Fabian Boehnlein created SPARK-6573: ---------------------------------------
Summary: expect pandas null values as numpy.nan (not only as None) Key: SPARK-6573 URL: https://issues.apache.org/jira/browse/SPARK-6573 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Fabian Boehnlein In pandas it is common to use numpy.nan as the null value, for missing data or whatever. http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna createDataFrame however only works with None as null values, parsing them as None in the RDD. I suggest to add support for np.nan values in pandas DataFrames. current stracktrace when calling a DataFrame with object type columns with np.nan values (which are floats) {code} TypeError Traceback (most recent call last) <ipython-input-38-34f0263f0bf4> in <module>() ----> 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema) /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio) 339 schema = self._inferSchema(data.map(lambda r: row_cls(*r)), samplingRatio) 340 --> 341 return self.applySchema(data, schema) 342 343 def registerDataFrameAsTable(self, rdd, tableName): /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in applySchema(self, rdd, schema) 246 247 for row in rows: --> 248 _verify_type(row, schema) 249 250 # convert python objects to sql data /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1064 "length of fields (%d)" % (len(obj), len(dataType.fields))) 1065 for v, f in zip(obj, dataType.fields): -> 1066 _verify_type(v, f.dataType) 1067 1068 _cached_cls = weakref.WeakValueDictionary() /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in _verify_type(obj, dataType) 1048 if type(obj) not in _acceptable_types[_type]: 1049 raise TypeError("%s can not accept object in type %s" -> 1050 % (dataType, type(obj))) 1051 1052 if isinstance(dataType, ArrayType): TypeError: StringType can not accept object in type <type 'float'>{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org