[jira] [Commented] (SPARK-6573) expect pandas null values as numpy.nan (not only as None)
[ https://issues.apache.org/jira/browse/SPARK-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389386#comment-14389386 ] Reynold Xin commented on SPARK-6573: I just tested. NaN in Python is turned into NaN in the JVM. I think we can treat NaN as null, and convert all NaN value into null in-bound, and then users never have to worry about it anymore. > expect pandas null values as numpy.nan (not only as None) > - > > Key: SPARK-6573 > URL: https://issues.apache.org/jira/browse/SPARK-6573 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.3.0 >Reporter: Fabian Boehnlein > > In pandas it is common to use numpy.nan as the null value, for missing data > or whatever. > http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions > http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none > http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna > createDataFrame however only works with None as null values, parsing them as > None in the RDD. > I suggest to add support for np.nan values in pandas DataFrames. > current stracktrace when calling a DataFrame with object type columns with > np.nan values (which are floats) > {code} > TypeError Traceback (most recent call last) > in () > > 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema) > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in > createDataFrame(self, data, schema, samplingRatio) > 339 schema = self._inferSchema(data.map(lambda r: > row_cls(*r)), samplingRatio) > 340 > --> 341 return self.applySchema(data, schema) > 342 > 343 def registerDataFrameAsTable(self, rdd, tableName): > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in > applySchema(self, rdd, schema) > 246 > 247 for row in rows: > --> 248 _verify_type(row, schema) > 249 > 250 # convert python objects to sql data > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in > _verify_type(obj, dataType) >1064 "length of fields (%d)" % (len(obj), > len(dataType.fields))) >1065 for v, f in zip(obj, dataType.fields): > -> 1066 _verify_type(v, f.dataType) >1067 >1068 _cached_cls = weakref.WeakValueDictionary() > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in > _verify_type(obj, dataType) >1048 if type(obj) not in _acceptable_types[_type]: >1049 raise TypeError("%s can not accept object in type %s" > -> 1050 % (dataType, type(obj))) >1051 >1052 if isinstance(dataType, ArrayType): > TypeError: StringType can not accept object in type {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6573) expect pandas null values as numpy.nan (not only as None)
[ https://issues.apache.org/jira/browse/SPARK-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388204#comment-14388204 ] Fabian Boehnlein commented on SPARK-6573: - I don't understand. numpy.nan values would need to be caught as an exception, as a datatype check will always come back with 'float', and _type_mappings in pyspark/sql/types.py maps 'float' to SQL DoubleType and only type(None) to the SQL NullType. Looking at the code the first time, it is probably too messy to treat numpy.nan specially. There is an quite easy 'hack' where you do {code} df = df.where(pandas.notnull(df), None) {code} to replace numpy.nan by None (=NoneType) before using sqlCtx.createDataFrame(df). So the alternative would be to make the user aware of that and throw a warning for numpy.nan in DoubleType columns and one for float(=numpy.nan) in columns of other non-DoubleType-types. I have yet to test what happens to numpy.nan values in float columns, when there is no actual type mismatch. > expect pandas null values as numpy.nan (not only as None) > - > > Key: SPARK-6573 > URL: https://issues.apache.org/jira/browse/SPARK-6573 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.3.0 >Reporter: Fabian Boehnlein > > In pandas it is common to use numpy.nan as the null value, for missing data > or whatever. > http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions > http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none > http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna > createDataFrame however only works with None as null values, parsing them as > None in the RDD. > I suggest to add support for np.nan values in pandas DataFrames. > current stracktrace when calling a DataFrame with object type columns with > np.nan values (which are floats) > {code} > TypeError Traceback (most recent call last) > in () > > 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema) > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in > createDataFrame(self, data, schema, samplingRatio) > 339 schema = self._inferSchema(data.map(lambda r: > row_cls(*r)), samplingRatio) > 340 > --> 341 return self.applySchema(data, schema) > 342 > 343 def registerDataFrameAsTable(self, rdd, tableName): > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in > applySchema(self, rdd, schema) > 246 > 247 for row in rows: > --> 248 _verify_type(row, schema) > 249 > 250 # convert python objects to sql data > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in > _verify_type(obj, dataType) >1064 "length of fields (%d)" % (len(obj), > len(dataType.fields))) >1065 for v, f in zip(obj, dataType.fields): > -> 1066 _verify_type(v, f.dataType) >1067 >1068 _cached_cls = weakref.WeakValueDictionary() > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in > _verify_type(obj, dataType) >1048 if type(obj) not in _acceptable_types[_type]: >1049 raise TypeError("%s can not accept object in type %s" > -> 1050 % (dataType, type(obj))) >1051 >1052 if isinstance(dataType, ArrayType): > TypeError: StringType can not accept object in type {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6573) expect pandas null values as numpy.nan (not only as None)
[ https://issues.apache.org/jira/browse/SPARK-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387921#comment-14387921 ] Reynold Xin commented on SPARK-6573: Are numpy.nan turned into Double.NaN in the JVM? If yes, maybe we should consider all NaN numbers as null in the JVM. > expect pandas null values as numpy.nan (not only as None) > - > > Key: SPARK-6573 > URL: https://issues.apache.org/jira/browse/SPARK-6573 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.3.0 >Reporter: Fabian Boehnlein > > In pandas it is common to use numpy.nan as the null value, for missing data > or whatever. > http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions > http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none > http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna > createDataFrame however only works with None as null values, parsing them as > None in the RDD. > I suggest to add support for np.nan values in pandas DataFrames. > current stracktrace when calling a DataFrame with object type columns with > np.nan values (which are floats) > {code} > TypeError Traceback (most recent call last) > in () > > 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema) > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in > createDataFrame(self, data, schema, samplingRatio) > 339 schema = self._inferSchema(data.map(lambda r: > row_cls(*r)), samplingRatio) > 340 > --> 341 return self.applySchema(data, schema) > 342 > 343 def registerDataFrameAsTable(self, rdd, tableName): > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in > applySchema(self, rdd, schema) > 246 > 247 for row in rows: > --> 248 _verify_type(row, schema) > 249 > 250 # convert python objects to sql data > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in > _verify_type(obj, dataType) >1064 "length of fields (%d)" % (len(obj), > len(dataType.fields))) >1065 for v, f in zip(obj, dataType.fields): > -> 1066 _verify_type(v, f.dataType) >1067 >1068 _cached_cls = weakref.WeakValueDictionary() > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in > _verify_type(obj, dataType) >1048 if type(obj) not in _acceptable_types[_type]: >1049 raise TypeError("%s can not accept object in type %s" > -> 1050 % (dataType, type(obj))) >1051 >1052 if isinstance(dataType, ArrayType): > TypeError: StringType can not accept object in type {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org