[ https://issues.apache.org/jira/browse/SPARK-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611565#comment-14611565 ]
Josh Rosen commented on SPARK-6573: ----------------------------------- NaN can lead to confusing exceptions during sorting if it appears in a column. I just ran into an issue where Sort threw a "Comparison method violates its general contract!" error for data containing NaN columns. See my comments at https://github.com/apache/spark/pull/7179#discussion_r33749911 > Convert inbound NaN values as null > ---------------------------------- > > Key: SPARK-6573 > URL: https://issues.apache.org/jira/browse/SPARK-6573 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 1.3.0 > Reporter: Fabian Boehnlein > > In pandas it is common to use numpy.nan as the null value, for missing data > or whatever. > http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions > http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none > http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna > createDataFrame however only works with None as null values, parsing them as > None in the RDD. > I suggest to add support for np.nan values in pandas DataFrames. > current stracktrace when calling a DataFrame with object type columns with > np.nan values (which are floats) > {code} > TypeError Traceback (most recent call last) > <ipython-input-38-34f0263f0bf4> in <module>() > ----> 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema) > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in > createDataFrame(self, data, schema, samplingRatio) > 339 schema = self._inferSchema(data.map(lambda r: > row_cls(*r)), samplingRatio) > 340 > --> 341 return self.applySchema(data, schema) > 342 > 343 def registerDataFrameAsTable(self, rdd, tableName): > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in > applySchema(self, rdd, schema) > 246 > 247 for row in rows: > --> 248 _verify_type(row, schema) > 249 > 250 # convert python objects to sql data > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in > _verify_type(obj, dataType) > 1064 "length of fields (%d)" % (len(obj), > len(dataType.fields))) > 1065 for v, f in zip(obj, dataType.fields): > -> 1066 _verify_type(v, f.dataType) > 1067 > 1068 _cached_cls = weakref.WeakValueDictionary() > /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in > _verify_type(obj, dataType) > 1048 if type(obj) not in _acceptable_types[_type]: > 1049 raise TypeError("%s can not accept object in type %s" > -> 1050 % (dataType, type(obj))) > 1051 > 1052 if isinstance(dataType, ArrayType): > TypeError: StringType can not accept object in type <type 'float'>{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org