[ 
https://issues.apache.org/jira/browse/SPARK-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611565#comment-14611565
 ] 

Josh Rosen commented on SPARK-6573:
-----------------------------------

NaN can lead to confusing exceptions during sorting if it appears in a column.  
I just ran into an issue where Sort threw a "Comparison method violates its 
general contract!" error for data containing NaN columns.  See my comments at 
https://github.com/apache/spark/pull/7179#discussion_r33749911

> Convert inbound NaN values as null
> ----------------------------------
>
>                 Key: SPARK-6573
>                 URL: https://issues.apache.org/jira/browse/SPARK-6573
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 1.3.0
>            Reporter: Fabian Boehnlein
>
> In pandas it is common to use numpy.nan as the null value, for missing data 
> or whatever.
> http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions
> http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none
> http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna
> createDataFrame however only works with None as null values, parsing them as 
> None in the RDD.
> I suggest to add support for np.nan values in pandas DataFrames.
> current stracktrace when calling a DataFrame with object type columns with 
> np.nan values (which are floats)
> {code}
> TypeError                                 Traceback (most recent call last)
> <ipython-input-38-34f0263f0bf4> in <module>()
> ----> 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema)
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in 
> createDataFrame(self, data, schema, samplingRatio)
>     339             schema = self._inferSchema(data.map(lambda r: 
> row_cls(*r)), samplingRatio)
>     340 
> --> 341         return self.applySchema(data, schema)
>     342 
>     343     def registerDataFrameAsTable(self, rdd, tableName):
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in 
> applySchema(self, rdd, schema)
>     246 
>     247         for row in rows:
> --> 248             _verify_type(row, schema)
>     249 
>     250         # convert python objects to sql data
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in 
> _verify_type(obj, dataType)
>    1064                              "length of fields (%d)" % (len(obj), 
> len(dataType.fields)))
>    1065         for v, f in zip(obj, dataType.fields):
> -> 1066             _verify_type(v, f.dataType)
>    1067 
>    1068 _cached_cls = weakref.WeakValueDictionary()
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in 
> _verify_type(obj, dataType)
>    1048     if type(obj) not in _acceptable_types[_type]:
>    1049         raise TypeError("%s can not accept object in type %s"
> -> 1050                         % (dataType, type(obj)))
>    1051 
>    1052     if isinstance(dataType, ArrayType):
> TypeError: StringType can not accept object in type <type 'float'>{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to