[jira] [Commented] (SPARK-6573) expect pandas null values as numpy.nan (not only as None)

Fabian Boehnlein (JIRA) Tue, 31 Mar 2015 00:55:54 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388204#comment-14388204
 ]


Fabian Boehnlein commented on SPARK-6573:
-----------------------------------------

I don't understand.
numpy.nan values would need to be caught as an exception, as a datatype check 
will always come back with 'float', and _type_mappings in pyspark/sql/types.py 
maps 'float' to SQL DoubleType and only type(None) to the SQL NullType.

Looking at the code the first time, it is probably too messy to treat numpy.nan 
specially. 

There is an quite easy 'hack' where you do
{code}
df = df.where(pandas.notnull(df), None)
{code}
to replace numpy.nan by None (=NoneType)
before using sqlCtx.createDataFrame(df).

So the alternative would be to make the user aware of that and throw a warning 
for numpy.nan in DoubleType columns and one for float(=numpy.nan) in columns of 
other non-DoubleType-types.

I have yet to test what happens to numpy.nan values in float columns, when 
there is no actual type mismatch.

> expect pandas null values as numpy.nan (not only as None)
> ---------------------------------------------------------
>
>                 Key: SPARK-6573
>                 URL: https://issues.apache.org/jira/browse/SPARK-6573
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 1.3.0
>            Reporter: Fabian Boehnlein
>
> In pandas it is common to use numpy.nan as the null value, for missing data 
> or whatever.
> http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions
> http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none
> http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna
> createDataFrame however only works with None as null values, parsing them as 
> None in the RDD.
> I suggest to add support for np.nan values in pandas DataFrames.
> current stracktrace when calling a DataFrame with object type columns with 
> np.nan values (which are floats)
> {code}
> TypeError                                 Traceback (most recent call last)
> <ipython-input-38-34f0263f0bf4> in <module>()
> ----> 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema)
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in 
> createDataFrame(self, data, schema, samplingRatio)
>     339             schema = self._inferSchema(data.map(lambda r: 
> row_cls(*r)), samplingRatio)
>     340 
> --> 341         return self.applySchema(data, schema)
>     342 
>     343     def registerDataFrameAsTable(self, rdd, tableName):
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in 
> applySchema(self, rdd, schema)
>     246 
>     247         for row in rows:
> --> 248             _verify_type(row, schema)
>     249 
>     250         # convert python objects to sql data
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in 
> _verify_type(obj, dataType)
>    1064                              "length of fields (%d)" % (len(obj), 
> len(dataType.fields)))
>    1065         for v, f in zip(obj, dataType.fields):
> -> 1066             _verify_type(v, f.dataType)
>    1067 
>    1068 _cached_cls = weakref.WeakValueDictionary()
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in 
> _verify_type(obj, dataType)
>    1048     if type(obj) not in _acceptable_types[_type]:
>    1049         raise TypeError("%s can not accept object in type %s"
> -> 1050                         % (dataType, type(obj)))
>    1051 
>    1052     if isinstance(dataType, ArrayType):
> TypeError: StringType can not accept object in type <type 'float'>{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6573) expect pandas null values as numpy.nan (not only as None)

Reply via email to