[jira] [Commented] (SPARK-6573) expect pandas null values as numpy.nan (not only as None)

2015-03-31 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389386#comment-14389386
 ] 

Reynold Xin commented on SPARK-6573:


I just tested. NaN in Python is turned into NaN in the JVM. I think we can 
treat NaN as null, and convert all NaN value into null in-bound, and then users 
never have to worry about it anymore. 

> expect pandas null values as numpy.nan (not only as None)
> -
>
> Key: SPARK-6573
> URL: https://issues.apache.org/jira/browse/SPARK-6573
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Fabian Boehnlein
>
> In pandas it is common to use numpy.nan as the null value, for missing data 
> or whatever.
> http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions
> http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none
> http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna
> createDataFrame however only works with None as null values, parsing them as 
> None in the RDD.
> I suggest to add support for np.nan values in pandas DataFrames.
> current stracktrace when calling a DataFrame with object type columns with 
> np.nan values (which are floats)
> {code}
> TypeError Traceback (most recent call last)
>  in ()
> > 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema)
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in 
> createDataFrame(self, data, schema, samplingRatio)
> 339 schema = self._inferSchema(data.map(lambda r: 
> row_cls(*r)), samplingRatio)
> 340 
> --> 341 return self.applySchema(data, schema)
> 342 
> 343 def registerDataFrameAsTable(self, rdd, tableName):
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in 
> applySchema(self, rdd, schema)
> 246 
> 247 for row in rows:
> --> 248 _verify_type(row, schema)
> 249 
> 250 # convert python objects to sql data
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in 
> _verify_type(obj, dataType)
>1064  "length of fields (%d)" % (len(obj), 
> len(dataType.fields)))
>1065 for v, f in zip(obj, dataType.fields):
> -> 1066 _verify_type(v, f.dataType)
>1067 
>1068 _cached_cls = weakref.WeakValueDictionary()
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in 
> _verify_type(obj, dataType)
>1048 if type(obj) not in _acceptable_types[_type]:
>1049 raise TypeError("%s can not accept object in type %s"
> -> 1050 % (dataType, type(obj)))
>1051 
>1052 if isinstance(dataType, ArrayType):
> TypeError: StringType can not accept object in type {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6573) expect pandas null values as numpy.nan (not only as None)

2015-03-31 Thread Fabian Boehnlein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388204#comment-14388204
 ] 

Fabian Boehnlein commented on SPARK-6573:
-

I don't understand.
numpy.nan values would need to be caught as an exception, as a datatype check 
will always come back with 'float', and _type_mappings in pyspark/sql/types.py 
maps 'float' to SQL DoubleType and only type(None) to the SQL NullType.

Looking at the code the first time, it is probably too messy to treat numpy.nan 
specially. 

There is an quite easy 'hack' where you do
{code}
df = df.where(pandas.notnull(df), None)
{code}
to replace numpy.nan by None (=NoneType)
before using sqlCtx.createDataFrame(df).

So the alternative would be to make the user aware of that and throw a warning 
for numpy.nan in DoubleType columns and one for float(=numpy.nan) in columns of 
other non-DoubleType-types.

I have yet to test what happens to numpy.nan values in float columns, when 
there is no actual type mismatch.

> expect pandas null values as numpy.nan (not only as None)
> -
>
> Key: SPARK-6573
> URL: https://issues.apache.org/jira/browse/SPARK-6573
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Fabian Boehnlein
>
> In pandas it is common to use numpy.nan as the null value, for missing data 
> or whatever.
> http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions
> http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none
> http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna
> createDataFrame however only works with None as null values, parsing them as 
> None in the RDD.
> I suggest to add support for np.nan values in pandas DataFrames.
> current stracktrace when calling a DataFrame with object type columns with 
> np.nan values (which are floats)
> {code}
> TypeError Traceback (most recent call last)
>  in ()
> > 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema)
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in 
> createDataFrame(self, data, schema, samplingRatio)
> 339 schema = self._inferSchema(data.map(lambda r: 
> row_cls(*r)), samplingRatio)
> 340 
> --> 341 return self.applySchema(data, schema)
> 342 
> 343 def registerDataFrameAsTable(self, rdd, tableName):
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in 
> applySchema(self, rdd, schema)
> 246 
> 247 for row in rows:
> --> 248 _verify_type(row, schema)
> 249 
> 250 # convert python objects to sql data
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in 
> _verify_type(obj, dataType)
>1064  "length of fields (%d)" % (len(obj), 
> len(dataType.fields)))
>1065 for v, f in zip(obj, dataType.fields):
> -> 1066 _verify_type(v, f.dataType)
>1067 
>1068 _cached_cls = weakref.WeakValueDictionary()
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in 
> _verify_type(obj, dataType)
>1048 if type(obj) not in _acceptable_types[_type]:
>1049 raise TypeError("%s can not accept object in type %s"
> -> 1050 % (dataType, type(obj)))
>1051 
>1052 if isinstance(dataType, ArrayType):
> TypeError: StringType can not accept object in type {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6573) expect pandas null values as numpy.nan (not only as None)

2015-03-30 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387921#comment-14387921
 ] 

Reynold Xin commented on SPARK-6573:


Are numpy.nan turned into Double.NaN in the JVM? If yes, maybe we should 
consider all NaN numbers as null in the JVM.

> expect pandas null values as numpy.nan (not only as None)
> -
>
> Key: SPARK-6573
> URL: https://issues.apache.org/jira/browse/SPARK-6573
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Fabian Boehnlein
>
> In pandas it is common to use numpy.nan as the null value, for missing data 
> or whatever.
> http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions
> http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none
> http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna
> createDataFrame however only works with None as null values, parsing them as 
> None in the RDD.
> I suggest to add support for np.nan values in pandas DataFrames.
> current stracktrace when calling a DataFrame with object type columns with 
> np.nan values (which are floats)
> {code}
> TypeError Traceback (most recent call last)
>  in ()
> > 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema)
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in 
> createDataFrame(self, data, schema, samplingRatio)
> 339 schema = self._inferSchema(data.map(lambda r: 
> row_cls(*r)), samplingRatio)
> 340 
> --> 341 return self.applySchema(data, schema)
> 342 
> 343 def registerDataFrameAsTable(self, rdd, tableName):
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in 
> applySchema(self, rdd, schema)
> 246 
> 247 for row in rows:
> --> 248 _verify_type(row, schema)
> 249 
> 250 # convert python objects to sql data
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in 
> _verify_type(obj, dataType)
>1064  "length of fields (%d)" % (len(obj), 
> len(dataType.fields)))
>1065 for v, f in zip(obj, dataType.fields):
> -> 1066 _verify_type(v, f.dataType)
>1067 
>1068 _cached_cls = weakref.WeakValueDictionary()
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in 
> _verify_type(obj, dataType)
>1048 if type(obj) not in _acceptable_types[_type]:
>1049 raise TypeError("%s can not accept object in type %s"
> -> 1050 % (dataType, type(obj)))
>1051 
>1052 if isinstance(dataType, ArrayType):
> TypeError: StringType can not accept object in type {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org