[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

2017-09-21 Thread logannc
Github user logannc commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140419964
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1760,13 +1760,39 @@ def toPandas(self):
   "if using spark.sql.execution.arrow.enable=true"
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
+import numpy as np
 dtype = {}
+nullable_int_columns = set()
+
+def null_handler(rows, nullable_int_columns):
+requires_double_precision = set()
+for row in rows:
+row = row.asDict()
+for column in nullable_int_columns:
+val = row[column]
+dt = dtype[column]
+if val is None and dt not in (np.float32, 
np.float64):
+dt = np.float64 if column in 
requires_double_precision else np.float32
+dtype[column] = dt
+elif val is not None:
+if abs(val) > 16777216:  # Max value before 
np.float32 loses precision.
--- End diff --

Values above this cannot be represented losslessly as a `np.float32`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...

2017-09-21 Thread logannc
Github user logannc commented on the issue:

https://github.com/apache/spark/pull/18945
  
I've continued to use @HyukjinKwon 's suggestion because it should be more 
performant and is capable of handling it without loss of precision. I believe 
I've addressed your concerns by only changing the type when we encounter a null 
(duh).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

2017-09-21 Thread logannc
Github user logannc commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140414783
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1761,12 +1761,37 @@ def toPandas(self):
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
 dtype = {}
+columns_with_null_int = set()
+def null_handler(rows, columns_with_null_int):
+for row in rows:
+row = row.asDict()
+for column in columns_with_null_int:
+val = row[column]
+dt = dtype[column]
+if val is not None:
+if abs(val) > 16777216: # Max value before 
np.float32 loses precision.
+val = np.float64(val)
+dt = np.float64
+dtype[column] = np.float64
+else:
+val = np.float32(val)
+if dt not in (np.float32, np.float64):
+dt = np.float32
+dtype[column] = np.float32
+row[column] = val
+row = Row(**row)
+yield row
+row_handler = lambda x,y: x
 for field in self.schema:
 pandas_type = _to_corrected_pandas_type(field.dataType)
+if pandas_type in (np.int8, np.int16, np.int32) and 
field.nullable:
+columns_with_null_int.add(field.name)
+row_handler = null_handler
+pandas_type = np.float32
--- End diff --

Ah, I see where I got confused. I had started with @ueshin 's suggestion 
but abandoned it because I didn't want to create the DataFrame before the type 
correction because I was also looking at @HyukjinKwon 's suggestion. I somehow 
ended up combining them incorrectly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

2017-09-21 Thread logannc
Github user logannc commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140413579
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1761,12 +1761,37 @@ def toPandas(self):
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
 dtype = {}
+columns_with_null_int = set()
+def null_handler(rows, columns_with_null_int):
+for row in rows:
+row = row.asDict()
+for column in columns_with_null_int:
+val = row[column]
+dt = dtype[column]
+if val is not None:
+if abs(val) > 16777216: # Max value before 
np.float32 loses precision.
+val = np.float64(val)
+dt = np.float64
+dtype[column] = np.float64
+else:
+val = np.float32(val)
+if dt not in (np.float32, np.float64):
+dt = np.float32
+dtype[column] = np.float32
+row[column] = val
+row = Row(**row)
+yield row
+row_handler = lambda x,y: x
 for field in self.schema:
 pandas_type = _to_corrected_pandas_type(field.dataType)
+if pandas_type in (np.int8, np.int16, np.int32) and 
field.nullable:
+columns_with_null_int.add(field.name)
+row_handler = null_handler
+pandas_type = np.float32
--- End diff --

Can you elaborate? I believe it is, per my reply to your comment in the 
`null_handler`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

2017-09-21 Thread logannc
Github user logannc commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140412857
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1761,12 +1761,37 @@ def toPandas(self):
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
 dtype = {}
+columns_with_null_int = {}
+def null_handler(rows, columns_with_null_int):
+for row in rows:
+row = row.asDict()
+for column in columns_with_null_int:
+val = row[column]
+dt = dtype[column]
+if val is not None:
--- End diff --

If `pandas_type in (np.int8, np.int16, np.int32) and field.nullable` and 
there are ANY non-null values, the dtype of the column is changed to 
`np.float32` or `np.float64`, both of which properly handle `None` values.

That said, if the entire column was `None`, it would fail. Therefore I have 
preemptively changed the type on line 1787 to `np.float32`. Per `null_handler`, 
it may still change to `np.float64` if needed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...

2017-09-21 Thread logannc
Github user logannc commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140412745
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1761,12 +1761,37 @@ def toPandas(self):
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
 dtype = {}
+columns_with_null_int = {}
+def null_handler(rows, columns_with_null_int):
+for row in rows:
+row = row.asDict()
+for column in columns_with_null_int:
+val = row[column]
+dt = dtype[column]
+if val is not None:
--- End diff --

If `pandas_type in (np.int8, np.int16, np.int32) and field.nullable` and 
there are ANY non-null values, the dtype of the column is changed to 
`np.float32` or `np.float64`, both of which properly handle `None` values.

That said, if the entire column was `None`, it would fail. Therefore I have 
preemptively changed the type on line 1787 to `np.float32`. Per `null_handler`, 
it may still change to `np.float64` if needed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

2017-09-20 Thread logannc
Github user logannc commented on the issue:

https://github.com/apache/spark/pull/18945
  
Hm. Where would I add tests?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...

2017-09-20 Thread logannc
Github user logannc commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140153330
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1761,12 +1761,37 @@ def toPandas(self):
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
 dtype = {}
+columns_with_null_int = {}
+def null_handler(rows, columns_with_null_int):
+for row in rows:
+row = row.asDict()
+for column in columns_with_null_int:
+val = row[column]
+dt = dtype[column]
+if val is not None:
+if abs(val) > 16777216: # Max value before 
np.float32 loses precision.
+val = np.float64(val)
+if np.float64 != dt:
+dt = np.float64
+dtype[column] = np.float64
+else:
+val = np.float32(val)
+if dt not in (np.float32, np.float64):
+dt = np.float32
+dtype[column] = np.float32
+row[column] = val
+row = Row(**row)
+yield row
+row_handler = lambda x,y: x
 for field in self.schema:
 pandas_type = _to_corrected_pandas_type(field.dataType)
+if pandas_type in (np.int8, np.int16, np.int32) and 
field.nullable:
+columns_with_null_int.add(field.name)
--- End diff --

Fixed...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...

2017-09-20 Thread logannc
Github user logannc commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140148777
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1761,12 +1761,37 @@ def toPandas(self):
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
 dtype = {}
+columns_with_null_int = {}
+def null_handler(rows, columns_with_null_int):
+for row in rows:
+row = row.asDict()
+for column in columns_with_null_int:
+val = row[column]
+dt = dtype[column]
+if val is not None:
+if abs(val) > 16777216: # Max value before 
np.float32 loses precision.
+val = np.float64(val)
+if np.float64 != dt:
--- End diff --

No, not strictly necessary, but also hardly harmful and it may future proof 
a bit...? Anyway, it can be removed if you think it should.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...

2017-09-20 Thread logannc
Github user logannc commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140148164
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1761,12 +1761,37 @@ def toPandas(self):
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
 dtype = {}
+columns_with_null_int = {}
+def null_handler(rows, columns_with_null_int):
+for row in rows:
+row = row.asDict()
+for column in columns_with_null_int:
+val = row[column]
+dt = dtype[column]
+if val is not None:
+if abs(val) > 16777216: # Max value before 
np.float32 loses precision.
+val = np.float64(val)
+if np.float64 != dt:
+dt = np.float64
+dtype[column] = np.float64
+else:
+val = np.float32(val)
+if dt not in (np.float32, np.float64):
+dt = np.float32
+dtype[column] = np.float32
+row[column] = val
+row = Row(**row)
+yield row
+row_handler = lambda x,y: x
 for field in self.schema:
 pandas_type = _to_corrected_pandas_type(field.dataType)
+if pandas_type in (np.int8, np.int16, np.int32) and 
field.nullable:
+columns_with_null_int.add(field.name)
--- End diff --

Ack, I noticed I did that then forgot to change it. On it...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...

2017-09-20 Thread logannc
Github user logannc commented on a diff in the pull request:

https://github.com/apache/spark/pull/18945#discussion_r140147933
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1761,12 +1761,37 @@ def toPandas(self):
 raise ImportError("%s\n%s" % (e.message, msg))
 else:
 dtype = {}
+columns_with_null_int = {}
+def null_handler(rows, columns_with_null_int):
+for row in rows:
+row = row.asDict()
+for column in columns_with_null_int:
+val = row[column]
+dt = dtype[column]
+if val is not None:
--- End diff --

They are handled by Pandas already, so I am just letting them pass through.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

2017-09-20 Thread logannc
Github user logannc commented on the issue:

https://github.com/apache/spark/pull/18945
  
Sorry I feel off the face of the earth. I finally had some time to sit down 
and do this. I took your suggestions but implemented it a little differently. 
Unless I've made a dumb mistake, I think I improved on it a bit.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

2017-08-27 Thread logannc
Github user logannc commented on the issue:

https://github.com/apache/spark/pull/18945
  
Sorry for the delay. Things got busy and now there is the storm in Houston. 
Will update this per these suggestions soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...

2017-08-17 Thread logannc
Github user logannc commented on the issue:

https://github.com/apache/spark/pull/18945
  
I read the contributing guide. It said that simple changes didn't need a 
JIRA. Certainly this code change is quite simple, I just wasn't sure if there 
would be enough discussion to warrant a Jira. Now I know.

So, rather than return np.float32, return None? That would probably also 
work, though the bug might get reintroduced by someone unfamiliar with the 
problem. That is why I prefered the explicitness of returning a type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...

2017-08-14 Thread logannc
GitHub user logannc opened a pull request:

https://github.com/apache/spark/pull/18945

Add option to convert nullable int columns to float columns in toPand…

…as to prevent needless Exceptions during routine use.

Add the `strict=True` kwarg to DataFrame.toPandas to allow for a non-strict 
interpretation of the schema of a dataframe. This is currently limited to 
allowing a nullable int column to being interpreted as a float column (because 
that is the only way Pandas supports nullable int columns and actually crashes 
without this).

I consider this small change to be a massive quality of life improvement 
for DataFrames with lots of nullable int columns, which would otherwise need a 
litany of `df.withColumn(name, F.col(name).cast(DoubleType()))`, etc, just to 
view them easily or interact with them in-memory.

**Possible Objections**
* I foresee concerns with the name of the kwarg, of which I am open to 
suggestions.
* I also foresee possible objections due to the potential for needless 
conversion of nullable int columns to floats when there are actually no null 
values. I would counter those objections by noting that it only occurs when 
strict=False, which is not the default, and can be avoided on a per-column 
basis by setting the `nullable` property of the schema to False. 

**Alternatives**
* Rename the kwarg to be specific to the current change. i.e., 
`nullable_int_to_float` instead of `strict` or some other, similar name.
* Fix Pandas to allow nullable int columns. (Very difficult, per Wes 
McKinney, due to lack of NumPy support. 
https://stackoverflow.com/questions/11548005/numpy-or-pandas-keeping-array-type-as-integer-while-having-a-nan-value)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/logannc/spark nullable_int_pandas

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18945.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18945


commit bceeefca77dd3414e4ec97ad3570043ec3ce3059
Author: Logan Collins 
Date:   2017-08-15T01:30:08Z

Add option to convert nullable int columns to float columns in toPandas to 
prevent needless crashes.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org