[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...
Github user logannc commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140419964 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1760,13 +1760,39 @@ def toPandas(self): "if using spark.sql.execution.arrow.enable=true" raise ImportError("%s\n%s" % (e.message, msg)) else: +import numpy as np dtype = {} +nullable_int_columns = set() + +def null_handler(rows, nullable_int_columns): +requires_double_precision = set() +for row in rows: +row = row.asDict() +for column in nullable_int_columns: +val = row[column] +dt = dtype[column] +if val is None and dt not in (np.float32, np.float64): +dt = np.float64 if column in requires_double_precision else np.float32 +dtype[column] = dt +elif val is not None: +if abs(val) > 16777216: # Max value before np.float32 loses precision. --- End diff -- Values above this cannot be represented losslessly as a `np.float32`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: [SPARK-21766][SQL] Convert nullable int columns to float...
Github user logannc commented on the issue: https://github.com/apache/spark/pull/18945 I've continued to use @HyukjinKwon 's suggestion because it should be more performant and is capable of handling it without loss of precision. I believe I've addressed your concerns by only changing the type when we encounter a null (duh). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...
Github user logannc commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140414783 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1761,12 +1761,37 @@ def toPandas(self): raise ImportError("%s\n%s" % (e.message, msg)) else: dtype = {} +columns_with_null_int = set() +def null_handler(rows, columns_with_null_int): +for row in rows: +row = row.asDict() +for column in columns_with_null_int: +val = row[column] +dt = dtype[column] +if val is not None: +if abs(val) > 16777216: # Max value before np.float32 loses precision. +val = np.float64(val) +dt = np.float64 +dtype[column] = np.float64 +else: +val = np.float32(val) +if dt not in (np.float32, np.float64): +dt = np.float32 +dtype[column] = np.float32 +row[column] = val +row = Row(**row) +yield row +row_handler = lambda x,y: x for field in self.schema: pandas_type = _to_corrected_pandas_type(field.dataType) +if pandas_type in (np.int8, np.int16, np.int32) and field.nullable: +columns_with_null_int.add(field.name) +row_handler = null_handler +pandas_type = np.float32 --- End diff -- Ah, I see where I got confused. I had started with @ueshin 's suggestion but abandoned it because I didn't want to create the DataFrame before the type correction because I was also looking at @HyukjinKwon 's suggestion. I somehow ended up combining them incorrectly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...
Github user logannc commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140413579 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1761,12 +1761,37 @@ def toPandas(self): raise ImportError("%s\n%s" % (e.message, msg)) else: dtype = {} +columns_with_null_int = set() +def null_handler(rows, columns_with_null_int): +for row in rows: +row = row.asDict() +for column in columns_with_null_int: +val = row[column] +dt = dtype[column] +if val is not None: +if abs(val) > 16777216: # Max value before np.float32 loses precision. +val = np.float64(val) +dt = np.float64 +dtype[column] = np.float64 +else: +val = np.float32(val) +if dt not in (np.float32, np.float64): +dt = np.float32 +dtype[column] = np.float32 +row[column] = val +row = Row(**row) +yield row +row_handler = lambda x,y: x for field in self.schema: pandas_type = _to_corrected_pandas_type(field.dataType) +if pandas_type in (np.int8, np.int16, np.int32) and field.nullable: +columns_with_null_int.add(field.name) +row_handler = null_handler +pandas_type = np.float32 --- End diff -- Can you elaborate? I believe it is, per my reply to your comment in the `null_handler`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...
Github user logannc commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140412857 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1761,12 +1761,37 @@ def toPandas(self): raise ImportError("%s\n%s" % (e.message, msg)) else: dtype = {} +columns_with_null_int = {} +def null_handler(rows, columns_with_null_int): +for row in rows: +row = row.asDict() +for column in columns_with_null_int: +val = row[column] +dt = dtype[column] +if val is not None: --- End diff -- If `pandas_type in (np.int8, np.int16, np.int32) and field.nullable` and there are ANY non-null values, the dtype of the column is changed to `np.float32` or `np.float64`, both of which properly handle `None` values. That said, if the entire column was `None`, it would fail. Therefore I have preemptively changed the type on line 1787 to `np.float32`. Per `null_handler`, it may still change to `np.float64` if needed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: [SPARK-21766][SQL] Convert nullable int columns t...
Github user logannc commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140412745 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1761,12 +1761,37 @@ def toPandas(self): raise ImportError("%s\n%s" % (e.message, msg)) else: dtype = {} +columns_with_null_int = {} +def null_handler(rows, columns_with_null_int): +for row in rows: +row = row.asDict() +for column in columns_with_null_int: +val = row[column] +dt = dtype[column] +if val is not None: --- End diff -- If `pandas_type in (np.int8, np.int16, np.int32) and field.nullable` and there are ANY non-null values, the dtype of the column is changed to `np.float32` or `np.float64`, both of which properly handle `None` values. That said, if the entire column was `None`, it would fail. Therefore I have preemptively changed the type on line 1787 to `np.float32`. Per `null_handler`, it may still change to `np.float64` if needed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...
Github user logannc commented on the issue: https://github.com/apache/spark/pull/18945 Hm. Where would I add tests? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...
Github user logannc commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140153330 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1761,12 +1761,37 @@ def toPandas(self): raise ImportError("%s\n%s" % (e.message, msg)) else: dtype = {} +columns_with_null_int = {} +def null_handler(rows, columns_with_null_int): +for row in rows: +row = row.asDict() +for column in columns_with_null_int: +val = row[column] +dt = dtype[column] +if val is not None: +if abs(val) > 16777216: # Max value before np.float32 loses precision. +val = np.float64(val) +if np.float64 != dt: +dt = np.float64 +dtype[column] = np.float64 +else: +val = np.float32(val) +if dt not in (np.float32, np.float64): +dt = np.float32 +dtype[column] = np.float32 +row[column] = val +row = Row(**row) +yield row +row_handler = lambda x,y: x for field in self.schema: pandas_type = _to_corrected_pandas_type(field.dataType) +if pandas_type in (np.int8, np.int16, np.int32) and field.nullable: +columns_with_null_int.add(field.name) --- End diff -- Fixed... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...
Github user logannc commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140148777 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1761,12 +1761,37 @@ def toPandas(self): raise ImportError("%s\n%s" % (e.message, msg)) else: dtype = {} +columns_with_null_int = {} +def null_handler(rows, columns_with_null_int): +for row in rows: +row = row.asDict() +for column in columns_with_null_int: +val = row[column] +dt = dtype[column] +if val is not None: +if abs(val) > 16777216: # Max value before np.float32 loses precision. +val = np.float64(val) +if np.float64 != dt: --- End diff -- No, not strictly necessary, but also hardly harmful and it may future proof a bit...? Anyway, it can be removed if you think it should. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...
Github user logannc commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140148164 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1761,12 +1761,37 @@ def toPandas(self): raise ImportError("%s\n%s" % (e.message, msg)) else: dtype = {} +columns_with_null_int = {} +def null_handler(rows, columns_with_null_int): +for row in rows: +row = row.asDict() +for column in columns_with_null_int: +val = row[column] +dt = dtype[column] +if val is not None: +if abs(val) > 16777216: # Max value before np.float32 loses precision. +val = np.float64(val) +if np.float64 != dt: +dt = np.float64 +dtype[column] = np.float64 +else: +val = np.float32(val) +if dt not in (np.float32, np.float64): +dt = np.float32 +dtype[column] = np.float32 +row[column] = val +row = Row(**row) +yield row +row_handler = lambda x,y: x for field in self.schema: pandas_type = _to_corrected_pandas_type(field.dataType) +if pandas_type in (np.int8, np.int16, np.int32) and field.nullable: +columns_with_null_int.add(field.name) --- End diff -- Ack, I noticed I did that then forgot to change it. On it... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...
Github user logannc commented on a diff in the pull request: https://github.com/apache/spark/pull/18945#discussion_r140147933 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1761,12 +1761,37 @@ def toPandas(self): raise ImportError("%s\n%s" % (e.message, msg)) else: dtype = {} +columns_with_null_int = {} +def null_handler(rows, columns_with_null_int): +for row in rows: +row = row.asDict() +for column in columns_with_null_int: +val = row[column] +dt = dtype[column] +if val is not None: --- End diff -- They are handled by Pandas already, so I am just letting them pass through. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...
Github user logannc commented on the issue: https://github.com/apache/spark/pull/18945 Sorry I feel off the face of the earth. I finally had some time to sit down and do this. I took your suggestions but implemented it a little differently. Unless I've made a dumb mistake, I think I improved on it a bit. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...
Github user logannc commented on the issue: https://github.com/apache/spark/pull/18945 Sorry for the delay. Things got busy and now there is the storm in Houston. Will update this per these suggestions soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...
Github user logannc commented on the issue: https://github.com/apache/spark/pull/18945 I read the contributing guide. It said that simple changes didn't need a JIRA. Certainly this code change is quite simple, I just wasn't sure if there would be enough discussion to warrant a Jira. Now I know. So, rather than return np.float32, return None? That would probably also work, though the bug might get reintroduced by someone unfamiliar with the problem. That is why I prefered the explicitness of returning a type. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...
GitHub user logannc opened a pull request: https://github.com/apache/spark/pull/18945 Add option to convert nullable int columns to float columns in toPand⦠â¦as to prevent needless Exceptions during routine use. Add the `strict=True` kwarg to DataFrame.toPandas to allow for a non-strict interpretation of the schema of a dataframe. This is currently limited to allowing a nullable int column to being interpreted as a float column (because that is the only way Pandas supports nullable int columns and actually crashes without this). I consider this small change to be a massive quality of life improvement for DataFrames with lots of nullable int columns, which would otherwise need a litany of `df.withColumn(name, F.col(name).cast(DoubleType()))`, etc, just to view them easily or interact with them in-memory. **Possible Objections** * I foresee concerns with the name of the kwarg, of which I am open to suggestions. * I also foresee possible objections due to the potential for needless conversion of nullable int columns to floats when there are actually no null values. I would counter those objections by noting that it only occurs when strict=False, which is not the default, and can be avoided on a per-column basis by setting the `nullable` property of the schema to False. **Alternatives** * Rename the kwarg to be specific to the current change. i.e., `nullable_int_to_float` instead of `strict` or some other, similar name. * Fix Pandas to allow nullable int columns. (Very difficult, per Wes McKinney, due to lack of NumPy support. https://stackoverflow.com/questions/11548005/numpy-or-pandas-keeping-array-type-as-integer-while-having-a-nan-value) You can merge this pull request into a Git repository by running: $ git pull https://github.com/logannc/spark nullable_int_pandas Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18945.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18945 commit bceeefca77dd3414e4ec97ad3570043ec3ce3059 Author: Logan Collins Date: 2017-08-15T01:30:08Z Add option to convert nullable int columns to float columns in toPandas to prevent needless crashes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org