[
https://issues.apache.org/jira/browse/ARROW-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rok Mihevc updated ARROW-5016:
------------------------------
External issue URL: https://github.com/apache/arrow/issues/21514
> [Python] Failed to convert 'float' to 'double' with using pandas_udf and
> pyspark
> --------------------------------------------------------------------------------
>
> Key: ARROW-5016
> URL: https://issues.apache.org/jira/browse/ARROW-5016
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.12.1
> Environment: Linux 68b0517ddf1c 3.10.0-862.11.6.el7.x86_64 #1 SMP
> GNU/Linux
> Reporter: Dat Nguyen
> Priority: Minor
> Labels: newbie
>
> Hi everyone,
> I would like to report a (potential) bug. I followed an official guide on
> [Usage Guide for Pandas with Apache
> Arrow]([https://spark.apache.org/docs/2.4.0/sql-pyspark-pandas-with-arrow.html)].
> However, `libarrrow` throws me error for type conversion from float ->
> double. Here is the example and its output.
> pyarrow==0.12.1
> {code:title=reproduce_bug.py}
> from pyspark.sql import SparkSession, SQLContext
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col
> spark = SparkSession.builder.appName('ReproduceBug') .getOrCreate()
> df = spark.createDataFrame(
> [(1, "a"), (1, "a"), (1, "b")],
> ("id", "value"))
> df.show()
> # Spark DataFrame
> # +---+-----+
> # | id|value|
> # +---+-----+
> # | 1| a|
> # | 1| a|
> # | 1| b|
> # +---+-----+
> # Potential Bug #
> @pandas_udf('double', PandasUDFType.SCALAR)
> def compute_frequencies(value_col):
> total = value_col.count()
> per_groups = value_col.groupby(value_col).transform('count')
> score = per_groups / total
> return score
> df.groupBy("id")\
> .agg(compute_frequencies(col('value')))\
> .show()
> spark.stop()
> {code}
>
> {code:title=output}
> ---------------------------------------------------------------------------
> Py4JJavaError Traceback (most recent call last)
> <ipython-input-3-d4f781f64db1> in <module>
> 32
> 33 df.groupBy("id")\
> ---> 34 .agg(compute_frequencies(col('value')))\
> 35 .show()
> 36
> /usr/local/spark/python/pyspark/sql/dataframe.py in show(self, n, truncate,
> vertical)
> 376 """
> 377 if isinstance(truncate, bool) and truncate:
> --> 378 print(self._jdf.showString(n, 20, vertical))
> 379 else:
> 380 print(self._jdf.showString(n, int(truncate), vertical))
> /usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in
> __call__(self, *args)
> 1255 answer = self.gateway_client.send_command(command)
> 1256 return_value = get_return_value(
> -> 1257 answer, self.gateway_client, self.target_id, self.name)
> 1258
> 1259 for temp_arg in temp_args:
> /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
> 61 def deco(*a, **kw):
> 62 try:
> ---> 63 return f(*a, **kw)
> 64 except py4j.protocol.Py4JJavaError as e:
> 65 s = e.java_exception.toString()
> /usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in
> get_return_value(answer, gateway_client, target_id, name)
> 326 raise Py4JJavaError(
> 327 "An error occurred while calling {0}{1}{2}.\n".
> --> 328 format(target_id, ".", name), value)
> 329 else:
> 330 raise Py4JError(
> Py4JJavaError: An error occurred while calling o186.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 44
> in stage 23.0 failed 1 times, most recent failure: Lost task 44.0 in stage
> 23.0 (TID 601, localhost, executor driver):
> org.apache.spark.api.python.PythonException: Traceback (most recent call
> last):
> File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372,
> in main
> process()
> File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 367,
> in process
> serializer.dump_stream(func(split_index, iterator), outfile)
> File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line
> 284, in dump_stream
> batch = _create_batch(series, self._timezone)
> File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line
> 253, in _create_batch
> arrs = [create_array(s, t) for s, t in series]
> File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line
> 253, in <listcomp>
> arrs = [create_array(s, t) for s, t in series]
> File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line
> 251, in create_array
> return pa.Array.from_pandas(s, mask=mask, type=t)
> File "pyarrow/array.pxi", line 536, in pyarrow.lib.Array.from_pandas
> File "pyarrow/array.pxi", line 176, in pyarrow.lib.array
> File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array
> File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Could not convert 0 0.666667
> 1 0.666667
> 2 0.333333
> Name: _0, dtype: float64 with type Series: tried to convert to double
> {code}
> Please let me know if you would like to know more any further information.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)