[jira] [Commented] (SPARK-24644) Pyarrow exception while running pandas_udf on pyspark 2.3.1
[ https://issues.apache.org/jira/browse/SPARK-24644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547031#comment-16547031 ] Hichame El Khalfi commented on SPARK-24644: --- Indeed we were using an old version on pandas, now after updating it to 0.19.2, no crash/error to report. Thank you [~bryanc] and [~hyukjin.kwon] for you valuable help and input (y). > Pyarrow exception while running pandas_udf on pyspark 2.3.1 > --- > > Key: SPARK-24644 > URL: https://issues.apache.org/jira/browse/SPARK-24644 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: os: centos > pyspark 2.3.1 > spark 2.3.1 > pyarrow >= 0.8.0 >Reporter: Hichame El Khalfi >Priority: Major > > Hello, > When I try to run a `pandas_udf` on my spark dataframe, I get this error > > {code:java} > File > "/mnt/ephemeral3/yarn/nm/usercache/user/appcache/application_1524574803975_205774/container_e280_1524574803975_205774_01_44/pyspark.zip/pyspark/serializers.py", > lin > e 280, in load_stream > pdf = batch.to_pandas() > File "pyarrow/table.pxi", line 677, in pyarrow.lib.RecordBatch.to_pandas > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:43226) > return Table.from_batches([self]).to_pandas(nthreads=nthreads) > File "pyarrow/table.pxi", line 1043, in pyarrow.lib.Table.to_pandas > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:46331) > mgr = pdcompat.table_to_blockmanager(options, self, memory_pool, > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 528, in table_to_blockmanager > blocks = _table_to_blocks(options, block_table, nthreads, memory_pool) > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 622, in _table_to_blocks > return [_reconstruct_block(item) for item in result] > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 446, in _reconstruct_block > block = _int.make_block(block_arr, placement=placement) > TypeError: make_block() takes at least 3 arguments (2 given) > {code} > > More than happy to provide any additional information -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24644) Pyarrow exception while running pandas_udf on pyspark 2.3.1
[ https://issues.apache.org/jira/browse/SPARK-24644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545589#comment-16545589 ] Bryan Cutler commented on SPARK-24644: -- [~helkhalfi], the error in the stack trace is coming from pandas internals and it looks like you are using a pretty old version, so my guess is that you need to upgrade pandas to solve this. For Spark, we currently test pyarrow with pandas 0.19.2 and I would recommend at least that version or higher. > Pyarrow exception while running pandas_udf on pyspark 2.3.1 > --- > > Key: SPARK-24644 > URL: https://issues.apache.org/jira/browse/SPARK-24644 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: os: centos > pyspark 2.3.1 > spark 2.3.1 > pyarrow >= 0.8.0 >Reporter: Hichame El Khalfi >Priority: Major > > Hello, > When I try to run a `pandas_udf` on my spark dataframe, I get this error > > {code:java} > File > "/mnt/ephemeral3/yarn/nm/usercache/user/appcache/application_1524574803975_205774/container_e280_1524574803975_205774_01_44/pyspark.zip/pyspark/serializers.py", > lin > e 280, in load_stream > pdf = batch.to_pandas() > File "pyarrow/table.pxi", line 677, in pyarrow.lib.RecordBatch.to_pandas > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:43226) > return Table.from_batches([self]).to_pandas(nthreads=nthreads) > File "pyarrow/table.pxi", line 1043, in pyarrow.lib.Table.to_pandas > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:46331) > mgr = pdcompat.table_to_blockmanager(options, self, memory_pool, > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 528, in table_to_blockmanager > blocks = _table_to_blocks(options, block_table, nthreads, memory_pool) > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 622, in _table_to_blocks > return [_reconstruct_block(item) for item in result] > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 446, in _reconstruct_block > block = _int.make_block(block_arr, placement=placement) > TypeError: make_block() takes at least 3 arguments (2 given) > {code} > > More than happy to provide any additional information -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24644) Pyarrow exception while running pandas_udf on pyspark 2.3.1
[ https://issues.apache.org/jira/browse/SPARK-24644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539626#comment-16539626 ] Hyukjin Kwon commented on SPARK-24644: -- Thanks, [~helkhalfi]. mind if I ask to post the codes you ran so that I or someone else could reproduce and investigate further? > Pyarrow exception while running pandas_udf on pyspark 2.3.1 > --- > > Key: SPARK-24644 > URL: https://issues.apache.org/jira/browse/SPARK-24644 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: os: centos > pyspark 2.3.1 > spark 2.3.1 > pyarrow >= 0.8.0 >Reporter: Hichame El Khalfi >Priority: Major > > Hello, > When I try to run a `pandas_udf` on my spark dataframe, I get this error > > {code:java} > File > "/mnt/ephemeral3/yarn/nm/usercache/user/appcache/application_1524574803975_205774/container_e280_1524574803975_205774_01_44/pyspark.zip/pyspark/serializers.py", > lin > e 280, in load_stream > pdf = batch.to_pandas() > File "pyarrow/table.pxi", line 677, in pyarrow.lib.RecordBatch.to_pandas > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:43226) > return Table.from_batches([self]).to_pandas(nthreads=nthreads) > File "pyarrow/table.pxi", line 1043, in pyarrow.lib.Table.to_pandas > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:46331) > mgr = pdcompat.table_to_blockmanager(options, self, memory_pool, > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 528, in table_to_blockmanager > blocks = _table_to_blocks(options, block_table, nthreads, memory_pool) > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 622, in _table_to_blocks > return [_reconstruct_block(item) for item in result] > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 446, in _reconstruct_block > block = _int.make_block(block_arr, placement=placement) > TypeError: make_block() takes at least 3 arguments (2 given) > {code} > > More than happy to provide any additional information -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24644) Pyarrow exception while running pandas_udf on pyspark 2.3.1
[ https://issues.apache.org/jira/browse/SPARK-24644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532178#comment-16532178 ] Hichame El Khalfi commented on SPARK-24644: --- Hello [~hyukjin.kwon] Thanks for taking time on this ticket Regardig the environemnt, we are using: * CentOS 7 * JDK 1.8.0_101-b13 * CPython interpreter 2.7 * Spark 2.3.1 in distributed mode. * pandas 0.13.0 * pyarrow 0.9.0 > Pyarrow exception while running pandas_udf on pyspark 2.3.1 > --- > > Key: SPARK-24644 > URL: https://issues.apache.org/jira/browse/SPARK-24644 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.3.1 > Environment: os: centos > pyspark 2.3.1 > spark 2.3.1 > pyarrow >= 0.8.0 >Reporter: Hichame El Khalfi >Priority: Major > > Hello, > When I try to run a `pandas_udf` on my spark dataframe, I get this error > > {code:java} > File > "/mnt/ephemeral3/yarn/nm/usercache/user/appcache/application_1524574803975_205774/container_e280_1524574803975_205774_01_44/pyspark.zip/pyspark/serializers.py", > lin > e 280, in load_stream > pdf = batch.to_pandas() > File "pyarrow/table.pxi", line 677, in pyarrow.lib.RecordBatch.to_pandas > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:43226) > return Table.from_batches([self]).to_pandas(nthreads=nthreads) > File "pyarrow/table.pxi", line 1043, in pyarrow.lib.Table.to_pandas > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:46331) > mgr = pdcompat.table_to_blockmanager(options, self, memory_pool, > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 528, in table_to_blockmanager > blocks = _table_to_blocks(options, block_table, nthreads, memory_pool) > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 622, in _table_to_blocks > return [_reconstruct_block(item) for item in result] > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 446, in _reconstruct_block > block = _int.make_block(block_arr, placement=placement) > TypeError: make_block() takes at least 3 arguments (2 given) > {code} > > More than happy to provide any additional information -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24644) Pyarrow exception while running pandas_udf on pyspark 2.3.1
[ https://issues.apache.org/jira/browse/SPARK-24644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523399#comment-16523399 ] Hyukjin Kwon commented on SPARK-24644: -- Can you clarify the environment, in particular, PyArrow and Pandas versions? > Pyarrow exception while running pandas_udf on pyspark 2.3.1 > --- > > Key: SPARK-24644 > URL: https://issues.apache.org/jira/browse/SPARK-24644 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.3.1 > Environment: os: centos > pyspark 2.3.1 > spark 2.3.1 > pyarrow >= 0.8.0 >Reporter: Hichame El Khalfi >Priority: Major > > Hello, > When I try to run a `pandas_udf` on my spark dataframe, I get this error > > {code:java} > File > "/mnt/ephemeral3/yarn/nm/usercache/user/appcache/application_1524574803975_205774/container_e280_1524574803975_205774_01_44/pyspark.zip/pyspark/serializers.py", > lin > e 280, in load_stream > pdf = batch.to_pandas() > File "pyarrow/table.pxi", line 677, in pyarrow.lib.RecordBatch.to_pandas > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:43226) > return Table.from_batches([self]).to_pandas(nthreads=nthreads) > File "pyarrow/table.pxi", line 1043, in pyarrow.lib.Table.to_pandas > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:46331) > mgr = pdcompat.table_to_blockmanager(options, self, memory_pool, > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 528, in table_to_blockmanager > blocks = _table_to_blocks(options, block_table, nthreads, memory_pool) > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 622, in _table_to_blocks > return [_reconstruct_block(item) for item in result] > File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line > 446, in _reconstruct_block > block = _int.make_block(block_arr, placement=placement) > TypeError: make_block() takes at least 3 arguments (2 given) > {code} > > More than happy to provide any additional information -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org