[ https://issues.apache.org/jira/browse/SPARK-25491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16674330#comment-16674330 ]
Hyukjin Kwon commented on SPARK-25491: -------------------------------------- Let me leave this resolved then. We will likely bump up the PyArrow version in Spark 3.0.0 soon anyway. > pandas_udf(GROUPED_MAP) fails when using ArrayType(ArrayType(DoubleType())) > ----------------------------------------------------------------------------- > > Key: SPARK-25491 > URL: https://issues.apache.org/jira/browse/SPARK-25491 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.3.1 > Environment: Linux > python 2.7.9 > pyspark 2.3.1 (also reproduces on pyspark 2.3.0) > pyarrow 0.9.0 (working OK when using pyarrow 0.8.0) > Reporter: Ofer Fridman > Priority: Major > > After upgrading from pyarrow-0.8.0 to pyarrow-0.9.0 using pandas_udf (in > PandasUDFType.GROUPED_MAP), results in an error: > {quote}Caused by: java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:158) > ... 24 more > {quote} > The problem occurs only when using complex type like > ArrayType(ArrayType(DoubleType())) usege of ArrayType(DoubleType()) did not > reproduce this issue. > here is a simple example to reproduce this issue: > {quote}import pandas as pd > import numpy as np > from pyspark.sql import SparkSession > from pyspark.context import SparkContext, SparkConf > from pyspark.sql.types import * > import pyspark.sql.functions as sprk_func > sp_conf = > SparkConf().setAppName("stam").setMaster("local[1]").set('spark.driver.memory','4g') > sc = SparkContext(conf=sp_conf) > spark = SparkSession(sc) > pd_data = pd.DataFrame(\{'id':(np.random.rand(20)*10).astype(int)}) > data_df = spark.createDataFrame(pd_data,StructType([StructField('id', > IntegerType(), True)])) > @sprk_func.pandas_udf(StructType([StructField('mat', > ArrayType(ArrayType(DoubleType())), True)]), > sprk_func.PandasUDFType.GROUPED_MAP) > def return_mat_group(group): > pd_data = pd.DataFrame(\{'mat': np.random.rand(7, 4, 4).tolist()}) > return pd_data > data_df.groupby(data_df.id).apply(return_mat_group).show(){quote} > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org