[ https://issues.apache.org/jira/browse/SPARK-25213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590397#comment-16590397 ]
Wenchen Fan commented on SPARK-25213: ------------------------------------- [~rdblue] I think we have a problem here. In https://github.com/apache/spark/pull/21118 we allow data source to produce internal row instead of unsafe row, assuming Filter/Project is OK with internal row. However, we missed python UDF, which can be in Project/Filter and must work with unsafe row. If we require python UDF to work with internal row, that is probably a perf regression as the Python UDF needs to do an extra unsafe projection. > DataSourceV2 doesn't seem to produce unsafe rows > ------------------------------------------------- > > Key: SPARK-25213 > URL: https://issues.apache.org/jira/browse/SPARK-25213 > Project: Spark > Issue Type: Task > Components: SQL > Affects Versions: 2.4.0 > Reporter: Li Jin > Priority: Major > > Reproduce (Need to compile test-classes): > bin/pyspark --driver-class-path sql/core/target/scala-2.11/test-classes > {code:java} > datasource_v2_df = spark.read \ > .format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2") > \ > .load() > result = datasource_v2_df.withColumn('x', udf(lambda x: x, > 'int')(datasource_v2_df['i'])) > result.show() > {code} > The above code fails with: > {code:java} > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > at > org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:127) > at > org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:126) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > {code} > Seems like Data Source V2 doesn't produce unsafeRows here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org