[jira] [Commented] (SPARK-25213) DataSourceV2 doesn't seem to produce unsafe rows

2018-08-28 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16594973#comment-16594973
 ] 

Li Jin commented on SPARK-25213:


This is resolved by https://github.com/apache/spark/pull/22104

> DataSourceV2 doesn't seem to produce unsafe rows 
> -
>
> Key: SPARK-25213
> URL: https://issues.apache.org/jira/browse/SPARK-25213
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> Reproduce (Need to compile test-classes):
> bin/pyspark --driver-class-path sql/core/target/scala-2.11/test-classes
> {code:java}
> datasource_v2_df = spark.read \
> .format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2") 
> \
> .load()
> result = datasource_v2_df.withColumn('x', udf(lambda x: x, 
> 'int')(datasource_v2_df['i']))
> result.show()
> {code}
> The above code fails with:
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:126)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
> {code}
> Seems like Data Source V2 doesn't produce unsafeRows here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25213) DataSourceV2 doesn't seem to produce unsafe rows

2018-08-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593858#comment-16593858
 ] 

Apache Spark commented on SPARK-25213:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/22244

> DataSourceV2 doesn't seem to produce unsafe rows 
> -
>
> Key: SPARK-25213
> URL: https://issues.apache.org/jira/browse/SPARK-25213
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Priority: Major
>
> Reproduce (Need to compile test-classes):
> bin/pyspark --driver-class-path sql/core/target/scala-2.11/test-classes
> {code:java}
> datasource_v2_df = spark.read \
> .format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2") 
> \
> .load()
> result = datasource_v2_df.withColumn('x', udf(lambda x: x, 
> 'int')(datasource_v2_df['i']))
> result.show()
> {code}
> The above code fails with:
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:126)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
> {code}
> Seems like Data Source V2 doesn't produce unsafeRows here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25213) DataSourceV2 doesn't seem to produce unsafe rows

2018-08-23 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590585#comment-16590585
 ] 

Apache Spark commented on SPARK-25213:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/22206

> DataSourceV2 doesn't seem to produce unsafe rows 
> -
>
> Key: SPARK-25213
> URL: https://issues.apache.org/jira/browse/SPARK-25213
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Priority: Major
>
> Reproduce (Need to compile test-classes):
> bin/pyspark --driver-class-path sql/core/target/scala-2.11/test-classes
> {code:java}
> datasource_v2_df = spark.read \
> .format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2") 
> \
> .load()
> result = datasource_v2_df.withColumn('x', udf(lambda x: x, 
> 'int')(datasource_v2_df['i']))
> result.show()
> {code}
> The above code fails with:
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:126)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
> {code}
> Seems like Data Source V2 doesn't produce unsafeRows here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25213) DataSourceV2 doesn't seem to produce unsafe rows

2018-08-23 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590531#comment-16590531
 ] 

Ryan Blue commented on SPARK-25213:
---

Sorry, I just realized the point is that the filter could have a python UDF in 
it. In that case, we need to add the project before the filter runs. I'll take 
a look at it.

> DataSourceV2 doesn't seem to produce unsafe rows 
> -
>
> Key: SPARK-25213
> URL: https://issues.apache.org/jira/browse/SPARK-25213
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Priority: Major
>
> Reproduce (Need to compile test-classes):
> bin/pyspark --driver-class-path sql/core/target/scala-2.11/test-classes
> {code:java}
> datasource_v2_df = spark.read \
> .format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2") 
> \
> .load()
> result = datasource_v2_df.withColumn('x', udf(lambda x: x, 
> 'int')(datasource_v2_df['i']))
> result.show()
> {code}
> The above code fails with:
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:126)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
> {code}
> Seems like Data Source V2 doesn't produce unsafeRows here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25213) DataSourceV2 doesn't seem to produce unsafe rows

2018-08-23 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590406#comment-16590406
 ] 

Ryan Blue commented on SPARK-25213:
---

[~cloud_fan], that PR ensures that there is a Project node on top of the v2 
scan to ensure the rows are converted to unsafe. We should be able to take a 
look at the physical plan to see whether it is there. If not, then we should 
find out why it isn't there. If it is there, we should find out why it isn't 
producing unsafe rows.

> DataSourceV2 doesn't seem to produce unsafe rows 
> -
>
> Key: SPARK-25213
> URL: https://issues.apache.org/jira/browse/SPARK-25213
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Priority: Major
>
> Reproduce (Need to compile test-classes):
> bin/pyspark --driver-class-path sql/core/target/scala-2.11/test-classes
> {code:java}
> datasource_v2_df = spark.read \
> .format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2") 
> \
> .load()
> result = datasource_v2_df.withColumn('x', udf(lambda x: x, 
> 'int')(datasource_v2_df['i']))
> result.show()
> {code}
> The above code fails with:
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:126)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
> {code}
> Seems like Data Source V2 doesn't produce unsafeRows here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25213) DataSourceV2 doesn't seem to produce unsafe rows

2018-08-23 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590397#comment-16590397
 ] 

Wenchen Fan commented on SPARK-25213:
-

[~rdblue] I think we have a problem here. In 
https://github.com/apache/spark/pull/21118 we allow data source to produce 
internal row instead of unsafe row, assuming Filter/Project is OK with internal 
row. However, we missed python UDF, which can be in Project/Filter and must 
work with unsafe row.

If we require python UDF to work with internal row, that is probably a perf 
regression as the Python UDF needs to do an extra unsafe projection.

> DataSourceV2 doesn't seem to produce unsafe rows 
> -
>
> Key: SPARK-25213
> URL: https://issues.apache.org/jira/browse/SPARK-25213
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Priority: Major
>
> Reproduce (Need to compile test-classes):
> bin/pyspark --driver-class-path sql/core/target/scala-2.11/test-classes
> {code:java}
> datasource_v2_df = spark.read \
> .format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2") 
> \
> .load()
> result = datasource_v2_df.withColumn('x', udf(lambda x: x, 
> 'int')(datasource_v2_df['i']))
> result.show()
> {code}
> The above code fails with:
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:126)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
> {code}
> Seems like Data Source V2 doesn't produce unsafeRows here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org