date:20221028

[jira] [Assigned] (SPARK-40946) Introduce a new DataSource V2 interface SupportsPushDownClusterKeys

2022-10-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40946:


Assignee: Apache Spark

> Introduce a new DataSource V2 interface SupportsPushDownClusterKeys
> ---
>
> Key: SPARK-40946
> URL: https://issues.apache.org/jira/browse/SPARK-40946
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>
> A mix-in interface for ScanBuilder. Data sources can implement this interface 
> to push down all the join or aggregate keys to data sources. A return value 
> true indicates that data source will return input partitions  following the 
> clustering keys. Otherwise, a false return value indicates the data source 
> doesn't make such a guarantee, even though it may still report a partitioning 
> that may or may not be compatible with the given clustering keys, and it's 
> Spark's responsibility to group the input partitions whether it can be 
> applied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40946) Introduce a new DataSource V2 interface SupportsPushDownClusterKeys

2022-10-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626007#comment-17626007
 ] 

Apache Spark commented on SPARK-40946:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/38434

> Introduce a new DataSource V2 interface SupportsPushDownClusterKeys
> ---
>
> Key: SPARK-40946
> URL: https://issues.apache.org/jira/browse/SPARK-40946
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> A mix-in interface for ScanBuilder. Data sources can implement this interface 
> to push down all the join or aggregate keys to data sources. A return value 
> true indicates that data source will return input partitions  following the 
> clustering keys. Otherwise, a false return value indicates the data source 
> doesn't make such a guarantee, even though it may still report a partitioning 
> that may or may not be compatible with the given clustering keys, and it's 
> Spark's responsibility to group the input partitions whether it can be 
> applied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40946) Introduce a new DataSource V2 interface SupportsPushDownClusterKeys

2022-10-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40946:


Assignee: (was: Apache Spark)

> Introduce a new DataSource V2 interface SupportsPushDownClusterKeys
> ---
>
> Key: SPARK-40946
> URL: https://issues.apache.org/jira/browse/SPARK-40946
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> A mix-in interface for ScanBuilder. Data sources can implement this interface 
> to push down all the join or aggregate keys to data sources. A return value 
> true indicates that data source will return input partitions  following the 
> clustering keys. Otherwise, a false return value indicates the data source 
> doesn't make such a guarantee, even though it may still report a partitioning 
> that may or may not be compatible with the given clustering keys, and it's 
> Spark's responsibility to group the input partitions whether it can be 
> applied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40964) Cannot run spark history server with shaded hadoop jar

2022-10-28 Thread YUBI LEE (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-40964:
-
Description: 
Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
If you try to start Spark History Server with shaded client jars and enable 
security using 
org.apache.hadoop.security.authentication.server.AuthenticationFilter, you will 
meet following exception.

{code}
# spark-env.sh
export 
SPARK_HISTORY_OPTS='-Dspark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter
 
-Dspark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=kerberos,kerberos.principal=HTTP/some.example@example.com,kerberos.keytab=/etc/security/keytabs/spnego.service.keytab"'
{code}


{code}
# spark history server's out file
22/10/27 15:29:48 INFO AbstractConnector: Started 
ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on 
port 18081.
22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
org.apache.hadoop.security.authentication.server.AuthenticationFilter
22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
java.lang.IllegalStateException: class 
org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a 
javax.servlet.Filter
at 
org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at 
org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730)
at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at 
java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at 
org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:755)
at 
org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:379)
at 
org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:910)
at 
org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:288)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:491)
at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:148)
at org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:148)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:148)
at 
org.apache.spark.deploy.history.HistoryServer.bind(HistoryServer.scala:164)
at 
org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:310)
at 
org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
{code}


I think "AuthenticationFilter" in the shaded jar imports 
"org.apache.hadoop.shaded.javax.servlet.Filter", not "javax.servlet.Filter".

{code}
❯ grep -r org.apache.hadoop.shaded.javax.servlet.Filter *
Binary file hadoop-client-runtime-3.3.1.jar matches
{code}

It causes the exception I mentioned.

I'm not sure what is the best answer.
Workaround is not to use spark with pre-built for Apache Hadoop, specify 
`HADOOP_HOME` or `SPARK_DIST_CLASSPATH` in spark-env.sh for Spark History 
Server.

May be the possible options are:
- Not to shade "javax.servlet.Filter" at hadoop shaded jar
- Or, shade "javax.servlet.Filter" also at jetty.

  was:
Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
If you try to start Spark History Server with shaded client jars and enable 
security using 
org.apache.hadoop.security.authentication.server.AuthenticationFilter, you will 
meet following exception.

{code}
export 
SPARK_HISTORY_OPTS='-Dspark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter
 
-Dspark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=kerberos,kerberos.principal=HTTP/some.example@example.com,kerberos.keytab=/etc/security/keytabs/spnego.service.keytab"'
{code}


{code}
22/10/27 15:29:48 INFO AbstractConnector: Started 
ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on 
port 18081.
22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
org.apache.hadoop.security.authentication.server.AuthenticationFilter
22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
java.lang.IllegalStateException: cla

[jira] [Updated] (SPARK-40964) Cannot run spark history server with shaded hadoop jar

2022-10-28 Thread YUBI LEE (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-40964:
-
Description: 
Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
If you try to start Spark History Server with shaded client jars and enable 
security using 
org.apache.hadoop.security.authentication.server.AuthenticationFilter, you will 
meet following exception.

{code}
export 
SPARK_HISTORY_OPTS='-Dspark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter
 
-Dspark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=kerberos,kerberos.principal=HTTP/some.example@example.com,kerberos.keytab=/etc/security/keytabs/spnego.service.keytab"'
{code}


{code}
22/10/27 15:29:48 INFO AbstractConnector: Started 
ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on 
port 18081.
22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
org.apache.hadoop.security.authentication.server.AuthenticationFilter
22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
java.lang.IllegalStateException: class 
org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a 
javax.servlet.Filter
at 
org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at 
org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730)
at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at 
java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at 
org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:755)
at 
org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:379)
at 
org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:910)
at 
org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:288)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:491)
at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:148)
at org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:148)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:148)
at 
org.apache.spark.deploy.history.HistoryServer.bind(HistoryServer.scala:164)
at 
org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:310)
at 
org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
{code}


I think "AuthenticationFilter" in the shaded jar imports 
"org.apache.hadoop.shaded.javax.servlet.Filter", not "javax.servlet.Filter".

{code}
❯ grep -r org.apache.hadoop.shaded.javax.servlet.Filter *
Binary file hadoop-client-runtime-3.3.1.jar matches
{code}

It causes the exception I mentioned.

I'm not sure what is the best answer.
Workaround is not to use spark with pre-built for Apache Hadoop, specify 
`HADOOP_HOME` or `SPARK_DIST_CLASSPATH` in spark-env.sh for Spark History 
Server.

May be the possible options are:
- Not to shade "javax.servlet.Filter" at hadoop shaded jar
- Or, shade "javax.servlet.Filter" also at jetty.

  was:
Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
If you try to start Spark History Server with shaded client jars and enable 
security using 
org.apache.hadoop.security.authentication.server.AuthenticationFilter, you will 
meet following exception.


{code}
22/10/27 15:29:48 INFO AbstractConnector: Started 
ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on 
port 18081.
22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
org.apache.hadoop.security.authentication.server.AuthenticationFilter
22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
java.lang.IllegalStateException: class 
org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a 
javax.servlet.Filter
at 
org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at 
org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730)

[jira] [Updated] (SPARK-40964) Cannot run spark history server with shaded hadoop jar

2022-10-28 Thread YUBI LEE (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-40964:
-
Description: 
Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
If you try to start Spark History Server with shaded client jars and enable 
security using 
org.apache.hadoop.security.authentication.server.AuthenticationFilter, you will 
meet following exception.


{code}
22/10/27 15:29:48 INFO AbstractConnector: Started 
ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on 
port 18081.
22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
org.apache.hadoop.security.authentication.server.AuthenticationFilter
22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
java.lang.IllegalStateException: class 
org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a 
javax.servlet.Filter
at 
org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at 
org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730)
at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at 
java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at 
org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:755)
at 
org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:379)
at 
org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:910)
at 
org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:288)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:491)
at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:148)
at org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:148)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:148)
at 
org.apache.spark.deploy.history.HistoryServer.bind(HistoryServer.scala:164)
at 
org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:310)
at 
org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
{code}


I think "AuthenticationFilter" in the shaded jar imports 
"org.apache.hadoop.shaded.javax.servlet.Filter", not "javax.servlet.Filter".

{code}
❯ grep -r org.apache.hadoop.shaded.javax.servlet.Filter *
Binary file hadoop-client-runtime-3.3.1.jar matches
{code}

It causes the exception I mentioned.

I'm not sure what is the best answer.
Workaround is not to use spark with pre-built for Apache Hadoop, specify 
`HADOOP_HOME` or `SPARK_DIST_CLASSPATH` in spark-env.sh for Spark History 
Server.

May be the possible options are:
- Not to shade "javax.servlet.Filter" at hadoop shaded jar
- Or, shade "javax.servlet.Filter" also at jetty.

  was:
Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
In this situation, if you try to start Spark History Server with shaded client 
jars and enable security using 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.
You will meet following exception.


{code}
22/10/27 15:29:48 INFO AbstractConnector: Started 
ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on 
port 18081.
22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
org.apache.hadoop.security.authentication.server.AuthenticationFilter
22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
java.lang.IllegalStateException: class 
org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a 
javax.servlet.Filter
at 
org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at 
org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730)
at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at 
java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at 
org.sparkproject.jetty.servlet.ServletHandler.initi

[jira] [Updated] (SPARK-40964) Cannot run spark history server with shaded hadoop jar

2022-10-28 Thread YUBI LEE (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-40964:
-
Description: 
Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
In this situation, if you try to start Spark History Server with shaded client 
jars and enable security using 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.
You will meet following exception.


{code}
22/10/27 15:29:48 INFO AbstractConnector: Started 
ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on 
port 18081.
22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
org.apache.hadoop.security.authentication.server.AuthenticationFilter
22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
java.lang.IllegalStateException: class 
org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a 
javax.servlet.Filter
at 
org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at 
org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730)
at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at 
java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at 
org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:755)
at 
org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:379)
at 
org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:910)
at 
org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:288)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:491)
at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:148)
at org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:148)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:148)
at 
org.apache.spark.deploy.history.HistoryServer.bind(HistoryServer.scala:164)
at 
org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:310)
at 
org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
{code}


I think "AuthenticationFilter" in the shaded jar imports 
"org.apache.hadoop.shaded.javax.servlet.Filter", not "javax.servlet.Filter".

```
❯ grep -r org.apache.hadoop.shaded.javax.servlet.Filter *
Binary file hadoop-client-runtime-3.3.1.jar matches
```

It causes the exception I mentioned.

I'm not sure what is the best answer.
Workaround is not to use spark with pre-built for Apache Hadoop, specify 
`HADOOP_HOME` or `SPARK_DIST_CLASSPATH` in spark-env.sh for Spark History 
Server.

May be the possible options are:
- Not to shade "javax.servlet.Filter" at hadoop shaded jar
- Or, shade "javax.servlet.Filter" also at jetty.

  was:
Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
In this situation, if you try to start Spark History Server with shaded client 
jars and enable security using 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.
You will meet following exception.


{code}
22/10/27 15:29:48 INFO AbstractConnector: Started 
ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on 
port 18081.
22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
org.apache.hadoop.security.authentication.server.AuthenticationFilter
22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
java.lang.IllegalStateException: class 
org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a 
javax.servlet.Filter
at 
org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at 
org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730)
at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at 
java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at 
org.sparkproject.jetty.servlet.ServletH

[jira] [Updated] (SPARK-40964) Cannot run spark history server with shaded hadoop jar

2022-10-28 Thread YUBI LEE (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YUBI LEE updated SPARK-40964:
-
Description: 
Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
In this situation, if you try to start Spark History Server with shaded client 
jars and enable security using 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.
You will meet following exception.


{code}
22/10/27 15:29:48 INFO AbstractConnector: Started 
ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on 
port 18081.
22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
org.apache.hadoop.security.authentication.server.AuthenticationFilter
22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
java.lang.IllegalStateException: class 
org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a 
javax.servlet.Filter
at 
org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at 
org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730)
at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at 
java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at 
org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:755)
at 
org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:379)
at 
org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:910)
at 
org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:288)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:491)
at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:148)
at org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:148)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:148)
at 
org.apache.spark.deploy.history.HistoryServer.bind(HistoryServer.scala:164)
at 
org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:310)
at 
org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
{code}


I think "AuthenticationFilter" in the shaded jar imports 
"org.apache.hadoop.shaded.javax.servlet.Filter", not "javax.servlet.Filter".

{code}
❯ grep -r org.apache.hadoop.shaded.javax.servlet.Filter *
Binary file hadoop-client-runtime-3.3.1.jar matches
{code}

It causes the exception I mentioned.

I'm not sure what is the best answer.
Workaround is not to use spark with pre-built for Apache Hadoop, specify 
`HADOOP_HOME` or `SPARK_DIST_CLASSPATH` in spark-env.sh for Spark History 
Server.

May be the possible options are:
- Not to shade "javax.servlet.Filter" at hadoop shaded jar
- Or, shade "javax.servlet.Filter" also at jetty.

  was:
Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
In this situation, if you try to start Spark History Server with shaded client 
jars and enable security using 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.
You will meet following exception.


{code}
22/10/27 15:29:48 INFO AbstractConnector: Started 
ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on 
port 18081.
22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
org.apache.hadoop.security.authentication.server.AuthenticationFilter
22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
java.lang.IllegalStateException: class 
org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a 
javax.servlet.Filter
at 
org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at 
org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730)
at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at 
java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at 
org.sparkproject.jetty.servlet.Se

[jira] [Created] (SPARK-40964) Cannot run spark history server with shaded hadoop jar

2022-10-28 Thread YUBI LEE (Jira)

YUBI LEE created SPARK-40964:


 Summary: Cannot run spark history server with shaded hadoop jar
 Key: SPARK-40964
 URL: https://issues.apache.org/jira/browse/SPARK-40964
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.2.2
Reporter: YUBI LEE


Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+.
In this situation, if you try to start Spark History Server with shaded client 
jars and enable security using 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.
You will meet following exception.


{code}
22/10/27 15:29:48 INFO AbstractConnector: Started 
ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081}
22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on 
port 18081.
22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: 
org.apache.hadoop.security.authentication.server.AuthenticationFilter
22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer
java.lang.IllegalStateException: class 
org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a 
javax.servlet.Filter
at 
org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at 
org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730)
at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at 
java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at 
org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:755)
at 
org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:379)
at 
org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:910)
at 
org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:288)
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:491)
at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:148)
at org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:148)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:148)
at 
org.apache.spark.deploy.history.HistoryServer.bind(HistoryServer.scala:164)
at 
org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:310)
at 
org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
{code}


I think "AuthenticationFilter" in the shaded jar imports 
"org.apache.hadoop.shaded.javax.servlet.Filter", not "javax.servlet.Filter".
It causes the exception I mentioned.

I'm not sure what is the best answer.
Workaround is not to use spark with pre-built for Apache Hadoop, specify 
`HADOOP_HOME` or `SPARK_DIST_CLASSPATH` in spark-env.sh for Spark History 
Server.

May be the possible options are:
- Not to shade "javax.servlet.Filter" at hadoop shaded jar
- Or, shade "javax.servlet.Filter" also at jetty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40963) containsNull in array type attributes is not updated from child output

2022-10-28 Thread Bruce Robbins (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-40963:
--
Affects Version/s: 3.1.3

> containsNull in array type attributes is not updated from child output
> --
>
> Key: SPARK-40963
> URL: https://issues.apache.org/jira/browse/SPARK-40963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3, 3.2.2, 3.4.0, 3.3.1
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Example:
> {noformat}
> select c1, explode(c4) as c5 from (
>   select c1, array(c3) as c4 from (
> select c1, explode_outer(c2) as c3
> from values
> (1, array(1, 2)),
> (2, array(2, 3)),
> (3, null)
> as data(c1, c2)
>   )
> );
> +---+---+
> |c1 |c5 |
> +---+---+
> |1  |1  |
> |1  |2  |
> |2  |2  |
> |2  |3  |
> |3  |0  |
> +---+---+
> {noformat}
> In the last row, {{c5}} is 0, but should be {{NULL}}.
> At the time {{ResolveGenerate.makeGeneratorOutput}} is called for 
> {{explode(c4)}}, {{c3}} has nullable set to false, so {{c4}}'s data type has 
> {{containsNull}} also set to false. Later, {{c3}}'s nullability is updated 
> and c4's data type reports containsNull = true, but two things go wrong:
> * The {{containsNull}} setting for {{c4}} is not propogated to parent 
> operators (so the attribute {{c4}} in {{explode(c4)}} still has containsNull 
> = false)
> * Even if it were propogated, {{generatorOutput}} for {{explode(c4)}} is 
> already determined and won't be recalculated.
> Another example:
> {noformat}
> select c1, inline_outer(c4) from (
>   select c1, array(c3) as c4 from (
> select c1, explode_outer(c2) as c3
> from values
> (1, array(named_struct('a', 1, 'b', 2))),
> (2, array(named_struct('a', 3, 'b', 4), named_struct('a', 5, 'b', 6))),
> (3, null)
> as data(c1, c2)
>   )
> );
> 22/10/27 11:53:20 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_1$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
>   
> -
> -
> -
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40963) containsNull in array type attributes is not updated from child output

2022-10-28 Thread Bruce Robbins (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-40963:
--
Affects Version/s: 3.2.2

> containsNull in array type attributes is not updated from child output
> --
>
> Key: SPARK-40963
> URL: https://issues.apache.org/jira/browse/SPARK-40963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2, 3.4.0, 3.3.1
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Example:
> {noformat}
> select c1, explode(c4) as c5 from (
>   select c1, array(c3) as c4 from (
> select c1, explode_outer(c2) as c3
> from values
> (1, array(1, 2)),
> (2, array(2, 3)),
> (3, null)
> as data(c1, c2)
>   )
> );
> +---+---+
> |c1 |c5 |
> +---+---+
> |1  |1  |
> |1  |2  |
> |2  |2  |
> |2  |3  |
> |3  |0  |
> +---+---+
> {noformat}
> In the last row, {{c5}} is 0, but should be {{NULL}}.
> At the time {{ResolveGenerate.makeGeneratorOutput}} is called for 
> {{explode(c4)}}, {{c3}} has nullable set to false, so {{c4}}'s data type has 
> {{containsNull}} also set to false. Later, {{c3}}'s nullability is updated 
> and c4's data type reports containsNull = true, but two things go wrong:
> * The {{containsNull}} setting for {{c4}} is not propogated to parent 
> operators (so the attribute {{c4}} in {{explode(c4)}} still has containsNull 
> = false)
> * Even if it were propogated, {{generatorOutput}} for {{explode(c4)}} is 
> already determined and won't be recalculated.
> Another example:
> {noformat}
> select c1, inline_outer(c4) from (
>   select c1, array(c3) as c4 from (
> select c1, explode_outer(c2) as c3
> from values
> (1, array(named_struct('a', 1, 'b', 2))),
> (2, array(named_struct('a', 3, 'b', 4), named_struct('a', 5, 'b', 6))),
> (3, null)
> as data(c1, c2)
>   )
> );
> 22/10/27 11:53:20 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_1$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
>   
> -
> -
> -
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40963) containsNull in array type attributes is not updated from child output

2022-10-28 Thread Bruce Robbins (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-40963:
--
Affects Version/s: 3.3.1

> containsNull in array type attributes is not updated from child output
> --
>
> Key: SPARK-40963
> URL: https://issues.apache.org/jira/browse/SPARK-40963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.3.1
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Example:
> {noformat}
> select c1, explode(c4) as c5 from (
>   select c1, array(c3) as c4 from (
> select c1, explode_outer(c2) as c3
> from values
> (1, array(1, 2)),
> (2, array(2, 3)),
> (3, null)
> as data(c1, c2)
>   )
> );
> +---+---+
> |c1 |c5 |
> +---+---+
> |1  |1  |
> |1  |2  |
> |2  |2  |
> |2  |3  |
> |3  |0  |
> +---+---+
> {noformat}
> In the last row, {{c5}} is 0, but should be {{NULL}}.
> At the time {{ResolveGenerate.makeGeneratorOutput}} is called for 
> {{explode(c4)}}, {{c3}} has nullable set to false, so {{c4}}'s data type has 
> {{containsNull}} also set to false. Later, {{c3}}'s nullability is updated 
> and c4's data type reports containsNull = true, but two things go wrong:
> * The {{containsNull}} setting for {{c4}} is not propogated to parent 
> operators (so the attribute {{c4}} in {{explode(c4)}} still has containsNull 
> = false)
> * Even if it were propogated, {{generatorOutput}} for {{explode(c4)}} is 
> already determined and won't be recalculated.
> Another example:
> {noformat}
> select c1, inline_outer(c4) from (
>   select c1, array(c3) as c4 from (
> select c1, explode_outer(c2) as c3
> from values
> (1, array(named_struct('a', 1, 'b', 2))),
> (2, array(named_struct('a', 3, 'b', 4), named_struct('a', 5, 'b', 6))),
> (3, null)
> as data(c1, c2)
>   )
> );
> 22/10/27 11:53:20 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_1$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
>   
> -
> -
> -
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40963) containsNull in array type attributes is not updated from child output

2022-10-28 Thread Bruce Robbins (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625955#comment-17625955
 ] 

Bruce Robbins commented on SPARK-40963:
---

I'll take a stab at fixing this in the next few days.

> containsNull in array type attributes is not updated from child output
> --
>
> Key: SPARK-40963
> URL: https://issues.apache.org/jira/browse/SPARK-40963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Example:
> {noformat}
> select c1, explode(c4) as c5 from (
>   select c1, array(c3) as c4 from (
> select c1, explode_outer(c2) as c3
> from values
> (1, array(1, 2)),
> (2, array(2, 3)),
> (3, null)
> as data(c1, c2)
>   )
> );
> +---+---+
> |c1 |c5 |
> +---+---+
> |1  |1  |
> |1  |2  |
> |2  |2  |
> |2  |3  |
> |3  |0  |
> +---+---+
> {noformat}
> In the last row, {{c5}} is 0, but should be {{NULL}}.
> At the time {{ResolveGenerate.makeGeneratorOutput}} is called for 
> {{explode(c4)}}, {{c3}} has nullable set to false, so {{c4}}'s data type has 
> {{containsNull}} also set to false. Later, {{c3}}'s nullability is updated 
> and c4's data type reports containsNull = true, but two things go wrong:
> * The {{containsNull}} setting for {{c4}} is not propogated to parent 
> operators (so the attribute {{c4}} in {{explode(c4)}} still has containsNull 
> = false)
> * Even if it were propogated, {{generatorOutput}} for {{explode(c4)}} is 
> already determined and won't be recalculated.
> Another example:
> {noformat}
> select c1, inline_outer(c4) from (
>   select c1, array(c3) as c4 from (
> select c1, explode_outer(c2) as c3
> from values
> (1, array(named_struct('a', 1, 'b', 2))),
> (2, array(named_struct('a', 3, 'b', 4), named_struct('a', 5, 'b', 6))),
> (3, null)
> as data(c1, c2)
>   )
> );
> 22/10/27 11:53:20 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_1$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
>   
> -
> -
> -
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40963) containsNull in array type attributes is not updated from child output

2022-10-28 Thread Bruce Robbins (Jira)

Bruce Robbins created SPARK-40963:
-

 Summary: containsNull in array type attributes is not updated from 
child output
 Key: SPARK-40963
 URL: https://issues.apache.org/jira/browse/SPARK-40963
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Bruce Robbins


Example:
{noformat}
select c1, explode(c4) as c5 from (
  select c1, array(c3) as c4 from (
select c1, explode_outer(c2) as c3
from values
(1, array(1, 2)),
(2, array(2, 3)),
(3, null)
as data(c1, c2)
  )
);

+---+---+
|c1 |c5 |
+---+---+
|1  |1  |
|1  |2  |
|2  |2  |
|2  |3  |
|3  |0  |
+---+---+
{noformat}
In the last row, {{c5}} is 0, but should be {{NULL}}.

At the time {{ResolveGenerate.makeGeneratorOutput}} is called for 
{{explode(c4)}}, {{c3}} has nullable set to false, so {{c4}}'s data type has 
{{containsNull}} also set to false. Later, {{c3}}'s nullability is updated and 
c4's data type reports containsNull = true, but two things go wrong:

* The {{containsNull}} setting for {{c4}} is not propogated to parent operators 
(so the attribute {{c4}} in {{explode(c4)}} still has containsNull = false)
* Even if it were propogated, {{generatorOutput}} for {{explode(c4)}} is 
already determined and won't be recalculated.

Another example:
{noformat}
select c1, inline_outer(c4) from (
  select c1, array(c3) as c4 from (
select c1, explode_outer(c2) as c3
from values
(1, array(named_struct('a', 1, 'b', 2))),
(2, array(named_struct('a', 3, 'b', 4), named_struct('a', 5, 'b', 6))),
(3, null)
as data(c1, c2)
  )
);

22/10/27 11:53:20 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2)
java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_1$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)

-
-
-
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40939) Release a shaded version of Apache Spark / shade jars on main jar

2022-10-28 Thread Erik Krogen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625953#comment-17625953
 ] 

Erik Krogen commented on SPARK-40939:
-

As a reference for prior work, there is also HADOOP-11656, in which Hadoop 
began publishing a new {{hadoop-client-runtime}} JAR into which all of the 
transitive dependencies are shaded. In the 
[proposal|https://issues.apache.org/jira/secure/attachment/12709266/HADOOP-11656_proposal.md]
 a technique similar to Flink's was proposed and eventually rejected due to 
higher maintenance burden to publish separate artifacts for each downstream 
library that is shaded.

There are some pitfalls that come with Spark being a Scala project, unlike 
Hadoop/Flink which are Java based. Most shading tools cannot handle certain 
Scala language elements, specifically {{ScalaSig}} causes problems because 
shading tools that are not Scala-aware do not perform relocations within the 
{{ScalaSig}} (see examples 
[one|https://github.com/coursier/coursier/issues/454#issuecomment-288969207] 
and [two|https://lists.apache.org/thread/x7b4z0os9zbzzprb5scft7b4wnr7c3mv] and 
[this previous Spark PR that tried to shade 
Jackson|https://github.com/apache/spark/pull/10931]). That being said, 
{{sbt}}'s [assembly plugin has had support for this since 
2020|https://github.com/sbt/sbt-assembly/pull/393], and this functionality was 
subsequently pulled out into a standalone library, [Jar Jar 
Abrams|http://eed3si9n.com/jarjar-abrams/]. So there is hope that this should 
be more achievable now than it was back in 2016 when that PR was filed. There's 
also been [interest in shading all of Spark's dependencies on the Spark 
dev-list|https://lists.apache.org/thread/vkkx8s2zv0ln7j7oo46k30x084mn163p].

I would love to hear what the community thinks of pursuing this earnestly with 
the tools available in 2022, though [~almogtavor] I'll note that this type of 
large change is better discussed on the dev mailing list (and probably an 
accompanying SPIP).

> Release a shaded version of Apache Spark / shade jars on main jar
> -
>
> Key: SPARK-40939
> URL: https://issues.apache.org/jira/browse/SPARK-40939
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.4.0
>Reporter: Almog Tavor
>Priority: Major
>
> I suggest shading in Apache Spark, to resolve the dependency hell that may 
> occur when building / deploying Apache Spark. This mainly occurs on Java 
> projects and on Hadoop environments, but shading will help for using Spark 
> with Scala & even Python either.
> Flink has a similar solution, delivering 
> [flink-shaded|https://github.com/apache/flink-shaded/blob/master/README.md].
> The dependencies I think that are relevant for shading are Jackson, Guava, 
> Netty & any of the Hadoop ecosystems if possible.
> As for releasing sources for the shaded version, I think the [issue that has 
> been raised in Flink|https://github.com/apache/flink-shaded/issues/25] is 
> relevant and unanswered here too, hence I don't think that's an option 
> currently (personally I don't see any value for it either).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40943) Make MSCK optional in MSCK REPAIR TABLE commands

2022-10-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40943:


Assignee: Apache Spark

> Make MSCK optional in MSCK REPAIR TABLE commands
> 
>
> Key: SPARK-40943
> URL: https://issues.apache.org/jira/browse/SPARK-40943
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Ben Zhang
>Assignee: Apache Spark
>Priority: Major
>
> The current syntax for `MSCK REPAIR TABLE` is complex and difficult to 
> understand. The proposal is to make the `MSCK` keyword optional so that 
> `REPAIR TABLE` may be used in its stead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40943) Make MSCK optional in MSCK REPAIR TABLE commands

2022-10-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40943:


Assignee: (was: Apache Spark)

> Make MSCK optional in MSCK REPAIR TABLE commands
> 
>
> Key: SPARK-40943
> URL: https://issues.apache.org/jira/browse/SPARK-40943
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Ben Zhang
>Priority: Major
>
> The current syntax for `MSCK REPAIR TABLE` is complex and difficult to 
> understand. The proposal is to make the `MSCK` keyword optional so that 
> `REPAIR TABLE` may be used in its stead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40943) Make MSCK optional in MSCK REPAIR TABLE commands

2022-10-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625952#comment-17625952
 ] 

Apache Spark commented on SPARK-40943:
--

User 'ben-zhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38433

> Make MSCK optional in MSCK REPAIR TABLE commands
> 
>
> Key: SPARK-40943
> URL: https://issues.apache.org/jira/browse/SPARK-40943
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Ben Zhang
>Priority: Major
>
> The current syntax for `MSCK REPAIR TABLE` is complex and difficult to 
> understand. The proposal is to make the `MSCK` keyword optional so that 
> `REPAIR TABLE` may be used in its stead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36124) Support set operators to be on correlation paths

2022-10-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625881#comment-17625881
 ] 

Apache Spark commented on SPARK-36124:
--

User 'jchen5' has created a pull request for this issue:
https://github.com/apache/spark/pull/38432

> Support set operators to be on correlation paths
> 
>
> Key: SPARK-36124
> URL: https://issues.apache.org/jira/browse/SPARK-36124
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> A correlation path is defined as the sub-tree of all the operators that are 
> on the path from the operator hosting the correlated expressions up to the 
> operator producing the correlated values. 
> We want to support set operators such as union and intercept to be on 
> correlation paths by adding them in DecorrelateInnerQuery. Please see page 
> 391 for more details: 
> [https://dl.gi.de/bitstream/handle/20.500.12116/2418/383.pdf] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36124) Support set operators to be on correlation paths

2022-10-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625879#comment-17625879
 ] 

Apache Spark commented on SPARK-36124:
--

User 'jchen5' has created a pull request for this issue:
https://github.com/apache/spark/pull/38432

> Support set operators to be on correlation paths
> 
>
> Key: SPARK-36124
> URL: https://issues.apache.org/jira/browse/SPARK-36124
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> A correlation path is defined as the sub-tree of all the operators that are 
> on the path from the operator hosting the correlated expressions up to the 
> operator producing the correlated values. 
> We want to support set operators such as union and intercept to be on 
> correlation paths by adding them in DecorrelateInnerQuery. Please see page 
> 391 for more details: 
> [https://dl.gi.de/bitstream/handle/20.500.12116/2418/383.pdf] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36124) Support set operators to be on correlation paths

2022-10-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36124:


Assignee: Apache Spark

> Support set operators to be on correlation paths
> 
>
> Key: SPARK-36124
> URL: https://issues.apache.org/jira/browse/SPARK-36124
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Assignee: Apache Spark
>Priority: Major
>
> A correlation path is defined as the sub-tree of all the operators that are 
> on the path from the operator hosting the correlated expressions up to the 
> operator producing the correlated values. 
> We want to support set operators such as union and intercept to be on 
> correlation paths by adding them in DecorrelateInnerQuery. Please see page 
> 391 for more details: 
> [https://dl.gi.de/bitstream/handle/20.500.12116/2418/383.pdf] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36124) Support set operators to be on correlation paths

2022-10-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36124:


Assignee: (was: Apache Spark)

> Support set operators to be on correlation paths
> 
>
> Key: SPARK-36124
> URL: https://issues.apache.org/jira/browse/SPARK-36124
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> A correlation path is defined as the sub-tree of all the operators that are 
> on the path from the operator hosting the correlated expressions up to the 
> operator producing the correlated values. 
> We want to support set operators such as union and intercept to be on 
> correlation paths by adding them in DecorrelateInnerQuery. Please see page 
> 391 for more details: 
> [https://dl.gi.de/bitstream/handle/20.500.12116/2418/383.pdf] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39319) Make query context as part of SparkThrowable

2022-10-28 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-39319:
---
Parent: SPARK-38615
Issue Type: Sub-task  (was: Improvement)

> Make query context as part of SparkThrowable
> 
>
> Key: SPARK-39319
> URL: https://issues.apache.org/jira/browse/SPARK-39319
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39365) Truncate fragment of query context if it is too long

2022-10-28 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-39365:
---
Parent: (was: SPARK-39319)
Issue Type: Task  (was: Sub-task)

> Truncate fragment of query context if it is too long
> 
>
> Key: SPARK-39365
> URL: https://issues.apache.org/jira/browse/SPARK-39365
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40956) SQL Equivalent for Dataframe overwrite command

2022-10-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625853#comment-17625853
 ] 

Apache Spark commented on SPARK-40956:
--

User 'carlfu-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/38404

> SQL Equivalent for Dataframe overwrite command
> --
>
> Key: SPARK-40956
> URL: https://issues.apache.org/jira/browse/SPARK-40956
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: chengyan fu
>Priority: Minor
>
> Proposing syntax
> {code:java}
>  INSERT INTO tbl REPLACE whereClause identifierList{code}
> to the spark SQL, as the equivalent of 
> [dataframe.overwrite()|https://github.com/apache/spark/blob/35d00df9bba7238ad4f40617fae4d04ddbfd/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala#L163]
>  command. 
>  
> For Example
> {code:java}
>  INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2{code}
> will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows 
> from table2 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40956) SQL Equivalent for Dataframe overwrite command

2022-10-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40956:


Assignee: (was: Apache Spark)

> SQL Equivalent for Dataframe overwrite command
> --
>
> Key: SPARK-40956
> URL: https://issues.apache.org/jira/browse/SPARK-40956
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: chengyan fu
>Priority: Minor
>
> Proposing syntax
> {code:java}
>  INSERT INTO tbl REPLACE whereClause identifierList{code}
> to the spark SQL, as the equivalent of 
> [dataframe.overwrite()|https://github.com/apache/spark/blob/35d00df9bba7238ad4f40617fae4d04ddbfd/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala#L163]
>  command. 
>  
> For Example
> {code:java}
>  INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2{code}
> will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows 
> from table2 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40956) SQL Equivalent for Dataframe overwrite command

2022-10-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40956:


Assignee: Apache Spark

> SQL Equivalent for Dataframe overwrite command
> --
>
> Key: SPARK-40956
> URL: https://issues.apache.org/jira/browse/SPARK-40956
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: chengyan fu
>Assignee: Apache Spark
>Priority: Minor
>
> Proposing syntax
> {code:java}
>  INSERT INTO tbl REPLACE whereClause identifierList{code}
> to the spark SQL, as the equivalent of 
> [dataframe.overwrite()|https://github.com/apache/spark/blob/35d00df9bba7238ad4f40617fae4d04ddbfd/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala#L163]
>  command. 
>  
> For Example
> {code:java}
>  INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2{code}
> will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows 
> from table2 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40956) SQL Equivalent for Dataframe overwrite command

2022-10-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625851#comment-17625851
 ] 

Apache Spark commented on SPARK-40956:
--

User 'carlfu-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/38404

> SQL Equivalent for Dataframe overwrite command
> --
>
> Key: SPARK-40956
> URL: https://issues.apache.org/jira/browse/SPARK-40956
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: chengyan fu
>Priority: Minor
>
> Proposing syntax
> {code:java}
>  INSERT INTO tbl REPLACE whereClause identifierList{code}
> to the spark SQL, as the equivalent of 
> [dataframe.overwrite()|https://github.com/apache/spark/blob/35d00df9bba7238ad4f40617fae4d04ddbfd/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala#L163]
>  command. 
>  
> For Example
> {code:java}
>  INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2{code}
> will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows 
> from table2 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40956) SQL Equivalent for Dataframe overwrite command

2022-10-28 Thread chengyan fu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chengyan fu updated SPARK-40956:

Description: 
Proposing syntax
{code:java}
 INSERT INTO tbl REPLACE whereClause identifierList{code}
to the spark SQL, as the equivalent of 
[dataframe.overwrite()|https://github.com/apache/spark/blob/35d00df9bba7238ad4f40617fae4d04ddbfd/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala#L163]
 command. 

 

For Example
{code:java}
 INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2{code}
will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows 
from table2 

 

 

  was:
Proposing syntax
{code:java}
 INSERT INTO tbl REPLACE whereClause identifierList{code}
to the spark SQL, as the equivalent of dataframe overwrite() command. 

 

For Example
{code:java}
 INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2{code}
will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows 
from table2 

 

 


> SQL Equivalent for Dataframe overwrite command
> --
>
> Key: SPARK-40956
> URL: https://issues.apache.org/jira/browse/SPARK-40956
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: chengyan fu
>Priority: Minor
>
> Proposing syntax
> {code:java}
>  INSERT INTO tbl REPLACE whereClause identifierList{code}
> to the spark SQL, as the equivalent of 
> [dataframe.overwrite()|https://github.com/apache/spark/blob/35d00df9bba7238ad4f40617fae4d04ddbfd/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala#L163]
>  command. 
>  
> For Example
> {code:java}
>  INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2{code}
> will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows 
> from table2 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38615) SQL Error Attribution Framework

2022-10-28 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-38615:
---
Summary: SQL Error Attribution Framework  (was: Provide error context for 
runtime ANSI failures)

> SQL Error Attribution Framework
> ---
>
> Key: SPARK-38615
> URL: https://issues.apache.org/jira/browse/SPARK-38615
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Currently,  there is not enough error context for runtime ANSI failures.
> In the following example, the error message only tells that there is a 
> "divide by zero" error, without pointing out where the exact SQL statement is.
> {code:java}
> > SELECT
>   ss1.ca_county,
>   ss1.d_year,
>   ws2.web_sales / ws1.web_sales web_q1_q2_increase,
>   ss2.store_sales / ss1.store_sales store_q1_q2_increase,
>   ws3.web_sales / ws2.web_sales web_q2_q3_increase,
>   ss3.store_sales / ss2.store_sales store_q2_q3_increase
> FROM
>   ss ss1, ss ss2, ss ss3, ws ws1, ws ws2, ws ws3
> WHERE
>   ss1.d_qoy = 1
> AND ss1.d_year = 2000
> AND ss1.ca_county = ss2.ca_county
> AND ss2.d_qoy = 2
> AND ss2.d_year = 2000
> AND ss2.ca_county = ss3.ca_county
> AND ss3.d_qoy = 3
> AND ss3.d_year = 2000
> AND ss1.ca_county = ws1.ca_county
> AND ws1.d_qoy = 1
> AND ws1.d_year = 2000
> AND ws1.ca_county = ws2.ca_county
> AND ws2.d_qoy = 2
> AND ws2.d_year = 2000
> AND ws1.ca_county = ws3.ca_county
> AND ws3.d_qoy = 3
> AND ws3.d_year = 2000
> AND CASE WHEN ws1.web_sales > 0
> THEN ws2.web_sales / ws1.web_sales
> ELSE NULL END
> > CASE WHEN ss1.store_sales > 0
> THEN ss2.store_sales / ss1.store_sales
>   ELSE NULL END
> AND CASE WHEN ws2.web_sales > 0
> THEN ws3.web_sales / ws2.web_sales
> ELSE NULL END
> > CASE WHEN ss2.store_sales > 0
> THEN ss3.store_sales / ss2.store_sales
>   ELSE NULL END
> ORDER BY ss1.ca_county
>  {code}
> {code:java}
> org.apache.spark.SparkArithmeticException: divide by zero at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:140)
>  at 
> org.apache.spark.sql.catalyst.expressions.DivModLike.eval(arithmetic.scala:437)
>  at 
> org.apache.spark.sql.catalyst.expressions.DivModLike.eval$(arithmetic.scala:425)
>  at 
> org.apache.spark.sql.catalyst.expressions.Divide.eval(arithmetic.scala:534)
> {code}
>  
> I suggest that we provide details in the error message,  including:
>  * the problematic expression from the original SQL query, e.g. 
> "ss3.store_sales / ss2.store_sales store_q2_q3_increase"
>  * the line number and starting char position of the problematic expression, 
> in case of queries like "select a + b from t1 union select a + b from t2"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40956) SQL Equivalent for Dataframe overwrite command

2022-10-28 Thread chengyan fu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chengyan fu updated SPARK-40956:

Description: 
Proposing syntax
{code:java}
 INSERT INTO tbl REPLACE whereClause identifierList{code}
to the spark SQL, as the equivalent of dataframe overwrite() command. 

 

For Example
{code:java}
 INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2{code}
will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows 
from table2 

 

 

  was:
Proposing syntax ```INSERT INTO tbl REPLACE whereClause identifierList``` to 
the spark SQL, as the equivalent of dataframe overwrite command. 

 

For Example

```INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2``` will, in an 
atomic operation, 1) delete rows with key = 3 and 2) insert rows from table2 

 

 


> SQL Equivalent for Dataframe overwrite command
> --
>
> Key: SPARK-40956
> URL: https://issues.apache.org/jira/browse/SPARK-40956
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: chengyan fu
>Priority: Minor
>
> Proposing syntax
> {code:java}
>  INSERT INTO tbl REPLACE whereClause identifierList{code}
> to the spark SQL, as the equivalent of dataframe overwrite() command. 
>  
> For Example
> {code:java}
>  INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2{code}
> will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows 
> from table2 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-40686) Support data masking and redacting built-in functions

2022-10-28 Thread Vinod KC (Jira)



[ https://issues.apache.org/jira/browse/SPARK-40686 ]


Vinod KC deleted comment on SPARK-40686:
--

was (Author: vinodkc):
I'm working on these 6 sub-tasks

> Support data masking and redacting built-in functions
> -
>
> Key: SPARK-40686
> URL: https://issues.apache.org/jira/browse/SPARK-40686
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> Support built-in data masking and redacting functions 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40686) Support data masking and redacting built-in functions

2022-10-28 Thread Daniel (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625827#comment-17625827
 ] 

Daniel commented on SPARK-40686:


> Daniel , Yes, please add them here. Let us change the description and title 
> of  this parent Jira title accordingly 

(y) I added a few subtasks for new functions. 

> Support data masking and redacting built-in functions
> -
>
> Key: SPARK-40686
> URL: https://issues.apache.org/jira/browse/SPARK-40686
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> Support built-in data masking and redacting functions 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-40686) Support data masking and redacting built-in functions

2022-10-28 Thread Daniel (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625827#comment-17625827
 ] 

Daniel edited comment on SPARK-40686 at 10/28/22 5:40 PM:
--

bq. Daniel , Yes, please add them here. Let us change the description and title 
of  this parent Jira title accordingly 

(y) I added a few subtasks for new functions. 


was (Author: JIRAUSER285772):
> Daniel , Yes, please add them here. Let us change the description and title 
> of  this parent Jira title accordingly 

(y) I added a few subtasks for new functions. 

> Support data masking and redacting built-in functions
> -
>
> Key: SPARK-40686
> URL: https://issues.apache.org/jira/browse/SPARK-40686
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> Support built-in data masking and redacting functions 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40918) Mismatch between ParquetFileFormat and FileSourceScanExec in # columns for WSCG.isTooManyFields when using _metadata

2022-10-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625826#comment-17625826
 ] 

Apache Spark commented on SPARK-40918:
--

User 'juliuszsompolski' has created a pull request for this issue:
https://github.com/apache/spark/pull/38431

> Mismatch between ParquetFileFormat and FileSourceScanExec in # columns for 
> WSCG.isTooManyFields when using _metadata
> 
>
> Key: SPARK-40918
> URL: https://issues.apache.org/jira/browse/SPARK-40918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> _metadata.columns are taken into account in 
> FileSourceScanExec.supportColumnar, but not when the parquet reader is 
> created. This can result in Parquet reader outputting columnar (because it 
> has less columns than WSCG.isTooManyFields), whereas FileSourceScanExec wants 
> row output (because with the extra metadata columns it hits the 
> isTooManyFields limit).
> I have a fix forthcoming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40962) Support data masking built-in function 'diff_privacy'

2022-10-28 Thread Daniel (Jira)

Daniel created SPARK-40962:
--

 Summary: Support data masking built-in function 'diff_privacy'
 Key: SPARK-40962
 URL: https://issues.apache.org/jira/browse/SPARK-40962
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Daniel


This function can compute the Laplace distribution for integer values.

It may be sufficient to implement a basic version of this for integer values 
only; full differential privacy support remains a separate effort.

Background on the math:

https://en.wikipedia.org/wiki/Laplace_distribution



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40961) Support data masking built-in functions for HIPAA compliant age/date/birthday/zipcode display

2022-10-28 Thread Daniel (Jira)

Daniel created SPARK-40961:
--

 Summary: Support data masking built-in functions for HIPAA 
compliant age/date/birthday/zipcode display
 Key: SPARK-40961
 URL: https://issues.apache.org/jira/browse/SPARK-40961
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Daniel


For example, these could be named `phi_age`, `phi_date`, `phi_dob`, `phi_zip3`, 
respectively.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40686) Support data masking and redacting built-in functions

2022-10-28 Thread Vinod KC (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod KC updated SPARK-40686:
-
Description: Support built-in data masking and redacting functions   (was: 
Support built-in data masking functions )

> Support data masking and redacting built-in functions
> -
>
> Key: SPARK-40686
> URL: https://issues.apache.org/jira/browse/SPARK-40686
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> Support built-in data masking and redacting functions 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40960) Support data masking built-in functions 'bcrypt'

2022-10-28 Thread Daniel (Jira)

Daniel created SPARK-40960:
--

 Summary: Support data masking built-in functions 'bcrypt'
 Key: SPARK-40960
 URL: https://issues.apache.org/jira/browse/SPARK-40960
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Daniel


This function can compute password hashing with encryption.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40959) Support data masking built-in function 'mask_default'

2022-10-28 Thread Daniel (Jira)

Daniel created SPARK-40959:
--

 Summary: Support data masking built-in function 'mask_default'
 Key: SPARK-40959
 URL: https://issues.apache.org/jira/browse/SPARK-40959
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Daniel


This can be a simple masking function to set a default value for each type.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40958) Support data masking built-in function 'null'

2022-10-28 Thread Daniel (Jira)

Daniel created SPARK-40958:
--

 Summary: Support data masking built-in function 'null'
 Key: SPARK-40958
 URL: https://issues.apache.org/jira/browse/SPARK-40958
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Daniel


This can be a simple function that returns a NULL value of the given input type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40686) Support data masking and redacting built-in functions

2022-10-28 Thread Daniel (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel updated SPARK-40686:
---
Summary: Support data masking and redacting built-in functions  (was: 
Support data masking built-in functions)

> Support data masking and redacting built-in functions
> -
>
> Key: SPARK-40686
> URL: https://issues.apache.org/jira/browse/SPARK-40686
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> Support built-in data masking functions 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40957) Add in memory cache in HDFSMetadataLog

2022-10-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625821#comment-17625821
 ] 

Apache Spark commented on SPARK-40957:
--

User 'jerrypeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38430

> Add in memory cache in HDFSMetadataLog
> --
>
> Key: SPARK-40957
> URL: https://issues.apache.org/jira/browse/SPARK-40957
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> Every time entries in offset log or commit log needs to be access, we read 
> from disk which is slow.  Can a cache of recent entries to speed up reads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40957) Add in memory cache in HDFSMetadataLog

2022-10-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40957:


Assignee: Apache Spark

> Add in memory cache in HDFSMetadataLog
> --
>
> Key: SPARK-40957
> URL: https://issues.apache.org/jira/browse/SPARK-40957
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Boyang Jerry Peng
>Assignee: Apache Spark
>Priority: Major
>
> Every time entries in offset log or commit log needs to be access, we read 
> from disk which is slow.  Can a cache of recent entries to speed up reads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40957) Add in memory cache in HDFSMetadataLog

2022-10-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625820#comment-17625820
 ] 

Apache Spark commented on SPARK-40957:
--

User 'jerrypeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38430

> Add in memory cache in HDFSMetadataLog
> --
>
> Key: SPARK-40957
> URL: https://issues.apache.org/jira/browse/SPARK-40957
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> Every time entries in offset log or commit log needs to be access, we read 
> from disk which is slow.  Can a cache of recent entries to speed up reads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40957) Add in memory cache in HDFSMetadataLog

2022-10-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40957:


Assignee: (was: Apache Spark)

> Add in memory cache in HDFSMetadataLog
> --
>
> Key: SPARK-40957
> URL: https://issues.apache.org/jira/browse/SPARK-40957
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> Every time entries in offset log or commit log needs to be access, we read 
> from disk which is slow.  Can a cache of recent entries to speed up reads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40686) Support data masking built-in functions

2022-10-28 Thread Vinod KC (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625819#comment-17625819
 ] 

Vinod KC commented on SPARK-40686:
--

[~dtenedor] , Yes, please add them here. Let us change the description and 
title of  this parent Jira title accordingly 

> Support data masking built-in functions
> ---
>
> Key: SPARK-40686
> URL: https://issues.apache.org/jira/browse/SPARK-40686
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> Support built-in data masking functions 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40957) Add in memory cache in HDFSMetadataLog

2022-10-28 Thread Boyang Jerry Peng (Jira)

Boyang Jerry Peng created SPARK-40957:
-

 Summary: Add in memory cache in HDFSMetadataLog
 Key: SPARK-40957
 URL: https://issues.apache.org/jira/browse/SPARK-40957
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 3.4.0
Reporter: Boyang Jerry Peng


Every time entries in offset log or commit log needs to be access, we read from 
disk which is slow.  Can a cache of recent entries to speed up reads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40956) SQL Equivalent for Dataframe overwrite command

2022-10-28 Thread chengyan fu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chengyan fu updated SPARK-40956:

Description: 
Proposing syntax ```INSERT INTO tbl REPLACE whereClause identifierList``` to 
the spark SQL, as the equivalent of dataframe overwrite command. 

 

For Example

```INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2``` will, in an 
atomic operation, 1) delete rows with key = 3 and 2) insert rows from table2 

 

 

  was:
 
{code:java}
 {code}
Proposing syntax ```INSERT INTO tbl REPLACE whereClause identifierList``` to 
the spark SQL, as the equivalent of dataframe overwrite command. 

 

For Example

```INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2``` will, in an 
atomic operation, 1) delete rows with key = 3 and 2) insert rows from table2 

 

 


> SQL Equivalent for Dataframe overwrite command
> --
>
> Key: SPARK-40956
> URL: https://issues.apache.org/jira/browse/SPARK-40956
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: chengyan fu
>Priority: Minor
>
> Proposing syntax ```INSERT INTO tbl REPLACE whereClause identifierList``` to 
> the spark SQL, as the equivalent of dataframe overwrite command. 
>  
> For Example
> ```INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2``` will, in 
> an atomic operation, 1) delete rows with key = 3 and 2) insert rows from 
> table2 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40956) SQL Equivalent for Dataframe overwrite command

2022-10-28 Thread chengyan fu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chengyan fu updated SPARK-40956:

Description: 
 
{code:java}
 {code}
Proposing syntax ```INSERT INTO tbl REPLACE whereClause identifierList``` to 
the spark SQL, as the equivalent of dataframe overwrite command. 

 

For Example

```INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2``` will, in an 
atomic operation, 1) delete rows with key = 3 and 2) insert rows from table2 

 

 

  was:
Proposing syntax ```INSERT INTO tbl REPLACE whereClause identifierList``` to 
the spark SQL, as the equivalent of dataframe overwrite command. 

For Example

```INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2``` will, in an 
atomic operation, 1) delete rows with key = 3 and 2) insert rows from table2 

 

 


> SQL Equivalent for Dataframe overwrite command
> --
>
> Key: SPARK-40956
> URL: https://issues.apache.org/jira/browse/SPARK-40956
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: chengyan fu
>Priority: Minor
>
>  
> {code:java}
>  {code}
> Proposing syntax ```INSERT INTO tbl REPLACE whereClause identifierList``` to 
> the spark SQL, as the equivalent of dataframe overwrite command. 
>  
> For Example
> ```INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2``` will, in 
> an atomic operation, 1) delete rows with key = 3 and 2) insert rows from 
> table2 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40956) SQL Equivalent for Dataframe overwrite command

2022-10-28 Thread chengyan fu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chengyan fu updated SPARK-40956:

Description: 
Proposing syntax ```INSERT INTO tbl REPLACE whereClause identifierList``` to 
the spark SQL, as the equivalent of dataframe overwrite command. 

For Example

```INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2``` will, in an 
atomic operation, 1) delete rows with key = 3 and 2) insert rows from table2 

 

 

  was:
Proposing syntax ```INSERT INTO tbl REPLACE whereClause identifierList``` to 
the spark SQL, as the equivalent of dataframe overwrite command. 

For Example

```INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2``` will in an 
atomic operation, 1) delete rows with key = 3 and 2) insert rows from table2 

 

 


> SQL Equivalent for Dataframe overwrite command
> --
>
> Key: SPARK-40956
> URL: https://issues.apache.org/jira/browse/SPARK-40956
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: chengyan fu
>Priority: Minor
>
> Proposing syntax ```INSERT INTO tbl REPLACE whereClause identifierList``` to 
> the spark SQL, as the equivalent of dataframe overwrite command. 
> For Example
> ```INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2``` will, in 
> an atomic operation, 1) delete rows with key = 3 and 2) insert rows from 
> table2 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40956) SQL Equivalent for Dataframe overwrite command

2022-10-28 Thread chengyan fu (Jira)

chengyan fu created SPARK-40956:
---

 Summary: SQL Equivalent for Dataframe overwrite command
 Key: SPARK-40956
 URL: https://issues.apache.org/jira/browse/SPARK-40956
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.2.2
Reporter: chengyan fu


Proposing syntax ```INSERT INTO tbl REPLACE whereClause identifierList``` to 
the spark SQL, as the equivalent of dataframe overwrite command. 

For Example

```INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2``` will in an 
atomic operation, 1) delete rows with key = 3 and 2) insert rows from table2 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34827) Support fetching shuffle blocks in batch with i/o encryption

2022-10-28 Thread Nikhil Sharma (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625809#comment-17625809
 ] 

Nikhil Sharma edited comment on SPARK-34827 at 10/28/22 4:54 PM:
-

Thank you for sharing such good information. Very informative and effective 
post. 

+[https://www.igmguru.com/digital-marketing-programming/golang-training/]+


was (Author: JIRAUSER295436):
Thank you for sharing such good information. Very informative and effective 
post. 

[golang 
certification|{+}[https://www.igmguru.com/digital-marketing-programming/golang-training/]{+}]

> Support fetching shuffle blocks in batch with i/o encryption
> 
>
> Key: SPARK-34827
> URL: https://issues.apache.org/jira/browse/SPARK-34827
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34827) Support fetching shuffle blocks in batch with i/o encryption

2022-10-28 Thread Nikhil Sharma (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625809#comment-17625809
 ] 

Nikhil Sharma edited comment on SPARK-34827 at 10/28/22 4:53 PM:
-

Thank you for sharing such good information. Very informative and effective 
post. 

[golang 
certification|{+}[https://www.igmguru.com/digital-marketing-programming/golang-training/]{+}]


was (Author: JIRAUSER295436):
Thank you for sharing such good information. Very informative and effective 
post. 

[golang 
certification|+[https://www.igmguru.com/digital-marketing-programming/golang-training/]+]

> Support fetching shuffle blocks in batch with i/o encryption
> 
>
> Key: SPARK-34827
> URL: https://issues.apache.org/jira/browse/SPARK-34827
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34827) Support fetching shuffle blocks in batch with i/o encryption

2022-10-28 Thread Nikhil Sharma (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625809#comment-17625809
 ] 

Nikhil Sharma commented on SPARK-34827:
---

Thank you for sharing such good information. Very informative and effective 
post. 

[golang 
certification|+[https://www.igmguru.com/digital-marketing-programming/golang-training/]+]

> Support fetching shuffle blocks in batch with i/o encryption
> 
>
> Key: SPARK-34827
> URL: https://issues.apache.org/jira/browse/SPARK-34827
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40686) Support data masking built-in functions

2022-10-28 Thread Daniel (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625805#comment-17625805
 ] 

Daniel commented on SPARK-40686:


In addition to the functions listed in subtasks here so far, I was also 
considering the following similar functions as well. Should we think about 
adding them here too? (I know the original intent for this Jira was to make 
paritiy with Hive.)

||Name||Description||
|bcrypt|Computes password hashing with encryption|
|diff_privacy|Computes the Laplace distribution for integer values|
|fnv_hash|Fowler–Noll–Vo hash function|
|mask_default|Masking function to set a default value for each type|
|null|Returns a NULL value of the given input type|
|phi_age|HIPAA compliant age display|
|phi_date|HIPAA compliant date display|
|phi_dob|HIPAA compliant date of birth display|
|phi_zip3|HIPAA compatible ZIP code display|
|random_ccn|Computes a random credit card number|
|tokenize|Computes a format-preserving encoding using NIST compatible FPE FF1|

> Support data masking built-in functions
> ---
>
> Key: SPARK-40686
> URL: https://issues.apache.org/jira/browse/SPARK-40686
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> Support built-in data masking functions 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40686) Support data masking built-in functions

2022-10-28 Thread Daniel (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625803#comment-17625803
 ] 

Daniel commented on SPARK-40686:


I closed https://issues.apache.org/jira/browse/SPARK-40623 as a duplicate of 
this Jira.

> Support data masking built-in functions
> ---
>
> Key: SPARK-40686
> URL: https://issues.apache.org/jira/browse/SPARK-40686
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> Support built-in data masking functions 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40623) Add new SQL built-in functions to help with redacting data

2022-10-28 Thread Daniel (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel resolved SPARK-40623.

Resolution: Duplicate

closing as a dup of https://issues.apache.org/jira/browse/SPARK-40686

> Add new SQL built-in functions to help with redacting data
> --
>
> Key: SPARK-40623
> URL: https://issues.apache.org/jira/browse/SPARK-40623
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>
> This issue tracks building new scalar SQL functions into Spark for purposes 
> of redacting sensitive information from fields. These can be useful for 
> creating copies of tables with sensitive information removed, but retaining 
> the same schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40652) Add MASK_PHONE and TRY_MASK_PHONE functions

2022-10-28 Thread Daniel (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625800#comment-17625800
 ] 

Daniel commented on SPARK-40652:


I am going to close this as there is a duplicate effort underway in 
https://issues.apache.org/jira/browse/SPARK-40686.

> Add MASK_PHONE and TRY_MASK_PHONE functions
> ---
>
> Key: SPARK-40652
> URL: https://issues.apache.org/jira/browse/SPARK-40652
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40955) allow DSV2 Predicate pushdown in FileScanBuilder.pushedDataFilter

2022-10-28 Thread RJ Marcus (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

RJ Marcus updated SPARK-40955:
--
Description: 
{+}overall{+}: 


Allow FileScanBuilder to push `Predicate` instead of `Filter` for data filters 
being pushed down to source. This would allow new (arbitrary) DS V2 Predicates 
to be pushed down to the file source. 




Hello spark developers,

Thank you in advance for reading. Please excuse me if I make mistakes; this is 
my first time working on apache/spark internals. I am asking these questions to 
better understand whether my proposed changes fall within the intended scope of 
Data Source V2 API functionality.


+Motivation / Background:+

I am working on a branch in 
[apache/incubator-sedona|https://github.com/apache/incubator-sedona] to extend 
its support of geoparquet files to include predicate pushdown of postGIS style 
spatial predicates (e.g. `ST_Contains()`) that can take advantage of spatial 
info in file metadata. We would like to inherit as much as possible from the 
Parquet classes (because geoparquet basically just adds a binary geometry 
column). However, {{FileScanBuilder.scala}} appears to be missing some 
functionality I need for DSV2 {{{}Predicates{}}}.


+My understanding of the problem so far:+

The ST_* {{Expression}} must be detected as a pushable predicate 
(ParquetScanBuilder.scala:71) and passed as a {{pushedDataFilter}} to the 
{{parquetPartitionReaderFactory}} where it will be translated into a (user 
defined) {{{}FilterPredicate{}}}.

The [Filter class is 
sealed|https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala]
 so the sedona package can’t define new Filters; DSV2 Predicate appears to be 
the preferred method for accomplishing this task (referred to as “V2 Filter”, 
SPARK-39966). However, `pushedDataFilters` in FileScanBuilder.scala is of type 
{{{}sources.Filter{}}}.

Some recent work (SPARK-39139) added the ability to detect user defined 
functions in  {{DataSourceV2Strategy.translateFilterV2() > 
V2ExpressionBuilder.generateExpression()}}  , which I think could accomplish 
detection correctly if {{FileScanBuilder}} called 
{{DataSourceV2Strategy.translateFilterV2()}} instead of 
{{{}DataSourceStrategy.translateFilter(){}}}.

However, changing {{FileScanBuilder}} to use {{Predicate}} instead of 
{{Filter}} would require many changes to all file based data sources. I don’t 
want to spend effort making sweeping changes if the current behavior of Spark 
is intentional.

 

+Concluding Questions:+

Should {{FileScanBuilder}} be pushing {{Predicate}} instead of {{Filter}} for 
data filters being pushed down to source? Or maybe in a FileScanBuilderV2?

If not, how can a developer of a data source push down a new (or user defined) 
predicate to the file source?

Thank you again for reading. Pending feedback, I will start working on a PR for 
this functionality.



[~beliefer] [~cloud_fan] [~huaxingao]   have worked on DSV2 related spark 
issues and I welcome your input. Please ignore this if I "@" you incorrectly.

  was:
{+}overall{+}: 
Allow FileScanBuilder to push `Predicate` instead of `Filter` for data filters 
being pushed down to source. This would allow new (arbitrary) DS V2 Predicates 
to be pushed down to the file source. 




Hello spark developers,

Thank you in advance for reading. Please excuse me if I make mistakes; this is 
my first time working on apache/spark internals. I am asking these questions to 
better understand whether my proposed changes fall within the intended scope of 
Data Source V2 API functionality.

+
Motivation / Background:+

I am working on a branch in 
[apache/incubator-sedona|https://github.com/apache/incubator-sedona] to extend 
its support of geoparquet files to include predicate pushdown of postGIS style 
spatial predicates (e.g. `ST_Contains()`) that can take advantage of spatial 
info in file metadata. We would like to inherit as much as possible from the 
Parquet classes (because geoparquet basically just adds a binary geometry 
column). However, {{FileScanBuilder.scala}} appears to be missing some 
functionality I need for DSV2 {{{}Predicates{}}}.

+
My understanding of the problem so far:+

The ST_* {{Expression}} must be detected as a pushable predicate 
(ParquetScanBuilder.scala:71) and passed as a {{pushedDataFilter}} to the 
{{parquetPartitionReaderFactory}} where it will be translated into a (user 
defined) {{{}FilterPredicate{}}}.

The [Filter class is 
sealed|https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala]
 so the sedona package can’t define new Filters; DSV2 Predicate appears to be 
the preferred method for accomplishing this task (referred to as “V2 Filter”, 
SPARK-39966). However, `pushedDataFilters` in FileScanBuilder.scala is of type 
{{{}sources.Filter{}}}.

Some recent wo

[jira] [Created] (SPARK-40955) allow DSV2 Predicate pushdown in FileScanBuilder.pushedDataFilter

2022-10-28 Thread RJ Marcus (Jira)

RJ Marcus created SPARK-40955:
-

 Summary: allow DSV2 Predicate pushdown in 
FileScanBuilder.pushedDataFilter 
 Key: SPARK-40955
 URL: https://issues.apache.org/jira/browse/SPARK-40955
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output, SQL
Affects Versions: 3.3.1
Reporter: RJ Marcus


{+}overall{+}: 
Allow FileScanBuilder to push `Predicate` instead of `Filter` for data filters 
being pushed down to source. This would allow new (arbitrary) DS V2 Predicates 
to be pushed down to the file source. 




Hello spark developers,

Thank you in advance for reading. Please excuse me if I make mistakes; this is 
my first time working on apache/spark internals. I am asking these questions to 
better understand whether my proposed changes fall within the intended scope of 
Data Source V2 API functionality.

+
Motivation / Background:+

I am working on a branch in 
[apache/incubator-sedona|https://github.com/apache/incubator-sedona] to extend 
its support of geoparquet files to include predicate pushdown of postGIS style 
spatial predicates (e.g. `ST_Contains()`) that can take advantage of spatial 
info in file metadata. We would like to inherit as much as possible from the 
Parquet classes (because geoparquet basically just adds a binary geometry 
column). However, {{FileScanBuilder.scala}} appears to be missing some 
functionality I need for DSV2 {{{}Predicates{}}}.

+
My understanding of the problem so far:+

The ST_* {{Expression}} must be detected as a pushable predicate 
(ParquetScanBuilder.scala:71) and passed as a {{pushedDataFilter}} to the 
{{parquetPartitionReaderFactory}} where it will be translated into a (user 
defined) {{{}FilterPredicate{}}}.

The [Filter class is 
sealed|https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala]
 so the sedona package can’t define new Filters; DSV2 Predicate appears to be 
the preferred method for accomplishing this task (referred to as “V2 Filter”, 
SPARK-39966). However, `pushedDataFilters` in FileScanBuilder.scala is of type 
{{{}sources.Filter{}}}.

Some recent work (SPARK-39139) added the ability to detect user defined 
functions in  {{DataSourceV2Strategy.translateFilterV2() > 
V2ExpressionBuilder.generateExpression()}}  , which I think could accomplish 
detection correctly if {{FileScanBuilder}} called 
{{DataSourceV2Strategy.translateFilterV2()}} instead of 
{{{}DataSourceStrategy.translateFilter(){}}}.

However, changing {{FileScanBuilder}} to use {{Predicate}} instead of 
{{Filter}} would require many changes to all file based data sources. I don’t 
want to spend effort making sweeping changes if the current behavior of Spark 
is intentional.

+
Concluding Questions:+

Should {{FileScanBuilder}} be pushing {{Predicate}} instead of {{Filter}} for 
data filters being pushed down to source? Or maybe in a FileScanBuilderV2?

If not, how can a developer of a data source push down a new (or user defined) 
predicate to the file source?

Thank you again for reading. Pending feedback, I will start working on a PR for 
this functionality.


[~beliefer] [~cloud_fan] [~huaxingao]   have worked on DSV2 related spark 
issues and I welcome your input. Please ignore this if I "@" you incorrectly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40954) Kubernetes integration tests stuck forever on Mac M1 with Minikube + Docker

2022-10-28 Thread Anton Ippolitov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Ippolitov updated SPARK-40954:

Description: 
h2. Description

I tried running Kubernetes integration tests with the Minikube backend (+ 
Docker driver) from commit c26d99e3f104f6603e0849d82eca03e28f196551 on Spark's 
master branch. I ran them with the following command:

 
{code:java}
mvn integration-test -am -pl :spark-kubernetes-integration-tests_2.12 \
-Pkubernetes -Pkubernetes-integration-tests \
-Phadoop-3 \
-Dspark.kubernetes.test.imageTag=MY_IMAGE_TAG_HERE \
-Dspark.kubernetes.test.imageRepo=docker.io/kubespark \
-Dspark.kubernetes.test.namespace=spark \
-Dspark.kubernetes.test.serviceAccountName=spark \
-Dspark.kubernetes.test.deployMode=minikube  {code}
However the test suite got stuck literally for hours on my machine. 

 
h2. Investigation

I ran {{jstack}} on the process that was running the tests and saw that it was 
stuck here:

 
{noformat}
"ScalaTest-main-running-KubernetesSuite" #1 prio=5 os_prio=31 
tid=0x7f78d580b800 nid=0x2503 runnable [0x000304749000]
   java.lang.Thread.State: RUNNABLE
    at java.io.FileInputStream.readBytes(Native Method)
    at java.io.FileInputStream.read(FileInputStream.java:255)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    - locked <0x00076c0b6f40> (a 
java.lang.UNIXProcess$ProcessPipeInputStream)
    at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
    - locked <0x00076c0bb410> (a java.io.InputStreamReader)
    at java.io.InputStreamReader.read(InputStreamReader.java:184)
    at java.io.BufferedReader.fill(BufferedReader.java:161)
    at java.io.BufferedReader.readLine(BufferedReader.java:324)
    - locked <0x00076c0bb410> (a java.io.InputStreamReader)
    at java.io.BufferedReader.readLine(BufferedReader.java:389)
    at 
scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:74)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
    at scala.collection.Iterator.foreach$(Iterator.scala:943)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    at 
org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.$anonfun$executeProcess$2(ProcessUtils.scala:45)
    at 
org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.$anonfun$executeProcess$2$adapted(ProcessUtils.scala:45)
    at 
org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$$$Lambda$322/20156341.apply(Unknown
 Source)
    at 
org.apache.spark.deploy.k8s.integrationtest.Utils$.tryWithResource(Utils.scala:49)
    at 
org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.executeProcess(ProcessUtils.scala:45)
    at 
org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.executeMinikube(Minikube.scala:103)
    at 
org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.minikubeServiceAction(Minikube.scala:112)
    at 
org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.$anonfun$getServiceUrl$1(DepsTestsSuite.scala:281)
    at 
org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite$$Lambda$611/1461360262.apply(Unknown
 Source)
    at 
org.scalatest.enablers.Retrying$$anon$4.makeAValiantAttempt$1(Retrying.scala:184)
    at org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:196)
    at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226)
    at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313)
    at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312)
    at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:457)
    at 
org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.getServiceUrl(DepsTestsSuite.scala:278)
    at 
org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.tryDepsTest(DepsTestsSuite.scala:325)
    at 
org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.$anonfun$$init$$1(DepsTestsSuite.scala:160)
    at 
org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite$$Lambda$178/1750286943.apply$mcV$sp(Unknown
 Source)
[...]{noformat}
 So the issue is coming from {{DepsTestsSuite}} when it is setting up 
{{{}minio{}}}. After [creating the minio StatefulSet and 
Service|https://github.com/apache/spark/blob/5ea2b386eb866e20540660cdb6ed43792cb29969/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala#L85],
 it 
[executes|https://github.com/apache/spark/blob/5ea2b386eb866e20540660cdb6ed43792cb29969/resource-managers/kubernetes/integration-te

[jira] [Updated] (SPARK-40954) Kubernetes integration tests stuck forever on Mac M1 with Minikube + Docker

2022-10-28 Thread Anton Ippolitov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Ippolitov updated SPARK-40954:

Attachment: TestProcess.scala

> Kubernetes integration tests stuck forever on Mac M1 with Minikube + Docker
> ---
>
> Key: SPARK-40954
> URL: https://issues.apache.org/jira/browse/SPARK-40954
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.1
> Environment: MacOS 12.6 (Mac M1)
> Minikube 1.27.1
> Docker 20.10.17
>Reporter: Anton Ippolitov
>Priority: Minor
> Attachments: TestProcess.scala
>
>
> h2. Description
> I tried running Kubernetes integration tests with the Minikube backend (+ 
> Docker driver) from commit c26d99e3f104f6603e0849d82eca03e28f196551 on 
> Spark's master branch. I ran them with the following command:
>  
> {code:java}
> mvn integration-test -am -pl :spark-kubernetes-integration-tests_2.12 \
> -Pkubernetes -Pkubernetes-integration-tests \
> -Phadoop-3 \
> -Dspark.kubernetes.test.imageTag=MY_IMAGE_TAG_HERE \
> -Dspark.kubernetes.test.imageRepo=docker.io/kubespark 
> \
> -Dspark.kubernetes.test.namespace=spark \
> -Dspark.kubernetes.test.serviceAccountName=spark \
> -Dspark.kubernetes.test.deployMode=minikube  {code}
> However the test suite got stuck literally for hours on my machine. 
>  
> h2. Investigation
> I ran {{jstack}} on the process that was running the tests and saw that it 
> was stuck here:
>  
> {noformat}
> "ScalaTest-main-running-KubernetesSuite" #1 prio=5 os_prio=31 
> tid=0x7f78d580b800 nid=0x2503 runnable [0x000304749000]
>    java.lang.Thread.State: RUNNABLE
>     at java.io.FileInputStream.readBytes(Native Method)
>     at java.io.FileInputStream.read(FileInputStream.java:255)
>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>     at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>     - locked <0x00076c0b6f40> (a 
> java.lang.UNIXProcess$ProcessPipeInputStream)
>     at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
>     at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
>     at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>     - locked <0x00076c0bb410> (a java.io.InputStreamReader)
>     at java.io.InputStreamReader.read(InputStreamReader.java:184)
>     at java.io.BufferedReader.fill(BufferedReader.java:161)
>     at java.io.BufferedReader.readLine(BufferedReader.java:324)
>     - locked <0x00076c0bb410> (a java.io.InputStreamReader)
>     at java.io.BufferedReader.readLine(BufferedReader.java:389)
>     at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:74)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at 
> org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.$anonfun$executeProcess$2(ProcessUtils.scala:45)
>     at 
> org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.$anonfun$executeProcess$2$adapted(ProcessUtils.scala:45)
>     at 
> org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$$$Lambda$322/20156341.apply(Unknown
>  Source)
>     at 
> org.apache.spark.deploy.k8s.integrationtest.Utils$.tryWithResource(Utils.scala:49)
>     at 
> org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.executeProcess(ProcessUtils.scala:45)
>     at 
> org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.executeMinikube(Minikube.scala:103)
>     at 
> org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.minikubeServiceAction(Minikube.scala:112)
>     at 
> org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.$anonfun$getServiceUrl$1(DepsTestsSuite.scala:281)
>     at 
> org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite$$Lambda$611/1461360262.apply(Unknown
>  Source)
>     at 
> org.scalatest.enablers.Retrying$$anon$4.makeAValiantAttempt$1(Retrying.scala:184)
>     at 
> org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:196)
>     at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226)
>     at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313)
>     at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312)
>     at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:457)
>     at 
> org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.getServiceUrl(DepsTestsSuite.scala:278)
>     at 
> org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.tryDepsTest(DepsT

[jira] [Created] (SPARK-40954) Kubernetes integration tests stuck forever on Mac M1 with Minikube + Docker

2022-10-28 Thread Anton Ippolitov (Jira)

Anton Ippolitov created SPARK-40954:
---

 Summary: Kubernetes integration tests stuck forever on Mac M1 with 
Minikube + Docker
 Key: SPARK-40954
 URL: https://issues.apache.org/jira/browse/SPARK-40954
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, Tests
Affects Versions: 3.3.1
 Environment: MacOS 12.6 (Mac M1)

Minikube 1.27.1

Docker 20.10.17
Reporter: Anton Ippolitov


h2. Description

I tried running Kubernetes integration tests with the Minikube backend (+ 
Docker driver) from commit c26d99e3f104f6603e0849d82eca03e28f196551 on Spark's 
master branch. I ran them with the following command:

 
{code:java}
mvn integration-test -am -pl :spark-kubernetes-integration-tests_2.12 \
-Pkubernetes -Pkubernetes-integration-tests \
-Phadoop-3 \
-Dspark.kubernetes.test.imageTag=MY_IMAGE_TAG_HERE \
-Dspark.kubernetes.test.imageRepo=docker.io/kubespark \
-Dspark.kubernetes.test.namespace=spark \
-Dspark.kubernetes.test.serviceAccountName=spark \
-Dspark.kubernetes.test.deployMode=minikube  {code}
However the test suite got stuck literally for hours on my machine. 

 
h2. Investigation

I ran {{jstack}} on the process that was running the tests and saw that it was 
stuck here:

 
{noformat}
"ScalaTest-main-running-KubernetesSuite" #1 prio=5 os_prio=31 
tid=0x7f78d580b800 nid=0x2503 runnable [0x000304749000]
   java.lang.Thread.State: RUNNABLE
    at java.io.FileInputStream.readBytes(Native Method)
    at java.io.FileInputStream.read(FileInputStream.java:255)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    - locked <0x00076c0b6f40> (a 
java.lang.UNIXProcess$ProcessPipeInputStream)
    at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
    - locked <0x00076c0bb410> (a java.io.InputStreamReader)
    at java.io.InputStreamReader.read(InputStreamReader.java:184)
    at java.io.BufferedReader.fill(BufferedReader.java:161)
    at java.io.BufferedReader.readLine(BufferedReader.java:324)
    - locked <0x00076c0bb410> (a java.io.InputStreamReader)
    at java.io.BufferedReader.readLine(BufferedReader.java:389)
    at 
scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:74)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
    at scala.collection.Iterator.foreach$(Iterator.scala:943)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    at 
org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.$anonfun$executeProcess$2(ProcessUtils.scala:45)
    at 
org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.$anonfun$executeProcess$2$adapted(ProcessUtils.scala:45)
    at 
org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$$$Lambda$322/20156341.apply(Unknown
 Source)
    at 
org.apache.spark.deploy.k8s.integrationtest.Utils$.tryWithResource(Utils.scala:49)
    at 
org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.executeProcess(ProcessUtils.scala:45)
    at 
org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.executeMinikube(Minikube.scala:103)
    at 
org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.minikubeServiceAction(Minikube.scala:112)
    at 
org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.$anonfun$getServiceUrl$1(DepsTestsSuite.scala:281)
    at 
org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite$$Lambda$611/1461360262.apply(Unknown
 Source)
    at 
org.scalatest.enablers.Retrying$$anon$4.makeAValiantAttempt$1(Retrying.scala:184)
    at org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:196)
    at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226)
    at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313)
    at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312)
    at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:457)
    at 
org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.getServiceUrl(DepsTestsSuite.scala:278)
    at 
org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.tryDepsTest(DepsTestsSuite.scala:325)
    at 
org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.$anonfun$$init$$1(DepsTestsSuite.scala:160)
    at 
org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite$$Lambda$178/1750286943.apply$mcV$sp(Unknown
 Source)
[...]{noformat}
 So the issue is coming from {{DepsTestsSuite}} when it is setting up 
{{{}minio{}}}. After [creating the minio StatefulSet and 
Service|https://github.com/apache/spark/blob/5ea2b386eb8

[jira] [Commented] (SPARK-40800) Always inline expressions in OptimizeOneRowRelationSubquery

2022-10-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625783#comment-17625783
 ] 

Apache Spark commented on SPARK-40800:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/38429

> Always inline expressions in OptimizeOneRowRelationSubquery
> ---
>
> Key: SPARK-40800
> URL: https://issues.apache.org/jira/browse/SPARK-40800
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> SPARK-39699 made `CollpaseProjects` more conservative. This has impacted 
> correlated subqueries that Spark used to be able to support. For example, a 
> correlated one-row scalar subquery that has a higher-order function:
> {code:java}
> CREATE TEMP VIEW t1 AS SELECT ARRAY('a', 'b') a 
> SELECT (
>   SELECT array_sort(a, (i, j) -> rank[i] - rank[j]) AS sorted
>   FROM (SELECT MAP('a', 1, 'b', 2) rank)
> ) FROM t1{code}
> This will throw an exception after SPARK-39699:
> {code:java}
> Unexpected operator Join Inner
> :- Aggregate [[a,b]], [[a,b] AS a#252]
> :  +- OneRowRelation
> +- Project [map(keys: [a,b], values: [1,2]) AS rank#241]
>    +- OneRowRelation
>  in correlated subquery{code}
> because the projects inside the subquery can no longer be collapsed. We 
> should always inline expressions if possible to avoid adding domain joins and 
> support a wider range of correlated subqueries. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40800) Always inline expressions in OptimizeOneRowRelationSubquery

2022-10-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625782#comment-17625782
 ] 

Apache Spark commented on SPARK-40800:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/38429

> Always inline expressions in OptimizeOneRowRelationSubquery
> ---
>
> Key: SPARK-40800
> URL: https://issues.apache.org/jira/browse/SPARK-40800
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> SPARK-39699 made `CollpaseProjects` more conservative. This has impacted 
> correlated subqueries that Spark used to be able to support. For example, a 
> correlated one-row scalar subquery that has a higher-order function:
> {code:java}
> CREATE TEMP VIEW t1 AS SELECT ARRAY('a', 'b') a 
> SELECT (
>   SELECT array_sort(a, (i, j) -> rank[i] - rank[j]) AS sorted
>   FROM (SELECT MAP('a', 1, 'b', 2) rank)
> ) FROM t1{code}
> This will throw an exception after SPARK-39699:
> {code:java}
> Unexpected operator Join Inner
> :- Aggregate [[a,b]], [[a,b] AS a#252]
> :  +- OneRowRelation
> +- Project [map(keys: [a,b], values: [1,2]) AS rank#241]
>    +- OneRowRelation
>  in correlated subquery{code}
> because the projects inside the subquery can no longer be collapsed. We 
> should always inline expressions if possible to avoid adding domain joins and 
> support a wider range of correlated subqueries. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40950) isRemoteAddressMaxedOut performance overhead on scala 2.13

2022-10-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625712#comment-17625712
 ] 

Apache Spark commented on SPARK-40950:
--

User 'eejbyfeldt' has created a pull request for this issue:
https://github.com/apache/spark/pull/38427

> isRemoteAddressMaxedOut performance overhead on scala 2.13
> --
>
> Key: SPARK-40950
> URL: https://issues.apache.org/jira/browse/SPARK-40950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.2, 3.3.1
>Reporter: Emil Ejbyfeldt
>Priority: Major
>
> On scala 2.13 the blocks in FetchRequest is sometimes backed by a `List` 
> while in 2.12 it would be ArrayBuffer. This means that calculating the length 
> of the blocks which is done in isRemoteAddressMaxedOut and other places now 
> much more expensive.  This is because in 2.13 `Seq` is can no longer be 
> backed by a mutable collection.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40950) isRemoteAddressMaxedOut performance overhead on scala 2.13

2022-10-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40950:


Assignee: (was: Apache Spark)

> isRemoteAddressMaxedOut performance overhead on scala 2.13
> --
>
> Key: SPARK-40950
> URL: https://issues.apache.org/jira/browse/SPARK-40950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.2, 3.3.1
>Reporter: Emil Ejbyfeldt
>Priority: Major
>
> On scala 2.13 the blocks in FetchRequest is sometimes backed by a `List` 
> while in 2.12 it would be ArrayBuffer. This means that calculating the length 
> of the blocks which is done in isRemoteAddressMaxedOut and other places now 
> much more expensive.  This is because in 2.13 `Seq` is can no longer be 
> backed by a mutable collection.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40950) isRemoteAddressMaxedOut performance overhead on scala 2.13

2022-10-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625710#comment-17625710
 ] 

Apache Spark commented on SPARK-40950:
--

User 'eejbyfeldt' has created a pull request for this issue:
https://github.com/apache/spark/pull/38427

> isRemoteAddressMaxedOut performance overhead on scala 2.13
> --
>
> Key: SPARK-40950
> URL: https://issues.apache.org/jira/browse/SPARK-40950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.2, 3.3.1
>Reporter: Emil Ejbyfeldt
>Priority: Major
>
> On scala 2.13 the blocks in FetchRequest is sometimes backed by a `List` 
> while in 2.12 it would be ArrayBuffer. This means that calculating the length 
> of the blocks which is done in isRemoteAddressMaxedOut and other places now 
> much more expensive.  This is because in 2.13 `Seq` is can no longer be 
> backed by a mutable collection.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40950) isRemoteAddressMaxedOut performance overhead on scala 2.13

2022-10-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40950:


Assignee: Apache Spark

> isRemoteAddressMaxedOut performance overhead on scala 2.13
> --
>
> Key: SPARK-40950
> URL: https://issues.apache.org/jira/browse/SPARK-40950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.2, 3.3.1
>Reporter: Emil Ejbyfeldt
>Assignee: Apache Spark
>Priority: Major
>
> On scala 2.13 the blocks in FetchRequest is sometimes backed by a `List` 
> while in 2.12 it would be ArrayBuffer. This means that calculating the length 
> of the blocks which is done in isRemoteAddressMaxedOut and other places now 
> much more expensive.  This is because in 2.13 `Seq` is can no longer be 
> backed by a mutable collection.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40932) Barrier: messages for allGather will be overridden by the following barrier APIs

2022-10-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-40932.
-
Fix Version/s: 3.3.2
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 38410
[https://github.com/apache/spark/pull/38410]

> Barrier: messages for allGather will be overridden by the following barrier 
> APIs
> 
>
> Key: SPARK-40932
> URL: https://issues.apache.org/jira/browse/SPARK-40932
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Bobby Wang
>Assignee: Bobby Wang
>Priority: Critical
> Fix For: 3.3.2, 3.4.0
>
>
> When I was working on an internal project which has not been opened source. I 
> found this bug that the messages for Barrier.allGather may be overridden by 
> the following Barrier APIs, which means the user can't get the correct 
> allGather message.
>  
> This issue can easily repro by the following unit tests.
>  
>  
> {code:java}
> test("SPARK-XXX, messages of allGather should not been overridden " +
>   "by the following barrier APIs") {
>   sc = new SparkContext(new 
> SparkConf().setAppName("test").setMaster("local[2]"))
>   sc.setLogLevel("INFO")
>   val rdd = sc.makeRDD(1 to 10, 2)
>   val rdd2 = rdd.barrier().mapPartitions { it =>
> val context = BarrierTaskContext.get()
> // Sleep for a random time before global sync.
> Thread.sleep(Random.nextInt(1000))
> // Pass partitionId message in
> val message: String = context.partitionId().toString
> val messages: Array[String] = context.allGather(message)
> context.barrier()
> Iterator.single(messages.toList)
>   }
>   val messages = rdd2.collect()
>   // All the task partitionIds are shared across all tasks
>   assert(messages.length === 2)
>   messages.foreach(m => println("--- " + m))
>   assert(messages.forall(_ == List("0", "1")))
> } {code}
>  
>  
> before throwing the exception by (assert(messages.forall(_ == List("0", 
> "1"))), the print log is 
>  
> {code:java}
> --- List(, )
> --- List(, ) {code}
>  
>  
> You can see, the messages are empty which has been overridden by 
> context.barrier() API.
>  
> Below is the spark log,
>  
> _22/10/27 17:03:50.236 Executor task launch worker for task 0.0 in stage 0.0 
> (TID 1) INFO Executor: Running task 0.0 in stage 0.0 (TID 1)_
> _22/10/27 17:03:50.236 Executor task launch worker for task 1.0 in stage 0.0 
> (TID 0) INFO Executor: Running task 1.0 in stage 0.0 (TID 0)_
> _22/10/27 17:03:50.949 Executor task launch worker for task 0.0 in stage 0.0 
> (TID 1) INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) has entered 
> the global sync, current barrier epoch is 0._
> _22/10/27 17:03:50.964 dispatcher-event-loop-1 INFO BarrierCoordinator: 
> Current barrier epoch for Stage 0 (Attempt 0) is 0._
> _22/10/27 17:03:50.966 dispatcher-event-loop-1 INFO BarrierCoordinator: 
> Barrier sync epoch 0 from Stage 0 (Attempt 0) received update from Task 1, 
> current progress: 1/2._
> _22/10/27 17:03:51.436 Executor task launch worker for task 1.0 in stage 0.0 
> (TID 0) INFO BarrierTaskContext: Task 0 from Stage 0(Attempt 0) has entered 
> the global sync, current barrier epoch is 0._
> _22/10/27 17:03:51.437 dispatcher-event-loop-0 INFO BarrierCoordinator: 
> Current barrier epoch for Stage 0 (Attempt 0) is 0._
> _22/10/27 17:03:51.437 dispatcher-event-loop-0 INFO BarrierCoordinator: 
> Barrier sync epoch 0 from Stage 0 (Attempt 0) received update from Task 0, 
> current progress: 2/2._
> _22/10/27 17:03:51.440 dispatcher-event-loop-0 INFO BarrierCoordinator: 
> Barrier sync epoch 0 from Stage 0 (Attempt 0) received all updates from 
> tasks, finished successfully._
> _22/10/27 17:03:51.958 Executor task launch worker for task 0.0 in stage 0.0 
> (TID 1) INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) finished 
> global sync successfully, waited for 1 seconds, current barrier epoch is 1._
> _22/10/27 17:03:51.959 Executor task launch worker for task 0.0 in stage 0.0 
> (TID 1) INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) has entered 
> the global sync, current barrier epoch is 1._
> _22/10/27 17:03:51.960 dispatcher-event-loop-1 INFO BarrierCoordinator: 
> Current barrier epoch for Stage 0 (Attempt 0) is 1._
> _22/10/27 17:03:51.960 dispatcher-event-loop-1 INFO BarrierCoordinator: 
> Barrier sync epoch 1 from Stage 0 (Attempt 0) received update from Task 1, 
> current progress: 1/2._
> _22/10/27 17:03:52.437 Executor task launch worker for task 1.0 in stage 0.0 
> (TID 0) INFO BarrierTaskContext: Task 0 from Stage 0(Attempt 0) finished 
> global sync successfully, waited for 1 seconds, current barrier epoch is 1._
> _22/10/27 17:03:52.438 Executor task

[jira] [Assigned] (SPARK-40932) Barrier: messages for allGather will be overridden by the following barrier APIs

2022-10-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-40932:
---

Assignee: Bobby Wang

> Barrier: messages for allGather will be overridden by the following barrier 
> APIs
> 
>
> Key: SPARK-40932
> URL: https://issues.apache.org/jira/browse/SPARK-40932
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Bobby Wang
>Assignee: Bobby Wang
>Priority: Critical
>
> When I was working on an internal project which has not been opened source. I 
> found this bug that the messages for Barrier.allGather may be overridden by 
> the following Barrier APIs, which means the user can't get the correct 
> allGather message.
>  
> This issue can easily repro by the following unit tests.
>  
>  
> {code:java}
> test("SPARK-XXX, messages of allGather should not been overridden " +
>   "by the following barrier APIs") {
>   sc = new SparkContext(new 
> SparkConf().setAppName("test").setMaster("local[2]"))
>   sc.setLogLevel("INFO")
>   val rdd = sc.makeRDD(1 to 10, 2)
>   val rdd2 = rdd.barrier().mapPartitions { it =>
> val context = BarrierTaskContext.get()
> // Sleep for a random time before global sync.
> Thread.sleep(Random.nextInt(1000))
> // Pass partitionId message in
> val message: String = context.partitionId().toString
> val messages: Array[String] = context.allGather(message)
> context.barrier()
> Iterator.single(messages.toList)
>   }
>   val messages = rdd2.collect()
>   // All the task partitionIds are shared across all tasks
>   assert(messages.length === 2)
>   messages.foreach(m => println("--- " + m))
>   assert(messages.forall(_ == List("0", "1")))
> } {code}
>  
>  
> before throwing the exception by (assert(messages.forall(_ == List("0", 
> "1"))), the print log is 
>  
> {code:java}
> --- List(, )
> --- List(, ) {code}
>  
>  
> You can see, the messages are empty which has been overridden by 
> context.barrier() API.
>  
> Below is the spark log,
>  
> _22/10/27 17:03:50.236 Executor task launch worker for task 0.0 in stage 0.0 
> (TID 1) INFO Executor: Running task 0.0 in stage 0.0 (TID 1)_
> _22/10/27 17:03:50.236 Executor task launch worker for task 1.0 in stage 0.0 
> (TID 0) INFO Executor: Running task 1.0 in stage 0.0 (TID 0)_
> _22/10/27 17:03:50.949 Executor task launch worker for task 0.0 in stage 0.0 
> (TID 1) INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) has entered 
> the global sync, current barrier epoch is 0._
> _22/10/27 17:03:50.964 dispatcher-event-loop-1 INFO BarrierCoordinator: 
> Current barrier epoch for Stage 0 (Attempt 0) is 0._
> _22/10/27 17:03:50.966 dispatcher-event-loop-1 INFO BarrierCoordinator: 
> Barrier sync epoch 0 from Stage 0 (Attempt 0) received update from Task 1, 
> current progress: 1/2._
> _22/10/27 17:03:51.436 Executor task launch worker for task 1.0 in stage 0.0 
> (TID 0) INFO BarrierTaskContext: Task 0 from Stage 0(Attempt 0) has entered 
> the global sync, current barrier epoch is 0._
> _22/10/27 17:03:51.437 dispatcher-event-loop-0 INFO BarrierCoordinator: 
> Current barrier epoch for Stage 0 (Attempt 0) is 0._
> _22/10/27 17:03:51.437 dispatcher-event-loop-0 INFO BarrierCoordinator: 
> Barrier sync epoch 0 from Stage 0 (Attempt 0) received update from Task 0, 
> current progress: 2/2._
> _22/10/27 17:03:51.440 dispatcher-event-loop-0 INFO BarrierCoordinator: 
> Barrier sync epoch 0 from Stage 0 (Attempt 0) received all updates from 
> tasks, finished successfully._
> _22/10/27 17:03:51.958 Executor task launch worker for task 0.0 in stage 0.0 
> (TID 1) INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) finished 
> global sync successfully, waited for 1 seconds, current barrier epoch is 1._
> _22/10/27 17:03:51.959 Executor task launch worker for task 0.0 in stage 0.0 
> (TID 1) INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) has entered 
> the global sync, current barrier epoch is 1._
> _22/10/27 17:03:51.960 dispatcher-event-loop-1 INFO BarrierCoordinator: 
> Current barrier epoch for Stage 0 (Attempt 0) is 1._
> _22/10/27 17:03:51.960 dispatcher-event-loop-1 INFO BarrierCoordinator: 
> Barrier sync epoch 1 from Stage 0 (Attempt 0) received update from Task 1, 
> current progress: 1/2._
> _22/10/27 17:03:52.437 Executor task launch worker for task 1.0 in stage 0.0 
> (TID 0) INFO BarrierTaskContext: Task 0 from Stage 0(Attempt 0) finished 
> global sync successfully, waited for 1 seconds, current barrier epoch is 1._
> _22/10/27 17:03:52.438 Executor task launch worker for task 1.0 in stage 0.0 
> (TID 0) INFO BarrierTaskContext: Task 0 from Stage 0(Attempt 0) has entered 
> the global sync, current barrier epoch is 1.

[jira] [Assigned] (SPARK-40912) Overhead of Exceptions in DeserializationStream

2022-10-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40912:


Assignee: Apache Spark

> Overhead of Exceptions in DeserializationStream 
> 
>
> Key: SPARK-40912
> URL: https://issues.apache.org/jira/browse/SPARK-40912
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Emil Ejbyfeldt
>Assignee: Apache Spark
>Priority: Minor
>
> The interface of DeserializationStream forces implementation to raise 
> EOFException to indicate that there is no more data. And for the 
> KryoDeserializtionStream it even worse since the kryo library does not raise 
> EOFException we pay for the price of two exceptions for each stream. For 
> large shuffles with lots of small stream this is quite a bit large overhead 
> (seen couple % of cpu time). It also less safe to depend exceptions as it 
> might me raised for different reasons like corrupt data and that currently 
> cause data loss.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40912) Overhead of Exceptions in DeserializationStream

2022-10-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40912:


Assignee: (was: Apache Spark)

> Overhead of Exceptions in DeserializationStream 
> 
>
> Key: SPARK-40912
> URL: https://issues.apache.org/jira/browse/SPARK-40912
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Emil Ejbyfeldt
>Priority: Minor
>
> The interface of DeserializationStream forces implementation to raise 
> EOFException to indicate that there is no more data. And for the 
> KryoDeserializtionStream it even worse since the kryo library does not raise 
> EOFException we pay for the price of two exceptions for each stream. For 
> large shuffles with lots of small stream this is quite a bit large overhead 
> (seen couple % of cpu time). It also less safe to depend exceptions as it 
> might me raised for different reasons like corrupt data and that currently 
> cause data loss.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40912) Overhead of Exceptions in DeserializationStream

2022-10-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625695#comment-17625695
 ] 

Apache Spark commented on SPARK-40912:
--

User 'eejbyfeldt' has created a pull request for this issue:
https://github.com/apache/spark/pull/38428

> Overhead of Exceptions in DeserializationStream 
> 
>
> Key: SPARK-40912
> URL: https://issues.apache.org/jira/browse/SPARK-40912
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Emil Ejbyfeldt
>Priority: Minor
>
> The interface of DeserializationStream forces implementation to raise 
> EOFException to indicate that there is no more data. And for the 
> KryoDeserializtionStream it even worse since the kryo library does not raise 
> EOFException we pay for the price of two exceptions for each stream. For 
> large shuffles with lots of small stream this is quite a bit large overhead 
> (seen couple % of cpu time). It also less safe to depend exceptions as it 
> might me raised for different reasons like corrupt data and that currently 
> cause data loss.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40952) Exception when handling timestamp data in PySpark Structured Streaming

2022-10-28 Thread Kai-Michael Roesner (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai-Michael Roesner updated SPARK-40952:

Description: 
I'm trying to process data that contains timestamps in PySpark "Structured 
Streaming" using the 
[foreach|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach]
 option. When I run the job I get a {{OSError: [Errno 22] Invalid argument}} 
exception in {{\pyspark\sql\types.py}} at the
{noformat}
return datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
100)
{noformat}
statement.

I have boiled down my Spark job to the essentials:
{noformat}
from pyspark.sql import SparkSession

def handle_row(row):
  print(f'Processing: \{row}')

spark = (SparkSession.builder
  .appName('test.stream.tstmp.byrow')
  .getOrCreate())

data = (spark.readStream
  .option('delimiter', ',')
  .option('header', True)
  .schema('a integer, b string, c timestamp')
  .csv('data/test'))

query = (data.writeStream
  .foreach(handle_row)
  .start())

query.awaitTermination()
{noformat}
In the {{data/test}} folder I have one csv file:
{noformat}
a,b,c
1,x,1970-01-01 00:59:59.999
2,y,1999-12-31 23:59:59.999
3,z,2022-10-18 15:53:12.345
{noformat}
If I change the csv schema to {{'a integer, b string, c string'}} everything 
works fine and I get the expected output of
{noformat}
Processing: Row(a=1, b='x', c='1970-01-01 00:59:59.999')
Processing: Row(a=2, b='y', c='1999-12-31 23:59:59.999')
Processing: Row(a=3, b='z', c='2022-10-18 15:53:12.345')
{noformat}
Also, if I change the stream handling to 
[micro-batches|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreachbatch]
 like so:
{noformat}
...
def handle_batch(df, epoch_id):
  print(f'Processing: \{df} - Epoch: \{epoch_id}')
...
query = (data.writeStream
  .foreachBatch(handle_batch)
  .start())
{noformat}
I get the expected output of
{noformat}
Processing: DataFrame[a: int, b: string, c: timestamp] - Epoch: 0
{noformat}

But "by row" handling should work with the row having the correct column data 
type of {{timestamp}}.

This issue also affects using the {{foreach}} sink in Structured Streaming with 
the Kafka data souce since Kafka events contain a timestamp

  was:
I'm trying to process data that contains timestamps in PySpark "Structured 
Streaming" using the 
[foreach|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach]
 option. When I run the job I get a {{OSError: [Errno 22] Invalid argument}} 
exception in {{\pyspark\sql\types.py}} at the
{noformat}
return datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
100)
{noformat}
statement.

I have boiled down my Spark job to the essentials:
{noformat}
from pyspark.sql import SparkSession

def handle_row(row):
  print(f'Processing: \{row}')

spark = (SparkSession.builder
  .appName('test.stream.tstmp.byrow')
  .getOrCreate())

data = (spark.readStream
  .option('delimiter', ',')
  .option('header', True)
  .schema('a integer, b string, c timestamp')
  .csv('data/test'))

query = (data.writeStream
  .foreach(handle_row)
  .start())

query.awaitTermination()
{noformat}
In the {{data/test}} folder I have one csv file:
{noformat}
a,b,c
1,x,1970-01-01 00:59:59.999
2,y,1999-12-31 23:59:59.999
3,z,2022-10-18 15:53:12.345
{noformat}
If I change the csv schema to {{'a integer, b string, c string'}} everything 
works fine and I get the expected output of
{noformat}
Processing: Row(a=1, b='x', c='1970-01-01 00:59:59.999')
Processing: Row(a=2, b='y', c='1999-12-31 23:59:59.999')
Processing: Row(a=3, b='z', c='2022-10-18 15:53:12.345')
{noformat}
Also, if I change the stream handling to 
[micro-batches|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreachbatch]
 like so:
{noformat}
...
def handle_batch(df, epoch_id):
  print(f'Processing: \{df} - Epoch: \{epoch_id}')
...
query = (data.writeStream
  .foreachBatch(handle_batch)
  .start())
{noformat}
I get the expected output of
{noformat}
Processing: DataFrame[a: int, b: string, c: timestamp] - Epoch: 0
{noformat}
But "by row" handling should work with the row having the correct column data 
type of {{{}timestamp{}}}.


> Exception when handling timestamp data in PySpark Structured Streaming
> --
>
> Key: SPARK-40952
> URL: https://issues.apache.org/jira/browse/SPARK-40952
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Structured Streaming, Windows
>Affects Versions: 3.3.0
> Environment: OS: Windows 10 
>Reporter: Kai-Michael Roesner
>Priority: Minor
>
> I'm trying to process data that contains timestamps in PySpark "Structured 
> Streaming" using the 
> [foreach|https://spark.apache.org/docs/latest/structured-s

[jira] [Commented] (SPARK-40749) Migrate type check failures of generators onto error classes

2022-10-28 Thread BingKun Pan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625674#comment-17625674
 ] 

BingKun Pan commented on SPARK-40749:
-

I work on it.

> Migrate type check failures of generators onto error classes
> 
>
> Key: SPARK-40749
> URL: https://issues.apache.org/jira/browse/SPARK-40749
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in the generator 
> expressions:
> 1. Stack (3): 
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala#L163-L170
> 2. ExplodeBase (1): 
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala#L299
> 3. Inline (1):
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala#L441



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40953) Add missing `limit(n)` in DataFrame.head

2022-10-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40953:


Assignee: (was: Apache Spark)

> Add missing `limit(n)` in DataFrame.head
> 
>
> Key: SPARK-40953
> URL: https://issues.apache.org/jira/browse/SPARK-40953
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40953) Add missing `limit(n)` in DataFrame.head

2022-10-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625662#comment-17625662
 ] 

Apache Spark commented on SPARK-40953:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38424

> Add missing `limit(n)` in DataFrame.head
> 
>
> Key: SPARK-40953
> URL: https://issues.apache.org/jira/browse/SPARK-40953
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40953) Add missing `limit(n)` in DataFrame.head

2022-10-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625664#comment-17625664
 ] 

Apache Spark commented on SPARK-40953:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38424

> Add missing `limit(n)` in DataFrame.head
> 
>
> Key: SPARK-40953
> URL: https://issues.apache.org/jira/browse/SPARK-40953
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40953) Add missing `limit(n)` in DataFrame.head

2022-10-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40953:


Assignee: Apache Spark

> Add missing `limit(n)` in DataFrame.head
> 
>
> Key: SPARK-40953
> URL: https://issues.apache.org/jira/browse/SPARK-40953
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40953) Add missing `limit(n)` in DataFrame.head

2022-10-28 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-40953:
-

 Summary: Add missing `limit(n)` in DataFrame.head
 Key: SPARK-40953
 URL: https://issues.apache.org/jira/browse/SPARK-40953
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40889) Check error classes in PlanResolutionSuite

2022-10-28 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40889.
--
Resolution: Fixed

Issue resolved by pull request 38421
[https://github.com/apache/spark/pull/38421]

> Check error classes in PlanResolutionSuite
> --
>
> Key: SPARK-40889
> URL: https://issues.apache.org/jira/browse/SPARK-40889
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: BingKun Pan
>Priority: Major
>  Labels: starter
> Fix For: 3.4.0
>
>
> Check error classes in PlanResolutionSuite by using checkError() instead of 
> assertUnsupported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40889) Check error classes in PlanResolutionSuite

2022-10-28 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-40889:


Assignee: BingKun Pan

> Check error classes in PlanResolutionSuite
> --
>
> Key: SPARK-40889
> URL: https://issues.apache.org/jira/browse/SPARK-40889
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: BingKun Pan
>Priority: Major
>  Labels: starter
> Fix For: 3.4.0
>
>
> Check error classes in PlanResolutionSuite by using checkError() instead of 
> assertUnsupported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40936) Refactor `AnalysisTest#assertAnalysisErrorClass` by reusing the `SparkFunSuite#checkError`

2022-10-28 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40936.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38413
[https://github.com/apache/spark/pull/38413]

> Refactor `AnalysisTest#assertAnalysisErrorClass` by reusing the 
> `SparkFunSuite#checkError`
> --
>
> Key: SPARK-40936
> URL: https://issues.apache.org/jira/browse/SPARK-40936
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40936) Refactor `AnalysisTest#assertAnalysisErrorClass` by reusing the `SparkFunSuite#checkError`

2022-10-28 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-40936:


Assignee: Yang Jie

> Refactor `AnalysisTest#assertAnalysisErrorClass` by reusing the 
> `SparkFunSuite#checkError`
> --
>
> Key: SPARK-40936
> URL: https://issues.apache.org/jira/browse/SPARK-40936
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40952) Exception when handling timestamp data in PySpark Structured Streaming

2022-10-28 Thread Kai-Michael Roesner (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai-Michael Roesner updated SPARK-40952:

Description: 
I'm trying to process data that contains timestamps in PySpark "Structured 
Streaming" using the 
[foreach|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach]
 option. When I run the job I get a {{OSError: [Errno 22] Invalid argument}} 
exception in {{\pyspark\sql\types.py}} at the
{noformat}
return datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
100)
{noformat}
statement.

I have boiled down my Spark job to the essentials:
{noformat}
from pyspark.sql import SparkSession

def handle_row(row):
  print(f'Processing: \{row}')

spark = (SparkSession.builder
  .appName('test.stream.tstmp.byrow')
  .getOrCreate())

data = (spark.readStream
  .option('delimiter', ',')
  .option('header', True)
  .schema('a integer, b string, c timestamp')
  .csv('data/test'))

query = (data.writeStream
  .foreach(handle_row)
  .start())

query.awaitTermination()
{noformat}
In the {{data/test}} folder I have one csv file:
{noformat}
a,b,c
1,x,1970-01-01 00:59:59.999
2,y,1999-12-31 23:59:59.999
3,z,2022-10-18 15:53:12.345
{noformat}
If I change the csv schema to {{'a integer, b string, c string'}} everything 
works fine and I get the expected output of
{noformat}
Processing: Row(a=1, b='x', c='1970-01-01 00:59:59.999')
Processing: Row(a=2, b='y', c='1999-12-31 23:59:59.999')
Processing: Row(a=3, b='z', c='2022-10-18 15:53:12.345')
{noformat}
Also, if I change the stream handling to 
[micro-batches|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreachbatch]
 like so:
{noformat}
...
def handle_batch(df, epoch_id):
  print(f'Processing: \{df} - Epoch: \{epoch_id}')
...
query = (data.writeStream
  .foreachBatch(handle_batch)
  .start())
{noformat}
I get the expected output of
{noformat}
Processing: DataFrame[a: int, b: string, c: timestamp] - Epoch: 0
{noformat}
But "by row" handling should work with the row having the correct column data 
type of {{{}timestamp{}}}.

  was:
I'm trying to process data that contains timestamps in PySpark "Structured 
Streaming" using the 
[foreach|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach]
 option. When I run the job I get a {{OSError: [Errno 22] Invalid argument}} 
exception in {{\pyspark\sql\types.py}} at the
{noformat}
return datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
100)
{noformat}
statement.

I have boiled down my Spark job to the essentials:
{noformat}
from pyspark.sql import SparkSession

def handle_row(row):
  print(f'Processing: \{row}')

spark = (SparkSession.builder
  .appName('test.stream.tstmp.byrow')
  .getOrCreate())

data = (spark.readStream
  .option('delimiter', ',')
  .option('header', True)
  .schema('a integer, b string, c timestamp')
  .csv('data/test'))

query = (data.writeStream
  .foreach(handle_row)
  .start())

query.awaitTermination()
{noformat}
In the `data/test` folder I have one csv file:
{noformat}
a,b,c
1,x,1970-01-01 00:59:59.999
2,y,1999-12-31 23:59:59.999
3,z,2022-10-18 15:53:12.345
{noformat}
If I change the csv schema to `'a integer, b string, c string'` everything 
works fine and I get the expected output of
{noformat}
Processing: Row(a=1, b='x', c='1970-01-01 00:59:59.999')
Processing: Row(a=2, b='y', c='1999-12-31 23:59:59.999')
Processing: Row(a=3, b='z', c='2022-10-18 15:53:12.345')
{noformat}
Also, if I change the stream handling to 
[micro-batches|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreachbatch]
 like so:
{noformat}
...
def handle_batch(df, epoch_id):
  print(f'Processing: \{df} - Epoch: \{epoch_id}')
...
query = (data.writeStream
  .foreachBatch(handle_batch)
  .start())
{noformat}
I get the expected output of
{noformat}
Processing: DataFrame[a: int, b: string, c: timestamp] - Epoch: 0
{noformat}
But "by row" handling should work with the row having the correct column data 
type of `timestamp`.


> Exception when handling timestamp data in PySpark Structured Streaming
> --
>
> Key: SPARK-40952
> URL: https://issues.apache.org/jira/browse/SPARK-40952
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Structured Streaming, Windows
>Affects Versions: 3.3.0
> Environment: OS: Windows 10 
>Reporter: Kai-Michael Roesner
>Priority: Minor
>
> I'm trying to process data that contains timestamps in PySpark "Structured 
> Streaming" using the 
> [foreach|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach]
>  option. When I run the job I get a {{OSError: [Errno 22] Invalid argument}} 
> exception in {{\pyspark\sq

[jira] [Updated] (SPARK-40952) Exception when handling timestamp data in PySpark Structured Streaming

2022-10-28 Thread Kai-Michael Roesner (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai-Michael Roesner updated SPARK-40952:

Description: 
I'm trying to process data that contains timestamps in PySpark "Structured 
Streaming" using the 
[foreach|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach]
 option. When I run the job I get a {{OSError: [Errno 22] Invalid argument}} 
exception in {{\pyspark\sql\types.py}} at the
{noformat}
return datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
100)
{noformat}
statement.

I have boiled down my Spark job to the essentials:
{noformat}
from pyspark.sql import SparkSession

def handle_row(row):
  print(f'Processing: \{row}')

spark = (SparkSession.builder
  .appName('test.stream.tstmp.byrow')
  .getOrCreate())

data = (spark.readStream
  .option('delimiter', ',')
  .option('header', True)
  .schema('a integer, b string, c timestamp')
  .csv('data/test'))

query = (data.writeStream
  .foreach(handle_row)
  .start())

query.awaitTermination()
{noformat}
In the `data/test` folder I have one csv file:
{noformat}
a,b,c
1,x,1970-01-01 00:59:59.999
2,y,1999-12-31 23:59:59.999
3,z,2022-10-18 15:53:12.345
{noformat}
If I change the csv schema to `'a integer, b string, c string'` everything 
works fine and I get the expected output of
{noformat}
Processing: Row(a=1, b='x', c='1970-01-01 00:59:59.999')
Processing: Row(a=2, b='y', c='1999-12-31 23:59:59.999')
Processing: Row(a=3, b='z', c='2022-10-18 15:53:12.345')
{noformat}
Also, if I change the stream handling to 
[micro-batches|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreachbatch]
 like so:
{noformat}
...
def handle_batch(df, epoch_id):
  print(f'Processing: \{df} - Epoch: \{epoch_id}')
...
query = (data.writeStream
  .foreachBatch(handle_batch)
  .start())
{noformat}
I get the expected output of
{noformat}
Processing: DataFrame[a: int, b: string, c: timestamp] - Epoch: 0
{noformat}
But "by row" handling should work with the row having the correct column data 
type of `timestamp`.

  was:
I'm trying to process data that contains timestamps in PySpark "Structured 
Streaming" using the 
[foreach|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach]
 option. When I run the job I get a `OSError: [Errno 22] Invalid argument` 
exception in \pyspark\sql\types.py at the
{noformat}
return datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
100)
{noformat}
statement.

I have boiled down my Spark job to the essentials:
{noformat}
from pyspark.sql import SparkSession

def handle_row(row):
  print(f'Processing: \{row}')

spark = (SparkSession.builder
  .appName('test.stream.tstmp.byrow')
  .getOrCreate())

data = (spark.readStream
  .option('delimiter', ',')
  .option('header', True)
  .schema('a integer, b string, c timestamp')
  .csv('data/test'))

query = (data.writeStream
  .foreach(handle_row)
  .start())

query.awaitTermination()
{noformat}

In the `data/test` folder I have one csv file:
{noformat}
a,b,c
1,x,1970-01-01 00:59:59.999
2,y,1999-12-31 23:59:59.999
3,z,2022-10-18 15:53:12.345
{noformat}

If I change the csv schema to `'a integer, b string, c string'` everything 
works fine and I get the expected output of
{noformat}
Processing: Row(a=1, b='x', c='1970-01-01 00:59:59.999')
Processing: Row(a=2, b='y', c='1999-12-31 23:59:59.999')
Processing: Row(a=3, b='z', c='2022-10-18 15:53:12.345')
{noformat}

Also, if I change the stream handling to 
[micro-batches|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreachbatch]
 like so:
{noformat}
...
def handle_batch(df, epoch_id):
  print(f'Processing: \{df} - Epoch: \{epoch_id}')
...
query = (data.writeStream
  .foreachBatch(handle_batch)
  .start())
{noformat}
I get the expected output of
{noformat}
Processing: DataFrame[a: int, b: string, c: timestamp] - Epoch: 0
{noformat}

But "by row" handling should work with the row having the correct column data 
type of `timestamp`.


> Exception when handling timestamp data in PySpark Structured Streaming
> --
>
> Key: SPARK-40952
> URL: https://issues.apache.org/jira/browse/SPARK-40952
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Structured Streaming, Windows
>Affects Versions: 3.3.0
> Environment: OS: Windows 10 
>Reporter: Kai-Michael Roesner
>Priority: Minor
>
> I'm trying to process data that contains timestamps in PySpark "Structured 
> Streaming" using the 
> [foreach|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach]
>  option. When I run the job I get a {{OSError: [Errno 22] Invalid argument}} 
> exception in {{\pyspark\sql\types.py}}

[jira] [Created] (SPARK-40952) Exception when handling timestamp data in PySpark Structured Streaming

2022-10-28 Thread Kai-Michael Roesner (Jira)

Kai-Michael Roesner created SPARK-40952:
---

 Summary: Exception when handling timestamp data in PySpark 
Structured Streaming
 Key: SPARK-40952
 URL: https://issues.apache.org/jira/browse/SPARK-40952
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Structured Streaming, Windows
Affects Versions: 3.3.0
 Environment: OS: Windows 10 
Reporter: Kai-Michael Roesner


I'm trying to process data that contains timestamps in PySpark "Structured 
Streaming" using the 
[foreach|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach]
 option. When I run the job I get a `OSError: [Errno 22] Invalid argument` 
exception in \pyspark\sql\types.py at the
{noformat}
return datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
100)
{noformat}
statement.

I have boiled down my Spark job to the essentials:
{noformat}
from pyspark.sql import SparkSession

def handle_row(row):
  print(f'Processing: \{row}')

spark = (SparkSession.builder
  .appName('test.stream.tstmp.byrow')
  .getOrCreate())

data = (spark.readStream
  .option('delimiter', ',')
  .option('header', True)
  .schema('a integer, b string, c timestamp')
  .csv('data/test'))

query = (data.writeStream
  .foreach(handle_row)
  .start())

query.awaitTermination()
{noformat}

In the `data/test` folder I have one csv file:
{noformat}
a,b,c
1,x,1970-01-01 00:59:59.999
2,y,1999-12-31 23:59:59.999
3,z,2022-10-18 15:53:12.345
{noformat}

If I change the csv schema to `'a integer, b string, c string'` everything 
works fine and I get the expected output of
{noformat}
Processing: Row(a=1, b='x', c='1970-01-01 00:59:59.999')
Processing: Row(a=2, b='y', c='1999-12-31 23:59:59.999')
Processing: Row(a=3, b='z', c='2022-10-18 15:53:12.345')
{noformat}

Also, if I change the stream handling to 
[micro-batches|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreachbatch]
 like so:
{noformat}
...
def handle_batch(df, epoch_id):
  print(f'Processing: \{df} - Epoch: \{epoch_id}')
...
query = (data.writeStream
  .foreachBatch(handle_batch)
  .start())
{noformat}
I get the expected output of
{noformat}
Processing: DataFrame[a: int, b: string, c: timestamp] - Epoch: 0
{noformat}

But "by row" handling should work with the row having the correct column data 
type of `timestamp`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40951) pyspark-connect tests should be skipped if pandas doesn't exist

2022-10-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40951:


Assignee: Apache Spark

> pyspark-connect tests should be skipped if pandas doesn't exist
> ---
>
> Key: SPARK-40951
> URL: https://issues.apache.org/jira/browse/SPARK-40951
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40951) pyspark-connect tests should be skipped if pandas doesn't exist

2022-10-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625566#comment-17625566
 ] 

Apache Spark commented on SPARK-40951:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/38426

> pyspark-connect tests should be skipped if pandas doesn't exist
> ---
>
> Key: SPARK-40951
> URL: https://issues.apache.org/jira/browse/SPARK-40951
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40951) pyspark-connect tests should be skipped if pandas doesn't exist

2022-10-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40951:


Assignee: (was: Apache Spark)

> pyspark-connect tests should be skipped if pandas doesn't exist
> ---
>
> Key: SPARK-40951
> URL: https://issues.apache.org/jira/browse/SPARK-40951
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40951) pyspark-connect tests should be skipped if pandas doesn't exist

2022-10-28 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-40951:
-

 Summary: pyspark-connect tests should be skipped if pandas doesn't 
exist
 Key: SPARK-40951
 URL: https://issues.apache.org/jira/browse/SPARK-40951
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, Tests
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40229) Re-enable excel I/O test for pandas API on Spark.

2022-10-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625550#comment-17625550
 ] 

Apache Spark commented on SPARK-40229:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/38425

> Re-enable excel I/O test for pandas API on Spark.
> -
>
> Key: SPARK-40229
> URL: https://issues.apache.org/jira/browse/SPARK-40229
> Project: Spark
>  Issue Type: Test
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently we're skipping the `read_excel` and `to_excel` tests for pandas API 
> on Spark, since the `openpyxl` is not installed in our test environments.
> We should enable this test for better test coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40950) isRemoteAddressMaxedOut performance overhead on scala 2.13

2022-10-28 Thread Emil Ejbyfeldt (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emil Ejbyfeldt updated SPARK-40950:
---
Summary: isRemoteAddressMaxedOut performance overhead on scala 2.13  (was: 
On scala 2.13 isRemoteAddressMaxedOut performance overhead)

> isRemoteAddressMaxedOut performance overhead on scala 2.13
> --
>
> Key: SPARK-40950
> URL: https://issues.apache.org/jira/browse/SPARK-40950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.2, 3.3.1
>Reporter: Emil Ejbyfeldt
>Priority: Major
>
> On scala 2.13 the blocks in FetchRequest is sometimes backed by a `List` 
> while in 2.12 it would be ArrayBuffer. This means that calculating the length 
> of the blocks which is done in isRemoteAddressMaxedOut and other places now 
> much more expensive.  This is because in 2.13 `Seq` is can no longer be 
> backed by a mutable collection.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40950) On scala 2.13 isRemoteAddressMaxedOut performance overhead

2022-10-28 Thread Emil Ejbyfeldt (Jira)

Emil Ejbyfeldt created SPARK-40950:
--

 Summary: On scala 2.13 isRemoteAddressMaxedOut performance overhead
 Key: SPARK-40950
 URL: https://issues.apache.org/jira/browse/SPARK-40950
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.1, 3.2.2
Reporter: Emil Ejbyfeldt


On scala 2.13 the blocks in FetchRequest is sometimes backed by a `List` while 
in 2.12 it would be ArrayBuffer. This means that calculating the length of the 
blocks which is done in isRemoteAddressMaxedOut and other places now much more 
expensive.  This is because in 2.13 `Seq` is can no longer be backed by a 
mutable collection.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40949) Implement `DataFrame.sortWithinPartitions`

2022-10-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40949:


Assignee: (was: Apache Spark)

> Implement `DataFrame.sortWithinPartitions`
> --
>
> Key: SPARK-40949
> URL: https://issues.apache.org/jira/browse/SPARK-40949
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40949) Implement `DataFrame.sortWithinPartitions`

2022-10-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625526#comment-17625526
 ] 

Apache Spark commented on SPARK-40949:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38423

> Implement `DataFrame.sortWithinPartitions`
> --
>
> Key: SPARK-40949
> URL: https://issues.apache.org/jira/browse/SPARK-40949
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40949) Implement `DataFrame.sortWithinPartitions`

2022-10-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40949:


Assignee: Apache Spark

> Implement `DataFrame.sortWithinPartitions`
> --
>
> Key: SPARK-40949
> URL: https://issues.apache.org/jira/browse/SPARK-40949
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40949) Implement `DataFrame.sortWithinPartitions`

2022-10-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625524#comment-17625524
 ] 

Apache Spark commented on SPARK-40949:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38423

> Implement `DataFrame.sortWithinPartitions`
> --
>
> Key: SPARK-40949
> URL: https://issues.apache.org/jira/browse/SPARK-40949
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 106 matches

Mail list logo