[jira] [Assigned] (SPARK-40946) Introduce a new DataSource V2 interface SupportsPushDownClusterKeys
[ https://issues.apache.org/jira/browse/SPARK-40946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40946: Assignee: Apache Spark > Introduce a new DataSource V2 interface SupportsPushDownClusterKeys > --- > > Key: SPARK-40946 > URL: https://issues.apache.org/jira/browse/SPARK-40946 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Major > > A mix-in interface for ScanBuilder. Data sources can implement this interface > to push down all the join or aggregate keys to data sources. A return value > true indicates that data source will return input partitions following the > clustering keys. Otherwise, a false return value indicates the data source > doesn't make such a guarantee, even though it may still report a partitioning > that may or may not be compatible with the given clustering keys, and it's > Spark's responsibility to group the input partitions whether it can be > applied. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40946) Introduce a new DataSource V2 interface SupportsPushDownClusterKeys
[ https://issues.apache.org/jira/browse/SPARK-40946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626007#comment-17626007 ] Apache Spark commented on SPARK-40946: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/38434 > Introduce a new DataSource V2 interface SupportsPushDownClusterKeys > --- > > Key: SPARK-40946 > URL: https://issues.apache.org/jira/browse/SPARK-40946 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Huaxin Gao >Priority: Major > > A mix-in interface for ScanBuilder. Data sources can implement this interface > to push down all the join or aggregate keys to data sources. A return value > true indicates that data source will return input partitions following the > clustering keys. Otherwise, a false return value indicates the data source > doesn't make such a guarantee, even though it may still report a partitioning > that may or may not be compatible with the given clustering keys, and it's > Spark's responsibility to group the input partitions whether it can be > applied. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40946) Introduce a new DataSource V2 interface SupportsPushDownClusterKeys
[ https://issues.apache.org/jira/browse/SPARK-40946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40946: Assignee: (was: Apache Spark) > Introduce a new DataSource V2 interface SupportsPushDownClusterKeys > --- > > Key: SPARK-40946 > URL: https://issues.apache.org/jira/browse/SPARK-40946 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Huaxin Gao >Priority: Major > > A mix-in interface for ScanBuilder. Data sources can implement this interface > to push down all the join or aggregate keys to data sources. A return value > true indicates that data source will return input partitions following the > clustering keys. Otherwise, a false return value indicates the data source > doesn't make such a guarantee, even though it may still report a partitioning > that may or may not be compatible with the given clustering keys, and it's > Spark's responsibility to group the input partitions whether it can be > applied. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40964) Cannot run spark history server with shaded hadoop jar
[ https://issues.apache.org/jira/browse/SPARK-40964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YUBI LEE updated SPARK-40964: - Description: Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+. If you try to start Spark History Server with shaded client jars and enable security using org.apache.hadoop.security.authentication.server.AuthenticationFilter, you will meet following exception. {code} # spark-env.sh export SPARK_HISTORY_OPTS='-Dspark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter -Dspark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=kerberos,kerberos.principal=HTTP/some.example@example.com,kerberos.keytab=/etc/security/keytabs/spnego.service.keytab"' {code} {code} # spark history server's out file 22/10/27 15:29:48 INFO AbstractConnector: Started ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081} 22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on port 18081. 22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: org.apache.hadoop.security.authentication.server.AuthenticationFilter 22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer java.lang.IllegalStateException: class org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a javax.servlet.Filter at org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103) at org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730) at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) at org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:755) at org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:379) at org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:910) at org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:288) at org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:491) at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:148) at org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:148) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.ui.WebUI.bind(WebUI.scala:148) at org.apache.spark.deploy.history.HistoryServer.bind(HistoryServer.scala:164) at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:310) at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala) {code} I think "AuthenticationFilter" in the shaded jar imports "org.apache.hadoop.shaded.javax.servlet.Filter", not "javax.servlet.Filter". {code} ❯ grep -r org.apache.hadoop.shaded.javax.servlet.Filter * Binary file hadoop-client-runtime-3.3.1.jar matches {code} It causes the exception I mentioned. I'm not sure what is the best answer. Workaround is not to use spark with pre-built for Apache Hadoop, specify `HADOOP_HOME` or `SPARK_DIST_CLASSPATH` in spark-env.sh for Spark History Server. May be the possible options are: - Not to shade "javax.servlet.Filter" at hadoop shaded jar - Or, shade "javax.servlet.Filter" also at jetty. was: Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+. If you try to start Spark History Server with shaded client jars and enable security using org.apache.hadoop.security.authentication.server.AuthenticationFilter, you will meet following exception. {code} export SPARK_HISTORY_OPTS='-Dspark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter -Dspark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=kerberos,kerberos.principal=HTTP/some.example@example.com,kerberos.keytab=/etc/security/keytabs/spnego.service.keytab"' {code} {code} 22/10/27 15:29:48 INFO AbstractConnector: Started ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081} 22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on port 18081. 22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: org.apache.hadoop.security.authentication.server.AuthenticationFilter 22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer java.lang.IllegalStateException: cla
[jira] [Updated] (SPARK-40964) Cannot run spark history server with shaded hadoop jar
[ https://issues.apache.org/jira/browse/SPARK-40964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YUBI LEE updated SPARK-40964: - Description: Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+. If you try to start Spark History Server with shaded client jars and enable security using org.apache.hadoop.security.authentication.server.AuthenticationFilter, you will meet following exception. {code} export SPARK_HISTORY_OPTS='-Dspark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter -Dspark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=kerberos,kerberos.principal=HTTP/some.example@example.com,kerberos.keytab=/etc/security/keytabs/spnego.service.keytab"' {code} {code} 22/10/27 15:29:48 INFO AbstractConnector: Started ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081} 22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on port 18081. 22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: org.apache.hadoop.security.authentication.server.AuthenticationFilter 22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer java.lang.IllegalStateException: class org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a javax.servlet.Filter at org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103) at org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730) at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) at org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:755) at org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:379) at org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:910) at org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:288) at org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:491) at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:148) at org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:148) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.ui.WebUI.bind(WebUI.scala:148) at org.apache.spark.deploy.history.HistoryServer.bind(HistoryServer.scala:164) at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:310) at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala) {code} I think "AuthenticationFilter" in the shaded jar imports "org.apache.hadoop.shaded.javax.servlet.Filter", not "javax.servlet.Filter". {code} ❯ grep -r org.apache.hadoop.shaded.javax.servlet.Filter * Binary file hadoop-client-runtime-3.3.1.jar matches {code} It causes the exception I mentioned. I'm not sure what is the best answer. Workaround is not to use spark with pre-built for Apache Hadoop, specify `HADOOP_HOME` or `SPARK_DIST_CLASSPATH` in spark-env.sh for Spark History Server. May be the possible options are: - Not to shade "javax.servlet.Filter" at hadoop shaded jar - Or, shade "javax.servlet.Filter" also at jetty. was: Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+. If you try to start Spark History Server with shaded client jars and enable security using org.apache.hadoop.security.authentication.server.AuthenticationFilter, you will meet following exception. {code} 22/10/27 15:29:48 INFO AbstractConnector: Started ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081} 22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on port 18081. 22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: org.apache.hadoop.security.authentication.server.AuthenticationFilter 22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer java.lang.IllegalStateException: class org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a javax.servlet.Filter at org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103) at org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730)
[jira] [Updated] (SPARK-40964) Cannot run spark history server with shaded hadoop jar
[ https://issues.apache.org/jira/browse/SPARK-40964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YUBI LEE updated SPARK-40964: - Description: Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+. If you try to start Spark History Server with shaded client jars and enable security using org.apache.hadoop.security.authentication.server.AuthenticationFilter, you will meet following exception. {code} 22/10/27 15:29:48 INFO AbstractConnector: Started ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081} 22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on port 18081. 22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: org.apache.hadoop.security.authentication.server.AuthenticationFilter 22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer java.lang.IllegalStateException: class org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a javax.servlet.Filter at org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103) at org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730) at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) at org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:755) at org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:379) at org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:910) at org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:288) at org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:491) at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:148) at org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:148) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.ui.WebUI.bind(WebUI.scala:148) at org.apache.spark.deploy.history.HistoryServer.bind(HistoryServer.scala:164) at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:310) at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala) {code} I think "AuthenticationFilter" in the shaded jar imports "org.apache.hadoop.shaded.javax.servlet.Filter", not "javax.servlet.Filter". {code} ❯ grep -r org.apache.hadoop.shaded.javax.servlet.Filter * Binary file hadoop-client-runtime-3.3.1.jar matches {code} It causes the exception I mentioned. I'm not sure what is the best answer. Workaround is not to use spark with pre-built for Apache Hadoop, specify `HADOOP_HOME` or `SPARK_DIST_CLASSPATH` in spark-env.sh for Spark History Server. May be the possible options are: - Not to shade "javax.servlet.Filter" at hadoop shaded jar - Or, shade "javax.servlet.Filter" also at jetty. was: Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+. In this situation, if you try to start Spark History Server with shaded client jars and enable security using org.apache.hadoop.security.authentication.server.AuthenticationFilter. You will meet following exception. {code} 22/10/27 15:29:48 INFO AbstractConnector: Started ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081} 22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on port 18081. 22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: org.apache.hadoop.security.authentication.server.AuthenticationFilter 22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer java.lang.IllegalStateException: class org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a javax.servlet.Filter at org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103) at org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730) at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) at org.sparkproject.jetty.servlet.ServletHandler.initi
[jira] [Updated] (SPARK-40964) Cannot run spark history server with shaded hadoop jar
[ https://issues.apache.org/jira/browse/SPARK-40964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YUBI LEE updated SPARK-40964: - Description: Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+. In this situation, if you try to start Spark History Server with shaded client jars and enable security using org.apache.hadoop.security.authentication.server.AuthenticationFilter. You will meet following exception. {code} 22/10/27 15:29:48 INFO AbstractConnector: Started ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081} 22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on port 18081. 22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: org.apache.hadoop.security.authentication.server.AuthenticationFilter 22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer java.lang.IllegalStateException: class org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a javax.servlet.Filter at org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103) at org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730) at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) at org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:755) at org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:379) at org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:910) at org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:288) at org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:491) at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:148) at org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:148) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.ui.WebUI.bind(WebUI.scala:148) at org.apache.spark.deploy.history.HistoryServer.bind(HistoryServer.scala:164) at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:310) at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala) {code} I think "AuthenticationFilter" in the shaded jar imports "org.apache.hadoop.shaded.javax.servlet.Filter", not "javax.servlet.Filter". ``` ❯ grep -r org.apache.hadoop.shaded.javax.servlet.Filter * Binary file hadoop-client-runtime-3.3.1.jar matches ``` It causes the exception I mentioned. I'm not sure what is the best answer. Workaround is not to use spark with pre-built for Apache Hadoop, specify `HADOOP_HOME` or `SPARK_DIST_CLASSPATH` in spark-env.sh for Spark History Server. May be the possible options are: - Not to shade "javax.servlet.Filter" at hadoop shaded jar - Or, shade "javax.servlet.Filter" also at jetty. was: Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+. In this situation, if you try to start Spark History Server with shaded client jars and enable security using org.apache.hadoop.security.authentication.server.AuthenticationFilter. You will meet following exception. {code} 22/10/27 15:29:48 INFO AbstractConnector: Started ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081} 22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on port 18081. 22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: org.apache.hadoop.security.authentication.server.AuthenticationFilter 22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer java.lang.IllegalStateException: class org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a javax.servlet.Filter at org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103) at org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730) at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) at org.sparkproject.jetty.servlet.ServletH
[jira] [Updated] (SPARK-40964) Cannot run spark history server with shaded hadoop jar
[ https://issues.apache.org/jira/browse/SPARK-40964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YUBI LEE updated SPARK-40964: - Description: Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+. In this situation, if you try to start Spark History Server with shaded client jars and enable security using org.apache.hadoop.security.authentication.server.AuthenticationFilter. You will meet following exception. {code} 22/10/27 15:29:48 INFO AbstractConnector: Started ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081} 22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on port 18081. 22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: org.apache.hadoop.security.authentication.server.AuthenticationFilter 22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer java.lang.IllegalStateException: class org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a javax.servlet.Filter at org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103) at org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730) at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) at org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:755) at org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:379) at org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:910) at org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:288) at org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:491) at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:148) at org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:148) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.ui.WebUI.bind(WebUI.scala:148) at org.apache.spark.deploy.history.HistoryServer.bind(HistoryServer.scala:164) at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:310) at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala) {code} I think "AuthenticationFilter" in the shaded jar imports "org.apache.hadoop.shaded.javax.servlet.Filter", not "javax.servlet.Filter". {code} ❯ grep -r org.apache.hadoop.shaded.javax.servlet.Filter * Binary file hadoop-client-runtime-3.3.1.jar matches {code} It causes the exception I mentioned. I'm not sure what is the best answer. Workaround is not to use spark with pre-built for Apache Hadoop, specify `HADOOP_HOME` or `SPARK_DIST_CLASSPATH` in spark-env.sh for Spark History Server. May be the possible options are: - Not to shade "javax.servlet.Filter" at hadoop shaded jar - Or, shade "javax.servlet.Filter" also at jetty. was: Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+. In this situation, if you try to start Spark History Server with shaded client jars and enable security using org.apache.hadoop.security.authentication.server.AuthenticationFilter. You will meet following exception. {code} 22/10/27 15:29:48 INFO AbstractConnector: Started ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081} 22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on port 18081. 22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: org.apache.hadoop.security.authentication.server.AuthenticationFilter 22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer java.lang.IllegalStateException: class org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a javax.servlet.Filter at org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103) at org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730) at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) at org.sparkproject.jetty.servlet.Se
[jira] [Created] (SPARK-40964) Cannot run spark history server with shaded hadoop jar
YUBI LEE created SPARK-40964: Summary: Cannot run spark history server with shaded hadoop jar Key: SPARK-40964 URL: https://issues.apache.org/jira/browse/SPARK-40964 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.2.2 Reporter: YUBI LEE Since SPARK-33212, Spark uses shaded client jars from Hadoop 3.x+. In this situation, if you try to start Spark History Server with shaded client jars and enable security using org.apache.hadoop.security.authentication.server.AuthenticationFilter. You will meet following exception. {code} 22/10/27 15:29:48 INFO AbstractConnector: Started ServerConnector@5ca1f591{HTTP/1.1, (http/1.1)}{0.0.0.0:18081} 22/10/27 15:29:48 INFO Utils: Successfully started service 'HistoryServerUI' on port 18081. 22/10/27 15:29:48 INFO ServerInfo: Adding filter to /: org.apache.hadoop.security.authentication.server.AuthenticationFilter 22/10/27 15:29:48 ERROR HistoryServer: Failed to bind HistoryServer java.lang.IllegalStateException: class org.apache.hadoop.security.authentication.server.AuthenticationFilter is not a javax.servlet.Filter at org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:103) at org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$0(ServletHandler.java:730) at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) at org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:755) at org.sparkproject.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:379) at org.sparkproject.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:910) at org.sparkproject.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:288) at org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:491) at org.apache.spark.ui.WebUI.$anonfun$bind$3(WebUI.scala:148) at org.apache.spark.ui.WebUI.$anonfun$bind$3$adapted(WebUI.scala:148) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.ui.WebUI.bind(WebUI.scala:148) at org.apache.spark.deploy.history.HistoryServer.bind(HistoryServer.scala:164) at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:310) at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala) {code} I think "AuthenticationFilter" in the shaded jar imports "org.apache.hadoop.shaded.javax.servlet.Filter", not "javax.servlet.Filter". It causes the exception I mentioned. I'm not sure what is the best answer. Workaround is not to use spark with pre-built for Apache Hadoop, specify `HADOOP_HOME` or `SPARK_DIST_CLASSPATH` in spark-env.sh for Spark History Server. May be the possible options are: - Not to shade "javax.servlet.Filter" at hadoop shaded jar - Or, shade "javax.servlet.Filter" also at jetty. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40963) containsNull in array type attributes is not updated from child output
[ https://issues.apache.org/jira/browse/SPARK-40963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-40963: -- Affects Version/s: 3.1.3 > containsNull in array type attributes is not updated from child output > -- > > Key: SPARK-40963 > URL: https://issues.apache.org/jira/browse/SPARK-40963 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.2, 3.4.0, 3.3.1 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Example: > {noformat} > select c1, explode(c4) as c5 from ( > select c1, array(c3) as c4 from ( > select c1, explode_outer(c2) as c3 > from values > (1, array(1, 2)), > (2, array(2, 3)), > (3, null) > as data(c1, c2) > ) > ); > +---+---+ > |c1 |c5 | > +---+---+ > |1 |1 | > |1 |2 | > |2 |2 | > |2 |3 | > |3 |0 | > +---+---+ > {noformat} > In the last row, {{c5}} is 0, but should be {{NULL}}. > At the time {{ResolveGenerate.makeGeneratorOutput}} is called for > {{explode(c4)}}, {{c3}} has nullable set to false, so {{c4}}'s data type has > {{containsNull}} also set to false. Later, {{c3}}'s nullability is updated > and c4's data type reports containsNull = true, but two things go wrong: > * The {{containsNull}} setting for {{c4}} is not propogated to parent > operators (so the attribute {{c4}} in {{explode(c4)}} still has containsNull > = false) > * Even if it were propogated, {{generatorOutput}} for {{explode(c4)}} is > already determined and won't be recalculated. > Another example: > {noformat} > select c1, inline_outer(c4) from ( > select c1, array(c3) as c4 from ( > select c1, explode_outer(c2) as c3 > from values > (1, array(named_struct('a', 1, 'b', 2))), > (2, array(named_struct('a', 3, 'b', 4), named_struct('a', 5, 'b', 6))), > (3, null) > as data(c1, c2) > ) > ); > 22/10/27 11:53:20 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_1$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364) > > - > - > - > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40963) containsNull in array type attributes is not updated from child output
[ https://issues.apache.org/jira/browse/SPARK-40963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-40963: -- Affects Version/s: 3.2.2 > containsNull in array type attributes is not updated from child output > -- > > Key: SPARK-40963 > URL: https://issues.apache.org/jira/browse/SPARK-40963 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.2, 3.4.0, 3.3.1 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Example: > {noformat} > select c1, explode(c4) as c5 from ( > select c1, array(c3) as c4 from ( > select c1, explode_outer(c2) as c3 > from values > (1, array(1, 2)), > (2, array(2, 3)), > (3, null) > as data(c1, c2) > ) > ); > +---+---+ > |c1 |c5 | > +---+---+ > |1 |1 | > |1 |2 | > |2 |2 | > |2 |3 | > |3 |0 | > +---+---+ > {noformat} > In the last row, {{c5}} is 0, but should be {{NULL}}. > At the time {{ResolveGenerate.makeGeneratorOutput}} is called for > {{explode(c4)}}, {{c3}} has nullable set to false, so {{c4}}'s data type has > {{containsNull}} also set to false. Later, {{c3}}'s nullability is updated > and c4's data type reports containsNull = true, but two things go wrong: > * The {{containsNull}} setting for {{c4}} is not propogated to parent > operators (so the attribute {{c4}} in {{explode(c4)}} still has containsNull > = false) > * Even if it were propogated, {{generatorOutput}} for {{explode(c4)}} is > already determined and won't be recalculated. > Another example: > {noformat} > select c1, inline_outer(c4) from ( > select c1, array(c3) as c4 from ( > select c1, explode_outer(c2) as c3 > from values > (1, array(named_struct('a', 1, 'b', 2))), > (2, array(named_struct('a', 3, 'b', 4), named_struct('a', 5, 'b', 6))), > (3, null) > as data(c1, c2) > ) > ); > 22/10/27 11:53:20 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_1$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364) > > - > - > - > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40963) containsNull in array type attributes is not updated from child output
[ https://issues.apache.org/jira/browse/SPARK-40963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-40963: -- Affects Version/s: 3.3.1 > containsNull in array type attributes is not updated from child output > -- > > Key: SPARK-40963 > URL: https://issues.apache.org/jira/browse/SPARK-40963 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.3.1 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Example: > {noformat} > select c1, explode(c4) as c5 from ( > select c1, array(c3) as c4 from ( > select c1, explode_outer(c2) as c3 > from values > (1, array(1, 2)), > (2, array(2, 3)), > (3, null) > as data(c1, c2) > ) > ); > +---+---+ > |c1 |c5 | > +---+---+ > |1 |1 | > |1 |2 | > |2 |2 | > |2 |3 | > |3 |0 | > +---+---+ > {noformat} > In the last row, {{c5}} is 0, but should be {{NULL}}. > At the time {{ResolveGenerate.makeGeneratorOutput}} is called for > {{explode(c4)}}, {{c3}} has nullable set to false, so {{c4}}'s data type has > {{containsNull}} also set to false. Later, {{c3}}'s nullability is updated > and c4's data type reports containsNull = true, but two things go wrong: > * The {{containsNull}} setting for {{c4}} is not propogated to parent > operators (so the attribute {{c4}} in {{explode(c4)}} still has containsNull > = false) > * Even if it were propogated, {{generatorOutput}} for {{explode(c4)}} is > already determined and won't be recalculated. > Another example: > {noformat} > select c1, inline_outer(c4) from ( > select c1, array(c3) as c4 from ( > select c1, explode_outer(c2) as c3 > from values > (1, array(named_struct('a', 1, 'b', 2))), > (2, array(named_struct('a', 3, 'b', 4), named_struct('a', 5, 'b', 6))), > (3, null) > as data(c1, c2) > ) > ); > 22/10/27 11:53:20 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_1$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364) > > - > - > - > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40963) containsNull in array type attributes is not updated from child output
[ https://issues.apache.org/jira/browse/SPARK-40963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625955#comment-17625955 ] Bruce Robbins commented on SPARK-40963: --- I'll take a stab at fixing this in the next few days. > containsNull in array type attributes is not updated from child output > -- > > Key: SPARK-40963 > URL: https://issues.apache.org/jira/browse/SPARK-40963 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Example: > {noformat} > select c1, explode(c4) as c5 from ( > select c1, array(c3) as c4 from ( > select c1, explode_outer(c2) as c3 > from values > (1, array(1, 2)), > (2, array(2, 3)), > (3, null) > as data(c1, c2) > ) > ); > +---+---+ > |c1 |c5 | > +---+---+ > |1 |1 | > |1 |2 | > |2 |2 | > |2 |3 | > |3 |0 | > +---+---+ > {noformat} > In the last row, {{c5}} is 0, but should be {{NULL}}. > At the time {{ResolveGenerate.makeGeneratorOutput}} is called for > {{explode(c4)}}, {{c3}} has nullable set to false, so {{c4}}'s data type has > {{containsNull}} also set to false. Later, {{c3}}'s nullability is updated > and c4's data type reports containsNull = true, but two things go wrong: > * The {{containsNull}} setting for {{c4}} is not propogated to parent > operators (so the attribute {{c4}} in {{explode(c4)}} still has containsNull > = false) > * Even if it were propogated, {{generatorOutput}} for {{explode(c4)}} is > already determined and won't be recalculated. > Another example: > {noformat} > select c1, inline_outer(c4) from ( > select c1, array(c3) as c4 from ( > select c1, explode_outer(c2) as c3 > from values > (1, array(named_struct('a', 1, 'b', 2))), > (2, array(named_struct('a', 3, 'b', 4), named_struct('a', 5, 'b', 6))), > (3, null) > as data(c1, c2) > ) > ); > 22/10/27 11:53:20 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_1$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364) > > - > - > - > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40963) containsNull in array type attributes is not updated from child output
Bruce Robbins created SPARK-40963: - Summary: containsNull in array type attributes is not updated from child output Key: SPARK-40963 URL: https://issues.apache.org/jira/browse/SPARK-40963 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Bruce Robbins Example: {noformat} select c1, explode(c4) as c5 from ( select c1, array(c3) as c4 from ( select c1, explode_outer(c2) as c3 from values (1, array(1, 2)), (2, array(2, 3)), (3, null) as data(c1, c2) ) ); +---+---+ |c1 |c5 | +---+---+ |1 |1 | |1 |2 | |2 |2 | |2 |3 | |3 |0 | +---+---+ {noformat} In the last row, {{c5}} is 0, but should be {{NULL}}. At the time {{ResolveGenerate.makeGeneratorOutput}} is called for {{explode(c4)}}, {{c3}} has nullable set to false, so {{c4}}'s data type has {{containsNull}} also set to false. Later, {{c3}}'s nullability is updated and c4's data type reports containsNull = true, but two things go wrong: * The {{containsNull}} setting for {{c4}} is not propogated to parent operators (so the attribute {{c4}} in {{explode(c4)}} still has containsNull = false) * Even if it were propogated, {{generatorOutput}} for {{explode(c4)}} is already determined and won't be recalculated. Another example: {noformat} select c1, inline_outer(c4) from ( select c1, array(c3) as c4 from ( select c1, explode_outer(c2) as c3 from values (1, array(named_struct('a', 1, 'b', 2))), (2, array(named_struct('a', 3, 'b', 4), named_struct('a', 5, 'b', 6))), (3, null) as data(c1, c2) ) ); 22/10/27 11:53:20 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_1$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364) - - - {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40939) Release a shaded version of Apache Spark / shade jars on main jar
[ https://issues.apache.org/jira/browse/SPARK-40939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625953#comment-17625953 ] Erik Krogen commented on SPARK-40939: - As a reference for prior work, there is also HADOOP-11656, in which Hadoop began publishing a new {{hadoop-client-runtime}} JAR into which all of the transitive dependencies are shaded. In the [proposal|https://issues.apache.org/jira/secure/attachment/12709266/HADOOP-11656_proposal.md] a technique similar to Flink's was proposed and eventually rejected due to higher maintenance burden to publish separate artifacts for each downstream library that is shaded. There are some pitfalls that come with Spark being a Scala project, unlike Hadoop/Flink which are Java based. Most shading tools cannot handle certain Scala language elements, specifically {{ScalaSig}} causes problems because shading tools that are not Scala-aware do not perform relocations within the {{ScalaSig}} (see examples [one|https://github.com/coursier/coursier/issues/454#issuecomment-288969207] and [two|https://lists.apache.org/thread/x7b4z0os9zbzzprb5scft7b4wnr7c3mv] and [this previous Spark PR that tried to shade Jackson|https://github.com/apache/spark/pull/10931]). That being said, {{sbt}}'s [assembly plugin has had support for this since 2020|https://github.com/sbt/sbt-assembly/pull/393], and this functionality was subsequently pulled out into a standalone library, [Jar Jar Abrams|http://eed3si9n.com/jarjar-abrams/]. So there is hope that this should be more achievable now than it was back in 2016 when that PR was filed. There's also been [interest in shading all of Spark's dependencies on the Spark dev-list|https://lists.apache.org/thread/vkkx8s2zv0ln7j7oo46k30x084mn163p]. I would love to hear what the community thinks of pursuing this earnestly with the tools available in 2022, though [~almogtavor] I'll note that this type of large change is better discussed on the dev mailing list (and probably an accompanying SPIP). > Release a shaded version of Apache Spark / shade jars on main jar > - > > Key: SPARK-40939 > URL: https://issues.apache.org/jira/browse/SPARK-40939 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 3.4.0 >Reporter: Almog Tavor >Priority: Major > > I suggest shading in Apache Spark, to resolve the dependency hell that may > occur when building / deploying Apache Spark. This mainly occurs on Java > projects and on Hadoop environments, but shading will help for using Spark > with Scala & even Python either. > Flink has a similar solution, delivering > [flink-shaded|https://github.com/apache/flink-shaded/blob/master/README.md]. > The dependencies I think that are relevant for shading are Jackson, Guava, > Netty & any of the Hadoop ecosystems if possible. > As for releasing sources for the shaded version, I think the [issue that has > been raised in Flink|https://github.com/apache/flink-shaded/issues/25] is > relevant and unanswered here too, hence I don't think that's an option > currently (personally I don't see any value for it either). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40943) Make MSCK optional in MSCK REPAIR TABLE commands
[ https://issues.apache.org/jira/browse/SPARK-40943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40943: Assignee: Apache Spark > Make MSCK optional in MSCK REPAIR TABLE commands > > > Key: SPARK-40943 > URL: https://issues.apache.org/jira/browse/SPARK-40943 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Ben Zhang >Assignee: Apache Spark >Priority: Major > > The current syntax for `MSCK REPAIR TABLE` is complex and difficult to > understand. The proposal is to make the `MSCK` keyword optional so that > `REPAIR TABLE` may be used in its stead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40943) Make MSCK optional in MSCK REPAIR TABLE commands
[ https://issues.apache.org/jira/browse/SPARK-40943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40943: Assignee: (was: Apache Spark) > Make MSCK optional in MSCK REPAIR TABLE commands > > > Key: SPARK-40943 > URL: https://issues.apache.org/jira/browse/SPARK-40943 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Ben Zhang >Priority: Major > > The current syntax for `MSCK REPAIR TABLE` is complex and difficult to > understand. The proposal is to make the `MSCK` keyword optional so that > `REPAIR TABLE` may be used in its stead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40943) Make MSCK optional in MSCK REPAIR TABLE commands
[ https://issues.apache.org/jira/browse/SPARK-40943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625952#comment-17625952 ] Apache Spark commented on SPARK-40943: -- User 'ben-zhang' has created a pull request for this issue: https://github.com/apache/spark/pull/38433 > Make MSCK optional in MSCK REPAIR TABLE commands > > > Key: SPARK-40943 > URL: https://issues.apache.org/jira/browse/SPARK-40943 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.1 >Reporter: Ben Zhang >Priority: Major > > The current syntax for `MSCK REPAIR TABLE` is complex and difficult to > understand. The proposal is to make the `MSCK` keyword optional so that > `REPAIR TABLE` may be used in its stead. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36124) Support set operators to be on correlation paths
[ https://issues.apache.org/jira/browse/SPARK-36124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625881#comment-17625881 ] Apache Spark commented on SPARK-36124: -- User 'jchen5' has created a pull request for this issue: https://github.com/apache/spark/pull/38432 > Support set operators to be on correlation paths > > > Key: SPARK-36124 > URL: https://issues.apache.org/jira/browse/SPARK-36124 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > > A correlation path is defined as the sub-tree of all the operators that are > on the path from the operator hosting the correlated expressions up to the > operator producing the correlated values. > We want to support set operators such as union and intercept to be on > correlation paths by adding them in DecorrelateInnerQuery. Please see page > 391 for more details: > [https://dl.gi.de/bitstream/handle/20.500.12116/2418/383.pdf] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36124) Support set operators to be on correlation paths
[ https://issues.apache.org/jira/browse/SPARK-36124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625879#comment-17625879 ] Apache Spark commented on SPARK-36124: -- User 'jchen5' has created a pull request for this issue: https://github.com/apache/spark/pull/38432 > Support set operators to be on correlation paths > > > Key: SPARK-36124 > URL: https://issues.apache.org/jira/browse/SPARK-36124 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > > A correlation path is defined as the sub-tree of all the operators that are > on the path from the operator hosting the correlated expressions up to the > operator producing the correlated values. > We want to support set operators such as union and intercept to be on > correlation paths by adding them in DecorrelateInnerQuery. Please see page > 391 for more details: > [https://dl.gi.de/bitstream/handle/20.500.12116/2418/383.pdf] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36124) Support set operators to be on correlation paths
[ https://issues.apache.org/jira/browse/SPARK-36124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36124: Assignee: Apache Spark > Support set operators to be on correlation paths > > > Key: SPARK-36124 > URL: https://issues.apache.org/jira/browse/SPARK-36124 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Assignee: Apache Spark >Priority: Major > > A correlation path is defined as the sub-tree of all the operators that are > on the path from the operator hosting the correlated expressions up to the > operator producing the correlated values. > We want to support set operators such as union and intercept to be on > correlation paths by adding them in DecorrelateInnerQuery. Please see page > 391 for more details: > [https://dl.gi.de/bitstream/handle/20.500.12116/2418/383.pdf] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36124) Support set operators to be on correlation paths
[ https://issues.apache.org/jira/browse/SPARK-36124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36124: Assignee: (was: Apache Spark) > Support set operators to be on correlation paths > > > Key: SPARK-36124 > URL: https://issues.apache.org/jira/browse/SPARK-36124 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > > A correlation path is defined as the sub-tree of all the operators that are > on the path from the operator hosting the correlated expressions up to the > operator producing the correlated values. > We want to support set operators such as union and intercept to be on > correlation paths by adding them in DecorrelateInnerQuery. Please see page > 391 for more details: > [https://dl.gi.de/bitstream/handle/20.500.12116/2418/383.pdf] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39319) Make query context as part of SparkThrowable
[ https://issues.apache.org/jira/browse/SPARK-39319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-39319: --- Parent: SPARK-38615 Issue Type: Sub-task (was: Improvement) > Make query context as part of SparkThrowable > > > Key: SPARK-39319 > URL: https://issues.apache.org/jira/browse/SPARK-39319 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39365) Truncate fragment of query context if it is too long
[ https://issues.apache.org/jira/browse/SPARK-39365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-39365: --- Parent: (was: SPARK-39319) Issue Type: Task (was: Sub-task) > Truncate fragment of query context if it is too long > > > Key: SPARK-39365 > URL: https://issues.apache.org/jira/browse/SPARK-39365 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40956) SQL Equivalent for Dataframe overwrite command
[ https://issues.apache.org/jira/browse/SPARK-40956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625853#comment-17625853 ] Apache Spark commented on SPARK-40956: -- User 'carlfu-db' has created a pull request for this issue: https://github.com/apache/spark/pull/38404 > SQL Equivalent for Dataframe overwrite command > -- > > Key: SPARK-40956 > URL: https://issues.apache.org/jira/browse/SPARK-40956 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.2 >Reporter: chengyan fu >Priority: Minor > > Proposing syntax > {code:java} > INSERT INTO tbl REPLACE whereClause identifierList{code} > to the spark SQL, as the equivalent of > [dataframe.overwrite()|https://github.com/apache/spark/blob/35d00df9bba7238ad4f40617fae4d04ddbfd/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala#L163] > command. > > For Example > {code:java} > INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2{code} > will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows > from table2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40956) SQL Equivalent for Dataframe overwrite command
[ https://issues.apache.org/jira/browse/SPARK-40956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40956: Assignee: (was: Apache Spark) > SQL Equivalent for Dataframe overwrite command > -- > > Key: SPARK-40956 > URL: https://issues.apache.org/jira/browse/SPARK-40956 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.2 >Reporter: chengyan fu >Priority: Minor > > Proposing syntax > {code:java} > INSERT INTO tbl REPLACE whereClause identifierList{code} > to the spark SQL, as the equivalent of > [dataframe.overwrite()|https://github.com/apache/spark/blob/35d00df9bba7238ad4f40617fae4d04ddbfd/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala#L163] > command. > > For Example > {code:java} > INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2{code} > will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows > from table2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40956) SQL Equivalent for Dataframe overwrite command
[ https://issues.apache.org/jira/browse/SPARK-40956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40956: Assignee: Apache Spark > SQL Equivalent for Dataframe overwrite command > -- > > Key: SPARK-40956 > URL: https://issues.apache.org/jira/browse/SPARK-40956 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.2 >Reporter: chengyan fu >Assignee: Apache Spark >Priority: Minor > > Proposing syntax > {code:java} > INSERT INTO tbl REPLACE whereClause identifierList{code} > to the spark SQL, as the equivalent of > [dataframe.overwrite()|https://github.com/apache/spark/blob/35d00df9bba7238ad4f40617fae4d04ddbfd/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala#L163] > command. > > For Example > {code:java} > INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2{code} > will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows > from table2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40956) SQL Equivalent for Dataframe overwrite command
[ https://issues.apache.org/jira/browse/SPARK-40956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625851#comment-17625851 ] Apache Spark commented on SPARK-40956: -- User 'carlfu-db' has created a pull request for this issue: https://github.com/apache/spark/pull/38404 > SQL Equivalent for Dataframe overwrite command > -- > > Key: SPARK-40956 > URL: https://issues.apache.org/jira/browse/SPARK-40956 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.2 >Reporter: chengyan fu >Priority: Minor > > Proposing syntax > {code:java} > INSERT INTO tbl REPLACE whereClause identifierList{code} > to the spark SQL, as the equivalent of > [dataframe.overwrite()|https://github.com/apache/spark/blob/35d00df9bba7238ad4f40617fae4d04ddbfd/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala#L163] > command. > > For Example > {code:java} > INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2{code} > will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows > from table2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40956) SQL Equivalent for Dataframe overwrite command
[ https://issues.apache.org/jira/browse/SPARK-40956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chengyan fu updated SPARK-40956: Description: Proposing syntax {code:java} INSERT INTO tbl REPLACE whereClause identifierList{code} to the spark SQL, as the equivalent of [dataframe.overwrite()|https://github.com/apache/spark/blob/35d00df9bba7238ad4f40617fae4d04ddbfd/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala#L163] command. For Example {code:java} INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2{code} will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows from table2 was: Proposing syntax {code:java} INSERT INTO tbl REPLACE whereClause identifierList{code} to the spark SQL, as the equivalent of dataframe overwrite() command. For Example {code:java} INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2{code} will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows from table2 > SQL Equivalent for Dataframe overwrite command > -- > > Key: SPARK-40956 > URL: https://issues.apache.org/jira/browse/SPARK-40956 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.2 >Reporter: chengyan fu >Priority: Minor > > Proposing syntax > {code:java} > INSERT INTO tbl REPLACE whereClause identifierList{code} > to the spark SQL, as the equivalent of > [dataframe.overwrite()|https://github.com/apache/spark/blob/35d00df9bba7238ad4f40617fae4d04ddbfd/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala#L163] > command. > > For Example > {code:java} > INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2{code} > will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows > from table2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38615) SQL Error Attribution Framework
[ https://issues.apache.org/jira/browse/SPARK-38615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-38615: --- Summary: SQL Error Attribution Framework (was: Provide error context for runtime ANSI failures) > SQL Error Attribution Framework > --- > > Key: SPARK-38615 > URL: https://issues.apache.org/jira/browse/SPARK-38615 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Priority: Major > > Currently, there is not enough error context for runtime ANSI failures. > In the following example, the error message only tells that there is a > "divide by zero" error, without pointing out where the exact SQL statement is. > {code:java} > > SELECT > ss1.ca_county, > ss1.d_year, > ws2.web_sales / ws1.web_sales web_q1_q2_increase, > ss2.store_sales / ss1.store_sales store_q1_q2_increase, > ws3.web_sales / ws2.web_sales web_q2_q3_increase, > ss3.store_sales / ss2.store_sales store_q2_q3_increase > FROM > ss ss1, ss ss2, ss ss3, ws ws1, ws ws2, ws ws3 > WHERE > ss1.d_qoy = 1 > AND ss1.d_year = 2000 > AND ss1.ca_county = ss2.ca_county > AND ss2.d_qoy = 2 > AND ss2.d_year = 2000 > AND ss2.ca_county = ss3.ca_county > AND ss3.d_qoy = 3 > AND ss3.d_year = 2000 > AND ss1.ca_county = ws1.ca_county > AND ws1.d_qoy = 1 > AND ws1.d_year = 2000 > AND ws1.ca_county = ws2.ca_county > AND ws2.d_qoy = 2 > AND ws2.d_year = 2000 > AND ws1.ca_county = ws3.ca_county > AND ws3.d_qoy = 3 > AND ws3.d_year = 2000 > AND CASE WHEN ws1.web_sales > 0 > THEN ws2.web_sales / ws1.web_sales > ELSE NULL END > > CASE WHEN ss1.store_sales > 0 > THEN ss2.store_sales / ss1.store_sales > ELSE NULL END > AND CASE WHEN ws2.web_sales > 0 > THEN ws3.web_sales / ws2.web_sales > ELSE NULL END > > CASE WHEN ss2.store_sales > 0 > THEN ss3.store_sales / ss2.store_sales > ELSE NULL END > ORDER BY ss1.ca_county > {code} > {code:java} > org.apache.spark.SparkArithmeticException: divide by zero at > org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:140) > at > org.apache.spark.sql.catalyst.expressions.DivModLike.eval(arithmetic.scala:437) > at > org.apache.spark.sql.catalyst.expressions.DivModLike.eval$(arithmetic.scala:425) > at > org.apache.spark.sql.catalyst.expressions.Divide.eval(arithmetic.scala:534) > {code} > > I suggest that we provide details in the error message, including: > * the problematic expression from the original SQL query, e.g. > "ss3.store_sales / ss2.store_sales store_q2_q3_increase" > * the line number and starting char position of the problematic expression, > in case of queries like "select a + b from t1 union select a + b from t2" -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40956) SQL Equivalent for Dataframe overwrite command
[ https://issues.apache.org/jira/browse/SPARK-40956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chengyan fu updated SPARK-40956: Description: Proposing syntax {code:java} INSERT INTO tbl REPLACE whereClause identifierList{code} to the spark SQL, as the equivalent of dataframe overwrite() command. For Example {code:java} INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2{code} will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows from table2 was: Proposing syntax ```INSERT INTO tbl REPLACE whereClause identifierList``` to the spark SQL, as the equivalent of dataframe overwrite command. For Example ```INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2``` will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows from table2 > SQL Equivalent for Dataframe overwrite command > -- > > Key: SPARK-40956 > URL: https://issues.apache.org/jira/browse/SPARK-40956 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.2 >Reporter: chengyan fu >Priority: Minor > > Proposing syntax > {code:java} > INSERT INTO tbl REPLACE whereClause identifierList{code} > to the spark SQL, as the equivalent of dataframe overwrite() command. > > For Example > {code:java} > INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2{code} > will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows > from table2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-40686) Support data masking and redacting built-in functions
[ https://issues.apache.org/jira/browse/SPARK-40686 ] Vinod KC deleted comment on SPARK-40686: -- was (Author: vinodkc): I'm working on these 6 sub-tasks > Support data masking and redacting built-in functions > - > > Key: SPARK-40686 > URL: https://issues.apache.org/jira/browse/SPARK-40686 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Vinod KC >Priority: Minor > > Support built-in data masking and redacting functions -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40686) Support data masking and redacting built-in functions
[ https://issues.apache.org/jira/browse/SPARK-40686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625827#comment-17625827 ] Daniel commented on SPARK-40686: > Daniel , Yes, please add them here. Let us change the description and title > of this parent Jira title accordingly (y) I added a few subtasks for new functions. > Support data masking and redacting built-in functions > - > > Key: SPARK-40686 > URL: https://issues.apache.org/jira/browse/SPARK-40686 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Vinod KC >Priority: Minor > > Support built-in data masking and redacting functions -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40686) Support data masking and redacting built-in functions
[ https://issues.apache.org/jira/browse/SPARK-40686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625827#comment-17625827 ] Daniel edited comment on SPARK-40686 at 10/28/22 5:40 PM: -- bq. Daniel , Yes, please add them here. Let us change the description and title of this parent Jira title accordingly (y) I added a few subtasks for new functions. was (Author: JIRAUSER285772): > Daniel , Yes, please add them here. Let us change the description and title > of this parent Jira title accordingly (y) I added a few subtasks for new functions. > Support data masking and redacting built-in functions > - > > Key: SPARK-40686 > URL: https://issues.apache.org/jira/browse/SPARK-40686 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Vinod KC >Priority: Minor > > Support built-in data masking and redacting functions -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40918) Mismatch between ParquetFileFormat and FileSourceScanExec in # columns for WSCG.isTooManyFields when using _metadata
[ https://issues.apache.org/jira/browse/SPARK-40918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625826#comment-17625826 ] Apache Spark commented on SPARK-40918: -- User 'juliuszsompolski' has created a pull request for this issue: https://github.com/apache/spark/pull/38431 > Mismatch between ParquetFileFormat and FileSourceScanExec in # columns for > WSCG.isTooManyFields when using _metadata > > > Key: SPARK-40918 > URL: https://issues.apache.org/jira/browse/SPARK-40918 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Juliusz Sompolski >Priority: Major > > _metadata.columns are taken into account in > FileSourceScanExec.supportColumnar, but not when the parquet reader is > created. This can result in Parquet reader outputting columnar (because it > has less columns than WSCG.isTooManyFields), whereas FileSourceScanExec wants > row output (because with the extra metadata columns it hits the > isTooManyFields limit). > I have a fix forthcoming. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40962) Support data masking built-in function 'diff_privacy'
Daniel created SPARK-40962: -- Summary: Support data masking built-in function 'diff_privacy' Key: SPARK-40962 URL: https://issues.apache.org/jira/browse/SPARK-40962 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Daniel This function can compute the Laplace distribution for integer values. It may be sufficient to implement a basic version of this for integer values only; full differential privacy support remains a separate effort. Background on the math: https://en.wikipedia.org/wiki/Laplace_distribution -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40961) Support data masking built-in functions for HIPAA compliant age/date/birthday/zipcode display
Daniel created SPARK-40961: -- Summary: Support data masking built-in functions for HIPAA compliant age/date/birthday/zipcode display Key: SPARK-40961 URL: https://issues.apache.org/jira/browse/SPARK-40961 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Daniel For example, these could be named `phi_age`, `phi_date`, `phi_dob`, `phi_zip3`, respectively. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40686) Support data masking and redacting built-in functions
[ https://issues.apache.org/jira/browse/SPARK-40686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod KC updated SPARK-40686: - Description: Support built-in data masking and redacting functions (was: Support built-in data masking functions ) > Support data masking and redacting built-in functions > - > > Key: SPARK-40686 > URL: https://issues.apache.org/jira/browse/SPARK-40686 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Vinod KC >Priority: Minor > > Support built-in data masking and redacting functions -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40960) Support data masking built-in functions 'bcrypt'
Daniel created SPARK-40960: -- Summary: Support data masking built-in functions 'bcrypt' Key: SPARK-40960 URL: https://issues.apache.org/jira/browse/SPARK-40960 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Daniel This function can compute password hashing with encryption. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40959) Support data masking built-in function 'mask_default'
Daniel created SPARK-40959: -- Summary: Support data masking built-in function 'mask_default' Key: SPARK-40959 URL: https://issues.apache.org/jira/browse/SPARK-40959 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Daniel This can be a simple masking function to set a default value for each type. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40958) Support data masking built-in function 'null'
Daniel created SPARK-40958: -- Summary: Support data masking built-in function 'null' Key: SPARK-40958 URL: https://issues.apache.org/jira/browse/SPARK-40958 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Daniel This can be a simple function that returns a NULL value of the given input type. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40686) Support data masking and redacting built-in functions
[ https://issues.apache.org/jira/browse/SPARK-40686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel updated SPARK-40686: --- Summary: Support data masking and redacting built-in functions (was: Support data masking built-in functions) > Support data masking and redacting built-in functions > - > > Key: SPARK-40686 > URL: https://issues.apache.org/jira/browse/SPARK-40686 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Vinod KC >Priority: Minor > > Support built-in data masking functions -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40957) Add in memory cache in HDFSMetadataLog
[ https://issues.apache.org/jira/browse/SPARK-40957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625821#comment-17625821 ] Apache Spark commented on SPARK-40957: -- User 'jerrypeng' has created a pull request for this issue: https://github.com/apache/spark/pull/38430 > Add in memory cache in HDFSMetadataLog > -- > > Key: SPARK-40957 > URL: https://issues.apache.org/jira/browse/SPARK-40957 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Boyang Jerry Peng >Priority: Major > > Every time entries in offset log or commit log needs to be access, we read > from disk which is slow. Can a cache of recent entries to speed up reads. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40957) Add in memory cache in HDFSMetadataLog
[ https://issues.apache.org/jira/browse/SPARK-40957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40957: Assignee: Apache Spark > Add in memory cache in HDFSMetadataLog > -- > > Key: SPARK-40957 > URL: https://issues.apache.org/jira/browse/SPARK-40957 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Boyang Jerry Peng >Assignee: Apache Spark >Priority: Major > > Every time entries in offset log or commit log needs to be access, we read > from disk which is slow. Can a cache of recent entries to speed up reads. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40957) Add in memory cache in HDFSMetadataLog
[ https://issues.apache.org/jira/browse/SPARK-40957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625820#comment-17625820 ] Apache Spark commented on SPARK-40957: -- User 'jerrypeng' has created a pull request for this issue: https://github.com/apache/spark/pull/38430 > Add in memory cache in HDFSMetadataLog > -- > > Key: SPARK-40957 > URL: https://issues.apache.org/jira/browse/SPARK-40957 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Boyang Jerry Peng >Priority: Major > > Every time entries in offset log or commit log needs to be access, we read > from disk which is slow. Can a cache of recent entries to speed up reads. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40957) Add in memory cache in HDFSMetadataLog
[ https://issues.apache.org/jira/browse/SPARK-40957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40957: Assignee: (was: Apache Spark) > Add in memory cache in HDFSMetadataLog > -- > > Key: SPARK-40957 > URL: https://issues.apache.org/jira/browse/SPARK-40957 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Boyang Jerry Peng >Priority: Major > > Every time entries in offset log or commit log needs to be access, we read > from disk which is slow. Can a cache of recent entries to speed up reads. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40686) Support data masking built-in functions
[ https://issues.apache.org/jira/browse/SPARK-40686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625819#comment-17625819 ] Vinod KC commented on SPARK-40686: -- [~dtenedor] , Yes, please add them here. Let us change the description and title of this parent Jira title accordingly > Support data masking built-in functions > --- > > Key: SPARK-40686 > URL: https://issues.apache.org/jira/browse/SPARK-40686 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Vinod KC >Priority: Minor > > Support built-in data masking functions -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40957) Add in memory cache in HDFSMetadataLog
Boyang Jerry Peng created SPARK-40957: - Summary: Add in memory cache in HDFSMetadataLog Key: SPARK-40957 URL: https://issues.apache.org/jira/browse/SPARK-40957 Project: Spark Issue Type: New Feature Components: Structured Streaming Affects Versions: 3.4.0 Reporter: Boyang Jerry Peng Every time entries in offset log or commit log needs to be access, we read from disk which is slow. Can a cache of recent entries to speed up reads. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40956) SQL Equivalent for Dataframe overwrite command
[ https://issues.apache.org/jira/browse/SPARK-40956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chengyan fu updated SPARK-40956: Description: Proposing syntax ```INSERT INTO tbl REPLACE whereClause identifierList``` to the spark SQL, as the equivalent of dataframe overwrite command. For Example ```INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2``` will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows from table2 was: {code:java} {code} Proposing syntax ```INSERT INTO tbl REPLACE whereClause identifierList``` to the spark SQL, as the equivalent of dataframe overwrite command. For Example ```INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2``` will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows from table2 > SQL Equivalent for Dataframe overwrite command > -- > > Key: SPARK-40956 > URL: https://issues.apache.org/jira/browse/SPARK-40956 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.2 >Reporter: chengyan fu >Priority: Minor > > Proposing syntax ```INSERT INTO tbl REPLACE whereClause identifierList``` to > the spark SQL, as the equivalent of dataframe overwrite command. > > For Example > ```INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2``` will, in > an atomic operation, 1) delete rows with key = 3 and 2) insert rows from > table2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40956) SQL Equivalent for Dataframe overwrite command
[ https://issues.apache.org/jira/browse/SPARK-40956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chengyan fu updated SPARK-40956: Description: {code:java} {code} Proposing syntax ```INSERT INTO tbl REPLACE whereClause identifierList``` to the spark SQL, as the equivalent of dataframe overwrite command. For Example ```INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2``` will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows from table2 was: Proposing syntax ```INSERT INTO tbl REPLACE whereClause identifierList``` to the spark SQL, as the equivalent of dataframe overwrite command. For Example ```INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2``` will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows from table2 > SQL Equivalent for Dataframe overwrite command > -- > > Key: SPARK-40956 > URL: https://issues.apache.org/jira/browse/SPARK-40956 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.2 >Reporter: chengyan fu >Priority: Minor > > > {code:java} > {code} > Proposing syntax ```INSERT INTO tbl REPLACE whereClause identifierList``` to > the spark SQL, as the equivalent of dataframe overwrite command. > > For Example > ```INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2``` will, in > an atomic operation, 1) delete rows with key = 3 and 2) insert rows from > table2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40956) SQL Equivalent for Dataframe overwrite command
[ https://issues.apache.org/jira/browse/SPARK-40956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chengyan fu updated SPARK-40956: Description: Proposing syntax ```INSERT INTO tbl REPLACE whereClause identifierList``` to the spark SQL, as the equivalent of dataframe overwrite command. For Example ```INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2``` will, in an atomic operation, 1) delete rows with key = 3 and 2) insert rows from table2 was: Proposing syntax ```INSERT INTO tbl REPLACE whereClause identifierList``` to the spark SQL, as the equivalent of dataframe overwrite command. For Example ```INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2``` will in an atomic operation, 1) delete rows with key = 3 and 2) insert rows from table2 > SQL Equivalent for Dataframe overwrite command > -- > > Key: SPARK-40956 > URL: https://issues.apache.org/jira/browse/SPARK-40956 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.2 >Reporter: chengyan fu >Priority: Minor > > Proposing syntax ```INSERT INTO tbl REPLACE whereClause identifierList``` to > the spark SQL, as the equivalent of dataframe overwrite command. > For Example > ```INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2``` will, in > an atomic operation, 1) delete rows with key = 3 and 2) insert rows from > table2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40956) SQL Equivalent for Dataframe overwrite command
chengyan fu created SPARK-40956: --- Summary: SQL Equivalent for Dataframe overwrite command Key: SPARK-40956 URL: https://issues.apache.org/jira/browse/SPARK-40956 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.2.2 Reporter: chengyan fu Proposing syntax ```INSERT INTO tbl REPLACE whereClause identifierList``` to the spark SQL, as the equivalent of dataframe overwrite command. For Example ```INSERT INTO table1 REPLACE WHERE key = 3 SELECT * FROM table2``` will in an atomic operation, 1) delete rows with key = 3 and 2) insert rows from table2 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34827) Support fetching shuffle blocks in batch with i/o encryption
[ https://issues.apache.org/jira/browse/SPARK-34827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625809#comment-17625809 ] Nikhil Sharma edited comment on SPARK-34827 at 10/28/22 4:54 PM: - Thank you for sharing such good information. Very informative and effective post. +[https://www.igmguru.com/digital-marketing-programming/golang-training/]+ was (Author: JIRAUSER295436): Thank you for sharing such good information. Very informative and effective post. [golang certification|{+}[https://www.igmguru.com/digital-marketing-programming/golang-training/]{+}] > Support fetching shuffle blocks in batch with i/o encryption > > > Key: SPARK-34827 > URL: https://issues.apache.org/jira/browse/SPARK-34827 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0, 3.3.0 >Reporter: Dongjoon Hyun >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34827) Support fetching shuffle blocks in batch with i/o encryption
[ https://issues.apache.org/jira/browse/SPARK-34827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625809#comment-17625809 ] Nikhil Sharma edited comment on SPARK-34827 at 10/28/22 4:53 PM: - Thank you for sharing such good information. Very informative and effective post. [golang certification|{+}[https://www.igmguru.com/digital-marketing-programming/golang-training/]{+}] was (Author: JIRAUSER295436): Thank you for sharing such good information. Very informative and effective post. [golang certification|+[https://www.igmguru.com/digital-marketing-programming/golang-training/]+] > Support fetching shuffle blocks in batch with i/o encryption > > > Key: SPARK-34827 > URL: https://issues.apache.org/jira/browse/SPARK-34827 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0, 3.3.0 >Reporter: Dongjoon Hyun >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34827) Support fetching shuffle blocks in batch with i/o encryption
[ https://issues.apache.org/jira/browse/SPARK-34827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625809#comment-17625809 ] Nikhil Sharma commented on SPARK-34827: --- Thank you for sharing such good information. Very informative and effective post. [golang certification|+[https://www.igmguru.com/digital-marketing-programming/golang-training/]+] > Support fetching shuffle blocks in batch with i/o encryption > > > Key: SPARK-34827 > URL: https://issues.apache.org/jira/browse/SPARK-34827 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0, 3.3.0 >Reporter: Dongjoon Hyun >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40686) Support data masking built-in functions
[ https://issues.apache.org/jira/browse/SPARK-40686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625805#comment-17625805 ] Daniel commented on SPARK-40686: In addition to the functions listed in subtasks here so far, I was also considering the following similar functions as well. Should we think about adding them here too? (I know the original intent for this Jira was to make paritiy with Hive.) ||Name||Description|| |bcrypt|Computes password hashing with encryption| |diff_privacy|Computes the Laplace distribution for integer values| |fnv_hash|Fowler–Noll–Vo hash function| |mask_default|Masking function to set a default value for each type| |null|Returns a NULL value of the given input type| |phi_age|HIPAA compliant age display| |phi_date|HIPAA compliant date display| |phi_dob|HIPAA compliant date of birth display| |phi_zip3|HIPAA compatible ZIP code display| |random_ccn|Computes a random credit card number| |tokenize|Computes a format-preserving encoding using NIST compatible FPE FF1| > Support data masking built-in functions > --- > > Key: SPARK-40686 > URL: https://issues.apache.org/jira/browse/SPARK-40686 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Vinod KC >Priority: Minor > > Support built-in data masking functions -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40686) Support data masking built-in functions
[ https://issues.apache.org/jira/browse/SPARK-40686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625803#comment-17625803 ] Daniel commented on SPARK-40686: I closed https://issues.apache.org/jira/browse/SPARK-40623 as a duplicate of this Jira. > Support data masking built-in functions > --- > > Key: SPARK-40686 > URL: https://issues.apache.org/jira/browse/SPARK-40686 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Vinod KC >Priority: Minor > > Support built-in data masking functions -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40623) Add new SQL built-in functions to help with redacting data
[ https://issues.apache.org/jira/browse/SPARK-40623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel resolved SPARK-40623. Resolution: Duplicate closing as a dup of https://issues.apache.org/jira/browse/SPARK-40686 > Add new SQL built-in functions to help with redacting data > -- > > Key: SPARK-40623 > URL: https://issues.apache.org/jira/browse/SPARK-40623 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > > This issue tracks building new scalar SQL functions into Spark for purposes > of redacting sensitive information from fields. These can be useful for > creating copies of tables with sensitive information removed, but retaining > the same schema. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40652) Add MASK_PHONE and TRY_MASK_PHONE functions
[ https://issues.apache.org/jira/browse/SPARK-40652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625800#comment-17625800 ] Daniel commented on SPARK-40652: I am going to close this as there is a duplicate effort underway in https://issues.apache.org/jira/browse/SPARK-40686. > Add MASK_PHONE and TRY_MASK_PHONE functions > --- > > Key: SPARK-40652 > URL: https://issues.apache.org/jira/browse/SPARK-40652 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40955) allow DSV2 Predicate pushdown in FileScanBuilder.pushedDataFilter
[ https://issues.apache.org/jira/browse/SPARK-40955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] RJ Marcus updated SPARK-40955: -- Description: {+}overall{+}: Allow FileScanBuilder to push `Predicate` instead of `Filter` for data filters being pushed down to source. This would allow new (arbitrary) DS V2 Predicates to be pushed down to the file source. Hello spark developers, Thank you in advance for reading. Please excuse me if I make mistakes; this is my first time working on apache/spark internals. I am asking these questions to better understand whether my proposed changes fall within the intended scope of Data Source V2 API functionality. +Motivation / Background:+ I am working on a branch in [apache/incubator-sedona|https://github.com/apache/incubator-sedona] to extend its support of geoparquet files to include predicate pushdown of postGIS style spatial predicates (e.g. `ST_Contains()`) that can take advantage of spatial info in file metadata. We would like to inherit as much as possible from the Parquet classes (because geoparquet basically just adds a binary geometry column). However, {{FileScanBuilder.scala}} appears to be missing some functionality I need for DSV2 {{{}Predicates{}}}. +My understanding of the problem so far:+ The ST_* {{Expression}} must be detected as a pushable predicate (ParquetScanBuilder.scala:71) and passed as a {{pushedDataFilter}} to the {{parquetPartitionReaderFactory}} where it will be translated into a (user defined) {{{}FilterPredicate{}}}. The [Filter class is sealed|https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala] so the sedona package can’t define new Filters; DSV2 Predicate appears to be the preferred method for accomplishing this task (referred to as “V2 Filter”, SPARK-39966). However, `pushedDataFilters` in FileScanBuilder.scala is of type {{{}sources.Filter{}}}. Some recent work (SPARK-39139) added the ability to detect user defined functions in {{DataSourceV2Strategy.translateFilterV2() > V2ExpressionBuilder.generateExpression()}} , which I think could accomplish detection correctly if {{FileScanBuilder}} called {{DataSourceV2Strategy.translateFilterV2()}} instead of {{{}DataSourceStrategy.translateFilter(){}}}. However, changing {{FileScanBuilder}} to use {{Predicate}} instead of {{Filter}} would require many changes to all file based data sources. I don’t want to spend effort making sweeping changes if the current behavior of Spark is intentional. +Concluding Questions:+ Should {{FileScanBuilder}} be pushing {{Predicate}} instead of {{Filter}} for data filters being pushed down to source? Or maybe in a FileScanBuilderV2? If not, how can a developer of a data source push down a new (or user defined) predicate to the file source? Thank you again for reading. Pending feedback, I will start working on a PR for this functionality. [~beliefer] [~cloud_fan] [~huaxingao] have worked on DSV2 related spark issues and I welcome your input. Please ignore this if I "@" you incorrectly. was: {+}overall{+}: Allow FileScanBuilder to push `Predicate` instead of `Filter` for data filters being pushed down to source. This would allow new (arbitrary) DS V2 Predicates to be pushed down to the file source. Hello spark developers, Thank you in advance for reading. Please excuse me if I make mistakes; this is my first time working on apache/spark internals. I am asking these questions to better understand whether my proposed changes fall within the intended scope of Data Source V2 API functionality. + Motivation / Background:+ I am working on a branch in [apache/incubator-sedona|https://github.com/apache/incubator-sedona] to extend its support of geoparquet files to include predicate pushdown of postGIS style spatial predicates (e.g. `ST_Contains()`) that can take advantage of spatial info in file metadata. We would like to inherit as much as possible from the Parquet classes (because geoparquet basically just adds a binary geometry column). However, {{FileScanBuilder.scala}} appears to be missing some functionality I need for DSV2 {{{}Predicates{}}}. + My understanding of the problem so far:+ The ST_* {{Expression}} must be detected as a pushable predicate (ParquetScanBuilder.scala:71) and passed as a {{pushedDataFilter}} to the {{parquetPartitionReaderFactory}} where it will be translated into a (user defined) {{{}FilterPredicate{}}}. The [Filter class is sealed|https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala] so the sedona package can’t define new Filters; DSV2 Predicate appears to be the preferred method for accomplishing this task (referred to as “V2 Filter”, SPARK-39966). However, `pushedDataFilters` in FileScanBuilder.scala is of type {{{}sources.Filter{}}}. Some recent wo
[jira] [Created] (SPARK-40955) allow DSV2 Predicate pushdown in FileScanBuilder.pushedDataFilter
RJ Marcus created SPARK-40955: - Summary: allow DSV2 Predicate pushdown in FileScanBuilder.pushedDataFilter Key: SPARK-40955 URL: https://issues.apache.org/jira/browse/SPARK-40955 Project: Spark Issue Type: Improvement Components: Input/Output, SQL Affects Versions: 3.3.1 Reporter: RJ Marcus {+}overall{+}: Allow FileScanBuilder to push `Predicate` instead of `Filter` for data filters being pushed down to source. This would allow new (arbitrary) DS V2 Predicates to be pushed down to the file source. Hello spark developers, Thank you in advance for reading. Please excuse me if I make mistakes; this is my first time working on apache/spark internals. I am asking these questions to better understand whether my proposed changes fall within the intended scope of Data Source V2 API functionality. + Motivation / Background:+ I am working on a branch in [apache/incubator-sedona|https://github.com/apache/incubator-sedona] to extend its support of geoparquet files to include predicate pushdown of postGIS style spatial predicates (e.g. `ST_Contains()`) that can take advantage of spatial info in file metadata. We would like to inherit as much as possible from the Parquet classes (because geoparquet basically just adds a binary geometry column). However, {{FileScanBuilder.scala}} appears to be missing some functionality I need for DSV2 {{{}Predicates{}}}. + My understanding of the problem so far:+ The ST_* {{Expression}} must be detected as a pushable predicate (ParquetScanBuilder.scala:71) and passed as a {{pushedDataFilter}} to the {{parquetPartitionReaderFactory}} where it will be translated into a (user defined) {{{}FilterPredicate{}}}. The [Filter class is sealed|https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala] so the sedona package can’t define new Filters; DSV2 Predicate appears to be the preferred method for accomplishing this task (referred to as “V2 Filter”, SPARK-39966). However, `pushedDataFilters` in FileScanBuilder.scala is of type {{{}sources.Filter{}}}. Some recent work (SPARK-39139) added the ability to detect user defined functions in {{DataSourceV2Strategy.translateFilterV2() > V2ExpressionBuilder.generateExpression()}} , which I think could accomplish detection correctly if {{FileScanBuilder}} called {{DataSourceV2Strategy.translateFilterV2()}} instead of {{{}DataSourceStrategy.translateFilter(){}}}. However, changing {{FileScanBuilder}} to use {{Predicate}} instead of {{Filter}} would require many changes to all file based data sources. I don’t want to spend effort making sweeping changes if the current behavior of Spark is intentional. + Concluding Questions:+ Should {{FileScanBuilder}} be pushing {{Predicate}} instead of {{Filter}} for data filters being pushed down to source? Or maybe in a FileScanBuilderV2? If not, how can a developer of a data source push down a new (or user defined) predicate to the file source? Thank you again for reading. Pending feedback, I will start working on a PR for this functionality. [~beliefer] [~cloud_fan] [~huaxingao] have worked on DSV2 related spark issues and I welcome your input. Please ignore this if I "@" you incorrectly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40954) Kubernetes integration tests stuck forever on Mac M1 with Minikube + Docker
[ https://issues.apache.org/jira/browse/SPARK-40954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Ippolitov updated SPARK-40954: Description: h2. Description I tried running Kubernetes integration tests with the Minikube backend (+ Docker driver) from commit c26d99e3f104f6603e0849d82eca03e28f196551 on Spark's master branch. I ran them with the following command: {code:java} mvn integration-test -am -pl :spark-kubernetes-integration-tests_2.12 \ -Pkubernetes -Pkubernetes-integration-tests \ -Phadoop-3 \ -Dspark.kubernetes.test.imageTag=MY_IMAGE_TAG_HERE \ -Dspark.kubernetes.test.imageRepo=docker.io/kubespark \ -Dspark.kubernetes.test.namespace=spark \ -Dspark.kubernetes.test.serviceAccountName=spark \ -Dspark.kubernetes.test.deployMode=minikube {code} However the test suite got stuck literally for hours on my machine. h2. Investigation I ran {{jstack}} on the process that was running the tests and saw that it was stuck here: {noformat} "ScalaTest-main-running-KubernetesSuite" #1 prio=5 os_prio=31 tid=0x7f78d580b800 nid=0x2503 runnable [0x000304749000] java.lang.Thread.State: RUNNABLE at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:255) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) - locked <0x00076c0b6f40> (a java.lang.UNIXProcess$ProcessPipeInputStream) at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) - locked <0x00076c0bb410> (a java.io.InputStreamReader) at java.io.InputStreamReader.read(InputStreamReader.java:184) at java.io.BufferedReader.fill(BufferedReader.java:161) at java.io.BufferedReader.readLine(BufferedReader.java:324) - locked <0x00076c0bb410> (a java.io.InputStreamReader) at java.io.BufferedReader.readLine(BufferedReader.java:389) at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:74) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.$anonfun$executeProcess$2(ProcessUtils.scala:45) at org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.$anonfun$executeProcess$2$adapted(ProcessUtils.scala:45) at org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$$$Lambda$322/20156341.apply(Unknown Source) at org.apache.spark.deploy.k8s.integrationtest.Utils$.tryWithResource(Utils.scala:49) at org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.executeProcess(ProcessUtils.scala:45) at org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.executeMinikube(Minikube.scala:103) at org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.minikubeServiceAction(Minikube.scala:112) at org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.$anonfun$getServiceUrl$1(DepsTestsSuite.scala:281) at org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite$$Lambda$611/1461360262.apply(Unknown Source) at org.scalatest.enablers.Retrying$$anon$4.makeAValiantAttempt$1(Retrying.scala:184) at org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:196) at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226) at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313) at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:457) at org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.getServiceUrl(DepsTestsSuite.scala:278) at org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.tryDepsTest(DepsTestsSuite.scala:325) at org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.$anonfun$$init$$1(DepsTestsSuite.scala:160) at org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite$$Lambda$178/1750286943.apply$mcV$sp(Unknown Source) [...]{noformat} So the issue is coming from {{DepsTestsSuite}} when it is setting up {{{}minio{}}}. After [creating the minio StatefulSet and Service|https://github.com/apache/spark/blob/5ea2b386eb866e20540660cdb6ed43792cb29969/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala#L85], it [executes|https://github.com/apache/spark/blob/5ea2b386eb866e20540660cdb6ed43792cb29969/resource-managers/kubernetes/integration-te
[jira] [Updated] (SPARK-40954) Kubernetes integration tests stuck forever on Mac M1 with Minikube + Docker
[ https://issues.apache.org/jira/browse/SPARK-40954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Ippolitov updated SPARK-40954: Attachment: TestProcess.scala > Kubernetes integration tests stuck forever on Mac M1 with Minikube + Docker > --- > > Key: SPARK-40954 > URL: https://issues.apache.org/jira/browse/SPARK-40954 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 3.3.1 > Environment: MacOS 12.6 (Mac M1) > Minikube 1.27.1 > Docker 20.10.17 >Reporter: Anton Ippolitov >Priority: Minor > Attachments: TestProcess.scala > > > h2. Description > I tried running Kubernetes integration tests with the Minikube backend (+ > Docker driver) from commit c26d99e3f104f6603e0849d82eca03e28f196551 on > Spark's master branch. I ran them with the following command: > > {code:java} > mvn integration-test -am -pl :spark-kubernetes-integration-tests_2.12 \ > -Pkubernetes -Pkubernetes-integration-tests \ > -Phadoop-3 \ > -Dspark.kubernetes.test.imageTag=MY_IMAGE_TAG_HERE \ > -Dspark.kubernetes.test.imageRepo=docker.io/kubespark > \ > -Dspark.kubernetes.test.namespace=spark \ > -Dspark.kubernetes.test.serviceAccountName=spark \ > -Dspark.kubernetes.test.deployMode=minikube {code} > However the test suite got stuck literally for hours on my machine. > > h2. Investigation > I ran {{jstack}} on the process that was running the tests and saw that it > was stuck here: > > {noformat} > "ScalaTest-main-running-KubernetesSuite" #1 prio=5 os_prio=31 > tid=0x7f78d580b800 nid=0x2503 runnable [0x000304749000] > java.lang.Thread.State: RUNNABLE > at java.io.FileInputStream.readBytes(Native Method) > at java.io.FileInputStream.read(FileInputStream.java:255) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > - locked <0x00076c0b6f40> (a > java.lang.UNIXProcess$ProcessPipeInputStream) > at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > - locked <0x00076c0bb410> (a java.io.InputStreamReader) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:161) > at java.io.BufferedReader.readLine(BufferedReader.java:324) > - locked <0x00076c0bb410> (a java.io.InputStreamReader) > at java.io.BufferedReader.readLine(BufferedReader.java:389) > at > scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:74) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at > org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.$anonfun$executeProcess$2(ProcessUtils.scala:45) > at > org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.$anonfun$executeProcess$2$adapted(ProcessUtils.scala:45) > at > org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$$$Lambda$322/20156341.apply(Unknown > Source) > at > org.apache.spark.deploy.k8s.integrationtest.Utils$.tryWithResource(Utils.scala:49) > at > org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.executeProcess(ProcessUtils.scala:45) > at > org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.executeMinikube(Minikube.scala:103) > at > org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.minikubeServiceAction(Minikube.scala:112) > at > org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.$anonfun$getServiceUrl$1(DepsTestsSuite.scala:281) > at > org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite$$Lambda$611/1461360262.apply(Unknown > Source) > at > org.scalatest.enablers.Retrying$$anon$4.makeAValiantAttempt$1(Retrying.scala:184) > at > org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:196) > at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226) > at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313) > at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312) > at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:457) > at > org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.getServiceUrl(DepsTestsSuite.scala:278) > at > org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.tryDepsTest(DepsT
[jira] [Created] (SPARK-40954) Kubernetes integration tests stuck forever on Mac M1 with Minikube + Docker
Anton Ippolitov created SPARK-40954: --- Summary: Kubernetes integration tests stuck forever on Mac M1 with Minikube + Docker Key: SPARK-40954 URL: https://issues.apache.org/jira/browse/SPARK-40954 Project: Spark Issue Type: Bug Components: Kubernetes, Tests Affects Versions: 3.3.1 Environment: MacOS 12.6 (Mac M1) Minikube 1.27.1 Docker 20.10.17 Reporter: Anton Ippolitov h2. Description I tried running Kubernetes integration tests with the Minikube backend (+ Docker driver) from commit c26d99e3f104f6603e0849d82eca03e28f196551 on Spark's master branch. I ran them with the following command: {code:java} mvn integration-test -am -pl :spark-kubernetes-integration-tests_2.12 \ -Pkubernetes -Pkubernetes-integration-tests \ -Phadoop-3 \ -Dspark.kubernetes.test.imageTag=MY_IMAGE_TAG_HERE \ -Dspark.kubernetes.test.imageRepo=docker.io/kubespark \ -Dspark.kubernetes.test.namespace=spark \ -Dspark.kubernetes.test.serviceAccountName=spark \ -Dspark.kubernetes.test.deployMode=minikube {code} However the test suite got stuck literally for hours on my machine. h2. Investigation I ran {{jstack}} on the process that was running the tests and saw that it was stuck here: {noformat} "ScalaTest-main-running-KubernetesSuite" #1 prio=5 os_prio=31 tid=0x7f78d580b800 nid=0x2503 runnable [0x000304749000] java.lang.Thread.State: RUNNABLE at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:255) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) - locked <0x00076c0b6f40> (a java.lang.UNIXProcess$ProcessPipeInputStream) at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) - locked <0x00076c0bb410> (a java.io.InputStreamReader) at java.io.InputStreamReader.read(InputStreamReader.java:184) at java.io.BufferedReader.fill(BufferedReader.java:161) at java.io.BufferedReader.readLine(BufferedReader.java:324) - locked <0x00076c0bb410> (a java.io.InputStreamReader) at java.io.BufferedReader.readLine(BufferedReader.java:389) at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:74) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.$anonfun$executeProcess$2(ProcessUtils.scala:45) at org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.$anonfun$executeProcess$2$adapted(ProcessUtils.scala:45) at org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$$$Lambda$322/20156341.apply(Unknown Source) at org.apache.spark.deploy.k8s.integrationtest.Utils$.tryWithResource(Utils.scala:49) at org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.executeProcess(ProcessUtils.scala:45) at org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.executeMinikube(Minikube.scala:103) at org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.minikubeServiceAction(Minikube.scala:112) at org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.$anonfun$getServiceUrl$1(DepsTestsSuite.scala:281) at org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite$$Lambda$611/1461360262.apply(Unknown Source) at org.scalatest.enablers.Retrying$$anon$4.makeAValiantAttempt$1(Retrying.scala:184) at org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:196) at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226) at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313) at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:457) at org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.getServiceUrl(DepsTestsSuite.scala:278) at org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.tryDepsTest(DepsTestsSuite.scala:325) at org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.$anonfun$$init$$1(DepsTestsSuite.scala:160) at org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite$$Lambda$178/1750286943.apply$mcV$sp(Unknown Source) [...]{noformat} So the issue is coming from {{DepsTestsSuite}} when it is setting up {{{}minio{}}}. After [creating the minio StatefulSet and Service|https://github.com/apache/spark/blob/5ea2b386eb8
[jira] [Commented] (SPARK-40800) Always inline expressions in OptimizeOneRowRelationSubquery
[ https://issues.apache.org/jira/browse/SPARK-40800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625783#comment-17625783 ] Apache Spark commented on SPARK-40800: -- User 'allisonwang-db' has created a pull request for this issue: https://github.com/apache/spark/pull/38429 > Always inline expressions in OptimizeOneRowRelationSubquery > --- > > Key: SPARK-40800 > URL: https://issues.apache.org/jira/browse/SPARK-40800 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.4.0 > > > SPARK-39699 made `CollpaseProjects` more conservative. This has impacted > correlated subqueries that Spark used to be able to support. For example, a > correlated one-row scalar subquery that has a higher-order function: > {code:java} > CREATE TEMP VIEW t1 AS SELECT ARRAY('a', 'b') a > SELECT ( > SELECT array_sort(a, (i, j) -> rank[i] - rank[j]) AS sorted > FROM (SELECT MAP('a', 1, 'b', 2) rank) > ) FROM t1{code} > This will throw an exception after SPARK-39699: > {code:java} > Unexpected operator Join Inner > :- Aggregate [[a,b]], [[a,b] AS a#252] > : +- OneRowRelation > +- Project [map(keys: [a,b], values: [1,2]) AS rank#241] > +- OneRowRelation > in correlated subquery{code} > because the projects inside the subquery can no longer be collapsed. We > should always inline expressions if possible to avoid adding domain joins and > support a wider range of correlated subqueries. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40800) Always inline expressions in OptimizeOneRowRelationSubquery
[ https://issues.apache.org/jira/browse/SPARK-40800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625782#comment-17625782 ] Apache Spark commented on SPARK-40800: -- User 'allisonwang-db' has created a pull request for this issue: https://github.com/apache/spark/pull/38429 > Always inline expressions in OptimizeOneRowRelationSubquery > --- > > Key: SPARK-40800 > URL: https://issues.apache.org/jira/browse/SPARK-40800 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.4.0 > > > SPARK-39699 made `CollpaseProjects` more conservative. This has impacted > correlated subqueries that Spark used to be able to support. For example, a > correlated one-row scalar subquery that has a higher-order function: > {code:java} > CREATE TEMP VIEW t1 AS SELECT ARRAY('a', 'b') a > SELECT ( > SELECT array_sort(a, (i, j) -> rank[i] - rank[j]) AS sorted > FROM (SELECT MAP('a', 1, 'b', 2) rank) > ) FROM t1{code} > This will throw an exception after SPARK-39699: > {code:java} > Unexpected operator Join Inner > :- Aggregate [[a,b]], [[a,b] AS a#252] > : +- OneRowRelation > +- Project [map(keys: [a,b], values: [1,2]) AS rank#241] > +- OneRowRelation > in correlated subquery{code} > because the projects inside the subquery can no longer be collapsed. We > should always inline expressions if possible to avoid adding domain joins and > support a wider range of correlated subqueries. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40950) isRemoteAddressMaxedOut performance overhead on scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-40950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625712#comment-17625712 ] Apache Spark commented on SPARK-40950: -- User 'eejbyfeldt' has created a pull request for this issue: https://github.com/apache/spark/pull/38427 > isRemoteAddressMaxedOut performance overhead on scala 2.13 > -- > > Key: SPARK-40950 > URL: https://issues.apache.org/jira/browse/SPARK-40950 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.2, 3.3.1 >Reporter: Emil Ejbyfeldt >Priority: Major > > On scala 2.13 the blocks in FetchRequest is sometimes backed by a `List` > while in 2.12 it would be ArrayBuffer. This means that calculating the length > of the blocks which is done in isRemoteAddressMaxedOut and other places now > much more expensive. This is because in 2.13 `Seq` is can no longer be > backed by a mutable collection. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40950) isRemoteAddressMaxedOut performance overhead on scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-40950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40950: Assignee: (was: Apache Spark) > isRemoteAddressMaxedOut performance overhead on scala 2.13 > -- > > Key: SPARK-40950 > URL: https://issues.apache.org/jira/browse/SPARK-40950 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.2, 3.3.1 >Reporter: Emil Ejbyfeldt >Priority: Major > > On scala 2.13 the blocks in FetchRequest is sometimes backed by a `List` > while in 2.12 it would be ArrayBuffer. This means that calculating the length > of the blocks which is done in isRemoteAddressMaxedOut and other places now > much more expensive. This is because in 2.13 `Seq` is can no longer be > backed by a mutable collection. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40950) isRemoteAddressMaxedOut performance overhead on scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-40950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625710#comment-17625710 ] Apache Spark commented on SPARK-40950: -- User 'eejbyfeldt' has created a pull request for this issue: https://github.com/apache/spark/pull/38427 > isRemoteAddressMaxedOut performance overhead on scala 2.13 > -- > > Key: SPARK-40950 > URL: https://issues.apache.org/jira/browse/SPARK-40950 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.2, 3.3.1 >Reporter: Emil Ejbyfeldt >Priority: Major > > On scala 2.13 the blocks in FetchRequest is sometimes backed by a `List` > while in 2.12 it would be ArrayBuffer. This means that calculating the length > of the blocks which is done in isRemoteAddressMaxedOut and other places now > much more expensive. This is because in 2.13 `Seq` is can no longer be > backed by a mutable collection. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40950) isRemoteAddressMaxedOut performance overhead on scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-40950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40950: Assignee: Apache Spark > isRemoteAddressMaxedOut performance overhead on scala 2.13 > -- > > Key: SPARK-40950 > URL: https://issues.apache.org/jira/browse/SPARK-40950 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.2, 3.3.1 >Reporter: Emil Ejbyfeldt >Assignee: Apache Spark >Priority: Major > > On scala 2.13 the blocks in FetchRequest is sometimes backed by a `List` > while in 2.12 it would be ArrayBuffer. This means that calculating the length > of the blocks which is done in isRemoteAddressMaxedOut and other places now > much more expensive. This is because in 2.13 `Seq` is can no longer be > backed by a mutable collection. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40932) Barrier: messages for allGather will be overridden by the following barrier APIs
[ https://issues.apache.org/jira/browse/SPARK-40932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-40932. - Fix Version/s: 3.3.2 3.4.0 Resolution: Fixed Issue resolved by pull request 38410 [https://github.com/apache/spark/pull/38410] > Barrier: messages for allGather will be overridden by the following barrier > APIs > > > Key: SPARK-40932 > URL: https://issues.apache.org/jira/browse/SPARK-40932 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0, 3.3.1 >Reporter: Bobby Wang >Assignee: Bobby Wang >Priority: Critical > Fix For: 3.3.2, 3.4.0 > > > When I was working on an internal project which has not been opened source. I > found this bug that the messages for Barrier.allGather may be overridden by > the following Barrier APIs, which means the user can't get the correct > allGather message. > > This issue can easily repro by the following unit tests. > > > {code:java} > test("SPARK-XXX, messages of allGather should not been overridden " + > "by the following barrier APIs") { > sc = new SparkContext(new > SparkConf().setAppName("test").setMaster("local[2]")) > sc.setLogLevel("INFO") > val rdd = sc.makeRDD(1 to 10, 2) > val rdd2 = rdd.barrier().mapPartitions { it => > val context = BarrierTaskContext.get() > // Sleep for a random time before global sync. > Thread.sleep(Random.nextInt(1000)) > // Pass partitionId message in > val message: String = context.partitionId().toString > val messages: Array[String] = context.allGather(message) > context.barrier() > Iterator.single(messages.toList) > } > val messages = rdd2.collect() > // All the task partitionIds are shared across all tasks > assert(messages.length === 2) > messages.foreach(m => println("--- " + m)) > assert(messages.forall(_ == List("0", "1"))) > } {code} > > > before throwing the exception by (assert(messages.forall(_ == List("0", > "1"))), the print log is > > {code:java} > --- List(, ) > --- List(, ) {code} > > > You can see, the messages are empty which has been overridden by > context.barrier() API. > > Below is the spark log, > > _22/10/27 17:03:50.236 Executor task launch worker for task 0.0 in stage 0.0 > (TID 1) INFO Executor: Running task 0.0 in stage 0.0 (TID 1)_ > _22/10/27 17:03:50.236 Executor task launch worker for task 1.0 in stage 0.0 > (TID 0) INFO Executor: Running task 1.0 in stage 0.0 (TID 0)_ > _22/10/27 17:03:50.949 Executor task launch worker for task 0.0 in stage 0.0 > (TID 1) INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) has entered > the global sync, current barrier epoch is 0._ > _22/10/27 17:03:50.964 dispatcher-event-loop-1 INFO BarrierCoordinator: > Current barrier epoch for Stage 0 (Attempt 0) is 0._ > _22/10/27 17:03:50.966 dispatcher-event-loop-1 INFO BarrierCoordinator: > Barrier sync epoch 0 from Stage 0 (Attempt 0) received update from Task 1, > current progress: 1/2._ > _22/10/27 17:03:51.436 Executor task launch worker for task 1.0 in stage 0.0 > (TID 0) INFO BarrierTaskContext: Task 0 from Stage 0(Attempt 0) has entered > the global sync, current barrier epoch is 0._ > _22/10/27 17:03:51.437 dispatcher-event-loop-0 INFO BarrierCoordinator: > Current barrier epoch for Stage 0 (Attempt 0) is 0._ > _22/10/27 17:03:51.437 dispatcher-event-loop-0 INFO BarrierCoordinator: > Barrier sync epoch 0 from Stage 0 (Attempt 0) received update from Task 0, > current progress: 2/2._ > _22/10/27 17:03:51.440 dispatcher-event-loop-0 INFO BarrierCoordinator: > Barrier sync epoch 0 from Stage 0 (Attempt 0) received all updates from > tasks, finished successfully._ > _22/10/27 17:03:51.958 Executor task launch worker for task 0.0 in stage 0.0 > (TID 1) INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) finished > global sync successfully, waited for 1 seconds, current barrier epoch is 1._ > _22/10/27 17:03:51.959 Executor task launch worker for task 0.0 in stage 0.0 > (TID 1) INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) has entered > the global sync, current barrier epoch is 1._ > _22/10/27 17:03:51.960 dispatcher-event-loop-1 INFO BarrierCoordinator: > Current barrier epoch for Stage 0 (Attempt 0) is 1._ > _22/10/27 17:03:51.960 dispatcher-event-loop-1 INFO BarrierCoordinator: > Barrier sync epoch 1 from Stage 0 (Attempt 0) received update from Task 1, > current progress: 1/2._ > _22/10/27 17:03:52.437 Executor task launch worker for task 1.0 in stage 0.0 > (TID 0) INFO BarrierTaskContext: Task 0 from Stage 0(Attempt 0) finished > global sync successfully, waited for 1 seconds, current barrier epoch is 1._ > _22/10/27 17:03:52.438 Executor task
[jira] [Assigned] (SPARK-40932) Barrier: messages for allGather will be overridden by the following barrier APIs
[ https://issues.apache.org/jira/browse/SPARK-40932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-40932: --- Assignee: Bobby Wang > Barrier: messages for allGather will be overridden by the following barrier > APIs > > > Key: SPARK-40932 > URL: https://issues.apache.org/jira/browse/SPARK-40932 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0, 3.3.1 >Reporter: Bobby Wang >Assignee: Bobby Wang >Priority: Critical > > When I was working on an internal project which has not been opened source. I > found this bug that the messages for Barrier.allGather may be overridden by > the following Barrier APIs, which means the user can't get the correct > allGather message. > > This issue can easily repro by the following unit tests. > > > {code:java} > test("SPARK-XXX, messages of allGather should not been overridden " + > "by the following barrier APIs") { > sc = new SparkContext(new > SparkConf().setAppName("test").setMaster("local[2]")) > sc.setLogLevel("INFO") > val rdd = sc.makeRDD(1 to 10, 2) > val rdd2 = rdd.barrier().mapPartitions { it => > val context = BarrierTaskContext.get() > // Sleep for a random time before global sync. > Thread.sleep(Random.nextInt(1000)) > // Pass partitionId message in > val message: String = context.partitionId().toString > val messages: Array[String] = context.allGather(message) > context.barrier() > Iterator.single(messages.toList) > } > val messages = rdd2.collect() > // All the task partitionIds are shared across all tasks > assert(messages.length === 2) > messages.foreach(m => println("--- " + m)) > assert(messages.forall(_ == List("0", "1"))) > } {code} > > > before throwing the exception by (assert(messages.forall(_ == List("0", > "1"))), the print log is > > {code:java} > --- List(, ) > --- List(, ) {code} > > > You can see, the messages are empty which has been overridden by > context.barrier() API. > > Below is the spark log, > > _22/10/27 17:03:50.236 Executor task launch worker for task 0.0 in stage 0.0 > (TID 1) INFO Executor: Running task 0.0 in stage 0.0 (TID 1)_ > _22/10/27 17:03:50.236 Executor task launch worker for task 1.0 in stage 0.0 > (TID 0) INFO Executor: Running task 1.0 in stage 0.0 (TID 0)_ > _22/10/27 17:03:50.949 Executor task launch worker for task 0.0 in stage 0.0 > (TID 1) INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) has entered > the global sync, current barrier epoch is 0._ > _22/10/27 17:03:50.964 dispatcher-event-loop-1 INFO BarrierCoordinator: > Current barrier epoch for Stage 0 (Attempt 0) is 0._ > _22/10/27 17:03:50.966 dispatcher-event-loop-1 INFO BarrierCoordinator: > Barrier sync epoch 0 from Stage 0 (Attempt 0) received update from Task 1, > current progress: 1/2._ > _22/10/27 17:03:51.436 Executor task launch worker for task 1.0 in stage 0.0 > (TID 0) INFO BarrierTaskContext: Task 0 from Stage 0(Attempt 0) has entered > the global sync, current barrier epoch is 0._ > _22/10/27 17:03:51.437 dispatcher-event-loop-0 INFO BarrierCoordinator: > Current barrier epoch for Stage 0 (Attempt 0) is 0._ > _22/10/27 17:03:51.437 dispatcher-event-loop-0 INFO BarrierCoordinator: > Barrier sync epoch 0 from Stage 0 (Attempt 0) received update from Task 0, > current progress: 2/2._ > _22/10/27 17:03:51.440 dispatcher-event-loop-0 INFO BarrierCoordinator: > Barrier sync epoch 0 from Stage 0 (Attempt 0) received all updates from > tasks, finished successfully._ > _22/10/27 17:03:51.958 Executor task launch worker for task 0.0 in stage 0.0 > (TID 1) INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) finished > global sync successfully, waited for 1 seconds, current barrier epoch is 1._ > _22/10/27 17:03:51.959 Executor task launch worker for task 0.0 in stage 0.0 > (TID 1) INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) has entered > the global sync, current barrier epoch is 1._ > _22/10/27 17:03:51.960 dispatcher-event-loop-1 INFO BarrierCoordinator: > Current barrier epoch for Stage 0 (Attempt 0) is 1._ > _22/10/27 17:03:51.960 dispatcher-event-loop-1 INFO BarrierCoordinator: > Barrier sync epoch 1 from Stage 0 (Attempt 0) received update from Task 1, > current progress: 1/2._ > _22/10/27 17:03:52.437 Executor task launch worker for task 1.0 in stage 0.0 > (TID 0) INFO BarrierTaskContext: Task 0 from Stage 0(Attempt 0) finished > global sync successfully, waited for 1 seconds, current barrier epoch is 1._ > _22/10/27 17:03:52.438 Executor task launch worker for task 1.0 in stage 0.0 > (TID 0) INFO BarrierTaskContext: Task 0 from Stage 0(Attempt 0) has entered > the global sync, current barrier epoch is 1.
[jira] [Assigned] (SPARK-40912) Overhead of Exceptions in DeserializationStream
[ https://issues.apache.org/jira/browse/SPARK-40912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40912: Assignee: Apache Spark > Overhead of Exceptions in DeserializationStream > > > Key: SPARK-40912 > URL: https://issues.apache.org/jira/browse/SPARK-40912 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Emil Ejbyfeldt >Assignee: Apache Spark >Priority: Minor > > The interface of DeserializationStream forces implementation to raise > EOFException to indicate that there is no more data. And for the > KryoDeserializtionStream it even worse since the kryo library does not raise > EOFException we pay for the price of two exceptions for each stream. For > large shuffles with lots of small stream this is quite a bit large overhead > (seen couple % of cpu time). It also less safe to depend exceptions as it > might me raised for different reasons like corrupt data and that currently > cause data loss. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40912) Overhead of Exceptions in DeserializationStream
[ https://issues.apache.org/jira/browse/SPARK-40912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40912: Assignee: (was: Apache Spark) > Overhead of Exceptions in DeserializationStream > > > Key: SPARK-40912 > URL: https://issues.apache.org/jira/browse/SPARK-40912 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Emil Ejbyfeldt >Priority: Minor > > The interface of DeserializationStream forces implementation to raise > EOFException to indicate that there is no more data. And for the > KryoDeserializtionStream it even worse since the kryo library does not raise > EOFException we pay for the price of two exceptions for each stream. For > large shuffles with lots of small stream this is quite a bit large overhead > (seen couple % of cpu time). It also less safe to depend exceptions as it > might me raised for different reasons like corrupt data and that currently > cause data loss. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40912) Overhead of Exceptions in DeserializationStream
[ https://issues.apache.org/jira/browse/SPARK-40912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625695#comment-17625695 ] Apache Spark commented on SPARK-40912: -- User 'eejbyfeldt' has created a pull request for this issue: https://github.com/apache/spark/pull/38428 > Overhead of Exceptions in DeserializationStream > > > Key: SPARK-40912 > URL: https://issues.apache.org/jira/browse/SPARK-40912 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Emil Ejbyfeldt >Priority: Minor > > The interface of DeserializationStream forces implementation to raise > EOFException to indicate that there is no more data. And for the > KryoDeserializtionStream it even worse since the kryo library does not raise > EOFException we pay for the price of two exceptions for each stream. For > large shuffles with lots of small stream this is quite a bit large overhead > (seen couple % of cpu time). It also less safe to depend exceptions as it > might me raised for different reasons like corrupt data and that currently > cause data loss. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40952) Exception when handling timestamp data in PySpark Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-40952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai-Michael Roesner updated SPARK-40952: Description: I'm trying to process data that contains timestamps in PySpark "Structured Streaming" using the [foreach|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach] option. When I run the job I get a {{OSError: [Errno 22] Invalid argument}} exception in {{\pyspark\sql\types.py}} at the {noformat} return datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 100) {noformat} statement. I have boiled down my Spark job to the essentials: {noformat} from pyspark.sql import SparkSession def handle_row(row): print(f'Processing: \{row}') spark = (SparkSession.builder .appName('test.stream.tstmp.byrow') .getOrCreate()) data = (spark.readStream .option('delimiter', ',') .option('header', True) .schema('a integer, b string, c timestamp') .csv('data/test')) query = (data.writeStream .foreach(handle_row) .start()) query.awaitTermination() {noformat} In the {{data/test}} folder I have one csv file: {noformat} a,b,c 1,x,1970-01-01 00:59:59.999 2,y,1999-12-31 23:59:59.999 3,z,2022-10-18 15:53:12.345 {noformat} If I change the csv schema to {{'a integer, b string, c string'}} everything works fine and I get the expected output of {noformat} Processing: Row(a=1, b='x', c='1970-01-01 00:59:59.999') Processing: Row(a=2, b='y', c='1999-12-31 23:59:59.999') Processing: Row(a=3, b='z', c='2022-10-18 15:53:12.345') {noformat} Also, if I change the stream handling to [micro-batches|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreachbatch] like so: {noformat} ... def handle_batch(df, epoch_id): print(f'Processing: \{df} - Epoch: \{epoch_id}') ... query = (data.writeStream .foreachBatch(handle_batch) .start()) {noformat} I get the expected output of {noformat} Processing: DataFrame[a: int, b: string, c: timestamp] - Epoch: 0 {noformat} But "by row" handling should work with the row having the correct column data type of {{timestamp}}. This issue also affects using the {{foreach}} sink in Structured Streaming with the Kafka data souce since Kafka events contain a timestamp was: I'm trying to process data that contains timestamps in PySpark "Structured Streaming" using the [foreach|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach] option. When I run the job I get a {{OSError: [Errno 22] Invalid argument}} exception in {{\pyspark\sql\types.py}} at the {noformat} return datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 100) {noformat} statement. I have boiled down my Spark job to the essentials: {noformat} from pyspark.sql import SparkSession def handle_row(row): print(f'Processing: \{row}') spark = (SparkSession.builder .appName('test.stream.tstmp.byrow') .getOrCreate()) data = (spark.readStream .option('delimiter', ',') .option('header', True) .schema('a integer, b string, c timestamp') .csv('data/test')) query = (data.writeStream .foreach(handle_row) .start()) query.awaitTermination() {noformat} In the {{data/test}} folder I have one csv file: {noformat} a,b,c 1,x,1970-01-01 00:59:59.999 2,y,1999-12-31 23:59:59.999 3,z,2022-10-18 15:53:12.345 {noformat} If I change the csv schema to {{'a integer, b string, c string'}} everything works fine and I get the expected output of {noformat} Processing: Row(a=1, b='x', c='1970-01-01 00:59:59.999') Processing: Row(a=2, b='y', c='1999-12-31 23:59:59.999') Processing: Row(a=3, b='z', c='2022-10-18 15:53:12.345') {noformat} Also, if I change the stream handling to [micro-batches|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreachbatch] like so: {noformat} ... def handle_batch(df, epoch_id): print(f'Processing: \{df} - Epoch: \{epoch_id}') ... query = (data.writeStream .foreachBatch(handle_batch) .start()) {noformat} I get the expected output of {noformat} Processing: DataFrame[a: int, b: string, c: timestamp] - Epoch: 0 {noformat} But "by row" handling should work with the row having the correct column data type of {{{}timestamp{}}}. > Exception when handling timestamp data in PySpark Structured Streaming > -- > > Key: SPARK-40952 > URL: https://issues.apache.org/jira/browse/SPARK-40952 > Project: Spark > Issue Type: Bug > Components: PySpark, Structured Streaming, Windows >Affects Versions: 3.3.0 > Environment: OS: Windows 10 >Reporter: Kai-Michael Roesner >Priority: Minor > > I'm trying to process data that contains timestamps in PySpark "Structured > Streaming" using the > [foreach|https://spark.apache.org/docs/latest/structured-s
[jira] [Commented] (SPARK-40749) Migrate type check failures of generators onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625674#comment-17625674 ] BingKun Pan commented on SPARK-40749: - I work on it. > Migrate type check failures of generators onto error classes > > > Key: SPARK-40749 > URL: https://issues.apache.org/jira/browse/SPARK-40749 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in the generator > expressions: > 1. Stack (3): > https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala#L163-L170 > 2. ExplodeBase (1): > https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala#L299 > 3. Inline (1): > https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala#L441 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40953) Add missing `limit(n)` in DataFrame.head
[ https://issues.apache.org/jira/browse/SPARK-40953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40953: Assignee: (was: Apache Spark) > Add missing `limit(n)` in DataFrame.head > > > Key: SPARK-40953 > URL: https://issues.apache.org/jira/browse/SPARK-40953 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40953) Add missing `limit(n)` in DataFrame.head
[ https://issues.apache.org/jira/browse/SPARK-40953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625662#comment-17625662 ] Apache Spark commented on SPARK-40953: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/38424 > Add missing `limit(n)` in DataFrame.head > > > Key: SPARK-40953 > URL: https://issues.apache.org/jira/browse/SPARK-40953 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40953) Add missing `limit(n)` in DataFrame.head
[ https://issues.apache.org/jira/browse/SPARK-40953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625664#comment-17625664 ] Apache Spark commented on SPARK-40953: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/38424 > Add missing `limit(n)` in DataFrame.head > > > Key: SPARK-40953 > URL: https://issues.apache.org/jira/browse/SPARK-40953 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40953) Add missing `limit(n)` in DataFrame.head
[ https://issues.apache.org/jira/browse/SPARK-40953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40953: Assignee: Apache Spark > Add missing `limit(n)` in DataFrame.head > > > Key: SPARK-40953 > URL: https://issues.apache.org/jira/browse/SPARK-40953 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40953) Add missing `limit(n)` in DataFrame.head
Ruifeng Zheng created SPARK-40953: - Summary: Add missing `limit(n)` in DataFrame.head Key: SPARK-40953 URL: https://issues.apache.org/jira/browse/SPARK-40953 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40889) Check error classes in PlanResolutionSuite
[ https://issues.apache.org/jira/browse/SPARK-40889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40889. -- Resolution: Fixed Issue resolved by pull request 38421 [https://github.com/apache/spark/pull/38421] > Check error classes in PlanResolutionSuite > -- > > Key: SPARK-40889 > URL: https://issues.apache.org/jira/browse/SPARK-40889 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: BingKun Pan >Priority: Major > Labels: starter > Fix For: 3.4.0 > > > Check error classes in PlanResolutionSuite by using checkError() instead of > assertUnsupported. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40889) Check error classes in PlanResolutionSuite
[ https://issues.apache.org/jira/browse/SPARK-40889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-40889: Assignee: BingKun Pan > Check error classes in PlanResolutionSuite > -- > > Key: SPARK-40889 > URL: https://issues.apache.org/jira/browse/SPARK-40889 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: BingKun Pan >Priority: Major > Labels: starter > Fix For: 3.4.0 > > > Check error classes in PlanResolutionSuite by using checkError() instead of > assertUnsupported. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40936) Refactor `AnalysisTest#assertAnalysisErrorClass` by reusing the `SparkFunSuite#checkError`
[ https://issues.apache.org/jira/browse/SPARK-40936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40936. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38413 [https://github.com/apache/spark/pull/38413] > Refactor `AnalysisTest#assertAnalysisErrorClass` by reusing the > `SparkFunSuite#checkError` > -- > > Key: SPARK-40936 > URL: https://issues.apache.org/jira/browse/SPARK-40936 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40936) Refactor `AnalysisTest#assertAnalysisErrorClass` by reusing the `SparkFunSuite#checkError`
[ https://issues.apache.org/jira/browse/SPARK-40936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-40936: Assignee: Yang Jie > Refactor `AnalysisTest#assertAnalysisErrorClass` by reusing the > `SparkFunSuite#checkError` > -- > > Key: SPARK-40936 > URL: https://issues.apache.org/jira/browse/SPARK-40936 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40952) Exception when handling timestamp data in PySpark Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-40952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai-Michael Roesner updated SPARK-40952: Description: I'm trying to process data that contains timestamps in PySpark "Structured Streaming" using the [foreach|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach] option. When I run the job I get a {{OSError: [Errno 22] Invalid argument}} exception in {{\pyspark\sql\types.py}} at the {noformat} return datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 100) {noformat} statement. I have boiled down my Spark job to the essentials: {noformat} from pyspark.sql import SparkSession def handle_row(row): print(f'Processing: \{row}') spark = (SparkSession.builder .appName('test.stream.tstmp.byrow') .getOrCreate()) data = (spark.readStream .option('delimiter', ',') .option('header', True) .schema('a integer, b string, c timestamp') .csv('data/test')) query = (data.writeStream .foreach(handle_row) .start()) query.awaitTermination() {noformat} In the {{data/test}} folder I have one csv file: {noformat} a,b,c 1,x,1970-01-01 00:59:59.999 2,y,1999-12-31 23:59:59.999 3,z,2022-10-18 15:53:12.345 {noformat} If I change the csv schema to {{'a integer, b string, c string'}} everything works fine and I get the expected output of {noformat} Processing: Row(a=1, b='x', c='1970-01-01 00:59:59.999') Processing: Row(a=2, b='y', c='1999-12-31 23:59:59.999') Processing: Row(a=3, b='z', c='2022-10-18 15:53:12.345') {noformat} Also, if I change the stream handling to [micro-batches|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreachbatch] like so: {noformat} ... def handle_batch(df, epoch_id): print(f'Processing: \{df} - Epoch: \{epoch_id}') ... query = (data.writeStream .foreachBatch(handle_batch) .start()) {noformat} I get the expected output of {noformat} Processing: DataFrame[a: int, b: string, c: timestamp] - Epoch: 0 {noformat} But "by row" handling should work with the row having the correct column data type of {{{}timestamp{}}}. was: I'm trying to process data that contains timestamps in PySpark "Structured Streaming" using the [foreach|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach] option. When I run the job I get a {{OSError: [Errno 22] Invalid argument}} exception in {{\pyspark\sql\types.py}} at the {noformat} return datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 100) {noformat} statement. I have boiled down my Spark job to the essentials: {noformat} from pyspark.sql import SparkSession def handle_row(row): print(f'Processing: \{row}') spark = (SparkSession.builder .appName('test.stream.tstmp.byrow') .getOrCreate()) data = (spark.readStream .option('delimiter', ',') .option('header', True) .schema('a integer, b string, c timestamp') .csv('data/test')) query = (data.writeStream .foreach(handle_row) .start()) query.awaitTermination() {noformat} In the `data/test` folder I have one csv file: {noformat} a,b,c 1,x,1970-01-01 00:59:59.999 2,y,1999-12-31 23:59:59.999 3,z,2022-10-18 15:53:12.345 {noformat} If I change the csv schema to `'a integer, b string, c string'` everything works fine and I get the expected output of {noformat} Processing: Row(a=1, b='x', c='1970-01-01 00:59:59.999') Processing: Row(a=2, b='y', c='1999-12-31 23:59:59.999') Processing: Row(a=3, b='z', c='2022-10-18 15:53:12.345') {noformat} Also, if I change the stream handling to [micro-batches|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreachbatch] like so: {noformat} ... def handle_batch(df, epoch_id): print(f'Processing: \{df} - Epoch: \{epoch_id}') ... query = (data.writeStream .foreachBatch(handle_batch) .start()) {noformat} I get the expected output of {noformat} Processing: DataFrame[a: int, b: string, c: timestamp] - Epoch: 0 {noformat} But "by row" handling should work with the row having the correct column data type of `timestamp`. > Exception when handling timestamp data in PySpark Structured Streaming > -- > > Key: SPARK-40952 > URL: https://issues.apache.org/jira/browse/SPARK-40952 > Project: Spark > Issue Type: Bug > Components: PySpark, Structured Streaming, Windows >Affects Versions: 3.3.0 > Environment: OS: Windows 10 >Reporter: Kai-Michael Roesner >Priority: Minor > > I'm trying to process data that contains timestamps in PySpark "Structured > Streaming" using the > [foreach|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach] > option. When I run the job I get a {{OSError: [Errno 22] Invalid argument}} > exception in {{\pyspark\sq
[jira] [Updated] (SPARK-40952) Exception when handling timestamp data in PySpark Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-40952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai-Michael Roesner updated SPARK-40952: Description: I'm trying to process data that contains timestamps in PySpark "Structured Streaming" using the [foreach|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach] option. When I run the job I get a {{OSError: [Errno 22] Invalid argument}} exception in {{\pyspark\sql\types.py}} at the {noformat} return datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 100) {noformat} statement. I have boiled down my Spark job to the essentials: {noformat} from pyspark.sql import SparkSession def handle_row(row): print(f'Processing: \{row}') spark = (SparkSession.builder .appName('test.stream.tstmp.byrow') .getOrCreate()) data = (spark.readStream .option('delimiter', ',') .option('header', True) .schema('a integer, b string, c timestamp') .csv('data/test')) query = (data.writeStream .foreach(handle_row) .start()) query.awaitTermination() {noformat} In the `data/test` folder I have one csv file: {noformat} a,b,c 1,x,1970-01-01 00:59:59.999 2,y,1999-12-31 23:59:59.999 3,z,2022-10-18 15:53:12.345 {noformat} If I change the csv schema to `'a integer, b string, c string'` everything works fine and I get the expected output of {noformat} Processing: Row(a=1, b='x', c='1970-01-01 00:59:59.999') Processing: Row(a=2, b='y', c='1999-12-31 23:59:59.999') Processing: Row(a=3, b='z', c='2022-10-18 15:53:12.345') {noformat} Also, if I change the stream handling to [micro-batches|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreachbatch] like so: {noformat} ... def handle_batch(df, epoch_id): print(f'Processing: \{df} - Epoch: \{epoch_id}') ... query = (data.writeStream .foreachBatch(handle_batch) .start()) {noformat} I get the expected output of {noformat} Processing: DataFrame[a: int, b: string, c: timestamp] - Epoch: 0 {noformat} But "by row" handling should work with the row having the correct column data type of `timestamp`. was: I'm trying to process data that contains timestamps in PySpark "Structured Streaming" using the [foreach|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach] option. When I run the job I get a `OSError: [Errno 22] Invalid argument` exception in \pyspark\sql\types.py at the {noformat} return datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 100) {noformat} statement. I have boiled down my Spark job to the essentials: {noformat} from pyspark.sql import SparkSession def handle_row(row): print(f'Processing: \{row}') spark = (SparkSession.builder .appName('test.stream.tstmp.byrow') .getOrCreate()) data = (spark.readStream .option('delimiter', ',') .option('header', True) .schema('a integer, b string, c timestamp') .csv('data/test')) query = (data.writeStream .foreach(handle_row) .start()) query.awaitTermination() {noformat} In the `data/test` folder I have one csv file: {noformat} a,b,c 1,x,1970-01-01 00:59:59.999 2,y,1999-12-31 23:59:59.999 3,z,2022-10-18 15:53:12.345 {noformat} If I change the csv schema to `'a integer, b string, c string'` everything works fine and I get the expected output of {noformat} Processing: Row(a=1, b='x', c='1970-01-01 00:59:59.999') Processing: Row(a=2, b='y', c='1999-12-31 23:59:59.999') Processing: Row(a=3, b='z', c='2022-10-18 15:53:12.345') {noformat} Also, if I change the stream handling to [micro-batches|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreachbatch] like so: {noformat} ... def handle_batch(df, epoch_id): print(f'Processing: \{df} - Epoch: \{epoch_id}') ... query = (data.writeStream .foreachBatch(handle_batch) .start()) {noformat} I get the expected output of {noformat} Processing: DataFrame[a: int, b: string, c: timestamp] - Epoch: 0 {noformat} But "by row" handling should work with the row having the correct column data type of `timestamp`. > Exception when handling timestamp data in PySpark Structured Streaming > -- > > Key: SPARK-40952 > URL: https://issues.apache.org/jira/browse/SPARK-40952 > Project: Spark > Issue Type: Bug > Components: PySpark, Structured Streaming, Windows >Affects Versions: 3.3.0 > Environment: OS: Windows 10 >Reporter: Kai-Michael Roesner >Priority: Minor > > I'm trying to process data that contains timestamps in PySpark "Structured > Streaming" using the > [foreach|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach] > option. When I run the job I get a {{OSError: [Errno 22] Invalid argument}} > exception in {{\pyspark\sql\types.py}}
[jira] [Created] (SPARK-40952) Exception when handling timestamp data in PySpark Structured Streaming
Kai-Michael Roesner created SPARK-40952: --- Summary: Exception when handling timestamp data in PySpark Structured Streaming Key: SPARK-40952 URL: https://issues.apache.org/jira/browse/SPARK-40952 Project: Spark Issue Type: Bug Components: PySpark, Structured Streaming, Windows Affects Versions: 3.3.0 Environment: OS: Windows 10 Reporter: Kai-Michael Roesner I'm trying to process data that contains timestamps in PySpark "Structured Streaming" using the [foreach|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreach] option. When I run the job I get a `OSError: [Errno 22] Invalid argument` exception in \pyspark\sql\types.py at the {noformat} return datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 100) {noformat} statement. I have boiled down my Spark job to the essentials: {noformat} from pyspark.sql import SparkSession def handle_row(row): print(f'Processing: \{row}') spark = (SparkSession.builder .appName('test.stream.tstmp.byrow') .getOrCreate()) data = (spark.readStream .option('delimiter', ',') .option('header', True) .schema('a integer, b string, c timestamp') .csv('data/test')) query = (data.writeStream .foreach(handle_row) .start()) query.awaitTermination() {noformat} In the `data/test` folder I have one csv file: {noformat} a,b,c 1,x,1970-01-01 00:59:59.999 2,y,1999-12-31 23:59:59.999 3,z,2022-10-18 15:53:12.345 {noformat} If I change the csv schema to `'a integer, b string, c string'` everything works fine and I get the expected output of {noformat} Processing: Row(a=1, b='x', c='1970-01-01 00:59:59.999') Processing: Row(a=2, b='y', c='1999-12-31 23:59:59.999') Processing: Row(a=3, b='z', c='2022-10-18 15:53:12.345') {noformat} Also, if I change the stream handling to [micro-batches|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreachbatch] like so: {noformat} ... def handle_batch(df, epoch_id): print(f'Processing: \{df} - Epoch: \{epoch_id}') ... query = (data.writeStream .foreachBatch(handle_batch) .start()) {noformat} I get the expected output of {noformat} Processing: DataFrame[a: int, b: string, c: timestamp] - Epoch: 0 {noformat} But "by row" handling should work with the row having the correct column data type of `timestamp`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40951) pyspark-connect tests should be skipped if pandas doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-40951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40951: Assignee: Apache Spark > pyspark-connect tests should be skipped if pandas doesn't exist > --- > > Key: SPARK-40951 > URL: https://issues.apache.org/jira/browse/SPARK-40951 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40951) pyspark-connect tests should be skipped if pandas doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-40951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625566#comment-17625566 ] Apache Spark commented on SPARK-40951: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/38426 > pyspark-connect tests should be skipped if pandas doesn't exist > --- > > Key: SPARK-40951 > URL: https://issues.apache.org/jira/browse/SPARK-40951 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40951) pyspark-connect tests should be skipped if pandas doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-40951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40951: Assignee: (was: Apache Spark) > pyspark-connect tests should be skipped if pandas doesn't exist > --- > > Key: SPARK-40951 > URL: https://issues.apache.org/jira/browse/SPARK-40951 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40951) pyspark-connect tests should be skipped if pandas doesn't exist
Dongjoon Hyun created SPARK-40951: - Summary: pyspark-connect tests should be skipped if pandas doesn't exist Key: SPARK-40951 URL: https://issues.apache.org/jira/browse/SPARK-40951 Project: Spark Issue Type: Sub-task Components: PySpark, Tests Affects Versions: 3.4.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40229) Re-enable excel I/O test for pandas API on Spark.
[ https://issues.apache.org/jira/browse/SPARK-40229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625550#comment-17625550 ] Apache Spark commented on SPARK-40229: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/38425 > Re-enable excel I/O test for pandas API on Spark. > - > > Key: SPARK-40229 > URL: https://issues.apache.org/jira/browse/SPARK-40229 > Project: Spark > Issue Type: Test > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > Currently we're skipping the `read_excel` and `to_excel` tests for pandas API > on Spark, since the `openpyxl` is not installed in our test environments. > We should enable this test for better test coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40950) isRemoteAddressMaxedOut performance overhead on scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-40950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emil Ejbyfeldt updated SPARK-40950: --- Summary: isRemoteAddressMaxedOut performance overhead on scala 2.13 (was: On scala 2.13 isRemoteAddressMaxedOut performance overhead) > isRemoteAddressMaxedOut performance overhead on scala 2.13 > -- > > Key: SPARK-40950 > URL: https://issues.apache.org/jira/browse/SPARK-40950 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.2, 3.3.1 >Reporter: Emil Ejbyfeldt >Priority: Major > > On scala 2.13 the blocks in FetchRequest is sometimes backed by a `List` > while in 2.12 it would be ArrayBuffer. This means that calculating the length > of the blocks which is done in isRemoteAddressMaxedOut and other places now > much more expensive. This is because in 2.13 `Seq` is can no longer be > backed by a mutable collection. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40950) On scala 2.13 isRemoteAddressMaxedOut performance overhead
Emil Ejbyfeldt created SPARK-40950: -- Summary: On scala 2.13 isRemoteAddressMaxedOut performance overhead Key: SPARK-40950 URL: https://issues.apache.org/jira/browse/SPARK-40950 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.1, 3.2.2 Reporter: Emil Ejbyfeldt On scala 2.13 the blocks in FetchRequest is sometimes backed by a `List` while in 2.12 it would be ArrayBuffer. This means that calculating the length of the blocks which is done in isRemoteAddressMaxedOut and other places now much more expensive. This is because in 2.13 `Seq` is can no longer be backed by a mutable collection. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40949) Implement `DataFrame.sortWithinPartitions`
[ https://issues.apache.org/jira/browse/SPARK-40949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40949: Assignee: (was: Apache Spark) > Implement `DataFrame.sortWithinPartitions` > -- > > Key: SPARK-40949 > URL: https://issues.apache.org/jira/browse/SPARK-40949 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40949) Implement `DataFrame.sortWithinPartitions`
[ https://issues.apache.org/jira/browse/SPARK-40949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625526#comment-17625526 ] Apache Spark commented on SPARK-40949: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/38423 > Implement `DataFrame.sortWithinPartitions` > -- > > Key: SPARK-40949 > URL: https://issues.apache.org/jira/browse/SPARK-40949 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40949) Implement `DataFrame.sortWithinPartitions`
[ https://issues.apache.org/jira/browse/SPARK-40949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40949: Assignee: Apache Spark > Implement `DataFrame.sortWithinPartitions` > -- > > Key: SPARK-40949 > URL: https://issues.apache.org/jira/browse/SPARK-40949 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40949) Implement `DataFrame.sortWithinPartitions`
[ https://issues.apache.org/jira/browse/SPARK-40949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625524#comment-17625524 ] Apache Spark commented on SPARK-40949: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/38423 > Implement `DataFrame.sortWithinPartitions` > -- > > Key: SPARK-40949 > URL: https://issues.apache.org/jira/browse/SPARK-40949 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org