[jira] [Comment Edited] (SPARK-34107) Spark History not loading when service has to load 300k applications initially from S3

2021-01-14 Thread Shashank Pedamallu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264672#comment-17264672
 ] 

Shashank Pedamallu edited comment on SPARK-34107 at 1/14/21, 7:59 AM:
--

Screenshot of the spark history:

!blank_shs.png|height=120,width=180!

Also, please find attached the dynamic tracing analysis (using 
[btrace|[http://example.com|https://github.com/btraceio/btrace]][^SHS_Profiling_Sorted.csv]


was (Author: spedamallu):
Screenshot of the spark history:

!blank_shs.png!

 

 

Also, please find attached the dynamic tracing analysis (using 
[btrace|[http://example.com|https://github.com/btraceio/btrace]][^SHS_Profiling_Sorted.csv]

> Spark History not loading when service has to load 300k applications 
> initially from S3
> --
>
> Key: SPARK-34107
> URL: https://issues.apache.org/jira/browse/SPARK-34107
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shashank Pedamallu
>Priority: Major
> Attachments: SHS_Profiling_Sorted.csv, blank_shs.png
>
>
> Spark History Service is having trouble loading when loading initially with 
> 300k+ applications from S3. Following are the details and snapshots:
> Number of files in `spark.history.fs.logDirectory`: (Using xxx for anonymity)
> {noformat}
> spedamallu@spedamallu-mbp143 ~/src/spark (spark-bug) $
> | => aws s3 ls s3://-company/spark-history-fs-logDirectory/ | wc -l
>   305571
> spedamallu@spedamallu-mbp143 ~/src/spark (spark-bug) ${noformat}
> {noformat}
> Logs when starting SparkHistory:
> {noformat}
> root@shs-with-statsd-86d7f54679-t8fqr:/go/src/github.com/-company/spark-private#
>  
> /go/src/github.com/-company/spark-private/bootstrap/start-history-server.sh
>  --properties-file /etc/spark-history-config/shs-default.properties
>  2021/01/14 02:40:28 Spark spark wrapper is disabled
>  2021/01/14 02:40:28 Attempt number 0, Max attempts 0, Left Attempts 0
>  2021/01/14 02:40:28 Statsd disabled
>  2021/01/14 02:40:28 Debug log: /tmp/.log
>  2021/01/14 02:40:28 Job submitted 0 seconds ago, Operator 0, ETL 0, Flyte 0 
> Mozart 0
>  2021/01/14 02:40:28 Running command /opt/spark/bin/spark-class.orig with 
> arguments [org.apache.spark.deploy.history.HistoryServer --properties-file 
> /etc/spark-history-config/shs-default.properties]
>  21/01/14 02:40:29 INFO HistoryServer: Started daemon with process name: 
> 2077@shs-with-statsd-86d7f54679-t8fqr
>  21/01/14 02:40:29 INFO SignalUtils: Registered signal handler for TERM
>  21/01/14 02:40:29 INFO SignalUtils: Registered signal handler for HUP
>  21/01/14 02:40:29 INFO SignalUtils: Registered signal handler for INT
>  21/01/14 02:40:30 WARN NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
>  21/01/14 02:40:30 INFO SecurityManager: Changing view acls to: root
>  21/01/14 02:40:30 INFO SecurityManager: Changing modify acls to: root
>  21/01/14 02:40:30 INFO SecurityManager: Changing view acls groups to:
>  21/01/14 02:40:30 INFO SecurityManager: Changing modify acls groups to:
>  21/01/14 02:40:30 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(root); groups 
> with view permissions: Set(); users with modify permissions: Set(root); 
> groups with modify permissions: Set()
>  21/01/14 02:40:30 INFO FsHistoryProvider: History server ui acls disabled; 
> users with admin permissions: ; groups with admin permissions
>  21/01/14 02:40:30 WARN MetricsConfig: Cannot locate configuration: tried 
> hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
>  21/01/14 02:40:30 INFO MetricsSystemImpl: Scheduled Metric snapshot period 
> at 10 second(s).
>  21/01/14 02:40:30 INFO MetricsSystemImpl: s3a-file-system metrics system 
> started
>  21/01/14 02:40:31 INFO log: Logging initialized @1933ms to 
> org.sparkproject.jetty.util.log.Slf4jLog
>  21/01/14 02:40:31 INFO Server: jetty-9.4.z-SNAPSHOT; built: 
> 2019-04-29T20:42:08.989Z; git: e1bc35120a6617ee3df052294e433f3a25ce7097; jvm 
> 1.8.0_242-b08
>  21/01/14 02:40:31 INFO Server: Started @1999ms
>  21/01/14 02:40:31 INFO AbstractConnector: Started ServerConnector@51751e5f
> {HTTP/1.1,[http/1.1]} {0.0.0.0:18080}
> 21/01/14 02:40:31 INFO Utils: Successfully started service on port 18080.
>  21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@b9dfc5a
> {/,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@1bbae752
> {/json,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@5cf87cfd
> {/api,null,AVAILABLE,@Spark}

[jira] [Comment Edited] (SPARK-34107) Spark History not loading when service has to load 300k applications initially from S3

2021-01-14 Thread Shashank Pedamallu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264672#comment-17264672
 ] 

Shashank Pedamallu edited comment on SPARK-34107 at 1/14/21, 7:59 AM:
--

Screenshot of the spark history:

!blank_shs.png|height=120,width=240!

Also, please find attached the dynamic tracing analysis (using 
[btrace|[http://example.com|https://github.com/btraceio/btrace]][^SHS_Profiling_Sorted.csv]


was (Author: spedamallu):
Screenshot of the spark history:

!blank_shs.png|height=120,width=180!

Also, please find attached the dynamic tracing analysis (using 
[btrace|[http://example.com|https://github.com/btraceio/btrace]][^SHS_Profiling_Sorted.csv]

> Spark History not loading when service has to load 300k applications 
> initially from S3
> --
>
> Key: SPARK-34107
> URL: https://issues.apache.org/jira/browse/SPARK-34107
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shashank Pedamallu
>Priority: Major
> Attachments: SHS_Profiling_Sorted.csv, blank_shs.png
>
>
> Spark History Service is having trouble loading when loading initially with 
> 300k+ applications from S3. Following are the details and snapshots:
> Number of files in `spark.history.fs.logDirectory`: (Using xxx for anonymity)
> {noformat}
> spedamallu@spedamallu-mbp143 ~/src/spark (spark-bug) $
> | => aws s3 ls s3://-company/spark-history-fs-logDirectory/ | wc -l
>   305571
> spedamallu@spedamallu-mbp143 ~/src/spark (spark-bug) ${noformat}
> {noformat}
> Logs when starting SparkHistory:
> {noformat}
> root@shs-with-statsd-86d7f54679-t8fqr:/go/src/github.com/-company/spark-private#
>  
> /go/src/github.com/-company/spark-private/bootstrap/start-history-server.sh
>  --properties-file /etc/spark-history-config/shs-default.properties
>  2021/01/14 02:40:28 Spark spark wrapper is disabled
>  2021/01/14 02:40:28 Attempt number 0, Max attempts 0, Left Attempts 0
>  2021/01/14 02:40:28 Statsd disabled
>  2021/01/14 02:40:28 Debug log: /tmp/.log
>  2021/01/14 02:40:28 Job submitted 0 seconds ago, Operator 0, ETL 0, Flyte 0 
> Mozart 0
>  2021/01/14 02:40:28 Running command /opt/spark/bin/spark-class.orig with 
> arguments [org.apache.spark.deploy.history.HistoryServer --properties-file 
> /etc/spark-history-config/shs-default.properties]
>  21/01/14 02:40:29 INFO HistoryServer: Started daemon with process name: 
> 2077@shs-with-statsd-86d7f54679-t8fqr
>  21/01/14 02:40:29 INFO SignalUtils: Registered signal handler for TERM
>  21/01/14 02:40:29 INFO SignalUtils: Registered signal handler for HUP
>  21/01/14 02:40:29 INFO SignalUtils: Registered signal handler for INT
>  21/01/14 02:40:30 WARN NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
>  21/01/14 02:40:30 INFO SecurityManager: Changing view acls to: root
>  21/01/14 02:40:30 INFO SecurityManager: Changing modify acls to: root
>  21/01/14 02:40:30 INFO SecurityManager: Changing view acls groups to:
>  21/01/14 02:40:30 INFO SecurityManager: Changing modify acls groups to:
>  21/01/14 02:40:30 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(root); groups 
> with view permissions: Set(); users with modify permissions: Set(root); 
> groups with modify permissions: Set()
>  21/01/14 02:40:30 INFO FsHistoryProvider: History server ui acls disabled; 
> users with admin permissions: ; groups with admin permissions
>  21/01/14 02:40:30 WARN MetricsConfig: Cannot locate configuration: tried 
> hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
>  21/01/14 02:40:30 INFO MetricsSystemImpl: Scheduled Metric snapshot period 
> at 10 second(s).
>  21/01/14 02:40:30 INFO MetricsSystemImpl: s3a-file-system metrics system 
> started
>  21/01/14 02:40:31 INFO log: Logging initialized @1933ms to 
> org.sparkproject.jetty.util.log.Slf4jLog
>  21/01/14 02:40:31 INFO Server: jetty-9.4.z-SNAPSHOT; built: 
> 2019-04-29T20:42:08.989Z; git: e1bc35120a6617ee3df052294e433f3a25ce7097; jvm 
> 1.8.0_242-b08
>  21/01/14 02:40:31 INFO Server: Started @1999ms
>  21/01/14 02:40:31 INFO AbstractConnector: Started ServerConnector@51751e5f
> {HTTP/1.1,[http/1.1]} {0.0.0.0:18080}
> 21/01/14 02:40:31 INFO Utils: Successfully started service on port 18080.
>  21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@b9dfc5a
> {/,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@1bbae752
> {/json,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@5cf87cfd
> {/api,null,AVA

[jira] [Comment Edited] (SPARK-34107) Spark History not loading when service has to load 300k applications initially from S3

2021-01-14 Thread Shashank Pedamallu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264672#comment-17264672
 ] 

Shashank Pedamallu edited comment on SPARK-34107 at 1/14/21, 8:00 AM:
--

Screenshot of the spark history:

!blank_shs.png|width=240,height=120!

Also, please find attached the dynamic tracing analysis (using 
[btrace|https://github.com/btraceio/btrace][^SHS_Profiling_Sorted.csv]


was (Author: spedamallu):
Screenshot of the spark history:

!blank_shs.png|height=120,width=240!

Also, please find attached the dynamic tracing analysis (using 
[btrace|[http://example.com|https://github.com/btraceio/btrace]][^SHS_Profiling_Sorted.csv]

> Spark History not loading when service has to load 300k applications 
> initially from S3
> --
>
> Key: SPARK-34107
> URL: https://issues.apache.org/jira/browse/SPARK-34107
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shashank Pedamallu
>Priority: Major
> Attachments: SHS_Profiling_Sorted.csv, blank_shs.png
>
>
> Spark History Service is having trouble loading when loading initially with 
> 300k+ applications from S3. Following are the details and snapshots:
> Number of files in `spark.history.fs.logDirectory`: (Using xxx for anonymity)
> {noformat}
> spedamallu@spedamallu-mbp143 ~/src/spark (spark-bug) $
> | => aws s3 ls s3://-company/spark-history-fs-logDirectory/ | wc -l
>   305571
> spedamallu@spedamallu-mbp143 ~/src/spark (spark-bug) ${noformat}
> {noformat}
> Logs when starting SparkHistory:
> {noformat}
> root@shs-with-statsd-86d7f54679-t8fqr:/go/src/github.com/-company/spark-private#
>  
> /go/src/github.com/-company/spark-private/bootstrap/start-history-server.sh
>  --properties-file /etc/spark-history-config/shs-default.properties
>  2021/01/14 02:40:28 Spark spark wrapper is disabled
>  2021/01/14 02:40:28 Attempt number 0, Max attempts 0, Left Attempts 0
>  2021/01/14 02:40:28 Statsd disabled
>  2021/01/14 02:40:28 Debug log: /tmp/.log
>  2021/01/14 02:40:28 Job submitted 0 seconds ago, Operator 0, ETL 0, Flyte 0 
> Mozart 0
>  2021/01/14 02:40:28 Running command /opt/spark/bin/spark-class.orig with 
> arguments [org.apache.spark.deploy.history.HistoryServer --properties-file 
> /etc/spark-history-config/shs-default.properties]
>  21/01/14 02:40:29 INFO HistoryServer: Started daemon with process name: 
> 2077@shs-with-statsd-86d7f54679-t8fqr
>  21/01/14 02:40:29 INFO SignalUtils: Registered signal handler for TERM
>  21/01/14 02:40:29 INFO SignalUtils: Registered signal handler for HUP
>  21/01/14 02:40:29 INFO SignalUtils: Registered signal handler for INT
>  21/01/14 02:40:30 WARN NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
>  21/01/14 02:40:30 INFO SecurityManager: Changing view acls to: root
>  21/01/14 02:40:30 INFO SecurityManager: Changing modify acls to: root
>  21/01/14 02:40:30 INFO SecurityManager: Changing view acls groups to:
>  21/01/14 02:40:30 INFO SecurityManager: Changing modify acls groups to:
>  21/01/14 02:40:30 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(root); groups 
> with view permissions: Set(); users with modify permissions: Set(root); 
> groups with modify permissions: Set()
>  21/01/14 02:40:30 INFO FsHistoryProvider: History server ui acls disabled; 
> users with admin permissions: ; groups with admin permissions
>  21/01/14 02:40:30 WARN MetricsConfig: Cannot locate configuration: tried 
> hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
>  21/01/14 02:40:30 INFO MetricsSystemImpl: Scheduled Metric snapshot period 
> at 10 second(s).
>  21/01/14 02:40:30 INFO MetricsSystemImpl: s3a-file-system metrics system 
> started
>  21/01/14 02:40:31 INFO log: Logging initialized @1933ms to 
> org.sparkproject.jetty.util.log.Slf4jLog
>  21/01/14 02:40:31 INFO Server: jetty-9.4.z-SNAPSHOT; built: 
> 2019-04-29T20:42:08.989Z; git: e1bc35120a6617ee3df052294e433f3a25ce7097; jvm 
> 1.8.0_242-b08
>  21/01/14 02:40:31 INFO Server: Started @1999ms
>  21/01/14 02:40:31 INFO AbstractConnector: Started ServerConnector@51751e5f
> {HTTP/1.1,[http/1.1]} {0.0.0.0:18080}
> 21/01/14 02:40:31 INFO Utils: Successfully started service on port 18080.
>  21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@b9dfc5a
> {/,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@1bbae752
> {/json,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@5cf87cfd
> {/api,null,AVAILABLE,@Spark}
> 21/0

[jira] [Comment Edited] (SPARK-34107) Spark History not loading when service has to load 300k applications initially from S3

2021-01-14 Thread Shashank Pedamallu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264672#comment-17264672
 ] 

Shashank Pedamallu edited comment on SPARK-34107 at 1/14/21, 8:01 AM:
--

Screenshot of the spark history:

!blank_shs.png|width=240,height=120!

Also, please find attached the dynamic tracing analysis (using 
[btrace|[http://example.com].|https://github.com/btraceio/btrace]


was (Author: spedamallu):
Screenshot of the spark history:

!blank_shs.png|width=240,height=120!

Also, please find attached the dynamic tracing analysis (using 
[btrace|https://github.com/btraceio/btrace][^SHS_Profiling_Sorted.csv]

> Spark History not loading when service has to load 300k applications 
> initially from S3
> --
>
> Key: SPARK-34107
> URL: https://issues.apache.org/jira/browse/SPARK-34107
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shashank Pedamallu
>Priority: Major
> Attachments: SHS_Profiling_Sorted.csv, blank_shs.png
>
>
> Spark History Service is having trouble loading when loading initially with 
> 300k+ applications from S3. Following are the details and snapshots:
> Number of files in `spark.history.fs.logDirectory`: (Using xxx for anonymity)
> {noformat}
> spedamallu@spedamallu-mbp143 ~/src/spark (spark-bug) $
> | => aws s3 ls s3://-company/spark-history-fs-logDirectory/ | wc -l
>   305571
> spedamallu@spedamallu-mbp143 ~/src/spark (spark-bug) ${noformat}
> {noformat}
> Logs when starting SparkHistory:
> {noformat}
> root@shs-with-statsd-86d7f54679-t8fqr:/go/src/github.com/-company/spark-private#
>  
> /go/src/github.com/-company/spark-private/bootstrap/start-history-server.sh
>  --properties-file /etc/spark-history-config/shs-default.properties
>  2021/01/14 02:40:28 Spark spark wrapper is disabled
>  2021/01/14 02:40:28 Attempt number 0, Max attempts 0, Left Attempts 0
>  2021/01/14 02:40:28 Statsd disabled
>  2021/01/14 02:40:28 Debug log: /tmp/.log
>  2021/01/14 02:40:28 Job submitted 0 seconds ago, Operator 0, ETL 0, Flyte 0 
> Mozart 0
>  2021/01/14 02:40:28 Running command /opt/spark/bin/spark-class.orig with 
> arguments [org.apache.spark.deploy.history.HistoryServer --properties-file 
> /etc/spark-history-config/shs-default.properties]
>  21/01/14 02:40:29 INFO HistoryServer: Started daemon with process name: 
> 2077@shs-with-statsd-86d7f54679-t8fqr
>  21/01/14 02:40:29 INFO SignalUtils: Registered signal handler for TERM
>  21/01/14 02:40:29 INFO SignalUtils: Registered signal handler for HUP
>  21/01/14 02:40:29 INFO SignalUtils: Registered signal handler for INT
>  21/01/14 02:40:30 WARN NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
>  21/01/14 02:40:30 INFO SecurityManager: Changing view acls to: root
>  21/01/14 02:40:30 INFO SecurityManager: Changing modify acls to: root
>  21/01/14 02:40:30 INFO SecurityManager: Changing view acls groups to:
>  21/01/14 02:40:30 INFO SecurityManager: Changing modify acls groups to:
>  21/01/14 02:40:30 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(root); groups 
> with view permissions: Set(); users with modify permissions: Set(root); 
> groups with modify permissions: Set()
>  21/01/14 02:40:30 INFO FsHistoryProvider: History server ui acls disabled; 
> users with admin permissions: ; groups with admin permissions
>  21/01/14 02:40:30 WARN MetricsConfig: Cannot locate configuration: tried 
> hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
>  21/01/14 02:40:30 INFO MetricsSystemImpl: Scheduled Metric snapshot period 
> at 10 second(s).
>  21/01/14 02:40:30 INFO MetricsSystemImpl: s3a-file-system metrics system 
> started
>  21/01/14 02:40:31 INFO log: Logging initialized @1933ms to 
> org.sparkproject.jetty.util.log.Slf4jLog
>  21/01/14 02:40:31 INFO Server: jetty-9.4.z-SNAPSHOT; built: 
> 2019-04-29T20:42:08.989Z; git: e1bc35120a6617ee3df052294e433f3a25ce7097; jvm 
> 1.8.0_242-b08
>  21/01/14 02:40:31 INFO Server: Started @1999ms
>  21/01/14 02:40:31 INFO AbstractConnector: Started ServerConnector@51751e5f
> {HTTP/1.1,[http/1.1]} {0.0.0.0:18080}
> 21/01/14 02:40:31 INFO Utils: Successfully started service on port 18080.
>  21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@b9dfc5a
> {/,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@1bbae752
> {/json,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@5cf87cfd
> {/api,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO Context

[jira] [Comment Edited] (SPARK-34107) Spark History not loading when service has to load 300k applications initially from S3

2021-01-14 Thread Shashank Pedamallu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264672#comment-17264672
 ] 

Shashank Pedamallu edited comment on SPARK-34107 at 1/14/21, 8:01 AM:
--

Screenshot of the spark history:

!blank_shs.png|width=240,height=120!

Also, please find attached the dynamic tracing analysis (using 
[btrace|https://github.com/btraceio/btrace]


was (Author: spedamallu):
Screenshot of the spark history:

!blank_shs.png|width=240,height=120!

Also, please find attached the dynamic tracing analysis (using 
[btrace|[http://example.com].|https://github.com/btraceio/btrace]

> Spark History not loading when service has to load 300k applications 
> initially from S3
> --
>
> Key: SPARK-34107
> URL: https://issues.apache.org/jira/browse/SPARK-34107
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shashank Pedamallu
>Priority: Major
> Attachments: SHS_Profiling_Sorted.csv, blank_shs.png
>
>
> Spark History Service is having trouble loading when loading initially with 
> 300k+ applications from S3. Following are the details and snapshots:
> Number of files in `spark.history.fs.logDirectory`: (Using xxx for anonymity)
> {noformat}
> spedamallu@spedamallu-mbp143 ~/src/spark (spark-bug) $
> | => aws s3 ls s3://-company/spark-history-fs-logDirectory/ | wc -l
>   305571
> spedamallu@spedamallu-mbp143 ~/src/spark (spark-bug) ${noformat}
> {noformat}
> Logs when starting SparkHistory:
> {noformat}
> root@shs-with-statsd-86d7f54679-t8fqr:/go/src/github.com/-company/spark-private#
>  
> /go/src/github.com/-company/spark-private/bootstrap/start-history-server.sh
>  --properties-file /etc/spark-history-config/shs-default.properties
>  2021/01/14 02:40:28 Spark spark wrapper is disabled
>  2021/01/14 02:40:28 Attempt number 0, Max attempts 0, Left Attempts 0
>  2021/01/14 02:40:28 Statsd disabled
>  2021/01/14 02:40:28 Debug log: /tmp/.log
>  2021/01/14 02:40:28 Job submitted 0 seconds ago, Operator 0, ETL 0, Flyte 0 
> Mozart 0
>  2021/01/14 02:40:28 Running command /opt/spark/bin/spark-class.orig with 
> arguments [org.apache.spark.deploy.history.HistoryServer --properties-file 
> /etc/spark-history-config/shs-default.properties]
>  21/01/14 02:40:29 INFO HistoryServer: Started daemon with process name: 
> 2077@shs-with-statsd-86d7f54679-t8fqr
>  21/01/14 02:40:29 INFO SignalUtils: Registered signal handler for TERM
>  21/01/14 02:40:29 INFO SignalUtils: Registered signal handler for HUP
>  21/01/14 02:40:29 INFO SignalUtils: Registered signal handler for INT
>  21/01/14 02:40:30 WARN NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
>  21/01/14 02:40:30 INFO SecurityManager: Changing view acls to: root
>  21/01/14 02:40:30 INFO SecurityManager: Changing modify acls to: root
>  21/01/14 02:40:30 INFO SecurityManager: Changing view acls groups to:
>  21/01/14 02:40:30 INFO SecurityManager: Changing modify acls groups to:
>  21/01/14 02:40:30 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(root); groups 
> with view permissions: Set(); users with modify permissions: Set(root); 
> groups with modify permissions: Set()
>  21/01/14 02:40:30 INFO FsHistoryProvider: History server ui acls disabled; 
> users with admin permissions: ; groups with admin permissions
>  21/01/14 02:40:30 WARN MetricsConfig: Cannot locate configuration: tried 
> hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
>  21/01/14 02:40:30 INFO MetricsSystemImpl: Scheduled Metric snapshot period 
> at 10 second(s).
>  21/01/14 02:40:30 INFO MetricsSystemImpl: s3a-file-system metrics system 
> started
>  21/01/14 02:40:31 INFO log: Logging initialized @1933ms to 
> org.sparkproject.jetty.util.log.Slf4jLog
>  21/01/14 02:40:31 INFO Server: jetty-9.4.z-SNAPSHOT; built: 
> 2019-04-29T20:42:08.989Z; git: e1bc35120a6617ee3df052294e433f3a25ce7097; jvm 
> 1.8.0_242-b08
>  21/01/14 02:40:31 INFO Server: Started @1999ms
>  21/01/14 02:40:31 INFO AbstractConnector: Started ServerConnector@51751e5f
> {HTTP/1.1,[http/1.1]} {0.0.0.0:18080}
> 21/01/14 02:40:31 INFO Utils: Successfully started service on port 18080.
>  21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@b9dfc5a
> {/,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@1bbae752
> {/json,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@5cf87cfd
> {/api,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s

[jira] [Comment Edited] (SPARK-34107) Spark History not loading when service has to load 300k applications initially from S3

2021-01-14 Thread Shashank Pedamallu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264672#comment-17264672
 ] 

Shashank Pedamallu edited comment on SPARK-34107 at 1/14/21, 8:01 AM:
--

Screenshot of the spark history:

!blank_shs.png|width=240,height=120!

Also, please find attached the dynamic tracing analysis (using 
[btrace)|https://github.com/btraceio/btrace]


was (Author: spedamallu):
Screenshot of the spark history:

!blank_shs.png|width=240,height=120!

Also, please find attached the dynamic tracing analysis (using 
[btrace|https://github.com/btraceio/btrace]

> Spark History not loading when service has to load 300k applications 
> initially from S3
> --
>
> Key: SPARK-34107
> URL: https://issues.apache.org/jira/browse/SPARK-34107
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shashank Pedamallu
>Priority: Major
> Attachments: SHS_Profiling_Sorted.csv, blank_shs.png
>
>
> Spark History Service is having trouble loading when loading initially with 
> 300k+ applications from S3. Following are the details and snapshots:
> Number of files in `spark.history.fs.logDirectory`: (Using xxx for anonymity)
> {noformat}
> spedamallu@spedamallu-mbp143 ~/src/spark (spark-bug) $
> | => aws s3 ls s3://-company/spark-history-fs-logDirectory/ | wc -l
>   305571
> spedamallu@spedamallu-mbp143 ~/src/spark (spark-bug) ${noformat}
> {noformat}
> Logs when starting SparkHistory:
> {noformat}
> root@shs-with-statsd-86d7f54679-t8fqr:/go/src/github.com/-company/spark-private#
>  
> /go/src/github.com/-company/spark-private/bootstrap/start-history-server.sh
>  --properties-file /etc/spark-history-config/shs-default.properties
>  2021/01/14 02:40:28 Spark spark wrapper is disabled
>  2021/01/14 02:40:28 Attempt number 0, Max attempts 0, Left Attempts 0
>  2021/01/14 02:40:28 Statsd disabled
>  2021/01/14 02:40:28 Debug log: /tmp/.log
>  2021/01/14 02:40:28 Job submitted 0 seconds ago, Operator 0, ETL 0, Flyte 0 
> Mozart 0
>  2021/01/14 02:40:28 Running command /opt/spark/bin/spark-class.orig with 
> arguments [org.apache.spark.deploy.history.HistoryServer --properties-file 
> /etc/spark-history-config/shs-default.properties]
>  21/01/14 02:40:29 INFO HistoryServer: Started daemon with process name: 
> 2077@shs-with-statsd-86d7f54679-t8fqr
>  21/01/14 02:40:29 INFO SignalUtils: Registered signal handler for TERM
>  21/01/14 02:40:29 INFO SignalUtils: Registered signal handler for HUP
>  21/01/14 02:40:29 INFO SignalUtils: Registered signal handler for INT
>  21/01/14 02:40:30 WARN NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
>  21/01/14 02:40:30 INFO SecurityManager: Changing view acls to: root
>  21/01/14 02:40:30 INFO SecurityManager: Changing modify acls to: root
>  21/01/14 02:40:30 INFO SecurityManager: Changing view acls groups to:
>  21/01/14 02:40:30 INFO SecurityManager: Changing modify acls groups to:
>  21/01/14 02:40:30 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(root); groups 
> with view permissions: Set(); users with modify permissions: Set(root); 
> groups with modify permissions: Set()
>  21/01/14 02:40:30 INFO FsHistoryProvider: History server ui acls disabled; 
> users with admin permissions: ; groups with admin permissions
>  21/01/14 02:40:30 WARN MetricsConfig: Cannot locate configuration: tried 
> hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
>  21/01/14 02:40:30 INFO MetricsSystemImpl: Scheduled Metric snapshot period 
> at 10 second(s).
>  21/01/14 02:40:30 INFO MetricsSystemImpl: s3a-file-system metrics system 
> started
>  21/01/14 02:40:31 INFO log: Logging initialized @1933ms to 
> org.sparkproject.jetty.util.log.Slf4jLog
>  21/01/14 02:40:31 INFO Server: jetty-9.4.z-SNAPSHOT; built: 
> 2019-04-29T20:42:08.989Z; git: e1bc35120a6617ee3df052294e433f3a25ce7097; jvm 
> 1.8.0_242-b08
>  21/01/14 02:40:31 INFO Server: Started @1999ms
>  21/01/14 02:40:31 INFO AbstractConnector: Started ServerConnector@51751e5f
> {HTTP/1.1,[http/1.1]} {0.0.0.0:18080}
> 21/01/14 02:40:31 INFO Utils: Successfully started service on port 18080.
>  21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@b9dfc5a
> {/,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@1bbae752
> {/json,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@5cf87cfd
> {/api,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandle

[jira] [Resolved] (SPARK-34107) Spark History not loading when service has to load 300k applications initially from S3

2021-01-14 Thread Shashank Pedamallu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shashank Pedamallu resolved SPARK-34107.

Resolution: Duplicate

Dubplicate of SPARK-33790

> Spark History not loading when service has to load 300k applications 
> initially from S3
> --
>
> Key: SPARK-34107
> URL: https://issues.apache.org/jira/browse/SPARK-34107
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shashank Pedamallu
>Priority: Major
> Attachments: SHS_Profiling_Sorted.csv, blank_shs.png
>
>
> Spark History Service is having trouble loading when loading initially with 
> 300k+ applications from S3. Following are the details and snapshots:
> Number of files in `spark.history.fs.logDirectory`: (Using xxx for anonymity)
> {noformat}
> spedamallu@spedamallu-mbp143 ~/src/spark (spark-bug) $
> | => aws s3 ls s3://-company/spark-history-fs-logDirectory/ | wc -l
>   305571
> spedamallu@spedamallu-mbp143 ~/src/spark (spark-bug) ${noformat}
> {noformat}
> Logs when starting SparkHistory:
> {noformat}
> root@shs-with-statsd-86d7f54679-t8fqr:/go/src/github.com/-company/spark-private#
>  
> /go/src/github.com/-company/spark-private/bootstrap/start-history-server.sh
>  --properties-file /etc/spark-history-config/shs-default.properties
>  2021/01/14 02:40:28 Spark spark wrapper is disabled
>  2021/01/14 02:40:28 Attempt number 0, Max attempts 0, Left Attempts 0
>  2021/01/14 02:40:28 Statsd disabled
>  2021/01/14 02:40:28 Debug log: /tmp/.log
>  2021/01/14 02:40:28 Job submitted 0 seconds ago, Operator 0, ETL 0, Flyte 0 
> Mozart 0
>  2021/01/14 02:40:28 Running command /opt/spark/bin/spark-class.orig with 
> arguments [org.apache.spark.deploy.history.HistoryServer --properties-file 
> /etc/spark-history-config/shs-default.properties]
>  21/01/14 02:40:29 INFO HistoryServer: Started daemon with process name: 
> 2077@shs-with-statsd-86d7f54679-t8fqr
>  21/01/14 02:40:29 INFO SignalUtils: Registered signal handler for TERM
>  21/01/14 02:40:29 INFO SignalUtils: Registered signal handler for HUP
>  21/01/14 02:40:29 INFO SignalUtils: Registered signal handler for INT
>  21/01/14 02:40:30 WARN NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
>  21/01/14 02:40:30 INFO SecurityManager: Changing view acls to: root
>  21/01/14 02:40:30 INFO SecurityManager: Changing modify acls to: root
>  21/01/14 02:40:30 INFO SecurityManager: Changing view acls groups to:
>  21/01/14 02:40:30 INFO SecurityManager: Changing modify acls groups to:
>  21/01/14 02:40:30 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(root); groups 
> with view permissions: Set(); users with modify permissions: Set(root); 
> groups with modify permissions: Set()
>  21/01/14 02:40:30 INFO FsHistoryProvider: History server ui acls disabled; 
> users with admin permissions: ; groups with admin permissions
>  21/01/14 02:40:30 WARN MetricsConfig: Cannot locate configuration: tried 
> hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
>  21/01/14 02:40:30 INFO MetricsSystemImpl: Scheduled Metric snapshot period 
> at 10 second(s).
>  21/01/14 02:40:30 INFO MetricsSystemImpl: s3a-file-system metrics system 
> started
>  21/01/14 02:40:31 INFO log: Logging initialized @1933ms to 
> org.sparkproject.jetty.util.log.Slf4jLog
>  21/01/14 02:40:31 INFO Server: jetty-9.4.z-SNAPSHOT; built: 
> 2019-04-29T20:42:08.989Z; git: e1bc35120a6617ee3df052294e433f3a25ce7097; jvm 
> 1.8.0_242-b08
>  21/01/14 02:40:31 INFO Server: Started @1999ms
>  21/01/14 02:40:31 INFO AbstractConnector: Started ServerConnector@51751e5f
> {HTTP/1.1,[http/1.1]} {0.0.0.0:18080}
> 21/01/14 02:40:31 INFO Utils: Successfully started service on port 18080.
>  21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@b9dfc5a
> {/,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@1bbae752
> {/json,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@5cf87cfd
> {/api,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@74971ed9
> {/static,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@1542af63
> {/history,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and 
> started at http://shs-with-statsd-86d7f54679-t8fqr:18080
>  21/01/14 02:40:31 DEBUG FsHistoryProvider: Scheduling update thread every 10 
> seconds
>  21/01/14 02:40:31 DEBUG FsHistoryProvider:

[jira] [Comment Edited] (SPARK-34107) Spark History not loading when service has to load 300k applications initially from S3

2021-01-14 Thread Shashank Pedamallu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264686#comment-17264686
 ] 

Shashank Pedamallu edited comment on SPARK-34107 at 1/14/21, 8:21 AM:
--

Duplicate of SPARK-33790


was (Author: spedamallu):
Dubplicate of SPARK-33790

> Spark History not loading when service has to load 300k applications 
> initially from S3
> --
>
> Key: SPARK-34107
> URL: https://issues.apache.org/jira/browse/SPARK-34107
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shashank Pedamallu
>Priority: Major
> Attachments: SHS_Profiling_Sorted.csv, blank_shs.png
>
>
> Spark History Service is having trouble loading when loading initially with 
> 300k+ applications from S3. Following are the details and snapshots:
> Number of files in `spark.history.fs.logDirectory`: (Using xxx for anonymity)
> {noformat}
> spedamallu@spedamallu-mbp143 ~/src/spark (spark-bug) $
> | => aws s3 ls s3://-company/spark-history-fs-logDirectory/ | wc -l
>   305571
> spedamallu@spedamallu-mbp143 ~/src/spark (spark-bug) ${noformat}
> {noformat}
> Logs when starting SparkHistory:
> {noformat}
> root@shs-with-statsd-86d7f54679-t8fqr:/go/src/github.com/-company/spark-private#
>  
> /go/src/github.com/-company/spark-private/bootstrap/start-history-server.sh
>  --properties-file /etc/spark-history-config/shs-default.properties
>  2021/01/14 02:40:28 Spark spark wrapper is disabled
>  2021/01/14 02:40:28 Attempt number 0, Max attempts 0, Left Attempts 0
>  2021/01/14 02:40:28 Statsd disabled
>  2021/01/14 02:40:28 Debug log: /tmp/.log
>  2021/01/14 02:40:28 Job submitted 0 seconds ago, Operator 0, ETL 0, Flyte 0 
> Mozart 0
>  2021/01/14 02:40:28 Running command /opt/spark/bin/spark-class.orig with 
> arguments [org.apache.spark.deploy.history.HistoryServer --properties-file 
> /etc/spark-history-config/shs-default.properties]
>  21/01/14 02:40:29 INFO HistoryServer: Started daemon with process name: 
> 2077@shs-with-statsd-86d7f54679-t8fqr
>  21/01/14 02:40:29 INFO SignalUtils: Registered signal handler for TERM
>  21/01/14 02:40:29 INFO SignalUtils: Registered signal handler for HUP
>  21/01/14 02:40:29 INFO SignalUtils: Registered signal handler for INT
>  21/01/14 02:40:30 WARN NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
>  21/01/14 02:40:30 INFO SecurityManager: Changing view acls to: root
>  21/01/14 02:40:30 INFO SecurityManager: Changing modify acls to: root
>  21/01/14 02:40:30 INFO SecurityManager: Changing view acls groups to:
>  21/01/14 02:40:30 INFO SecurityManager: Changing modify acls groups to:
>  21/01/14 02:40:30 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(root); groups 
> with view permissions: Set(); users with modify permissions: Set(root); 
> groups with modify permissions: Set()
>  21/01/14 02:40:30 INFO FsHistoryProvider: History server ui acls disabled; 
> users with admin permissions: ; groups with admin permissions
>  21/01/14 02:40:30 WARN MetricsConfig: Cannot locate configuration: tried 
> hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
>  21/01/14 02:40:30 INFO MetricsSystemImpl: Scheduled Metric snapshot period 
> at 10 second(s).
>  21/01/14 02:40:30 INFO MetricsSystemImpl: s3a-file-system metrics system 
> started
>  21/01/14 02:40:31 INFO log: Logging initialized @1933ms to 
> org.sparkproject.jetty.util.log.Slf4jLog
>  21/01/14 02:40:31 INFO Server: jetty-9.4.z-SNAPSHOT; built: 
> 2019-04-29T20:42:08.989Z; git: e1bc35120a6617ee3df052294e433f3a25ce7097; jvm 
> 1.8.0_242-b08
>  21/01/14 02:40:31 INFO Server: Started @1999ms
>  21/01/14 02:40:31 INFO AbstractConnector: Started ServerConnector@51751e5f
> {HTTP/1.1,[http/1.1]} {0.0.0.0:18080}
> 21/01/14 02:40:31 INFO Utils: Successfully started service on port 18080.
>  21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@b9dfc5a
> {/,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@1bbae752
> {/json,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@5cf87cfd
> {/api,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@74971ed9
> {/static,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO ContextHandler: Started 
> o.s.j.s.ServletContextHandler@1542af63
> {/history,null,AVAILABLE,@Spark}
> 21/01/14 02:40:31 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and 
> started at http://shs-with-statsd-86d7f54679-t8fqr:18080
>  

[jira] [Resolved] (SPARK-34067) PartitionPruning push down pruningHasBenefit function into insertPredicate function to decrease calculate time

2021-01-14 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-34067.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31122
[https://github.com/apache/spark/pull/31122]

> PartitionPruning push down pruningHasBenefit function into insertPredicate 
> function to decrease calculate time
> --
>
> Key: SPARK-34067
> URL: https://issues.apache.org/jira/browse/SPARK-34067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.2.0, 3.1.1
>Reporter: jiahong.li
>Assignee: jiahong.li
>Priority: Minor
> Fix For: 3.2.0
>
>
> in the class PartitionPruning, function prune, push pruningHasBenefit 
> function into insertPredicate function, as `SQLConf.get.exchangeReuseEnabled` 
> is always  true default and 
> `SQLConf.get.dynamicPartitionPruningReuseBroadcastOnly `  is always true 
> default.
> by set hasBenefit to lazy ,we can do not need invoke hasBenefit ,so we can 
> save time.
> solved by #31122



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34067) PartitionPruning push down pruningHasBenefit function into insertPredicate function to decrease calculate time

2021-01-14 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-34067:
---

Assignee: jiahong.li

> PartitionPruning push down pruningHasBenefit function into insertPredicate 
> function to decrease calculate time
> --
>
> Key: SPARK-34067
> URL: https://issues.apache.org/jira/browse/SPARK-34067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.2.0, 3.1.1
>Reporter: jiahong.li
>Assignee: jiahong.li
>Priority: Minor
>
> in the class PartitionPruning, function prune, push pruningHasBenefit 
> function into insertPredicate function, as `SQLConf.get.exchangeReuseEnabled` 
> is always  true default and 
> `SQLConf.get.dynamicPartitionPruningReuseBroadcastOnly `  is always true 
> default.
> by set hasBenefit to lazy ,we can do not need invoke hasBenefit ,so we can 
> save time.
> solved by #31122



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34113) Use metric data to update metadata statistics

2021-01-14 Thread angerszhu (Jira)
angerszhu created SPARK-34113:
-

 Summary: Use metric data to update metadata statistics
 Key: SPARK-34113
 URL: https://issues.apache.org/jira/browse/SPARK-34113
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu


Use metric data to update metadata statistics



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34113) Use metric data to update metadata statistics

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34113:


Assignee: (was: Apache Spark)

> Use metric data to update metadata statistics
> -
>
> Key: SPARK-34113
> URL: https://issues.apache.org/jira/browse/SPARK-34113
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Minor
>
> Use metric data to update metadata statistics



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34113) Use metric data to update metadata statistics

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264697#comment-17264697
 ] 

Apache Spark commented on SPARK-34113:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/31179

> Use metric data to update metadata statistics
> -
>
> Key: SPARK-34113
> URL: https://issues.apache.org/jira/browse/SPARK-34113
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Minor
>
> Use metric data to update metadata statistics



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34113) Use metric data to update metadata statistics

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34113:


Assignee: Apache Spark

> Use metric data to update metadata statistics
> -
>
> Key: SPARK-34113
> URL: https://issues.apache.org/jira/browse/SPARK-34113
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Minor
>
> Use metric data to update metadata statistics



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33354) New explicit cast syntax rules in ANSI mode

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264699#comment-17264699
 ] 

Apache Spark commented on SPARK-33354:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/31180

> New explicit cast syntax rules in ANSI mode
> ---
>
> Key: SPARK-33354
> URL: https://issues.apache.org/jira/browse/SPARK-33354
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> In section 6.13 of the ANSI SQL standard,  there are syntax rules for valid 
> combinations of the source and target data types.
> To make Spark's ANSI mode more ANSI SQL Compatible,I propose to disallow the 
> following casting in ANSI mode:
> {code:java}
> TimeStamp <=> Boolean
> Date <=> Boolean
> Numeric <=> Timestamp
> Numeric <=> Date
> Numeric <=> Binary
> String <=> Array
> String <=> Map
> String <=> Struct
> {code}
> The following castings are considered invalid in ANSI SQL standard, but they 
> are quite straight forward. Let's Allow them for now
> {code:java}
> Numeric <=> Boolean
> String <=> Boolean
> String <=> Binary
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34114) should not trim right for read-side char length check and padding

2021-01-14 Thread Kent Yao (Jira)
Kent Yao created SPARK-34114:


 Summary: should not trim right for read-side char length check and 
padding
 Key: SPARK-34114
 URL: https://issues.apache.org/jira/browse/SPARK-34114
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34110) Upgrade ZooKeeper to 3.6.2

2021-01-14 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34110:

Description: 
When running Spark on JDK 14:
{noformat}
21/01/13 20:25:32,533 WARN 
[Driver-SendThread(apache-spark-zk-3.vip.hadoop.com:2181)] 
zookeeper.ClientCnxn:1164 : Session 0x0 for server 
apache-spark-zk-3.vip.hadoop.com/:2181, unexpected error, closing 
socket connection and attempting reconnect
java.lang.IllegalArgumentException: Unable to canonicalize address 
apache-spark-zk-3.vip.hadoop.com/:2181 because it's not resolvable
at 
org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65)
at 
org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41)
at 
org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
{noformat}

Please see ZOOKEEPER-3779 for more details.


  was:
When running Spark on JDK 14:
{noformat}
21/01/13 20:25:32,533 WARN 
[Driver-SendThread(apache-spark-zk-3.vip.hadoop.com:2181)] 
zookeeper.ClientCnxn:1164 : Session 0x0 for server 
apache-spark-zk-3.vip.hadoop.com/:2181, unexpected error, closing 
socket connection and attempting reconnect
java.lang.IllegalArgumentException: Unable to canonicalize address 
carmel-rno-zk-3.vip.hadoop.ebay.com/:2181 because it's not 
resolvable
at 
org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65)
at 
org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41)
at 
org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
{noformat}

Please see ZOOKEEPER-3779 for more details.



> Upgrade ZooKeeper to 3.6.2
> --
>
> Key: SPARK-34110
> URL: https://issues.apache.org/jira/browse/SPARK-34110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> When running Spark on JDK 14:
> {noformat}
> 21/01/13 20:25:32,533 WARN 
> [Driver-SendThread(apache-spark-zk-3.vip.hadoop.com:2181)] 
> zookeeper.ClientCnxn:1164 : Session 0x0 for server 
> apache-spark-zk-3.vip.hadoop.com/:2181, unexpected error, closing 
> socket connection and attempting reconnect
> java.lang.IllegalArgumentException: Unable to canonicalize address 
> apache-spark-zk-3.vip.hadoop.com/:2181 because it's not resolvable
>   at 
> org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65)
>   at 
> org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41)
>   at 
> org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
> {noformat}
> Please see ZOOKEEPER-3779 for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34114) should not trim right for read-side char length check and padding

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264748#comment-17264748
 ] 

Apache Spark commented on SPARK-34114:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/31181

> should not trim right for read-side char length check and padding
> -
>
> Key: SPARK-34114
> URL: https://issues.apache.org/jira/browse/SPARK-34114
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Kent Yao
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34114) should not trim right for read-side char length check and padding

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34114:


Assignee: (was: Apache Spark)

> should not trim right for read-side char length check and padding
> -
>
> Key: SPARK-34114
> URL: https://issues.apache.org/jira/browse/SPARK-34114
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Kent Yao
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34114) should not trim right for read-side char length check and padding

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34114:


Assignee: Apache Spark

> should not trim right for read-side char length check and padding
> -
>
> Key: SPARK-34114
> URL: https://issues.apache.org/jira/browse/SPARK-34114
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34114) should not trim right for read-side char length check and padding

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264749#comment-17264749
 ] 

Apache Spark commented on SPARK-34114:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/31181

> should not trim right for read-side char length check and padding
> -
>
> Key: SPARK-34114
> URL: https://issues.apache.org/jira/browse/SPARK-34114
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Kent Yao
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26385) YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in cache

2021-01-14 Thread jiulongzhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiulongzhu updated SPARK-26385:
---
Attachment: (was: SPARK-26385.0001.patch)

> YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in 
> cache
> ---
>
> Key: SPARK-26385
> URL: https://issues.apache.org/jira/browse/SPARK-26385
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Hadoop 2.6.0, Spark 2.4.0
>Reporter: T M
>Priority: Major
>
>  
> Hello,
>  
> I have Spark Structured Streaming job which is runnning on YARN(Hadoop 2.6.0, 
> Spark 2.4.0). After 25-26 hours, my job stops working with following error:
> {code:java}
> 2018-12-16 22:35:17 ERROR 
> org.apache.spark.internal.Logging$class.logError(Logging.scala:91): Query 
> TestQuery[id = a61ce197-1d1b-4e82-a7af-60162953488b, runId = 
> a56878cf-dfc7-4f6a-ad48-02cf738ccc2f] terminated with error 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for REMOVED: HDFS_DELEGATION_TOKEN owner=REMOVED, renewer=yarn, 
> realUser=, issueDate=1544903057122, maxDate=1545507857122, 
> sequenceNumber=10314, masterKeyId=344) can't be found in cache at 
> org.apache.hadoop.ipc.Client.call(Client.java:1470) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1401) at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>  at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752)
>  at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>  at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) at 
> org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1977) at 
> org.apache.hadoop.fs.Hdfs.getFileStatus(Hdfs.java:133) at 
> org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1120) at 
> org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1116) at 
> org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at 
> org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1116) at 
> org.apache.hadoop.fs.FileContext$Util.exists(FileContext.java:1581) at 
> org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.exists(CheckpointFileManager.scala:326)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:142)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:544)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:554)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecuto

[jira] [Commented] (SPARK-26385) YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in cache

2021-01-14 Thread jiulongzhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264772#comment-17264772
 ] 

jiulongzhu commented on SPARK-26385:


https://issues.apache.org/jira/browse/HDFS-9276

> YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in 
> cache
> ---
>
> Key: SPARK-26385
> URL: https://issues.apache.org/jira/browse/SPARK-26385
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Hadoop 2.6.0, Spark 2.4.0
>Reporter: T M
>Priority: Major
>
>  
> Hello,
>  
> I have Spark Structured Streaming job which is runnning on YARN(Hadoop 2.6.0, 
> Spark 2.4.0). After 25-26 hours, my job stops working with following error:
> {code:java}
> 2018-12-16 22:35:17 ERROR 
> org.apache.spark.internal.Logging$class.logError(Logging.scala:91): Query 
> TestQuery[id = a61ce197-1d1b-4e82-a7af-60162953488b, runId = 
> a56878cf-dfc7-4f6a-ad48-02cf738ccc2f] terminated with error 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for REMOVED: HDFS_DELEGATION_TOKEN owner=REMOVED, renewer=yarn, 
> realUser=, issueDate=1544903057122, maxDate=1545507857122, 
> sequenceNumber=10314, masterKeyId=344) can't be found in cache at 
> org.apache.hadoop.ipc.Client.call(Client.java:1470) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1401) at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>  at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752)
>  at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>  at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) at 
> org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1977) at 
> org.apache.hadoop.fs.Hdfs.getFileStatus(Hdfs.java:133) at 
> org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1120) at 
> org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1116) at 
> org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at 
> org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1116) at 
> org.apache.hadoop.fs.FileContext$Util.exists(FileContext.java:1581) at 
> org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.exists(CheckpointFileManager.scala:326)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:142)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:544)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:554)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution

[jira] [Updated] (SPARK-34115) Long runtime on many environment variables

2021-01-14 Thread Norbert Schultz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Norbert Schultz updated SPARK-34115:

Description: 
I am not sure if this is a bug report or a feature request. The code is is the 
same in current versions of Spark and maybe this ticket saves someone some time 
for debugging.

We migrated some older code to Spark 2.4.0, and suddently the unit tests on our 
build machine were much slower than expected.

On local machines it was running perfectly.

At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
analyzing in the following functions
 * AnalysisHelper.assertNotAnalysisRule calling
 * Utils.isTesting

Utils.isTesting is traversing all environment variables.

The offending build machine was a Kubernetes Pod which automatically exposed 
all services as environment variables, so it had more than 3000 environment 
variables.

As Utils.isTesting is called very often throgh 
AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
transformUp).

 

Of course we will restrict the number of environment variables, on the other 
side Utils.isTesting could also use a lazy val for

 
{code:java}
sys.env.contains("SPARK_TESTING") {code}
 

to not make it that expensive.

  was:
I am not sure if this is a bug report or a feature request. The code is is the 
same in current versions of Spark and maybe this ticket saves someone some time 
for debugging.

We migrated some older code to Spark 2.4.0, and suddently the unit tests on our 
build machine were much slower than expected.

On local machines it was running perfectly.

At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
analyzing in the following functions
 * AnalysisHelper.assertNotAnalysisRule calling
 * Utils.isTesting

Utils.isTesting is traversing all environment variables.

The offending build machine was a Kubernetes Pod which automatically exposed 
all services as environment variables, so it had more than 3000 environment 
variables.

As Utils.isTesting is called very often throgh 
AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
transformUp).

 

Of course we will restrict the number of environment variables, on the other 
side Utils.isTesting could also use a lazy val for

```

sys.env.contains("SPARK_TESTING")

```

to not make it that expensive.


> Long runtime on many environment variables
> --
>
> Key: SPARK-34115
> URL: https://issues.apache.org/jira/browse/SPARK-34115
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 local[2] on a Kubernetes Pod
>Reporter: Norbert Schultz
>Priority: Major
>
> I am not sure if this is a bug report or a feature request. The code is is 
> the same in current versions of Spark and maybe this ticket saves someone 
> some time for debugging.
> We migrated some older code to Spark 2.4.0, and suddently the unit tests on 
> our build machine were much slower than expected.
> On local machines it was running perfectly.
> At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
> analyzing in the following functions
>  * AnalysisHelper.assertNotAnalysisRule calling
>  * Utils.isTesting
> Utils.isTesting is traversing all environment variables.
> The offending build machine was a Kubernetes Pod which automatically exposed 
> all services as environment variables, so it had more than 3000 environment 
> variables.
> As Utils.isTesting is called very often throgh 
> AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
> transformUp).
>  
> Of course we will restrict the number of environment variables, on the other 
> side Utils.isTesting could also use a lazy val for
>  
> {code:java}
> sys.env.contains("SPARK_TESTING") {code}
>  
> to not make it that expensive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34115) Long runtime on many environment variables

2021-01-14 Thread Norbert Schultz (Jira)
Norbert Schultz created SPARK-34115:
---

 Summary: Long runtime on many environment variables
 Key: SPARK-34115
 URL: https://issues.apache.org/jira/browse/SPARK-34115
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
 Environment: Spark 2.4.0 local[2] on a Kubernetes Pod
Reporter: Norbert Schultz


I am not sure if this is a bug report or a feature request. The code is is the 
same in current versions of Spark and maybe this ticket saves someone some time 
for debugging.

We migrated some older code to Spark 2.4.0, and suddently the unit tests on our 
build machine were much slower than expected.

On local machines it was running perfectly.

At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
analyzing in the following functions
 * AnalysisHelper.assertNotAnalysisRule calling
 * Utils.isTesting

Utils.isTesting is traversing all environment variables.

The offending build machine was a Kubernetes Pod which automatically exposed 
all services as environment variables, so it had more than 3000 environment 
variables.

As Utils.isTesting is called very often throgh 
AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
transformUp).

 

Of course we will restrict the number of environment variables, on the other 
side Utils.isTesting could also use a lazy val for 
"sys.env.contains("SPARK_TESTING")" to not make it that expensive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34115) Long runtime on many environment variables

2021-01-14 Thread Norbert Schultz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Norbert Schultz updated SPARK-34115:

Description: 
I am not sure if this is a bug report or a feature request. The code is is the 
same in current versions of Spark and maybe this ticket saves someone some time 
for debugging.

We migrated some older code to Spark 2.4.0, and suddently the unit tests on our 
build machine were much slower than expected.

On local machines it was running perfectly.

At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
analyzing in the following functions
 * AnalysisHelper.assertNotAnalysisRule calling
 * Utils.isTesting

Utils.isTesting is traversing all environment variables.

The offending build machine was a Kubernetes Pod which automatically exposed 
all services as environment variables, so it had more than 3000 environment 
variables.

As Utils.isTesting is called very often throgh 
AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
transformUp).

 

Of course we will restrict the number of environment variables, on the other 
side Utils.isTesting could also use a lazy val for

```

sys.env.contains("SPARK_TESTING")

```

to not make it that expensive.

  was:
I am not sure if this is a bug report or a feature request. The code is is the 
same in current versions of Spark and maybe this ticket saves someone some time 
for debugging.

We migrated some older code to Spark 2.4.0, and suddently the unit tests on our 
build machine were much slower than expected.

On local machines it was running perfectly.

At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
analyzing in the following functions
 * AnalysisHelper.assertNotAnalysisRule calling
 * Utils.isTesting

Utils.isTesting is traversing all environment variables.

The offending build machine was a Kubernetes Pod which automatically exposed 
all services as environment variables, so it had more than 3000 environment 
variables.

As Utils.isTesting is called very often throgh 
AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
transformUp).

 

Of course we will restrict the number of environment variables, on the other 
side Utils.isTesting could also use a lazy val for 
"sys.env.contains("SPARK_TESTING")" to not make it that expensive.


> Long runtime on many environment variables
> --
>
> Key: SPARK-34115
> URL: https://issues.apache.org/jira/browse/SPARK-34115
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 local[2] on a Kubernetes Pod
>Reporter: Norbert Schultz
>Priority: Major
>
> I am not sure if this is a bug report or a feature request. The code is is 
> the same in current versions of Spark and maybe this ticket saves someone 
> some time for debugging.
> We migrated some older code to Spark 2.4.0, and suddently the unit tests on 
> our build machine were much slower than expected.
> On local machines it was running perfectly.
> At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
> analyzing in the following functions
>  * AnalysisHelper.assertNotAnalysisRule calling
>  * Utils.isTesting
> Utils.isTesting is traversing all environment variables.
> The offending build machine was a Kubernetes Pod which automatically exposed 
> all services as environment variables, so it had more than 3000 environment 
> variables.
> As Utils.isTesting is called very often throgh 
> AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
> transformUp).
>  
> Of course we will restrict the number of environment variables, on the other 
> side Utils.isTesting could also use a lazy val for
> ```
> sys.env.contains("SPARK_TESTING")
> ```
> to not make it that expensive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34115) Long runtime on many environment variables

2021-01-14 Thread Norbert Schultz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Norbert Schultz updated SPARK-34115:

Description: 
I am not sure if this is a bug report or a feature request. The code is is the 
same in current versions of Spark and maybe this ticket saves someone some time 
for debugging.

We migrated some older code to Spark 2.4.0, and suddently the integration tests 
on our build machine were much slower than expected.

On local machines it was running perfectly.

At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
analyzing in the following functions
 * AnalysisHelper.assertNotAnalysisRule calling
 * Utils.isTesting

Utils.isTesting is traversing all environment variables.

The offending build machine was a Kubernetes Pod which automatically exposed 
all services as environment variables, so it had more than 3000 environment 
variables.

As Utils.isTesting is called very often throgh 
AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
transformUp).

 

Of course we will restrict the number of environment variables, on the other 
side Utils.isTesting could also use a lazy val for

 
{code:java}
sys.env.contains("SPARK_TESTING") {code}
 

to not make it that expensive.

  was:
I am not sure if this is a bug report or a feature request. The code is is the 
same in current versions of Spark and maybe this ticket saves someone some time 
for debugging.

We migrated some older code to Spark 2.4.0, and suddently the unit tests on our 
build machine were much slower than expected.

On local machines it was running perfectly.

At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
analyzing in the following functions
 * AnalysisHelper.assertNotAnalysisRule calling
 * Utils.isTesting

Utils.isTesting is traversing all environment variables.

The offending build machine was a Kubernetes Pod which automatically exposed 
all services as environment variables, so it had more than 3000 environment 
variables.

As Utils.isTesting is called very often throgh 
AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
transformUp).

 

Of course we will restrict the number of environment variables, on the other 
side Utils.isTesting could also use a lazy val for

 
{code:java}
sys.env.contains("SPARK_TESTING") {code}
 

to not make it that expensive.


> Long runtime on many environment variables
> --
>
> Key: SPARK-34115
> URL: https://issues.apache.org/jira/browse/SPARK-34115
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 local[2] on a Kubernetes Pod
>Reporter: Norbert Schultz
>Priority: Major
>
> I am not sure if this is a bug report or a feature request. The code is is 
> the same in current versions of Spark and maybe this ticket saves someone 
> some time for debugging.
> We migrated some older code to Spark 2.4.0, and suddently the integration 
> tests on our build machine were much slower than expected.
> On local machines it was running perfectly.
> At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
> analyzing in the following functions
>  * AnalysisHelper.assertNotAnalysisRule calling
>  * Utils.isTesting
> Utils.isTesting is traversing all environment variables.
> The offending build machine was a Kubernetes Pod which automatically exposed 
> all services as environment variables, so it had more than 3000 environment 
> variables.
> As Utils.isTesting is called very often throgh 
> AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
> transformUp).
>  
> Of course we will restrict the number of environment variables, on the other 
> side Utils.isTesting could also use a lazy val for
>  
> {code:java}
> sys.env.contains("SPARK_TESTING") {code}
>  
> to not make it that expensive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34102) Spark SQL cannot escape both \ and other special characters

2021-01-14 Thread Noah Kawasaki (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noah Kawasaki updated SPARK-34102:
--
Labels: escaping filter string-manipulation  (was: )

> Spark SQL cannot escape both \ and other special characters 
> 
>
> Key: SPARK-34102
> URL: https://issues.apache.org/jira/browse/SPARK-34102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.0, 2.4.5, 3.0.1
>Reporter: Noah Kawasaki
>Priority: Minor
>  Labels: escaping, filter, string-manipulation
>
> Spark literal string parsing does not properly escape backslashes or other 
> special characters. This is an extension of this issue: 
> https://issues.apache.org/jira/browse/SPARK-17647#
>  
> The issue is that depending on how spark.sql.parser.escapedStringLiterals is 
> set, you will either be able to correctly get escaped backslashes in a string 
> literal, but not escaped other special characters, OR, you can have correctly 
> escaped other special characters, but not correctly escaped backslashes.
> So you have to choose which configuration you care about more.
> I have tested Spark versions 2.1, 2.2, 2.3, 2.4, and 3.0 and they all 
> experience the issue:
> {code:java}
> # These do not return the expected backslash
> SET spark.sql.parser.escapedStringLiterals=false;
> SELECT '\\';
> > \
> (should return \\)
> SELECT 'hi\hi';
> > hihi
> (should return hi\hi) 
> # These are correctly escaped
> SELECT '\"';
> > "
>  SELECT '\'';
> > '{code}
> If I switch this: 
> {code:java}
> # These now work
> SET spark.sql.parser.escapedStringLiterals=true;
> SELECT '\\';
> > \\
> SELECT 'hi\hi';
> > hi\hi
> # These are now not correctly escaped
> SELECT '\"';
> > \"
> (should return ")
> SELECT '\'';
> > \'
> (should return ' ){code}
>  So basically we have to choose:
> SET spark.sql.parser.escapedStringLiterals=false; if we want backslashes 
> correctly escaped but not other special characters
> SET spark.sql.parser.escapedStringLiterals=true; if we want other special 
> characters correctly escaped but not backslashes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33989) Strip auto-generated cast when using Cast.sql

2021-01-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33989:
---

Assignee: ulysses you

> Strip auto-generated cast when using Cast.sql
> -
>
> Key: SPARK-33989
> URL: https://issues.apache.org/jira/browse/SPARK-33989
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
>
> During analysis we may introduce the Cast if exists type cast implicitly. 
> That makes assgined name unclear.
> Let's say we have a sql `select id == null` which id is int type, then the 
> output field name will be `(id = CAST(null as int))`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33989) Strip auto-generated cast when using Cast.sql

2021-01-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33989.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31034
[https://github.com/apache/spark/pull/31034]

> Strip auto-generated cast when using Cast.sql
> -
>
> Key: SPARK-33989
> URL: https://issues.apache.org/jira/browse/SPARK-33989
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.2.0
>
>
> During analysis we may introduce the Cast if exists type cast implicitly. 
> That makes assgined name unclear.
> Let's say we have a sql `select id == null` which id is int type, then the 
> output field name will be `(id = CAST(null as int))`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28263) Spark-submit can not find class (ClassNotFoundException)

2021-01-14 Thread Jack LIN (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264990#comment-17264990
 ] 

Jack LIN commented on SPARK-28263:
--

Just guessing, this probably relates to the settings of `sbt`

> Spark-submit can not find class (ClassNotFoundException)
> 
>
> Key: SPARK-28263
> URL: https://issues.apache.org/jira/browse/SPARK-28263
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.4.3
>Reporter: Zhiyuan
>Priority: Major
>  Labels: beginner
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> I try to run the Main class in my folder using the following code in the 
> script:
>  
> {code:java}
> spark-shell --class com.navercorp.Main /target/node2vec-0.0.1-SNAPSHOT.jar 
> --cmd node2vec ../graph/karate.edgelist --output ../walk/walk.txt {code}
> But it raises the error:
> {code:java}
> 19/07/05 14:39:20 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 19/07/05 14:39:20 WARN deploy.SparkSubmit$$anon$2: Failed to load 
> com.navercorp.Main.
> java.lang.ClassNotFoundException: com.navercorp.Main
> at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:810)
> at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
> at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala){code}
> I have jar file in my folder, this is the structure:
> {code:java}
> 1node2vec
>      2node2vec_spark
>            3main
>                  4resources
>                  4com
>                         5novercorp
>                                6lib
>                                       7Main
>                                       7Node2vec
>                                       7Word2vec
>      2target
>             3lib
>             3classes
>             3maven-archiver
>             3node2vec-0.0.1-SNAPSHOT.jar       
>      2graph
>             3---karate.edgelist
>      2walk
>             3walk.txt
> {code}
> Also, I attach the structure of jar file:
> {code:java}
> ```META-INF/
> META-INF/MANIFEST.MF
> log4j2.properties
> com/
> com/navercorp/
> com/navercorp/Node2vec$.class
> com/navercorp/Main$Params$$typecreator1$1.class
> com/navercorp/Main$$anon$1$$anonfun$11.class
> com/navercorp/Word2vec$.class
> com/navercorp/Main$$anon$1$$anonfun$8.class
> com/navercorp/Node2vec$$anonfun$randomWalk$1$$anonfun$8.class
> com/navercorp/Node2vec$$anonfun$indexingGraph$4.class
> com/navercorp/Node2vec$$anonfun$initTransitionProb$1.class
> com/navercorp/Main$.class
> com/navercorp/Node2vec$$anonfun$loadNode2Id$1.class
> com/navercorp/Node2vec$$anonfun$14.class
> com/navercorp/Node2vec$$anonfun$readIndexedGraph$2$$anonfun$1.class
> ```{code}
> Could someone give me the advice on how to connect Main class?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-28263) Spark-submit can not find class (ClassNotFoundException)

2021-01-14 Thread Jack LIN (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack LIN updated SPARK-28263:
-
Comment: was deleted

(was: Just guessing, this probably relates to the settings of `sbt`)

> Spark-submit can not find class (ClassNotFoundException)
> 
>
> Key: SPARK-28263
> URL: https://issues.apache.org/jira/browse/SPARK-28263
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.4.3
>Reporter: Zhiyuan
>Priority: Major
>  Labels: beginner
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> I try to run the Main class in my folder using the following code in the 
> script:
>  
> {code:java}
> spark-shell --class com.navercorp.Main /target/node2vec-0.0.1-SNAPSHOT.jar 
> --cmd node2vec ../graph/karate.edgelist --output ../walk/walk.txt {code}
> But it raises the error:
> {code:java}
> 19/07/05 14:39:20 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 19/07/05 14:39:20 WARN deploy.SparkSubmit$$anon$2: Failed to load 
> com.navercorp.Main.
> java.lang.ClassNotFoundException: com.navercorp.Main
> at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:810)
> at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
> at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala){code}
> I have jar file in my folder, this is the structure:
> {code:java}
> 1node2vec
>      2node2vec_spark
>            3main
>                  4resources
>                  4com
>                         5novercorp
>                                6lib
>                                       7Main
>                                       7Node2vec
>                                       7Word2vec
>      2target
>             3lib
>             3classes
>             3maven-archiver
>             3node2vec-0.0.1-SNAPSHOT.jar       
>      2graph
>             3---karate.edgelist
>      2walk
>             3walk.txt
> {code}
> Also, I attach the structure of jar file:
> {code:java}
> ```META-INF/
> META-INF/MANIFEST.MF
> log4j2.properties
> com/
> com/navercorp/
> com/navercorp/Node2vec$.class
> com/navercorp/Main$Params$$typecreator1$1.class
> com/navercorp/Main$$anon$1$$anonfun$11.class
> com/navercorp/Word2vec$.class
> com/navercorp/Main$$anon$1$$anonfun$8.class
> com/navercorp/Node2vec$$anonfun$randomWalk$1$$anonfun$8.class
> com/navercorp/Node2vec$$anonfun$indexingGraph$4.class
> com/navercorp/Node2vec$$anonfun$initTransitionProb$1.class
> com/navercorp/Main$.class
> com/navercorp/Node2vec$$anonfun$loadNode2Id$1.class
> com/navercorp/Node2vec$$anonfun$14.class
> com/navercorp/Node2vec$$anonfun$readIndexedGraph$2$$anonfun$1.class
> ```{code}
> Could someone give me the advice on how to connect Main class?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34095) Spark 3 is giving WARN while running job

2021-01-14 Thread Sachit Murarka (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265019#comment-17265019
 ] 

Sachit Murarka commented on SPARK-34095:


[~sowen] [~holden] -> Can you please suggest something on this?

> Spark 3 is giving WARN while running job
> 
>
> Key: SPARK-34095
> URL: https://issues.apache.org/jira/browse/SPARK-34095
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.0.1
>Reporter: Sachit Murarka
>Priority: Major
>
> I am running Spark 3.0.1 with Java11. Getting the following WARN.
>  
> WARNING: An illegal reflective access operation has occurred
>  WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> ([file:/opt/spark/jars/spark-unsafe_2.12-3.0.1.jar|file://opt/spark/jars/spark-unsafe_2.12-3.0.1.jar])
>  to constructor java.nio.DirectByteBuffer(long,int)
>  WARNING: Please consider reporting this to the maintainers of 
> org.apache.spark.unsafe.Platform
>  WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
>  WARNING: All illegal access operations will be denied in a future release



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34095) Spark 3.0.1 is giving WARN while running job

2021-01-14 Thread Sachit Murarka (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachit Murarka updated SPARK-34095:
---
Summary: Spark 3.0.1 is giving WARN while running job  (was: Spark 3 is 
giving WARN while running job)

> Spark 3.0.1 is giving WARN while running job
> 
>
> Key: SPARK-34095
> URL: https://issues.apache.org/jira/browse/SPARK-34095
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.0.1
>Reporter: Sachit Murarka
>Priority: Major
>
> I am running Spark 3.0.1 with Java11. Getting the following WARN.
>  
> WARNING: An illegal reflective access operation has occurred
>  WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> ([file:/opt/spark/jars/spark-unsafe_2.12-3.0.1.jar|file://opt/spark/jars/spark-unsafe_2.12-3.0.1.jar])
>  to constructor java.nio.DirectByteBuffer(long,int)
>  WARNING: Please consider reporting this to the maintainers of 
> org.apache.spark.unsafe.Platform
>  WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
>  WARNING: All illegal access operations will be denied in a future release



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34042) Column pruning is not working as expected for PERMISIVE mode

2021-01-14 Thread Marius Butan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265022#comment-17265022
 ] 

Marius Butan commented on SPARK-34042:
--

Like I said in description I made some tests in 
[https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]:

 

All the tests are running on a file with 3 columns

withPruningEnabledAndMapSingleColumn -> with pruning enabled when we have a 
column in schema and we use it for the select query, the expected result is 2 
not 0

withPruningEnabledAndMap2ColumnsButUse1InSql-> with pruning enabled when we 
have 2 columns in schema and we use only 1 in  the select, the expected result 
is 2 and it is correct

withPruningDisableAndMap2ColumnsButUse1InSql -> with pruning disabled it works 
correctly

> Column pruning is not working as expected for PERMISIVE mode
> 
>
> Key: SPARK-34042
> URL: https://issues.apache.org/jira/browse/SPARK-34042
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.7
>Reporter: Marius Butan
>Priority: Major
>
> In PERMISSIVE mode
> Given a csv with multiple columns per row, if your file schema has a single 
> column and you are doing a SELECT in SQL with a condition like 
> ' is null', the row is marked as corrupted
>  
> BUT if you add an extra column in the file schema and you are not putting 
> that column in SQL SELECT , the row is not marked as corrupted
>  
> PS. I don't know exactly what is the right behavior, I didn't find it for 
> PERMISSIVE mode the documentation.
> What I found is: As an example, CSV file contains the "id,name" header and 
> one row "1234". In Spark 2.4, the selection of the id column consists of a 
> row with one column value 1234 but in Spark 2.3 and earlier, it is empty in 
> the DROPMALFORMED mode. To restore the previous behavior, set 
> {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.
>  
> [https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]
>  
> I made a "unit" test in order to exemplify the issue: 
> [https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34108) Caching with permanent view doesn't work in certain cases

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265048#comment-17265048
 ] 

Apache Spark commented on SPARK-34108:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/31182

> Caching with permanent view doesn't work in certain cases
> -
>
> Key: SPARK-34108
> URL: https://issues.apache.org/jira/browse/SPARK-34108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently, caching a permanent view doesn't work in certain cases. For 
> instance, in the following:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t
> CACHE TABLE v1
> SELECT key FROM t
> {code}
> The last SELECT query will hit the cached {{v1}}. On the other hand:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t ORDER by key
> CACHE TABLE v1
> SELECT key FROM t ORDER BY key
> {code}
> The SELECT won't hit the cache.
> It seems this is related to {{EliminateView}}. In the second case, it will 
> insert an extra project operator which makes the comparison on canonicalized 
> plan during cache lookup fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34108) Caching with permanent view doesn't work in certain cases

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34108:


Assignee: (was: Apache Spark)

> Caching with permanent view doesn't work in certain cases
> -
>
> Key: SPARK-34108
> URL: https://issues.apache.org/jira/browse/SPARK-34108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently, caching a permanent view doesn't work in certain cases. For 
> instance, in the following:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t
> CACHE TABLE v1
> SELECT key FROM t
> {code}
> The last SELECT query will hit the cached {{v1}}. On the other hand:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t ORDER by key
> CACHE TABLE v1
> SELECT key FROM t ORDER BY key
> {code}
> The SELECT won't hit the cache.
> It seems this is related to {{EliminateView}}. In the second case, it will 
> insert an extra project operator which makes the comparison on canonicalized 
> plan during cache lookup fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34108) Caching with permanent view doesn't work in certain cases

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34108:


Assignee: Apache Spark

> Caching with permanent view doesn't work in certain cases
> -
>
> Key: SPARK-34108
> URL: https://issues.apache.org/jira/browse/SPARK-34108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Major
>
> Currently, caching a permanent view doesn't work in certain cases. For 
> instance, in the following:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t
> CACHE TABLE v1
> SELECT key FROM t
> {code}
> The last SELECT query will hit the cached {{v1}}. On the other hand:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t ORDER by key
> CACHE TABLE v1
> SELECT key FROM t ORDER BY key
> {code}
> The SELECT won't hit the cache.
> It seems this is related to {{EliminateView}}. In the second case, it will 
> insert an extra project operator which makes the comparison on canonicalized 
> plan during cache lookup fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34108) Caching with permanent view doesn't work in certain cases

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265049#comment-17265049
 ] 

Apache Spark commented on SPARK-34108:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/31182

> Caching with permanent view doesn't work in certain cases
> -
>
> Key: SPARK-34108
> URL: https://issues.apache.org/jira/browse/SPARK-34108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently, caching a permanent view doesn't work in certain cases. For 
> instance, in the following:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t
> CACHE TABLE v1
> SELECT key FROM t
> {code}
> The last SELECT query will hit the cached {{v1}}. On the other hand:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t ORDER by key
> CACHE TABLE v1
> SELECT key FROM t ORDER BY key
> {code}
> The SELECT won't hit the cache.
> It seems this is related to {{EliminateView}}. In the second case, it will 
> insert an extra project operator which makes the comparison on canonicalized 
> plan during cache lookup fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2021-01-14 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253044#comment-17253044
 ] 

Nicholas Chammas edited comment on SPARK-12890 at 1/14/21, 5:41 PM:


I think this is still an open issue. On Spark 2.4.6:
{code:java}
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>>  ascending=False).limit(1).explain()
== Physical Plan == 
TakeOrderedAndProject(limit=1, orderBy=[file_date#144 DESC NULLS LAST], 
output=[file_date#144])
+- *(1) FileScan parquet [file_date#144] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[s3a://some/dataset], PartitionCount: 74, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>>  ascending=False).limit(1).show()
[Stage 23:=>(2110 + 12) / 21049]
{code}
{{file_date}} is a partitioning column:
{code:java}
$ aws s3 ls s3://some/dataset/
   PRE file_date=2018-10-02/
   PRE file_date=2018-10-08/
   PRE file_date=2018-10-15/ 
   ...{code}
Schema merging is not enabled, as far as I can tell.

Shouldn't Spark be able to answer this query without going through ~20K files?

Is the problem that {{FileScan}} somehow needs to be enhanced to recognize when 
it's only projecting partitioning columns?

For the record, the best current workaround appears to be to [use the catalog 
to list partitions and extract what's needed that 
way|https://stackoverflow.com/a/65724151/877069]. But it seems like Spark 
should handle this situation better.


was (Author: nchammas):
I think this is still an open issue. On Spark 2.4.6:
{code:java}
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>>  ascending=False).limit(1).explain()
== Physical Plan == 
TakeOrderedAndProject(limit=1, orderBy=[file_date#144 DESC NULLS LAST], 
output=[file_date#144])
+- *(1) FileScan parquet [file_date#144] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[s3a://some/dataset], PartitionCount: 74, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>>  ascending=False).limit(1).show()
[Stage 23:=>(2110 + 12) / 21049]
{code}
{{file_date}} is a partitioning column:
{code:java}
$ aws s3 ls s3://some/dataset/
   PRE file_date=2018-10-02/
   PRE file_date=2018-10-08/
   PRE file_date=2018-10-15/ 
   ...{code}
Schema merging is not enabled, as far as I can tell.

Shouldn't Spark be able to answer this query without going through ~20K files?

Is the problem that {{FileScan}} somehow needs to be enhanced to recognize when 
it's only projecting partitioning columns?

For the record, the best current workaround appears to be to [use the catalog 
to list partitions and extract what's needed that 
way|https://stackoverflow.com/a/57440760/877069]. But it seems like Spark 
should handle this situation better.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>Priority: Minor
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34116) Separate state store numKeys metric test

2021-01-14 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-34116:
---

 Summary: Separate state store numKeys metric test
 Key: SPARK-34116
 URL: https://issues.apache.org/jira/browse/SPARK-34116
 Project: Spark
  Issue Type: Test
  Components: Structured Streaming
Affects Versions: 3.2.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


Right now in StateStoreSuite, the tests of get/put/remove/commit are mixed with 
numKeys metric test. I found it is flaky when I was testing with other 
StateStore implementation. Specifically, we also are able to check these 
metrics after state store is updated (committed).  So I think we can refactor 
the test a little bit to make it easier to incorporate other StateStore 
externally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34116) Separate state store numKeys metric test

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265110#comment-17265110
 ] 

Apache Spark commented on SPARK-34116:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/31183

> Separate state store numKeys metric test
> 
>
> Key: SPARK-34116
> URL: https://issues.apache.org/jira/browse/SPARK-34116
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
>
> Right now in StateStoreSuite, the tests of get/put/remove/commit are mixed 
> with numKeys metric test. I found it is flaky when I was testing with other 
> StateStore implementation. Specifically, we also are able to check these 
> metrics after state store is updated (committed).  So I think we can refactor 
> the test a little bit to make it easier to incorporate other StateStore 
> externally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34116) Separate state store numKeys metric test

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34116:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Separate state store numKeys metric test
> 
>
> Key: SPARK-34116
> URL: https://issues.apache.org/jira/browse/SPARK-34116
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Minor
>
> Right now in StateStoreSuite, the tests of get/put/remove/commit are mixed 
> with numKeys metric test. I found it is flaky when I was testing with other 
> StateStore implementation. Specifically, we also are able to check these 
> metrics after state store is updated (committed).  So I think we can refactor 
> the test a little bit to make it easier to incorporate other StateStore 
> externally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34116) Separate state store numKeys metric test

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34116:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Separate state store numKeys metric test
> 
>
> Key: SPARK-34116
> URL: https://issues.apache.org/jira/browse/SPARK-34116
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Minor
>
> Right now in StateStoreSuite, the tests of get/put/remove/commit are mixed 
> with numKeys metric test. I found it is flaky when I was testing with other 
> StateStore implementation. Specifically, we also are able to check these 
> metrics after state store is updated (committed).  So I think we can refactor 
> the test a little bit to make it easier to incorporate other StateStore 
> externally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34116) Separate state store numKeys metric test

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34116:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Separate state store numKeys metric test
> 
>
> Key: SPARK-34116
> URL: https://issues.apache.org/jira/browse/SPARK-34116
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
>
> Right now in StateStoreSuite, the tests of get/put/remove/commit are mixed 
> with numKeys metric test. I found it is flaky when I was testing with other 
> StateStore implementation. Specifically, we also are able to check these 
> metrics after state store is updated (committed).  So I think we can refactor 
> the test a little bit to make it easier to incorporate other StateStore 
> externally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31693) Investigate AmpLab Jenkins server network issue

2021-01-14 Thread Shane Knapp (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp resolved SPARK-31693.
-
Resolution: Fixed

> Investigate AmpLab Jenkins server network issue
> ---
>
> Key: SPARK-31693
> URL: https://issues.apache.org/jira/browse/SPARK-31693
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Critical
>
> Given the series of failures in Spark packaging Jenkins job, it seems that 
> there is a network issue in AmbLab Jenkins cluster.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
> - The node failed to talk to GitBox. (SPARK-31687) -> GitHub is okay.
> - The node failed to download the maven mirror. (SPARK-31691) -> The primary 
> host is okay.
> - The node failed to communicate repository.apache.org. (Current master 
> branch Jenkins job failure)
> {code}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-deploy-plugin:3.0.0-M1:deploy (default-deploy) 
> on project spark-parent_2.12: ArtifactDeployerException: Failed to retrieve 
> remote metadata 
> org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml: Could 
> not transfer metadata 
> org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml from/to 
> apache.snapshots.https 
> (https://repository.apache.org/content/repositories/snapshots): Transfer 
> failed for 
> https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.12/3.1.0-SNAPSHOT/maven-metadata.xml:
>  Connect to repository.apache.org:443 [repository.apache.org/207.244.88.140] 
> failed: Connection timed out (Connection timed out) -> [Help 1]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29183) Upgrade JDK 11 Installation to 11.0.6

2021-01-14 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265139#comment-17265139
 ] 

Shane Knapp commented on SPARK-29183:
-

in the next ~month or so, we'll be reimaging the remaining ubuntu workers and 
java11 will be at 11.0.9+.  the new ubuntu 20 workers (r-j-w-01..06) all 
currently have 11.0.9 installed.

> Upgrade JDK 11 Installation to 11.0.6
> -
>
> Key: SPARK-29183
> URL: https://issues.apache.org/jira/browse/SPARK-29183
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Major
>
> Every JDK 11.0.x releases have many fixes including performance regression 
> fix. We had better upgrade it to the latest 11.0.4.
> - https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8221760



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-31693) Investigate AmpLab Jenkins server network issue

2021-01-14 Thread Shane Knapp (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp closed SPARK-31693.
---

> Investigate AmpLab Jenkins server network issue
> ---
>
> Key: SPARK-31693
> URL: https://issues.apache.org/jira/browse/SPARK-31693
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Critical
>
> Given the series of failures in Spark packaging Jenkins job, it seems that 
> there is a network issue in AmbLab Jenkins cluster.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
> - The node failed to talk to GitBox. (SPARK-31687) -> GitHub is okay.
> - The node failed to download the maven mirror. (SPARK-31691) -> The primary 
> host is okay.
> - The node failed to communicate repository.apache.org. (Current master 
> branch Jenkins job failure)
> {code}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-deploy-plugin:3.0.0-M1:deploy (default-deploy) 
> on project spark-parent_2.12: ArtifactDeployerException: Failed to retrieve 
> remote metadata 
> org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml: Could 
> not transfer metadata 
> org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml from/to 
> apache.snapshots.https 
> (https://repository.apache.org/content/repositories/snapshots): Transfer 
> failed for 
> https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.12/3.1.0-SNAPSHOT/maven-metadata.xml:
>  Connect to repository.apache.org:443 [repository.apache.org/207.244.88.140] 
> failed: Connection timed out (Connection timed out) -> [Help 1]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29183) Upgrade JDK 11 Installation to 11.0.6

2021-01-14 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265140#comment-17265140
 ] 

Dongjoon Hyun commented on SPARK-29183:
---

Yey! It will be great. Thank you for sharing that, [~shaneknapp]. I'm looking 
forward to seeing it.

> Upgrade JDK 11 Installation to 11.0.6
> -
>
> Key: SPARK-29183
> URL: https://issues.apache.org/jira/browse/SPARK-29183
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Major
>
> Every JDK 11.0.x releases have many fixes including performance regression 
> fix. We had better upgrade it to the latest 11.0.4.
> - https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8221760



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30747) Update roxygen2 to 7.0.1

2021-01-14 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265142#comment-17265142
 ] 

Shane Knapp commented on SPARK-30747:
-

we're currently running 7.1.1 on all of the workers.

> Update roxygen2 to 7.0.1
> 
>
> Key: SPARK-30747
> URL: https://issues.apache.org/jira/browse/SPARK-30747
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, Tests
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Shane Knapp
>Priority: Minor
>
> Currently Spark uses {{roxygen2}} 5.0.1. It is already pretty old 
> (2015-11-11) so it could be a good idea to use current R updates to update it 
> as well.
> At crude inspection:
> * SPARK-22430 has been resolved a while ago.
> * SPARK-30737][SPARK-27262,  https://github.com/apache/spark/pull/27437 and 
> https://github.com/apache/spark/commit/b95ccb1d8b726b11435789cdb5882df6643430ed
>  resolved persisting warnings
> * Documentation builds and CRAN checks pass
> * Generated HTML docs are identical to 5.0.1
> Since {{roxygen2}} shares some potentially unstable dependencies with 
> {{devtools}} (primarily {{rlang}}) it might be a good idea to keep these in 
> sync (as a bonus we wouldn't have to worry about {{DESCRIPTION}} being 
> overwritten by local tests).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30747) Update roxygen2 to 7.0.1

2021-01-14 Thread Shane Knapp (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp resolved SPARK-30747.
-
Resolution: Fixed

> Update roxygen2 to 7.0.1
> 
>
> Key: SPARK-30747
> URL: https://issues.apache.org/jira/browse/SPARK-30747
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, Tests
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Shane Knapp
>Priority: Minor
>
> Currently Spark uses {{roxygen2}} 5.0.1. It is already pretty old 
> (2015-11-11) so it could be a good idea to use current R updates to update it 
> as well.
> At crude inspection:
> * SPARK-22430 has been resolved a while ago.
> * SPARK-30737][SPARK-27262,  https://github.com/apache/spark/pull/27437 and 
> https://github.com/apache/spark/commit/b95ccb1d8b726b11435789cdb5882df6643430ed
>  resolved persisting warnings
> * Documentation builds and CRAN checks pass
> * Generated HTML docs are identical to 5.0.1
> Since {{roxygen2}} shares some potentially unstable dependencies with 
> {{devtools}} (primarily {{rlang}}) it might be a good idea to keep these in 
> sync (as a bonus we wouldn't have to worry about {{DESCRIPTION}} being 
> overwritten by local tests).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33711) Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns

2021-01-14 Thread Shashi Kangayam (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265426#comment-17265426
 ] 

Shashi Kangayam commented on SPARK-33711:
-

[~attilapiros]

We have our jobs on Spark-3.0.1 that are reflecting the same behavior. 

Can you please back port this fix to Spark-3.0

>  Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns
> --
>
> Key: SPARK-33711
> URL: https://issues.apache.org/jira/browse/SPARK-33711
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.4, 2.4.7, 3.0.0, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
>
> Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD 
> changes which could wrongfully lead to detecting of missing PODs (PODs known 
> by scheduler backend but missing from POD snapshots) by the executor POD 
> lifecycle manager.
> A key indicator of this is seeing this log msg:
> "The executor with ID [some_id] was not found in the cluster but we didn't 
> get a reason why. Marking the executor as failed. The executor may have been 
> deleted but the driver missed the deletion event."
> So one of the problem is running the missing POD detection even when a single 
> pod is changed without having a full consistent snapshot about all the PODs 
> (see ExecutorPodsPollingSnapshotSource). The other could be a race between 
> the executor POD lifecycle manager and the scheduler backend.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34116) Separate state store numKeys metric test

2021-01-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34116.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 31183
[https://github.com/apache/spark/pull/31183]

> Separate state store numKeys metric test
> 
>
> Key: SPARK-34116
> URL: https://issues.apache.org/jira/browse/SPARK-34116
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.1.0
>
>
> Right now in StateStoreSuite, the tests of get/put/remove/commit are mixed 
> with numKeys metric test. I found it is flaky when I was testing with other 
> StateStore implementation. Specifically, we also are able to check these 
> metrics after state store is updated (committed).  So I think we can refactor 
> the test a little bit to make it easier to incorporate other StateStore 
> externally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34116) Separate state store numKeys metric test

2021-01-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34116:
--
Fix Version/s: (was: 3.1.0)
   3.1.1

> Separate state store numKeys metric test
> 
>
> Key: SPARK-34116
> URL: https://issues.apache.org/jira/browse/SPARK-34116
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 3.2.0, 3.1.1
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.1.1
>
>
> Right now in StateStoreSuite, the tests of get/put/remove/commit are mixed 
> with numKeys metric test. I found it is flaky when I was testing with other 
> StateStore implementation. Specifically, we also are able to check these 
> metrics after state store is updated (committed).  So I think we can refactor 
> the test a little bit to make it easier to incorporate other StateStore 
> externally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34116) Separate state store numKeys metric test

2021-01-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34116:
--
Affects Version/s: 3.1.1

> Separate state store numKeys metric test
> 
>
> Key: SPARK-34116
> URL: https://issues.apache.org/jira/browse/SPARK-34116
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 3.2.0, 3.1.1
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.1.0
>
>
> Right now in StateStoreSuite, the tests of get/put/remove/commit are mixed 
> with numKeys metric test. I found it is flaky when I was testing with other 
> StateStore implementation. Specifically, we also are able to check these 
> metrics after state store is updated (committed).  So I think we can refactor 
> the test a little bit to make it easier to incorporate other StateStore 
> externally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30747) Update roxygen2 to 7.0.1

2021-01-14 Thread Maciej Szymkiewicz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265566#comment-17265566
 ] 

Maciej Szymkiewicz commented on SPARK-30747:


Thanks for working on this [~shaneknapp]!

> Update roxygen2 to 7.0.1
> 
>
> Key: SPARK-30747
> URL: https://issues.apache.org/jira/browse/SPARK-30747
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, Tests
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Shane Knapp
>Priority: Minor
>
> Currently Spark uses {{roxygen2}} 5.0.1. It is already pretty old 
> (2015-11-11) so it could be a good idea to use current R updates to update it 
> as well.
> At crude inspection:
> * SPARK-22430 has been resolved a while ago.
> * SPARK-30737][SPARK-27262,  https://github.com/apache/spark/pull/27437 and 
> https://github.com/apache/spark/commit/b95ccb1d8b726b11435789cdb5882df6643430ed
>  resolved persisting warnings
> * Documentation builds and CRAN checks pass
> * Generated HTML docs are identical to 5.0.1
> Since {{roxygen2}} shares some potentially unstable dependencies with 
> {{devtools}} (primarily {{rlang}}) it might be a good idea to keep these in 
> sync (as a bonus we wouldn't have to worry about {{DESCRIPTION}} being 
> overwritten by local tests).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33711) Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns

2021-01-14 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265604#comment-17265604
 ] 

Dongjoon Hyun commented on SPARK-33711:
---

+1 for having this in branch-3.0. Could you make a backport, [~attilapiros]?

>  Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns
> --
>
> Key: SPARK-33711
> URL: https://issues.apache.org/jira/browse/SPARK-33711
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.4, 2.4.7, 3.0.0, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
>
> Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD 
> changes which could wrongfully lead to detecting of missing PODs (PODs known 
> by scheduler backend but missing from POD snapshots) by the executor POD 
> lifecycle manager.
> A key indicator of this is seeing this log msg:
> "The executor with ID [some_id] was not found in the cluster but we didn't 
> get a reason why. Marking the executor as failed. The executor may have been 
> deleted but the driver missed the deletion event."
> So one of the problem is running the missing POD detection even when a single 
> pod is changed without having a full consistent snapshot about all the PODs 
> (see ExecutorPodsPollingSnapshotSource). The other could be a race between 
> the executor POD lifecycle manager and the scheduler backend.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors

2021-01-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34080:
-
Target Version/s: 3.1.1  (was: 3.2.0)

> Add UnivariateFeatureSelector to deprecate existing selectors
> -
>
> Key: SPARK-34080
> URL: https://issues.apache.org/jira/browse/SPARK-34080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Xiangrui Meng
>Priority: Major
>
> In SPARK-26111, we introduced a few univariate feature selectors, which share 
> a common set of params. And they are named after the underlying test, which 
> requires users to understand the test to find the matched scenarios. It would 
> be nice if we introduce a single class called UnivariateFeatureSelector that 
> accepts a selection criterion and a score method (string names). Then we can 
> deprecate all other univariate selectors.
> For the params, instead of ask users to provide what score function to use, 
> it is more friendly to ask users to specify the feature and label types 
> (continuous or categorical) and we set a default score function for each 
> combo. We can also detect the types from feature metadata if given. Advanced 
> users can overwrite it (if there are multiple score function that is 
> compatible with the feature type and label type combo). Example (param names 
> are not finalized):
> {code}
> selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
> labelCol=["target"], featureType="categorical", labelType="continuous", 
> select="bestK", k=100)
> {code}
> cc: [~huaxingao] [~ruifengz] [~weichenxu123]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors

2021-01-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34080:
-
Priority: Critical  (was: Major)

> Add UnivariateFeatureSelector to deprecate existing selectors
> -
>
> Key: SPARK-34080
> URL: https://issues.apache.org/jira/browse/SPARK-34080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Xiangrui Meng
>Priority: Critical
>
> In SPARK-26111, we introduced a few univariate feature selectors, which share 
> a common set of params. And they are named after the underlying test, which 
> requires users to understand the test to find the matched scenarios. It would 
> be nice if we introduce a single class called UnivariateFeatureSelector that 
> accepts a selection criterion and a score method (string names). Then we can 
> deprecate all other univariate selectors.
> For the params, instead of ask users to provide what score function to use, 
> it is more friendly to ask users to specify the feature and label types 
> (continuous or categorical) and we set a default score function for each 
> combo. We can also detect the types from feature metadata if given. Advanced 
> users can overwrite it (if there are multiple score function that is 
> compatible with the feature type and label type combo). Example (param names 
> are not finalized):
> {code}
> selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
> labelCol=["target"], featureType="categorical", labelType="continuous", 
> select="bestK", k=100)
> {code}
> cc: [~huaxingao] [~ruifengz] [~weichenxu123]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors

2021-01-14 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265611#comment-17265611
 ] 

Hyukjin Kwon commented on SPARK-34080:
--

I am okay with having this in Spark 3.1.1.

> Add UnivariateFeatureSelector to deprecate existing selectors
> -
>
> Key: SPARK-34080
> URL: https://issues.apache.org/jira/browse/SPARK-34080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Xiangrui Meng
>Priority: Critical
>
> In SPARK-26111, we introduced a few univariate feature selectors, which share 
> a common set of params. And they are named after the underlying test, which 
> requires users to understand the test to find the matched scenarios. It would 
> be nice if we introduce a single class called UnivariateFeatureSelector that 
> accepts a selection criterion and a score method (string names). Then we can 
> deprecate all other univariate selectors.
> For the params, instead of ask users to provide what score function to use, 
> it is more friendly to ask users to specify the feature and label types 
> (continuous or categorical) and we set a default score function for each 
> combo. We can also detect the types from feature metadata if given. Advanced 
> users can overwrite it (if there are multiple score function that is 
> compatible with the feature type and label type combo). Example (param names 
> are not finalized):
> {code}
> selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
> labelCol=["target"], featureType="categorical", labelType="continuous", 
> select="bestK", k=100)
> {code}
> cc: [~huaxingao] [~ruifengz] [~weichenxu123]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34118) Replaces filter and check for emptiness with exists or forall

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265647#comment-17265647
 ] 

Apache Spark commented on SPARK-34118:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/31184

> Replaces filter and check for emptiness with exists or forall
> -
>
> Key: SPARK-34118
> URL: https://issues.apache.org/jira/browse/SPARK-34118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> Semantic consistency and code Simpilefications
> before:
> {code:java}
> seq.filter(p).size == 0)
> seq.filter(p).length > 0
> seq.filterNot(p).isEmpty
> seq.filterNot(p).nonEmpty
> {code}
> after:
> {code:java}
> !seq.exists(p)
> seq.exists(p)
> seq.forall(p)
> !seq.forall(p)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34118) Replaces filter and check for emptiness with exists or forall

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34118:


Assignee: Apache Spark

> Replaces filter and check for emptiness with exists or forall
> -
>
> Key: SPARK-34118
> URL: https://issues.apache.org/jira/browse/SPARK-34118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> Semantic consistency and code Simpilefications
> before:
> {code:java}
> seq.filter(p).size == 0)
> seq.filter(p).length > 0
> seq.filterNot(p).isEmpty
> seq.filterNot(p).nonEmpty
> {code}
> after:
> {code:java}
> !seq.exists(p)
> seq.exists(p)
> seq.forall(p)
> !seq.forall(p)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34118) Replaces filter and check for emptiness with exists or forall

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34118:


Assignee: (was: Apache Spark)

> Replaces filter and check for emptiness with exists or forall
> -
>
> Key: SPARK-34118
> URL: https://issues.apache.org/jira/browse/SPARK-34118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> Semantic consistency and code Simpilefications
> before:
> {code:java}
> seq.filter(p).size == 0)
> seq.filter(p).length > 0
> seq.filterNot(p).isEmpty
> seq.filterNot(p).nonEmpty
> {code}
> after:
> {code:java}
> !seq.exists(p)
> seq.exists(p)
> seq.forall(p)
> !seq.forall(p)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34118) Replaces filter and check for emptiness with exists or forall

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265648#comment-17265648
 ] 

Apache Spark commented on SPARK-34118:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/31184

> Replaces filter and check for emptiness with exists or forall
> -
>
> Key: SPARK-34118
> URL: https://issues.apache.org/jira/browse/SPARK-34118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> Semantic consistency and code Simpilefications
> before:
> {code:java}
> seq.filter(p).size == 0)
> seq.filter(p).length > 0
> seq.filterNot(p).isEmpty
> seq.filterNot(p).nonEmpty
> {code}
> after:
> {code:java}
> !seq.exists(p)
> seq.exists(p)
> seq.forall(p)
> !seq.forall(p)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34117) Disable LeftSemi/LeftAnti push down over Aggregate

2021-01-14 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34117:

Attachment: disable_pushdown.jpg
current.jpg

> Disable LeftSemi/LeftAnti push down over Aggregate
> --
>
> Key: SPARK-34117
> URL: https://issues.apache.org/jira/browse/SPARK-34117
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: current.jpg, disable_pushdown.jpg
>
>
> After SPARK-34081, it can improve TPC-DS q38 and q87. But it still can not 
> handle [this case(rewritten from 
> q14b)|https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q14b.sql#L2-L32]:
> {code:sql}
> SELECT i_item_sk ss_item_sk
>   FROM item,
> (SELECT
>   distinct
>   iss.i_brand_id brand_id,
>   iss.i_class_id class_id,
>   iss.i_category_id category_id
> FROM store_sales, item iss, date_dim d1
> WHERE ss_item_sk = iss.i_item_sk
>   AND ss_sold_date_sk = d1.d_date_sk
>   AND d1.d_year BETWEEN 1999 AND 1999 + 2
> INTERSECT
> SELECT
> distinct
>   ics.i_brand_id,
>   ics.i_class_id,
>   ics.i_category_id
> FROM catalog_sales, item ics, date_dim d2
> WHERE cs_item_sk = ics.i_item_sk
>   AND cs_sold_date_sk = d2.d_date_sk
>   AND d2.d_year BETWEEN 1999 AND 1999 + 2
> INTERSECT
> SELECT
> distinct
>   iws.i_brand_id,
>   iws.i_class_id,
>   iws.i_category_id
> FROM web_sales, item iws, date_dim d3
> WHERE ws_item_sk = iws.i_item_sk
>   AND ws_sold_date_sk = d3.d_date_sk
>   AND d3.d_year BETWEEN 1999 AND 1999 + 2) x
>   WHERE i_brand_id = brand_id
> AND i_class_id = class_id
> AND i_category_id = category_id;
> {code}
> Optimized Logical Plan:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
> +- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
> class_id#160)) AND (i_category_id#18 = category_id#161)), 
> Statistics(sizeInBytes=2.42E+28 B)
>:- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
> Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
>:  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
> isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
> rowCount=3.69E+5)
>: +- 
> Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
>  parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
>+- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
> class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
>   +- Aggregate [brand_id#159, class_id#160, category_id#161], 
> [brand_id#159, class_id#160, category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
>  +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
> class_id#160, i_category_id#18 AS category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
> :  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
> Statistics(sizeInBytes=3.83E+21 B)
> :  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
> :  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
> Statistics(sizeInBytes=516.5 PiB)
> :  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
> Statistics(sizeInBytes=61.1 GiB)
> :  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
> isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
> Statistics(sizeInBytes=580.6 GiB)
> :  : : : :  +- Project [d_date_sk#52], 
> Statistics(sizeInBytes=8.6 KiB, rowCount=731)
> :  : : : : +- Filter d_year#58 >= 1999) AND 
> (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
> Statistics(sizeInBytes=175.6 KiB, rowCount=731)
> :  : : : :+- 
> Relation[d_date_sk#52,d_date

[jira] [Updated] (SPARK-34117) Disable LeftSemi/LeftAnti push down over Aggregate

2021-01-14 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34117:

Description: 
After SPARK-34081, it can improve TPC-DS q38 and q87. But it still can not 
handle [this case(rewritten from 
q14b)|https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q14b.sql#L2-L32]:
{code:sql}
SELECT i_item_sk ss_item_sk
  FROM item,
(SELECT
  distinct
  iss.i_brand_id brand_id,
  iss.i_class_id class_id,
  iss.i_category_id category_id
FROM store_sales, item iss, date_dim d1
WHERE ss_item_sk = iss.i_item_sk
  AND ss_sold_date_sk = d1.d_date_sk
  AND d1.d_year BETWEEN 1999 AND 1999 + 2
INTERSECT
SELECT
distinct
  ics.i_brand_id,
  ics.i_class_id,
  ics.i_category_id
FROM catalog_sales, item ics, date_dim d2
WHERE cs_item_sk = ics.i_item_sk
  AND cs_sold_date_sk = d2.d_date_sk
  AND d2.d_year BETWEEN 1999 AND 1999 + 2
INTERSECT
SELECT
distinct
  iws.i_brand_id,
  iws.i_class_id,
  iws.i_category_id
FROM web_sales, item iws, date_dim d3
WHERE ws_item_sk = iws.i_item_sk
  AND ws_sold_date_sk = d3.d_date_sk
  AND d3.d_year BETWEEN 1999 AND 1999 + 2) x
  WHERE i_brand_id = brand_id
AND i_class_id = class_id
AND i_category_id = category_id;
{code}

Optimized Logical Plan:
{noformat}
== Optimized Logical Plan ==
Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
+- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
class_id#160)) AND (i_category_id#18 = category_id#161)), 
Statistics(sizeInBytes=2.42E+28 B)
   :- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
   :  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
rowCount=3.69E+5)
   : +- 
Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
 parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
   +- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
  +- Aggregate [brand_id#159, class_id#160, category_id#161], 
[brand_id#159, class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 
B)
 +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND (class_id#160 
<=> i_class_id#16)) AND (category_id#161 <=> i_category_id#18)), 
Statistics(sizeInBytes=2.73E+21 B)
:- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
(class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> i_category_id#18)), 
Statistics(sizeInBytes=2.73E+21 B)
:  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
class_id#160, i_category_id#18 AS category_id#161], 
Statistics(sizeInBytes=2.73E+21 B)
:  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
Statistics(sizeInBytes=3.83E+21 B)
:  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
:  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
Statistics(sizeInBytes=516.5 PiB)
:  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
Statistics(sizeInBytes=61.1 GiB)
:  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
Statistics(sizeInBytes=580.6 GiB)
:  : : : :  +- Project [d_date_sk#52], 
Statistics(sizeInBytes=8.6 KiB, rowCount=731)
:  : : : : +- Filter d_year#58 >= 1999) AND 
(d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
Statistics(sizeInBytes=175.6 KiB, rowCount=731)
:  : : : :+- 
Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,...
 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
:  : : : +- 
Relation[ss_sold_time_sk#29,ss_item_sk#30,ss_customer_sk#31,ss_cdemo_sk#32,ss_hdemo_sk#33,ss_addr_sk#34,ss_store_sk#35,ss_promo_sk#36,ss_ticket_number#37L,ss_quantity#38,ss_wholesale_cost#39,ss_list_price#40,ss_sales_price#41,ss_ext_discount_a

[jira] [Created] (SPARK-34117) Disable LeftSemi/LeftAnti push down over Aggregate

2021-01-14 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-34117:
---

 Summary: Disable LeftSemi/LeftAnti push down over Aggregate
 Key: SPARK-34117
 URL: https://issues.apache.org/jira/browse/SPARK-34117
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


After SPARK-34081, it can improve TPC-DS q38 and q87. But it still can not 
handle [this case(rewritten from 
q14b)|https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q14b.sql#L2-L32]:
{code:sql}
SELECT i_item_sk ss_item_sk
  FROM item,
(SELECT
  distinct
  iss.i_brand_id brand_id,
  iss.i_class_id class_id,
  iss.i_category_id category_id
FROM store_sales, item iss, date_dim d1
WHERE ss_item_sk = iss.i_item_sk
  AND ss_sold_date_sk = d1.d_date_sk
  AND d1.d_year BETWEEN 1999 AND 1999 + 2
INTERSECT
SELECT
distinct
  ics.i_brand_id,
  ics.i_class_id,
  ics.i_category_id
FROM catalog_sales, item ics, date_dim d2
WHERE cs_item_sk = ics.i_item_sk
  AND cs_sold_date_sk = d2.d_date_sk
  AND d2.d_year BETWEEN 1999 AND 1999 + 2
INTERSECT
SELECT
distinct
  iws.i_brand_id,
  iws.i_class_id,
  iws.i_category_id
FROM web_sales, item iws, date_dim d3
WHERE ws_item_sk = iws.i_item_sk
  AND ws_sold_date_sk = d3.d_date_sk
  AND d3.d_year BETWEEN 1999 AND 1999 + 2) x
  WHERE i_brand_id = brand_id
AND i_class_id = class_id
AND i_category_id = category_id;
{code}

Optimized Logical Plan:
{noformat}
== Optimized Logical Plan ==
Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
+- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
class_id#160)) AND (i_category_id#18 = category_id#161)), 
Statistics(sizeInBytes=2.42E+28 B)
   :- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
   :  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
rowCount=3.69E+5)
   : +- 
Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
 parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
   +- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
  +- Aggregate [brand_id#159, class_id#160, category_id#161], 
[brand_id#159, class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 
B)
 +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND (class_id#160 
<=> i_class_id#16)) AND (category_id#161 <=> i_category_id#18)), 
Statistics(sizeInBytes=2.73E+21 B)
:- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
(class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> i_category_id#18)), 
Statistics(sizeInBytes=2.73E+21 B)
:  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
class_id#160, i_category_id#18 AS category_id#161], 
Statistics(sizeInBytes=2.73E+21 B)
:  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
Statistics(sizeInBytes=3.83E+21 B)
:  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
:  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
Statistics(sizeInBytes=516.5 PiB)
:  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
Statistics(sizeInBytes=61.1 GiB)
:  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
Statistics(sizeInBytes=580.6 GiB)
:  : : : :  +- Project [d_date_sk#52], 
Statistics(sizeInBytes=8.6 KiB, rowCount=731)
:  : : : : +- Filter d_year#58 >= 1999) AND 
(d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
Statistics(sizeInBytes=175.6 KiB, rowCount=731)
:  : : : :+- 
Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,...
 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
:  : : : +- 
Relation[ss_sold_time_sk#29,ss_item_sk#30,ss_customer_sk

[jira] [Created] (SPARK-34118) Replaces filter and check for emptiness with exists or forall.

2021-01-14 Thread Yang Jie (Jira)
Yang Jie created SPARK-34118:


 Summary: Replaces filter and check for emptiness with exists or 
forall.
 Key: SPARK-34118
 URL: https://issues.apache.org/jira/browse/SPARK-34118
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.2.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34118) Replaces filter and check for emptiness with exists or forall.

2021-01-14 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-34118:
-
Description: 
Semantic consistency and code Simpilefications

before:
{code:java}
seq.filter(p).size == 0)
seq.filter(p).length > 0
seq.filterNot(p).isEmpty
seq.filterNot(p).nonEmpty
{code}
after:
{code:java}
!seq.exists(p)
seq.exists(p)
seq.forall(p)
!seq.forall(p)
{code}
 

> Replaces filter and check for emptiness with exists or forall.
> --
>
> Key: SPARK-34118
> URL: https://issues.apache.org/jira/browse/SPARK-34118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> Semantic consistency and code Simpilefications
> before:
> {code:java}
> seq.filter(p).size == 0)
> seq.filter(p).length > 0
> seq.filterNot(p).isEmpty
> seq.filterNot(p).nonEmpty
> {code}
> after:
> {code:java}
> !seq.exists(p)
> seq.exists(p)
> seq.forall(p)
> !seq.forall(p)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34118) Replaces filter and check for emptiness with exists or forall

2021-01-14 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-34118:
-
Summary: Replaces filter and check for emptiness with exists or forall  
(was: Replaces filter and check for emptiness with exists or forall.)

> Replaces filter and check for emptiness with exists or forall
> -
>
> Key: SPARK-34118
> URL: https://issues.apache.org/jira/browse/SPARK-34118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> Semantic consistency and code Simpilefications
> before:
> {code:java}
> seq.filter(p).size == 0)
> seq.filter(p).length > 0
> seq.filterNot(p).isEmpty
> seq.filterNot(p).nonEmpty
> {code}
> after:
> {code:java}
> !seq.exists(p)
> seq.exists(p)
> seq.forall(p)
> !seq.forall(p)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34118) Replaces filter and check for emptiness with exists or forall

2021-01-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34118:
-
Fix Version/s: 3.2.0

> Replaces filter and check for emptiness with exists or forall
> -
>
> Key: SPARK-34118
> URL: https://issues.apache.org/jira/browse/SPARK-34118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
> Fix For: 3.2.0
>
>
> Semantic consistency and code Simpilefications
> before:
> {code:java}
> seq.filter(p).size == 0)
> seq.filter(p).length > 0
> seq.filterNot(p).isEmpty
> seq.filterNot(p).nonEmpty
> {code}
> after:
> {code:java}
> !seq.exists(p)
> seq.exists(p)
> seq.forall(p)
> !seq.forall(p)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-14 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-33790:
-
Priority: Critical  (was: Trivial)

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34119) Keep necessary stats after partition pruning

2021-01-14 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-34119:
---

 Summary: Keep necessary stats after partition pruning
 Key: SPARK-34119
 URL: https://issues.apache.org/jira/browse/SPARK-34119
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


It missing stats if filter conditions contains dynamicpruning, we should keep 
these stats after partition pruning:
{noformat}
== Optimized Logical Plan ==
Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
+- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
class_id#160)) AND (i_category_id#18 = category_id#161)), 
Statistics(sizeInBytes=2.42E+28 B)
   :- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
   :  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
rowCount=3.69E+5)
   : +- 
Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
 parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
   +- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
  +- Aggregate [brand_id#159, class_id#160, category_id#161], 
[brand_id#159, class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 
B)
 +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND (class_id#160 
<=> i_class_id#16)) AND (category_id#161 <=> i_category_id#18)), 
Statistics(sizeInBytes=2.73E+21 B)
:- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
(class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> i_category_id#18)), 
Statistics(sizeInBytes=2.73E+21 B)
:  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
class_id#160, i_category_id#18 AS category_id#161], 
Statistics(sizeInBytes=2.73E+21 B)
:  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
Statistics(sizeInBytes=3.83E+21 B)
:  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
:  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
Statistics(sizeInBytes=516.5 PiB)
:  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
Statistics(sizeInBytes=61.1 GiB)
:  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
Statistics(sizeInBytes=580.6 GiB)
:  : : : :  +- Project [d_date_sk#52], 
Statistics(sizeInBytes=8.6 KiB, rowCount=731)
:  : : : : +- Filter d_year#58 >= 1999) AND 
(d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
Statistics(sizeInBytes=175.6 KiB, rowCount=731)
:  : : : :+- 
Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,...
 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
:  : : : +- 
Relation[ss_sold_time_sk#29,ss_item_sk#30,ss_customer_sk#31,ss_cdemo_sk#32,ss_hdemo_sk#33,ss_addr_sk#34,ss_store_sk#35,ss_promo_sk#36,ss_ticket_number#37L,ss_quantity#38,ss_wholesale_cost#39,ss_list_price#40,ss_sales_price#41,ss_ext_discount_amt#42,ss_ext_sales_price#43,ss_ext_wholesale_cost#44,ss_ext_list_price#45,ss_ext_tax#46,ss_coupon_amt#47,ss_net_paid#48,ss_net_paid_inc_tax#49,ss_net_profit#50,ss_sold_date_sk#51]
 parquet, Statistics(sizeInBytes=580.6 GiB)
:  : : +- Project [i_item_sk#7, i_brand_id#14, 
i_class_id#16, i_category_id#18], Statistics(sizeInBytes=8.5 MiB, 
rowCount=3.69E+5)
:  : :+- Filter (((isnotnull(i_brand_id#14) AND 
isnotnull(i_class_id#16)) AND isnotnull(i_category_id#18)) AND 
isnotnull(i_item_sk#7)), Statistics(sizeInBytes=150.0 MiB, rowCount=3.69E+5)
:  : :   +- 
Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
 parquet, S

[jira] [Commented] (SPARK-34119) Keep necessary stats after partition pruning

2021-01-14 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265657#comment-17265657
 ] 

Yuming Wang commented on SPARK-34119:
-

[~leoluan] Would you like work on this?

> Keep necessary stats after partition pruning
> 
>
> Key: SPARK-34119
> URL: https://issues.apache.org/jira/browse/SPARK-34119
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> It missing stats if filter conditions contains dynamicpruning, we should keep 
> these stats after partition pruning:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
> +- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
> class_id#160)) AND (i_category_id#18 = category_id#161)), 
> Statistics(sizeInBytes=2.42E+28 B)
>:- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
> Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
>:  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
> isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
> rowCount=3.69E+5)
>: +- 
> Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
>  parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
>+- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
> class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
>   +- Aggregate [brand_id#159, class_id#160, category_id#161], 
> [brand_id#159, class_id#160, category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
>  +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
> class_id#160, i_category_id#18 AS category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
> :  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
> Statistics(sizeInBytes=3.83E+21 B)
> :  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
> :  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
> Statistics(sizeInBytes=516.5 PiB)
> :  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
> Statistics(sizeInBytes=61.1 GiB)
> :  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
> isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
> Statistics(sizeInBytes=580.6 GiB)
> :  : : : :  +- Project [d_date_sk#52], 
> Statistics(sizeInBytes=8.6 KiB, rowCount=731)
> :  : : : : +- Filter d_year#58 >= 1999) AND 
> (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
> Statistics(sizeInBytes=175.6 KiB, rowCount=731)
> :  : : : :+- 
> Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,...
>  4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
> :  : : : +- 
> Relation[ss_sold_time_sk#29,ss_item_sk#30,ss_customer_sk#31,ss_cdemo_sk#32,ss_hdemo_sk#33,ss_addr_sk#34,ss_store_sk#35,ss_promo_sk#36,ss_ticket_number#37L,ss_quantity#38,ss_wholesale_cost#39,ss_list_price#40,ss_sales_price#41,ss_ext_discount_amt#42,ss_ext_sales_price#43,ss_ext_wholesale_cost#44,ss_ext_list_price#45,ss_ext_tax#46,ss_coupon_amt#47,ss_net_paid#48,ss_net_paid_inc_tax#49,ss_net_profit#50,ss_sold_date_sk#51]
>  parquet, Statistics(sizeInBytes=580.6 GiB)
> :  : : +- Project [i_item_sk#7, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=8.5 MiB, 
> rowCount=3.69E+5)
> :  : :+- Filter (((isnotnull(i_brand_id#14) AND 
> isnotnull(i_class_id#16)) AND isnotnull(i_category_id#18)) AND 
> isnotnull(i_item_sk#7)), Statistics(sizeInBytes=150.0 

[jira] [Commented] (SPARK-34117) Disable LeftSemi/LeftAnti push down over Aggregate

2021-01-14 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265631#comment-17265631
 ] 

Yuming Wang commented on SPARK-34117:
-

I'm working on.

> Disable LeftSemi/LeftAnti push down over Aggregate
> --
>
> Key: SPARK-34117
> URL: https://issues.apache.org/jira/browse/SPARK-34117
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: current.jpg, disable_pushdown.jpg
>
>
> After SPARK-34081, it can improve TPC-DS q38 and q87. But it still can not 
> handle [this case(rewritten from 
> q14b)|https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q14b.sql#L2-L32]:
> {code:sql}
> SELECT i_item_sk ss_item_sk
>   FROM item,
> (SELECT
>   distinct
>   iss.i_brand_id brand_id,
>   iss.i_class_id class_id,
>   iss.i_category_id category_id
> FROM store_sales, item iss, date_dim d1
> WHERE ss_item_sk = iss.i_item_sk
>   AND ss_sold_date_sk = d1.d_date_sk
>   AND d1.d_year BETWEEN 1999 AND 1999 + 2
> INTERSECT
> SELECT
> distinct
>   ics.i_brand_id,
>   ics.i_class_id,
>   ics.i_category_id
> FROM catalog_sales, item ics, date_dim d2
> WHERE cs_item_sk = ics.i_item_sk
>   AND cs_sold_date_sk = d2.d_date_sk
>   AND d2.d_year BETWEEN 1999 AND 1999 + 2
> INTERSECT
> SELECT
> distinct
>   iws.i_brand_id,
>   iws.i_class_id,
>   iws.i_category_id
> FROM web_sales, item iws, date_dim d3
> WHERE ws_item_sk = iws.i_item_sk
>   AND ws_sold_date_sk = d3.d_date_sk
>   AND d3.d_year BETWEEN 1999 AND 1999 + 2) x
>   WHERE i_brand_id = brand_id
> AND i_class_id = class_id
> AND i_category_id = category_id;
> {code}
> Optimized Logical Plan:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
> +- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
> class_id#160)) AND (i_category_id#18 = category_id#161)), 
> Statistics(sizeInBytes=2.42E+28 B)
>:- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
> Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
>:  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
> isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
> rowCount=3.69E+5)
>: +- 
> Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
>  parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
>+- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
> class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
>   +- Aggregate [brand_id#159, class_id#160, category_id#161], 
> [brand_id#159, class_id#160, category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
>  +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
> class_id#160, i_category_id#18 AS category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
> :  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
> Statistics(sizeInBytes=3.83E+21 B)
> :  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
> :  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
> Statistics(sizeInBytes=516.5 PiB)
> :  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
> Statistics(sizeInBytes=61.1 GiB)
> :  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
> isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
> Statistics(sizeInBytes=580.6 GiB)
> :  : : : :  +- Project [d_date_sk#52], 
> Statistics(sizeInBytes=8.6 KiB, rowCount=731)
> :  : : : : +- Filter d_year#58 >= 1999) AND 
> (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
> Statistics(sizeInBytes=175.6 KiB, rowCount=731)
> :  : : : :+- 
> Relation[d_date_sk#

[jira] [Created] (SPARK-34120) Improve the statistics estimation

2021-01-14 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-34120:
---

 Summary: Improve the statistics estimation
 Key: SPARK-34120
 URL: https://issues.apache.org/jira/browse/SPARK-34120
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


This umbrella tickets aim to track improvements in statistics estimates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33959) Improve the statistics estimation of the Tail

2021-01-14 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33959:

Parent: SPARK-34120
Issue Type: Sub-task  (was: Improvement)

> Improve the statistics estimation of the Tail
> -
>
> Key: SPARK-33959
> URL: https://issues.apache.org/jira/browse/SPARK-33959
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> {code:scala}
> spark.sql("set spark.sql.cbo.enabled=true")
> spark.range(100).selectExpr("id as a", "id as b", "id as c", "id as 
> e").write.saveAsTable("t1")
> println(Tail(Literal(5), spark.sql("SELECT * FROM 
> t1").queryExecution.logical).queryExecution.explainString(org.apache.spark.sql.execution.CostMode))
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Tail 5, Statistics(sizeInBytes=3.8 KiB)
> +- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
> {noformat}
> Expected:
> {noformat}
> == Optimized Logical Plan ==
> Tail 5, Statistics(sizeInBytes=200.0 B, rowCount=5)
> +- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34031) Union operator missing rowCount when enable CBO

2021-01-14 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34031:

Parent: SPARK-34120
Issue Type: Sub-task  (was: Improvement)

> Union operator missing rowCount when enable CBO
> ---
>
> Key: SPARK-34031
> URL: https://issues.apache.org/jira/browse/SPARK-34031
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> {code:scala}
> spark.sql("CREATE TABLE t1 USING parquet AS SELECT id FROM RANGE(10)")
> spark.sql("CREATE TABLE t2 USING parquet AS SELECT id FROM RANGE(10)")
> spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
> spark.sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
> spark.sql("set spark.sql.cbo.enabled=true")
> spark.sql("SELECT * FROM t1 UNION ALL SELECT * FROM t2").explain("cost")
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Union false, false, Statistics(sizeInBytes=320.0 B)
> :- Relation[id#5880L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10)
> +- Relation[id#5881L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10)
> {noformat}
> Expected
> {noformat}
> == Optimized Logical Plan ==
> Union false, false, Statistics(sizeInBytes=320.0 B, rowCount=20)
> :- Relation[id#2138L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10)
> +- Relation[id#2139L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33954) Some operator missing rowCount when enable CBO

2021-01-14 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33954:

Parent: SPARK-34120
Issue Type: Sub-task  (was: Improvement)

> Some operator missing rowCount when enable CBO
> --
>
> Key: SPARK-33954
> URL: https://issues.apache.org/jira/browse/SPARK-33954
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Some operator missing rowCount when enable CBO, for example:
> {code:scala}
> spark.range(1000).selectExpr("id as a", "id as b").write.saveAsTable("t1")
> spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
> spark.sql("set spark.sql.cbo.enabled=true")
> spark.sql("set spark.sql.cbo.planStats.enabled=true")
> spark.sql("select * from (select * from t1 distribute by a limit 100) 
> distribute by b").explain("cost")
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> RepartitionByExpression [b#2129L], Statistics(sizeInBytes=2.3 KiB)
> +- GlobalLimit 100, Statistics(sizeInBytes=2.3 KiB, rowCount=100)
>+- LocalLimit 100, Statistics(sizeInBytes=23.4 KiB)
>   +- RepartitionByExpression [a#2128L], Statistics(sizeInBytes=23.4 KiB)
>  +- Relation[a#2128L,b#2129L] parquet, Statistics(sizeInBytes=23.4 
> KiB, rowCount=1.00E+3)
> {noformat}
> Expected:
> {noformat}
> == Optimized Logical Plan ==
> RepartitionByExpression [b#2129L], Statistics(sizeInBytes=2.3 KiB, 
> rowCount=100)
> +- GlobalLimit 100, Statistics(sizeInBytes=2.3 KiB, rowCount=100)
>+- LocalLimit 100, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)
>   +- RepartitionByExpression [a#2128L], Statistics(sizeInBytes=23.4 KiB, 
> rowCount=1.00E+3)
>  +- Relation[a#2128L,b#2129L] parquet, Statistics(sizeInBytes=23.4 
> KiB, rowCount=1.00E+3)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33956) Add rowCount for Range operator

2021-01-14 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33956:

Parent: SPARK-34120
Issue Type: Sub-task  (was: Improvement)

> Add rowCount for Range operator
> ---
>
> Key: SPARK-33956
> URL: https://issues.apache.org/jira/browse/SPARK-33956
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> {code:scala}
> spark.sql("set spark.sql.cbo.enabled=true")
> spark.sql("select id from range(100)").explain("cost")
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Range (0, 100, step=1, splits=None), Statistics(sizeInBytes=800.0 B)
> {noformat}
> Expected:
> {noformat}
> == Optimized Logical Plan ==
> Range (0, 100, step=1, splits=None), Statistics(sizeInBytes=800.0 B, 
> rowCount=100)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34119) Keep necessary stats after partition pruning

2021-01-14 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34119:

Parent: SPARK-34120
Issue Type: Sub-task  (was: Improvement)

> Keep necessary stats after partition pruning
> 
>
> Key: SPARK-34119
> URL: https://issues.apache.org/jira/browse/SPARK-34119
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> It missing stats if filter conditions contains dynamicpruning, we should keep 
> these stats after partition pruning:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
> +- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
> class_id#160)) AND (i_category_id#18 = category_id#161)), 
> Statistics(sizeInBytes=2.42E+28 B)
>:- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
> Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
>:  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
> isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
> rowCount=3.69E+5)
>: +- 
> Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
>  parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
>+- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
> class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
>   +- Aggregate [brand_id#159, class_id#160, category_id#161], 
> [brand_id#159, class_id#160, category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
>  +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
> class_id#160, i_category_id#18 AS category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
> :  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
> Statistics(sizeInBytes=3.83E+21 B)
> :  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
> :  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
> Statistics(sizeInBytes=516.5 PiB)
> :  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
> Statistics(sizeInBytes=61.1 GiB)
> :  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
> isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
> Statistics(sizeInBytes=580.6 GiB)
> :  : : : :  +- Project [d_date_sk#52], 
> Statistics(sizeInBytes=8.6 KiB, rowCount=731)
> :  : : : : +- Filter d_year#58 >= 1999) AND 
> (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
> Statistics(sizeInBytes=175.6 KiB, rowCount=731)
> :  : : : :+- 
> Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,...
>  4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
> :  : : : +- 
> Relation[ss_sold_time_sk#29,ss_item_sk#30,ss_customer_sk#31,ss_cdemo_sk#32,ss_hdemo_sk#33,ss_addr_sk#34,ss_store_sk#35,ss_promo_sk#36,ss_ticket_number#37L,ss_quantity#38,ss_wholesale_cost#39,ss_list_price#40,ss_sales_price#41,ss_ext_discount_amt#42,ss_ext_sales_price#43,ss_ext_wholesale_cost#44,ss_ext_list_price#45,ss_ext_tax#46,ss_coupon_amt#47,ss_net_paid#48,ss_net_paid_inc_tax#49,ss_net_profit#50,ss_sold_date_sk#51]
>  parquet, Statistics(sizeInBytes=580.6 GiB)
> :  : : +- Project [i_item_sk#7, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=8.5 MiB, 
> rowCount=3.69E+5)
> :  : :+- Filter (((isnotnull(i_brand_id#14) AND 
> isnotnull(i_class_id#16)) AND isnotnull(i_category_id#18)) AND 
> isnotnull(i_item_sk#7)), Statistics(sizeInBytes=150.0 MiB, rowCount=3.69E+5)
>

[jira] [Assigned] (SPARK-34099) Refactor table caching in `DataSourceV2Strategy`

2021-01-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34099:
---

Assignee: Maxim Gekk

> Refactor table caching in `DataSourceV2Strategy`
> 
>
> Key: SPARK-34099
> URL: https://issues.apache.org/jira/browse/SPARK-34099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Currently, invalidateCache() performs 3 calls to the Cache Manager to refresh 
> table cache. We can perform only one call recacheByPlan(). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34099) Refactor table caching in `DataSourceV2Strategy`

2021-01-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34099.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31172
[https://github.com/apache/spark/pull/31172]

> Refactor table caching in `DataSourceV2Strategy`
> 
>
> Key: SPARK-34099
> URL: https://issues.apache.org/jira/browse/SPARK-34099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently, invalidateCache() performs 3 calls to the Cache Manager to refresh 
> table cache. We can perform only one call recacheByPlan(). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34121) Intersect operator missing rowCount when CBO enabled

2021-01-14 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-34121:
---

 Summary: Intersect operator missing rowCount when CBO enabled
 Key: SPARK-34121
 URL: https://issues.apache.org/jira/browse/SPARK-34121
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-14 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265662#comment-17265662
 ] 

Jungtaek Lim commented on SPARK-33790:
--

I've revisited this somehow and I realized this is regression on performance 
for event log v1. (SPARK-28869 caused the regression.)

I'll submit PRs for below branches. This should be fixed in 3.1.x / 3.0.x as 
well.

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34111) Deconflict the jars jakarta.servlet-api-4.0.3.jar and javax.servlet-api-3.1.0.jar

2021-01-14 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265666#comment-17265666
 ] 

Kent Yao commented on SPARK-34111:
--

I have found the problem, LOL

 

In Hadoop 3.2: ({color:#FF}*should be*{color})


 javax.servlet
 javax.servlet-api


 

In Hadoop 2.7:

 javax.servlet
 servlet-api


> Deconflict the jars jakarta.servlet-api-4.0.3.jar and 
> javax.servlet-api-3.1.0.jar
> -
>
> Key: SPARK-34111
> URL: https://issues.apache.org/jira/browse/SPARK-34111
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Blocker
>
> After SPARK-33705, we now happened to have two jars in the release artifact 
> with Hadoop 3:
> {{dev/deps/spark-deps-hadoop-3.2-hive-2.3}}:
> {code}
> ...
> jakarta.servlet-api/4.0.3//jakarta.servlet-api-4.0.3.jar
> ...
> javax.servlet-api/3.1.0//javax.servlet-api-3.1.0.jar
> ...
> {code}
> It can potentially cause an issue, and we should better remove 
> {{javax.servlet-api-3.1.0.jar}} which is apparently only required for YARN 
> tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34111) Deconflict the jars jakarta.servlet-api-4.0.3.jar and javax.servlet-api-3.1.0.jar

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265668#comment-17265668
 ] 

Apache Spark commented on SPARK-34111:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/31185

> Deconflict the jars jakarta.servlet-api-4.0.3.jar and 
> javax.servlet-api-3.1.0.jar
> -
>
> Key: SPARK-34111
> URL: https://issues.apache.org/jira/browse/SPARK-34111
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Blocker
>
> After SPARK-33705, we now happened to have two jars in the release artifact 
> with Hadoop 3:
> {{dev/deps/spark-deps-hadoop-3.2-hive-2.3}}:
> {code}
> ...
> jakarta.servlet-api/4.0.3//jakarta.servlet-api-4.0.3.jar
> ...
> javax.servlet-api/3.1.0//javax.servlet-api-3.1.0.jar
> ...
> {code}
> It can potentially cause an issue, and we should better remove 
> {{javax.servlet-api-3.1.0.jar}} which is apparently only required for YARN 
> tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34111) Deconflict the jars jakarta.servlet-api-4.0.3.jar and javax.servlet-api-3.1.0.jar

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34111:


Assignee: Apache Spark

> Deconflict the jars jakarta.servlet-api-4.0.3.jar and 
> javax.servlet-api-3.1.0.jar
> -
>
> Key: SPARK-34111
> URL: https://issues.apache.org/jira/browse/SPARK-34111
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Blocker
>
> After SPARK-33705, we now happened to have two jars in the release artifact 
> with Hadoop 3:
> {{dev/deps/spark-deps-hadoop-3.2-hive-2.3}}:
> {code}
> ...
> jakarta.servlet-api/4.0.3//jakarta.servlet-api-4.0.3.jar
> ...
> javax.servlet-api/3.1.0//javax.servlet-api-3.1.0.jar
> ...
> {code}
> It can potentially cause an issue, and we should better remove 
> {{javax.servlet-api-3.1.0.jar}} which is apparently only required for YARN 
> tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34111) Deconflict the jars jakarta.servlet-api-4.0.3.jar and javax.servlet-api-3.1.0.jar

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34111:


Assignee: (was: Apache Spark)

> Deconflict the jars jakarta.servlet-api-4.0.3.jar and 
> javax.servlet-api-3.1.0.jar
> -
>
> Key: SPARK-34111
> URL: https://issues.apache.org/jira/browse/SPARK-34111
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Blocker
>
> After SPARK-33705, we now happened to have two jars in the release artifact 
> with Hadoop 3:
> {{dev/deps/spark-deps-hadoop-3.2-hive-2.3}}:
> {code}
> ...
> jakarta.servlet-api/4.0.3//jakarta.servlet-api-4.0.3.jar
> ...
> javax.servlet-api/3.1.0//javax.servlet-api-3.1.0.jar
> ...
> {code}
> It can potentially cause an issue, and we should better remove 
> {{javax.servlet-api-3.1.0.jar}} which is apparently only required for YARN 
> tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2021-01-14 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265674#comment-17265674
 ] 

Hyukjin Kwon commented on SPARK-12890:
--

[~nchammas], shall we file a new JIRA? your description looks much better than 
the current one, and easier to follow.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>Priority: Minor
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34101) Make spark-sql CLI configurable for the behavior of printing header by SET command

2021-01-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34101.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31173
[https://github.com/apache/spark/pull/31173]

> Make spark-sql CLI configurable for the behavior of printing header by SET 
> command
> --
>
> Key: SPARK-34101
> URL: https://issues.apache.org/jira/browse/SPARK-34101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.2.0
>
>
> Like Hive CLI, spark-sql CLI accepts hive.cli.print.header property and we 
> can change the behavior of printing header.
> But spark-sql CLI doesn't allow users to change Hive specific configurations 
> dynamically by SET command.
> So, it's better to support the way to change the behavior by SET command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265679#comment-17265679
 ] 

Apache Spark commented on SPARK-33790:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/31186

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34114) should not trim right for read-side char length check and padding

2021-01-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34114:
---

Assignee: Kent Yao

> should not trim right for read-side char length check and padding
> -
>
> Key: SPARK-34114
> URL: https://issues.apache.org/jira/browse/SPARK-34114
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34114) should not trim right for read-side char length check and padding

2021-01-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34114.
-
Fix Version/s: 3.1.1
   Resolution: Fixed

Issue resolved by pull request 31181
[https://github.com/apache/spark/pull/31181]

> should not trim right for read-side char length check and padding
> -
>
> Key: SPARK-34114
> URL: https://issues.apache.org/jira/browse/SPARK-34114
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Critical
> Fix For: 3.1.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-14 Thread dzcxzl (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265691#comment-17265691
 ] 

dzcxzl commented on SPARK-33790:


This is indeed a performance regression problem.

The following is my case 2.x version EventLoggingListener.codecMap is of type 
mutable.HashMap, which is not thread-safe and may hang.

3.x version changed to EventLogFileReader.codecMap changed to ConcurrentHashMap 
type.

In the 2.x version, the history server may not work. 

I tried to use the 3.x version, and found that a round of scan has slowed down 
a lot, 7min rose to about 23min.

In addition, do I need to fix the thread safety issues in version 2.x?

[~kabhwan]

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265693#comment-17265693
 ] 

Apache Spark commented on SPARK-33790:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/31187

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265695#comment-17265695
 ] 

Apache Spark commented on SPARK-33790:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/31187

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34118) Replaces filter and check for emptiness with exists or forall

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265697#comment-17265697
 ] 

Apache Spark commented on SPARK-34118:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/31188

> Replaces filter and check for emptiness with exists or forall
> -
>
> Key: SPARK-34118
> URL: https://issues.apache.org/jira/browse/SPARK-34118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
> Fix For: 3.2.0
>
>
> Semantic consistency and code Simpilefications
> before:
> {code:java}
> seq.filter(p).size == 0)
> seq.filter(p).length > 0
> seq.filterNot(p).isEmpty
> seq.filterNot(p).nonEmpty
> {code}
> after:
> {code:java}
> !seq.exists(p)
> seq.exists(p)
> seq.forall(p)
> !seq.forall(p)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34095) Spark 3.0.1 is giving WARN while running job

2021-01-14 Thread Sachit Murarka (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachit Murarka updated SPARK-34095:
---
Priority: Minor  (was: Major)

> Spark 3.0.1 is giving WARN while running job
> 
>
> Key: SPARK-34095
> URL: https://issues.apache.org/jira/browse/SPARK-34095
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.0.1
>Reporter: Sachit Murarka
>Priority: Minor
>
> I am running Spark 3.0.1 with Java11. Getting the following WARN.
>  
> WARNING: An illegal reflective access operation has occurred
>  WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> ([file:/opt/spark/jars/spark-unsafe_2.12-3.0.1.jar|file://opt/spark/jars/spark-unsafe_2.12-3.0.1.jar])
>  to constructor java.nio.DirectByteBuffer(long,int)
>  WARNING: Please consider reporting this to the maintainers of 
> org.apache.spark.unsafe.Platform
>  WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
>  WARNING: All illegal access operations will be denied in a future release



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34122) Remove duplicated branches in case when

2021-01-14 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-34122:
--

 Summary: Remove duplicated branches in case when
 Key: SPARK-34122
 URL: https://issues.apache.org/jira/browse/SPARK-34122
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Lantao Jin


CaseWhen with duplicated branches could be dedup
{code}
SELECT CASE WHEN key = 1 THEN 1 WHEN key = 1 THEN 1 WHEN key = 1 THEN 1
ELSE 2 END FROM testData WHERE key = 1 group by key
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >