[jira] [Updated] (SPARK-35502) Spark Executor metrics are not produced/showed
[ https://issues.apache.org/jira/browse/SPARK-35502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mati updated SPARK-35502: - Description: Recently we have enabled prometheusServlet configuration in order to have spark master, worker, driver and executor metrics. We can see and using spark master, worker and driver executors but can't see spark executor metrics. We are running spark streaming standalone cluster in version 3.0.1 over physical servers. We have taken one of our jobs and added the following parameters to the job configuration, but couldn't see executer metrics by curling both driver and executor workers of this job: These are the parameters: --conf spark.ui.prometheus.enabled=true \ --conf spark.executor.processTreeMetrics.enabled=true Curl commands: [00764f](root@sparktest-40005-prod-chidc2:~)# curl -s [http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead] -n5 [00764f](root@sparktest-40005-prod-chidc2:~)# Driver of this job - sparktest-40004: [e35005](root@sparktest-40004-prod-chidc2:~)# curl -s [http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead] -n5 [e35005](root@sparktest-40004-prod-chidc2:~)# curl -s [http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead] -n5 Our UI port is on 4050 I understood that the executor Prometheus endpoint is still experimental which may explain the inconsistent behaviour we see but is there a plan to fix it? Are there any known issues regarding this? was: Recently we have enabled prometheusServlet configuration in order to have spark master, worker, driver and executor metrics. We can see and using spark master, worker and driver executors but can't see spark executor metrics. We are running spark streaming standalone cluster in version 3.0.1 over physical servers. We have taken one of our jobs and added the following parameters to the job configuration, but couldn't see executer metrics by curling both driver and executor workers of this job: These are the parameters: --conf spark.ui.prometheus.enabled=true \ --conf spark.executor.processTreeMetrics.enabled=true Curl commands: [00764f](root@sparktest-40005-prod-chidc2:~)# curl -s [http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead] -n5 [00764f](root@sparktest-40005-prod-chidc2:~)# Driver of this job - sparktest-40004: [e35005](root@sparktest-40004-prod-chidc2:~)# curl -s [http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead] -n5 [e35005](root@sparktest-40004-prod-chidc2:~)# curl -s [http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead] -n5 Out UI port is on 4050 I understood that the executor Prometheus endpoint is still experimental which may explain the inconsistent behaviour we see but is there a plan to fix it? Are there any known issues regarding this? Environment: metrics.properties {code:java} *.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet *.sink.prometheusServlet.path=/metrics/prometheus master.sink.prometheusServlet.path=/metrics/master/prometheus applications.sink.prometheusServlet.path=/metrics/applications/prometheus ## Below may be removed after finalizing the native Prometheus implementation # # Enable Prometheus for driver # driver.sink.prometheus_chidc2.class=com.banzaicloud.spark.metrics.sink.PrometheusSink driver.sink.prometheus_chidc2.report-instance-id=false # Prometheus pushgateway address driver.sink.prometheus_chidc2.pushgateway-address-protocol=http driver.sink.prometheus_chidc2.pushgateway-address=pushgateway-spark-master-staging-pod.service.chidc2.consul:9091 driver.sink.prometheus_chidc2.period=60 driver.sink.prometheus_chidc2.pushgateway-enable-timestamp=false driver.sink.prometheus_chidc2.labels=cluster_name=apache-test-v3,datacenter=chidc2 driver.sink.prometheus_chidc2.master-worker-labels=instance=sparktest-40001-prod-chidc2.chidc2.outbrain.com driver.sink.prometheus_chidc2.metrics-name-capture-regex=application_(\\S+)_([0-9]+)_cores;application_(\\S+)_([0-9]+)_runtime_ms;(.+)_[0-9]+_executor_(.+) driver.sink.prometheus_chidc2.metrics-name-replacement=__name__=application_$1_cores,start_time=$2;__name__=application_$1_runtime_ms,start_time=$2;__name__=$1_executor_$2 driver.sink.prometheus_chidc2.metrics-exclude-regex=.*CodeGenerator_.+;.*HiveExternalCatalog_.+;.+executor_filesystem_file_largeRead_ops;.+executor_filesystem_file_read_bytes;.+executor_filesystem_file_read_ops;.+executor_filesystem_file_write_bytes;.+executor_filesystem_file_write_ops
[jira] [Assigned] (SPARK-35506) Run tests with Python 3.9 in GitHub Actions
[ https://issues.apache.org/jira/browse/SPARK-35506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35506: Assignee: Hyukjin Kwon (was: Apache Spark) > Run tests with Python 3.9 in GitHub Actions > --- > > Key: SPARK-35506 > URL: https://issues.apache.org/jira/browse/SPARK-35506 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > We're currently running PySpark tests with Python 3.8. We should run it with > Python 3.9 to verify the latest python support. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35506) Run tests with Python 3.9 in GitHub Actions
[ https://issues.apache.org/jira/browse/SPARK-35506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350783#comment-17350783 ] Apache Spark commented on SPARK-35506: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/32657 > Run tests with Python 3.9 in GitHub Actions > --- > > Key: SPARK-35506 > URL: https://issues.apache.org/jira/browse/SPARK-35506 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > We're currently running PySpark tests with Python 3.8. We should run it with > Python 3.9 to verify the latest python support. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35506) Run tests with Python 3.9 in GitHub Actions
[ https://issues.apache.org/jira/browse/SPARK-35506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35506: Assignee: Apache Spark (was: Hyukjin Kwon) > Run tests with Python 3.9 in GitHub Actions > --- > > Key: SPARK-35506 > URL: https://issues.apache.org/jira/browse/SPARK-35506 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > We're currently running PySpark tests with Python 3.8. We should run it with > Python 3.9 to verify the latest python support. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35499) Apply black to pandas API on Spark codes.
[ https://issues.apache.org/jira/browse/SPARK-35499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-35499: Parent: SPARK-34849 Issue Type: Sub-task (was: Improvement) > Apply black to pandas API on Spark codes. > - > > Key: SPARK-35499 > URL: https://issues.apache.org/jira/browse/SPARK-35499 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > Make it easier and more efficient to static analysis, we'd better to apply > `black` to the pandas API on Spark. > Koalas project is using black for [reformatting > script|https://github.com/databricks/koalas/blob/master/dev/reformat]. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35433) Move CSV data source options from Python and Scala into a single page.
[ https://issues.apache.org/jira/browse/SPARK-35433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350771#comment-17350771 ] Haejoon Lee commented on SPARK-35433: - I'm working on this > Move CSV data source options from Python and Scala into a single page. > -- > > Key: SPARK-35433 > URL: https://issues.apache.org/jira/browse/SPARK-35433 > Project: Spark > Issue Type: Sub-task > Components: docs >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > Refer to https://issues.apache.org/jira/browse/SPARK-34491 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35504) count distinct asterisk
[ https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350769#comment-17350769 ] Hyukjin Kwon commented on SPARK-35504: -- Just a wild guess but: {quote} count(DISTINCT expr[, expr...]) {quote} doesn't count NULLs butL {quote} count(*) from (select distinct * from storage_datamart.olympiads) {quote} counts nulls? > count distinct asterisk > > > Key: SPARK-35504 > URL: https://issues.apache.org/jira/browse/SPARK-35504 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: {code:java} > uname -a > Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > {code} > > {code:java} > lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 18.04.4 LTS > Release: 18.04 > Codename: bionic > {code} > > {code:java} > /opt/spark/bin/spark-submit --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292 > Branch HEAD > Compiled by user ubuntu on 2020-06-06T13:05:28Z > Revision 3fdfce3120f307147244e5eaf46d61419a723d50 > Url https://gitbox.apache.org/repos/asf/spark.git > Type --help for more information. > {code} > {code:java} > lscpu > Architecture:x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 4 > On-line CPU(s) list: 0-3 > Thread(s) per core: 2 > Core(s) per socket: 2 > Socket(s): 1 > NUMA node(s):1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 85 > Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz > Stepping:7 > CPU MHz: 3602.011 > BogoMIPS:6000.01 > Hypervisor vendor: KVM > Virtualization type: full > L1d cache: 32K > L1i cache: 32K > L2 cache:1024K > L3 cache:36608K > NUMA node0 CPU(s): 0-3 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm > constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf > tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe > popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm > 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms > invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd > avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke > {code} > >Reporter: Nikolay Sokolov >Priority: Minor > Attachments: SPARK-35504_first_query_plan.log, > SPARK-35504_second_query_plan.log > > > Hi everyone, > I hope you're well! > > Today I came across a very interesting case when the result of the execution > of the algorithm for counting unique rows differs depending on the form > (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries. > I still can't figure out on my own if this is a bug or a feature and I would > like to share what I found. > > I run Spark SQL queries through the Thrift (and not only) connecting to the > Spark cluster. I use the DBeaver app to execute Spark SQL queries. > > So, I have two identical Spark SQL queries from an algorithmic point of view > that return different results. > > The first query: > {code:sql} > select count(distinct *) unique_amt from storage_datamart.olympiads > ; -- Rows: 13437678 > {code} > > The second query: > {code:sql} > select count(*) from (select distinct * from storage_datamart.olympiads) > ; -- Rows: 36901430 > {code} > > The result of the two queries is different. (But it must be the same, right!?) > {code:sql} > select 'The first query' description, count(distinct *) unique_amt from > storage_datamart.olympiads > union all > select 'The second query', count(*) from (select distinct * from > storage_datamart.olympiads) > ; > {code} > > The result of the above query is the following: > {code:java} > The first query13437678 > The second query 36901430 > {code} > > I can easily calculate the unique number of rows in the table: > {code:sql} > select count(*) from ( > select student_id, olympiad_id, tour, grade > from storage_datamart.olympiads >group by student_id, olympiad_id, tour, grade > having count(*) = 1 > ) > ; -- Rows: 36901365 > {code} > > The table DDL is the following: > {code:sql} > CREATE TABLE `storage_datamart`.`olympiads` ( > `ptn_date` DATE, > `student_id` BIGINT, > `olympiad_id` STRING, > `grade` BIGINT, > `grade_type` STRING, >
[jira] [Comment Edited] (SPARK-35504) count distinct asterisk
[ https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350769#comment-17350769 ] Hyukjin Kwon edited comment on SPARK-35504 at 5/25/21, 3:50 AM: Just a wild guess but: {quote} count(DISTINCT expr[, expr...]) {quote} doesn't count NULLs but: {quote} count\(*\) from (select distinct * from storage_datamart.olympiads) {quote} counts nulls? was (Author: hyukjin.kwon): Just a wild guess but: {quote} count(DISTINCT expr[, expr...]) {quote} doesn't count NULLs butL {quote} count(*) from (select distinct * from storage_datamart.olympiads) {quote} counts nulls? > count distinct asterisk > > > Key: SPARK-35504 > URL: https://issues.apache.org/jira/browse/SPARK-35504 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: {code:java} > uname -a > Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > {code} > > {code:java} > lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 18.04.4 LTS > Release: 18.04 > Codename: bionic > {code} > > {code:java} > /opt/spark/bin/spark-submit --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292 > Branch HEAD > Compiled by user ubuntu on 2020-06-06T13:05:28Z > Revision 3fdfce3120f307147244e5eaf46d61419a723d50 > Url https://gitbox.apache.org/repos/asf/spark.git > Type --help for more information. > {code} > {code:java} > lscpu > Architecture:x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 4 > On-line CPU(s) list: 0-3 > Thread(s) per core: 2 > Core(s) per socket: 2 > Socket(s): 1 > NUMA node(s):1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 85 > Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz > Stepping:7 > CPU MHz: 3602.011 > BogoMIPS:6000.01 > Hypervisor vendor: KVM > Virtualization type: full > L1d cache: 32K > L1i cache: 32K > L2 cache:1024K > L3 cache:36608K > NUMA node0 CPU(s): 0-3 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm > constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf > tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe > popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm > 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms > invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd > avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke > {code} > >Reporter: Nikolay Sokolov >Priority: Minor > Attachments: SPARK-35504_first_query_plan.log, > SPARK-35504_second_query_plan.log > > > Hi everyone, > I hope you're well! > > Today I came across a very interesting case when the result of the execution > of the algorithm for counting unique rows differs depending on the form > (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries. > I still can't figure out on my own if this is a bug or a feature and I would > like to share what I found. > > I run Spark SQL queries through the Thrift (and not only) connecting to the > Spark cluster. I use the DBeaver app to execute Spark SQL queries. > > So, I have two identical Spark SQL queries from an algorithmic point of view > that return different results. > > The first query: > {code:sql} > select count(distinct *) unique_amt from storage_datamart.olympiads > ; -- Rows: 13437678 > {code} > > The second query: > {code:sql} > select count(*) from (select distinct * from storage_datamart.olympiads) > ; -- Rows: 36901430 > {code} > > The result of the two queries is different. (But it must be the same, right!?) > {code:sql} > select 'The first query' description, count(distinct *) unique_amt from > storage_datamart.olympiads > union all > select 'The second query', count(*) from (select distinct * from > storage_datamart.olympiads) > ; > {code} > > The result of the above query is the following: > {code:java} > The first query13437678 > The second query 36901430 > {code} > > I can easily calculate the unique number of rows in the table: > {code:sql} > select count(*) from ( > select student_id, olympiad_id, tour, grade > from storage_datamart.olympiads >group by student_id, olympiad_id, tour, grade >
[jira] [Closed] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query perf
[ https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yahui Liu closed SPARK-35500. - > GenerateSafeProjection.generate will generate SpecificSafeProjection class, > but if column is array type or map type, the code cannot be reused which > impact the query performance > - > > Key: SPARK-35500 > URL: https://issues.apache.org/jira/browse/SPARK-35500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.5 >Reporter: Yahui Liu >Priority: Minor > Labels: codegen > > Reproduce steps: > # create a new table with array type: create table test_code_gen(a > array); > # Add > log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator > = DEBUG to log4j.properties; > # Enter spark-shell, fire a query: spark.sql("select * from > test_code_gen").collect > # Everytime, Dataset.collect is called, SpecificSafeProjection class is > generated, but the code for the class cannot be reused because everytime the > id for two variables in the generated class is changed: MapObjects_loopValue > and MapObjects_loopIsNull. So even the class generated before has been > cached, new code cannot match the cache key so that new code need to be > compiled again which cost some time. The time cost for compile is increasing > with the growth of column number, for wide table, this cost can more than 2s. > {code:java} > object MapObjects { > private val curId = new java.util.concurrent.atomic.AtomicInteger() > val id = curId.getAndIncrement() > val loopValue = s"MapObjects_loopValue$id" > val loopIsNull = if (elementNullable) { > s"MapObjects_loopIsNull$id" > } else { > "false" > } > {code} > First time run: > {code:java} > class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > private int MapObjects_loopValue1; > private boolean MapObjects_loopIsNull1; > private UTF8String MapObjects_loopValue2; > private boolean MapObjects_loopIsNull2; > } > {code} > Second time run: > {code:java} > class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > private int MapObjects_loopValue3; > private boolean MapObjects_loopIsNull3; > private UTF8String MapObjects_loopValue4; > private boolean MapObjects_loopIsNull4; > }{code} > Expectation: > The code generated by GenerateSafeProjection can be reused if the query is > same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35507) Move Python 3.9 installtation to the docker image for GitHub Actions
Hyukjin Kwon created SPARK-35507: Summary: Move Python 3.9 installtation to the docker image for GitHub Actions Key: SPARK-35507 URL: https://issues.apache.org/jira/browse/SPARK-35507 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 3.2.0 Reporter: Hyukjin Kwon SPARK-35506 added Python 3.9 support but it had to manually install it. The installed packages and Python versions should go to the docker image. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35506) Run tests with Python 3.9 in GitHub Actions
Hyukjin Kwon created SPARK-35506: Summary: Run tests with Python 3.9 in GitHub Actions Key: SPARK-35506 URL: https://issues.apache.org/jira/browse/SPARK-35506 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.2.0 Reporter: Hyukjin Kwon Assignee: Hyukjin Kwon We're currently running PySpark tests with Python 3.8. We should run it with Python 3.9 to verify the latest python support. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35497) Enable plotly tests in pandas-on-Spark
[ https://issues.apache.org/jira/browse/SPARK-35497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-35497. -- Fix Version/s: 3.2.0 Assignee: Hyukjin Kwon Resolution: Fixed Fixed in https://github.com/apache/spark/pull/32649 > Enable plotly tests in pandas-on-Spark > -- > > Key: SPARK-35497 > URL: https://issues.apache.org/jira/browse/SPARK-35497 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.2.0 > > > Currently, GitHub Actions doesn't have plotly installed and results in plotly > related tests being skipped. We should enable it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query pe
[ https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yahui Liu resolved SPARK-35500. --- Resolution: Duplicate > GenerateSafeProjection.generate will generate SpecificSafeProjection class, > but if column is array type or map type, the code cannot be reused which > impact the query performance > - > > Key: SPARK-35500 > URL: https://issues.apache.org/jira/browse/SPARK-35500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.5 >Reporter: Yahui Liu >Priority: Minor > Labels: codegen > > Reproduce steps: > # create a new table with array type: create table test_code_gen(a > array); > # Add > log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator > = DEBUG to log4j.properties; > # Enter spark-shell, fire a query: spark.sql("select * from > test_code_gen").collect > # Everytime, Dataset.collect is called, SpecificSafeProjection class is > generated, but the code for the class cannot be reused because everytime the > id for two variables in the generated class is changed: MapObjects_loopValue > and MapObjects_loopIsNull. So even the class generated before has been > cached, new code cannot match the cache key so that new code need to be > compiled again which cost some time. The time cost for compile is increasing > with the growth of column number, for wide table, this cost can more than 2s. > {code:java} > object MapObjects { > private val curId = new java.util.concurrent.atomic.AtomicInteger() > val id = curId.getAndIncrement() > val loopValue = s"MapObjects_loopValue$id" > val loopIsNull = if (elementNullable) { > s"MapObjects_loopIsNull$id" > } else { > "false" > } > {code} > First time run: > {code:java} > class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > private int MapObjects_loopValue1; > private boolean MapObjects_loopIsNull1; > private UTF8String MapObjects_loopValue2; > private boolean MapObjects_loopIsNull2; > } > {code} > Second time run: > {code:java} > class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > private int MapObjects_loopValue3; > private boolean MapObjects_loopIsNull3; > private UTF8String MapObjects_loopValue4; > private boolean MapObjects_loopIsNull4; > }{code} > Expectation: > The code generated by GenerateSafeProjection can be reused if the query is > same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per
[ https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yahui Liu updated SPARK-35500: -- Affects Version/s: (was: 3.1.0) 2.4.5 > GenerateSafeProjection.generate will generate SpecificSafeProjection class, > but if column is array type or map type, the code cannot be reused which > impact the query performance > - > > Key: SPARK-35500 > URL: https://issues.apache.org/jira/browse/SPARK-35500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.5 >Reporter: Yahui Liu >Priority: Minor > Labels: codegen > > Reproduce steps: > # create a new table with array type: create table test_code_gen(a > array); > # Add > log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator > = DEBUG to log4j.properties; > # Enter spark-shell, fire a query: spark.sql("select * from > test_code_gen").collect > # Everytime, Dataset.collect is called, SpecificSafeProjection class is > generated, but the code for the class cannot be reused because everytime the > id for two variables in the generated class is changed: MapObjects_loopValue > and MapObjects_loopIsNull. So even the class generated before has been > cached, new code cannot match the cache key so that new code need to be > compiled again which cost some time. The time cost for compile is increasing > with the growth of column number, for wide table, this cost can more than 2s. > {code:java} > object MapObjects { > private val curId = new java.util.concurrent.atomic.AtomicInteger() > val id = curId.getAndIncrement() > val loopValue = s"MapObjects_loopValue$id" > val loopIsNull = if (elementNullable) { > s"MapObjects_loopIsNull$id" > } else { > "false" > } > {code} > First time run: > {code:java} > class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > private int MapObjects_loopValue1; > private boolean MapObjects_loopIsNull1; > private UTF8String MapObjects_loopValue2; > private boolean MapObjects_loopIsNull2; > } > {code} > Second time run: > {code:java} > class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > private int MapObjects_loopValue3; > private boolean MapObjects_loopIsNull3; > private UTF8String MapObjects_loopValue4; > private boolean MapObjects_loopIsNull4; > }{code} > Expectation: > The code generated by GenerateSafeProjection can be reused if the query is > same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query p
[ https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350746#comment-17350746 ] Yahui Liu commented on SPARK-35500: --- [~maropu] Hi, sorry for incorrect information, i thought i am using 3.1.1, but actually the jars i am using are complied with code 2.4.5 version. That is why this issue exists, SPARK-27871 is the solution i want, i can close this jira. Thanks. > GenerateSafeProjection.generate will generate SpecificSafeProjection class, > but if column is array type or map type, the code cannot be reused which > impact the query performance > - > > Key: SPARK-35500 > URL: https://issues.apache.org/jira/browse/SPARK-35500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yahui Liu >Priority: Minor > Labels: codegen > > Reproduce steps: > # create a new table with array type: create table test_code_gen(a > array); > # Add > log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator > = DEBUG to log4j.properties; > # Enter spark-shell, fire a query: spark.sql("select * from > test_code_gen").collect > # Everytime, Dataset.collect is called, SpecificSafeProjection class is > generated, but the code for the class cannot be reused because everytime the > id for two variables in the generated class is changed: MapObjects_loopValue > and MapObjects_loopIsNull. So even the class generated before has been > cached, new code cannot match the cache key so that new code need to be > compiled again which cost some time. The time cost for compile is increasing > with the growth of column number, for wide table, this cost can more than 2s. > {code:java} > object MapObjects { > private val curId = new java.util.concurrent.atomic.AtomicInteger() > val id = curId.getAndIncrement() > val loopValue = s"MapObjects_loopValue$id" > val loopIsNull = if (elementNullable) { > s"MapObjects_loopIsNull$id" > } else { > "false" > } > {code} > First time run: > {code:java} > class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > private int MapObjects_loopValue1; > private boolean MapObjects_loopIsNull1; > private UTF8String MapObjects_loopValue2; > private boolean MapObjects_loopIsNull2; > } > {code} > Second time run: > {code:java} > class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > private int MapObjects_loopValue3; > private boolean MapObjects_loopIsNull3; > private UTF8String MapObjects_loopValue4; > private boolean MapObjects_loopIsNull4; > }{code} > Expectation: > The code generated by GenerateSafeProjection can be reused if the query is > same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23101) Migrate unit test sinks
[ https://issues.apache.org/jira/browse/SPARK-23101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-23101: - Labels: (was: bulk-closed) > Migrate unit test sinks > --- > > Key: SPARK-23101 > URL: https://issues.apache.org/jira/browse/SPARK-23101 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23100) Migrate unit test sources
[ https://issues.apache.org/jira/browse/SPARK-23100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-23100: - Labels: (was: bulk-closed) > Migrate unit test sources > - > > Key: SPARK-23100 > URL: https://issues.apache.org/jira/browse/SPARK-23100 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23100) Migrate unit test sources
[ https://issues.apache.org/jira/browse/SPARK-23100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-23100: - Labels: bulk-closed (was: ) > Migrate unit test sources > - > > Key: SPARK-23100 > URL: https://issues.apache.org/jira/browse/SPARK-23100 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > Labels: bulk-closed > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23101) Migrate unit test sinks
[ https://issues.apache.org/jira/browse/SPARK-23101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-23101: - Labels: bulk-closed (was: ) > Migrate unit test sinks > --- > > Key: SPARK-23101 > URL: https://issues.apache.org/jira/browse/SPARK-23101 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > Labels: bulk-closed > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35498) Add an API "inheritable_thread_target" which return a wrapped thread target for pyspark pin thread mode
[ https://issues.apache.org/jira/browse/SPARK-35498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-35498. -- Fix Version/s: 3.2.0 Assignee: Weichen Xu Resolution: Fixed Fixed in https://github.com/apache/spark/pull/32644 > Add an API "inheritable_thread_target" which return a wrapped thread target > for pyspark pin thread mode > --- > > Key: SPARK-35498 > URL: https://issues.apache.org/jira/browse/SPARK-35498 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Fix For: 3.2.0 > > > In pyspark, user may create some threads, not via `Thread` object, but via > some parallel helper function such as: > `thread_pool.imap_unordered` > In this case, we need create a function wrapper, used in pin thread mode, The > wrapper function, before calling original thread target, it inherits the > inheritable properties specific to JVM thread such as > ``InheritableThreadLocal``, and after original thread target return, > garbage-collects the Python thread instance and also closes the connection > which finishes JVM thread correctly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35505) Remove APIs that have been deprecated in Koalas.
[ https://issues.apache.org/jira/browse/SPARK-35505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350706#comment-17350706 ] Apache Spark commented on SPARK-35505: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/32656 > Remove APIs that have been deprecated in Koalas. > > > Key: SPARK-35505 > URL: https://issues.apache.org/jira/browse/SPARK-35505 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Major > > There are some APIs that have been deprecated in Koalas. We shouldn't have > those in pandas APIs on Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35505) Remove APIs that have been deprecated in Koalas.
[ https://issues.apache.org/jira/browse/SPARK-35505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35505: Assignee: (was: Apache Spark) > Remove APIs that have been deprecated in Koalas. > > > Key: SPARK-35505 > URL: https://issues.apache.org/jira/browse/SPARK-35505 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Major > > There are some APIs that have been deprecated in Koalas. We shouldn't have > those in pandas APIs on Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35505) Remove APIs that have been deprecated in Koalas.
[ https://issues.apache.org/jira/browse/SPARK-35505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35505: Assignee: Apache Spark > Remove APIs that have been deprecated in Koalas. > > > Key: SPARK-35505 > URL: https://issues.apache.org/jira/browse/SPARK-35505 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > > There are some APIs that have been deprecated in Koalas. We shouldn't have > those in pandas APIs on Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35505) Remove APIs that have been deprecated in Koalas.
[ https://issues.apache.org/jira/browse/SPARK-35505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350692#comment-17350692 ] Takuya Ueshin commented on SPARK-35505: --- I'm working on this. > Remove APIs that have been deprecated in Koalas. > > > Key: SPARK-35505 > URL: https://issues.apache.org/jira/browse/SPARK-35505 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Major > > There are some APIs that have been deprecated in Koalas. We shouldn't have > those in pandas APIs on Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35505) Remove APIs that have been deprecated in Koalas.
Takuya Ueshin created SPARK-35505: - Summary: Remove APIs that have been deprecated in Koalas. Key: SPARK-35505 URL: https://issues.apache.org/jira/browse/SPARK-35505 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.2.0 Reporter: Takuya Ueshin There are some APIs that have been deprecated in Koalas. We shouldn't have those in pandas APIs on Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35452) Introduce ArrayOps, MapOps and StructOps
[ https://issues.apache.org/jira/browse/SPARK-35452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-35452: - Summary: Introduce ArrayOps, MapOps and StructOps (was: ntroduce ArrayOps, MapOps and StructOps) > Introduce ArrayOps, MapOps and StructOps > > > Key: SPARK-35452 > URL: https://issues.apache.org/jira/browse/SPARK-35452 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > > StructType, ArrayType, and MapType are not accepted by DataTypeOps now. > We should handle these complex types. > Please note that arithmetic operations might be applied to these complex > types, for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work > the same as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33743) Change datatype mapping in JDBC mssqldialect: DATETIME to DATETIME2
[ https://issues.apache.org/jira/browse/SPARK-33743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350683#comment-17350683 ] Apache Spark commented on SPARK-33743: -- User 'luxu1-ms' has created a pull request for this issue: https://github.com/apache/spark/pull/32655 > Change datatype mapping in JDBC mssqldialect: DATETIME to DATETIME2 > --- > > Key: SPARK-33743 > URL: https://issues.apache.org/jira/browse/SPARK-33743 > Project: Spark > Issue Type: Request > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: Lu Xu >Priority: Major > > *datetime v/s datetime2* > Spark datetime type is > [timestamptype|https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fapi%2Fjava%2Forg%2Fapache%2Fspark%2Fsql%2Ftypes%2FTimestampType.html=04%7C01%7Cluxu1%40microsoft.com%7C39803a0f635646dadd6b08d89010896a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637417747986187437%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=qPPJve%2FGAPeIp%2BI2hjB%2BqoGGN%2FcJQe6CIDjlEdUyASo%3D=0]. > This supports a microsecond resolution. > > Sql supports 2 date time types > o *datetime* can support only milli seconds resolution (0 to 999). > o *datetime2* is extension of datetime , is compatible with datetime and > supports 0 to 999 sub second resolution. > Currently > [MsSQLServerDialect|https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fblob%2Fbfb257f078854ad587a9e2bfe548cdb7bf8786d4%2Fsql%2Fcore%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fsql%2Fjdbc%2FMsSqlServerDialect.scala=04%7C01%7Cluxu1%40microsoft.com%7C39803a0f635646dadd6b08d89010896a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637417747986197428%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=PMT9rA08NJRN0kwHy2ERaloOaDRB6ZsBBd70MZXl%2Bv4%3D=0] > maps timestamptype to datetime. This implies results in errors when writing > *+Current+* > {code:java} > override def getJDBCType(dt: DataType): Option[JdbcType] = dt match { case > TimestampType => Some(JdbcType("DATETIME", java.sql.Types.TIMESTAMP)) .. } > {code} > > *+Proposal+* > {code:java} > override def getJDBCType(dt: DataType): Option[JdbcType] = dt match { case > TimestampType => Some(JdbcType("DATETIME2", java.sql.Types.TIMESTAMP)).. } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35452) ntroduce ArrayOps, MapOps and StructOps
[ https://issues.apache.org/jira/browse/SPARK-35452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-35452: - Summary: ntroduce ArrayOps, MapOps and StructOps (was: Introduce ComplexOps for complex types) > ntroduce ArrayOps, MapOps and StructOps > --- > > Key: SPARK-35452 > URL: https://issues.apache.org/jira/browse/SPARK-35452 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > > StructType, ArrayType, and MapType are not accepted by DataTypeOps now. > We should handle these complex types. > Please note that arithmetic operations might be applied to these complex > types, for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work > the same as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33121) Spark Streaming 3.1.1 hangs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350629#comment-17350629 ] L. C. Hsieh commented on SPARK-33121: - Hmm, I cannot reproduce in branch-3.1/2.4 or master branch. If you don't have Thread.sleep(5000), does it stop gracefully? > Spark Streaming 3.1.1 hangs on shutdown > --- > > Key: SPARK-33121 > URL: https://issues.apache.org/jira/browse/SPARK-33121 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 3.1.1 >Reporter: Dmitry Tverdokhleb >Priority: Major > Labels: Streaming, hang, shutdown > > Hi. I am trying to migrate from spark 2.4.5 to 3.1.1 and there is a problem > in graceful shutdown. > Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true". > Here is the code: > {code:java} > inputStream.foreachRDD { > rdd => > rdd.foreachPartition { > Thread.sleep(5000) > } > } > {code} > I send a SIGTERM signal to stop the spark streaming and after sleeping an > exception arises: > {noformat} > streaming-agg-tds-data_1 | java.util.concurrent.RejectedExecutionException: > Task org.apache.spark.executor.Executor$TaskRunner@7ca7f0b8 rejected from > java.util.concurrent.ThreadPoolExecutor@2474219c[Terminated, pool size = 0, > active threads = 0, queued tasks = 0, completed tasks = 1] > streaming-agg-tds-data_1 | at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) > streaming-agg-tds-data_1 | at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) > streaming-agg-tds-data_1 | at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) > streaming-agg-tds-data_1 | at > org.apache.spark.executor.Executor.launchTask(Executor.scala:270) > streaming-agg-tds-data_1 | at > org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93) > streaming-agg-tds-data_1 | at > org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91) > streaming-agg-tds-data_1 | at > scala.collection.Iterator.foreach(Iterator.scala:941) > streaming-agg-tds-data_1 | at > scala.collection.Iterator.foreach$(Iterator.scala:941) > streaming-agg-tds-data_1 | at > scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > streaming-agg-tds-data_1 | at > scala.collection.IterableLike.foreach(IterableLike.scala:74) > streaming-agg-tds-data_1 | at > scala.collection.IterableLike.foreach$(IterableLike.scala:73) > streaming-agg-tds-data_1 | at > scala.collection.AbstractIterable.foreach(Iterable.scala:56) > streaming-agg-tds-data_1 | at > org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91) > streaming-agg-tds-data_1 | at > org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68) > streaming-agg-tds-data_1 | at > org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) > streaming-agg-tds-data_1 | at > org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) > streaming-agg-tds-data_1 | at > org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > streaming-agg-tds-data_1 | at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > streaming-agg-tds-data_1 | at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > streaming-agg-tds-data_1 | at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > streaming-agg-tds-data_1 | at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > streaming-agg-tds-data_1 | at java.lang.Thread.run(Thread.java:748) > streaming-agg-tds-data_1 | 2021-04-22 13:33:41 WARN JobGenerator - Timed > out while stopping the job generator (timeout = 1) > streaming-agg-tds-data_1 | 2021-04-22 13:33:41 INFO JobGenerator - Waited > for jobs to be processed and checkpoints to be written > streaming-agg-tds-data_1 | 2021-04-22 13:33:41 INFO JobGenerator - Stopped > JobGenerator{noformat} > After this exception and "JobGenerator - Stopped JobGenerator" log, streaming > freezes, and halts by timeout (Config parameter > "hadoop.service.shutdown.timeout"). > Besides, there is no problem with the graceful shutdown in spark 2.4.5. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35449) Should not extract common expressions from value expressions when elseValue is empty in CaseWhen
[ https://issues.apache.org/jira/browse/SPARK-35449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-35449: Fix Version/s: 3.1.2 > Should not extract common expressions from value expressions when elseValue > is empty in CaseWhen > > > Key: SPARK-35449 > URL: https://issues.apache.org/jira/browse/SPARK-35449 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1, 3.2.0 >Reporter: L. C. Hsieh >Assignee: Adam Binford >Priority: Major > Fix For: 3.1.2, 3.2.0 > > > [https://github.com/apache/spark/pull/30245] added support for creating > subexpressions that are present in all branches of conditional statements. > However, for a statement to be in "all branches" of a CaseWhen statement, it > must also be in the elseValue. This can lead to a subexpression to be created > and run for branches of a conditional that don't pass. This can cause issues > especially with a UDF in a branch that gets executed assuming the condition > is true. For example: > {code:java} > val col = when($"id" < 0, myUdf($"id")) > spark.range(1).select(when(col > 0, col)).show() > {code} > myUdf($"id") gets extracted as a subexpression and executed even though both > conditions don't pass and it should never be executed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35452) Introduce ComplexOps for complex types
[ https://issues.apache.org/jira/browse/SPARK-35452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-35452: - Summary: Introduce ComplexOps for complex types (was: Introduce ComplexTypeOps) > Introduce ComplexOps for complex types > -- > > Key: SPARK-35452 > URL: https://issues.apache.org/jira/browse/SPARK-35452 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > > StructType, ArrayType, and MapType are not accepted by DataTypeOps now. > We should handle these complex types. > Please note that arithmetic operations might be applied to these complex > types, for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work > the same as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35452) Introduce ComplexTypeOps
[ https://issues.apache.org/jira/browse/SPARK-35452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-35452: - Description: StructType, ArrayType, and MapType are not accepted by DataTypeOps now. We should handle these complex types. Please note that arithmetic operations might be applied to these complex types, for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work the same as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]). was: StructType, ArrayType, MapType, and BinaryType are not accepted by DataTypeOps now. We should handle these complex types. Please note that arithmetic operations might be applied to these complex types, for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work the same as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]). > Introduce ComplexTypeOps > > > Key: SPARK-35452 > URL: https://issues.apache.org/jira/browse/SPARK-35452 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Priority: Major > > StructType, ArrayType, and MapType are not accepted by DataTypeOps now. > We should handle these complex types. > Please note that arithmetic operations might be applied to these complex > types, for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work > the same as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33428) conv UDF returns incorrect value
[ https://issues.apache.org/jira/browse/SPARK-33428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350563#comment-17350563 ] Wenchen Fan commented on SPARK-33428: - AFAIK this function is from MySQL and it's better to follow the MySQL behavior. MySQL returns the max unsigned long if the input string is too big, and Spark should follow it. However, seems Spark has different behavior in two cases: # MySQL allows leading spaces but Spark does not. # If the input string is way too long, Spark fails with ArrayIndexOutOfBoundException [~angerszhu] would you like to look into it? > conv UDF returns incorrect value > > > Key: SPARK-33428 > URL: https://issues.apache.org/jira/browse/SPARK-33428 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce this issue: > {noformat} > spark-sql> select java_method('scala.math.BigInt', 'apply', > 'c8dcdfb41711fc9a1f17928001d7fd61', 16); > 266992441711411603393340504520074460513 > spark-sql> select conv('c8dcdfb41711fc9a1f17928001d7fd61', 16, 10); > 18446744073709551615 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35504) count distinct asterisk
[ https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350555#comment-17350555 ] Nikolay Sokolov commented on SPARK-35504: - Added execution plans for two SparkSQL queries: * [^SPARK-35504_first_query_plan.log] * [^SPARK-35504_second_query_plan.log] > count distinct asterisk > > > Key: SPARK-35504 > URL: https://issues.apache.org/jira/browse/SPARK-35504 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: {code:java} > uname -a > Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > {code} > > {code:java} > lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 18.04.4 LTS > Release: 18.04 > Codename: bionic > {code} > > {code:java} > /opt/spark/bin/spark-submit --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292 > Branch HEAD > Compiled by user ubuntu on 2020-06-06T13:05:28Z > Revision 3fdfce3120f307147244e5eaf46d61419a723d50 > Url https://gitbox.apache.org/repos/asf/spark.git > Type --help for more information. > {code} > {code:java} > lscpu > Architecture:x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 4 > On-line CPU(s) list: 0-3 > Thread(s) per core: 2 > Core(s) per socket: 2 > Socket(s): 1 > NUMA node(s):1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 85 > Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz > Stepping:7 > CPU MHz: 3602.011 > BogoMIPS:6000.01 > Hypervisor vendor: KVM > Virtualization type: full > L1d cache: 32K > L1i cache: 32K > L2 cache:1024K > L3 cache:36608K > NUMA node0 CPU(s): 0-3 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm > constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf > tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe > popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm > 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms > invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd > avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke > {code} > >Reporter: Nikolay Sokolov >Priority: Minor > Attachments: SPARK-35504_first_query_plan.log, > SPARK-35504_second_query_plan.log > > > Hi everyone, > I hope you're well! > > Today I came across a very interesting case when the result of the execution > of the algorithm for counting unique rows differs depending on the form > (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries. > I still can't figure out on my own if this is a bug or a feature and I would > like to share what I found. > > I run Spark SQL queries through the Thrift (and not only) connecting to the > Spark cluster. I use the DBeaver app to execute Spark SQL queries. > > So, I have two identical Spark SQL queries from an algorithmic point of view > that return different results. > > The first query: > {code:sql} > select count(distinct *) unique_amt from storage_datamart.olympiads > ; -- Rows: 13437678 > {code} > > The second query: > {code:sql} > select count(*) from (select distinct * from storage_datamart.olympiads) > ; -- Rows: 36901430 > {code} > > The result of the two queries is different. (But it must be the same, right!?) > {code:sql} > select 'The first query' description, count(distinct *) unique_amt from > storage_datamart.olympiads > union all > select 'The second query', count(*) from (select distinct * from > storage_datamart.olympiads) > ; > {code} > > The result of the above query is the following: > {code:java} > The first query13437678 > The second query 36901430 > {code} > > I can easily calculate the unique number of rows in the table: > {code:sql} > select count(*) from ( > select student_id, olympiad_id, tour, grade > from storage_datamart.olympiads >group by student_id, olympiad_id, tour, grade > having count(*) = 1 > ) > ; -- Rows: 36901365 > {code} > > The table DDL is the following: > {code:sql} > CREATE TABLE `storage_datamart`.`olympiads` ( > `ptn_date` DATE, > `student_id` BIGINT, > `olympiad_id` STRING, > `grade` BIGINT, > `grade_type` STRING, > `tour` STRING, > `created_at` TIMESTAMP, > `created_at_local`
[jira] [Updated] (SPARK-35504) count distinct asterisk
[ https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikolay Sokolov updated SPARK-35504: Attachment: SPARK-35504_second_query_plan.log SPARK-35504_first_query_plan.log > count distinct asterisk > > > Key: SPARK-35504 > URL: https://issues.apache.org/jira/browse/SPARK-35504 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: {code:java} > uname -a > Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > {code} > > {code:java} > lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 18.04.4 LTS > Release: 18.04 > Codename: bionic > {code} > > {code:java} > /opt/spark/bin/spark-submit --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292 > Branch HEAD > Compiled by user ubuntu on 2020-06-06T13:05:28Z > Revision 3fdfce3120f307147244e5eaf46d61419a723d50 > Url https://gitbox.apache.org/repos/asf/spark.git > Type --help for more information. > {code} > {code:java} > lscpu > Architecture:x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 4 > On-line CPU(s) list: 0-3 > Thread(s) per core: 2 > Core(s) per socket: 2 > Socket(s): 1 > NUMA node(s):1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 85 > Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz > Stepping:7 > CPU MHz: 3602.011 > BogoMIPS:6000.01 > Hypervisor vendor: KVM > Virtualization type: full > L1d cache: 32K > L1i cache: 32K > L2 cache:1024K > L3 cache:36608K > NUMA node0 CPU(s): 0-3 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm > constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf > tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe > popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm > 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms > invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd > avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke > {code} > >Reporter: Nikolay Sokolov >Priority: Minor > Attachments: SPARK-35504_first_query_plan.log, > SPARK-35504_second_query_plan.log > > > Hi everyone, > I hope you're well! > > Today I came across a very interesting case when the result of the execution > of the algorithm for counting unique rows differs depending on the form > (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries. > I still can't figure out on my own if this is a bug or a feature and I would > like to share what I found. > > I run Spark SQL queries through the Thrift (and not only) connecting to the > Spark cluster. I use the DBeaver app to execute Spark SQL queries. > > So, I have two identical Spark SQL queries from an algorithmic point of view > that return different results. > > The first query: > {code:sql} > select count(distinct *) unique_amt from storage_datamart.olympiads > ; -- Rows: 13437678 > {code} > > The second query: > {code:sql} > select count(*) from (select distinct * from storage_datamart.olympiads) > ; -- Rows: 36901430 > {code} > > The result of the two queries is different. (But it must be the same, right!?) > {code:sql} > select 'The first query' description, count(distinct *) unique_amt from > storage_datamart.olympiads > union all > select 'The second query', count(*) from (select distinct * from > storage_datamart.olympiads) > ; > {code} > > The result of the above query is the following: > {code:java} > The first query13437678 > The second query 36901430 > {code} > > I can easily calculate the unique number of rows in the table: > {code:sql} > select count(*) from ( > select student_id, olympiad_id, tour, grade > from storage_datamart.olympiads >group by student_id, olympiad_id, tour, grade > having count(*) = 1 > ) > ; -- Rows: 36901365 > {code} > > The table DDL is the following: > {code:sql} > CREATE TABLE `storage_datamart`.`olympiads` ( > `ptn_date` DATE, > `student_id` BIGINT, > `olympiad_id` STRING, > `grade` BIGINT, > `grade_type` STRING, > `tour` STRING, > `created_at` TIMESTAMP, > `created_at_local` TIMESTAMP, > `olympiad_num` BIGINT, > `olympiad_name` STRING, >
[jira] [Updated] (SPARK-35504) count distinct asterisk
[ https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikolay Sokolov updated SPARK-35504: Description: Hi everyone, I hope you're well! Today I came across a very interesting case when the result of the execution of the algorithm for counting unique rows differs depending on the form (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries. I still can't figure out on my own if this is a bug or a feature and I would like to share what I found. I run Spark SQL queries through the Thrift (and not only) connecting to the Spark cluster. I use the DBeaver app to execute Spark SQL queries. So, I have two identical Spark SQL queries from an algorithmic point of view that return different results. The first query: {code:sql} select count(distinct *) unique_amt from storage_datamart.olympiads ; -- Rows: 13437678 {code} The second query: {code:sql} select count(*) from (select distinct * from storage_datamart.olympiads) ; -- Rows: 36901430 {code} The result of the two queries is different. (But it must be the same, right!?) {code:sql} select 'The first query' description, count(distinct *) unique_amt from storage_datamart.olympiads union all select 'The second query', count(*) from (select distinct * from storage_datamart.olympiads) ; {code} The result of the above query is the following: {code:java} The first query13437678 The second query 36901430 {code} I can easily calculate the unique number of rows in the table: {code:sql} select count(*) from ( select student_id, olympiad_id, tour, grade from storage_datamart.olympiads group by student_id, olympiad_id, tour, grade having count(*) = 1 ) ; -- Rows: 36901365 {code} The table DDL is the following: {code:sql} CREATE TABLE `storage_datamart`.`olympiads` ( `ptn_date` DATE, `student_id` BIGINT, `olympiad_id` STRING, `grade` BIGINT, `grade_type` STRING, `tour` STRING, `created_at` TIMESTAMP, `created_at_local` TIMESTAMP, `olympiad_num` BIGINT, `olympiad_name` STRING, `subject` STRING, `started_at` TIMESTAMP, `ended_at` TIMESTAMP, `region_id` BIGINT, `region_name` STRING, `municipality_name` STRING, `school_id` BIGINT, `school_name` STRING, `school_status` BOOLEAN, `oly_n_common` INT, `num_day` INT, `award_type` STRING, `new_student_legacy` INT, `segment` STRING, `total_start` TIMESTAMP, `total_end` TIMESTAMP, `year_learn` STRING, `parent_id` BIGINT, `teacher_id` BIGINT, `parallel` BIGINT, `olympiad_type` STRING) USING parquet LOCATION 's3a://uchiru-bi-dwh/storage/datamart/olympiads.parquet' ; {code} Could you please tell me why in the first Spark SQL query counting the unique number of rows using the construction `count(distinct *)` does not count correctly and why the result of the two Spark SQL queries is different?? Thanks in advance. p.s. I could not find a description of such behaviour of the function `count(distinct *)` in the [official Spark documentation|https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html#aggregate-functions]: {quote}count(DISTINCT expr[, expr...]) -> Returns the number of rows for which the supplied expression(s) are unique and non-null. {quote} was: Hi everyone, I hope you're well! Today I came across a very interesting case when the result of the execution of the algorithm for counting unique rows differs depending on the form (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries. I still can't figure out on my own if this is a bug or a feature and I would like to share what I found. I run Spark SQL queries through the Thrift (and not only) connecting to the Spark cluster. I use the DBeaver app to execute Spark SQL queries. So, I have two identical Spark SQL queries from an algorithmic point of view that return different results. The first query: {code:sql} select count(distinct *) unique_amt from storage_datamart.olympiads ; -- Rows: 13437678 {code} The second query: {code:sql} select count(*) from (select distinct * from storage_datamart.olympiads) ; -- Rows: 36901430 {code} The result of the two queries is different. (But it must be the same, right!?) {code:sql} select 'The first query' description, count(distinct *) unique_amt from storage_datamart.olympiads union all select 'The second query', count(*) from (select distinct * from storage_datamart.olympiads) ; {code} The of the above query is the following: {code:java} The first query13437678 The second query 36901430 {code} I can easily calculate the unique number of rows in the table: {code:sql} select count(*) from ( select student_id, olympiad_id, tour, grade from storage_datamart.olympiads group by student_id, olympiad_id, tour, grade having count(*) = 1 ) ; -- Rows: 36901365 {code} The table DDL is the
[jira] [Created] (SPARK-35504) count distinct asterisk
Nikolay Sokolov created SPARK-35504: --- Summary: count distinct asterisk Key: SPARK-35504 URL: https://issues.apache.org/jira/browse/SPARK-35504 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Environment: {code:java} uname -a Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux {code} {code:java} lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description:Ubuntu 18.04.4 LTS Release:18.04 Codename: bionic {code} {code:java} /opt/spark/bin/spark-submit --version Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0 /_/ Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292 Branch HEAD Compiled by user ubuntu on 2020-06-06T13:05:28Z Revision 3fdfce3120f307147244e5eaf46d61419a723d50 Url https://gitbox.apache.org/repos/asf/spark.git Type --help for more information. {code} {code:java} lscpu Architecture:x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s):1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz Stepping:7 CPU MHz: 3602.011 BogoMIPS:6000.01 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache:1024K L3 cache:36608K NUMA node0 CPU(s): 0-3 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke {code} Reporter: Nikolay Sokolov Hi everyone, I hope you're well! Today I came across a very interesting case when the result of the execution of the algorithm for counting unique rows differs depending on the form (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries. I still can't figure out on my own if this is a bug or a feature and I would like to share what I found. I run Spark SQL queries through the Thrift (and not only) connecting to the Spark cluster. I use the DBeaver app to execute Spark SQL queries. So, I have two identical Spark SQL queries from an algorithmic point of view that return different results. The first query: {code:sql} select count(distinct *) unique_amt from storage_datamart.olympiads ; -- Rows: 13437678 {code} The second query: {code:sql} select count(*) from (select distinct * from storage_datamart.olympiads) ; -- Rows: 36901430 {code} The result of the two queries is different. (But it must be the same, right!?) {code:sql} select 'The first query' description, count(distinct *) unique_amt from storage_datamart.olympiads union all select 'The second query', count(*) from (select distinct * from storage_datamart.olympiads) ; {code} The of the above query is the following: {code:java} The first query13437678 The second query 36901430 {code} I can easily calculate the unique number of rows in the table: {code:sql} select count(*) from ( select student_id, olympiad_id, tour, grade from storage_datamart.olympiads group by student_id, olympiad_id, tour, grade having count(*) = 1 ) ; -- Rows: 36901365 {code} The table DDL is the following: {code:sql} CREATE TABLE `storage_datamart`.`olympiads` ( `ptn_date` DATE, `student_id` BIGINT, `olympiad_id` STRING, `grade` BIGINT, `grade_type` STRING, `tour` STRING, `created_at` TIMESTAMP, `created_at_local` TIMESTAMP, `olympiad_num` BIGINT, `olympiad_name` STRING, `subject` STRING, `started_at` TIMESTAMP, `ended_at` TIMESTAMP, `region_id` BIGINT, `region_name` STRING, `municipality_name` STRING, `school_id` BIGINT, `school_name` STRING, `school_status` BOOLEAN, `oly_n_common` INT, `num_day` INT, `award_type` STRING, `new_student_legacy` INT, `segment` STRING, `total_start` TIMESTAMP, `total_end` TIMESTAMP, `year_learn` STRING, `parent_id` BIGINT, `teacher_id` BIGINT, `parallel` BIGINT, `olympiad_type` STRING) USING parquet LOCATION 's3a://uchiru-bi-dwh/storage/datamart/olympiads.parquet' ; {code} Could you please tell me why in the
[jira] [Commented] (SPARK-35503) Change the facetFilters of Docsearch to 3.1.2
[ https://issues.apache.org/jira/browse/SPARK-35503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350526#comment-17350526 ] Apache Spark commented on SPARK-35503: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/32654 > Change the facetFilters of Docsearch to 3.1.2 > - > > Key: SPARK-35503 > URL: https://issues.apache.org/jira/browse/SPARK-35503 > Project: Spark > Issue Type: Task > Components: Documentation >Affects Versions: 3.1.2 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > > As we are going to release 3.1.2, the search result of 3.1.2 documentation > should point to the new documentation site instead of 3.1.1 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35503) Change the facetFilters of Docsearch to 3.1.2
[ https://issues.apache.org/jira/browse/SPARK-35503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35503: Assignee: Apache Spark (was: Gengliang Wang) > Change the facetFilters of Docsearch to 3.1.2 > - > > Key: SPARK-35503 > URL: https://issues.apache.org/jira/browse/SPARK-35503 > Project: Spark > Issue Type: Task > Components: Documentation >Affects Versions: 3.1.2 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Minor > > As we are going to release 3.1.2, the search result of 3.1.2 documentation > should point to the new documentation site instead of 3.1.1 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35503) Change the facetFilters of Docsearch to 3.1.2
[ https://issues.apache.org/jira/browse/SPARK-35503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350527#comment-17350527 ] Apache Spark commented on SPARK-35503: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/32654 > Change the facetFilters of Docsearch to 3.1.2 > - > > Key: SPARK-35503 > URL: https://issues.apache.org/jira/browse/SPARK-35503 > Project: Spark > Issue Type: Task > Components: Documentation >Affects Versions: 3.1.2 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > > As we are going to release 3.1.2, the search result of 3.1.2 documentation > should point to the new documentation site instead of 3.1.1 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35503) Change the facetFilters of Docsearch to 3.1.2
[ https://issues.apache.org/jira/browse/SPARK-35503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350525#comment-17350525 ] Apache Spark commented on SPARK-35503: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/32654 > Change the facetFilters of Docsearch to 3.1.2 > - > > Key: SPARK-35503 > URL: https://issues.apache.org/jira/browse/SPARK-35503 > Project: Spark > Issue Type: Task > Components: Documentation >Affects Versions: 3.1.2 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > > As we are going to release 3.1.2, the search result of 3.1.2 documentation > should point to the new documentation site instead of 3.1.1 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35503) Change the facetFilters of Docsearch to 3.1.2
[ https://issues.apache.org/jira/browse/SPARK-35503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35503: Assignee: Gengliang Wang (was: Apache Spark) > Change the facetFilters of Docsearch to 3.1.2 > - > > Key: SPARK-35503 > URL: https://issues.apache.org/jira/browse/SPARK-35503 > Project: Spark > Issue Type: Task > Components: Documentation >Affects Versions: 3.1.2 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > > As we are going to release 3.1.2, the search result of 3.1.2 documentation > should point to the new documentation site instead of 3.1.1 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35503) Change the facetFilters of Docsearch to 3.1.2
Gengliang Wang created SPARK-35503: -- Summary: Change the facetFilters of Docsearch to 3.1.2 Key: SPARK-35503 URL: https://issues.apache.org/jira/browse/SPARK-35503 Project: Spark Issue Type: Task Components: Documentation Affects Versions: 3.1.2 Reporter: Gengliang Wang Assignee: Gengliang Wang As we are going to release 3.1.2, the search result of 3.1.2 documentation should point to the new documentation site instead of 3.1.1 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35287) RemoveRedundantProjects removes non-redundant projects
[ https://issues.apache.org/jira/browse/SPARK-35287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-35287. - Fix Version/s: 3.1.2 3.2.0 Resolution: Fixed Issue resolved by pull request 32606 [https://github.com/apache/spark/pull/32606] > RemoveRedundantProjects removes non-redundant projects > -- > > Key: SPARK-35287 > URL: https://issues.apache.org/jira/browse/SPARK-35287 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: Chungmin >Priority: Major > Fix For: 3.2.0, 3.1.2 > > > RemoveRedundantProjects erroneously removes non-redundant projects which are > required to convert rows coming from DataSourceV2ScanExec to UnsafeRow. There > is a code for this case, but it only looks at the child. The bug occurs when > DataSourceV2ScanExec is not a child of the project, but a descendant. The > method {{isRedundant}} in {{RemoveRedundantProjects}} should be updated to > account for descendants too. > The original scenario requires Iceberg to reproduce the issue. In theory, it > should be able to reproduce the bug with Spark SQL only, and someone more > knowledgeable with Spark SQL should be able to make such a scenario. The > following is my reproduction scenario (Spark 3.1.1, Iceberg 0.11.1): > {code:java} > import scala.collection.JavaConverters._ > import org.apache.iceberg.{PartitionSpec, TableProperties} > import org.apache.iceberg.hadoop.HadoopTables > import org.apache.iceberg.spark.SparkSchemaUtil > import org.apache.spark.sql.{DataFrame, QueryTest, SparkSession} > import org.apache.spark.sql.internal.SQLConf > class RemoveRedundantProjectsTest extends QueryTest { > override val spark: SparkSession = SparkSession > .builder() > .master("local[4]") > .config("spark.driver.bindAddress", "127.0.0.1") > .appName(suiteName) > .getOrCreate() > test("RemoveRedundantProjects removes non-redundant projects") { > withSQLConf( > SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1", > SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "false", > SQLConf.REMOVE_REDUNDANT_PROJECTS_ENABLED.key -> "true") { > withTempDir { dir => > val path = dir.getCanonicalPath > val data = spark.range(3).toDF > val table = new HadoopTables().create( > SparkSchemaUtil.convert(data.schema), > PartitionSpec.unpartitioned(), > Map(TableProperties.WRITE_NEW_DATA_LOCATION -> path).asJava, > path) > data.write.format("iceberg").mode("overwrite").save(path) > table.refresh() > val df = spark.read.format("iceberg").load(path) > val dfX = df.as("x") > val dfY = df.as("y") > val join = dfX.filter(dfX("id") > 0).join(dfY, "id") > join.explain("extended") > assert(join.count() == 2) > } > } > } > } > {code} > Stack trace: > {noformat} > [info] - RemoveRedundantProjects removes non-redundant projects *** FAILED *** > [info] org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in > stage 1.0 (TID 4) (xeroxms100.northamerica.corp.microsoft.com executor > driver): java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > [info] at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:226) > [info] at > org.apache.spark.sql.execution.SortExec.$anonfun$doExecute$1(SortExec.scala:119) > [info] at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) > [info] at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) > [info] at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > [info] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > [info] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > [info] at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) > [info] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > [info] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > [info] at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > [info] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > [info] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > [info] at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > [info] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > [info] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > [info] at >
[jira] [Assigned] (SPARK-35287) RemoveRedundantProjects removes non-redundant projects
[ https://issues.apache.org/jira/browse/SPARK-35287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-35287: --- Assignee: Kousuke Saruta > RemoveRedundantProjects removes non-redundant projects > -- > > Key: SPARK-35287 > URL: https://issues.apache.org/jira/browse/SPARK-35287 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: Chungmin >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.1.2, 3.2.0 > > > RemoveRedundantProjects erroneously removes non-redundant projects which are > required to convert rows coming from DataSourceV2ScanExec to UnsafeRow. There > is a code for this case, but it only looks at the child. The bug occurs when > DataSourceV2ScanExec is not a child of the project, but a descendant. The > method {{isRedundant}} in {{RemoveRedundantProjects}} should be updated to > account for descendants too. > The original scenario requires Iceberg to reproduce the issue. In theory, it > should be able to reproduce the bug with Spark SQL only, and someone more > knowledgeable with Spark SQL should be able to make such a scenario. The > following is my reproduction scenario (Spark 3.1.1, Iceberg 0.11.1): > {code:java} > import scala.collection.JavaConverters._ > import org.apache.iceberg.{PartitionSpec, TableProperties} > import org.apache.iceberg.hadoop.HadoopTables > import org.apache.iceberg.spark.SparkSchemaUtil > import org.apache.spark.sql.{DataFrame, QueryTest, SparkSession} > import org.apache.spark.sql.internal.SQLConf > class RemoveRedundantProjectsTest extends QueryTest { > override val spark: SparkSession = SparkSession > .builder() > .master("local[4]") > .config("spark.driver.bindAddress", "127.0.0.1") > .appName(suiteName) > .getOrCreate() > test("RemoveRedundantProjects removes non-redundant projects") { > withSQLConf( > SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1", > SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "false", > SQLConf.REMOVE_REDUNDANT_PROJECTS_ENABLED.key -> "true") { > withTempDir { dir => > val path = dir.getCanonicalPath > val data = spark.range(3).toDF > val table = new HadoopTables().create( > SparkSchemaUtil.convert(data.schema), > PartitionSpec.unpartitioned(), > Map(TableProperties.WRITE_NEW_DATA_LOCATION -> path).asJava, > path) > data.write.format("iceberg").mode("overwrite").save(path) > table.refresh() > val df = spark.read.format("iceberg").load(path) > val dfX = df.as("x") > val dfY = df.as("y") > val join = dfX.filter(dfX("id") > 0).join(dfY, "id") > join.explain("extended") > assert(join.count() == 2) > } > } > } > } > {code} > Stack trace: > {noformat} > [info] - RemoveRedundantProjects removes non-redundant projects *** FAILED *** > [info] org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in > stage 1.0 (TID 4) (xeroxms100.northamerica.corp.microsoft.com executor > driver): java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > [info] at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:226) > [info] at > org.apache.spark.sql.execution.SortExec.$anonfun$doExecute$1(SortExec.scala:119) > [info] at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) > [info] at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) > [info] at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > [info] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > [info] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > [info] at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) > [info] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > [info] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > [info] at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > [info] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > [info] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > [info] at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > [info] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > [info] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > [info] at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > [info] at
[jira] [Commented] (SPARK-35312) Introduce new Option in Kafka source to specify minimum number of records to read per trigger
[ https://issues.apache.org/jira/browse/SPARK-35312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350505#comment-17350505 ] Apache Spark commented on SPARK-35312: -- User 'satishgopalani' has created a pull request for this issue: https://github.com/apache/spark/pull/32653 > Introduce new Option in Kafka source to specify minimum number of records to > read per trigger > - > > Key: SPARK-35312 > URL: https://issues.apache.org/jira/browse/SPARK-35312 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.1 >Reporter: Satish Gopalani >Priority: Major > > Kafka source currently provides options to set the maximum number of offsets > to read per trigger. > I will like to introduce a new option to specify the minimum number of > offsets to read per trigger i.e. *minOffsetsPerTrigger*. > This new option will allow skipping trigger/batch when the number of records > available in Kafka is low. This is a very useful feature in cases where we > have a sudden burst of data at certain intervals in a day and data volume is > low for the rest of the day. Tunning such jobs is difficult as decreasing > trigger processing time increasing the number of batches and hence cluster > resource usage and adds to small file issues. Increasing trigger processing > time adds consumer lag. This will save cluster resources and also help solve > small file issues as it is running lesser batches. > Along with this, I would like to introduce '*maxTriggerDelay*' option which > will help to avoid cases of infinite delay in scheduling trigger and the > trigger will happen irrespective of records available if the maxTriggerDelay > time exceeds the last trigger. It would be an optional parameter with a > default value of 15 mins. _This option will be only applicable if > minOffsetsPerTrigger is set._ > *minOffsetsPerTrigger* option would be optional of course, but once specified > it would take precedence over *maxOffestsPerTrigger* which will be honored > only after *minOffsetsPerTrigger* is satisfied. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35312) Introduce new Option in Kafka source to specify minimum number of records to read per trigger
[ https://issues.apache.org/jira/browse/SPARK-35312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350504#comment-17350504 ] Apache Spark commented on SPARK-35312: -- User 'satishgopalani' has created a pull request for this issue: https://github.com/apache/spark/pull/32653 > Introduce new Option in Kafka source to specify minimum number of records to > read per trigger > - > > Key: SPARK-35312 > URL: https://issues.apache.org/jira/browse/SPARK-35312 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.1 >Reporter: Satish Gopalani >Priority: Major > > Kafka source currently provides options to set the maximum number of offsets > to read per trigger. > I will like to introduce a new option to specify the minimum number of > offsets to read per trigger i.e. *minOffsetsPerTrigger*. > This new option will allow skipping trigger/batch when the number of records > available in Kafka is low. This is a very useful feature in cases where we > have a sudden burst of data at certain intervals in a day and data volume is > low for the rest of the day. Tunning such jobs is difficult as decreasing > trigger processing time increasing the number of batches and hence cluster > resource usage and adds to small file issues. Increasing trigger processing > time adds consumer lag. This will save cluster resources and also help solve > small file issues as it is running lesser batches. > Along with this, I would like to introduce '*maxTriggerDelay*' option which > will help to avoid cases of infinite delay in scheduling trigger and the > trigger will happen irrespective of records available if the maxTriggerDelay > time exceeds the last trigger. It would be an optional parameter with a > default value of 15 mins. _This option will be only applicable if > minOffsetsPerTrigger is set._ > *minOffsetsPerTrigger* option would be optional of course, but once specified > it would take precedence over *maxOffestsPerTrigger* which will be honored > only after *minOffsetsPerTrigger* is satisfied. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query p
[ https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350458#comment-17350458 ] Takeshi Yamamuro commented on SPARK-35500: -- Which version did you use? v3.1.0 does not exist, so v3.1.1? I tried to run the queries in v3.1.1 to reproduce it, but it couldn't happen; {code:java} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.1 /_/ Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181) scala> sql("create table test_code_gen(a array)") scala> sql("insert into test_code_gen values (array(1, 1))") scala> sc.setLogLevel("debug") // The first run scala> sql("select * from test_code_gen").collect() ... 21/05/24 23:14:00 DEBUG GenerateSafeProjection: code for createexternalrow(staticinvoke(class scala.collection.mutable.WrappedArray$, ObjectType(interface scala.collection.Seq), make, mapobjects(lambdavariable(MapObject, IntegerType, true, -1), lambdavariable(MapObject, IntegerType, true, -1), input[0, array, true], None).array, true, false), StructField(a,ArrayType(IntegerType,true),true)): /* 001 */ public java.lang.Object generate(Object[] references) { /* 002 */ return new SpecificSafeProjection(references); /* 003 */ } /* 004 */ /* 005 */ class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { /* 006 */ /* 007 */ private Object[] references; /* 008 */ private InternalRow mutableRow; /* 009 */ private boolean resultIsNull_0; /* 010 */ private int value_MapObject_lambda_variable_1; /* 011 */ private boolean isNull_MapObject_lambda_variable_1; /* 012 */ private boolean globalIsNull_0; /* 013 */ private java.lang.Object[] mutableStateArray_0 = new java.lang.Object[1]; /* 014 */ ... // The second run scala> sql("select * from test_code_gen").collect() ... 21/05/24 23:14:28 DEBUG GenerateSafeProjection: code for createexternalrow(staticinvoke(class scala.collection.mutable.WrappedArray$, ObjectType(interface scala.collection.Seq), make, mapobjects(lambdavariable(MapObject, IntegerType, true, -1), lambdavariable(MapObject, IntegerType, true, -1), input[0, array, true], None).array, true, false), StructField(a,ArrayType(IntegerType,true),true)): /* 001 */ public java.lang.Object generate(Object[] references) { /* 002 */ return new SpecificSafeProjection(references); /* 003 */ } /* 004 */ /* 005 */ class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { /* 006 */ /* 007 */ private Object[] references; /* 008 */ private InternalRow mutableRow; /* 009 */ private boolean resultIsNull_0; /* 010 */ private int value_MapObject_lambda_variable_1; /* 011 */ private boolean isNull_MapObject_lambda_variable_1; /* 012 */ private boolean globalIsNull_0; /* 013 */ private java.lang.Object[] mutableStateArray_0 = new java.lang.Object[1]; ... {code} Actually, this issue should be fixed in SPARK-27871([https://github.com/apache/spark/pull/24735]). Or, do I miss something? > GenerateSafeProjection.generate will generate SpecificSafeProjection class, > but if column is array type or map type, the code cannot be reused which > impact the query performance > - > > Key: SPARK-35500 > URL: https://issues.apache.org/jira/browse/SPARK-35500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yahui Liu >Priority: Minor > Labels: codegen > > Reproduce steps: > # create a new table with array type: create table test_code_gen(a > array); > # Add > log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator > = DEBUG to log4j.properties; > # Enter spark-shell, fire a query: spark.sql("select * from > test_code_gen").collect > # Everytime, Dataset.collect is called, SpecificSafeProjection class is > generated, but the code for the class cannot be reused because everytime the > id for two variables in the generated class is changed: MapObjects_loopValue > and MapObjects_loopIsNull. So even the class generated before has been > cached, new code cannot match the cache key so that new code need to be > compiled again which cost some time. The time cost for compile is increasing > with the growth of column number, for wide table, this cost can more than 2s. > {code:java} > object MapObjects { > private val curId = new java.util.concurrent.atomic.AtomicInteger() > val id = curId.getAndIncrement() > val loopValue = s"MapObjects_loopValue$id" > val loopIsNull = if (elementNullable) { >
[jira] [Created] (SPARK-35502) Spark Executor metrics are not produced/showed
Mati created SPARK-35502: Summary: Spark Executor metrics are not produced/showed Key: SPARK-35502 URL: https://issues.apache.org/jira/browse/SPARK-35502 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 3.0.1 Reporter: Mati Recently we have enabled prometheusServlet configuration in order to have spark master, worker, driver and executor metrics. We can see and using spark master, worker and driver executors but can't see spark executor metrics. We are running spark streaming standalone cluster in version 3.0.1 over physical servers. We have taken one of our jobs and added the following parameters to the job configuration, but couldn't see executer metrics by curling both driver and executor workers of this job: These are the parameters: --conf spark.ui.prometheus.enabled=true \ --conf spark.executor.processTreeMetrics.enabled=true Curl commands: [00764f](root@sparktest-40005-prod-chidc2:~)# curl -s [http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead] -n5 [00764f](root@sparktest-40005-prod-chidc2:~)# Driver of this job - sparktest-40004: [e35005](root@sparktest-40004-prod-chidc2:~)# curl -s [http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead] -n5 [e35005](root@sparktest-40004-prod-chidc2:~)# curl -s [http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead] -n5 Out UI port is on 4050 I understood that the executor Prometheus endpoint is still experimental which may explain the inconsistent behaviour we see but is there a plan to fix it? Are there any known issues regarding this? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35501) Add a feature for removing pulled container image for docker integration tests
[ https://issues.apache.org/jira/browse/SPARK-35501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350419#comment-17350419 ] Apache Spark commented on SPARK-35501: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/32652 > Add a feature for removing pulled container image for docker integration tests > -- > > Key: SPARK-35501 > URL: https://issues.apache.org/jira/browse/SPARK-35501 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > Docker integration tests pull the corresponding container images before test > if there is no container image in the local repository. > But the pulled image remains even after test finished. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35501) Add a feature for removing pulled container image for docker integration tests
[ https://issues.apache.org/jira/browse/SPARK-35501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350417#comment-17350417 ] Apache Spark commented on SPARK-35501: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/32652 > Add a feature for removing pulled container image for docker integration tests > -- > > Key: SPARK-35501 > URL: https://issues.apache.org/jira/browse/SPARK-35501 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > Docker integration tests pull the corresponding container images before test > if there is no container image in the local repository. > But the pulled image remains even after test finished. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35501) Add a feature for removing pulled container image for docker integration tests
[ https://issues.apache.org/jira/browse/SPARK-35501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35501: Assignee: Apache Spark (was: Kousuke Saruta) > Add a feature for removing pulled container image for docker integration tests > -- > > Key: SPARK-35501 > URL: https://issues.apache.org/jira/browse/SPARK-35501 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Minor > > Docker integration tests pull the corresponding container images before test > if there is no container image in the local repository. > But the pulled image remains even after test finished. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35501) Add a feature for removing pulled container image for docker integration tests
[ https://issues.apache.org/jira/browse/SPARK-35501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35501: Assignee: Kousuke Saruta (was: Apache Spark) > Add a feature for removing pulled container image for docker integration tests > -- > > Key: SPARK-35501 > URL: https://issues.apache.org/jira/browse/SPARK-35501 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > Docker integration tests pull the corresponding container images before test > if there is no container image in the local repository. > But the pulled image remains even after test finished. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per
[ https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yahui Liu updated SPARK-35500: -- Description: Reproduce steps: # create a new table with array type: create table test_code_gen(a array); # Add log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = DEBUG to log4j.properties; # Enter spark-shell, fire a query: spark.sql("select * from test_code_gen").collect # Everytime, Dataset.collect is called, SpecificSafeProjection class is generated, but the code for the class cannot be reused because everytime the id for two variables in the generated class is changed: MapObjects_loopValue and MapObjects_loopIsNull. So even the class generated before has been cached, new code cannot match the cache key so that new code need to be compiled again which cost some time. The time cost for compile is increasing with the growth of column number, for wide table, this cost can more than 2s. {code:java} object MapObjects { private val curId = new java.util.concurrent.atomic.AtomicInteger() val id = curId.getAndIncrement() val loopValue = s"MapObjects_loopValue$id" val loopIsNull = if (elementNullable) { s"MapObjects_loopIsNull$id" } else { "false" } {code} First time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue1; private boolean MapObjects_loopIsNull1; private UTF8String MapObjects_loopValue2; private boolean MapObjects_loopIsNull2; } {code} Second time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue3; private boolean MapObjects_loopIsNull3; private UTF8String MapObjects_loopValue4; private boolean MapObjects_loopIsNull4; }{code} Expectation: The code generated by GenerateSafeProjection can be reused if the query is same. was: Reproduce steps: # create a new table with array type: create table test_code_gen(a array); # Add log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = DEBUG to log4j.properties; # Enter spark-shell, fire a query: spark.sql("select * from test_code_gen").collect # Everytime, Dataset.collect is called, SpecificSafeProjection class is generated, but the code for the class cannot be reused because everytime the id for two variables in the generated class is changed: MapObjects_loopValue and MapObjects_loopIsNull. So even the class generated before has been cached, new code cannot match the cache key so that new code need to be compiled again which cost some time. The time cost for compile is increasing with the growth of column number, for wide table, this cost can more than 2s. {code:java} object MapObjects { private val curId = new java.util.concurrent.atomic.AtomicInteger() val id = curId.getAndIncrement() val loopValue = s"MapObjects_loopValue$id" val loopIsNull = if (elementNullable) { s"MapObjects_loopIsNull$id" } else { "false" } {code} First time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue1; private boolean MapObjects_loopIsNull1; private UTF8String MapObjects_loopValue2; private boolean MapObjects_loopIsNull2; } {code} Second time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue3; private boolean MapObjects_loopIsNull3; private UTF8String MapObjects_loopValue4; private boolean MapObjects_loopIsNull4; }{code} Expectation: MapObjects_loopValue and MapObjects_loopIsNull > GenerateSafeProjection.generate will generate SpecificSafeProjection class, > but if column is array type or map type, the code cannot be reused which > impact the query performance > - > > Key: SPARK-35500 > URL: https://issues.apache.org/jira/browse/SPARK-35500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yahui Liu >Priority: Minor > Labels: codegen > > Reproduce steps: > # create a new table with array type: create table test_code_gen(a > array); > # Add > log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator > = DEBUG to log4j.properties; > # Enter spark-shell, fire a query: spark.sql("select * from > test_code_gen").collect > # Everytime, Dataset.collect is called, SpecificSafeProjection class is > generated, but the code for the class cannot be reused because everytime the > id for two variables in the
[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per
[ https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yahui Liu updated SPARK-35500: -- Description: Reproduce steps: # create a new table with array type: create table test_code_gen(a array); # Add log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = DEBUG to log4j.properties; # Enter spark-shell, fire a query: spark.sql("select * from test_code_gen").collect # Everytime, Dataset.collect is called, SpecificSafeProjection class is generated, but the code for the class cannot be reused because everytime the id for two variables in the generated class is changed: MapObjects_loopValue and MapObjects_loopIsNull. So even the class generated before has been cached, new code cannot match the cache key so that new code need to be compiled again which cost some time. The time cost for compile is increasing with the growth of column number, for wide table, this cost can more than 2s. {code:java} object MapObjects { private val curId = new java.util.concurrent.atomic.AtomicInteger() val id = curId.getAndIncrement() val loopValue = s"MapObjects_loopValue$id" val loopIsNull = if (elementNullable) { s"MapObjects_loopIsNull$id" } else { "false" } {code} First time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue1; private boolean MapObjects_loopIsNull1; private UTF8String MapObjects_loopValue2; private boolean MapObjects_loopIsNull2; } {code} Second time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue3; private boolean MapObjects_loopIsNull3; private UTF8String MapObjects_loopValue4; private boolean MapObjects_loopIsNull4; }{code} Expectation: MapObjects_loopValue and MapObjects_loopIsNull was: Reproduce steps: # create a new table with array type: create table test_code_gen(a array); # Add log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = DEBUG to log4j.properties; # Enter spark-shell, fire a query: spark.sql("select * from test_code_gen").collect # Everytime, Dataset.collect is called, SpecificSafeProjection class is generated, but the code for the class cannot be reused because everytime the id for two variables in the generated class is changed: MapObjects_loopValue and MapObjects_loopIsNull. So even the class generated before has been cached, new code cannot match the cache key so that new code need to be compiled again which cost some time. The time cost for compile is increasing with the growth of column number, for wide table, this cost can more than 2s. {code:java} object MapObjects { private val curId = new java.util.concurrent.atomic.AtomicInteger() val id = curId.getAndIncrement() val loopValue = s"MapObjects_loopValue$id" val loopIsNull = if (elementNullable) { s"MapObjects_loopIsNull$id" } else { "false" } {code} First time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue1; private boolean MapObjects_loopIsNull1; private UTF8String MapObjects_loopValue2; private boolean MapObjects_loopIsNull2; } {code} Second time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue3; private boolean MapObjects_loopIsNull3; private UTF8String MapObjects_loopValue4; private boolean MapObjects_loopIsNull4; }{code} Expectation: > GenerateSafeProjection.generate will generate SpecificSafeProjection class, > but if column is array type or map type, the code cannot be reused which > impact the query performance > - > > Key: SPARK-35500 > URL: https://issues.apache.org/jira/browse/SPARK-35500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yahui Liu >Priority: Minor > Labels: codegen > > Reproduce steps: > # create a new table with array type: create table test_code_gen(a > array); > # Add > log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator > = DEBUG to log4j.properties; > # Enter spark-shell, fire a query: spark.sql("select * from > test_code_gen").collect > # Everytime, Dataset.collect is called, SpecificSafeProjection class is > generated, but the code for the class cannot be reused because everytime the > id for two variables in the generated class is changed: MapObjects_loopValue > and MapObjects_loopIsNull. So
[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per
[ https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yahui Liu updated SPARK-35500: -- Description: Reproduce steps: # create a new table with array type: create table test_code_gen(a array); # Add log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = DEBUG to log4j.properties; # Enter spark-shell, fire a query: spark.sql("select * from test_code_gen").collect # Everytime, Dataset.collect is called, SpecificSafeProjection class is generated, but the code for the class cannot be reused because everytime the id for two variables in the generated class is changed: MapObjects_loopValue and MapObjects_loopIsNull. So even the class generated before has been cached, new code cannot match the cache key so that new code need to be compiled again which cost some time. The time cost for compile is increasing with the growth of column number, for wide table, this cost can more than 2s. {code:java} object MapObjects { private val curId = new java.util.concurrent.atomic.AtomicInteger() val id = curId.getAndIncrement() val loopValue = s"MapObjects_loopValue$id" val loopIsNull = if (elementNullable) { s"MapObjects_loopIsNull$id" } else { "false" } {code} First time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue1; private boolean MapObjects_loopIsNull1; private UTF8String MapObjects_loopValue2; private boolean MapObjects_loopIsNull2; } {code} Second time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue3; private boolean MapObjects_loopIsNull3; private UTF8String MapObjects_loopValue4; private boolean MapObjects_loopIsNull4; }{code} Expectation: was: Reproduce steps: # create a new table with array type: create table test_code_gen(a array); # Add log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = DEBUG to log4j.properties; # Enter spark-shell, fire a query: spark.sql("select * from test_code_gen").collect # Everytime, Dataset.collect is called, SpecificSafeProjection class is generated, but the code for the class cannot be reused because everytime the id for two variables in the generated class is changed: MapObjects_loopValue and MapObjects_loopIsNull. So even the class generated before has been cached, new code cannot match the cache key so that new code need to be compiled again which cost some time. The time cost for compile is increasing with the growth of column number, for wide table, this cost can more than 2s. {code:java} object MapObjects { private val curId = new java.util.concurrent.atomic.AtomicInteger() val id = curId.getAndIncrement() val loopValue = s"MapObjects_loopValue$id" val loopIsNull = if (elementNullable) { s"MapObjects_loopIsNull$id" } else { "false" } {code} First time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue1; private boolean MapObjects_loopIsNull1; private UTF8String MapObjects_loopValue2; private boolean MapObjects_loopIsNull2; } {code} Second time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue3; private boolean MapObjects_loopIsNull3; private UTF8String MapObjects_loopValue4; private boolean MapObjects_loopIsNull4; }{code} > GenerateSafeProjection.generate will generate SpecificSafeProjection class, > but if column is array type or map type, the code cannot be reused which > impact the query performance > - > > Key: SPARK-35500 > URL: https://issues.apache.org/jira/browse/SPARK-35500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yahui Liu >Priority: Minor > Labels: codegen > > Reproduce steps: > # create a new table with array type: create table test_code_gen(a > array); > # Add > log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator > = DEBUG to log4j.properties; > # Enter spark-shell, fire a query: spark.sql("select * from > test_code_gen").collect > # Everytime, Dataset.collect is called, SpecificSafeProjection class is > generated, but the code for the class cannot be reused because everytime the > id for two variables in the generated class is changed: MapObjects_loopValue > and MapObjects_loopIsNull. So even the class generated before has been > cached, new code
[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per
[ https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yahui Liu updated SPARK-35500: -- Description: Reproduce steps: # create a new table with array type: create table test_code_gen(a array); # Add log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = DEBUG to log4j.properties; # Enter spark-shell, fire a query: spark.sql("select * from test_code_gen").collect # Everytime, Dataset.collect is called, SpecificSafeProjection class is generated, but the code for the class cannot be reused because everytime the id for two variables in the generated class is changed: MapObjects_loopValue and MapObjects_loopIsNull. So even the class generated before has been cached, new code cannot match the cache key so that new code need to be compiled again which cost some time. The time cost for compile is increasing with the growth of column number, for wide table, this cost can more than 2s. {code:java} object MapObjects { private val curId = new java.util.concurrent.atomic.AtomicInteger() val id = curId.getAndIncrement() val loopValue = s"MapObjects_loopValue$id" val loopIsNull = if (elementNullable) { s"MapObjects_loopIsNull$id" } else { "false" } {code} First time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue1; private boolean MapObjects_loopIsNull1; private UTF8String MapObjects_loopValue2; private boolean MapObjects_loopIsNull2; } {code} Second time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue3; private boolean MapObjects_loopIsNull3; private UTF8String MapObjects_loopValue4; private boolean MapObjects_loopIsNull4; }{code} was: Reproduce steps: # create a new table with array type: create table test_code_gen(a array); # Add log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = DEBUG to log4j.properties; # Enter spark-shell, fire a query: spark.sql("select * from test_code_gen").collect # Everytime, Dataset.collect is called, SpecificSafeProjection class is generated, but the code for the class cannot be reused because everytime the id for two variables in the generated class is changed: MapObjects_loopValue and MapObjects_loopIsNull. So even the class generated before has been cached, new code cannot match the cache key so that new code need to be compiled again which cost some time. The time cost for compile is increasing with the growth of column number, for wide table, this cost can more than 2s. {code:java} object MapObjects { private val curId = new java.util.concurrent.atomic.AtomicInteger() val id = curId.getAndIncrement() val loopValue = s"MapObjects_loopValue$id" val loopIsNull = if (elementNullable) { s"MapObjects_loopIsNull$id" } else { "false" } {code} First time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue1; private boolean MapObjects_loopIsNull1; private UTF8String MapObjects_loopValue2; private boolean MapObjects_loopIsNull2; } {code} Second time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue3; private boolean MapObjects_loopIsNull3; private UTF8String MapObjects_loopValue4; private boolean MapObjects_loopIsNull4; }{code} # sdf > GenerateSafeProjection.generate will generate SpecificSafeProjection class, > but if column is array type or map type, the code cannot be reused which > impact the query performance > - > > Key: SPARK-35500 > URL: https://issues.apache.org/jira/browse/SPARK-35500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yahui Liu >Priority: Minor > Labels: codegen > > Reproduce steps: > # create a new table with array type: create table test_code_gen(a > array); > # Add > log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator > = DEBUG to log4j.properties; > # Enter spark-shell, fire a query: spark.sql("select * from > test_code_gen").collect > # Everytime, Dataset.collect is called, SpecificSafeProjection class is > generated, but the code for the class cannot be reused because everytime the > id for two variables in the generated class is changed: MapObjects_loopValue > and MapObjects_loopIsNull. So even the class generated before has been > cached, new code
[jira] [Created] (SPARK-35501) Add a feature for removing pulled container image for docker integration tests
Kousuke Saruta created SPARK-35501: -- Summary: Add a feature for removing pulled container image for docker integration tests Key: SPARK-35501 URL: https://issues.apache.org/jira/browse/SPARK-35501 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 3.2.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Docker integration tests pull the corresponding container images before test if there is no container image in the local repository. But the pulled image remains even after test finished. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per
[ https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yahui Liu updated SPARK-35500: -- Description: Reproduce steps: # create a new table with array type: create table test_code_gen(a array); # Add log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = DEBUG to log4j.properties; # Enter spark-shell, fire a query: spark.sql("select * from test_code_gen").collect # Everytime, Dataset.collect is called, SpecificSafeProjection class is generated, but the code for the class cannot be reused because everytime the id for two variables in the generated class is changed: MapObjects_loopValue and MapObjects_loopIsNull. So even the class generated before has been cached, new code cannot match the cache key so that new code need to be compiled again which cost some time. The time cost for compile is increasing with the growth of column number, for wide table, this cost can more than 2s. {code:java} object MapObjects { private val curId = new java.util.concurrent.atomic.AtomicInteger() val id = curId.getAndIncrement() val loopValue = s"MapObjects_loopValue$id" val loopIsNull = if (elementNullable) { s"MapObjects_loopIsNull$id" } else { "false" } {code} First time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue1; private boolean MapObjects_loopIsNull1; private UTF8String MapObjects_loopValue2; private boolean MapObjects_loopIsNull2; } {code} Second time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue3; private boolean MapObjects_loopIsNull3; private UTF8String MapObjects_loopValue4; private boolean MapObjects_loopIsNull4; }{code} # sdf was: Reproduce steps: # create a new table with array type: create table test_code_gen(a array); # Add log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = DEBUG to log4j.properties; # Enter spark-shell, fire a query: spark.sql("select * from test_code_gen").collect # Everytime, Dataset.collect is called, SpecificSafeProjection class is generated, but the code for the class cannot be reused because everytime the id for two variables in the generated class is changed: MapObjects_loopValue and MapObjects_loopIsNull. So even the class generated before has been cached, new code cannot match the cache key so that new code need to be compiled again which cost some time. The time cost for compile is increasing with the growth of column number, for wide table, this cost can more than 2s. {code:java} object MapObjects { private val curId = new java.util.concurrent.atomic.AtomicInteger() val id = curId.getAndIncrement() val loopValue = s"MapObjects_loopValue$id" val loopIsNull = if (elementNullable) { s"MapObjects_loopIsNull$id" } else { "false" } {code} First time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue1; private boolean MapObjects_loopIsNull1; private UTF8String MapObjects_loopValue2; private boolean MapObjects_loopIsNull2; } {code} Second time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue3; private boolean MapObjects_loopIsNull3; private UTF8String MapObjects_loopValue4; private boolean MapObjects_loopIsNull4; }{code} > GenerateSafeProjection.generate will generate SpecificSafeProjection class, > but if column is array type or map type, the code cannot be reused which > impact the query performance > - > > Key: SPARK-35500 > URL: https://issues.apache.org/jira/browse/SPARK-35500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yahui Liu >Priority: Minor > Labels: codegen > > Reproduce steps: > # create a new table with array type: create table test_code_gen(a > array); > # Add > log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator > = DEBUG to log4j.properties; > # Enter spark-shell, fire a query: spark.sql("select * from > test_code_gen").collect > # Everytime, Dataset.collect is called, SpecificSafeProjection class is > generated, but the code for the class cannot be reused because everytime the > id for two variables in the generated class is changed: MapObjects_loopValue > and MapObjects_loopIsNull. So even the class generated before has been > cached, new code
[jira] [Commented] (SPARK-35449) Should not extract common expressions from value expressions when elseValue is empty in CaseWhen
[ https://issues.apache.org/jira/browse/SPARK-35449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350387#comment-17350387 ] Apache Spark commented on SPARK-35449: -- User 'Kimahriman' has created a pull request for this issue: https://github.com/apache/spark/pull/32651 > Should not extract common expressions from value expressions when elseValue > is empty in CaseWhen > > > Key: SPARK-35449 > URL: https://issues.apache.org/jira/browse/SPARK-35449 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1, 3.2.0 >Reporter: L. C. Hsieh >Assignee: Adam Binford >Priority: Major > Fix For: 3.2.0 > > > [https://github.com/apache/spark/pull/30245] added support for creating > subexpressions that are present in all branches of conditional statements. > However, for a statement to be in "all branches" of a CaseWhen statement, it > must also be in the elseValue. This can lead to a subexpression to be created > and run for branches of a conditional that don't pass. This can cause issues > especially with a UDF in a branch that gets executed assuming the condition > is true. For example: > {code:java} > val col = when($"id" < 0, myUdf($"id")) > spark.range(1).select(when(col > 0, col)).show() > {code} > myUdf($"id") gets extracted as a subexpression and executed even though both > conditions don't pass and it should never be executed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35279) _SUCCESS file not written when using partitionOverwriteMode=dynamic
[ https://issues.apache.org/jira/browse/SPARK-35279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350373#comment-17350373 ] Steve Loughran commented on SPARK-35279: Any reason for not using the S3A committers here? i.e the ones which are safe to use on S3? > _SUCCESS file not written when using partitionOverwriteMode=dynamic > --- > > Key: SPARK-35279 > URL: https://issues.apache.org/jira/browse/SPARK-35279 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.1 >Reporter: Goran Kljajic >Priority: Minor > > Steps to reproduce: > > {code:java} > case class A(a: String, b:String) > def df = List(A("a", "b")).toDF > spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") > val writer = df.write.mode(SaveMode.Overwrite).partitionBy("a") > writer.parquet("s3a://some_bucket/test/") > {code} > when spark.sql.sources.partitionOverwriteMode is set to dynamic, the output > written doesn't have _SUCCESS file updated. > (I have checked different versions of hadoop from 3.1.4 to 3.22 and they all > behave the same, so the issue is with spark) > This is working in spark 3.0.2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35299) Dataframe overwrite on S3 does not delete old files with S3 object-put to table path
[ https://issues.apache.org/jira/browse/SPARK-35299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350372#comment-17350372 ] Steve Loughran commented on SPARK-35299: +use a version of spark with the hadoop 3.1+ JARs and the s3a committer. > Dataframe overwrite on S3 does not delete old files with S3 object-put to > table path > > > Key: SPARK-35299 > URL: https://issues.apache.org/jira/browse/SPARK-35299 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yusheng Ding >Priority: Major > Labels: aws-s3, dataframe, hive, spark > > To reproduce: > test_table path: s3a://test_bucket/test_table/ > > df = spark_session.sql("SELECT * FROM test_table") > df.count() # produce row number 1000 > #S3 operation## > s3 = boto3.client("s3") > s3.put_object( > Bucket="test_bucket", Body="", Key=f"test_table/" > ) > #S3 operation## > df.write.insertInto(test_table, overwrite=True) > #Same goes to df.write.save(mode="overwrite", format="parquet", > path="s3a://test_bucket/test_table") > df = spark_session.sql("SELECT * FROM test_table") > df.count() # produce row number 2000 > > Overwrite is not functioning correctly. Old files will not be deleted on S3. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35406) TaskCompletionListenerException: Premature end of Content-Length delimited message body
[ https://issues.apache.org/jira/browse/SPARK-35406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350371#comment-17350371 ] Steve Loughran commented on SPARK-35406: This is actually triggered by garbage collection; not an AWS SDK thing. The S3A Input stream keeps a reference to the stream returned by the S3Object, but somehow the outer objects can get a GC'd and everything breaks. The fix is: keep a reference to that outer object. Updating the AWS SDK is unlikely to have any direct effect; it an the required matching hadoop-* JARs may change GC profile enough to hide it. You should be off Hadoop 2.7 for many reasons, especially when working with object stores. And for java 11. Not supported there, so all problems will be "wontfix." As well for efficient random access IO over HTTP connections, Hadoop 3.1+ ships with the S3A committers which are actually designed to work with S3. The one you are using relies on rename() being fast and, for a directory, atomic. This doesn't hold with s3. At least now S3 is consistent rename() is going to discover the correct set of files to copy. > TaskCompletionListenerException: Premature end of Content-Length delimited > message body > --- > > Key: SPARK-35406 > URL: https://issues.apache.org/jira/browse/SPARK-35406 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 > Environment: Spark 2.4.7 > Build with Hadoop 2.7.3 > hadoop-aws jar 2.7.3 > aws-java-sdk 1.7.4 > EKS 1.18 >Reporter: Avik Aggarwal >Priority: Major > Labels: Kubernetes > > Running Spark on kubernetes (EKS 1.18) and Below version fo different > components: > Spark 2.4.7 > Build with Hadoop 2.7.3 > hadoop-aws jar 2.7.3 > aws-java-sdk 1.7.4 > > I am using s3a endpoint for reading S3 objects from private repository and > appropriate role has been given to executors. > > I am facing below error while read/writing bigger files from/to S3. > Size for which I am facing issue - early MBs (10-30). > While it works for files with KBs of size. > > Logs : - > {code:java} > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 > (TID 6, 10.83.7.112, executor 2): > org.apache.spark.util.TaskCompletionListenerException: Premature end of > Content-Length delimited message body (expected: 3918825; received: 18020 > at > org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:138) > at > org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:116) > at org.apache.spark.scheduler.Task.run(Task.scala:139) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748)Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1925) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1913) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1912) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1912) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:948) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:948) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:948) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2146) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2095) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2084) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:759) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) > at
[jira] [Commented] (SPARK-26385) YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in cache
[ https://issues.apache.org/jira/browse/SPARK-26385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350337#comment-17350337 ] Ji Chen commented on SPARK-26385: - [~zjiash] Has this problem been solved please? > YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in > cache > --- > > Key: SPARK-26385 > URL: https://issues.apache.org/jira/browse/SPARK-26385 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 > Environment: Hadoop 2.6.0, Spark 2.4.0 >Reporter: T M >Priority: Major > > > Hello, > > I have Spark Structured Streaming job which is runnning on YARN(Hadoop 2.6.0, > Spark 2.4.0). After 25-26 hours, my job stops working with following error: > {code:java} > 2018-12-16 22:35:17 ERROR > org.apache.spark.internal.Logging$class.logError(Logging.scala:91): Query > TestQuery[id = a61ce197-1d1b-4e82-a7af-60162953488b, runId = > a56878cf-dfc7-4f6a-ad48-02cf738ccc2f] terminated with error > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (token for REMOVED: HDFS_DELEGATION_TOKEN owner=REMOVED, renewer=yarn, > realUser=, issueDate=1544903057122, maxDate=1545507857122, > sequenceNumber=10314, masterKeyId=344) can't be found in cache at > org.apache.hadoop.ipc.Client.call(Client.java:1470) at > org.apache.hadoop.ipc.Client.call(Client.java:1401) at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752) > at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) at > org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1977) at > org.apache.hadoop.fs.Hdfs.getFileStatus(Hdfs.java:133) at > org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1120) at > org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1116) at > org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at > org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1116) at > org.apache.hadoop.fs.FileContext$Util.exists(FileContext.java:1581) at > org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.exists(CheckpointFileManager.scala:326) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:142) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:544) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:554) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:542) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166) > at >
[jira] [Commented] (SPARK-32194) Standardize exceptions in PySpark
[ https://issues.apache.org/jira/browse/SPARK-32194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350307#comment-17350307 ] Apache Spark commented on SPARK-32194: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/32650 > Standardize exceptions in PySpark > - > > Key: SPARK-32194 > URL: https://issues.apache.org/jira/browse/SPARK-32194 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, PySpark throws {{Exception}} or just {{RuntimeException}} in many > cases. We should standardize them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per
[ https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yahui Liu updated SPARK-35500: -- Description: Reproduce steps: # create a new table with array type: create table test_code_gen(a array); # Add log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = DEBUG to log4j.properties; # Enter spark-shell, fire a query: spark.sql("select * from test_code_gen").collect # Everytime, Dataset.collect is called, SpecificSafeProjection class is generated, but the code for the class cannot be reused because everytime the id for two variables in the generated class is changed: MapObjects_loopValue and MapObjects_loopIsNull. So even the class generated before has been cached, new code cannot match the cache key so that new code need to be compiled again which cost some time. The time cost for compile is increasing with the growth of column number, for wide table, this cost can more than 2s. {code:java} object MapObjects { private val curId = new java.util.concurrent.atomic.AtomicInteger() val id = curId.getAndIncrement() val loopValue = s"MapObjects_loopValue$id" val loopIsNull = if (elementNullable) { s"MapObjects_loopIsNull$id" } else { "false" } {code} First time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue1; private boolean MapObjects_loopIsNull1; private UTF8String MapObjects_loopValue2; private boolean MapObjects_loopIsNull2; } {code} Second time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue3; private boolean MapObjects_loopIsNull3; private UTF8String MapObjects_loopValue4; private boolean MapObjects_loopIsNull4; }{code} was: Reproduce steps: # create a new table with array type: create table test_code_gen(a array); # Add log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = DEBUG to log4j.properties; # Enter spark-shell, fire a query: spark.sql("select * from test_code_gen").collect # Everytime, Dataset.collect is called, SpecificSafeProjection class is generated, but the code for the class cannot be reused because everytime the id for two variables in the generated class is changed: MapObjects_loopValue and MapObjects_loopIsNull. So even the class generated before has been cached, new code cannot match the cache key so that new code need to be compiled again which cost some time. {code:java} object MapObjects { private val curId = new java.util.concurrent.atomic.AtomicInteger() val id = curId.getAndIncrement() val loopValue = s"MapObjects_loopValue$id" val loopIsNull = if (elementNullable) { s"MapObjects_loopIsNull$id" } else { "false" } {code} First time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue1; private boolean MapObjects_loopIsNull1; private UTF8String MapObjects_loopValue2; private boolean MapObjects_loopIsNull2; } {code} Second time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue3; private boolean MapObjects_loopIsNull3; private UTF8String MapObjects_loopValue4; private boolean MapObjects_loopIsNull4; } {code} # The time cost for compile is increasing with the growth of column number, for wide table, this cost can more than 2s. > GenerateSafeProjection.generate will generate SpecificSafeProjection class, > but if column is array type or map type, the code cannot be reused which > impact the query performance > - > > Key: SPARK-35500 > URL: https://issues.apache.org/jira/browse/SPARK-35500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yahui Liu >Priority: Minor > Labels: codegen > > Reproduce steps: > # create a new table with array type: create table test_code_gen(a > array); > # Add > log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator > = DEBUG to log4j.properties; > # Enter spark-shell, fire a query: spark.sql("select * from > test_code_gen").collect > # Everytime, Dataset.collect is called, SpecificSafeProjection class is > generated, but the code for the class cannot be reused because everytime the > id for two variables in the generated class is changed: MapObjects_loopValue > and MapObjects_loopIsNull. So even the class generated before has been > cached, new code cannot
[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per
[ https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yahui Liu updated SPARK-35500: -- Description: Reproduce steps: # create a new table with array type: create table test_code_gen(a array); # Add log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = DEBUG to log4j.properties; # Enter spark-shell, fire a query: spark.sql("select * from test_code_gen").collect # Everytime, Dataset.collect is called, SpecificSafeProjection class is generated, but the code for the class cannot be reused because everytime the id for two variables in the generated class is changed: MapObjects_loopValue and MapObjects_loopIsNull. So even the class generated before has been cached, new code cannot match the cache key so that new code need to be compiled again which cost some time. {code:java} object MapObjects { private val curId = new java.util.concurrent.atomic.AtomicInteger() val id = curId.getAndIncrement() val loopValue = s"MapObjects_loopValue$id" val loopIsNull = if (elementNullable) { s"MapObjects_loopIsNull$id" } else { "false" } {code} First time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue1; private boolean MapObjects_loopIsNull1; private UTF8String MapObjects_loopValue2; private boolean MapObjects_loopIsNull2; } {code} Second time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue3; private boolean MapObjects_loopIsNull3; private UTF8String MapObjects_loopValue4; private boolean MapObjects_loopIsNull4; } {code} # The time cost for compile is increasing with the growth of column number, for wide table, this cost can more than 2s. was: Reproduce steps: # create a new table with array type: create table test_code_gen(a array); # Add log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = DEBUG to log4j.properties; # Enter spark-shell, fire a query: spark.sql("select * from test_code_gen").collect # Everytime, Dataset.collect is called, SpecificSafeProjection class is generated, but the code for the class cannot be reused because everytime the id for two variables in the generated class is changed: MapObjects_loopValue and MapObjects_loopIsNull. So even the class generated before has been cached, new code cannot match the cache key so that new code need to be compiled again which cost some time. {code:java} object MapObjects { private val curId = new java.util.concurrent.atomic.AtomicInteger() val id = curId.getAndIncrement() val loopValue = s"MapObjects_loopValue$id" val loopIsNull = if (elementNullable) { s"MapObjects_loopIsNull$id" } else { "false" } {code} First time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue1; private boolean MapObjects_loopIsNull1; private UTF8String MapObjects_loopValue2; private boolean MapObjects_loopIsNull2; } {code} Second time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue3; private boolean MapObjects_loopIsNull3; private UTF8String MapObjects_loopValue4; private boolean MapObjects_loopIsNull4; } {code} # The time cost for compile is increasing with the growth of column number, for wide table, this cost can more than 2s. # > GenerateSafeProjection.generate will generate SpecificSafeProjection class, > but if column is array type or map type, the code cannot be reused which > impact the query performance > - > > Key: SPARK-35500 > URL: https://issues.apache.org/jira/browse/SPARK-35500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yahui Liu >Priority: Minor > Labels: codegen > > Reproduce steps: > # create a new table with array type: create table test_code_gen(a > array); > # Add > log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator > = DEBUG to log4j.properties; > # Enter spark-shell, fire a query: spark.sql("select * from > test_code_gen").collect > # Everytime, Dataset.collect is called, SpecificSafeProjection class is > generated, but the code for the class cannot be reused because everytime the > id for two variables in the generated class is changed: MapObjects_loopValue > and MapObjects_loopIsNull. So even the class generated before has been > cached, new
[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per
[ https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yahui Liu updated SPARK-35500: -- Description: Reproduce steps: # create a new table with array type: create table test_code_gen(a array); # Add log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = DEBUG to log4j.properties; # Enter spark-shell, fire a query: spark.sql("select * from test_code_gen").collect # Everytime, Dataset.collect is called, SpecificSafeProjection class is generated, but the code for the class cannot be reused because everytime the id for two variables in the generated class is changed: MapObjects_loopValue and MapObjects_loopIsNull. So even the class generated before has been cached, new code cannot match the cache key so that new code need to be compiled again which cost some time. {code:java} object MapObjects { private val curId = new java.util.concurrent.atomic.AtomicInteger() val id = curId.getAndIncrement() val loopValue = s"MapObjects_loopValue$id" val loopIsNull = if (elementNullable) { s"MapObjects_loopIsNull$id" } else { "false" } {code} First time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue1; private boolean MapObjects_loopIsNull1; private UTF8String MapObjects_loopValue2; private boolean MapObjects_loopIsNull2; } {code} Second time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue3; private boolean MapObjects_loopIsNull3; private UTF8String MapObjects_loopValue4; private boolean MapObjects_loopIsNull4; } {code} # The time cost for compile is increasing with the growth of column number, for wide table, this cost can more than 2s. was: Reproduce steps: # create a new table with array type: create table test_code_gen(a array); # Add log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = DEBUG to log4j.properties; # Enter spark-shell, fire a query: spark.sql("select * from test_code_gen").collect # Everytime, Dataset.collect is called, SpecificSafeProjection class is generated, but the code for the class cannot be reused because everytime the id for two variables in the generated class is changed: MapObjects_loopValue and MapObjects_loopIsNull. So even the class generated before has been cached, new code cannot match the cache key so that new code need to be compiled again which cost some time. {code:java} object MapObjects { private val curId = new java.util.concurrent.atomic.AtomicInteger() val id = curId.getAndIncrement() val loopValue = s"MapObjects_loopValue$id" val loopIsNull = if (elementNullable) { s"MapObjects_loopIsNull$id" } else { "false" } {code} First time run: # # The time cost for compile is increasing with the growth of column number, for wide table, this cost can more than 2s. # First time run: class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue{color:#FF}1{color}; private boolean MapObjects_loopIsNull{color:#FF}1{color}; private UTF8String MapObjects_loopValue{color:#FF}2{color}; private boolean MapObjects_loopIsNull{color:#FF}2{color}; } Second time run: class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue{color:#FF}3{color}; private boolean MapObjects_loopIsNull{color:#FF}3{color}; private UTF8String MapObjects_loopValue{color:#FF}4{color}; private boolean MapObjects_loopIsNull{color:#FF}4{color}; } > GenerateSafeProjection.generate will generate SpecificSafeProjection class, > but if column is array type or map type, the code cannot be reused which > impact the query performance > - > > Key: SPARK-35500 > URL: https://issues.apache.org/jira/browse/SPARK-35500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yahui Liu >Priority: Minor > Labels: codegen > > Reproduce steps: > # create a new table with array type: create table test_code_gen(a > array); > # Add > log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator > = DEBUG to log4j.properties; > # Enter spark-shell, fire a
[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per
[ https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yahui Liu updated SPARK-35500: -- Description: Reproduce steps: # create a new table with array type: create table test_code_gen(a array); # Add log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = DEBUG to log4j.properties; # Enter spark-shell, fire a query: spark.sql("select * from test_code_gen").collect # Everytime, Dataset.collect is called, SpecificSafeProjection class is generated, but the code for the class cannot be reused because everytime the id for two variables in the generated class is changed: MapObjects_loopValue and MapObjects_loopIsNull. So even the class generated before has been cached, new code cannot match the cache key so that new code need to be compiled again which cost some time. {code:java} object MapObjects { private val curId = new java.util.concurrent.atomic.AtomicInteger() val id = curId.getAndIncrement() val loopValue = s"MapObjects_loopValue$id" val loopIsNull = if (elementNullable) { s"MapObjects_loopIsNull$id" } else { "false" } {code} First time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue1; private boolean MapObjects_loopIsNull1; private UTF8String MapObjects_loopValue2; private boolean MapObjects_loopIsNull2; } {code} Second time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue3; private boolean MapObjects_loopIsNull3; private UTF8String MapObjects_loopValue4; private boolean MapObjects_loopIsNull4; } {code} # The time cost for compile is increasing with the growth of column number, for wide table, this cost can more than 2s. # was: Reproduce steps: # create a new table with array type: create table test_code_gen(a array); # Add log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = DEBUG to log4j.properties; # Enter spark-shell, fire a query: spark.sql("select * from test_code_gen").collect # Everytime, Dataset.collect is called, SpecificSafeProjection class is generated, but the code for the class cannot be reused because everytime the id for two variables in the generated class is changed: MapObjects_loopValue and MapObjects_loopIsNull. So even the class generated before has been cached, new code cannot match the cache key so that new code need to be compiled again which cost some time. {code:java} object MapObjects { private val curId = new java.util.concurrent.atomic.AtomicInteger() val id = curId.getAndIncrement() val loopValue = s"MapObjects_loopValue$id" val loopIsNull = if (elementNullable) { s"MapObjects_loopIsNull$id" } else { "false" } {code} First time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue1; private boolean MapObjects_loopIsNull1; private UTF8String MapObjects_loopValue2; private boolean MapObjects_loopIsNull2; } {code} Second time run: {code:java} class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue3; private boolean MapObjects_loopIsNull3; private UTF8String MapObjects_loopValue4; private boolean MapObjects_loopIsNull4; } {code} # The time cost for compile is increasing with the growth of column number, for wide table, this cost can more than 2s. > GenerateSafeProjection.generate will generate SpecificSafeProjection class, > but if column is array type or map type, the code cannot be reused which > impact the query performance > - > > Key: SPARK-35500 > URL: https://issues.apache.org/jira/browse/SPARK-35500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yahui Liu >Priority: Minor > Labels: codegen > > Reproduce steps: > # create a new table with array type: create table test_code_gen(a > array); > # Add > log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator > = DEBUG to log4j.properties; > # Enter spark-shell, fire a query: spark.sql("select * from > test_code_gen").collect > # Everytime, Dataset.collect is called, SpecificSafeProjection class is > generated, but the code for the class cannot be reused because everytime the > id for two variables in the generated class is changed: MapObjects_loopValue > and MapObjects_loopIsNull. So even the class generated before has been > cached, new
[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per
[ https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yahui Liu updated SPARK-35500: -- Description: Reproduce steps: # create a new table with array type: create table test_code_gen(a array); # Add log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = DEBUG to log4j.properties; # Enter spark-shell, fire a query: spark.sql("select * from test_code_gen").collect # Everytime, Dataset.collect is called, SpecificSafeProjection class is generated, but the code for the class cannot be reused because everytime the id for two variables in the generated class is changed: MapObjects_loopValue and MapObjects_loopIsNull. So even the class generated before has been cached, new code cannot match the cache key so that new code need to be compiled again which cost some time. {code:java} object MapObjects { private val curId = new java.util.concurrent.atomic.AtomicInteger() val id = curId.getAndIncrement() val loopValue = s"MapObjects_loopValue$id" val loopIsNull = if (elementNullable) { s"MapObjects_loopIsNull$id" } else { "false" } {code} First time run: # # The time cost for compile is increasing with the growth of column number, for wide table, this cost can more than 2s. # First time run: class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue{color:#FF}1{color}; private boolean MapObjects_loopIsNull{color:#FF}1{color}; private UTF8String MapObjects_loopValue{color:#FF}2{color}; private boolean MapObjects_loopIsNull{color:#FF}2{color}; } Second time run: class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { private int MapObjects_loopValue{color:#FF}3{color}; private boolean MapObjects_loopIsNull{color:#FF}3{color}; private UTF8String MapObjects_loopValue{color:#FF}4{color}; private boolean MapObjects_loopIsNull{color:#FF}4{color}; } was: Reproduce steps: # create a new table with array type: create table test_code_gen(a array); # Add log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = DEBUG to log4j.properties; # Enter spark-shell, fire a query: spark.sql("select * from test_code_gen").collect # Everytime, Dataset.collect is called, SpecificSafeProjection class is generated, but the code for the class cannot be reused because everytime the id for two variables in the generated class is changed: MapObjects_loopValue and MapObjects_loopIsNull. So even the class generated before has been cached, new code cannot match the cache key so that new code need to be compiled again which cost some time. object MapObjects { private val curId = new java.util.concurrent.atomic.AtomicInteger() # The time cost for compile is increasing with the growth of column number, for wide table, this cost can more than 2s. > GenerateSafeProjection.generate will generate SpecificSafeProjection class, > but if column is array type or map type, the code cannot be reused which > impact the query performance > - > > Key: SPARK-35500 > URL: https://issues.apache.org/jira/browse/SPARK-35500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yahui Liu >Priority: Minor > Labels: codegen > > Reproduce steps: > # create a new table with array type: create table test_code_gen(a > array); > # Add > log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator > = DEBUG to log4j.properties; > # Enter spark-shell, fire a query: spark.sql("select * from > test_code_gen").collect > # Everytime, Dataset.collect is called, SpecificSafeProjection class is > generated, but the code for the class cannot be reused because everytime the > id for two variables in the generated class is changed: MapObjects_loopValue > and MapObjects_loopIsNull. So even the class generated before has been > cached, new code cannot match the cache key so that new code need to be > compiled again which cost some time. > {code:java} > object MapObjects { > private val curId = new java.util.concurrent.atomic.AtomicInteger() > val id = curId.getAndIncrement() > val loopValue = s"MapObjects_loopValue$id" > val loopIsNull = if (elementNullable) { > s"MapObjects_loopIsNull$id" > } else { > "false" > } > {code} > First
[jira] [Assigned] (SPARK-35497) Enable plotly tests in pandas-on-Spark
[ https://issues.apache.org/jira/browse/SPARK-35497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35497: Assignee: Apache Spark > Enable plotly tests in pandas-on-Spark > -- > > Key: SPARK-35497 > URL: https://issues.apache.org/jira/browse/SPARK-35497 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > Currently, GitHub Actions doesn't have plotly installed and results in plotly > related tests being skipped. We should enable it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35497) Enable plotly tests in pandas-on-Spark
[ https://issues.apache.org/jira/browse/SPARK-35497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35497: Assignee: (was: Apache Spark) > Enable plotly tests in pandas-on-Spark > -- > > Key: SPARK-35497 > URL: https://issues.apache.org/jira/browse/SPARK-35497 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, GitHub Actions doesn't have plotly installed and results in plotly > related tests being skipped. We should enable it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35497) Enable plotly tests in pandas-on-Spark
[ https://issues.apache.org/jira/browse/SPARK-35497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350298#comment-17350298 ] Apache Spark commented on SPARK-35497: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/32649 > Enable plotly tests in pandas-on-Spark > -- > > Key: SPARK-35497 > URL: https://issues.apache.org/jira/browse/SPARK-35497 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, GitHub Actions doesn't have plotly installed and results in plotly > related tests being skipped. We should enable it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per
[ https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yahui Liu updated SPARK-35500: -- Description: Reproduce steps: # create a new table with array type: create table test_code_gen(a array); # Add log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = DEBUG to log4j.properties; # Enter spark-shell, fire a query: spark.sql("select * from test_code_gen").collect # Everytime, Dataset.collect is called, SpecificSafeProjection class is generated, but the code for the class cannot be reused because everytime the id for two variables in the generated class is changed: MapObjects_loopValue and MapObjects_loopIsNull. So even the class generated before has been cached, new code cannot match the cache key so that new code need to be compiled again which cost some time. object MapObjects { private val curId = new java.util.concurrent.atomic.AtomicInteger() # The time cost for compile is increasing with the growth of column number, for wide table, this cost can more than 2s. was: Reproduce steps: # create a new table with array type: create table test_code_gen(a array); # Add log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = DEBUG to log4j.properties; # Enter spark-shell, fire a query: spark.sql("select * from test_code_gen").collect # Everytime, Dataset.collect is called, SpecificSafeProjection class is generated, but the code for the class cannot be reused because everytime the id for two variables in the generated class is changed: MapObjects_loopValue and MapObjects_loopIsNull. So even the class generated before has been cached, new code cannot match the cache key so that new code need to be compiled again which cost some time. !image-2021-05-24-16-15-18-359.png!!image-2021-05-24-16-05-34-334.png! # The time cost for compile is increasing with the growth of column number, for wide table, this cost can more than 2s. !image-2021-05-24-16-11-20-841.png! > GenerateSafeProjection.generate will generate SpecificSafeProjection class, > but if column is array type or map type, the code cannot be reused which > impact the query performance > - > > Key: SPARK-35500 > URL: https://issues.apache.org/jira/browse/SPARK-35500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yahui Liu >Priority: Minor > Labels: codegen > > Reproduce steps: > # create a new table with array type: create table test_code_gen(a > array); > # Add > log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator > = DEBUG to log4j.properties; > # Enter spark-shell, fire a query: spark.sql("select * from > test_code_gen").collect > # Everytime, Dataset.collect is called, SpecificSafeProjection class is > generated, but the code for the class cannot be reused because everytime the > id for two variables in the generated class is changed: MapObjects_loopValue > and MapObjects_loopIsNull. So even the class generated before has been > cached, new code cannot match the cache key so that new code need to be > compiled again which cost some time. > object MapObjects { > private val curId = new java.util.concurrent.atomic.AtomicInteger() > # The time cost for compile is increasing with the growth of column number, > for wide table, this cost can more than 2s. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35449) Should not extract common expressions from value expressions when elseValue is empty in CaseWhen
[ https://issues.apache.org/jira/browse/SPARK-35449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350297#comment-17350297 ] L. C. Hsieh commented on SPARK-35449: - This issue was fixed at https://github.com/apache/spark/pull/32595. > Should not extract common expressions from value expressions when elseValue > is empty in CaseWhen > > > Key: SPARK-35449 > URL: https://issues.apache.org/jira/browse/SPARK-35449 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1, 3.2.0 >Reporter: L. C. Hsieh >Assignee: Adam Binford >Priority: Major > Fix For: 3.2.0 > > > [https://github.com/apache/spark/pull/30245] added support for creating > subexpressions that are present in all branches of conditional statements. > However, for a statement to be in "all branches" of a CaseWhen statement, it > must also be in the elseValue. This can lead to a subexpression to be created > and run for branches of a conditional that don't pass. This can cause issues > especially with a UDF in a branch that gets executed assuming the condition > is true. For example: > {code:java} > val col = when($"id" < 0, myUdf($"id")) > spark.range(1).select(when(col > 0, col)).show() > {code} > myUdf($"id") gets extracted as a subexpression and executed even though both > conditions don't pass and it should never be executed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35449) Should not extract common expressions from value expressions when elseValue is empty in CaseWhen
[ https://issues.apache.org/jira/browse/SPARK-35449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-35449. - Fix Version/s: 3.2.0 Resolution: Fixed > Should not extract common expressions from value expressions when elseValue > is empty in CaseWhen > > > Key: SPARK-35449 > URL: https://issues.apache.org/jira/browse/SPARK-35449 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1, 3.2.0 >Reporter: L. C. Hsieh >Priority: Major > Fix For: 3.2.0 > > > [https://github.com/apache/spark/pull/30245] added support for creating > subexpressions that are present in all branches of conditional statements. > However, for a statement to be in "all branches" of a CaseWhen statement, it > must also be in the elseValue. This can lead to a subexpression to be created > and run for branches of a conditional that don't pass. This can cause issues > especially with a UDF in a branch that gets executed assuming the condition > is true. For example: > {code:java} > val col = when($"id" < 0, myUdf($"id")) > spark.range(1).select(when(col > 0, col)).show() > {code} > myUdf($"id") gets extracted as a subexpression and executed even though both > conditions don't pass and it should never be executed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35449) Should not extract common expressions from value expressions when elseValue is empty in CaseWhen
[ https://issues.apache.org/jira/browse/SPARK-35449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-35449: --- Assignee: Adam Binford > Should not extract common expressions from value expressions when elseValue > is empty in CaseWhen > > > Key: SPARK-35449 > URL: https://issues.apache.org/jira/browse/SPARK-35449 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1, 3.2.0 >Reporter: L. C. Hsieh >Assignee: Adam Binford >Priority: Major > Fix For: 3.2.0 > > > [https://github.com/apache/spark/pull/30245] added support for creating > subexpressions that are present in all branches of conditional statements. > However, for a statement to be in "all branches" of a CaseWhen statement, it > must also be in the elseValue. This can lead to a subexpression to be created > and run for branches of a conditional that don't pass. This can cause issues > especially with a UDF in a branch that gets executed assuming the condition > is true. For example: > {code:java} > val col = when($"id" < 0, myUdf($"id")) > spark.range(1).select(when(col > 0, col)).show() > {code} > myUdf($"id") gets extracted as a subexpression and executed even though both > conditions don't pass and it should never be executed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per
Yahui Liu created SPARK-35500: - Summary: GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query performance Key: SPARK-35500 URL: https://issues.apache.org/jira/browse/SPARK-35500 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Yahui Liu Reproduce steps: # create a new table with array type: create table test_code_gen(a array); # Add log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = DEBUG to log4j.properties; # Enter spark-shell, fire a query: spark.sql("select * from test_code_gen").collect # Everytime, Dataset.collect is called, SpecificSafeProjection class is generated, but the code for the class cannot be reused because everytime the id for two variables in the generated class is changed: MapObjects_loopValue and MapObjects_loopIsNull. So even the class generated before has been cached, new code cannot match the cache key so that new code need to be compiled again which cost some time. !image-2021-05-24-16-15-18-359.png!!image-2021-05-24-16-05-34-334.png! # The time cost for compile is increasing with the growth of column number, for wide table, this cost can more than 2s. !image-2021-05-24-16-11-20-841.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34927) Support TPCDSQueryBenchmark in Benchmarks
[ https://issues.apache.org/jira/browse/SPARK-34927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350290#comment-17350290 ] Myeongju Kim commented on SPARK-34927: -- Thank you for your response! I'll take a look at it and update the Jira accordingly > Support TPCDSQueryBenchmark in Benchmarks > - > > Key: SPARK-34927 > URL: https://issues.apache.org/jira/browse/SPARK-34927 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Benchmarks.scala currently does not support TPCDSQueryBenchmark. We should > make it supported. See also > https://github.com/apache/spark/pull/32015#issuecomment-89046 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35499) Apply black to pandas API on Spark codes.
[ https://issues.apache.org/jira/browse/SPARK-35499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-35499: Description: Make it easier and more efficient to static analysis, we'd better to apply `black` to the pandas API on Spark. Koalas project is using black for [reformatting script|https://github.com/databricks/koalas/blob/master/dev/reformat]. was: Make it easier and more efficient to static analysis, we'd better to apply `black` to the pandas API on Spark. Koalas project is using black for [reformatting script|[https://github.com/databricks/koalas/blob/master/dev/reformat].] > Apply black to pandas API on Spark codes. > - > > Key: SPARK-35499 > URL: https://issues.apache.org/jira/browse/SPARK-35499 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > Make it easier and more efficient to static analysis, we'd better to apply > `black` to the pandas API on Spark. > Koalas project is using black for [reformatting > script|https://github.com/databricks/koalas/blob/master/dev/reformat]. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35499) Apply black to pandas API on Spark codes.
Haejoon Lee created SPARK-35499: --- Summary: Apply black to pandas API on Spark codes. Key: SPARK-35499 URL: https://issues.apache.org/jira/browse/SPARK-35499 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.2.0 Reporter: Haejoon Lee Make it easier and more efficient to static analysis, we'd better to apply `black` to the pandas API on Spark. Koalas project is using black for [reformatting script|[https://github.com/databricks/koalas/blob/master/dev/reformat].] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35449) Should not extract common expressions from value expressions when elseValue is empty in CaseWhen
[ https://issues.apache.org/jira/browse/SPARK-35449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-35449: Affects Version/s: 3.1.1 > Should not extract common expressions from value expressions when elseValue > is empty in CaseWhen > > > Key: SPARK-35449 > URL: https://issues.apache.org/jira/browse/SPARK-35449 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1, 3.2.0 >Reporter: L. C. Hsieh >Priority: Major > > [https://github.com/apache/spark/pull/30245] added support for creating > subexpressions that are present in all branches of conditional statements. > However, for a statement to be in "all branches" of a CaseWhen statement, it > must also be in the elseValue. This can lead to a subexpression to be created > and run for branches of a conditional that don't pass. This can cause issues > especially with a UDF in a branch that gets executed assuming the condition > is true. For example: > {code:java} > val col = when($"id" < 0, myUdf($"id")) > spark.range(1).select(when(col > 0, col)).show() > {code} > myUdf($"id") gets extracted as a subexpression and executed even though both > conditions don't pass and it should never be executed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35498) Add an API "inheritable_thread_target" which return a wrapped thread target for pyspark pin thread mode
[ https://issues.apache.org/jira/browse/SPARK-35498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350273#comment-17350273 ] Apache Spark commented on SPARK-35498: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/32644 > Add an API "inheritable_thread_target" which return a wrapped thread target > for pyspark pin thread mode > --- > > Key: SPARK-35498 > URL: https://issues.apache.org/jira/browse/SPARK-35498 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Weichen Xu >Priority: Major > > In pyspark, user may create some threads, not via `Thread` object, but via > some parallel helper function such as: > `thread_pool.imap_unordered` > In this case, we need create a function wrapper, used in pin thread mode, The > wrapper function, before calling original thread target, it inherits the > inheritable properties specific to JVM thread such as > ``InheritableThreadLocal``, and after original thread target return, > garbage-collects the Python thread instance and also closes the connection > which finishes JVM thread correctly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35498) Add an API "inheritable_thread_target" which return a wrapped thread target for pyspark pin thread mode
[ https://issues.apache.org/jira/browse/SPARK-35498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350272#comment-17350272 ] Apache Spark commented on SPARK-35498: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/32644 > Add an API "inheritable_thread_target" which return a wrapped thread target > for pyspark pin thread mode > --- > > Key: SPARK-35498 > URL: https://issues.apache.org/jira/browse/SPARK-35498 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Weichen Xu >Priority: Major > > In pyspark, user may create some threads, not via `Thread` object, but via > some parallel helper function such as: > `thread_pool.imap_unordered` > In this case, we need create a function wrapper, used in pin thread mode, The > wrapper function, before calling original thread target, it inherits the > inheritable properties specific to JVM thread such as > ``InheritableThreadLocal``, and after original thread target return, > garbage-collects the Python thread instance and also closes the connection > which finishes JVM thread correctly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35498) Add an API "inheritable_thread_target" which return a wrapped thread target for pyspark pin thread mode
[ https://issues.apache.org/jira/browse/SPARK-35498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35498: Assignee: Apache Spark > Add an API "inheritable_thread_target" which return a wrapped thread target > for pyspark pin thread mode > --- > > Key: SPARK-35498 > URL: https://issues.apache.org/jira/browse/SPARK-35498 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Weichen Xu >Assignee: Apache Spark >Priority: Major > > In pyspark, user may create some threads, not via `Thread` object, but via > some parallel helper function such as: > `thread_pool.imap_unordered` > In this case, we need create a function wrapper, used in pin thread mode, The > wrapper function, before calling original thread target, it inherits the > inheritable properties specific to JVM thread such as > ``InheritableThreadLocal``, and after original thread target return, > garbage-collects the Python thread instance and also closes the connection > which finishes JVM thread correctly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35498) Add an API "inheritable_thread_target" which return a wrapped thread target for pyspark pin thread mode
[ https://issues.apache.org/jira/browse/SPARK-35498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35498: Assignee: (was: Apache Spark) > Add an API "inheritable_thread_target" which return a wrapped thread target > for pyspark pin thread mode > --- > > Key: SPARK-35498 > URL: https://issues.apache.org/jira/browse/SPARK-35498 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Weichen Xu >Priority: Major > > In pyspark, user may create some threads, not via `Thread` object, but via > some parallel helper function such as: > `thread_pool.imap_unordered` > In this case, we need create a function wrapper, used in pin thread mode, The > wrapper function, before calling original thread target, it inherits the > inheritable properties specific to JVM thread such as > ``InheritableThreadLocal``, and after original thread target return, > garbage-collects the Python thread instance and also closes the connection > which finishes JVM thread correctly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35498) Add an API "inheritable_thread_target" which return a wrapped thread target for pyspark pin thread mode
[ https://issues.apache.org/jira/browse/SPARK-35498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-35498: --- Description: In pyspark, user may create some threads, not via `Thread` object, but via some parallel helper function such as: `thread_pool.imap_unordered` In this case, we need create a function wrapper, used in pin thread mode, The wrapper function, before calling original thread target, it inherits the inheritable properties specific to JVM thread such as ``InheritableThreadLocal``, and after original thread target return, garbage-collects the Python thread instance and also closes the connection which finishes JVM thread correctly. was:For inheritable_thread_target > Add an API "inheritable_thread_target" which return a wrapped thread target > for pyspark pin thread mode > --- > > Key: SPARK-35498 > URL: https://issues.apache.org/jira/browse/SPARK-35498 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Weichen Xu >Priority: Major > > In pyspark, user may create some threads, not via `Thread` object, but via > some parallel helper function such as: > `thread_pool.imap_unordered` > In this case, we need create a function wrapper, used in pin thread mode, The > wrapper function, before calling original thread target, it inherits the > inheritable properties specific to JVM thread such as > ``InheritableThreadLocal``, and after original thread target return, > garbage-collects the Python thread instance and also closes the connection > which finishes JVM thread correctly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35498) Add an API "inheritable_thread_target" which return a wrapped thread target for pyspark pin thread mode
[ https://issues.apache.org/jira/browse/SPARK-35498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-35498: --- Description: For inheritable_thread_target > Add an API "inheritable_thread_target" which return a wrapped thread target > for pyspark pin thread mode > --- > > Key: SPARK-35498 > URL: https://issues.apache.org/jira/browse/SPARK-35498 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Weichen Xu >Priority: Major > > For inheritable_thread_target -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35498) Add an API "inheritable_thread_target" which return a wrapped thread target for pyspark pin thread mode
Weichen Xu created SPARK-35498: -- Summary: Add an API "inheritable_thread_target" which return a wrapped thread target for pyspark pin thread mode Key: SPARK-35498 URL: https://issues.apache.org/jira/browse/SPARK-35498 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.2.0 Reporter: Weichen Xu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35497) Enable plotly tests in pandas-on-Spark
Hyukjin Kwon created SPARK-35497: Summary: Enable plotly tests in pandas-on-Spark Key: SPARK-35497 URL: https://issues.apache.org/jira/browse/SPARK-35497 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.2.0 Reporter: Hyukjin Kwon Currently, GitHub Actions doesn't have plotly installed and results in plotly related tests being skipped. We should enable it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35381) Fix lambda variable name issues in nested DataFrame functions in R APIs
[ https://issues.apache.org/jira/browse/SPARK-35381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-35381: --- Assignee: Hyukjin Kwon > Fix lambda variable name issues in nested DataFrame functions in R APIs > --- > > Key: SPARK-35381 > URL: https://issues.apache.org/jira/browse/SPARK-35381 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 3.1.1 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > Labels: correctness > Fix For: 3.1.2, 3.2.0 > > > R's higher order functions also have the same problem with SPARK-34794: > {code} > df <- sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters") > collect(select( > df, > array_transform("numbers", function(number) { > array_transform("letters", function(latter) { > struct(alias(number, "n"), alias(latter, "l")) > }) > }) > )) > {code} > {code} > transform(numbers, lambdafunction(transform(letters, > lambdafunction(struct(namedlambdavariable() AS n, namedlambdavariable() AS > l), namedlambdavariable())), namedlambdavariable())) > 1 > a, a, b, b, c, c, a, a, > b, b, c, c, a, a, b, b, c, c > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35496) Upgrade Scala 2.13 from 2.13.5 to 2.13.6
[ https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35496: Assignee: Apache Spark > Upgrade Scala 2.13 from 2.13.5 to 2.13.6 > > > Key: SPARK-35496 > URL: https://issues.apache.org/jira/browse/SPARK-35496 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > > Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35496) Upgrade Scala 2.13 from 2.13.5 to 2.13.6
[ https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350267#comment-17350267 ] Apache Spark commented on SPARK-35496: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/32648 > Upgrade Scala 2.13 from 2.13.5 to 2.13.6 > > > Key: SPARK-35496 > URL: https://issues.apache.org/jira/browse/SPARK-35496 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Major > > Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35496) Upgrade Scala 2.13 from 2.13.5 to 2.13.6
[ https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35496: Assignee: (was: Apache Spark) > Upgrade Scala 2.13 from 2.13.5 to 2.13.6 > > > Key: SPARK-35496 > URL: https://issues.apache.org/jira/browse/SPARK-35496 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Major > > Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35496) Upgrade Scala 2.13 from 2.13.5 to 2.13.6
[ https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35496: Assignee: Apache Spark > Upgrade Scala 2.13 from 2.13.5 to 2.13.6 > > > Key: SPARK-35496 > URL: https://issues.apache.org/jira/browse/SPARK-35496 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > > Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35496) Upgrade Scala 2.13 from 2.13.5 to 2.13.6
Yang Jie created SPARK-35496: Summary: Upgrade Scala 2.13 from 2.13.5 to 2.13.6 Key: SPARK-35496 URL: https://issues.apache.org/jira/browse/SPARK-35496 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 3.2.0 Reporter: Yang Jie Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34981) Implement V2 function resolution and evaluation
[ https://issues.apache.org/jira/browse/SPARK-34981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350257#comment-17350257 ] Apache Spark commented on SPARK-34981: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/32647 > Implement V2 function resolution and evaluation > > > Key: SPARK-34981 > URL: https://issues.apache.org/jira/browse/SPARK-34981 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.2.0 > > > This is a follow-up of SPARK-27658. With FunctionCatalog API done, this aims > at implementing the function resolution (in analyzer) and evaluation by > wrapping them into corresponding expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34981) Implement V2 function resolution and evaluation
[ https://issues.apache.org/jira/browse/SPARK-34981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350256#comment-17350256 ] Apache Spark commented on SPARK-34981: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/32647 > Implement V2 function resolution and evaluation > > > Key: SPARK-34981 > URL: https://issues.apache.org/jira/browse/SPARK-34981 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.2.0 > > > This is a follow-up of SPARK-27658. With FunctionCatalog API done, this aims > at implementing the function resolution (in analyzer) and evaluation by > wrapping them into corresponding expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org