date:20210524

[jira] [Updated] (SPARK-35502) Spark Executor metrics are not produced/showed

2021-05-24 Thread Mati (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mati updated SPARK-35502:
-
Description: 
Recently we have enabled prometheusServlet configuration in order to have spark 
master, worker, driver and executor metrics.

We can see and using spark master, worker and driver executors but can't see 
spark executor metrics.

We are running spark streaming standalone cluster in version 3.0.1 over 
physical servers.

 
 We have taken one of our jobs and added the following parameters to the job 
configuration, but couldn't see executer metrics by curling both driver and 
executor workers of this job:
  
 These are the parameters:
 --conf spark.ui.prometheus.enabled=true \
 --conf spark.executor.processTreeMetrics.enabled=true
  
 Curl commands:
 [00764f](root@sparktest-40005-prod-chidc2:~)# curl -s 
[http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead]
 -n5
 [00764f](root@sparktest-40005-prod-chidc2:~)#
 Driver of this job - sparktest-40004:
 [e35005](root@sparktest-40004-prod-chidc2:~)# curl -s 
[http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead]
 -n5
 [e35005](root@sparktest-40004-prod-chidc2:~)# curl -s 
[http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead]
 -n5
  
 Our UI port is on 4050
  
  I understood that the executor Prometheus endpoint is still experimental 
which may explain the inconsistent behaviour we see but is there a plan to fix 
it?
  
 Are there any known issues regarding this?
  
  
  
  
  

  was:
Recently we have enabled prometheusServlet configuration in order to have spark 
master, worker, driver and executor metrics.

We can see and using spark master, worker and driver executors but can't see 
spark executor metrics.

We are running spark streaming standalone cluster in version 3.0.1 over 
physical servers.

 
We have taken one of our jobs and added the following parameters to the job 
configuration, but couldn't see executer metrics by curling both driver and 
executor workers of this job:
 
These are the parameters:
--conf spark.ui.prometheus.enabled=true \
--conf spark.executor.processTreeMetrics.enabled=true
 
Curl commands:
[00764f](root@sparktest-40005-prod-chidc2:~)# curl -s 
[http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead]
 -n5
[00764f](root@sparktest-40005-prod-chidc2:~)#
Driver of this job - sparktest-40004:
[e35005](root@sparktest-40004-prod-chidc2:~)# curl -s 
[http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead]
 -n5
[e35005](root@sparktest-40004-prod-chidc2:~)# curl -s 
[http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead]
 -n5
 
Out UI port is on 4050
 
 I understood that the executor Prometheus endpoint is still experimental which 
may explain the inconsistent behaviour we see but is there a plan to fix it?
 
Are there any known issues regarding this?
 
 
 
 
 

Environment: 
metrics.properties
{code:java}
 *.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
  *.sink.prometheusServlet.path=/metrics/prometheus
  master.sink.prometheusServlet.path=/metrics/master/prometheus
  applications.sink.prometheusServlet.path=/metrics/applications/prometheus  ## 
Below may be removed after finalizing the native Prometheus implementation  #
  # Enable Prometheus for driver
  #  
driver.sink.prometheus_chidc2.class=com.banzaicloud.spark.metrics.sink.PrometheusSink
  driver.sink.prometheus_chidc2.report-instance-id=false
  # Prometheus pushgateway address
  driver.sink.prometheus_chidc2.pushgateway-address-protocol=http
  
driver.sink.prometheus_chidc2.pushgateway-address=pushgateway-spark-master-staging-pod.service.chidc2.consul:9091
  driver.sink.prometheus_chidc2.period=60
  driver.sink.prometheus_chidc2.pushgateway-enable-timestamp=false
  
driver.sink.prometheus_chidc2.labels=cluster_name=apache-test-v3,datacenter=chidc2
  
driver.sink.prometheus_chidc2.master-worker-labels=instance=sparktest-40001-prod-chidc2.chidc2.outbrain.com
  
driver.sink.prometheus_chidc2.metrics-name-capture-regex=application_(\\S+)_([0-9]+)_cores;application_(\\S+)_([0-9]+)_runtime_ms;(.+)_[0-9]+_executor_(.+)
  
driver.sink.prometheus_chidc2.metrics-name-replacement=__name__=application_$1_cores,start_time=$2;__name__=application_$1_runtime_ms,start_time=$2;__name__=$1_executor_$2
  
driver.sink.prometheus_chidc2.metrics-exclude-regex=.*CodeGenerator_.+;.*HiveExternalCatalog_.+;.+executor_filesystem_file_largeRead_ops;.+executor_filesystem_file_read_bytes;.+executor_filesystem_file_read_ops;.+executor_filesystem_file_write_bytes;.+executor_filesystem_file_write_ops

[jira] [Assigned] (SPARK-35506) Run tests with Python 3.9 in GitHub Actions

2021-05-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35506:


Assignee: Hyukjin Kwon  (was: Apache Spark)

> Run tests with Python 3.9 in GitHub Actions
> ---
>
> Key: SPARK-35506
> URL: https://issues.apache.org/jira/browse/SPARK-35506
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> We're currently running PySpark tests with Python 3.8. We should run it with 
> Python 3.9 to verify the latest python support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35506) Run tests with Python 3.9 in GitHub Actions

2021-05-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350783#comment-17350783
 ] 

Apache Spark commented on SPARK-35506:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/32657

> Run tests with Python 3.9 in GitHub Actions
> ---
>
> Key: SPARK-35506
> URL: https://issues.apache.org/jira/browse/SPARK-35506
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> We're currently running PySpark tests with Python 3.8. We should run it with 
> Python 3.9 to verify the latest python support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35506) Run tests with Python 3.9 in GitHub Actions

2021-05-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35506:


Assignee: Apache Spark  (was: Hyukjin Kwon)

> Run tests with Python 3.9 in GitHub Actions
> ---
>
> Key: SPARK-35506
> URL: https://issues.apache.org/jira/browse/SPARK-35506
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> We're currently running PySpark tests with Python 3.8. We should run it with 
> Python 3.9 to verify the latest python support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35499) Apply black to pandas API on Spark codes.

2021-05-24 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-35499:

Parent: SPARK-34849
Issue Type: Sub-task  (was: Improvement)

> Apply black to pandas API on Spark codes.
> -
>
> Key: SPARK-35499
> URL: https://issues.apache.org/jira/browse/SPARK-35499
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Make it easier and more efficient to static analysis, we'd better to apply 
> `black` to the pandas API on Spark.
> Koalas project is using black for [reformatting 
> script|https://github.com/databricks/koalas/blob/master/dev/reformat].
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35433) Move CSV data source options from Python and Scala into a single page.

2021-05-24 Thread Haejoon Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350771#comment-17350771
 ] 

Haejoon Lee commented on SPARK-35433:
-

I'm working on this

> Move CSV data source options from Python and Scala into a single page.
> --
>
> Key: SPARK-35433
> URL: https://issues.apache.org/jira/browse/SPARK-35433
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Refer to https://issues.apache.org/jira/browse/SPARK-34491



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35504) count distinct asterisk

2021-05-24 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350769#comment-17350769
 ] 

Hyukjin Kwon commented on SPARK-35504:
--

Just a wild guess but:

{quote}
count(DISTINCT expr[, expr...])
{quote}

doesn't count NULLs butL

{quote}
count(*) from (select distinct * from storage_datamart.olympiads)
{quote}

counts nulls?



> count distinct asterisk 
> 
>
> Key: SPARK-35504
> URL: https://issues.apache.org/jira/browse/SPARK-35504
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: {code:java}
> uname -a
> Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 
> x86_64 x86_64 x86_64 GNU/Linux
> {code}
>  
> {code:java}
> lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 18.04.4 LTS
> Release:  18.04
> Codename: bionic
> {code}
>  
> {code:java}
> /opt/spark/bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0
>   /_/
> Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292
> Branch HEAD
> Compiled by user ubuntu on 2020-06-06T13:05:28Z
> Revision 3fdfce3120f307147244e5eaf46d61419a723d50
> Url https://gitbox.apache.org/repos/asf/spark.git
> Type --help for more information.
> {code}
> {code:java}
> lscpu
> Architecture:x86_64
> CPU op-mode(s):  32-bit, 64-bit
> Byte Order:  Little Endian
> CPU(s):  4
> On-line CPU(s) list: 0-3
> Thread(s) per core:  2
> Core(s) per socket:  2
> Socket(s):   1
> NUMA node(s):1
> Vendor ID:   GenuineIntel
> CPU family:  6
> Model:   85
> Model name:  Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
> Stepping:7
> CPU MHz: 3602.011
> BogoMIPS:6000.01
> Hypervisor vendor:   KVM
> Virtualization type: full
> L1d cache:   32K
> L1i cache:   32K
> L2 cache:1024K
> L3 cache:36608K
> NUMA node0 CPU(s):   0-3
> Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm 
> constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf 
> tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe 
> popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 
> 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms 
> invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd 
> avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
> {code}
>  
>Reporter: Nikolay Sokolov
>Priority: Minor
> Attachments: SPARK-35504_first_query_plan.log, 
> SPARK-35504_second_query_plan.log
>
>
> Hi everyone,
> I hope you're well!
>  
> Today I came across a very interesting case when the result of the execution 
> of the algorithm for counting unique rows differs depending on the form 
> (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries.
> I still can't figure out on my own if this is a bug or a feature and I would 
> like to share what I found.
>  
> I run Spark SQL queries through the Thrift (and not only) connecting to the 
> Spark cluster. I use the DBeaver app to execute Spark SQL queries.
>  
> So, I have two identical Spark SQL queries from an algorithmic point of view 
> that return different results.
>  
> The first query:
> {code:sql}
> select count(distinct *) unique_amt from storage_datamart.olympiads
> ; -- Rows: 13437678
> {code}
>  
> The second query:
> {code:sql}
> select count(*) from (select distinct * from storage_datamart.olympiads)
> ; -- Rows: 36901430
> {code}
>  
> The result of the two queries is different. (But it must be the same, right!?)
> {code:sql}
> select 'The first query' description, count(distinct *) unique_amt from 
> storage_datamart.olympiads
>  union all
> select 'The second query', count(*) from (select distinct * from 
> storage_datamart.olympiads)
> ;
> {code}
>  
> The result of the above query is the following:
> {code:java}
> The first query13437678
> The second query   36901430
> {code}
>  
>  I can easily calculate the unique number of rows in the table:
> {code:sql}
> select count(*) from (
>   select student_id, olympiad_id, tour, grade
> from storage_datamart.olympiads
>group by student_id, olympiad_id, tour, grade
>   having count(*) = 1
> )
> ; -- Rows: 36901365
> {code}
>  
> The table DDL is the following:
> {code:sql}
> CREATE TABLE `storage_datamart`.`olympiads` (
>   `ptn_date` DATE,
>   `student_id` BIGINT,
>   `olympiad_id` STRING,
>   `grade` BIGINT,
>   `grade_type` STRING,
>

[jira] [Comment Edited] (SPARK-35504) count distinct asterisk

2021-05-24 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350769#comment-17350769
 ] 

Hyukjin Kwon edited comment on SPARK-35504 at 5/25/21, 3:50 AM:


Just a wild guess but:

{quote}
count(DISTINCT expr[, expr...])
{quote}

doesn't count NULLs but:

{quote}
count\(*\) from (select distinct * from storage_datamart.olympiads)
{quote}

counts nulls?




was (Author: hyukjin.kwon):
Just a wild guess but:

{quote}
count(DISTINCT expr[, expr...])
{quote}

doesn't count NULLs butL

{quote}
count(*) from (select distinct * from storage_datamart.olympiads)
{quote}

counts nulls?



> count distinct asterisk 
> 
>
> Key: SPARK-35504
> URL: https://issues.apache.org/jira/browse/SPARK-35504
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: {code:java}
> uname -a
> Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 
> x86_64 x86_64 x86_64 GNU/Linux
> {code}
>  
> {code:java}
> lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 18.04.4 LTS
> Release:  18.04
> Codename: bionic
> {code}
>  
> {code:java}
> /opt/spark/bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0
>   /_/
> Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292
> Branch HEAD
> Compiled by user ubuntu on 2020-06-06T13:05:28Z
> Revision 3fdfce3120f307147244e5eaf46d61419a723d50
> Url https://gitbox.apache.org/repos/asf/spark.git
> Type --help for more information.
> {code}
> {code:java}
> lscpu
> Architecture:x86_64
> CPU op-mode(s):  32-bit, 64-bit
> Byte Order:  Little Endian
> CPU(s):  4
> On-line CPU(s) list: 0-3
> Thread(s) per core:  2
> Core(s) per socket:  2
> Socket(s):   1
> NUMA node(s):1
> Vendor ID:   GenuineIntel
> CPU family:  6
> Model:   85
> Model name:  Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
> Stepping:7
> CPU MHz: 3602.011
> BogoMIPS:6000.01
> Hypervisor vendor:   KVM
> Virtualization type: full
> L1d cache:   32K
> L1i cache:   32K
> L2 cache:1024K
> L3 cache:36608K
> NUMA node0 CPU(s):   0-3
> Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm 
> constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf 
> tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe 
> popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 
> 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms 
> invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd 
> avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
> {code}
>  
>Reporter: Nikolay Sokolov
>Priority: Minor
> Attachments: SPARK-35504_first_query_plan.log, 
> SPARK-35504_second_query_plan.log
>
>
> Hi everyone,
> I hope you're well!
>  
> Today I came across a very interesting case when the result of the execution 
> of the algorithm for counting unique rows differs depending on the form 
> (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries.
> I still can't figure out on my own if this is a bug or a feature and I would 
> like to share what I found.
>  
> I run Spark SQL queries through the Thrift (and not only) connecting to the 
> Spark cluster. I use the DBeaver app to execute Spark SQL queries.
>  
> So, I have two identical Spark SQL queries from an algorithmic point of view 
> that return different results.
>  
> The first query:
> {code:sql}
> select count(distinct *) unique_amt from storage_datamart.olympiads
> ; -- Rows: 13437678
> {code}
>  
> The second query:
> {code:sql}
> select count(*) from (select distinct * from storage_datamart.olympiads)
> ; -- Rows: 36901430
> {code}
>  
> The result of the two queries is different. (But it must be the same, right!?)
> {code:sql}
> select 'The first query' description, count(distinct *) unique_amt from 
> storage_datamart.olympiads
>  union all
> select 'The second query', count(*) from (select distinct * from 
> storage_datamart.olympiads)
> ;
> {code}
>  
> The result of the above query is the following:
> {code:java}
> The first query13437678
> The second query   36901430
> {code}
>  
>  I can easily calculate the unique number of rows in the table:
> {code:sql}
> select count(*) from (
>   select student_id, olympiad_id, tour, grade
> from storage_datamart.olympiads
>group by student_id, olympiad_id, tour, grade
>

[jira] [Closed] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query perf

2021-05-24 Thread Yahui Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yahui Liu closed SPARK-35500.
-

> GenerateSafeProjection.generate will generate SpecificSafeProjection class, 
> but if column is array type or map type, the code cannot be reused which 
> impact the query performance
> -
>
> Key: SPARK-35500
> URL: https://issues.apache.org/jira/browse/SPARK-35500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Yahui Liu
>Priority: Minor
>  Labels: codegen
>
> Reproduce steps:
>  # create a new table with array type: create table test_code_gen(a 
> array);
>  # Add 
> log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator 
> = DEBUG to log4j.properties;
>  # Enter spark-shell, fire a query: spark.sql("select * from 
> test_code_gen").collect
>  # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
> generated, but the code for the class cannot be reused because everytime the 
> id for two variables in the generated class is changed: MapObjects_loopValue 
> and MapObjects_loopIsNull. So even the class generated before has been 
> cached, new code cannot match the cache key so that new code need to be 
> compiled again which cost some time.  The time cost for compile is increasing 
> with the growth of column number, for wide table, this cost can more than 2s. 
> {code:java}
> object MapObjects {
>   private val curId = new java.util.concurrent.atomic.AtomicInteger()
>  val id = curId.getAndIncrement()
>  val loopValue = s"MapObjects_loopValue$id"
>  val loopIsNull = if (elementNullable) {
>    s"MapObjects_loopIsNull$id"
>  } else {
>    "false"
>  }
> {code}
> First time run: 
> {code:java}
> class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
>  private int MapObjects_loopValue1;
>  private boolean MapObjects_loopIsNull1;
>  private UTF8String MapObjects_loopValue2;
>  private boolean MapObjects_loopIsNull2;
> }
> {code}
> Second time run:
> {code:java}
> class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
>  private int MapObjects_loopValue3;
>  private boolean MapObjects_loopIsNull3;
>  private UTF8String MapObjects_loopValue4;
>  private boolean MapObjects_loopIsNull4;
> }{code}
> Expectation:
> The code generated by GenerateSafeProjection can be reused if the query is 
> same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35507) Move Python 3.9 installtation to the docker image for GitHub Actions

2021-05-24 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-35507:


 Summary: Move Python 3.9 installtation to the docker image for 
GitHub Actions
 Key: SPARK-35507
 URL: https://issues.apache.org/jira/browse/SPARK-35507
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon


SPARK-35506 added Python 3.9 support but it had to manually install it.
The installed packages and Python versions should go to the docker image.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35506) Run tests with Python 3.9 in GitHub Actions

2021-05-24 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-35506:


 Summary: Run tests with Python 3.9 in GitHub Actions
 Key: SPARK-35506
 URL: https://issues.apache.org/jira/browse/SPARK-35506
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon
Assignee: Hyukjin Kwon


We're currently running PySpark tests with Python 3.8. We should run it with 
Python 3.9 to verify the latest python support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35497) Enable plotly tests in pandas-on-Spark

2021-05-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35497.
--
Fix Version/s: 3.2.0
 Assignee: Hyukjin Kwon
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/32649

> Enable plotly tests in pandas-on-Spark
> --
>
> Key: SPARK-35497
> URL: https://issues.apache.org/jira/browse/SPARK-35497
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently, GitHub Actions doesn't have plotly installed and results in plotly 
> related tests being skipped. We should enable it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query pe

2021-05-24 Thread Yahui Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yahui Liu resolved SPARK-35500.
---
Resolution: Duplicate

> GenerateSafeProjection.generate will generate SpecificSafeProjection class, 
> but if column is array type or map type, the code cannot be reused which 
> impact the query performance
> -
>
> Key: SPARK-35500
> URL: https://issues.apache.org/jira/browse/SPARK-35500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Yahui Liu
>Priority: Minor
>  Labels: codegen
>
> Reproduce steps:
>  # create a new table with array type: create table test_code_gen(a 
> array);
>  # Add 
> log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator 
> = DEBUG to log4j.properties;
>  # Enter spark-shell, fire a query: spark.sql("select * from 
> test_code_gen").collect
>  # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
> generated, but the code for the class cannot be reused because everytime the 
> id for two variables in the generated class is changed: MapObjects_loopValue 
> and MapObjects_loopIsNull. So even the class generated before has been 
> cached, new code cannot match the cache key so that new code need to be 
> compiled again which cost some time.  The time cost for compile is increasing 
> with the growth of column number, for wide table, this cost can more than 2s. 
> {code:java}
> object MapObjects {
>   private val curId = new java.util.concurrent.atomic.AtomicInteger()
>  val id = curId.getAndIncrement()
>  val loopValue = s"MapObjects_loopValue$id"
>  val loopIsNull = if (elementNullable) {
>    s"MapObjects_loopIsNull$id"
>  } else {
>    "false"
>  }
> {code}
> First time run: 
> {code:java}
> class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
>  private int MapObjects_loopValue1;
>  private boolean MapObjects_loopIsNull1;
>  private UTF8String MapObjects_loopValue2;
>  private boolean MapObjects_loopIsNull2;
> }
> {code}
> Second time run:
> {code:java}
> class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
>  private int MapObjects_loopValue3;
>  private boolean MapObjects_loopIsNull3;
>  private UTF8String MapObjects_loopValue4;
>  private boolean MapObjects_loopIsNull4;
> }{code}
> Expectation:
> The code generated by GenerateSafeProjection can be reused if the query is 
> same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per

2021-05-24 Thread Yahui Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yahui Liu updated SPARK-35500:
--
Affects Version/s: (was: 3.1.0)
   2.4.5

> GenerateSafeProjection.generate will generate SpecificSafeProjection class, 
> but if column is array type or map type, the code cannot be reused which 
> impact the query performance
> -
>
> Key: SPARK-35500
> URL: https://issues.apache.org/jira/browse/SPARK-35500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Yahui Liu
>Priority: Minor
>  Labels: codegen
>
> Reproduce steps:
>  # create a new table with array type: create table test_code_gen(a 
> array);
>  # Add 
> log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator 
> = DEBUG to log4j.properties;
>  # Enter spark-shell, fire a query: spark.sql("select * from 
> test_code_gen").collect
>  # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
> generated, but the code for the class cannot be reused because everytime the 
> id for two variables in the generated class is changed: MapObjects_loopValue 
> and MapObjects_loopIsNull. So even the class generated before has been 
> cached, new code cannot match the cache key so that new code need to be 
> compiled again which cost some time.  The time cost for compile is increasing 
> with the growth of column number, for wide table, this cost can more than 2s. 
> {code:java}
> object MapObjects {
>   private val curId = new java.util.concurrent.atomic.AtomicInteger()
>  val id = curId.getAndIncrement()
>  val loopValue = s"MapObjects_loopValue$id"
>  val loopIsNull = if (elementNullable) {
>    s"MapObjects_loopIsNull$id"
>  } else {
>    "false"
>  }
> {code}
> First time run: 
> {code:java}
> class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
>  private int MapObjects_loopValue1;
>  private boolean MapObjects_loopIsNull1;
>  private UTF8String MapObjects_loopValue2;
>  private boolean MapObjects_loopIsNull2;
> }
> {code}
> Second time run:
> {code:java}
> class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
>  private int MapObjects_loopValue3;
>  private boolean MapObjects_loopIsNull3;
>  private UTF8String MapObjects_loopValue4;
>  private boolean MapObjects_loopIsNull4;
> }{code}
> Expectation:
> The code generated by GenerateSafeProjection can be reused if the query is 
> same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query p

2021-05-24 Thread Yahui Liu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350746#comment-17350746
 ] 

Yahui Liu commented on SPARK-35500:
---

[~maropu] Hi, sorry for incorrect information, i thought i am using 3.1.1, but 
actually the jars i am using are complied with code 2.4.5 version. That is why 
this issue exists, SPARK-27871 is the solution i want, i can close this jira. 
Thanks.

> GenerateSafeProjection.generate will generate SpecificSafeProjection class, 
> but if column is array type or map type, the code cannot be reused which 
> impact the query performance
> -
>
> Key: SPARK-35500
> URL: https://issues.apache.org/jira/browse/SPARK-35500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yahui Liu
>Priority: Minor
>  Labels: codegen
>
> Reproduce steps:
>  # create a new table with array type: create table test_code_gen(a 
> array);
>  # Add 
> log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator 
> = DEBUG to log4j.properties;
>  # Enter spark-shell, fire a query: spark.sql("select * from 
> test_code_gen").collect
>  # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
> generated, but the code for the class cannot be reused because everytime the 
> id for two variables in the generated class is changed: MapObjects_loopValue 
> and MapObjects_loopIsNull. So even the class generated before has been 
> cached, new code cannot match the cache key so that new code need to be 
> compiled again which cost some time.  The time cost for compile is increasing 
> with the growth of column number, for wide table, this cost can more than 2s. 
> {code:java}
> object MapObjects {
>   private val curId = new java.util.concurrent.atomic.AtomicInteger()
>  val id = curId.getAndIncrement()
>  val loopValue = s"MapObjects_loopValue$id"
>  val loopIsNull = if (elementNullable) {
>    s"MapObjects_loopIsNull$id"
>  } else {
>    "false"
>  }
> {code}
> First time run: 
> {code:java}
> class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
>  private int MapObjects_loopValue1;
>  private boolean MapObjects_loopIsNull1;
>  private UTF8String MapObjects_loopValue2;
>  private boolean MapObjects_loopIsNull2;
> }
> {code}
> Second time run:
> {code:java}
> class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
>  private int MapObjects_loopValue3;
>  private boolean MapObjects_loopIsNull3;
>  private UTF8String MapObjects_loopValue4;
>  private boolean MapObjects_loopIsNull4;
> }{code}
> Expectation:
> The code generated by GenerateSafeProjection can be reused if the query is 
> same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23101) Migrate unit test sinks

2021-05-24 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-23101:
-
Labels:   (was: bulk-closed)

> Migrate unit test sinks
> ---
>
> Key: SPARK-23101
> URL: https://issues.apache.org/jira/browse/SPARK-23101
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23100) Migrate unit test sources

2021-05-24 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-23100:
-
Labels:   (was: bulk-closed)

> Migrate unit test sources
> -
>
> Key: SPARK-23100
> URL: https://issues.apache.org/jira/browse/SPARK-23100
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23100) Migrate unit test sources

2021-05-24 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-23100:
-
Labels: bulk-closed  (was: )

> Migrate unit test sources
> -
>
> Key: SPARK-23100
> URL: https://issues.apache.org/jira/browse/SPARK-23100
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>  Labels: bulk-closed
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23101) Migrate unit test sinks

2021-05-24 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-23101:
-
Labels: bulk-closed  (was: )

> Migrate unit test sinks
> ---
>
> Key: SPARK-23101
> URL: https://issues.apache.org/jira/browse/SPARK-23101
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>  Labels: bulk-closed
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35498) Add an API "inheritable_thread_target" which return a wrapped thread target for pyspark pin thread mode

2021-05-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35498.
--
Fix Version/s: 3.2.0
 Assignee: Weichen Xu
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/32644

> Add an API "inheritable_thread_target" which return a wrapped thread target 
> for pyspark pin thread mode
> ---
>
> Key: SPARK-35498
> URL: https://issues.apache.org/jira/browse/SPARK-35498
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.2.0
>
>
> In pyspark, user may create some threads, not via `Thread` object, but via 
> some parallel helper function such as:
> `thread_pool.imap_unordered`
> In this case, we need create a function wrapper, used in pin thread mode, The 
> wrapper function, before calling original thread target, it inherits the 
> inheritable properties specific to JVM thread such as 
> ``InheritableThreadLocal``, and after original thread target return, 
> garbage-collects the Python thread instance and also closes the connection 
> which finishes JVM thread correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35505) Remove APIs that have been deprecated in Koalas.

2021-05-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350706#comment-17350706
 ] 

Apache Spark commented on SPARK-35505:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/32656

> Remove APIs that have been deprecated in Koalas.
> 
>
> Key: SPARK-35505
> URL: https://issues.apache.org/jira/browse/SPARK-35505
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> There are some APIs that have been deprecated in Koalas. We shouldn't have 
> those in pandas APIs on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35505) Remove APIs that have been deprecated in Koalas.

2021-05-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35505:


Assignee: (was: Apache Spark)

> Remove APIs that have been deprecated in Koalas.
> 
>
> Key: SPARK-35505
> URL: https://issues.apache.org/jira/browse/SPARK-35505
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> There are some APIs that have been deprecated in Koalas. We shouldn't have 
> those in pandas APIs on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35505) Remove APIs that have been deprecated in Koalas.

2021-05-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35505:


Assignee: Apache Spark

> Remove APIs that have been deprecated in Koalas.
> 
>
> Key: SPARK-35505
> URL: https://issues.apache.org/jira/browse/SPARK-35505
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> There are some APIs that have been deprecated in Koalas. We shouldn't have 
> those in pandas APIs on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35505) Remove APIs that have been deprecated in Koalas.

2021-05-24 Thread Takuya Ueshin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350692#comment-17350692
 ] 

Takuya Ueshin commented on SPARK-35505:
---

I'm working on this.

> Remove APIs that have been deprecated in Koalas.
> 
>
> Key: SPARK-35505
> URL: https://issues.apache.org/jira/browse/SPARK-35505
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> There are some APIs that have been deprecated in Koalas. We shouldn't have 
> those in pandas APIs on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35505) Remove APIs that have been deprecated in Koalas.

2021-05-24 Thread Takuya Ueshin (Jira)

Takuya Ueshin created SPARK-35505:
-

 Summary: Remove APIs that have been deprecated in Koalas.
 Key: SPARK-35505
 URL: https://issues.apache.org/jira/browse/SPARK-35505
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Takuya Ueshin


There are some APIs that have been deprecated in Koalas. We shouldn't have 
those in pandas APIs on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35452) Introduce ArrayOps, MapOps and StructOps

2021-05-24 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-35452:
-
Summary: Introduce ArrayOps, MapOps and StructOps  (was: ntroduce ArrayOps, 
MapOps and StructOps)

> Introduce ArrayOps, MapOps and StructOps
> 
>
> Key: SPARK-35452
> URL: https://issues.apache.org/jira/browse/SPARK-35452
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> StructType, ArrayType, and MapType are not accepted by DataTypeOps now.
> We should handle these complex types.
> Please note that arithmetic operations might be applied to these complex 
> types, for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work 
> the same as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33743) Change datatype mapping in JDBC mssqldialect: DATETIME to DATETIME2

2021-05-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350683#comment-17350683
 ] 

Apache Spark commented on SPARK-33743:
--

User 'luxu1-ms' has created a pull request for this issue:
https://github.com/apache/spark/pull/32655

> Change datatype mapping in JDBC mssqldialect: DATETIME to DATETIME2
> ---
>
> Key: SPARK-33743
> URL: https://issues.apache.org/jira/browse/SPARK-33743
> Project: Spark
>  Issue Type: Request
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Lu Xu
>Priority: Major
>
> *datetime v/s datetime2*
> Spark datetime type is 
> [timestamptype|https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fapi%2Fjava%2Forg%2Fapache%2Fspark%2Fsql%2Ftypes%2FTimestampType.html=04%7C01%7Cluxu1%40microsoft.com%7C39803a0f635646dadd6b08d89010896a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637417747986187437%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=qPPJve%2FGAPeIp%2BI2hjB%2BqoGGN%2FcJQe6CIDjlEdUyASo%3D=0].
>  This supports a microsecond resolution.
>  
> Sql supports 2 date time types
> o *datetime* can support only milli seconds resolution (0 to 999).
> o *datetime2* is extension of datetime , is compatible with datetime and 
> supports 0 to 999 sub second resolution.
> Currently 
> [MsSQLServerDialect|https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fblob%2Fbfb257f078854ad587a9e2bfe548cdb7bf8786d4%2Fsql%2Fcore%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fsql%2Fjdbc%2FMsSqlServerDialect.scala=04%7C01%7Cluxu1%40microsoft.com%7C39803a0f635646dadd6b08d89010896a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637417747986197428%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=PMT9rA08NJRN0kwHy2ERaloOaDRB6ZsBBd70MZXl%2Bv4%3D=0]
>  maps timestamptype to datetime. This implies results in errors when writing
> *+Current+*
> {code:java}
> override def getJDBCType(dt: DataType): Option[JdbcType] = dt match { case 
> TimestampType => Some(JdbcType("DATETIME", java.sql.Types.TIMESTAMP)) .. }
> {code}
>  
> *+Proposal+*  
> {code:java}
> override def getJDBCType(dt: DataType): Option[JdbcType] = dt match { case 
> TimestampType => Some(JdbcType("DATETIME2", java.sql.Types.TIMESTAMP)).. }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35452) ntroduce ArrayOps, MapOps and StructOps

2021-05-24 Thread Xinrong Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-35452:
-
Summary: ntroduce ArrayOps, MapOps and StructOps  (was: Introduce 
ComplexOps for complex types)

> ntroduce ArrayOps, MapOps and StructOps
> ---
>
> Key: SPARK-35452
> URL: https://issues.apache.org/jira/browse/SPARK-35452
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> StructType, ArrayType, and MapType are not accepted by DataTypeOps now.
> We should handle these complex types.
> Please note that arithmetic operations might be applied to these complex 
> types, for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work 
> the same as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33121) Spark Streaming 3.1.1 hangs on shutdown

2021-05-24 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350629#comment-17350629
 ] 

L. C. Hsieh commented on SPARK-33121:
-

Hmm, I cannot reproduce in branch-3.1/2.4 or master branch.

If you don't have Thread.sleep(5000), does it stop gracefully?

> Spark Streaming 3.1.1 hangs on shutdown
> ---
>
> Key: SPARK-33121
> URL: https://issues.apache.org/jira/browse/SPARK-33121
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 3.1.1
>Reporter: Dmitry Tverdokhleb
>Priority: Major
>  Labels: Streaming, hang, shutdown
>
> Hi. I am trying to migrate from spark 2.4.5 to 3.1.1 and there is a problem 
> in graceful shutdown.
> Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true".
> Here is the code:
> {code:java}
> inputStream.foreachRDD {
>   rdd =>
> rdd.foreachPartition {
> Thread.sleep(5000)
> }
> }
> {code}
> I send a SIGTERM signal to stop the spark streaming and after sleeping an 
> exception arises:
> {noformat}
> streaming-agg-tds-data_1  | java.util.concurrent.RejectedExecutionException: 
> Task org.apache.spark.executor.Executor$TaskRunner@7ca7f0b8 rejected from 
> java.util.concurrent.ThreadPoolExecutor@2474219c[Terminated, pool size = 0, 
> active threads = 0, queued tasks = 0, completed tasks = 1]
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.executor.Executor.launchTask(Executor.scala:270)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
> streaming-agg-tds-data_1  | at 
> scala.collection.Iterator.foreach(Iterator.scala:941)
> streaming-agg-tds-data_1  | at 
> scala.collection.Iterator.foreach$(Iterator.scala:941)
> streaming-agg-tds-data_1  | at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
> streaming-agg-tds-data_1  | at 
> scala.collection.IterableLike.foreach(IterableLike.scala:74)
> streaming-agg-tds-data_1  | at 
> scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> streaming-agg-tds-data_1  | at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> streaming-agg-tds-data_1  | at java.lang.Thread.run(Thread.java:748)
> streaming-agg-tds-data_1  | 2021-04-22 13:33:41 WARN  JobGenerator - Timed 
> out while stopping the job generator (timeout = 1)
> streaming-agg-tds-data_1  | 2021-04-22 13:33:41 INFO  JobGenerator - Waited 
> for jobs to be processed and checkpoints to be written
> streaming-agg-tds-data_1  | 2021-04-22 13:33:41 INFO  JobGenerator - Stopped 
> JobGenerator{noformat}
> After this exception and "JobGenerator - Stopped JobGenerator" log, streaming 
> freezes, and halts by timeout (Config parameter 
> "hadoop.service.shutdown.timeout").
> Besides, there is no problem with the graceful shutdown in spark 2.4.5.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35449) Should not extract common expressions from value expressions when elseValue is empty in CaseWhen

2021-05-24 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-35449:

Fix Version/s: 3.1.2

> Should not extract common expressions from value expressions when elseValue 
> is empty in CaseWhen
> 
>
> Key: SPARK-35449
> URL: https://issues.apache.org/jira/browse/SPARK-35449
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.1.2, 3.2.0
>
>
> [https://github.com/apache/spark/pull/30245]  added support for creating 
> subexpressions that are present in all branches of conditional statements. 
> However, for a statement to be in "all branches" of a CaseWhen statement, it 
> must also be in the elseValue. This can lead to a subexpression to be created 
> and run for branches of a conditional that don't pass. This can cause issues 
> especially with a UDF in a branch that gets executed assuming the condition 
> is true. For example:
> {code:java}
> val col = when($"id" < 0, myUdf($"id"))
> spark.range(1).select(when(col > 0, col)).show()
> {code}
> myUdf($"id") gets extracted as a subexpression and executed even though both 
> conditions don't pass and it should never be executed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35452) Introduce ComplexOps for complex types

2021-05-24 Thread Xinrong Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-35452:
-
Summary: Introduce ComplexOps for complex types  (was: Introduce 
ComplexTypeOps)

> Introduce ComplexOps for complex types
> --
>
> Key: SPARK-35452
> URL: https://issues.apache.org/jira/browse/SPARK-35452
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> StructType, ArrayType, and MapType are not accepted by DataTypeOps now.
> We should handle these complex types.
> Please note that arithmetic operations might be applied to these complex 
> types, for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work 
> the same as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35452) Introduce ComplexTypeOps

2021-05-24 Thread Xinrong Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-35452:
-
Description: 
StructType, ArrayType, and MapType are not accepted by DataTypeOps now.

We should handle these complex types.

Please note that arithmetic operations might be applied to these complex types, 
for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work the same 
as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]).

  was:
StructType, ArrayType, MapType, and BinaryType are not accepted by DataTypeOps 
now.

We should handle these complex types.

Please note that arithmetic operations might be applied to these complex types, 
for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work the same 
as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]).


> Introduce ComplexTypeOps
> 
>
> Key: SPARK-35452
> URL: https://issues.apache.org/jira/browse/SPARK-35452
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> StructType, ArrayType, and MapType are not accepted by DataTypeOps now.
> We should handle these complex types.
> Please note that arithmetic operations might be applied to these complex 
> types, for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work 
> the same as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33428) conv UDF returns incorrect value

2021-05-24 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350563#comment-17350563
 ] 

Wenchen Fan commented on SPARK-33428:
-

AFAIK this function is from MySQL and it's better to follow the MySQL behavior. 
MySQL returns the max unsigned long if the input string is too big, and Spark 
should follow it.

However, seems Spark has different behavior in two cases:
 # MySQL allows leading spaces but Spark does not.
 # If the input string is way too long, Spark fails with 
ArrayIndexOutOfBoundException

[~angerszhu] would you like to look into it?

> conv UDF returns incorrect value
> 
>
> Key: SPARK-33428
> URL: https://issues.apache.org/jira/browse/SPARK-33428
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {noformat}
> spark-sql> select java_method('scala.math.BigInt', 'apply', 
> 'c8dcdfb41711fc9a1f17928001d7fd61', 16);
> 266992441711411603393340504520074460513
> spark-sql> select conv('c8dcdfb41711fc9a1f17928001d7fd61', 16, 10);
> 18446744073709551615
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35504) count distinct asterisk

2021-05-24 Thread Nikolay Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350555#comment-17350555
 ] 

Nikolay Sokolov commented on SPARK-35504:
-

Added execution plans for two SparkSQL queries:
 * [^SPARK-35504_first_query_plan.log]
 * [^SPARK-35504_second_query_plan.log]

> count distinct asterisk 
> 
>
> Key: SPARK-35504
> URL: https://issues.apache.org/jira/browse/SPARK-35504
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: {code:java}
> uname -a
> Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 
> x86_64 x86_64 x86_64 GNU/Linux
> {code}
>  
> {code:java}
> lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 18.04.4 LTS
> Release:  18.04
> Codename: bionic
> {code}
>  
> {code:java}
> /opt/spark/bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0
>   /_/
> Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292
> Branch HEAD
> Compiled by user ubuntu on 2020-06-06T13:05:28Z
> Revision 3fdfce3120f307147244e5eaf46d61419a723d50
> Url https://gitbox.apache.org/repos/asf/spark.git
> Type --help for more information.
> {code}
> {code:java}
> lscpu
> Architecture:x86_64
> CPU op-mode(s):  32-bit, 64-bit
> Byte Order:  Little Endian
> CPU(s):  4
> On-line CPU(s) list: 0-3
> Thread(s) per core:  2
> Core(s) per socket:  2
> Socket(s):   1
> NUMA node(s):1
> Vendor ID:   GenuineIntel
> CPU family:  6
> Model:   85
> Model name:  Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
> Stepping:7
> CPU MHz: 3602.011
> BogoMIPS:6000.01
> Hypervisor vendor:   KVM
> Virtualization type: full
> L1d cache:   32K
> L1i cache:   32K
> L2 cache:1024K
> L3 cache:36608K
> NUMA node0 CPU(s):   0-3
> Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm 
> constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf 
> tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe 
> popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 
> 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms 
> invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd 
> avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
> {code}
>  
>Reporter: Nikolay Sokolov
>Priority: Minor
> Attachments: SPARK-35504_first_query_plan.log, 
> SPARK-35504_second_query_plan.log
>
>
> Hi everyone,
> I hope you're well!
>  
> Today I came across a very interesting case when the result of the execution 
> of the algorithm for counting unique rows differs depending on the form 
> (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries.
> I still can't figure out on my own if this is a bug or a feature and I would 
> like to share what I found.
>  
> I run Spark SQL queries through the Thrift (and not only) connecting to the 
> Spark cluster. I use the DBeaver app to execute Spark SQL queries.
>  
> So, I have two identical Spark SQL queries from an algorithmic point of view 
> that return different results.
>  
> The first query:
> {code:sql}
> select count(distinct *) unique_amt from storage_datamart.olympiads
> ; -- Rows: 13437678
> {code}
>  
> The second query:
> {code:sql}
> select count(*) from (select distinct * from storage_datamart.olympiads)
> ; -- Rows: 36901430
> {code}
>  
> The result of the two queries is different. (But it must be the same, right!?)
> {code:sql}
> select 'The first query' description, count(distinct *) unique_amt from 
> storage_datamart.olympiads
>  union all
> select 'The second query', count(*) from (select distinct * from 
> storage_datamart.olympiads)
> ;
> {code}
>  
> The result of the above query is the following:
> {code:java}
> The first query13437678
> The second query   36901430
> {code}
>  
>  I can easily calculate the unique number of rows in the table:
> {code:sql}
> select count(*) from (
>   select student_id, olympiad_id, tour, grade
> from storage_datamart.olympiads
>group by student_id, olympiad_id, tour, grade
>   having count(*) = 1
> )
> ; -- Rows: 36901365
> {code}
>  
> The table DDL is the following:
> {code:sql}
> CREATE TABLE `storage_datamart`.`olympiads` (
>   `ptn_date` DATE,
>   `student_id` BIGINT,
>   `olympiad_id` STRING,
>   `grade` BIGINT,
>   `grade_type` STRING,
>   `tour` STRING,
>   `created_at` TIMESTAMP,
>   `created_at_local`

[jira] [Updated] (SPARK-35504) count distinct asterisk

2021-05-24 Thread Nikolay Sokolov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikolay Sokolov updated SPARK-35504:

Attachment: SPARK-35504_second_query_plan.log
SPARK-35504_first_query_plan.log

> count distinct asterisk 
> 
>
> Key: SPARK-35504
> URL: https://issues.apache.org/jira/browse/SPARK-35504
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: {code:java}
> uname -a
> Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 
> x86_64 x86_64 x86_64 GNU/Linux
> {code}
>  
> {code:java}
> lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 18.04.4 LTS
> Release:  18.04
> Codename: bionic
> {code}
>  
> {code:java}
> /opt/spark/bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0
>   /_/
> Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292
> Branch HEAD
> Compiled by user ubuntu on 2020-06-06T13:05:28Z
> Revision 3fdfce3120f307147244e5eaf46d61419a723d50
> Url https://gitbox.apache.org/repos/asf/spark.git
> Type --help for more information.
> {code}
> {code:java}
> lscpu
> Architecture:x86_64
> CPU op-mode(s):  32-bit, 64-bit
> Byte Order:  Little Endian
> CPU(s):  4
> On-line CPU(s) list: 0-3
> Thread(s) per core:  2
> Core(s) per socket:  2
> Socket(s):   1
> NUMA node(s):1
> Vendor ID:   GenuineIntel
> CPU family:  6
> Model:   85
> Model name:  Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
> Stepping:7
> CPU MHz: 3602.011
> BogoMIPS:6000.01
> Hypervisor vendor:   KVM
> Virtualization type: full
> L1d cache:   32K
> L1i cache:   32K
> L2 cache:1024K
> L3 cache:36608K
> NUMA node0 CPU(s):   0-3
> Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm 
> constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf 
> tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe 
> popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 
> 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms 
> invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd 
> avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
> {code}
>  
>Reporter: Nikolay Sokolov
>Priority: Minor
> Attachments: SPARK-35504_first_query_plan.log, 
> SPARK-35504_second_query_plan.log
>
>
> Hi everyone,
> I hope you're well!
>  
> Today I came across a very interesting case when the result of the execution 
> of the algorithm for counting unique rows differs depending on the form 
> (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries.
> I still can't figure out on my own if this is a bug or a feature and I would 
> like to share what I found.
>  
> I run Spark SQL queries through the Thrift (and not only) connecting to the 
> Spark cluster. I use the DBeaver app to execute Spark SQL queries.
>  
> So, I have two identical Spark SQL queries from an algorithmic point of view 
> that return different results.
>  
> The first query:
> {code:sql}
> select count(distinct *) unique_amt from storage_datamart.olympiads
> ; -- Rows: 13437678
> {code}
>  
> The second query:
> {code:sql}
> select count(*) from (select distinct * from storage_datamart.olympiads)
> ; -- Rows: 36901430
> {code}
>  
> The result of the two queries is different. (But it must be the same, right!?)
> {code:sql}
> select 'The first query' description, count(distinct *) unique_amt from 
> storage_datamart.olympiads
>  union all
> select 'The second query', count(*) from (select distinct * from 
> storage_datamart.olympiads)
> ;
> {code}
>  
> The result of the above query is the following:
> {code:java}
> The first query13437678
> The second query   36901430
> {code}
>  
>  I can easily calculate the unique number of rows in the table:
> {code:sql}
> select count(*) from (
>   select student_id, olympiad_id, tour, grade
> from storage_datamart.olympiads
>group by student_id, olympiad_id, tour, grade
>   having count(*) = 1
> )
> ; -- Rows: 36901365
> {code}
>  
> The table DDL is the following:
> {code:sql}
> CREATE TABLE `storage_datamart`.`olympiads` (
>   `ptn_date` DATE,
>   `student_id` BIGINT,
>   `olympiad_id` STRING,
>   `grade` BIGINT,
>   `grade_type` STRING,
>   `tour` STRING,
>   `created_at` TIMESTAMP,
>   `created_at_local` TIMESTAMP,
>   `olympiad_num` BIGINT,
>   `olympiad_name` STRING,
>

[jira] [Updated] (SPARK-35504) count distinct asterisk

2021-05-24 Thread Nikolay Sokolov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikolay Sokolov updated SPARK-35504:

Description: 
Hi everyone,

I hope you're well!

 

Today I came across a very interesting case when the result of the execution of 
the algorithm for counting unique rows differs depending on the form 
(count(distinct *) vs count( * ) from derived table) of the Spark SQL queries.

I still can't figure out on my own if this is a bug or a feature and I would 
like to share what I found.

 

I run Spark SQL queries through the Thrift (and not only) connecting to the 
Spark cluster. I use the DBeaver app to execute Spark SQL queries.

 

So, I have two identical Spark SQL queries from an algorithmic point of view 
that return different results.

 

The first query:
{code:sql}
select count(distinct *) unique_amt from storage_datamart.olympiads
; -- Rows: 13437678
{code}
 

The second query:
{code:sql}
select count(*) from (select distinct * from storage_datamart.olympiads)
; -- Rows: 36901430
{code}
 

The result of the two queries is different. (But it must be the same, right!?)
{code:sql}
select 'The first query' description, count(distinct *) unique_amt from 
storage_datamart.olympiads
 union all
select 'The second query', count(*) from (select distinct * from 
storage_datamart.olympiads)
;
{code}
 

The result of the above query is the following:
{code:java}
The first query13437678
The second query   36901430
{code}
 
 I can easily calculate the unique number of rows in the table:
{code:sql}
select count(*) from (
  select student_id, olympiad_id, tour, grade
from storage_datamart.olympiads
   group by student_id, olympiad_id, tour, grade
  having count(*) = 1
)
; -- Rows: 36901365
{code}
 

The table DDL is the following:
{code:sql}
CREATE TABLE `storage_datamart`.`olympiads` (
  `ptn_date` DATE,
  `student_id` BIGINT,
  `olympiad_id` STRING,
  `grade` BIGINT,
  `grade_type` STRING,
  `tour` STRING,
  `created_at` TIMESTAMP,
  `created_at_local` TIMESTAMP,
  `olympiad_num` BIGINT,
  `olympiad_name` STRING,
  `subject` STRING,
  `started_at` TIMESTAMP,
  `ended_at` TIMESTAMP,
  `region_id` BIGINT,
  `region_name` STRING,
  `municipality_name` STRING,
  `school_id` BIGINT,
  `school_name` STRING,
  `school_status` BOOLEAN,
  `oly_n_common` INT,
  `num_day` INT,
  `award_type` STRING,
  `new_student_legacy` INT,
  `segment` STRING,
  `total_start` TIMESTAMP,
  `total_end` TIMESTAMP,
  `year_learn` STRING,
  `parent_id` BIGINT,
  `teacher_id` BIGINT,
  `parallel` BIGINT,
  `olympiad_type` STRING)
USING parquet
LOCATION 's3a://uchiru-bi-dwh/storage/datamart/olympiads.parquet'
;
{code}
 

Could you please tell me why in the first Spark SQL query counting the unique 
number of rows using the construction `count(distinct *)` does not count 
correctly and why the result of the two Spark SQL queries is different??

Thanks in advance.

 

p.s. I could not find a description of such behaviour of the function 
`count(distinct *)` in the [official Spark 
documentation|https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html#aggregate-functions]:
{quote}count(DISTINCT expr[, expr...]) -> Returns the number of rows for which 
the supplied expression(s) are unique and non-null.
{quote}
 

  was:
Hi everyone,

I hope you're well!

 

Today I came across a very interesting case when the result of the execution of 
the algorithm for counting unique rows differs depending on the form 
(count(distinct *) vs count( * ) from derived table) of the Spark SQL queries.

I still can't figure out on my own if this is a bug or a feature and I would 
like to share what I found.

 

I run Spark SQL queries through the Thrift (and not only) connecting to the 
Spark cluster. I use the DBeaver app to execute Spark SQL queries.

 

So, I have two identical Spark SQL queries from an algorithmic point of view 
that return different results.

 

The first query:
{code:sql}
select count(distinct *) unique_amt from storage_datamart.olympiads
; -- Rows: 13437678
{code}
 

The second query:
{code:sql}
select count(*) from (select distinct * from storage_datamart.olympiads)
; -- Rows: 36901430
{code}
 

The result of the two queries is different. (But it must be the same, right!?)
{code:sql}
select 'The first query' description, count(distinct *) unique_amt from 
storage_datamart.olympiads
 union all
select 'The second query', count(*) from (select distinct * from 
storage_datamart.olympiads)
;
{code}
 

The of the above query is the following:
{code:java}
The first query13437678
The second query   36901430
{code}
 
 I can easily calculate the unique number of rows in the table:
{code:sql}
select count(*) from (
  select student_id, olympiad_id, tour, grade
from storage_datamart.olympiads
   group by student_id, olympiad_id, tour, grade
  having count(*) = 1
)
; -- Rows: 36901365
{code}
 

The table DDL is the

[jira] [Created] (SPARK-35504) count distinct asterisk

2021-05-24 Thread Nikolay Sokolov (Jira)

Nikolay Sokolov created SPARK-35504:
---

 Summary: count distinct asterisk 
 Key: SPARK-35504
 URL: https://issues.apache.org/jira/browse/SPARK-35504
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
 Environment: {code:java}
uname -a

Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 x86_64 
x86_64 x86_64 GNU/Linux
{code}
 
{code:java}
lsb_release -a

No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 18.04.4 LTS
Release:18.04
Codename:   bionic
{code}
 
{code:java}
/opt/spark/bin/spark-submit --version

Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0
  /_/

Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292
Branch HEAD
Compiled by user ubuntu on 2020-06-06T13:05:28Z
Revision 3fdfce3120f307147244e5eaf46d61419a723d50
Url https://gitbox.apache.org/repos/asf/spark.git
Type --help for more information.
{code}
{code:java}
lscpu

Architecture:x86_64
CPU op-mode(s):  32-bit, 64-bit
Byte Order:  Little Endian
CPU(s):  4
On-line CPU(s) list: 0-3
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):   1
NUMA node(s):1
Vendor ID:   GenuineIntel
CPU family:  6
Model:   85
Model name:  Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
Stepping:7
CPU MHz: 3602.011
BogoMIPS:6000.01
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:   32K
L1i cache:   32K
L2 cache:1024K
L3 cache:36608K
NUMA node0 CPU(s):   0-3
Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm 
constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf 
tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe 
popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 
3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms 
invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw 
avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
{code}
 
Reporter: Nikolay Sokolov


Hi everyone,

I hope you're well!

 

Today I came across a very interesting case when the result of the execution of 
the algorithm for counting unique rows differs depending on the form 
(count(distinct *) vs count( * ) from derived table) of the Spark SQL queries.

I still can't figure out on my own if this is a bug or a feature and I would 
like to share what I found.

 

I run Spark SQL queries through the Thrift (and not only) connecting to the 
Spark cluster. I use the DBeaver app to execute Spark SQL queries.

 

So, I have two identical Spark SQL queries from an algorithmic point of view 
that return different results.

 

The first query:
{code:sql}
select count(distinct *) unique_amt from storage_datamart.olympiads
; -- Rows: 13437678
{code}
 

The second query:
{code:sql}
select count(*) from (select distinct * from storage_datamart.olympiads)
; -- Rows: 36901430
{code}
 

The result of the two queries is different. (But it must be the same, right!?)
{code:sql}
select 'The first query' description, count(distinct *) unique_amt from 
storage_datamart.olympiads
 union all
select 'The second query', count(*) from (select distinct * from 
storage_datamart.olympiads)
;
{code}
 

The of the above query is the following:
{code:java}
The first query13437678
The second query   36901430
{code}
 
 I can easily calculate the unique number of rows in the table:
{code:sql}
select count(*) from (
  select student_id, olympiad_id, tour, grade
from storage_datamart.olympiads
   group by student_id, olympiad_id, tour, grade
  having count(*) = 1
)
; -- Rows: 36901365
{code}
 

The table DDL is the following:
{code:sql}
CREATE TABLE `storage_datamart`.`olympiads` (
  `ptn_date` DATE,
  `student_id` BIGINT,
  `olympiad_id` STRING,
  `grade` BIGINT,
  `grade_type` STRING,
  `tour` STRING,
  `created_at` TIMESTAMP,
  `created_at_local` TIMESTAMP,
  `olympiad_num` BIGINT,
  `olympiad_name` STRING,
  `subject` STRING,
  `started_at` TIMESTAMP,
  `ended_at` TIMESTAMP,
  `region_id` BIGINT,
  `region_name` STRING,
  `municipality_name` STRING,
  `school_id` BIGINT,
  `school_name` STRING,
  `school_status` BOOLEAN,
  `oly_n_common` INT,
  `num_day` INT,
  `award_type` STRING,
  `new_student_legacy` INT,
  `segment` STRING,
  `total_start` TIMESTAMP,
  `total_end` TIMESTAMP,
  `year_learn` STRING,
  `parent_id` BIGINT,
  `teacher_id` BIGINT,
  `parallel` BIGINT,
  `olympiad_type` STRING)
USING parquet
LOCATION 's3a://uchiru-bi-dwh/storage/datamart/olympiads.parquet'
;
{code}
 

Could you please tell me why in the

[jira] [Commented] (SPARK-35503) Change the facetFilters of Docsearch to 3.1.2

2021-05-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350526#comment-17350526
 ] 

Apache Spark commented on SPARK-35503:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32654

> Change the facetFilters of Docsearch to 3.1.2
> -
>
> Key: SPARK-35503
> URL: https://issues.apache.org/jira/browse/SPARK-35503
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 3.1.2
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>
> As we are going to release 3.1.2, the search result of 3.1.2 documentation 
> should point to the new documentation site instead of 3.1.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35503) Change the facetFilters of Docsearch to 3.1.2

2021-05-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35503:


Assignee: Apache Spark  (was: Gengliang Wang)

> Change the facetFilters of Docsearch to 3.1.2
> -
>
> Key: SPARK-35503
> URL: https://issues.apache.org/jira/browse/SPARK-35503
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 3.1.2
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>
> As we are going to release 3.1.2, the search result of 3.1.2 documentation 
> should point to the new documentation site instead of 3.1.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35503) Change the facetFilters of Docsearch to 3.1.2

2021-05-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350527#comment-17350527
 ] 

Apache Spark commented on SPARK-35503:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32654

> Change the facetFilters of Docsearch to 3.1.2
> -
>
> Key: SPARK-35503
> URL: https://issues.apache.org/jira/browse/SPARK-35503
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 3.1.2
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>
> As we are going to release 3.1.2, the search result of 3.1.2 documentation 
> should point to the new documentation site instead of 3.1.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35503) Change the facetFilters of Docsearch to 3.1.2

2021-05-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350525#comment-17350525
 ] 

Apache Spark commented on SPARK-35503:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32654

> Change the facetFilters of Docsearch to 3.1.2
> -
>
> Key: SPARK-35503
> URL: https://issues.apache.org/jira/browse/SPARK-35503
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 3.1.2
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>
> As we are going to release 3.1.2, the search result of 3.1.2 documentation 
> should point to the new documentation site instead of 3.1.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35503) Change the facetFilters of Docsearch to 3.1.2

2021-05-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35503:


Assignee: Gengliang Wang  (was: Apache Spark)

> Change the facetFilters of Docsearch to 3.1.2
> -
>
> Key: SPARK-35503
> URL: https://issues.apache.org/jira/browse/SPARK-35503
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 3.1.2
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>
> As we are going to release 3.1.2, the search result of 3.1.2 documentation 
> should point to the new documentation site instead of 3.1.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35503) Change the facetFilters of Docsearch to 3.1.2

2021-05-24 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-35503:
--

 Summary: Change the facetFilters of Docsearch to 3.1.2
 Key: SPARK-35503
 URL: https://issues.apache.org/jira/browse/SPARK-35503
 Project: Spark
  Issue Type: Task
  Components: Documentation
Affects Versions: 3.1.2
Reporter: Gengliang Wang
Assignee: Gengliang Wang


As we are going to release 3.1.2, the search result of 3.1.2 documentation 
should point to the new documentation site instead of 3.1.1





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35287) RemoveRedundantProjects removes non-redundant projects

2021-05-24 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35287.
-
Fix Version/s: 3.1.2
   3.2.0
   Resolution: Fixed

Issue resolved by pull request 32606
[https://github.com/apache/spark/pull/32606]

> RemoveRedundantProjects removes non-redundant projects
> --
>
> Key: SPARK-35287
> URL: https://issues.apache.org/jira/browse/SPARK-35287
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Chungmin
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
>
> RemoveRedundantProjects erroneously removes non-redundant projects which are 
> required to convert rows coming from DataSourceV2ScanExec to UnsafeRow. There 
> is a code for this case, but it only looks at the child. The bug occurs when 
> DataSourceV2ScanExec is not a child of the project, but a descendant. The 
> method {{isRedundant}} in {{RemoveRedundantProjects}} should be updated to 
> account for descendants too.
> The original scenario requires Iceberg to reproduce the issue. In theory, it 
> should be able to reproduce the bug with Spark SQL only, and someone more 
> knowledgeable with Spark SQL should be able to make such a scenario. The 
> following is my reproduction scenario (Spark 3.1.1, Iceberg 0.11.1): 
> {code:java}
> import scala.collection.JavaConverters._
> import org.apache.iceberg.{PartitionSpec, TableProperties}
> import org.apache.iceberg.hadoop.HadoopTables
> import org.apache.iceberg.spark.SparkSchemaUtil
> import org.apache.spark.sql.{DataFrame, QueryTest, SparkSession}
> import org.apache.spark.sql.internal.SQLConf
> class RemoveRedundantProjectsTest extends QueryTest {
>   override val spark: SparkSession = SparkSession
> .builder()
> .master("local[4]")
> .config("spark.driver.bindAddress", "127.0.0.1")
> .appName(suiteName)
> .getOrCreate()
>   test("RemoveRedundantProjects removes non-redundant projects") {
> withSQLConf(
>   SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1",
>   SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "false",
>   SQLConf.REMOVE_REDUNDANT_PROJECTS_ENABLED.key -> "true") {
>   withTempDir { dir =>
> val path = dir.getCanonicalPath
> val data = spark.range(3).toDF
> val table = new HadoopTables().create(
>   SparkSchemaUtil.convert(data.schema),
>   PartitionSpec.unpartitioned(),
>   Map(TableProperties.WRITE_NEW_DATA_LOCATION -> path).asJava,
>   path)
> data.write.format("iceberg").mode("overwrite").save(path)
> table.refresh()
> val df = spark.read.format("iceberg").load(path)
> val dfX = df.as("x")
> val dfY = df.as("y")
> val join = dfX.filter(dfX("id") > 0).join(dfY, "id")
> join.explain("extended")
> assert(join.count() == 2)
>   }
> }
>   }
> }
> {code}
> Stack trace:
> {noformat}
> [info] - RemoveRedundantProjects removes non-redundant projects *** FAILED ***
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 1.0 (TID 4) (xeroxms100.northamerica.corp.microsoft.com executor 
> driver): java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
> [info]  at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:226)
> [info]  at 
> org.apache.spark.sql.execution.SortExec.$anonfun$doExecute$1(SortExec.scala:119)
> [info]  at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
> [info]  at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
> [info]  at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> [info]  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> [info]  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> [info]  at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
> [info]  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> [info]  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> [info]  at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> [info]  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> [info]  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> [info]  at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> [info]  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> [info]  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> [info]  at 
>

[jira] [Assigned] (SPARK-35287) RemoveRedundantProjects removes non-redundant projects

2021-05-24 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-35287:
---

Assignee: Kousuke Saruta

> RemoveRedundantProjects removes non-redundant projects
> --
>
> Key: SPARK-35287
> URL: https://issues.apache.org/jira/browse/SPARK-35287
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Chungmin
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.2, 3.2.0
>
>
> RemoveRedundantProjects erroneously removes non-redundant projects which are 
> required to convert rows coming from DataSourceV2ScanExec to UnsafeRow. There 
> is a code for this case, but it only looks at the child. The bug occurs when 
> DataSourceV2ScanExec is not a child of the project, but a descendant. The 
> method {{isRedundant}} in {{RemoveRedundantProjects}} should be updated to 
> account for descendants too.
> The original scenario requires Iceberg to reproduce the issue. In theory, it 
> should be able to reproduce the bug with Spark SQL only, and someone more 
> knowledgeable with Spark SQL should be able to make such a scenario. The 
> following is my reproduction scenario (Spark 3.1.1, Iceberg 0.11.1): 
> {code:java}
> import scala.collection.JavaConverters._
> import org.apache.iceberg.{PartitionSpec, TableProperties}
> import org.apache.iceberg.hadoop.HadoopTables
> import org.apache.iceberg.spark.SparkSchemaUtil
> import org.apache.spark.sql.{DataFrame, QueryTest, SparkSession}
> import org.apache.spark.sql.internal.SQLConf
> class RemoveRedundantProjectsTest extends QueryTest {
>   override val spark: SparkSession = SparkSession
> .builder()
> .master("local[4]")
> .config("spark.driver.bindAddress", "127.0.0.1")
> .appName(suiteName)
> .getOrCreate()
>   test("RemoveRedundantProjects removes non-redundant projects") {
> withSQLConf(
>   SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1",
>   SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "false",
>   SQLConf.REMOVE_REDUNDANT_PROJECTS_ENABLED.key -> "true") {
>   withTempDir { dir =>
> val path = dir.getCanonicalPath
> val data = spark.range(3).toDF
> val table = new HadoopTables().create(
>   SparkSchemaUtil.convert(data.schema),
>   PartitionSpec.unpartitioned(),
>   Map(TableProperties.WRITE_NEW_DATA_LOCATION -> path).asJava,
>   path)
> data.write.format("iceberg").mode("overwrite").save(path)
> table.refresh()
> val df = spark.read.format("iceberg").load(path)
> val dfX = df.as("x")
> val dfY = df.as("y")
> val join = dfX.filter(dfX("id") > 0).join(dfY, "id")
> join.explain("extended")
> assert(join.count() == 2)
>   }
> }
>   }
> }
> {code}
> Stack trace:
> {noformat}
> [info] - RemoveRedundantProjects removes non-redundant projects *** FAILED ***
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 1.0 (TID 4) (xeroxms100.northamerica.corp.microsoft.com executor 
> driver): java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
> [info]  at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:226)
> [info]  at 
> org.apache.spark.sql.execution.SortExec.$anonfun$doExecute$1(SortExec.scala:119)
> [info]  at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
> [info]  at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
> [info]  at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> [info]  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> [info]  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> [info]  at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
> [info]  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> [info]  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> [info]  at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> [info]  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> [info]  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> [info]  at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> [info]  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> [info]  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> [info]  at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> [info]  at

[jira] [Commented] (SPARK-35312) Introduce new Option in Kafka source to specify minimum number of records to read per trigger

2021-05-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350505#comment-17350505
 ] 

Apache Spark commented on SPARK-35312:
--

User 'satishgopalani' has created a pull request for this issue:
https://github.com/apache/spark/pull/32653

> Introduce new Option in Kafka source to specify minimum number of records to 
> read per trigger
> -
>
> Key: SPARK-35312
> URL: https://issues.apache.org/jira/browse/SPARK-35312
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.1
>Reporter: Satish Gopalani
>Priority: Major
>
> Kafka source currently provides options to set the maximum number of offsets 
> to read per trigger.
> I will like to introduce a new option to specify the minimum number of 
> offsets to read per trigger i.e. *minOffsetsPerTrigger*.
> This new option will allow skipping trigger/batch when the number of records 
> available in Kafka is low. This is a very useful feature in cases where we 
> have a sudden burst of data at certain intervals in a day and data volume is 
> low for the rest of the day. Tunning such jobs is difficult as decreasing 
> trigger processing time increasing the number of batches and hence cluster 
> resource usage and adds to small file issues. Increasing trigger processing 
> time adds consumer lag. This will save cluster resources and also help solve 
> small file issues as it is running lesser batches.
>  Along with this, I would like to introduce '*maxTriggerDelay*' option which 
> will help to avoid cases of infinite delay in scheduling trigger and the 
> trigger will happen irrespective of records available if the maxTriggerDelay 
> time exceeds the last trigger. It would be an optional parameter with a 
> default value of 15 mins. _This option will be only applicable if 
> minOffsetsPerTrigger is set._
> *minOffsetsPerTrigger* option would be optional of course, but once specified 
> it would take precedence over *maxOffestsPerTrigger* which will be honored 
> only after *minOffsetsPerTrigger* is satisfied.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35312) Introduce new Option in Kafka source to specify minimum number of records to read per trigger

2021-05-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350504#comment-17350504
 ] 

Apache Spark commented on SPARK-35312:
--

User 'satishgopalani' has created a pull request for this issue:
https://github.com/apache/spark/pull/32653

> Introduce new Option in Kafka source to specify minimum number of records to 
> read per trigger
> -
>
> Key: SPARK-35312
> URL: https://issues.apache.org/jira/browse/SPARK-35312
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.1
>Reporter: Satish Gopalani
>Priority: Major
>
> Kafka source currently provides options to set the maximum number of offsets 
> to read per trigger.
> I will like to introduce a new option to specify the minimum number of 
> offsets to read per trigger i.e. *minOffsetsPerTrigger*.
> This new option will allow skipping trigger/batch when the number of records 
> available in Kafka is low. This is a very useful feature in cases where we 
> have a sudden burst of data at certain intervals in a day and data volume is 
> low for the rest of the day. Tunning such jobs is difficult as decreasing 
> trigger processing time increasing the number of batches and hence cluster 
> resource usage and adds to small file issues. Increasing trigger processing 
> time adds consumer lag. This will save cluster resources and also help solve 
> small file issues as it is running lesser batches.
>  Along with this, I would like to introduce '*maxTriggerDelay*' option which 
> will help to avoid cases of infinite delay in scheduling trigger and the 
> trigger will happen irrespective of records available if the maxTriggerDelay 
> time exceeds the last trigger. It would be an optional parameter with a 
> default value of 15 mins. _This option will be only applicable if 
> minOffsetsPerTrigger is set._
> *minOffsetsPerTrigger* option would be optional of course, but once specified 
> it would take precedence over *maxOffestsPerTrigger* which will be honored 
> only after *minOffsetsPerTrigger* is satisfied.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query p

2021-05-24 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350458#comment-17350458
 ] 

Takeshi Yamamuro commented on SPARK-35500:
--

Which version did you use? v3.1.0 does not exist, so v3.1.1? I tried to run the 
queries in v3.1.1 to reproduce it, but it couldn't happen;
{code:java}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
  /_/
 
Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)

scala> sql("create table test_code_gen(a array)")
scala> sql("insert into test_code_gen values (array(1, 1))")
scala> sc.setLogLevel("debug")

// The first run
scala> sql("select * from test_code_gen").collect()
...
21/05/24 23:14:00 DEBUG GenerateSafeProjection: code for 
createexternalrow(staticinvoke(class scala.collection.mutable.WrappedArray$, 
ObjectType(interface scala.collection.Seq), make, 
mapobjects(lambdavariable(MapObject, IntegerType, true, -1), 
lambdavariable(MapObject, IntegerType, true, -1), input[0, array, true], 
None).array, true, false), StructField(a,ArrayType(IntegerType,true),true)):
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificSafeProjection(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private InternalRow mutableRow;
/* 009 */   private boolean resultIsNull_0;
/* 010 */   private int value_MapObject_lambda_variable_1;
/* 011 */   private boolean isNull_MapObject_lambda_variable_1;
/* 012 */   private boolean globalIsNull_0;
/* 013 */   private java.lang.Object[] mutableStateArray_0 = new 
java.lang.Object[1];
/* 014 */
...


// The second run
scala> sql("select * from test_code_gen").collect()
...
21/05/24 23:14:28 DEBUG GenerateSafeProjection: code for 
createexternalrow(staticinvoke(class scala.collection.mutable.WrappedArray$, 
ObjectType(interface scala.collection.Seq), make, 
mapobjects(lambdavariable(MapObject, IntegerType, true, -1), 
lambdavariable(MapObject, IntegerType, true, -1), input[0, array, true], 
None).array, true, false), StructField(a,ArrayType(IntegerType,true),true)):
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificSafeProjection(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private InternalRow mutableRow;
/* 009 */   private boolean resultIsNull_0;
/* 010 */   private int value_MapObject_lambda_variable_1;
/* 011 */   private boolean isNull_MapObject_lambda_variable_1;
/* 012 */   private boolean globalIsNull_0;
/* 013 */   private java.lang.Object[] mutableStateArray_0 = new 
java.lang.Object[1];
...
 {code}
Actually, this issue should be fixed in 
SPARK-27871([https://github.com/apache/spark/pull/24735]). Or, do I miss 
something?

> GenerateSafeProjection.generate will generate SpecificSafeProjection class, 
> but if column is array type or map type, the code cannot be reused which 
> impact the query performance
> -
>
> Key: SPARK-35500
> URL: https://issues.apache.org/jira/browse/SPARK-35500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yahui Liu
>Priority: Minor
>  Labels: codegen
>
> Reproduce steps:
>  # create a new table with array type: create table test_code_gen(a 
> array);
>  # Add 
> log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator 
> = DEBUG to log4j.properties;
>  # Enter spark-shell, fire a query: spark.sql("select * from 
> test_code_gen").collect
>  # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
> generated, but the code for the class cannot be reused because everytime the 
> id for two variables in the generated class is changed: MapObjects_loopValue 
> and MapObjects_loopIsNull. So even the class generated before has been 
> cached, new code cannot match the cache key so that new code need to be 
> compiled again which cost some time.  The time cost for compile is increasing 
> with the growth of column number, for wide table, this cost can more than 2s. 
> {code:java}
> object MapObjects {
>   private val curId = new java.util.concurrent.atomic.AtomicInteger()
>  val id = curId.getAndIncrement()
>  val loopValue = s"MapObjects_loopValue$id"
>  val loopIsNull = if (elementNullable) {
>

[jira] [Created] (SPARK-35502) Spark Executor metrics are not produced/showed

2021-05-24 Thread Mati (Jira)

Mati created SPARK-35502:


 Summary: Spark Executor metrics are not produced/showed
 Key: SPARK-35502
 URL: https://issues.apache.org/jira/browse/SPARK-35502
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 3.0.1
Reporter: Mati


Recently we have enabled prometheusServlet configuration in order to have spark 
master, worker, driver and executor metrics.

We can see and using spark master, worker and driver executors but can't see 
spark executor metrics.

We are running spark streaming standalone cluster in version 3.0.1 over 
physical servers.

 
We have taken one of our jobs and added the following parameters to the job 
configuration, but couldn't see executer metrics by curling both driver and 
executor workers of this job:
 
These are the parameters:
--conf spark.ui.prometheus.enabled=true \
--conf spark.executor.processTreeMetrics.enabled=true
 
Curl commands:
[00764f](root@sparktest-40005-prod-chidc2:~)# curl -s 
[http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead]
 -n5
[00764f](root@sparktest-40005-prod-chidc2:~)#
Driver of this job - sparktest-40004:
[e35005](root@sparktest-40004-prod-chidc2:~)# curl -s 
[http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead]
 -n5
[e35005](root@sparktest-40004-prod-chidc2:~)# curl -s 
[http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead]
 -n5
 
Out UI port is on 4050
 
 I understood that the executor Prometheus endpoint is still experimental which 
may explain the inconsistent behaviour we see but is there a plan to fix it?
 
Are there any known issues regarding this?
 
 
 
 
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35501) Add a feature for removing pulled container image for docker integration tests

2021-05-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350419#comment-17350419
 ] 

Apache Spark commented on SPARK-35501:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32652

> Add a feature for removing pulled container image for docker integration tests
> --
>
> Key: SPARK-35501
> URL: https://issues.apache.org/jira/browse/SPARK-35501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> Docker integration tests pull the corresponding container images before test 
> if there is no container image in the local repository.
> But the pulled image remains even after test finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35501) Add a feature for removing pulled container image for docker integration tests

2021-05-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350417#comment-17350417
 ] 

Apache Spark commented on SPARK-35501:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32652

> Add a feature for removing pulled container image for docker integration tests
> --
>
> Key: SPARK-35501
> URL: https://issues.apache.org/jira/browse/SPARK-35501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> Docker integration tests pull the corresponding container images before test 
> if there is no container image in the local repository.
> But the pulled image remains even after test finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35501) Add a feature for removing pulled container image for docker integration tests

2021-05-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35501:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Add a feature for removing pulled container image for docker integration tests
> --
>
> Key: SPARK-35501
> URL: https://issues.apache.org/jira/browse/SPARK-35501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> Docker integration tests pull the corresponding container images before test 
> if there is no container image in the local repository.
> But the pulled image remains even after test finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35501) Add a feature for removing pulled container image for docker integration tests

2021-05-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35501:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Add a feature for removing pulled container image for docker integration tests
> --
>
> Key: SPARK-35501
> URL: https://issues.apache.org/jira/browse/SPARK-35501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> Docker integration tests pull the corresponding container images before test 
> if there is no container image in the local repository.
> But the pulled image remains even after test finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per

2021-05-24 Thread Yahui Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yahui Liu updated SPARK-35500:
--
Description: 
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  The time cost for compile is increasing with the growth 
of column number, for wide table, this cost can more than 2s. 
{code:java}
object MapObjects {
  private val curId = new java.util.concurrent.atomic.AtomicInteger()
 val id = curId.getAndIncrement()
 val loopValue = s"MapObjects_loopValue$id"
 val loopIsNull = if (elementNullable) {
   s"MapObjects_loopIsNull$id"
 } else {
   "false"
 }
{code}
First time run: 
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue1;
 private boolean MapObjects_loopIsNull1;
 private UTF8String MapObjects_loopValue2;
 private boolean MapObjects_loopIsNull2;
}
{code}
Second time run:
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue3;
 private boolean MapObjects_loopIsNull3;
 private UTF8String MapObjects_loopValue4;
 private boolean MapObjects_loopIsNull4;
}{code}

Expectation:
The code generated by GenerateSafeProjection can be reused if the query is same.

  was:
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  The time cost for compile is increasing with the growth 
of column number, for wide table, this cost can more than 2s. 
{code:java}
object MapObjects {
  private val curId = new java.util.concurrent.atomic.AtomicInteger()
 val id = curId.getAndIncrement()
 val loopValue = s"MapObjects_loopValue$id"
 val loopIsNull = if (elementNullable) {
   s"MapObjects_loopIsNull$id"
 } else {
   "false"
 }
{code}
First time run: 
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue1;
 private boolean MapObjects_loopIsNull1;
 private UTF8String MapObjects_loopValue2;
 private boolean MapObjects_loopIsNull2;
}
{code}
Second time run:
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue3;
 private boolean MapObjects_loopIsNull3;
 private UTF8String MapObjects_loopValue4;
 private boolean MapObjects_loopIsNull4;
}{code}

Expectation:
MapObjects_loopValue and MapObjects_loopIsNull 


> GenerateSafeProjection.generate will generate SpecificSafeProjection class, 
> but if column is array type or map type, the code cannot be reused which 
> impact the query performance
> -
>
> Key: SPARK-35500
> URL: https://issues.apache.org/jira/browse/SPARK-35500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yahui Liu
>Priority: Minor
>  Labels: codegen
>
> Reproduce steps:
>  # create a new table with array type: create table test_code_gen(a 
> array);
>  # Add 
> log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator 
> = DEBUG to log4j.properties;
>  # Enter spark-shell, fire a query: spark.sql("select * from 
> test_code_gen").collect
>  # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
> generated, but the code for the class cannot be reused because everytime the 
> id for two variables in the

[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per

2021-05-24 Thread Yahui Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yahui Liu updated SPARK-35500:
--
Description: 
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  The time cost for compile is increasing with the growth 
of column number, for wide table, this cost can more than 2s. 
{code:java}
object MapObjects {
  private val curId = new java.util.concurrent.atomic.AtomicInteger()
 val id = curId.getAndIncrement()
 val loopValue = s"MapObjects_loopValue$id"
 val loopIsNull = if (elementNullable) {
   s"MapObjects_loopIsNull$id"
 } else {
   "false"
 }
{code}
First time run: 
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue1;
 private boolean MapObjects_loopIsNull1;
 private UTF8String MapObjects_loopValue2;
 private boolean MapObjects_loopIsNull2;
}
{code}
Second time run:
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue3;
 private boolean MapObjects_loopIsNull3;
 private UTF8String MapObjects_loopValue4;
 private boolean MapObjects_loopIsNull4;
}{code}

Expectation:
MapObjects_loopValue and MapObjects_loopIsNull 

  was:
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  The time cost for compile is increasing with the growth 
of column number, for wide table, this cost can more than 2s. 
{code:java}
object MapObjects {
  private val curId = new java.util.concurrent.atomic.AtomicInteger()
 val id = curId.getAndIncrement()
 val loopValue = s"MapObjects_loopValue$id"
 val loopIsNull = if (elementNullable) {
   s"MapObjects_loopIsNull$id"
 } else {
   "false"
 }
{code}
First time run: 
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue1;
 private boolean MapObjects_loopIsNull1;
 private UTF8String MapObjects_loopValue2;
 private boolean MapObjects_loopIsNull2;
}
{code}
Second time run:
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue3;
 private boolean MapObjects_loopIsNull3;
 private UTF8String MapObjects_loopValue4;
 private boolean MapObjects_loopIsNull4;
}{code}
Expectation:


> GenerateSafeProjection.generate will generate SpecificSafeProjection class, 
> but if column is array type or map type, the code cannot be reused which 
> impact the query performance
> -
>
> Key: SPARK-35500
> URL: https://issues.apache.org/jira/browse/SPARK-35500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yahui Liu
>Priority: Minor
>  Labels: codegen
>
> Reproduce steps:
>  # create a new table with array type: create table test_code_gen(a 
> array);
>  # Add 
> log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator 
> = DEBUG to log4j.properties;
>  # Enter spark-shell, fire a query: spark.sql("select * from 
> test_code_gen").collect
>  # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
> generated, but the code for the class cannot be reused because everytime the 
> id for two variables in the generated class is changed: MapObjects_loopValue 
> and MapObjects_loopIsNull. So

[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per

2021-05-24 Thread Yahui Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yahui Liu updated SPARK-35500:
--
Description: 
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  The time cost for compile is increasing with the growth 
of column number, for wide table, this cost can more than 2s. 
{code:java}
object MapObjects {
  private val curId = new java.util.concurrent.atomic.AtomicInteger()
 val id = curId.getAndIncrement()
 val loopValue = s"MapObjects_loopValue$id"
 val loopIsNull = if (elementNullable) {
   s"MapObjects_loopIsNull$id"
 } else {
   "false"
 }
{code}
First time run: 
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue1;
 private boolean MapObjects_loopIsNull1;
 private UTF8String MapObjects_loopValue2;
 private boolean MapObjects_loopIsNull2;
}
{code}
Second time run:
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue3;
 private boolean MapObjects_loopIsNull3;
 private UTF8String MapObjects_loopValue4;
 private boolean MapObjects_loopIsNull4;
}{code}
Expectation:

  was:
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  The time cost for compile is increasing with the growth 
of column number, for wide table, this cost can more than 2s. 
{code:java}
object MapObjects {
  private val curId = new java.util.concurrent.atomic.AtomicInteger()
 val id = curId.getAndIncrement()
 val loopValue = s"MapObjects_loopValue$id"
 val loopIsNull = if (elementNullable) {
   s"MapObjects_loopIsNull$id"
 } else {
   "false"
 }
{code}
First time run: 
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue1;
 private boolean MapObjects_loopIsNull1;
 private UTF8String MapObjects_loopValue2;
 private boolean MapObjects_loopIsNull2;
}
{code}
Second time run:
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue3;
 private boolean MapObjects_loopIsNull3;
 private UTF8String MapObjects_loopValue4;
 private boolean MapObjects_loopIsNull4;
}{code}


> GenerateSafeProjection.generate will generate SpecificSafeProjection class, 
> but if column is array type or map type, the code cannot be reused which 
> impact the query performance
> -
>
> Key: SPARK-35500
> URL: https://issues.apache.org/jira/browse/SPARK-35500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yahui Liu
>Priority: Minor
>  Labels: codegen
>
> Reproduce steps:
>  # create a new table with array type: create table test_code_gen(a 
> array);
>  # Add 
> log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator 
> = DEBUG to log4j.properties;
>  # Enter spark-shell, fire a query: spark.sql("select * from 
> test_code_gen").collect
>  # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
> generated, but the code for the class cannot be reused because everytime the 
> id for two variables in the generated class is changed: MapObjects_loopValue 
> and MapObjects_loopIsNull. So even the class generated before has been 
> cached, new code

[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per

2021-05-24 Thread Yahui Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yahui Liu updated SPARK-35500:
--
Description: 
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  The time cost for compile is increasing with the growth 
of column number, for wide table, this cost can more than 2s. 
{code:java}
object MapObjects {
  private val curId = new java.util.concurrent.atomic.AtomicInteger()
 val id = curId.getAndIncrement()
 val loopValue = s"MapObjects_loopValue$id"
 val loopIsNull = if (elementNullable) {
   s"MapObjects_loopIsNull$id"
 } else {
   "false"
 }
{code}
First time run: 
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue1;
 private boolean MapObjects_loopIsNull1;
 private UTF8String MapObjects_loopValue2;
 private boolean MapObjects_loopIsNull2;
}
{code}
Second time run:
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue3;
 private boolean MapObjects_loopIsNull3;
 private UTF8String MapObjects_loopValue4;
 private boolean MapObjects_loopIsNull4;
}{code}

  was:
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  The time cost for compile is increasing with the growth 
of column number, for wide table, this cost can more than 2s. 
{code:java}
object MapObjects {
  private val curId = new java.util.concurrent.atomic.AtomicInteger()
 val id = curId.getAndIncrement()
 val loopValue = s"MapObjects_loopValue$id"
 val loopIsNull = if (elementNullable) {
   s"MapObjects_loopIsNull$id"
 } else {
   "false"
 }
{code}
First time run: 
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue1;
 private boolean MapObjects_loopIsNull1;
 private UTF8String MapObjects_loopValue2;
 private boolean MapObjects_loopIsNull2;
}
{code}
Second time run:
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue3;
 private boolean MapObjects_loopIsNull3;
 private UTF8String MapObjects_loopValue4;
 private boolean MapObjects_loopIsNull4;
}{code}

 # sdf 


> GenerateSafeProjection.generate will generate SpecificSafeProjection class, 
> but if column is array type or map type, the code cannot be reused which 
> impact the query performance
> -
>
> Key: SPARK-35500
> URL: https://issues.apache.org/jira/browse/SPARK-35500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yahui Liu
>Priority: Minor
>  Labels: codegen
>
> Reproduce steps:
>  # create a new table with array type: create table test_code_gen(a 
> array);
>  # Add 
> log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator 
> = DEBUG to log4j.properties;
>  # Enter spark-shell, fire a query: spark.sql("select * from 
> test_code_gen").collect
>  # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
> generated, but the code for the class cannot be reused because everytime the 
> id for two variables in the generated class is changed: MapObjects_loopValue 
> and MapObjects_loopIsNull. So even the class generated before has been 
> cached, new code

[jira] [Created] (SPARK-35501) Add a feature for removing pulled container image for docker integration tests

2021-05-24 Thread Kousuke Saruta (Jira)

Kousuke Saruta created SPARK-35501:
--

 Summary: Add a feature for removing pulled container image for 
docker integration tests
 Key: SPARK-35501
 URL: https://issues.apache.org/jira/browse/SPARK-35501
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 3.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


Docker integration tests pull the corresponding container images before test if 
there is no container image in the local repository.
But the pulled image remains even after test finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per

2021-05-24 Thread Yahui Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yahui Liu updated SPARK-35500:
--
Description: 
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  The time cost for compile is increasing with the growth 
of column number, for wide table, this cost can more than 2s. 
{code:java}
object MapObjects {
  private val curId = new java.util.concurrent.atomic.AtomicInteger()
 val id = curId.getAndIncrement()
 val loopValue = s"MapObjects_loopValue$id"
 val loopIsNull = if (elementNullable) {
   s"MapObjects_loopIsNull$id"
 } else {
   "false"
 }
{code}
First time run: 
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue1;
 private boolean MapObjects_loopIsNull1;
 private UTF8String MapObjects_loopValue2;
 private boolean MapObjects_loopIsNull2;
}
{code}
Second time run:
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue3;
 private boolean MapObjects_loopIsNull3;
 private UTF8String MapObjects_loopValue4;
 private boolean MapObjects_loopIsNull4;
}{code}

 # sdf 

  was:
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  The time cost for compile is increasing with the growth 
of column number, for wide table, this cost can more than 2s. 
{code:java}
object MapObjects {
  private val curId = new java.util.concurrent.atomic.AtomicInteger()
 val id = curId.getAndIncrement()
 val loopValue = s"MapObjects_loopValue$id"
 val loopIsNull = if (elementNullable) {
   s"MapObjects_loopIsNull$id"
 } else {
   "false"
 }
{code}
First time run: 
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue1;
 private boolean MapObjects_loopIsNull1;
 private UTF8String MapObjects_loopValue2;
 private boolean MapObjects_loopIsNull2;
}
{code}
Second time run:
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue3;
 private boolean MapObjects_loopIsNull3;
 private UTF8String MapObjects_loopValue4;
 private boolean MapObjects_loopIsNull4;
}{code}


> GenerateSafeProjection.generate will generate SpecificSafeProjection class, 
> but if column is array type or map type, the code cannot be reused which 
> impact the query performance
> -
>
> Key: SPARK-35500
> URL: https://issues.apache.org/jira/browse/SPARK-35500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yahui Liu
>Priority: Minor
>  Labels: codegen
>
> Reproduce steps:
>  # create a new table with array type: create table test_code_gen(a 
> array);
>  # Add 
> log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator 
> = DEBUG to log4j.properties;
>  # Enter spark-shell, fire a query: spark.sql("select * from 
> test_code_gen").collect
>  # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
> generated, but the code for the class cannot be reused because everytime the 
> id for two variables in the generated class is changed: MapObjects_loopValue 
> and MapObjects_loopIsNull. So even the class generated before has been 
> cached, new code

[jira] [Commented] (SPARK-35449) Should not extract common expressions from value expressions when elseValue is empty in CaseWhen

2021-05-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350387#comment-17350387
 ] 

Apache Spark commented on SPARK-35449:
--

User 'Kimahriman' has created a pull request for this issue:
https://github.com/apache/spark/pull/32651

> Should not extract common expressions from value expressions when elseValue 
> is empty in CaseWhen
> 
>
> Key: SPARK-35449
> URL: https://issues.apache.org/jira/browse/SPARK-35449
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.2.0
>
>
> [https://github.com/apache/spark/pull/30245]  added support for creating 
> subexpressions that are present in all branches of conditional statements. 
> However, for a statement to be in "all branches" of a CaseWhen statement, it 
> must also be in the elseValue. This can lead to a subexpression to be created 
> and run for branches of a conditional that don't pass. This can cause issues 
> especially with a UDF in a branch that gets executed assuming the condition 
> is true. For example:
> {code:java}
> val col = when($"id" < 0, myUdf($"id"))
> spark.range(1).select(when(col > 0, col)).show()
> {code}
> myUdf($"id") gets extracted as a subexpression and executed even though both 
> conditions don't pass and it should never be executed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35279) _SUCCESS file not written when using partitionOverwriteMode=dynamic

2021-05-24 Thread Steve Loughran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350373#comment-17350373
 ] 

Steve Loughran commented on SPARK-35279:


Any reason for not using the S3A committers here? i.e the ones which are safe 
to use on S3?

> _SUCCESS file not written when using partitionOverwriteMode=dynamic
> ---
>
> Key: SPARK-35279
> URL: https://issues.apache.org/jira/browse/SPARK-35279
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Goran Kljajic
>Priority: Minor
>
> Steps to reproduce:
>  
> {code:java}
> case class A(a: String, b:String)
> def df = List(A("a", "b")).toDF
> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
> val writer = df.write.mode(SaveMode.Overwrite).partitionBy("a") 
> writer.parquet("s3a://some_bucket/test/")
> {code}
> when spark.sql.sources.partitionOverwriteMode is set to dynamic, the output 
> written doesn't have _SUCCESS file updated.
> (I have checked different versions of hadoop from 3.1.4 to 3.22 and they all 
> behave the same, so the issue is with spark)
> This is working in spark 3.0.2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35299) Dataframe overwrite on S3 does not delete old files with S3 object-put to table path

2021-05-24 Thread Steve Loughran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350372#comment-17350372
 ] 

Steve Loughran commented on SPARK-35299:


+use a version of spark with the hadoop 3.1+ JARs and the s3a committer.

> Dataframe overwrite on S3 does not delete old files with S3 object-put to 
> table path
> 
>
> Key: SPARK-35299
> URL: https://issues.apache.org/jira/browse/SPARK-35299
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yusheng Ding
>Priority: Major
>  Labels: aws-s3, dataframe, hive, spark
>
> To reproduce:
> test_table path: s3a://test_bucket/test_table/
>  
> df = spark_session.sql("SELECT * FROM test_table")
> df.count()  # produce row number 1000
> #S3 operation##
> s3 = boto3.client("s3")
>  s3.put_object(
>      Bucket="test_bucket", Body="", Key=f"test_table/"
>  )
> #S3 operation##
> df.write.insertInto(test_table, overwrite=True)
> #Same goes to df.write.save(mode="overwrite", format="parquet", 
> path="s3a://test_bucket/test_table")
> df = spark_session.sql("SELECT * FROM test_table")
> df.count()  # produce row number 2000
>  
> Overwrite is not functioning correctly. Old files will not be deleted on S3.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35406) TaskCompletionListenerException: Premature end of Content-Length delimited message body

2021-05-24 Thread Steve Loughran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350371#comment-17350371
 ] 

Steve Loughran commented on SPARK-35406:


This is actually triggered by garbage collection; not an AWS SDK thing. 


The S3A Input stream keeps a reference to the stream returned by the S3Object, 
but somehow the outer objects can get a GC'd and everything breaks. The fix is: 
keep a reference to that outer object.


Updating the AWS SDK is unlikely to have any direct effect; it an the required 
matching hadoop-* JARs may change GC profile enough to hide it. You should be 
off Hadoop 2.7 for many reasons, especially when working with object stores. 
And for java 11. Not supported there, so all problems will be "wontfix."

As well for efficient random access IO over HTTP connections, Hadoop 3.1+ ships 
with the S3A committers which are actually designed to work with S3. The one 
you are using relies on rename() being fast and, for a directory, atomic. This 
doesn't hold with s3. At least now S3 is consistent rename() is going to 
discover the correct set of files to copy.

> TaskCompletionListenerException: Premature end of Content-Length delimited 
> message body
> ---
>
> Key: SPARK-35406
> URL: https://issues.apache.org/jira/browse/SPARK-35406
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.7
> Environment: Spark 2.4.7
> Build with Hadoop 2.7.3
> hadoop-aws jar 2.7.3
> aws-java-sdk 1.7.4
> EKS 1.18
>Reporter: Avik Aggarwal
>Priority: Major
>  Labels: Kubernetes
>
> Running Spark on kubernetes (EKS 1.18) and Below version fo different 
> components:
> Spark 2.4.7
> Build with Hadoop 2.7.3
> hadoop-aws jar 2.7.3
> aws-java-sdk 1.7.4
>  
> I am using s3a endpoint for reading S3 objects from private repository and 
> appropriate role has been given to executors.
>  
> I am facing below error while read/writing bigger files from/to S3.
> Size for which I am facing issue - early MBs (10-30).
> While it works for files with KBs of size.
>  
> Logs : -
> {code:java}
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 
> (TID 6, 10.83.7.112, executor 2): 
> org.apache.spark.util.TaskCompletionListenerException: Premature end of 
> Content-Length delimited message body (expected: 3918825; received: 18020
>   at 
> org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:138)
>   at 
> org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:116)
>   at org.apache.spark.scheduler.Task.run(Task.scala:139)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1925)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1913)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1912)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1912)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:948)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:948)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:948)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2146)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2095)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2084)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:759)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
>   at

[jira] [Commented] (SPARK-26385) YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in cache

2021-05-24 Thread Ji Chen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350337#comment-17350337
 ] 

Ji Chen commented on SPARK-26385:
-

[~zjiash] Has this problem been solved please?

> YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in 
> cache
> ---
>
> Key: SPARK-26385
> URL: https://issues.apache.org/jira/browse/SPARK-26385
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Hadoop 2.6.0, Spark 2.4.0
>Reporter: T M
>Priority: Major
>
>  
> Hello,
>  
> I have Spark Structured Streaming job which is runnning on YARN(Hadoop 2.6.0, 
> Spark 2.4.0). After 25-26 hours, my job stops working with following error:
> {code:java}
> 2018-12-16 22:35:17 ERROR 
> org.apache.spark.internal.Logging$class.logError(Logging.scala:91): Query 
> TestQuery[id = a61ce197-1d1b-4e82-a7af-60162953488b, runId = 
> a56878cf-dfc7-4f6a-ad48-02cf738ccc2f] terminated with error 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for REMOVED: HDFS_DELEGATION_TOKEN owner=REMOVED, renewer=yarn, 
> realUser=, issueDate=1544903057122, maxDate=1545507857122, 
> sequenceNumber=10314, masterKeyId=344) can't be found in cache at 
> org.apache.hadoop.ipc.Client.call(Client.java:1470) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1401) at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>  at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752)
>  at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>  at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) at 
> org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1977) at 
> org.apache.hadoop.fs.Hdfs.getFileStatus(Hdfs.java:133) at 
> org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1120) at 
> org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1116) at 
> org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at 
> org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1116) at 
> org.apache.hadoop.fs.FileContext$Util.exists(FileContext.java:1581) at 
> org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.exists(CheckpointFileManager.scala:326)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:142)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:544)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:554)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
>  at 
>

[jira] [Commented] (SPARK-32194) Standardize exceptions in PySpark

2021-05-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350307#comment-17350307
 ] 

Apache Spark commented on SPARK-32194:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/32650

> Standardize exceptions in PySpark
> -
>
> Key: SPARK-32194
> URL: https://issues.apache.org/jira/browse/SPARK-32194
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, PySpark throws {{Exception}} or just {{RuntimeException}} in many 
> cases. We should standardize them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per

2021-05-24 Thread Yahui Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yahui Liu updated SPARK-35500:
--
Description: 
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  The time cost for compile is increasing with the growth 
of column number, for wide table, this cost can more than 2s. 
{code:java}
object MapObjects {
  private val curId = new java.util.concurrent.atomic.AtomicInteger()
 val id = curId.getAndIncrement()
 val loopValue = s"MapObjects_loopValue$id"
 val loopIsNull = if (elementNullable) {
   s"MapObjects_loopIsNull$id"
 } else {
   "false"
 }
{code}
First time run: 
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue1;
 private boolean MapObjects_loopIsNull1;
 private UTF8String MapObjects_loopValue2;
 private boolean MapObjects_loopIsNull2;
}
{code}
Second time run:
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue3;
 private boolean MapObjects_loopIsNull3;
 private UTF8String MapObjects_loopValue4;
 private boolean MapObjects_loopIsNull4;
}{code}

  was:
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  
{code:java}
object MapObjects {
  private val curId = new java.util.concurrent.atomic.AtomicInteger()
 val id = curId.getAndIncrement()
 val loopValue = s"MapObjects_loopValue$id"
 val loopIsNull = if (elementNullable) {
   s"MapObjects_loopIsNull$id"
 } else {
   "false"
 }
{code}
First time run: 
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue1;
 private boolean MapObjects_loopIsNull1;
 private UTF8String MapObjects_loopValue2;
 private boolean MapObjects_loopIsNull2;
}
{code}
Second time run:
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue3;
 private boolean MapObjects_loopIsNull3;
 private UTF8String MapObjects_loopValue4;
 private boolean MapObjects_loopIsNull4;
}
{code}

 # The time cost for compile is increasing with the growth of column number, 
for wide table, this cost can more than 2s. 


> GenerateSafeProjection.generate will generate SpecificSafeProjection class, 
> but if column is array type or map type, the code cannot be reused which 
> impact the query performance
> -
>
> Key: SPARK-35500
> URL: https://issues.apache.org/jira/browse/SPARK-35500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yahui Liu
>Priority: Minor
>  Labels: codegen
>
> Reproduce steps:
>  # create a new table with array type: create table test_code_gen(a 
> array);
>  # Add 
> log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator 
> = DEBUG to log4j.properties;
>  # Enter spark-shell, fire a query: spark.sql("select * from 
> test_code_gen").collect
>  # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
> generated, but the code for the class cannot be reused because everytime the 
> id for two variables in the generated class is changed: MapObjects_loopValue 
> and MapObjects_loopIsNull. So even the class generated before has been 
> cached, new code cannot

[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per

2021-05-24 Thread Yahui Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yahui Liu updated SPARK-35500:
--
Description: 
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  
{code:java}
object MapObjects {
  private val curId = new java.util.concurrent.atomic.AtomicInteger()
 val id = curId.getAndIncrement()
 val loopValue = s"MapObjects_loopValue$id"
 val loopIsNull = if (elementNullable) {
   s"MapObjects_loopIsNull$id"
 } else {
   "false"
 }
{code}
First time run: 
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue1;
 private boolean MapObjects_loopIsNull1;
 private UTF8String MapObjects_loopValue2;
 private boolean MapObjects_loopIsNull2;
}
{code}
Second time run:
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue3;
 private boolean MapObjects_loopIsNull3;
 private UTF8String MapObjects_loopValue4;
 private boolean MapObjects_loopIsNull4;
}
{code}

 # The time cost for compile is increasing with the growth of column number, 
for wide table, this cost can more than 2s. 

  was:
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  
{code:java}
object MapObjects {
  private val curId = new java.util.concurrent.atomic.AtomicInteger()
 val id = curId.getAndIncrement()
 val loopValue = s"MapObjects_loopValue$id"
 val loopIsNull = if (elementNullable) {
   s"MapObjects_loopIsNull$id"
 } else {
   "false"
 }
{code}
First time run: 
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue1;
 private boolean MapObjects_loopIsNull1;
 private UTF8String MapObjects_loopValue2;
 private boolean MapObjects_loopIsNull2;
}
{code}
Second time run:
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue3;
 private boolean MapObjects_loopIsNull3;
 private UTF8String MapObjects_loopValue4;
 private boolean MapObjects_loopIsNull4;
}
{code}

 # The time cost for compile is increasing with the growth of column number, 
for wide table, this cost can more than 2s. 
 #


> GenerateSafeProjection.generate will generate SpecificSafeProjection class, 
> but if column is array type or map type, the code cannot be reused which 
> impact the query performance
> -
>
> Key: SPARK-35500
> URL: https://issues.apache.org/jira/browse/SPARK-35500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yahui Liu
>Priority: Minor
>  Labels: codegen
>
> Reproduce steps:
>  # create a new table with array type: create table test_code_gen(a 
> array);
>  # Add 
> log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator 
> = DEBUG to log4j.properties;
>  # Enter spark-shell, fire a query: spark.sql("select * from 
> test_code_gen").collect
>  # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
> generated, but the code for the class cannot be reused because everytime the 
> id for two variables in the generated class is changed: MapObjects_loopValue 
> and MapObjects_loopIsNull. So even the class generated before has been 
> cached, new

[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per

2021-05-24 Thread Yahui Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yahui Liu updated SPARK-35500:
--
Description: 
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  
{code:java}
object MapObjects {
  private val curId = new java.util.concurrent.atomic.AtomicInteger()
 val id = curId.getAndIncrement()
 val loopValue = s"MapObjects_loopValue$id"
 val loopIsNull = if (elementNullable) {
   s"MapObjects_loopIsNull$id"
 } else {
   "false"
 }
{code}
First time run: 
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue1;
 private boolean MapObjects_loopIsNull1;
 private UTF8String MapObjects_loopValue2;
 private boolean MapObjects_loopIsNull2;
}
{code}
Second time run:
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue3;
 private boolean MapObjects_loopIsNull3;
 private UTF8String MapObjects_loopValue4;
 private boolean MapObjects_loopIsNull4;
}
{code}

 # The time cost for compile is increasing with the growth of column number, 
for wide table, this cost can more than 2s. 

  was:
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  
{code:java}
object MapObjects {
  private val curId = new java.util.concurrent.atomic.AtomicInteger()
 val id = curId.getAndIncrement()
 val loopValue = s"MapObjects_loopValue$id"
 val loopIsNull = if (elementNullable) {
   s"MapObjects_loopIsNull$id"
 } else {
   "false"
 }
{code}
First time run: 
 # 
 # The time cost for compile is increasing with the growth of column number, 
for wide table, this cost can more than 2s. 
 # 

 

         First time run:

          class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {

                    private int MapObjects_loopValue{color:#FF}1{color};
                    private boolean 
MapObjects_loopIsNull{color:#FF}1{color};
                    private UTF8String 
MapObjects_loopValue{color:#FF}2{color};
                    private boolean 
MapObjects_loopIsNull{color:#FF}2{color};

          }

         Second time run:



          class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {

                    private int MapObjects_loopValue{color:#FF}3{color};
                    private boolean 
MapObjects_loopIsNull{color:#FF}3{color};
                    private UTF8String 
MapObjects_loopValue{color:#FF}4{color};
                    private boolean 
MapObjects_loopIsNull{color:#FF}4{color};

          }


> GenerateSafeProjection.generate will generate SpecificSafeProjection class, 
> but if column is array type or map type, the code cannot be reused which 
> impact the query performance
> -
>
> Key: SPARK-35500
> URL: https://issues.apache.org/jira/browse/SPARK-35500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yahui Liu
>Priority: Minor
>  Labels: codegen
>
> Reproduce steps:
>  # create a new table with array type: create table test_code_gen(a 
> array);
>  # Add 
> log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator 
> = DEBUG to log4j.properties;
>  # Enter spark-shell, fire a

[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per

2021-05-24 Thread Yahui Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yahui Liu updated SPARK-35500:
--
Description: 
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  
{code:java}
object MapObjects {
  private val curId = new java.util.concurrent.atomic.AtomicInteger()
 val id = curId.getAndIncrement()
 val loopValue = s"MapObjects_loopValue$id"
 val loopIsNull = if (elementNullable) {
   s"MapObjects_loopIsNull$id"
 } else {
   "false"
 }
{code}
First time run: 
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue1;
 private boolean MapObjects_loopIsNull1;
 private UTF8String MapObjects_loopValue2;
 private boolean MapObjects_loopIsNull2;
}
{code}
Second time run:
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue3;
 private boolean MapObjects_loopIsNull3;
 private UTF8String MapObjects_loopValue4;
 private boolean MapObjects_loopIsNull4;
}
{code}

 # The time cost for compile is increasing with the growth of column number, 
for wide table, this cost can more than 2s. 
 #

  was:
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  
{code:java}
object MapObjects {
  private val curId = new java.util.concurrent.atomic.AtomicInteger()
 val id = curId.getAndIncrement()
 val loopValue = s"MapObjects_loopValue$id"
 val loopIsNull = if (elementNullable) {
   s"MapObjects_loopIsNull$id"
 } else {
   "false"
 }
{code}
First time run: 
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue1;
 private boolean MapObjects_loopIsNull1;
 private UTF8String MapObjects_loopValue2;
 private boolean MapObjects_loopIsNull2;
}
{code}
Second time run:
{code:java}
class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
 private int MapObjects_loopValue3;
 private boolean MapObjects_loopIsNull3;
 private UTF8String MapObjects_loopValue4;
 private boolean MapObjects_loopIsNull4;
}
{code}

 # The time cost for compile is increasing with the growth of column number, 
for wide table, this cost can more than 2s. 


> GenerateSafeProjection.generate will generate SpecificSafeProjection class, 
> but if column is array type or map type, the code cannot be reused which 
> impact the query performance
> -
>
> Key: SPARK-35500
> URL: https://issues.apache.org/jira/browse/SPARK-35500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yahui Liu
>Priority: Minor
>  Labels: codegen
>
> Reproduce steps:
>  # create a new table with array type: create table test_code_gen(a 
> array);
>  # Add 
> log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator 
> = DEBUG to log4j.properties;
>  # Enter spark-shell, fire a query: spark.sql("select * from 
> test_code_gen").collect
>  # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
> generated, but the code for the class cannot be reused because everytime the 
> id for two variables in the generated class is changed: MapObjects_loopValue 
> and MapObjects_loopIsNull. So even the class generated before has been 
> cached, new

[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per

2021-05-24 Thread Yahui Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yahui Liu updated SPARK-35500:
--
Description: 
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  
{code:java}
object MapObjects {
  private val curId = new java.util.concurrent.atomic.AtomicInteger()
 val id = curId.getAndIncrement()
 val loopValue = s"MapObjects_loopValue$id"
 val loopIsNull = if (elementNullable) {
   s"MapObjects_loopIsNull$id"
 } else {
   "false"
 }
{code}
First time run: 
 # 
 # The time cost for compile is increasing with the growth of column number, 
for wide table, this cost can more than 2s. 
 # 

 

         First time run:

          class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {

                    private int MapObjects_loopValue{color:#FF}1{color};
                    private boolean 
MapObjects_loopIsNull{color:#FF}1{color};
                    private UTF8String 
MapObjects_loopValue{color:#FF}2{color};
                    private boolean 
MapObjects_loopIsNull{color:#FF}2{color};

          }

         Second time run:



          class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {

                    private int MapObjects_loopValue{color:#FF}3{color};
                    private boolean 
MapObjects_loopIsNull{color:#FF}3{color};
                    private UTF8String 
MapObjects_loopValue{color:#FF}4{color};
                    private boolean 
MapObjects_loopIsNull{color:#FF}4{color};

          }

  was:
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  

object MapObjects {
 private val curId = new java.util.concurrent.atomic.AtomicInteger()
 # The time cost for compile is increasing with the growth of column number, 
for wide table, this cost can more than 2s. 


> GenerateSafeProjection.generate will generate SpecificSafeProjection class, 
> but if column is array type or map type, the code cannot be reused which 
> impact the query performance
> -
>
> Key: SPARK-35500
> URL: https://issues.apache.org/jira/browse/SPARK-35500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yahui Liu
>Priority: Minor
>  Labels: codegen
>
> Reproduce steps:
>  # create a new table with array type: create table test_code_gen(a 
> array);
>  # Add 
> log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator 
> = DEBUG to log4j.properties;
>  # Enter spark-shell, fire a query: spark.sql("select * from 
> test_code_gen").collect
>  # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
> generated, but the code for the class cannot be reused because everytime the 
> id for two variables in the generated class is changed: MapObjects_loopValue 
> and MapObjects_loopIsNull. So even the class generated before has been 
> cached, new code cannot match the cache key so that new code need to be 
> compiled again which cost some time.  
> {code:java}
> object MapObjects {
>   private val curId = new java.util.concurrent.atomic.AtomicInteger()
>  val id = curId.getAndIncrement()
>  val loopValue = s"MapObjects_loopValue$id"
>  val loopIsNull = if (elementNullable) {
>    s"MapObjects_loopIsNull$id"
>  } else {
>    "false"
>  }
> {code}
> First

[jira] [Assigned] (SPARK-35497) Enable plotly tests in pandas-on-Spark

2021-05-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35497:


Assignee: Apache Spark

> Enable plotly tests in pandas-on-Spark
> --
>
> Key: SPARK-35497
> URL: https://issues.apache.org/jira/browse/SPARK-35497
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Currently, GitHub Actions doesn't have plotly installed and results in plotly 
> related tests being skipped. We should enable it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35497) Enable plotly tests in pandas-on-Spark

2021-05-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35497:


Assignee: (was: Apache Spark)

> Enable plotly tests in pandas-on-Spark
> --
>
> Key: SPARK-35497
> URL: https://issues.apache.org/jira/browse/SPARK-35497
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, GitHub Actions doesn't have plotly installed and results in plotly 
> related tests being skipped. We should enable it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35497) Enable plotly tests in pandas-on-Spark

2021-05-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350298#comment-17350298
 ] 

Apache Spark commented on SPARK-35497:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/32649

> Enable plotly tests in pandas-on-Spark
> --
>
> Key: SPARK-35497
> URL: https://issues.apache.org/jira/browse/SPARK-35497
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, GitHub Actions doesn't have plotly installed and results in plotly 
> related tests being skipped. We should enable it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per

2021-05-24 Thread Yahui Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yahui Liu updated SPARK-35500:
--
Description: 
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  

object MapObjects {
 private val curId = new java.util.concurrent.atomic.AtomicInteger()
 # The time cost for compile is increasing with the growth of column number, 
for wide table, this cost can more than 2s. 

  was:
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  
!image-2021-05-24-16-15-18-359.png!!image-2021-05-24-16-05-34-334.png!
 # The time cost for compile is increasing with the growth of column number, 
for wide table, this cost can more than 2s. !image-2021-05-24-16-11-20-841.png!


> GenerateSafeProjection.generate will generate SpecificSafeProjection class, 
> but if column is array type or map type, the code cannot be reused which 
> impact the query performance
> -
>
> Key: SPARK-35500
> URL: https://issues.apache.org/jira/browse/SPARK-35500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yahui Liu
>Priority: Minor
>  Labels: codegen
>
> Reproduce steps:
>  # create a new table with array type: create table test_code_gen(a 
> array);
>  # Add 
> log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator 
> = DEBUG to log4j.properties;
>  # Enter spark-shell, fire a query: spark.sql("select * from 
> test_code_gen").collect
>  # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
> generated, but the code for the class cannot be reused because everytime the 
> id for two variables in the generated class is changed: MapObjects_loopValue 
> and MapObjects_loopIsNull. So even the class generated before has been 
> cached, new code cannot match the cache key so that new code need to be 
> compiled again which cost some time.  
> object MapObjects {
>  private val curId = new java.util.concurrent.atomic.AtomicInteger()
>  # The time cost for compile is increasing with the growth of column number, 
> for wide table, this cost can more than 2s. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35449) Should not extract common expressions from value expressions when elseValue is empty in CaseWhen

2021-05-24 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350297#comment-17350297
 ] 

L. C. Hsieh commented on SPARK-35449:
-

This issue was fixed at https://github.com/apache/spark/pull/32595.

> Should not extract common expressions from value expressions when elseValue 
> is empty in CaseWhen
> 
>
> Key: SPARK-35449
> URL: https://issues.apache.org/jira/browse/SPARK-35449
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.2.0
>
>
> [https://github.com/apache/spark/pull/30245]  added support for creating 
> subexpressions that are present in all branches of conditional statements. 
> However, for a statement to be in "all branches" of a CaseWhen statement, it 
> must also be in the elseValue. This can lead to a subexpression to be created 
> and run for branches of a conditional that don't pass. This can cause issues 
> especially with a UDF in a branch that gets executed assuming the condition 
> is true. For example:
> {code:java}
> val col = when($"id" < 0, myUdf($"id"))
> spark.range(1).select(when(col > 0, col)).show()
> {code}
> myUdf($"id") gets extracted as a subexpression and executed even though both 
> conditions don't pass and it should never be executed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35449) Should not extract common expressions from value expressions when elseValue is empty in CaseWhen

2021-05-24 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-35449.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

> Should not extract common expressions from value expressions when elseValue 
> is empty in CaseWhen
> 
>
> Key: SPARK-35449
> URL: https://issues.apache.org/jira/browse/SPARK-35449
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> [https://github.com/apache/spark/pull/30245]  added support for creating 
> subexpressions that are present in all branches of conditional statements. 
> However, for a statement to be in "all branches" of a CaseWhen statement, it 
> must also be in the elseValue. This can lead to a subexpression to be created 
> and run for branches of a conditional that don't pass. This can cause issues 
> especially with a UDF in a branch that gets executed assuming the condition 
> is true. For example:
> {code:java}
> val col = when($"id" < 0, myUdf($"id"))
> spark.range(1).select(when(col > 0, col)).show()
> {code}
> myUdf($"id") gets extracted as a subexpression and executed even though both 
> conditions don't pass and it should never be executed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35449) Should not extract common expressions from value expressions when elseValue is empty in CaseWhen

2021-05-24 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-35449:
---

Assignee: Adam Binford

> Should not extract common expressions from value expressions when elseValue 
> is empty in CaseWhen
> 
>
> Key: SPARK-35449
> URL: https://issues.apache.org/jira/browse/SPARK-35449
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.2.0
>
>
> [https://github.com/apache/spark/pull/30245]  added support for creating 
> subexpressions that are present in all branches of conditional statements. 
> However, for a statement to be in "all branches" of a CaseWhen statement, it 
> must also be in the elseValue. This can lead to a subexpression to be created 
> and run for branches of a conditional that don't pass. This can cause issues 
> especially with a UDF in a branch that gets executed assuming the condition 
> is true. For example:
> {code:java}
> val col = when($"id" < 0, myUdf($"id"))
> spark.range(1).select(when(col > 0, col)).show()
> {code}
> myUdf($"id") gets extracted as a subexpression and executed even though both 
> conditions don't pass and it should never be executed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query per

2021-05-24 Thread Yahui Liu (Jira)

Yahui Liu created SPARK-35500:
-

 Summary: GenerateSafeProjection.generate will generate 
SpecificSafeProjection class, but if column is array type or map type, the code 
cannot be reused which impact the query performance
 Key: SPARK-35500
 URL: https://issues.apache.org/jira/browse/SPARK-35500
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yahui Liu


Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  
!image-2021-05-24-16-15-18-359.png!!image-2021-05-24-16-05-34-334.png!
 # The time cost for compile is increasing with the growth of column number, 
for wide table, this cost can more than 2s. !image-2021-05-24-16-11-20-841.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34927) Support TPCDSQueryBenchmark in Benchmarks

2021-05-24 Thread Myeongju Kim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350290#comment-17350290
 ] 

Myeongju Kim commented on SPARK-34927:
--

Thank you for your response! I'll take a look at it and update the Jira 
accordingly 

> Support TPCDSQueryBenchmark in Benchmarks
> -
>
> Key: SPARK-34927
> URL: https://issues.apache.org/jira/browse/SPARK-34927
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Benchmarks.scala currently does not support TPCDSQueryBenchmark. We should 
> make it supported. See also 
> https://github.com/apache/spark/pull/32015#issuecomment-89046



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35499) Apply black to pandas API on Spark codes.

2021-05-24 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-35499:

Description: 
Make it easier and more efficient to static analysis, we'd better to apply 
`black` to the pandas API on Spark.

Koalas project is using black for [reformatting 
script|https://github.com/databricks/koalas/blob/master/dev/reformat].

 

 

  was:
Make it easier and more efficient to static analysis, we'd better to apply 
`black` to the pandas API on Spark.

Koalas project is using black for [reformatting 
script|[https://github.com/databricks/koalas/blob/master/dev/reformat].]

 

 


> Apply black to pandas API on Spark codes.
> -
>
> Key: SPARK-35499
> URL: https://issues.apache.org/jira/browse/SPARK-35499
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Make it easier and more efficient to static analysis, we'd better to apply 
> `black` to the pandas API on Spark.
> Koalas project is using black for [reformatting 
> script|https://github.com/databricks/koalas/blob/master/dev/reformat].
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35499) Apply black to pandas API on Spark codes.

2021-05-24 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-35499:
---

 Summary: Apply black to pandas API on Spark codes.
 Key: SPARK-35499
 URL: https://issues.apache.org/jira/browse/SPARK-35499
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Haejoon Lee


Make it easier and more efficient to static analysis, we'd better to apply 
`black` to the pandas API on Spark.

Koalas project is using black for [reformatting 
script|[https://github.com/databricks/koalas/blob/master/dev/reformat].]

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35449) Should not extract common expressions from value expressions when elseValue is empty in CaseWhen

2021-05-24 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-35449:

Affects Version/s: 3.1.1

> Should not extract common expressions from value expressions when elseValue 
> is empty in CaseWhen
> 
>
> Key: SPARK-35449
> URL: https://issues.apache.org/jira/browse/SPARK-35449
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> [https://github.com/apache/spark/pull/30245]  added support for creating 
> subexpressions that are present in all branches of conditional statements. 
> However, for a statement to be in "all branches" of a CaseWhen statement, it 
> must also be in the elseValue. This can lead to a subexpression to be created 
> and run for branches of a conditional that don't pass. This can cause issues 
> especially with a UDF in a branch that gets executed assuming the condition 
> is true. For example:
> {code:java}
> val col = when($"id" < 0, myUdf($"id"))
> spark.range(1).select(when(col > 0, col)).show()
> {code}
> myUdf($"id") gets extracted as a subexpression and executed even though both 
> conditions don't pass and it should never be executed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35498) Add an API "inheritable_thread_target" which return a wrapped thread target for pyspark pin thread mode

2021-05-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350273#comment-17350273
 ] 

Apache Spark commented on SPARK-35498:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/32644

> Add an API "inheritable_thread_target" which return a wrapped thread target 
> for pyspark pin thread mode
> ---
>
> Key: SPARK-35498
> URL: https://issues.apache.org/jira/browse/SPARK-35498
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Weichen Xu
>Priority: Major
>
> In pyspark, user may create some threads, not via `Thread` object, but via 
> some parallel helper function such as:
> `thread_pool.imap_unordered`
> In this case, we need create a function wrapper, used in pin thread mode, The 
> wrapper function, before calling original thread target, it inherits the 
> inheritable properties specific to JVM thread such as 
> ``InheritableThreadLocal``, and after original thread target return, 
> garbage-collects the Python thread instance and also closes the connection 
> which finishes JVM thread correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35498) Add an API "inheritable_thread_target" which return a wrapped thread target for pyspark pin thread mode

2021-05-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350272#comment-17350272
 ] 

Apache Spark commented on SPARK-35498:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/32644

> Add an API "inheritable_thread_target" which return a wrapped thread target 
> for pyspark pin thread mode
> ---
>
> Key: SPARK-35498
> URL: https://issues.apache.org/jira/browse/SPARK-35498
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Weichen Xu
>Priority: Major
>
> In pyspark, user may create some threads, not via `Thread` object, but via 
> some parallel helper function such as:
> `thread_pool.imap_unordered`
> In this case, we need create a function wrapper, used in pin thread mode, The 
> wrapper function, before calling original thread target, it inherits the 
> inheritable properties specific to JVM thread such as 
> ``InheritableThreadLocal``, and after original thread target return, 
> garbage-collects the Python thread instance and also closes the connection 
> which finishes JVM thread correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35498) Add an API "inheritable_thread_target" which return a wrapped thread target for pyspark pin thread mode

2021-05-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35498:


Assignee: Apache Spark

> Add an API "inheritable_thread_target" which return a wrapped thread target 
> for pyspark pin thread mode
> ---
>
> Key: SPARK-35498
> URL: https://issues.apache.org/jira/browse/SPARK-35498
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Major
>
> In pyspark, user may create some threads, not via `Thread` object, but via 
> some parallel helper function such as:
> `thread_pool.imap_unordered`
> In this case, we need create a function wrapper, used in pin thread mode, The 
> wrapper function, before calling original thread target, it inherits the 
> inheritable properties specific to JVM thread such as 
> ``InheritableThreadLocal``, and after original thread target return, 
> garbage-collects the Python thread instance and also closes the connection 
> which finishes JVM thread correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35498) Add an API "inheritable_thread_target" which return a wrapped thread target for pyspark pin thread mode

2021-05-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35498:


Assignee: (was: Apache Spark)

> Add an API "inheritable_thread_target" which return a wrapped thread target 
> for pyspark pin thread mode
> ---
>
> Key: SPARK-35498
> URL: https://issues.apache.org/jira/browse/SPARK-35498
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Weichen Xu
>Priority: Major
>
> In pyspark, user may create some threads, not via `Thread` object, but via 
> some parallel helper function such as:
> `thread_pool.imap_unordered`
> In this case, we need create a function wrapper, used in pin thread mode, The 
> wrapper function, before calling original thread target, it inherits the 
> inheritable properties specific to JVM thread such as 
> ``InheritableThreadLocal``, and after original thread target return, 
> garbage-collects the Python thread instance and also closes the connection 
> which finishes JVM thread correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35498) Add an API "inheritable_thread_target" which return a wrapped thread target for pyspark pin thread mode

2021-05-24 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-35498:
---
Description: 
In pyspark, user may create some threads, not via `Thread` object, but via some 
parallel helper function such as:
`thread_pool.imap_unordered`

In this case, we need create a function wrapper, used in pin thread mode, The 
wrapper function, before calling original thread target, it inherits the 
inheritable properties specific to JVM thread such as 
``InheritableThreadLocal``, and after original thread target return, 
garbage-collects the Python thread instance and also closes the connection 
which finishes JVM thread correctly.

  was:For inheritable_thread_target


> Add an API "inheritable_thread_target" which return a wrapped thread target 
> for pyspark pin thread mode
> ---
>
> Key: SPARK-35498
> URL: https://issues.apache.org/jira/browse/SPARK-35498
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Weichen Xu
>Priority: Major
>
> In pyspark, user may create some threads, not via `Thread` object, but via 
> some parallel helper function such as:
> `thread_pool.imap_unordered`
> In this case, we need create a function wrapper, used in pin thread mode, The 
> wrapper function, before calling original thread target, it inherits the 
> inheritable properties specific to JVM thread such as 
> ``InheritableThreadLocal``, and after original thread target return, 
> garbage-collects the Python thread instance and also closes the connection 
> which finishes JVM thread correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35498) Add an API "inheritable_thread_target" which return a wrapped thread target for pyspark pin thread mode

2021-05-24 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-35498:
---
Description: For inheritable_thread_target

> Add an API "inheritable_thread_target" which return a wrapped thread target 
> for pyspark pin thread mode
> ---
>
> Key: SPARK-35498
> URL: https://issues.apache.org/jira/browse/SPARK-35498
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Weichen Xu
>Priority: Major
>
> For inheritable_thread_target



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35498) Add an API "inheritable_thread_target" which return a wrapped thread target for pyspark pin thread mode

2021-05-24 Thread Weichen Xu (Jira)

Weichen Xu created SPARK-35498:
--

 Summary: Add an API "inheritable_thread_target" which return a 
wrapped thread target for pyspark pin thread mode
 Key: SPARK-35498
 URL: https://issues.apache.org/jira/browse/SPARK-35498
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Weichen Xu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35497) Enable plotly tests in pandas-on-Spark

2021-05-24 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-35497:


 Summary: Enable plotly tests in pandas-on-Spark
 Key: SPARK-35497
 URL: https://issues.apache.org/jira/browse/SPARK-35497
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon


Currently, GitHub Actions doesn't have plotly installed and results in plotly 
related tests being skipped. We should enable it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35381) Fix lambda variable name issues in nested DataFrame functions in R APIs

2021-05-24 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-35381:
---

Assignee: Hyukjin Kwon

> Fix lambda variable name issues in nested DataFrame functions in R APIs
> ---
>
> Key: SPARK-35381
> URL: https://issues.apache.org/jira/browse/SPARK-35381
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.1.1
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>  Labels: correctness
> Fix For: 3.1.2, 3.2.0
>
>
> R's higher order functions also have the same problem with SPARK-34794:
> {code}
> df <- sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
> collect(select(
>   df,
>   array_transform("numbers", function(number) {
> array_transform("letters", function(latter) {
>   struct(alias(number, "n"), alias(latter, "l"))
> })
>   })
> ))
> {code}
> {code}
>   transform(numbers, lambdafunction(transform(letters, 
> lambdafunction(struct(namedlambdavariable() AS n, namedlambdavariable() AS 
> l), namedlambdavariable())), namedlambdavariable()))
> 1 
> a, a, b, b, c, c, a, a, 
> b, b, c, c, a, a, b, b, c, c
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35496) Upgrade Scala 2.13 from 2.13.5 to 2.13.6

2021-05-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35496:


Assignee: Apache Spark

> Upgrade Scala 2.13 from 2.13.5 to 2.13.6
> 
>
> Key: SPARK-35496
> URL: https://issues.apache.org/jira/browse/SPARK-35496
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>
> Scala 2.13.6 released（https://github.com/scala/scala/releases/tag/v2.13.6）



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35496) Upgrade Scala 2.13 from 2.13.5 to 2.13.6

2021-05-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350267#comment-17350267
 ] 

Apache Spark commented on SPARK-35496:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32648

> Upgrade Scala 2.13 from 2.13.5 to 2.13.6
> 
>
> Key: SPARK-35496
> URL: https://issues.apache.org/jira/browse/SPARK-35496
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Major
>
> Scala 2.13.6 released（https://github.com/scala/scala/releases/tag/v2.13.6）



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35496) Upgrade Scala 2.13 from 2.13.5 to 2.13.6

2021-05-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35496:


Assignee: (was: Apache Spark)

> Upgrade Scala 2.13 from 2.13.5 to 2.13.6
> 
>
> Key: SPARK-35496
> URL: https://issues.apache.org/jira/browse/SPARK-35496
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Major
>
> Scala 2.13.6 released（https://github.com/scala/scala/releases/tag/v2.13.6）



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35496) Upgrade Scala 2.13 from 2.13.5 to 2.13.6

2021-05-24 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35496:


Assignee: Apache Spark

> Upgrade Scala 2.13 from 2.13.5 to 2.13.6
> 
>
> Key: SPARK-35496
> URL: https://issues.apache.org/jira/browse/SPARK-35496
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>
> Scala 2.13.6 released（https://github.com/scala/scala/releases/tag/v2.13.6）



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35496) Upgrade Scala 2.13 from 2.13.5 to 2.13.6

2021-05-24 Thread Yang Jie (Jira)

Yang Jie created SPARK-35496:


 Summary: Upgrade Scala 2.13 from 2.13.5 to 2.13.6
 Key: SPARK-35496
 URL: https://issues.apache.org/jira/browse/SPARK-35496
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 3.2.0
Reporter: Yang Jie


Scala 2.13.6 released（https://github.com/scala/scala/releases/tag/v2.13.6）



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34981) Implement V2 function resolution and evaluation

2021-05-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350257#comment-17350257
 ] 

Apache Spark commented on SPARK-34981:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/32647

> Implement V2 function resolution and evaluation 
> 
>
> Key: SPARK-34981
> URL: https://issues.apache.org/jira/browse/SPARK-34981
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.2.0
>
>
> This is a follow-up of SPARK-27658. With FunctionCatalog API done, this aims 
> at implementing the function resolution (in analyzer) and evaluation by 
> wrapping them into corresponding expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34981) Implement V2 function resolution and evaluation

2021-05-24 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350256#comment-17350256
 ] 

Apache Spark commented on SPARK-34981:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/32647

> Implement V2 function resolution and evaluation 
> 
>
> Key: SPARK-34981
> URL: https://issues.apache.org/jira/browse/SPARK-34981
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.2.0
>
>
> This is a follow-up of SPARK-27658. With FunctionCatalog API done, this aims 
> at implementing the function resolution (in analyzer) and evaluation by 
> wrapping them into corresponding expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

98 matches

Mail list logo