[jira] [Updated] (SPARK-32489) Pass all `core` module UTs in Scala 2.13

2020-07-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32489:
--
Description: 
{code}
$ dev/change-scala-version.sh 2.13
$ build/mvn test -pl core --am -Pscala-2.13
...
Tests: succeeded 2612, failed 3, canceled 1, ignored 8, pending 0
*** 3 TESTS FAILED ***
{code}

*AFTER*
{code}
Tests: succeeded 2615, failed 0, canceled 1, ignored 8, pending 0
All tests passed.
{code}

  was:
{code}
$ dev/change-scala-version.sh 2.13
$ build/mvn test -pl core --am -Pscala-2.13
...
Tests: succeeded 2615, failed 0, canceled 1, ignored 8, pending 0
All tests passed.
{code}


> Pass all `core` module UTs in Scala 2.13
> 
>
> Key: SPARK-32489
> URL: https://issues.apache.org/jira/browse/SPARK-32489
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> {code}
> $ dev/change-scala-version.sh 2.13
> $ build/mvn test -pl core --am -Pscala-2.13
> ...
> Tests: succeeded 2612, failed 3, canceled 1, ignored 8, pending 0
> *** 3 TESTS FAILED ***
> {code}
> *AFTER*
> {code}
> Tests: succeeded 2615, failed 0, canceled 1, ignored 8, pending 0
> All tests passed.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32475) java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;

2020-07-30 Thread jinhai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167689#comment-17167689
 ] 

jinhai commented on SPARK-32475:


!image-2020-07-30-15-15-42-319.png!

> java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
> 
>
> Key: SPARK-32475
> URL: https://issues.apache.org/jira/browse/SPARK-32475
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Spark-3.0.0;JDK 8
>Reporter: jinhai
>Priority: Major
> Attachments: image-2020-07-30-15-15-42-319.png
>
>
> When I use the command to compile spark-core_2.12 module, and then use the 
> spark-core_2.12-3.0.0.jar instead of /jars/spark-core_2.12-3.0.0.jar, I will 
> report an error (without making any code changes)
> command: ./build/mvn -pl :spark-core_2.12 -Pyarn -Phive -Phive-thriftserver 
> -Dhadoop.version=2.7.4 -DskipTests clean package
> version: spark-3.0.0
> jdk: 1.8



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32475) java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;

2020-07-30 Thread jinhai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jinhai updated SPARK-32475:
---
Attachment: image-2020-07-30-15-15-42-319.png

> java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
> 
>
> Key: SPARK-32475
> URL: https://issues.apache.org/jira/browse/SPARK-32475
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Spark-3.0.0;JDK 8
>Reporter: jinhai
>Priority: Major
> Attachments: image-2020-07-30-15-15-42-319.png
>
>
> When I use the command to compile spark-core_2.12 module, and then use the 
> spark-core_2.12-3.0.0.jar instead of /jars/spark-core_2.12-3.0.0.jar, I will 
> report an error (without making any code changes)
> command: ./build/mvn -pl :spark-core_2.12 -Pyarn -Phive -Phive-thriftserver 
> -Dhadoop.version=2.7.4 -DskipTests clean package
> version: spark-3.0.0
> jdk: 1.8



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32481) Support truncate table to move the data to trash

2020-07-30 Thread jobit mathew (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jobit mathew updated SPARK-32481:
-
Description: *Instead of deleting the data, move the data to trash.So from 
trash based on configuration data can be deleted permanently.*

> Support truncate table to move the data to trash
> 
>
> Key: SPARK-32481
> URL: https://issues.apache.org/jira/browse/SPARK-32481
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>
> *Instead of deleting the data, move the data to trash.So from trash based on 
> configuration data can be deleted permanently.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32475) java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;

2020-07-30 Thread jinhai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167690#comment-17167690
 ] 

jinhai commented on SPARK-32475:


My mistake. The default jdk version of maven is 14 not jdk8

> java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
> 
>
> Key: SPARK-32475
> URL: https://issues.apache.org/jira/browse/SPARK-32475
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Spark-3.0.0;JDK 8
>Reporter: jinhai
>Priority: Major
> Attachments: image-2020-07-30-15-15-42-319.png
>
>
> When I use the command to compile spark-core_2.12 module, and then use the 
> spark-core_2.12-3.0.0.jar instead of /jars/spark-core_2.12-3.0.0.jar, I will 
> report an error (without making any code changes)
> command: ./build/mvn -pl :spark-core_2.12 -Pyarn -Phive -Phive-thriftserver 
> -Dhadoop.version=2.7.4 -DskipTests clean package
> version: spark-3.0.0
> jdk: 1.8



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32475) java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;

2020-07-30 Thread jinhai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jinhai resolved SPARK-32475.

Resolution: Not A Problem

> java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
> 
>
> Key: SPARK-32475
> URL: https://issues.apache.org/jira/browse/SPARK-32475
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Spark-3.0.0;JDK 8
>Reporter: jinhai
>Priority: Major
> Attachments: image-2020-07-30-15-15-42-319.png
>
>
> When I use the command to compile spark-core_2.12 module, and then use the 
> spark-core_2.12-3.0.0.jar instead of /jars/spark-core_2.12-3.0.0.jar, I will 
> report an error (without making any code changes)
> command: ./build/mvn -pl :spark-core_2.12 -Pyarn -Phive -Phive-thriftserver 
> -Dhadoop.version=2.7.4 -DskipTests clean package
> version: spark-3.0.0
> jdk: 1.8



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32490) Upgrade netty-all to 4.1.51.Final

2020-07-30 Thread Yang Jie (Jira)
Yang Jie created SPARK-32490:


 Summary: Upgrade netty-all to 4.1.51.Final
 Key: SPARK-32490
 URL: https://issues.apache.org/jira/browse/SPARK-32490
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.1.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32490) Upgrade netty-all to 4.1.51.Final

2020-07-30 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-32490:
-
Description: Upgrade netty version to io.netty:netty-all to 4.1.51.Final to 
fix some bugs

> Upgrade netty-all to 4.1.51.Final
> -
>
> Key: SPARK-32490
> URL: https://issues.apache.org/jira/browse/SPARK-32490
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Minor
>
> Upgrade netty version to io.netty:netty-all to 4.1.51.Final to fix some bugs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32490) Upgrade netty-all to 4.1.51.Final

2020-07-30 Thread Rohit Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167691#comment-17167691
 ] 

Rohit Mishra commented on SPARK-32490:
--

[~LuciferYang], Can you please add a description?

> Upgrade netty-all to 4.1.51.Final
> -
>
> Key: SPARK-32490
> URL: https://issues.apache.org/jira/browse/SPARK-32490
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32490) Upgrade netty-all to 4.1.51.Final

2020-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32490:


Assignee: (was: Apache Spark)

> Upgrade netty-all to 4.1.51.Final
> -
>
> Key: SPARK-32490
> URL: https://issues.apache.org/jira/browse/SPARK-32490
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Minor
>
> Upgrade netty version to io.netty:netty-all to 4.1.51.Final to fix some bugs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32490) Upgrade netty-all to 4.1.51.Final

2020-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167692#comment-17167692
 ] 

Apache Spark commented on SPARK-32490:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29299

> Upgrade netty-all to 4.1.51.Final
> -
>
> Key: SPARK-32490
> URL: https://issues.apache.org/jira/browse/SPARK-32490
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Minor
>
> Upgrade netty version to io.netty:netty-all to 4.1.51.Final to fix some bugs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32490) Upgrade netty-all to 4.1.51.Final

2020-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32490:


Assignee: Apache Spark

> Upgrade netty-all to 4.1.51.Final
> -
>
> Key: SPARK-32490
> URL: https://issues.apache.org/jira/browse/SPARK-32490
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> Upgrade netty version to io.netty:netty-all to 4.1.51.Final to fix some bugs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32491) Do not install SparkR in test-only mode in testing script

2020-07-30 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-32491:


 Summary: Do not install SparkR in test-only mode in testing script
 Key: SPARK-32491
 URL: https://issues.apache.org/jira/browse/SPARK-32491
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


Currently GitHub Actions builds are failed as below:

{code}
ERROR: this R is version 3.4.4, package 'SparkR' requires R >= 3.5
[error] running /home/runner/work/spark/spark/R/install-dev.sh ; received 
return code 1
##[error]Process completed with exit code 10.
{code}

https://github.com/apache/spark/runs/926437963

Looks GitHub Actions has R 3.4.4 installed by default; however, R 3.4 was 
dropped as of SPARK-32073. When SparkR tests are not needed, GitHub Actions 
does not install R 3.6 and it caused the test failure.

In fact, SparkR is installed in case of running R linter, see SPARK-8505; 
however, we don't run the linters in test-only mode at testing script.

We can safely skip it in test-only mode in our testing script.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32491) Do not install SparkR in test-only mode in testing script

2020-07-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32491:
-
Parent: SPARK-32244
Issue Type: Sub-task  (was: Improvement)

> Do not install SparkR in test-only mode in testing script
> -
>
> Key: SPARK-32491
> URL: https://issues.apache.org/jira/browse/SPARK-32491
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently GitHub Actions builds are failed as below:
> {code}
> ERROR: this R is version 3.4.4, package 'SparkR' requires R >= 3.5
> [error] running /home/runner/work/spark/spark/R/install-dev.sh ; received 
> return code 1
> ##[error]Process completed with exit code 10.
> {code}
> https://github.com/apache/spark/runs/926437963
> Looks GitHub Actions has R 3.4.4 installed by default; however, R 3.4 was 
> dropped as of SPARK-32073. When SparkR tests are not needed, GitHub Actions 
> does not install R 3.6 and it caused the test failure.
> In fact, SparkR is installed in case of running R linter, see SPARK-8505; 
> however, we don't run the linters in test-only mode at testing script.
> We can safely skip it in test-only mode in our testing script.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32491) Do not install SparkR in test-only mode in testing script

2020-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167711#comment-17167711
 ] 

Apache Spark commented on SPARK-32491:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29300

> Do not install SparkR in test-only mode in testing script
> -
>
> Key: SPARK-32491
> URL: https://issues.apache.org/jira/browse/SPARK-32491
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently GitHub Actions builds are failed as below:
> {code}
> ERROR: this R is version 3.4.4, package 'SparkR' requires R >= 3.5
> [error] running /home/runner/work/spark/spark/R/install-dev.sh ; received 
> return code 1
> ##[error]Process completed with exit code 10.
> {code}
> https://github.com/apache/spark/runs/926437963
> Looks GitHub Actions has R 3.4.4 installed by default; however, R 3.4 was 
> dropped as of SPARK-32073. When SparkR tests are not needed, GitHub Actions 
> does not install R 3.6 and it caused the test failure.
> In fact, SparkR is installed in case of running R linter, see SPARK-8505; 
> however, we don't run the linters in test-only mode at testing script.
> We can safely skip it in test-only mode in our testing script.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32491) Do not install SparkR in test-only mode in testing script

2020-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32491:


Assignee: Apache Spark

> Do not install SparkR in test-only mode in testing script
> -
>
> Key: SPARK-32491
> URL: https://issues.apache.org/jira/browse/SPARK-32491
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Currently GitHub Actions builds are failed as below:
> {code}
> ERROR: this R is version 3.4.4, package 'SparkR' requires R >= 3.5
> [error] running /home/runner/work/spark/spark/R/install-dev.sh ; received 
> return code 1
> ##[error]Process completed with exit code 10.
> {code}
> https://github.com/apache/spark/runs/926437963
> Looks GitHub Actions has R 3.4.4 installed by default; however, R 3.4 was 
> dropped as of SPARK-32073. When SparkR tests are not needed, GitHub Actions 
> does not install R 3.6 and it caused the test failure.
> In fact, SparkR is installed in case of running R linter, see SPARK-8505; 
> however, we don't run the linters in test-only mode at testing script.
> We can safely skip it in test-only mode in our testing script.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32491) Do not install SparkR in test-only mode in testing script

2020-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32491:


Assignee: (was: Apache Spark)

> Do not install SparkR in test-only mode in testing script
> -
>
> Key: SPARK-32491
> URL: https://issues.apache.org/jira/browse/SPARK-32491
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently GitHub Actions builds are failed as below:
> {code}
> ERROR: this R is version 3.4.4, package 'SparkR' requires R >= 3.5
> [error] running /home/runner/work/spark/spark/R/install-dev.sh ; received 
> return code 1
> ##[error]Process completed with exit code 10.
> {code}
> https://github.com/apache/spark/runs/926437963
> Looks GitHub Actions has R 3.4.4 installed by default; however, R 3.4 was 
> dropped as of SPARK-32073. When SparkR tests are not needed, GitHub Actions 
> does not install R 3.6 and it caused the test failure.
> In fact, SparkR is installed in case of running R linter, see SPARK-8505; 
> however, we don't run the linters in test-only mode at testing script.
> We can safely skip it in test-only mode in our testing script.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32474) NullAwareAntiJoin multi-column support

2020-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32474:


Assignee: Apache Spark

> NullAwareAntiJoin multi-column support
> --
>
> Key: SPARK-32474
> URL: https://issues.apache.org/jira/browse/SPARK-32474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Assignee: Apache Spark
>Priority: Minor
>
> This is a follow up improvement of Issue SPARK-32290.
> In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to 
> BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but 
> it's only targeting on Single Column Case, because it's much more complicate 
> in multi column support.
> See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6
>  
> FYI, code logical for single and multi column is defined at
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql
>  
> For supporting multi column, I throw the following idea and see if is it 
> worth to do multi-column support with some trade off. I would need to do some 
> data expansion in HashedRelation, and i would call this new type of 
> HashedRelation as NullAwareHashedRelation.
>  
> In NullAwareHashedRelation, key with null column is allowed, which is 
> opposite in LongHashedRelation and UnsafeHashedRelation; And single key might 
> be expanded into 2^N - 1 records, (N refer to columnNum of the key). for 
> example, if there is a record
> (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), 
> C(2,3) as a combination to copy origin key row, and setNull at target 
> position, and then insert into NullAwareHashedRelation. including the origin 
> key row, there will be 7 key row inserted as follow.
> (null, 2, 3)
> (1, null, 3)
> (1, 2, null)
> (null, null, 3)
> (null, 2, null)
> (1, null, null)
> (1, 2, 3)
>  
> with the expanded data we can extract a common pattern for both single and 
> multi column. allNull refer to a unsafeRow which has all null columns.
>  * buildSide is empty input => return all rows
>  * allNullColumnKey Exists In buildSide input => reject all rows
>  * if streamedSideRow.allNull is true => drop the row
>  * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
> => drop the row
>  * if streamedSideRow.allNull is false & notFindMatch in 
> NullAwareHashedRelation => return the row
>  
> this solution will sure make buildSide data expand to 2^N-1 times, but since 
> it is normally up to 2~3 column in NAAJ in normal production query, i suppose 
> that it's acceptable to expand buildSide data to around 7X. I would also have 
> a limitation of max column support for NAAJ, basically should not more than 
> 3. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32474) NullAwareAntiJoin multi-column support

2020-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32474:


Assignee: (was: Apache Spark)

> NullAwareAntiJoin multi-column support
> --
>
> Key: SPARK-32474
> URL: https://issues.apache.org/jira/browse/SPARK-32474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Minor
>
> This is a follow up improvement of Issue SPARK-32290.
> In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to 
> BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but 
> it's only targeting on Single Column Case, because it's much more complicate 
> in multi column support.
> See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6
>  
> FYI, code logical for single and multi column is defined at
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql
>  
> For supporting multi column, I throw the following idea and see if is it 
> worth to do multi-column support with some trade off. I would need to do some 
> data expansion in HashedRelation, and i would call this new type of 
> HashedRelation as NullAwareHashedRelation.
>  
> In NullAwareHashedRelation, key with null column is allowed, which is 
> opposite in LongHashedRelation and UnsafeHashedRelation; And single key might 
> be expanded into 2^N - 1 records, (N refer to columnNum of the key). for 
> example, if there is a record
> (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), 
> C(2,3) as a combination to copy origin key row, and setNull at target 
> position, and then insert into NullAwareHashedRelation. including the origin 
> key row, there will be 7 key row inserted as follow.
> (null, 2, 3)
> (1, null, 3)
> (1, 2, null)
> (null, null, 3)
> (null, 2, null)
> (1, null, null)
> (1, 2, 3)
>  
> with the expanded data we can extract a common pattern for both single and 
> multi column. allNull refer to a unsafeRow which has all null columns.
>  * buildSide is empty input => return all rows
>  * allNullColumnKey Exists In buildSide input => reject all rows
>  * if streamedSideRow.allNull is true => drop the row
>  * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
> => drop the row
>  * if streamedSideRow.allNull is false & notFindMatch in 
> NullAwareHashedRelation => return the row
>  
> this solution will sure make buildSide data expand to 2^N-1 times, but since 
> it is normally up to 2~3 column in NAAJ in normal production query, i suppose 
> that it's acceptable to expand buildSide data to around 7X. I would also have 
> a limitation of max column support for NAAJ, basically should not more than 
> 3. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32474) NullAwareAntiJoin multi-column support

2020-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167723#comment-17167723
 ] 

Apache Spark commented on SPARK-32474:
--

User 'leanken' has created a pull request for this issue:
https://github.com/apache/spark/pull/29301

> NullAwareAntiJoin multi-column support
> --
>
> Key: SPARK-32474
> URL: https://issues.apache.org/jira/browse/SPARK-32474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Minor
>
> This is a follow up improvement of Issue SPARK-32290.
> In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to 
> BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but 
> it's only targeting on Single Column Case, because it's much more complicate 
> in multi column support.
> See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6
>  
> FYI, code logical for single and multi column is defined at
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql
>  
> For supporting multi column, I throw the following idea and see if is it 
> worth to do multi-column support with some trade off. I would need to do some 
> data expansion in HashedRelation, and i would call this new type of 
> HashedRelation as NullAwareHashedRelation.
>  
> In NullAwareHashedRelation, key with null column is allowed, which is 
> opposite in LongHashedRelation and UnsafeHashedRelation; And single key might 
> be expanded into 2^N - 1 records, (N refer to columnNum of the key). for 
> example, if there is a record
> (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), 
> C(2,3) as a combination to copy origin key row, and setNull at target 
> position, and then insert into NullAwareHashedRelation. including the origin 
> key row, there will be 7 key row inserted as follow.
> (null, 2, 3)
> (1, null, 3)
> (1, 2, null)
> (null, null, 3)
> (null, 2, null)
> (1, null, null)
> (1, 2, 3)
>  
> with the expanded data we can extract a common pattern for both single and 
> multi column. allNull refer to a unsafeRow which has all null columns.
>  * buildSide is empty input => return all rows
>  * allNullColumnKey Exists In buildSide input => reject all rows
>  * if streamedSideRow.allNull is true => drop the row
>  * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
> => drop the row
>  * if streamedSideRow.allNull is false & notFindMatch in 
> NullAwareHashedRelation => return the row
>  
> this solution will sure make buildSide data expand to 2^N-1 times, but since 
> it is normally up to 2~3 column in NAAJ in normal production query, i suppose 
> that it's acceptable to expand buildSide data to around 7X. I would also have 
> a limitation of max column support for NAAJ, basically should not more than 
> 3. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32474) NullAwareAntiJoin multi-column support

2020-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167722#comment-17167722
 ] 

Apache Spark commented on SPARK-32474:
--

User 'leanken' has created a pull request for this issue:
https://github.com/apache/spark/pull/29301

> NullAwareAntiJoin multi-column support
> --
>
> Key: SPARK-32474
> URL: https://issues.apache.org/jira/browse/SPARK-32474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Minor
>
> This is a follow up improvement of Issue SPARK-32290.
> In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to 
> BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but 
> it's only targeting on Single Column Case, because it's much more complicate 
> in multi column support.
> See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6
>  
> FYI, code logical for single and multi column is defined at
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql
>  
> For supporting multi column, I throw the following idea and see if is it 
> worth to do multi-column support with some trade off. I would need to do some 
> data expansion in HashedRelation, and i would call this new type of 
> HashedRelation as NullAwareHashedRelation.
>  
> In NullAwareHashedRelation, key with null column is allowed, which is 
> opposite in LongHashedRelation and UnsafeHashedRelation; And single key might 
> be expanded into 2^N - 1 records, (N refer to columnNum of the key). for 
> example, if there is a record
> (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), 
> C(2,3) as a combination to copy origin key row, and setNull at target 
> position, and then insert into NullAwareHashedRelation. including the origin 
> key row, there will be 7 key row inserted as follow.
> (null, 2, 3)
> (1, null, 3)
> (1, 2, null)
> (null, null, 3)
> (null, 2, null)
> (1, null, null)
> (1, 2, 3)
>  
> with the expanded data we can extract a common pattern for both single and 
> multi column. allNull refer to a unsafeRow which has all null columns.
>  * buildSide is empty input => return all rows
>  * allNullColumnKey Exists In buildSide input => reject all rows
>  * if streamedSideRow.allNull is true => drop the row
>  * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
> => drop the row
>  * if streamedSideRow.allNull is false & notFindMatch in 
> NullAwareHashedRelation => return the row
>  
> this solution will sure make buildSide data expand to 2^N-1 times, but since 
> it is normally up to 2~3 column in NAAJ in normal production query, i suppose 
> that it's acceptable to expand buildSide data to around 7X. I would also have 
> a limitation of max column support for NAAJ, basically should not more than 
> 3. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32488) Use @parser::members and @lexer::members to avoid generating unused code

2020-07-30 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32488.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29296
[https://github.com/apache/spark/pull/29296]

> Use @parser::members and @lexer::members to avoid generating unused code
> 
>
> Key: SPARK-32488
> URL: https://issues.apache.org/jira/browse/SPARK-32488
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.1.0
>
>
> This ticket aims to update {{SqlBse.g4}} for avoiding generating unused code.
> Currently, ANTLR generates unused methods and variables; {{isValidDecimal}} 
> and {{isHint}} are only used in the generated lexer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32488) Use @parser::members and @lexer::members to avoid generating unused code

2020-07-30 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32488:
---

Assignee: Takeshi Yamamuro

> Use @parser::members and @lexer::members to avoid generating unused code
> 
>
> Key: SPARK-32488
> URL: https://issues.apache.org/jira/browse/SPARK-32488
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
>
> This ticket aims to update {{SqlBse.g4}} for avoiding generating unused code.
> Currently, ANTLR generates unused methods and variables; {{isValidDecimal}} 
> and {{isHint}} are only used in the generated lexer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32483) spark-shell: error: value topByKey is not a member of org.apache.spark.rdd.RDD[(String, (String, Double))]

2020-07-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32483:
-
Priority: Major  (was: Critical)

> spark-shell: error: value topByKey is not a member of 
> org.apache.spark.rdd.RDD[(String, (String, Double))]
> --
>
> Key: SPARK-32483
> URL: https://issues.apache.org/jira/browse/SPARK-32483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.0
> Environment: spark-mllib version=2.4.0 
>Reporter: manley
>Priority: Major
>
> hi:
> the problem happened to me is that I got a message that error: value topByKey 
> is not a member of org.apache.spark.rdd.RDD[(String, (String, Double))] in 
> Spark-shell,  but the same function can be processed in JUnit test



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32483) spark-shell: error: value topByKey is not a member of org.apache.spark.rdd.RDD[(String, (String, Double))]

2020-07-30 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167734#comment-17167734
 ] 

Hyukjin Kwon commented on SPARK-32483:
--

Please avoid setting the Priority to Critical+ which is usually reserved for 
committers, seehttp://spark.apache.org/contributing.html. cc [~rohitmishr1484] 
FYI

> spark-shell: error: value topByKey is not a member of 
> org.apache.spark.rdd.RDD[(String, (String, Double))]
> --
>
> Key: SPARK-32483
> URL: https://issues.apache.org/jira/browse/SPARK-32483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.0
> Environment: spark-mllib version=2.4.0 
>Reporter: manley
>Priority: Major
>
> hi:
> the problem happened to me is that I got a message that error: value topByKey 
> is not a member of org.apache.spark.rdd.RDD[(String, (String, Double))] in 
> Spark-shell,  but the same function can be processed in JUnit test



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32483) spark-shell: error: value topByKey is not a member of org.apache.spark.rdd.RDD[(String, (String, Double))]

2020-07-30 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167734#comment-17167734
 ] 

Hyukjin Kwon edited comment on SPARK-32483 at 7/30/20, 8:00 AM:


Please avoid setting the Priority to Critical+ which is usually reserved for 
committers, see http://spark.apache.org/contributing.html. cc [~rohitmishr1484] 
FYI


was (Author: hyukjin.kwon):
Please avoid setting the Priority to Critical+ which is usually reserved for 
committers, seehttp://spark.apache.org/contributing.html. cc [~rohitmishr1484] 
FYI

> spark-shell: error: value topByKey is not a member of 
> org.apache.spark.rdd.RDD[(String, (String, Double))]
> --
>
> Key: SPARK-32483
> URL: https://issues.apache.org/jira/browse/SPARK-32483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.0
> Environment: spark-mllib version=2.4.0 
>Reporter: manley
>Priority: Major
>
> hi:
> the problem happened to me is that I got a message that error: value topByKey 
> is not a member of org.apache.spark.rdd.RDD[(String, (String, Double))] in 
> Spark-shell,  but the same function can be processed in JUnit test



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32483) spark-shell: error: value topByKey is not a member of org.apache.spark.rdd.RDD[(String, (String, Double))]

2020-07-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32483.
--
Resolution: Invalid

> spark-shell: error: value topByKey is not a member of 
> org.apache.spark.rdd.RDD[(String, (String, Double))]
> --
>
> Key: SPARK-32483
> URL: https://issues.apache.org/jira/browse/SPARK-32483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.0
> Environment: spark-mllib version=2.4.0 
>Reporter: manley
>Priority: Major
>
> hi:
> the problem happened to me is that I got a message that error: value topByKey 
> is not a member of org.apache.spark.rdd.RDD[(String, (String, Double))] in 
> Spark-shell,  but the same function can be processed in JUnit test



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32483) spark-shell: error: value topByKey is not a member of org.apache.spark.rdd.RDD[(String, (String, Double))]

2020-07-30 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167735#comment-17167735
 ] 

Hyukjin Kwon commented on SPARK-32483:
--

Looks [~JinxinTang]'s way is working. I am resolving this.

> spark-shell: error: value topByKey is not a member of 
> org.apache.spark.rdd.RDD[(String, (String, Double))]
> --
>
> Key: SPARK-32483
> URL: https://issues.apache.org/jira/browse/SPARK-32483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.0
> Environment: spark-mllib version=2.4.0 
>Reporter: manley
>Priority: Major
>
> hi:
> the problem happened to me is that I got a message that error: value topByKey 
> is not a member of org.apache.spark.rdd.RDD[(String, (String, Double))] in 
> Spark-shell,  but the same function can be processed in JUnit test



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32475) java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;

2020-07-30 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167736#comment-17167736
 ] 

Hyukjin Kwon commented on SPARK-32475:
--

Looks like you're Java paths are mixed up. It happens when you compile with JDK 
11 but run it with JDK 8.

> java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
> 
>
> Key: SPARK-32475
> URL: https://issues.apache.org/jira/browse/SPARK-32475
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Spark-3.0.0;JDK 8
>Reporter: jinhai
>Priority: Major
> Attachments: image-2020-07-30-15-15-42-319.png
>
>
> When I use the command to compile spark-core_2.12 module, and then use the 
> spark-core_2.12-3.0.0.jar instead of /jars/spark-core_2.12-3.0.0.jar, I will 
> report an error (without making any code changes)
> command: ./build/mvn -pl :spark-core_2.12 -Pyarn -Phive -Phive-thriftserver 
> -Dhadoop.version=2.7.4 -DskipTests clean package
> version: spark-3.0.0
> jdk: 1.8



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32475) java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;

2020-07-30 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167737#comment-17167737
 ] 

Hyukjin Kwon commented on SPARK-32475:
--

Oh right.

> java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
> 
>
> Key: SPARK-32475
> URL: https://issues.apache.org/jira/browse/SPARK-32475
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Spark-3.0.0;JDK 8
>Reporter: jinhai
>Priority: Major
> Attachments: image-2020-07-30-15-15-42-319.png
>
>
> When I use the command to compile spark-core_2.12 module, and then use the 
> spark-core_2.12-3.0.0.jar instead of /jars/spark-core_2.12-3.0.0.jar, I will 
> report an error (without making any code changes)
> command: ./build/mvn -pl :spark-core_2.12 -Pyarn -Phive -Phive-thriftserver 
> -Dhadoop.version=2.7.4 -DskipTests clean package
> version: spark-3.0.0
> jdk: 1.8



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32479) Fix the slicing logic in createDataFrame when converting pandas dataframe to arrow table

2020-07-30 Thread Liang Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167759#comment-17167759
 ] 

Liang Zhang commented on SPARK-32479:
-

This is not a bug. Spark will always create `defaultParallelism` partitions; 
there could be empty partitions. Moving to Won't Do.

> Fix the slicing logic in createDataFrame when converting pandas dataframe to 
> arrow table
> 
>
> Key: SPARK-32479
> URL: https://issues.apache.org/jira/browse/SPARK-32479
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Liang Zhang
>Assignee: Liang Zhang
>Priority: Major
>
> h1. Problem:
> In 
> [https://github.com/databricks/runtime/blob/84a952313ae73e3df32f065eb00cc0bcb024af14/python/pyspark/sql/pandas/conversion.py#L418|https://github.com/databricks/runtime/blob/84a952313ae73e3df32f065eb00cc0bcb024af14/python/pyspark/sql/pandas/conversion.py#L418,]
>  , the slicing logic may result in less partitions than specified.
> h1. Example:
> Assume:
> {noformat}
> length = 100 -> [0, 1, ..., 99]
> num_slices = 99 = self.sparkContext.defaultParallelism{noformat}
> Old method:
> step = math.ceil(length / num_slices) = 2
>  start = i * step, end = (i + 1) * step:
>  output: [0,1] [2,3] [4,5] ... [98,99] -> 50 slices != num_slices
>  
> h1. Solution:
> We can use a silimar logic as in 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L125]
> {code:python}
> # replace conversion.py#L418
> pdf_slices = (pdf.iloc[i * length // num_slices: (i + 1) * length // 
> num_slices] for i in xrange(0, num_slices))
> {code}
> New method:
>  start = i * length // num_slices, end = (i + 1) * length // num_slices:
>  output: [0] [1] [2] ... [98,99] -> 99 slices
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32479) Fix the slicing logic in createDataFrame when converting pandas dataframe to arrow table

2020-07-30 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu resolved SPARK-32479.

Resolution: Not A Bug

> Fix the slicing logic in createDataFrame when converting pandas dataframe to 
> arrow table
> 
>
> Key: SPARK-32479
> URL: https://issues.apache.org/jira/browse/SPARK-32479
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Liang Zhang
>Assignee: Liang Zhang
>Priority: Major
>
> h1. Problem:
> In 
> [https://github.com/databricks/runtime/blob/84a952313ae73e3df32f065eb00cc0bcb024af14/python/pyspark/sql/pandas/conversion.py#L418|https://github.com/databricks/runtime/blob/84a952313ae73e3df32f065eb00cc0bcb024af14/python/pyspark/sql/pandas/conversion.py#L418,]
>  , the slicing logic may result in less partitions than specified.
> h1. Example:
> Assume:
> {noformat}
> length = 100 -> [0, 1, ..., 99]
> num_slices = 99 = self.sparkContext.defaultParallelism{noformat}
> Old method:
> step = math.ceil(length / num_slices) = 2
>  start = i * step, end = (i + 1) * step:
>  output: [0,1] [2,3] [4,5] ... [98,99] -> 50 slices != num_slices
>  
> h1. Solution:
> We can use a silimar logic as in 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L125]
> {code:python}
> # replace conversion.py#L418
> pdf_slices = (pdf.iloc[i * length // num_slices: (i + 1) * length // 
> num_slices] for i in xrange(0, num_slices))
> {code}
> New method:
>  start = i * length // num_slices, end = (i + 1) * length // num_slices:
>  output: [0] [1] [2] ... [98,99] -> 99 slices
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32474) NullAwareAntiJoin multi-column support

2020-07-30 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin resolved SPARK-32474.
-
Resolution: Duplicate

> NullAwareAntiJoin multi-column support
> --
>
> Key: SPARK-32474
> URL: https://issues.apache.org/jira/browse/SPARK-32474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Minor
>
> This is a follow up improvement of Issue SPARK-32290.
> In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to 
> BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but 
> it's only targeting on Single Column Case, because it's much more complicate 
> in multi column support.
> See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6
>  
> FYI, code logical for single and multi column is defined at
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql
>  
> For supporting multi column, I throw the following idea and see if is it 
> worth to do multi-column support with some trade off. I would need to do some 
> data expansion in HashedRelation, and i would call this new type of 
> HashedRelation as NullAwareHashedRelation.
>  
> In NullAwareHashedRelation, key with null column is allowed, which is 
> opposite in LongHashedRelation and UnsafeHashedRelation; And single key might 
> be expanded into 2^N - 1 records, (N refer to columnNum of the key). for 
> example, if there is a record
> (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), 
> C(2,3) as a combination to copy origin key row, and setNull at target 
> position, and then insert into NullAwareHashedRelation. including the origin 
> key row, there will be 7 key row inserted as follow.
> (null, 2, 3)
> (1, null, 3)
> (1, 2, null)
> (null, null, 3)
> (null, 2, null)
> (1, null, null)
> (1, 2, 3)
>  
> with the expanded data we can extract a common pattern for both single and 
> multi column. allNull refer to a unsafeRow which has all null columns.
>  * buildSide is empty input => return all rows
>  * allNullColumnKey Exists In buildSide input => reject all rows
>  * if streamedSideRow.allNull is true => drop the row
>  * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
> => drop the row
>  * if streamedSideRow.allNull is false & notFindMatch in 
> NullAwareHashedRelation => return the row
>  
> this solution will sure make buildSide data expand to 2^N-1 times, but since 
> it is normally up to 2~3 column in NAAJ in normal production query, i suppose 
> that it's acceptable to expand buildSide data to around 7X. I would also have 
> a limitation of max column support for NAAJ, basically should not more than 
> 3. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32492) Fulfill missing column meta information for thrift server client tools

2020-07-30 Thread Kent Yao (Jira)
Kent Yao created SPARK-32492:


 Summary: Fulfill missing column meta information for thrift server 
client tools
 Key: SPARK-32492
 URL: https://issues.apache.org/jira/browse/SPARK-32492
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Kent Yao


Most fields of a column are missing, e.g. position, column-size

!image-2020-07-30-17-51-03-351.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32492) Fulfill missing column meta information for thrift server client tools

2020-07-30 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-32492:
-
Description: 
Most fields of a column are missing, e.g. position, column-size

 

  was:
Most fields of a column are missing, e.g. position, column-size

!image-2020-07-30-17-51-03-351.png!


> Fulfill missing column meta information for thrift server client tools
> --
>
> Key: SPARK-32492
> URL: https://issues.apache.org/jira/browse/SPARK-32492
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> Most fields of a column are missing, e.g. position, column-size
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32492) Fulfill missing column meta information for thrift server client tools

2020-07-30 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-32492:
-
Attachment: wx20200730-175...@2x.png

> Fulfill missing column meta information for thrift server client tools
> --
>
> Key: SPARK-32492
> URL: https://issues.apache.org/jira/browse/SPARK-32492
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
> Attachments: wx20200730-175...@2x.png
>
>
> Most fields of a column are missing, e.g. position, column-size
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32493) Manually install R instead of using setup-r in GitHub Actions

2020-07-30 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-32493:


 Summary: Manually install R instead of using setup-r in GitHub 
Actions
 Key: SPARK-32493
 URL: https://issues.apache.org/jira/browse/SPARK-32493
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra, SparkR
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


Looks like {{setup-r}} fails to install specific R version. See the debugging 
logs at https://github.com/HyukjinKwon/spark/pull/15:

{code}
ERROR: this R is version 3.4.4, package 'SparkR' requires R >= 3.5
[error] running /home/runner/work/spark/spark/R/install-dev.sh ; received 
return code 1
##[error]Process completed with exit code 10.
{code}

We should maybe just manually install R.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32493) Manually install R instead of using setup-r in GitHub Actions

2020-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32493:


Assignee: Apache Spark

> Manually install R instead of using setup-r in GitHub Actions
> -
>
> Key: SPARK-32493
> URL: https://issues.apache.org/jira/browse/SPARK-32493
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, SparkR
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Looks like {{setup-r}} fails to install specific R version. See the debugging 
> logs at https://github.com/HyukjinKwon/spark/pull/15:
> {code}
> ERROR: this R is version 3.4.4, package 'SparkR' requires R >= 3.5
> [error] running /home/runner/work/spark/spark/R/install-dev.sh ; received 
> return code 1
> ##[error]Process completed with exit code 10.
> {code}
> We should maybe just manually install R.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32493) Manually install R instead of using setup-r in GitHub Actions

2020-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32493:


Assignee: (was: Apache Spark)

> Manually install R instead of using setup-r in GitHub Actions
> -
>
> Key: SPARK-32493
> URL: https://issues.apache.org/jira/browse/SPARK-32493
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, SparkR
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Looks like {{setup-r}} fails to install specific R version. See the debugging 
> logs at https://github.com/HyukjinKwon/spark/pull/15:
> {code}
> ERROR: this R is version 3.4.4, package 'SparkR' requires R >= 3.5
> [error] running /home/runner/work/spark/spark/R/install-dev.sh ; received 
> return code 1
> ##[error]Process completed with exit code 10.
> {code}
> We should maybe just manually install R.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32493) Manually install R instead of using setup-r in GitHub Actions

2020-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167785#comment-17167785
 ] 

Apache Spark commented on SPARK-32493:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29302

> Manually install R instead of using setup-r in GitHub Actions
> -
>
> Key: SPARK-32493
> URL: https://issues.apache.org/jira/browse/SPARK-32493
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, SparkR
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Looks like {{setup-r}} fails to install specific R version. See the debugging 
> logs at https://github.com/HyukjinKwon/spark/pull/15:
> {code}
> ERROR: this R is version 3.4.4, package 'SparkR' requires R >= 3.5
> [error] running /home/runner/work/spark/spark/R/install-dev.sh ; received 
> return code 1
> ##[error]Process completed with exit code 10.
> {code}
> We should maybe just manually install R.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32493) Manually install R instead of using setup-r in GitHub Actions

2020-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167787#comment-17167787
 ] 

Apache Spark commented on SPARK-32493:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29302

> Manually install R instead of using setup-r in GitHub Actions
> -
>
> Key: SPARK-32493
> URL: https://issues.apache.org/jira/browse/SPARK-32493
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, SparkR
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Looks like {{setup-r}} fails to install specific R version. See the debugging 
> logs at https://github.com/HyukjinKwon/spark/pull/15:
> {code}
> ERROR: this R is version 3.4.4, package 'SparkR' requires R >= 3.5
> [error] running /home/runner/work/spark/spark/R/install-dev.sh ; received 
> return code 1
> ##[error]Process completed with exit code 10.
> {code}
> We should maybe just manually install R.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32494) Null Aware Anti Join Optimize Support Multi-Column

2020-07-30 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-32494:
---

 Summary: Null Aware Anti Join Optimize Support Multi-Column
 Key: SPARK-32494
 URL: https://issues.apache.org/jira/browse/SPARK-32494
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Leanken.Lin


In Issue [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290], we 
managed to optimize BroadcastNestedLoopJoin into BroadcastHashJoin within the 
Single-Column NAAJ scenario, by using hash lookup instead of loop join. 

It's simple to just fulfill a "NOT IN" logical when it's a single key, but 
multi-column not in is much more complicated with all these null aware compare.

Hence, proposed with a New type HashedRelation, NullAwareHashedRelation. 

For NullAwareHashedRelation
 # it will not skip anyNullColumn key like LongHashedRelation and 
UnsafeHashedRelation
 # while building NullAwareHashedRelation, will put extra keys into the 
relation, just to make null aware columns comparison in hash lookup style.

the duplication would be 2^numKeys - 1, for example, if we are to support NAAJ 
with 3 column join key, the buildSide would be expanded into (2^3 - 1) times, 
7X.

For example, if there is a UnsafeRow key (1,2,3)


In NullAware Mode, it should be expanded into 7 keys with extra C(3,1), C(3,2) 
combinations, within the combinations, we duplicated these record with null 
padding as following.

Original record

(1,2,3)

Extra record to be appended into HashedRelation

(null, 2, 3) (1, null, 3) (1, 2, null)
(null, null, 3) (null, 2, null) (1, null, null))

with the expanded data we can extract a common pattern for both single and 
multi column. allNull refer to a unsafeRow which has all null columns.
 * buildSide is empty input => return all rows
 * allNullColumnKey Exists In buildSide input => reject all rows
 * if streamedSideRow.allNull is true => drop the row
 * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
=> drop the row
 * if streamedSideRow.allNull is false & notFindMatch in 
NullAwareHashedRelation => return the row

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32492) Fulfill missing column meta information for thrift server client tools

2020-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167795#comment-17167795
 ] 

Apache Spark commented on SPARK-32492:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/29303

> Fulfill missing column meta information for thrift server client tools
> --
>
> Key: SPARK-32492
> URL: https://issues.apache.org/jira/browse/SPARK-32492
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
> Attachments: wx20200730-175...@2x.png
>
>
> Most fields of a column are missing, e.g. position, column-size
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32492) Fulfill missing column meta information for thrift server client tools

2020-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32492:


Assignee: Apache Spark

> Fulfill missing column meta information for thrift server client tools
> --
>
> Key: SPARK-32492
> URL: https://issues.apache.org/jira/browse/SPARK-32492
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
> Attachments: wx20200730-175...@2x.png
>
>
> Most fields of a column are missing, e.g. position, column-size
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32494) Null Aware Anti Join Optimize Support Multi-Column

2020-07-30 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-32494:

Description: 
In Issue SPARK-32290, we managed to optimize BroadcastNestedLoopJoin into 
BroadcastHashJoin within the Single-Column NAAJ scenario, by using hash lookup 
instead of loop join. 

It's simple to just fulfill a "NOT IN" logical when it's a single key, but 
multi-column not in is much more complicated with all these null aware 
comparison.

FYI, code logical for single and multi column is defined at

~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql

~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql

 

Hence, proposed with a New type HashedRelation, NullAwareHashedRelation. 

For NullAwareHashedRelation
 # it will not skip anyNullColumn key like LongHashedRelation and 
UnsafeHashedRelation do.
 # while building NullAwareHashedRelation, will put extra keys into the 
relation, just to make null aware columns comparison in hash lookup style.

the duplication would be 2^numKeys - 1 times, for example, if we are to support 
NAAJ with 3 column join key, the buildSide would be expanded into (2^3 - 1) 
times, 7X.

For example, if there is a UnsafeRow key (1,2,3)

In NullAware Mode, it should be expanded into 7 keys with extra C(3,1), C(3,2) 
combinations, within the combinations, we duplicated these record with null 
padding as following.

Original record

(1,2,3)

Extra record to be appended into NullAwareHashedRelation

(null, 2, 3) (1, null, 3) (1, 2, null)
 (null, null, 3) (null, 2, null) (1, null, null))

with the expanded data we can extract a common pattern for both single and 
multi column. allNull refer to a unsafeRow which has all null columns.
 * buildSide is empty input => return all rows
 * allNullColumnKey Exists In buildSide input => reject all rows
 * if streamedSideRow.allNull is true => drop the row
 * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
=> drop the row
 * if streamedSideRow.allNull is false & notFindMatch in 
NullAwareHashedRelation => return the row

 

 

  was:
In Issue [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290], we 
managed to optimize BroadcastNestedLoopJoin into BroadcastHashJoin within the 
Single-Column NAAJ scenario, by using hash lookup instead of loop join. 

It's simple to just fulfill a "NOT IN" logical when it's a single key, but 
multi-column not in is much more complicated with all these null aware compare.

Hence, proposed with a New type HashedRelation, NullAwareHashedRelation. 

For NullAwareHashedRelation
 # it will not skip anyNullColumn key like LongHashedRelation and 
UnsafeHashedRelation
 # while building NullAwareHashedRelation, will put extra keys into the 
relation, just to make null aware columns comparison in hash lookup style.

the duplication would be 2^numKeys - 1, for example, if we are to support NAAJ 
with 3 column join key, the buildSide would be expanded into (2^3 - 1) times, 
7X.

For example, if there is a UnsafeRow key (1,2,3)


In NullAware Mode, it should be expanded into 7 keys with extra C(3,1), C(3,2) 
combinations, within the combinations, we duplicated these record with null 
padding as following.

Original record

(1,2,3)

Extra record to be appended into HashedRelation

(null, 2, 3) (1, null, 3) (1, 2, null)
(null, null, 3) (null, 2, null) (1, null, null))

with the expanded data we can extract a common pattern for both single and 
multi column. allNull refer to a unsafeRow which has all null columns.
 * buildSide is empty input => return all rows
 * allNullColumnKey Exists In buildSide input => reject all rows
 * if streamedSideRow.allNull is true => drop the row
 * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
=> drop the row
 * if streamedSideRow.allNull is false & notFindMatch in 
NullAwareHashedRelation => return the row

 

 


> Null Aware Anti Join Optimize Support Multi-Column
> --
>
> Key: SPARK-32494
> URL: https://issues.apache.org/jira/browse/SPARK-32494
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Major
>
> In Issue SPARK-32290, we managed to optimize BroadcastNestedLoopJoin into 
> BroadcastHashJoin within the Single-Column NAAJ scenario, by using hash 
> lookup instead of loop join. 
> It's simple to just fulfill a "NOT IN" logical when it's a single key, but 
> multi-column not in is much more complicated with all these null aware 
> comparison.
> FYI, code logical for single and multi column is defined at
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql
> ~/sql/c

[jira] [Assigned] (SPARK-32492) Fulfill missing column meta information for thrift server client tools

2020-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32492:


Assignee: (was: Apache Spark)

> Fulfill missing column meta information for thrift server client tools
> --
>
> Key: SPARK-32492
> URL: https://issues.apache.org/jira/browse/SPARK-32492
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
> Attachments: wx20200730-175...@2x.png
>
>
> Most fields of a column are missing, e.g. position, column-size
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32492) Fulfill missing column meta information for thrift server client tools

2020-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167797#comment-17167797
 ] 

Apache Spark commented on SPARK-32492:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/29303

> Fulfill missing column meta information for thrift server client tools
> --
>
> Key: SPARK-32492
> URL: https://issues.apache.org/jira/browse/SPARK-32492
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
> Attachments: wx20200730-175...@2x.png
>
>
> Most fields of a column are missing, e.g. position, column-size
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32494) Null Aware Anti Join Optimize Support Multi-Column

2020-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167800#comment-17167800
 ] 

Apache Spark commented on SPARK-32494:
--

User 'leanken' has created a pull request for this issue:
https://github.com/apache/spark/pull/29304

> Null Aware Anti Join Optimize Support Multi-Column
> --
>
> Key: SPARK-32494
> URL: https://issues.apache.org/jira/browse/SPARK-32494
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Major
>
> In Issue SPARK-32290, we managed to optimize BroadcastNestedLoopJoin into 
> BroadcastHashJoin within the Single-Column NAAJ scenario, by using hash 
> lookup instead of loop join. 
> It's simple to just fulfill a "NOT IN" logical when it's a single key, but 
> multi-column not in is much more complicated with all these null aware 
> comparison.
> FYI, code logical for single and multi column is defined at
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql
>  
> Hence, proposed with a New type HashedRelation, NullAwareHashedRelation. 
> For NullAwareHashedRelation
>  # it will not skip anyNullColumn key like LongHashedRelation and 
> UnsafeHashedRelation do.
>  # while building NullAwareHashedRelation, will put extra keys into the 
> relation, just to make null aware columns comparison in hash lookup style.
> the duplication would be 2^numKeys - 1 times, for example, if we are to 
> support NAAJ with 3 column join key, the buildSide would be expanded into 
> (2^3 - 1) times, 7X.
> For example, if there is a UnsafeRow key (1,2,3)
> In NullAware Mode, it should be expanded into 7 keys with extra C(3,1), 
> C(3,2) combinations, within the combinations, we duplicated these record with 
> null padding as following.
> Original record
> (1,2,3)
> Extra record to be appended into NullAwareHashedRelation
> (null, 2, 3) (1, null, 3) (1, 2, null)
>  (null, null, 3) (null, 2, null) (1, null, null))
> with the expanded data we can extract a common pattern for both single and 
> multi column. allNull refer to a unsafeRow which has all null columns.
>  * buildSide is empty input => return all rows
>  * allNullColumnKey Exists In buildSide input => reject all rows
>  * if streamedSideRow.allNull is true => drop the row
>  * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
> => drop the row
>  * if streamedSideRow.allNull is false & notFindMatch in 
> NullAwareHashedRelation => return the row
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32494) Null Aware Anti Join Optimize Support Multi-Column

2020-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32494:


Assignee: Apache Spark

> Null Aware Anti Join Optimize Support Multi-Column
> --
>
> Key: SPARK-32494
> URL: https://issues.apache.org/jira/browse/SPARK-32494
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Assignee: Apache Spark
>Priority: Major
>
> In Issue SPARK-32290, we managed to optimize BroadcastNestedLoopJoin into 
> BroadcastHashJoin within the Single-Column NAAJ scenario, by using hash 
> lookup instead of loop join. 
> It's simple to just fulfill a "NOT IN" logical when it's a single key, but 
> multi-column not in is much more complicated with all these null aware 
> comparison.
> FYI, code logical for single and multi column is defined at
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql
>  
> Hence, proposed with a New type HashedRelation, NullAwareHashedRelation. 
> For NullAwareHashedRelation
>  # it will not skip anyNullColumn key like LongHashedRelation and 
> UnsafeHashedRelation do.
>  # while building NullAwareHashedRelation, will put extra keys into the 
> relation, just to make null aware columns comparison in hash lookup style.
> the duplication would be 2^numKeys - 1 times, for example, if we are to 
> support NAAJ with 3 column join key, the buildSide would be expanded into 
> (2^3 - 1) times, 7X.
> For example, if there is a UnsafeRow key (1,2,3)
> In NullAware Mode, it should be expanded into 7 keys with extra C(3,1), 
> C(3,2) combinations, within the combinations, we duplicated these record with 
> null padding as following.
> Original record
> (1,2,3)
> Extra record to be appended into NullAwareHashedRelation
> (null, 2, 3) (1, null, 3) (1, 2, null)
>  (null, null, 3) (null, 2, null) (1, null, null))
> with the expanded data we can extract a common pattern for both single and 
> multi column. allNull refer to a unsafeRow which has all null columns.
>  * buildSide is empty input => return all rows
>  * allNullColumnKey Exists In buildSide input => reject all rows
>  * if streamedSideRow.allNull is true => drop the row
>  * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
> => drop the row
>  * if streamedSideRow.allNull is false & notFindMatch in 
> NullAwareHashedRelation => return the row
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32494) Null Aware Anti Join Optimize Support Multi-Column

2020-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32494:


Assignee: (was: Apache Spark)

> Null Aware Anti Join Optimize Support Multi-Column
> --
>
> Key: SPARK-32494
> URL: https://issues.apache.org/jira/browse/SPARK-32494
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Major
>
> In Issue SPARK-32290, we managed to optimize BroadcastNestedLoopJoin into 
> BroadcastHashJoin within the Single-Column NAAJ scenario, by using hash 
> lookup instead of loop join. 
> It's simple to just fulfill a "NOT IN" logical when it's a single key, but 
> multi-column not in is much more complicated with all these null aware 
> comparison.
> FYI, code logical for single and multi column is defined at
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql
>  
> Hence, proposed with a New type HashedRelation, NullAwareHashedRelation. 
> For NullAwareHashedRelation
>  # it will not skip anyNullColumn key like LongHashedRelation and 
> UnsafeHashedRelation do.
>  # while building NullAwareHashedRelation, will put extra keys into the 
> relation, just to make null aware columns comparison in hash lookup style.
> the duplication would be 2^numKeys - 1 times, for example, if we are to 
> support NAAJ with 3 column join key, the buildSide would be expanded into 
> (2^3 - 1) times, 7X.
> For example, if there is a UnsafeRow key (1,2,3)
> In NullAware Mode, it should be expanded into 7 keys with extra C(3,1), 
> C(3,2) combinations, within the combinations, we duplicated these record with 
> null padding as following.
> Original record
> (1,2,3)
> Extra record to be appended into NullAwareHashedRelation
> (null, 2, 3) (1, null, 3) (1, 2, null)
>  (null, null, 3) (null, 2, null) (1, null, null))
> with the expanded data we can extract a common pattern for both single and 
> multi column. allNull refer to a unsafeRow which has all null columns.
>  * buildSide is empty input => return all rows
>  * allNullColumnKey Exists In buildSide input => reject all rows
>  * if streamedSideRow.allNull is true => drop the row
>  * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
> => drop the row
>  * if streamedSideRow.allNull is false & notFindMatch in 
> NullAwareHashedRelation => return the row
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32491) Do not install SparkR in test-only mode in testing script

2020-07-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32491:


Assignee: Hyukjin Kwon

> Do not install SparkR in test-only mode in testing script
> -
>
> Key: SPARK-32491
> URL: https://issues.apache.org/jira/browse/SPARK-32491
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Currently GitHub Actions builds are failed as below:
> {code}
> ERROR: this R is version 3.4.4, package 'SparkR' requires R >= 3.5
> [error] running /home/runner/work/spark/spark/R/install-dev.sh ; received 
> return code 1
> ##[error]Process completed with exit code 10.
> {code}
> https://github.com/apache/spark/runs/926437963
> Looks GitHub Actions has R 3.4.4 installed by default; however, R 3.4 was 
> dropped as of SPARK-32073. When SparkR tests are not needed, GitHub Actions 
> does not install R 3.6 and it caused the test failure.
> In fact, SparkR is installed in case of running R linter, see SPARK-8505; 
> however, we don't run the linters in test-only mode at testing script.
> We can safely skip it in test-only mode in our testing script.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32491) Do not install SparkR in test-only mode in testing script

2020-07-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32491.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29300
[https://github.com/apache/spark/pull/29300]

> Do not install SparkR in test-only mode in testing script
> -
>
> Key: SPARK-32491
> URL: https://issues.apache.org/jira/browse/SPARK-32491
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently GitHub Actions builds are failed as below:
> {code}
> ERROR: this R is version 3.4.4, package 'SparkR' requires R >= 3.5
> [error] running /home/runner/work/spark/spark/R/install-dev.sh ; received 
> return code 1
> ##[error]Process completed with exit code 10.
> {code}
> https://github.com/apache/spark/runs/926437963
> Looks GitHub Actions has R 3.4.4 installed by default; however, R 3.4 was 
> dropped as of SPARK-32073. When SparkR tests are not needed, GitHub Actions 
> does not install R 3.6 and it caused the test failure.
> In fact, SparkR is installed in case of running R linter, see SPARK-8505; 
> however, we don't run the linters in test-only mode at testing script.
> We can safely skip it in test-only mode in our testing script.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32495) Update jackson versions from 2.4.6 and so on(2.4.x)

2020-07-30 Thread SHOBHIT SHUKLA (Jira)
SHOBHIT SHUKLA created SPARK-32495:
--

 Summary: Update jackson versions from 2.4.6 and so on(2.4.x)
 Key: SPARK-32495
 URL: https://issues.apache.org/jira/browse/SPARK-32495
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 2.4.6
Reporter: SHOBHIT SHUKLA


Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs 
[https://github.com/FasterXML/jackson-databind/issues/2186], Would it be 
possible to upgrade the jackson version to >= 2.9.8 for spark-2.4.6 and so 
on(2.4.x).




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32496) Include GitHub Action file as the changes in testing

2020-07-30 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-32496:


 Summary: Include GitHub Action file as the changes in testing
 Key: SPARK-32496
 URL: https://issues.apache.org/jira/browse/SPARK-32496
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


https://github.com/apache/spark/pull/26556 excluded 
`.github/workflows/master.yml`. So no tests run if the GitHub Actions 
configuration file is changed.

As of SPARK-32245, we now run the regular tests via the testing script. We 
should include it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32493) Manually install R instead of using setup-r in GitHub Actions

2020-07-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32493:


Assignee: Hyukjin Kwon

> Manually install R instead of using setup-r in GitHub Actions
> -
>
> Key: SPARK-32493
> URL: https://issues.apache.org/jira/browse/SPARK-32493
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, SparkR
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> Looks like {{setup-r}} fails to install specific R version. See the debugging 
> logs at https://github.com/HyukjinKwon/spark/pull/15:
> {code}
> ERROR: this R is version 3.4.4, package 'SparkR' requires R >= 3.5
> [error] running /home/runner/work/spark/spark/R/install-dev.sh ; received 
> return code 1
> ##[error]Process completed with exit code 10.
> {code}
> We should maybe just manually install R.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32493) Manually install R instead of using setup-r in GitHub Actions

2020-07-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32493.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29302
[https://github.com/apache/spark/pull/29302]

> Manually install R instead of using setup-r in GitHub Actions
> -
>
> Key: SPARK-32493
> URL: https://issues.apache.org/jira/browse/SPARK-32493
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, SparkR
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> Looks like {{setup-r}} fails to install specific R version. See the debugging 
> logs at https://github.com/HyukjinKwon/spark/pull/15:
> {code}
> ERROR: this R is version 3.4.4, package 'SparkR' requires R >= 3.5
> [error] running /home/runner/work/spark/spark/R/install-dev.sh ; received 
> return code 1
> ##[error]Process completed with exit code 10.
> {code}
> We should maybe just manually install R.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32496) Include GitHub Action file as the changes in testing

2020-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32496:


Assignee: Apache Spark

> Include GitHub Action file as the changes in testing
> 
>
> Key: SPARK-32496
> URL: https://issues.apache.org/jira/browse/SPARK-32496
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> https://github.com/apache/spark/pull/26556 excluded 
> `.github/workflows/master.yml`. So no tests run if the GitHub Actions 
> configuration file is changed.
> As of SPARK-32245, we now run the regular tests via the testing script. We 
> should include it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32496) Include GitHub Action file as the changes in testing

2020-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32496:


Assignee: (was: Apache Spark)

> Include GitHub Action file as the changes in testing
> 
>
> Key: SPARK-32496
> URL: https://issues.apache.org/jira/browse/SPARK-32496
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/pull/26556 excluded 
> `.github/workflows/master.yml`. So no tests run if the GitHub Actions 
> configuration file is changed.
> As of SPARK-32245, we now run the regular tests via the testing script. We 
> should include it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32496) Include GitHub Action file as the changes in testing

2020-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167823#comment-17167823
 ] 

Apache Spark commented on SPARK-32496:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29305

> Include GitHub Action file as the changes in testing
> 
>
> Key: SPARK-32496
> URL: https://issues.apache.org/jira/browse/SPARK-32496
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/pull/26556 excluded 
> `.github/workflows/master.yml`. So no tests run if the GitHub Actions 
> configuration file is changed.
> As of SPARK-32245, we now run the regular tests via the testing script. We 
> should include it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32496) Include GitHub Action file as the changes in testing

2020-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167826#comment-17167826
 ] 

Apache Spark commented on SPARK-32496:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29305

> Include GitHub Action file as the changes in testing
> 
>
> Key: SPARK-32496
> URL: https://issues.apache.org/jira/browse/SPARK-32496
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/pull/26556 excluded 
> `.github/workflows/master.yml`. So no tests run if the GitHub Actions 
> configuration file is changed.
> As of SPARK-32245, we now run the regular tests via the testing script. We 
> should include it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32497) Installs qpdf package for CRAN check in GitHub Actions

2020-07-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32497:
-
Affects Version/s: (was: 3.0.0)
   3.1.0

> Installs qpdf package for CRAN check in GitHub Actions
> --
>
> Key: SPARK-32497
> URL: https://issues.apache.org/jira/browse/SPARK-32497
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> CRAN check fails as below:
> {code}
> ...
>  WARNING
> ‘qpdf’ is needed for checks on size reduction of PDFs
> ...
> Status: 1 WARNING, 1 NOTE
> See
>   ‘/home/runner/work/spark/spark/R/SparkR.Rcheck/00check.log’
> for details.
> {code}
> Looks we should install {{qpdf}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32497) Installs qpdf package for CRAN check in GitHub Actions

2020-07-30 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-32497:


 Summary: Installs qpdf package for CRAN check in GitHub Actions
 Key: SPARK-32497
 URL: https://issues.apache.org/jira/browse/SPARK-32497
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


CRAN check fails as below:

{code}
...
 WARNING
‘qpdf’ is needed for checks on size reduction of PDFs
...
Status: 1 WARNING, 1 NOTE
See
  ‘/home/runner/work/spark/spark/R/SparkR.Rcheck/00check.log’
for details.
{code}

Looks we should install {{qpdf}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32497) Installs qpdf package for CRAN check in GitHub Actions

2020-07-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32497:
-
Parent: SPARK-32244
Issue Type: Sub-task  (was: Bug)

> Installs qpdf package for CRAN check in GitHub Actions
> --
>
> Key: SPARK-32497
> URL: https://issues.apache.org/jira/browse/SPARK-32497
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> CRAN check fails as below:
> {code}
> ...
>  WARNING
> ‘qpdf’ is needed for checks on size reduction of PDFs
> ...
> Status: 1 WARNING, 1 NOTE
> See
>   ‘/home/runner/work/spark/spark/R/SparkR.Rcheck/00check.log’
> for details.
> {code}
> Looks we should install {{qpdf}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32496) Include GitHub Action file as the changes in testing

2020-07-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32496:


Assignee: Hyukjin Kwon

> Include GitHub Action file as the changes in testing
> 
>
> Key: SPARK-32496
> URL: https://issues.apache.org/jira/browse/SPARK-32496
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/pull/26556 excluded 
> `.github/workflows/master.yml`. So no tests run if the GitHub Actions 
> configuration file is changed.
> As of SPARK-32245, we now run the regular tests via the testing script. We 
> should include it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32496) Include GitHub Action file as the changes in testing

2020-07-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32496.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29305
[https://github.com/apache/spark/pull/29305]

> Include GitHub Action file as the changes in testing
> 
>
> Key: SPARK-32496
> URL: https://issues.apache.org/jira/browse/SPARK-32496
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> https://github.com/apache/spark/pull/26556 excluded 
> `.github/workflows/master.yml`. So no tests run if the GitHub Actions 
> configuration file is changed.
> As of SPARK-32245, we now run the regular tests via the testing script. We 
> should include it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32497) Installs qpdf package for CRAN check in GitHub Actions

2020-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167868#comment-17167868
 ] 

Apache Spark commented on SPARK-32497:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29306

> Installs qpdf package for CRAN check in GitHub Actions
> --
>
> Key: SPARK-32497
> URL: https://issues.apache.org/jira/browse/SPARK-32497
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> CRAN check fails as below:
> {code}
> ...
>  WARNING
> ‘qpdf’ is needed for checks on size reduction of PDFs
> ...
> Status: 1 WARNING, 1 NOTE
> See
>   ‘/home/runner/work/spark/spark/R/SparkR.Rcheck/00check.log’
> for details.
> {code}
> Looks we should install {{qpdf}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32497) Installs qpdf package for CRAN check in GitHub Actions

2020-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32497:


Assignee: Apache Spark

> Installs qpdf package for CRAN check in GitHub Actions
> --
>
> Key: SPARK-32497
> URL: https://issues.apache.org/jira/browse/SPARK-32497
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> CRAN check fails as below:
> {code}
> ...
>  WARNING
> ‘qpdf’ is needed for checks on size reduction of PDFs
> ...
> Status: 1 WARNING, 1 NOTE
> See
>   ‘/home/runner/work/spark/spark/R/SparkR.Rcheck/00check.log’
> for details.
> {code}
> Looks we should install {{qpdf}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32481) Support truncate table to move the data to trash

2020-07-30 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167869#comment-17167869
 ] 

Yang Jie commented on SPARK-32481:
--

[~Udbhav Agrawal] Maybe Trash#moveToAppropriateTrash can achieve this goal, 
FileSystem#delete always skip trash policy 

> Support truncate table to move the data to trash
> 
>
> Key: SPARK-32481
> URL: https://issues.apache.org/jira/browse/SPARK-32481
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>
> *Instead of deleting the data, move the data to trash.So from trash based on 
> configuration data can be deleted permanently.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32497) Installs qpdf package for CRAN check in GitHub Actions

2020-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32497:


Assignee: (was: Apache Spark)

> Installs qpdf package for CRAN check in GitHub Actions
> --
>
> Key: SPARK-32497
> URL: https://issues.apache.org/jira/browse/SPARK-32497
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> CRAN check fails as below:
> {code}
> ...
>  WARNING
> ‘qpdf’ is needed for checks on size reduction of PDFs
> ...
> Status: 1 WARNING, 1 NOTE
> See
>   ‘/home/runner/work/spark/spark/R/SparkR.Rcheck/00check.log’
> for details.
> {code}
> Looks we should install {{qpdf}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32497) Installs qpdf package for CRAN check in GitHub Actions

2020-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167870#comment-17167870
 ] 

Apache Spark commented on SPARK-32497:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29306

> Installs qpdf package for CRAN check in GitHub Actions
> --
>
> Key: SPARK-32497
> URL: https://issues.apache.org/jira/browse/SPARK-32497
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> CRAN check fails as below:
> {code}
> ...
>  WARNING
> ‘qpdf’ is needed for checks on size reduction of PDFs
> ...
> Status: 1 WARNING, 1 NOTE
> See
>   ‘/home/runner/work/spark/spark/R/SparkR.Rcheck/00check.log’
> for details.
> {code}
> Looks we should install {{qpdf}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32485) RecordBinaryComparatorSuite test failures on big-endian systems

2020-07-30 Thread Michael Munday (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167881#comment-17167881
 ] 

Michael Munday commented on SPARK-32485:


This is for Linux running on the IBM Z platform (s390x) which is a big endian 
platform. I have a PR open to fix this, I will update it with this issue number 
shortly. Thanks.

> RecordBinaryComparatorSuite test failures on big-endian systems
> ---
>
> Key: SPARK-32485
> URL: https://issues.apache.org/jira/browse/SPARK-32485
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Michael Munday
>Priority: Minor
>  Labels: endianness
>
> The fix for SPARK-29918 broke two tests on big-endian systems:
>  * testBinaryComparatorWhenSubtractionIsDivisibleByMaxIntValue
>  * testBinaryComparatorWhenSubtractionCanOverflowLongValue
> These tests date from a time where subtraction was being used to do 
> multi-byte comparisons. They try to trigger old bugs by feeding specific 
> values into the comparison. However the fix for SPARK-29918 modified the 
> order in which bytes are compared when comparing 8 bytes at a time on 
> little-endian systems (to match the normal byte-by-byte comparison). This fix 
> did not affect big-endian systems. However the expected output of the tests 
> was modified for all systems regardless of endianness. So the tests broke on 
> big-endian systems.
> It is also not clear that the values compared in the tests match the original 
> intent of the tests now that the bytes in those values are compared in order 
> (equivalent to the bytes in the values being reversed).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32227) Bug in load-spark-env.cmd with Spark 3.0.0

2020-07-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32227:


Assignee: Ihor Bobak

> Bug in load-spark-env.cmd  with Spark 3.0.0
> ---
>
> Key: SPARK-32227
> URL: https://issues.apache.org/jira/browse/SPARK-32227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.0.0
> Environment: Windows 10
>Reporter: Ihor Bobak
>Assignee: Ihor Bobak
>Priority: Major
> Attachments: load-spark-env.cmd
>
>
> spark-env.cmd  which is located in conf  is not loaded by load-spark-env.cmd.
>  
> *How to reproduce:*
> 1) download spark 3.0.0 without hadoop and extract it
> 2) put a file conf/spark-env.cmd with the following contents (paths are 
> relative to where my hadoop is - in C:\opt\hadoop\hadoop-3.2.1, you may need 
> to change):
>  
> SET JAVA_HOME=C:\opt\Java\jdk1.8.0_241
>  SET HADOOP_HOME=C:\opt\hadoop\hadoop-3.2.1
>  SET HADOOP_CONF_DIR=C:\opt\hadoop\hadoop-3.2.1\conf
>  SET 
> SPARK_DIST_CLASSPATH=C:\opt\hadoop\hadoop-3.2.1\etc\hadoop;C:\opt\hadoop\hadoop-3.2.1\share\hadoop\common;C:\opt\hadoop\hadoop-3.2.1\share\hadoop\common\lib*;C:\opt\hadoop\hadoop-3.2.1\share\hadoop\common*;C:\opt\hadoop\hadoop-3.2.1\share\hadoop\hdfs;C:\opt\hadoop\hadoop-3.2.1\share\hadoop\hdfs\lib*;C:\opt\hadoop\hadoop-3.2.1\share\hadoop\hdfs*;C:\opt\hadoop\hadoop-3.2.1\share\hadoop\yarn;C:\opt\hadoop\hadoop-3.2.1\share\hadoop\yarn\lib*;C:\opt\hadoop\hadoop-3.2.1\share\hadoop\yarn*;C:\opt\hadoop\hadoop-3.2.1\share\hadoop\mapreduce\lib*;C:\opt\hadoop\hadoop-3.2.1\share\hadoop\mapreduce*
>  
> 3) go to the bin directory and run pyspark.   You will get an error that 
> log4j can't be found, etc. (reason: the environment was not loaded indeed, it 
> doesn't see where hadoop with all its jars is).
>  
> *How to fix:*
> just take the load-spark-env.cmd  from Spark version 2.4.3, and everything 
> will work.
> [UPDATE]:  I attached a fixed version of load-spark-env.cmd  that works fine.
>  
> *What is the difference?*
> I am not a good specialist in Windows batch, but doing a function
> :LoadSparkEnv
>  if exist "%SPARK_CONF_DIR%\spark-env.cmd" (
>   call "%SPARK_CONF_DIR%\spark-env.cmd"
>  )
> and then calling it (as it was in 2.4.3) helps.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32227) Bug in load-spark-env.cmd with Spark 3.0.0

2020-07-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32227.
--
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/29044

> Bug in load-spark-env.cmd  with Spark 3.0.0
> ---
>
> Key: SPARK-32227
> URL: https://issues.apache.org/jira/browse/SPARK-32227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.0.0
> Environment: Windows 10
>Reporter: Ihor Bobak
>Assignee: Ihor Bobak
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
> Attachments: load-spark-env.cmd
>
>
> spark-env.cmd  which is located in conf  is not loaded by load-spark-env.cmd.
>  
> *How to reproduce:*
> 1) download spark 3.0.0 without hadoop and extract it
> 2) put a file conf/spark-env.cmd with the following contents (paths are 
> relative to where my hadoop is - in C:\opt\hadoop\hadoop-3.2.1, you may need 
> to change):
>  
> SET JAVA_HOME=C:\opt\Java\jdk1.8.0_241
>  SET HADOOP_HOME=C:\opt\hadoop\hadoop-3.2.1
>  SET HADOOP_CONF_DIR=C:\opt\hadoop\hadoop-3.2.1\conf
>  SET 
> SPARK_DIST_CLASSPATH=C:\opt\hadoop\hadoop-3.2.1\etc\hadoop;C:\opt\hadoop\hadoop-3.2.1\share\hadoop\common;C:\opt\hadoop\hadoop-3.2.1\share\hadoop\common\lib*;C:\opt\hadoop\hadoop-3.2.1\share\hadoop\common*;C:\opt\hadoop\hadoop-3.2.1\share\hadoop\hdfs;C:\opt\hadoop\hadoop-3.2.1\share\hadoop\hdfs\lib*;C:\opt\hadoop\hadoop-3.2.1\share\hadoop\hdfs*;C:\opt\hadoop\hadoop-3.2.1\share\hadoop\yarn;C:\opt\hadoop\hadoop-3.2.1\share\hadoop\yarn\lib*;C:\opt\hadoop\hadoop-3.2.1\share\hadoop\yarn*;C:\opt\hadoop\hadoop-3.2.1\share\hadoop\mapreduce\lib*;C:\opt\hadoop\hadoop-3.2.1\share\hadoop\mapreduce*
>  
> 3) go to the bin directory and run pyspark.   You will get an error that 
> log4j can't be found, etc. (reason: the environment was not loaded indeed, it 
> doesn't see where hadoop with all its jars is).
>  
> *How to fix:*
> just take the load-spark-env.cmd  from Spark version 2.4.3, and everything 
> will work.
> [UPDATE]:  I attached a fixed version of load-spark-env.cmd  that works fine.
>  
> *What is the difference?*
> I am not a good specialist in Windows batch, but doing a function
> :LoadSparkEnv
>  if exist "%SPARK_CONF_DIR%\spark-env.cmd" (
>   call "%SPARK_CONF_DIR%\spark-env.cmd"
>  )
> and then calling it (as it was in 2.4.3) helps.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32498) Support grant/revoke access privileges like postgresql

2020-07-30 Thread jobit mathew (Jira)
jobit mathew created SPARK-32498:


 Summary: Support grant/revoke access privileges like postgresql
 Key: SPARK-32498
 URL: https://issues.apache.org/jira/browse/SPARK-32498
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: jobit mathew


Support grant/revoke access privileges like postgresql.

[https://www.postgresql.org/docs/9.0/sql-grant.html]

Currently Spark SQL does not support grant/revoke statement, which might be 
done only in Hive



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32481) Support truncate table to move the data to trash

2020-07-30 Thread Udbhav Agrawal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167976#comment-17167976
 ] 

Udbhav Agrawal commented on SPARK-32481:


[~LuciferYang]  yes i have the patch ready I'll raise MR today.

> Support truncate table to move the data to trash
> 
>
> Key: SPARK-32481
> URL: https://issues.apache.org/jira/browse/SPARK-32481
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>
> *Instead of deleting the data, move the data to trash.So from trash based on 
> configuration data can be deleted permanently.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29886) Add support for Spark style HashDistribution and Partitioning to V2 Datasource

2020-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-29886:


Assignee: Apache Spark

> Add support for Spark style HashDistribution and Partitioning to V2 Datasource
> --
>
> Key: SPARK-29886
> URL: https://issues.apache.org/jira/browse/SPARK-29886
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Andrew K Long
>Assignee: Apache Spark
>Priority: Major
>
> Currently a v2 datasource does not have the ability specify that its 
> Distribution iscompatible with sparks HashClusteredDistribution.  We need to 
> add the appropriate class in the interface and add support in 
> DataSourcePartitioning so that EnsureRequirements is aware of the tables 
> partitioning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29886) Add support for Spark style HashDistribution and Partitioning to V2 Datasource

2020-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-29886:


Assignee: (was: Apache Spark)

> Add support for Spark style HashDistribution and Partitioning to V2 Datasource
> --
>
> Key: SPARK-29886
> URL: https://issues.apache.org/jira/browse/SPARK-29886
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Andrew K Long
>Priority: Major
>
> Currently a v2 datasource does not have the ability specify that its 
> Distribution iscompatible with sparks HashClusteredDistribution.  We need to 
> add the appropriate class in the interface and add support in 
> DataSourcePartitioning so that EnsureRequirements is aware of the tables 
> partitioning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32499) Use {} for structs and maps in show()

2020-07-30 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-32499:
--

 Summary: Use {} for structs and maps in show()
 Key: SPARK-32499
 URL: https://issues.apache.org/jira/browse/SPARK-32499
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Currently, show() wraps arrays, maps and structs by []. Maps and structs should 
be wrapped by {}:
- To be consistent with ToHiveResult
- To distinguish maps/structs from arrays 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32083) Unnecessary tasks are launched when input is empty with AQE

2020-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168003#comment-17168003
 ] 

Apache Spark commented on SPARK-32083:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29307

> Unnecessary tasks are launched when input is empty with AQE
> ---
>
> Key: SPARK-32083
> URL: https://issues.apache.org/jira/browse/SPARK-32083
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Minor
>
> [https://github.com/apache/spark/pull/28226] meant to avoid launching 
> unnecessary tasks for 0-size partitions when AQE is enabled. However, when 
> all partitions are empty, the number of partitions will be 
> `spark.sql.adaptive.coalescePartitions.initialPartitionNum` and (a lot of) 
> unnecessary tasks are launched in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32500) Query and Batch Id not set for Structured Streaming Jobs in case of ForeacgBatch in Pyspark

2020-07-30 Thread Abhishek Dixit (Jira)
Abhishek Dixit created SPARK-32500:
--

 Summary: Query and Batch Id not set for Structured Streaming Jobs 
in case of ForeacgBatch in Pyspark
 Key: SPARK-32500
 URL: https://issues.apache.org/jira/browse/SPARK-32500
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Structured Streaming
Affects Versions: 2.4.6
Reporter: Abhishek Dixit


Query Id and Batch Id information is not available for jobs started by 
structured streaming query when _foreachBatch_ API is used in PySpark.

This happens only with foreachBatch in pyspark. ForeachBatch in scala works 
fine, and also other structured streaming sinks in pyspark work fine. I am 
attaching a screenshot of jobs pages.

 

!image-2020-07-30-21-03-22-094.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32500) Query and Batch Id not set for Structured Streaming Jobs in case of ForeacgBatch in Pyspark

2020-07-30 Thread Abhishek Dixit (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Dixit updated SPARK-32500:
---
Attachment: Screen Shot 2020-07-30 at 9.04.21 PM.png

> Query and Batch Id not set for Structured Streaming Jobs in case of 
> ForeacgBatch in Pyspark
> ---
>
> Key: SPARK-32500
> URL: https://issues.apache.org/jira/browse/SPARK-32500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Abhishek Dixit
>Priority: Major
> Attachments: Screen Shot 2020-07-30 at 9.04.21 PM.png
>
>
> Query Id and Batch Id information is not available for jobs started by 
> structured streaming query when _foreachBatch_ API is used in PySpark.
> This happens only with foreachBatch in pyspark. ForeachBatch in scala works 
> fine, and also other structured streaming sinks in pyspark work fine. I am 
> attaching a screenshot of jobs pages.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32500) Query and Batch Id not set for Structured Streaming Jobs in case of ForeacgBatch in Pyspark

2020-07-30 Thread Abhishek Dixit (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Dixit updated SPARK-32500:
---
Description: 
Query Id and Batch Id information is not available for jobs started by 
structured streaming query when _foreachBatch_ API is used in PySpark.

This happens only with foreachBatch in pyspark. ForeachBatch in scala works 
fine, and also other structured streaming sinks in pyspark work fine. I am 
attaching a screenshot of jobs pages.

 

  was:
Query Id and Batch Id information is not available for jobs started by 
structured streaming query when _foreachBatch_ API is used in PySpark.

This happens only with foreachBatch in pyspark. ForeachBatch in scala works 
fine, and also other structured streaming sinks in pyspark work fine. I am 
attaching a screenshot of jobs pages.

 

!image-2020-07-30-21-03-22-094.png!


> Query and Batch Id not set for Structured Streaming Jobs in case of 
> ForeacgBatch in Pyspark
> ---
>
> Key: SPARK-32500
> URL: https://issues.apache.org/jira/browse/SPARK-32500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Abhishek Dixit
>Priority: Major
> Attachments: Screen Shot 2020-07-30 at 9.04.21 PM.png
>
>
> Query Id and Batch Id information is not available for jobs started by 
> structured streaming query when _foreachBatch_ API is used in PySpark.
> This happens only with foreachBatch in pyspark. ForeachBatch in scala works 
> fine, and also other structured streaming sinks in pyspark work fine. I am 
> attaching a screenshot of jobs pages.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32499) Use {} for structs and maps in show()

2020-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32499:


Assignee: (was: Apache Spark)

> Use {} for structs and maps in show()
> -
>
> Key: SPARK-32499
> URL: https://issues.apache.org/jira/browse/SPARK-32499
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, show() wraps arrays, maps and structs by []. Maps and structs 
> should be wrapped by {}:
> - To be consistent with ToHiveResult
> - To distinguish maps/structs from arrays 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32499) Use {} for structs and maps in show()

2020-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32499:


Assignee: Apache Spark

> Use {} for structs and maps in show()
> -
>
> Key: SPARK-32499
> URL: https://issues.apache.org/jira/browse/SPARK-32499
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, show() wraps arrays, maps and structs by []. Maps and structs 
> should be wrapped by {}:
> - To be consistent with ToHiveResult
> - To distinguish maps/structs from arrays 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32499) Use {} for structs and maps in show()

2020-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168007#comment-17168007
 ] 

Apache Spark commented on SPARK-32499:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29308

> Use {} for structs and maps in show()
> -
>
> Key: SPARK-32499
> URL: https://issues.apache.org/jira/browse/SPARK-32499
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, show() wraps arrays, maps and structs by []. Maps and structs 
> should be wrapped by {}:
> - To be consistent with ToHiveResult
> - To distinguish maps/structs from arrays 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32500) Query and Batch Id not set for Structured Streaming Jobs in case of ForeacgBatch in Pyspark

2020-07-30 Thread Abhishek Dixit (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Dixit updated SPARK-32500:
---
Description: 
Query Id and Batch Id information is not available for jobs started by 
structured streaming query when _foreachBatch_ API is used in PySpark.

This happens only with foreachBatch in pyspark. ForeachBatch in scala works 
fine, and also other structured streaming sinks in pyspark work fine. I am 
attaching a screenshot of jobs pages.

I think job group is not set properly when _foreachBatch_ is used via pyspark. 
I have a framework which depends on the queryId and batchId information 
available in the job properties.

 

  was:
Query Id and Batch Id information is not available for jobs started by 
structured streaming query when _foreachBatch_ API is used in PySpark.

This happens only with foreachBatch in pyspark. ForeachBatch in scala works 
fine, and also other structured streaming sinks in pyspark work fine. I am 
attaching a screenshot of jobs pages.

 


> Query and Batch Id not set for Structured Streaming Jobs in case of 
> ForeacgBatch in Pyspark
> ---
>
> Key: SPARK-32500
> URL: https://issues.apache.org/jira/browse/SPARK-32500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Abhishek Dixit
>Priority: Major
> Attachments: Screen Shot 2020-07-30 at 9.04.21 PM.png
>
>
> Query Id and Batch Id information is not available for jobs started by 
> structured streaming query when _foreachBatch_ API is used in PySpark.
> This happens only with foreachBatch in pyspark. ForeachBatch in scala works 
> fine, and also other structured streaming sinks in pyspark work fine. I am 
> attaching a screenshot of jobs pages.
> I think job group is not set properly when _foreachBatch_ is used via 
> pyspark. I have a framework which depends on the queryId and batchId 
> information available in the job properties.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32500) Query and Batch Id not set for Structured Streaming Jobs in case of ForeachBatch in PySpark

2020-07-30 Thread Abhishek Dixit (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Dixit updated SPARK-32500:
---
Summary: Query and Batch Id not set for Structured Streaming Jobs in case 
of ForeachBatch in PySpark  (was: Query and Batch Id not set for Structured 
Streaming Jobs in case of ForeacgBatch in Pyspark)

> Query and Batch Id not set for Structured Streaming Jobs in case of 
> ForeachBatch in PySpark
> ---
>
> Key: SPARK-32500
> URL: https://issues.apache.org/jira/browse/SPARK-32500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Abhishek Dixit
>Priority: Major
> Attachments: Screen Shot 2020-07-30 at 9.04.21 PM.png
>
>
> Query Id and Batch Id information is not available for jobs started by 
> structured streaming query when _foreachBatch_ API is used in PySpark.
> This happens only with foreachBatch in pyspark. ForeachBatch in scala works 
> fine, and also other structured streaming sinks in pyspark work fine. I am 
> attaching a screenshot of jobs pages.
> I think job group is not set properly when _foreachBatch_ is used via 
> pyspark. I have a framework that depends on the _queryId_ and _batchId_ 
> information available in the job properties and so my framework doesn't work 
> for pyspark-foreachBatch use case.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32500) Query and Batch Id not set for Structured Streaming Jobs in case of ForeacgBatch in Pyspark

2020-07-30 Thread Abhishek Dixit (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Dixit updated SPARK-32500:
---
Description: 
Query Id and Batch Id information is not available for jobs started by 
structured streaming query when _foreachBatch_ API is used in PySpark.

This happens only with foreachBatch in pyspark. ForeachBatch in scala works 
fine, and also other structured streaming sinks in pyspark work fine. I am 
attaching a screenshot of jobs pages.

I think job group is not set properly when _foreachBatch_ is used via pyspark. 
I have a framework that depends on the _queryId_ and _batchId_ information 
available in the job properties and so my framework doesn't work for 
pyspark-foreachBatch use case.

 

  was:
Query Id and Batch Id information is not available for jobs started by 
structured streaming query when _foreachBatch_ API is used in PySpark.

This happens only with foreachBatch in pyspark. ForeachBatch in scala works 
fine, and also other structured streaming sinks in pyspark work fine. I am 
attaching a screenshot of jobs pages.

I think job group is not set properly when _foreachBatch_ is used via pyspark. 
I have a framework which depends on the queryId and batchId information 
available in the job properties.

 


> Query and Batch Id not set for Structured Streaming Jobs in case of 
> ForeacgBatch in Pyspark
> ---
>
> Key: SPARK-32500
> URL: https://issues.apache.org/jira/browse/SPARK-32500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Abhishek Dixit
>Priority: Major
> Attachments: Screen Shot 2020-07-30 at 9.04.21 PM.png
>
>
> Query Id and Batch Id information is not available for jobs started by 
> structured streaming query when _foreachBatch_ API is used in PySpark.
> This happens only with foreachBatch in pyspark. ForeachBatch in scala works 
> fine, and also other structured streaming sinks in pyspark work fine. I am 
> attaching a screenshot of jobs pages.
> I think job group is not set properly when _foreachBatch_ is used via 
> pyspark. I have a framework that depends on the _queryId_ and _batchId_ 
> information available in the job properties and so my framework doesn't work 
> for pyspark-foreachBatch use case.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32497) Installs qpdf package for CRAN check in GitHub Actions

2020-07-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32497:


Assignee: Hyukjin Kwon

> Installs qpdf package for CRAN check in GitHub Actions
> --
>
> Key: SPARK-32497
> URL: https://issues.apache.org/jira/browse/SPARK-32497
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> CRAN check fails as below:
> {code}
> ...
>  WARNING
> ‘qpdf’ is needed for checks on size reduction of PDFs
> ...
> Status: 1 WARNING, 1 NOTE
> See
>   ‘/home/runner/work/spark/spark/R/SparkR.Rcheck/00check.log’
> for details.
> {code}
> Looks we should install {{qpdf}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32497) Installs qpdf package for CRAN check in GitHub Actions

2020-07-30 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32497.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29306
[https://github.com/apache/spark/pull/29306]

> Installs qpdf package for CRAN check in GitHub Actions
> --
>
> Key: SPARK-32497
> URL: https://issues.apache.org/jira/browse/SPARK-32497
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> CRAN check fails as below:
> {code}
> ...
>  WARNING
> ‘qpdf’ is needed for checks on size reduction of PDFs
> ...
> Status: 1 WARNING, 1 NOTE
> See
>   ‘/home/runner/work/spark/spark/R/SparkR.Rcheck/00check.log’
> for details.
> {code}
> Looks we should install {{qpdf}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29886) Add support for Spark style HashDistribution and Partitioning to V2 Datasource

2020-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168030#comment-17168030
 ] 

Apache Spark commented on SPARK-29886:
--

User 'rahij' has created a pull request for this issue:
https://github.com/apache/spark/pull/29309

> Add support for Spark style HashDistribution and Partitioning to V2 Datasource
> --
>
> Key: SPARK-29886
> URL: https://issues.apache.org/jira/browse/SPARK-29886
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Andrew K Long
>Priority: Major
>
> Currently a v2 datasource does not have the ability specify that its 
> Distribution iscompatible with sparks HashClusteredDistribution.  We need to 
> add the appropriate class in the interface and add support in 
> DataSourcePartitioning so that EnsureRequirements is aware of the tables 
> partitioning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32332) AQE doesn't adequately allow for Columnar Processing extension

2020-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168031#comment-17168031
 ] 

Apache Spark commented on SPARK-32332:
--

User 'andygrove' has created a pull request for this issue:
https://github.com/apache/spark/pull/29310

> AQE doesn't adequately allow for Columnar Processing extension 
> ---
>
> Key: SPARK-32332
> URL: https://issues.apache.org/jira/browse/SPARK-32332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.1.0
>
>
> In SPARK-27396 we added support to extended Columnar Processing. We did the 
> initial work as to what we thought was sufficient but adaptive query 
> execution was being developed at the same time.
> We have discovered that the changes made to AQE are not sufficient for users 
> to properly extend it for columnar processing because AQE hardcodes to look 
> for specific classes/execs.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32500) Query and Batch Id not set for Structured Streaming Jobs in case of ForeachBatch in PySpark

2020-07-30 Thread Rohit Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168115#comment-17168115
 ] 

Rohit Mishra commented on SPARK-32500:
--

[~abhishekd0907], It will be helpful if you can add environment detail and 
reproducible steps (if needed). Can you please add this?

> Query and Batch Id not set for Structured Streaming Jobs in case of 
> ForeachBatch in PySpark
> ---
>
> Key: SPARK-32500
> URL: https://issues.apache.org/jira/browse/SPARK-32500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Abhishek Dixit
>Priority: Major
> Attachments: Screen Shot 2020-07-30 at 9.04.21 PM.png
>
>
> Query Id and Batch Id information is not available for jobs started by 
> structured streaming query when _foreachBatch_ API is used in PySpark.
> This happens only with foreachBatch in pyspark. ForeachBatch in scala works 
> fine, and also other structured streaming sinks in pyspark work fine. I am 
> attaching a screenshot of jobs pages.
> I think job group is not set properly when _foreachBatch_ is used via 
> pyspark. I have a framework that depends on the _queryId_ and _batchId_ 
> information available in the job properties and so my framework doesn't work 
> for pyspark-foreachBatch use case.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32489) Pass all `core` module UTs in Scala 2.13

2020-07-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32489.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29298
[https://github.com/apache/spark/pull/29298]

> Pass all `core` module UTs in Scala 2.13
> 
>
> Key: SPARK-32489
> URL: https://issues.apache.org/jira/browse/SPARK-32489
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>
> {code}
> $ dev/change-scala-version.sh 2.13
> $ build/mvn test -pl core --am -Pscala-2.13
> ...
> Tests: succeeded 2612, failed 3, canceled 1, ignored 8, pending 0
> *** 3 TESTS FAILED ***
> {code}
> *AFTER*
> {code}
> Tests: succeeded 2615, failed 0, canceled 1, ignored 8, pending 0
> All tests passed.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32489) Pass all `core` module UTs in Scala 2.13

2020-07-30 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32489:
-

Assignee: Dongjoon Hyun

> Pass all `core` module UTs in Scala 2.13
> 
>
> Key: SPARK-32489
> URL: https://issues.apache.org/jira/browse/SPARK-32489
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> {code}
> $ dev/change-scala-version.sh 2.13
> $ build/mvn test -pl core --am -Pscala-2.13
> ...
> Tests: succeeded 2612, failed 3, canceled 1, ignored 8, pending 0
> *** 3 TESTS FAILED ***
> {code}
> *AFTER*
> {code}
> Tests: succeeded 2615, failed 0, canceled 1, ignored 8, pending 0
> All tests passed.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32501) Inconsistent NULL conversions to strings

2020-07-30 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-32501:
--

 Summary: Inconsistent NULL conversions to strings 
 Key: SPARK-32501
 URL: https://issues.apache.org/jira/browse/SPARK-32501
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


1. It is impossible to distinguish empty string and null, for instance:
{code:scala}
scala> Seq(Seq(""), Seq(null)).toDF().show
+-+
|value|
+-+
| []|
| []|
+-+
{code}
2. Inconsistent NULL conversions for top-level values and nested columns, for 
instance:
{code:scala}
scala> sql("select named_struct('c', null), null").show
+-++
|named_struct(c, NULL)|NULL|
+-++
| []|null|
+-++
{code}
3. `.show()` is different from conversions to Hive strings, and as a 
consequence its output is different from `spark-sql` (sql tests):
{code:sql}
spark-sql> select named_struct('c', null) as struct;
{"c":null}
{code}
{code:scala}
scala> sql("select named_struct('c', null) as struct").show
+--+
|struct|
+--+
| []|
+--+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32501) Inconsistent NULL conversions to strings

2020-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32501:


Assignee: (was: Apache Spark)

> Inconsistent NULL conversions to strings 
> -
>
> Key: SPARK-32501
> URL: https://issues.apache.org/jira/browse/SPARK-32501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> 1. It is impossible to distinguish empty string and null, for instance:
> {code:scala}
> scala> Seq(Seq(""), Seq(null)).toDF().show
> +-+
> |value|
> +-+
> | []|
> | []|
> +-+
> {code}
> 2. Inconsistent NULL conversions for top-level values and nested columns, for 
> instance:
> {code:scala}
> scala> sql("select named_struct('c', null), null").show
> +-++
> |named_struct(c, NULL)|NULL|
> +-++
> | []|null|
> +-++
> {code}
> 3. `.show()` is different from conversions to Hive strings, and as a 
> consequence its output is different from `spark-sql` (sql tests):
> {code:sql}
> spark-sql> select named_struct('c', null) as struct;
> {"c":null}
> {code}
> {code:scala}
> scala> sql("select named_struct('c', null) as struct").show
> +--+
> |struct|
> +--+
> | []|
> +--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32501) Inconsistent NULL conversions to strings

2020-07-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168129#comment-17168129
 ] 

Apache Spark commented on SPARK-32501:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29311

> Inconsistent NULL conversions to strings 
> -
>
> Key: SPARK-32501
> URL: https://issues.apache.org/jira/browse/SPARK-32501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> 1. It is impossible to distinguish empty string and null, for instance:
> {code:scala}
> scala> Seq(Seq(""), Seq(null)).toDF().show
> +-+
> |value|
> +-+
> | []|
> | []|
> +-+
> {code}
> 2. Inconsistent NULL conversions for top-level values and nested columns, for 
> instance:
> {code:scala}
> scala> sql("select named_struct('c', null), null").show
> +-++
> |named_struct(c, NULL)|NULL|
> +-++
> | []|null|
> +-++
> {code}
> 3. `.show()` is different from conversions to Hive strings, and as a 
> consequence its output is different from `spark-sql` (sql tests):
> {code:sql}
> spark-sql> select named_struct('c', null) as struct;
> {"c":null}
> {code}
> {code:scala}
> scala> sql("select named_struct('c', null) as struct").show
> +--+
> |struct|
> +--+
> | []|
> +--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32501) Inconsistent NULL conversions to strings

2020-07-30 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32501:


Assignee: Apache Spark

> Inconsistent NULL conversions to strings 
> -
>
> Key: SPARK-32501
> URL: https://issues.apache.org/jira/browse/SPARK-32501
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> 1. It is impossible to distinguish empty string and null, for instance:
> {code:scala}
> scala> Seq(Seq(""), Seq(null)).toDF().show
> +-+
> |value|
> +-+
> | []|
> | []|
> +-+
> {code}
> 2. Inconsistent NULL conversions for top-level values and nested columns, for 
> instance:
> {code:scala}
> scala> sql("select named_struct('c', null), null").show
> +-++
> |named_struct(c, NULL)|NULL|
> +-++
> | []|null|
> +-++
> {code}
> 3. `.show()` is different from conversions to Hive strings, and as a 
> consequence its output is different from `spark-sql` (sql tests):
> {code:sql}
> spark-sql> select named_struct('c', null) as struct;
> {"c":null}
> {code}
> {code:scala}
> scala> sql("select named_struct('c', null) as struct").show
> +--+
> |struct|
> +--+
> | []|
> +--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27589) Spark file source V2

2020-07-30 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168133#comment-17168133
 ] 

Thomas Graves commented on SPARK-27589:
---

Hey, I see most of the sources are still in the useV1SourceList, what is left 
to make the v2 on by default?   Is it just the remaining Jira here or other 
things?

> Spark file source V2
> 
>
> Key: SPARK-27589
> URL: https://issues.apache.org/jira/browse/SPARK-27589
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Re-implement file sources with data source V2 API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >