date:20200728

[jira] [Updated] (SPARK-32474) NullAwareAntiJoin multi-column support

2020-07-28 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32474:

Target Version/s:   (was: 3.0.0)

> NullAwareAntiJoin multi-column support
> --
>
> Key: SPARK-32474
> URL: https://issues.apache.org/jira/browse/SPARK-32474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Minor
> Fix For: 3.1.0
>
>
> This is a follow up improvement of Issue SPARK-32290.
> In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to 
> BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but 
> it's only targeting on Single Column Case, because it's much more complicate 
> in multi column support.
> See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6
>  
> FYI, code logical for single and multi column is defined at
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql
>  
> For supporting multi column, I throw the following idea and see if is it 
> worth to do multi-column support with some trade off. I would need to do some 
> data expansion in HashedRelation, and i would call this new type of 
> HashedRelation as NullAwareHashedRelation.
>  
> In NullAwareHashedRelation, key with null column is allowed, which is 
> opposite in LongHashedRelation and UnsafeHashedRelation; And single key might 
> be expanded into 2^N - 1 records, (N refer to columnNum of the key). for 
> example, if there is a record
> (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), 
> C(2,3) as a combination to copy origin key row, and setNull at target 
> position, and then insert into NullAwareHashedRelation. including the origin 
> key row, there will be 7 key row inserted as follow.
> (null, 2, 3)
> (1, null, 3)
> (1, 2, null)
> (null, null, 3)
> (null, 2, null)
> (1, null, null)
> (1, 2, 3)
>  
> with the expanded data we can extract a common pattern for both single and 
> multi column. allNull refer to a unsafeRow which has all null columns.
>  * buildSide is empty input => return all rows
>  * allNullColumnKey Exists In buildSide input => reject all rows
>  * if streamedSideRow.allNull is true => drop the row
>  * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
> => drop the row
>  * if streamedSideRow.allNull is false & notFindMatch in 
> NullAwareHashedRelation => return the row
>  
> this solution will sure make buildSide data expand to 2^N-1 times, but since 
> it is normally up to 2~3 column in NAAJ in normal production query, i suppose 
> that it's acceptable to expand buildSide data to around 7X. I would also have 
> a limitation of max column support for NAAJ, basically should not more than 
> 3. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32474) NullAwareAntiJoin multi-column support

2020-07-28 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32474:

Fix Version/s: (was: 3.1.0)

> NullAwareAntiJoin multi-column support
> --
>
> Key: SPARK-32474
> URL: https://issues.apache.org/jira/browse/SPARK-32474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Minor
>
> This is a follow up improvement of Issue SPARK-32290.
> In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to 
> BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but 
> it's only targeting on Single Column Case, because it's much more complicate 
> in multi column support.
> See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6
>  
> FYI, code logical for single and multi column is defined at
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql
>  
> For supporting multi column, I throw the following idea and see if is it 
> worth to do multi-column support with some trade off. I would need to do some 
> data expansion in HashedRelation, and i would call this new type of 
> HashedRelation as NullAwareHashedRelation.
>  
> In NullAwareHashedRelation, key with null column is allowed, which is 
> opposite in LongHashedRelation and UnsafeHashedRelation; And single key might 
> be expanded into 2^N - 1 records, (N refer to columnNum of the key). for 
> example, if there is a record
> (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), 
> C(2,3) as a combination to copy origin key row, and setNull at target 
> position, and then insert into NullAwareHashedRelation. including the origin 
> key row, there will be 7 key row inserted as follow.
> (null, 2, 3)
> (1, null, 3)
> (1, 2, null)
> (null, null, 3)
> (null, 2, null)
> (1, null, null)
> (1, 2, 3)
>  
> with the expanded data we can extract a common pattern for both single and 
> multi column. allNull refer to a unsafeRow which has all null columns.
>  * buildSide is empty input => return all rows
>  * allNullColumnKey Exists In buildSide input => reject all rows
>  * if streamedSideRow.allNull is true => drop the row
>  * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
> => drop the row
>  * if streamedSideRow.allNull is false & notFindMatch in 
> NullAwareHashedRelation => return the row
>  
> this solution will sure make buildSide data expand to 2^N-1 times, but since 
> it is normally up to 2~3 column in NAAJ in normal production query, i suppose 
> that it's acceptable to expand buildSide data to around 7X. I would also have 
> a limitation of max column support for NAAJ, basically should not more than 
> 3. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32473) Use === instead IndexSeqView

2020-07-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32473.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29280
[https://github.com/apache/spark/pull/29280]

> Use === instead IndexSeqView
> 
>
> Key: SPARK-32473
> URL: https://issues.apache.org/jira/browse/SPARK-32473
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>
> This issue aims to fix `SorterSuite` and `RadixSortSuite` in Scala 2.13 by 
> using `sameElements` instead of `IndexSeqView.==`.
> Scala 2.13 reimplements `IndexSeqView` and the behavior is different.
> - https://docs.scala-lang.org/overviews/core/collections-migration-213.html
> {code}
> Welcome to Scala 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_262).
> Type in expressions for evaluation. Or try :help.
> scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view
> res0: Boolean = true
> {code}
> {code}
> Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_262).
> Type in expressions for evaluation. Or try :help.
> scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view
> val res0: Boolean = false
> {code}
> {code}
> $ dev/change-scala-version.sh 2.13
> $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none 
> -DwildcardSuites=org.apache.spark.util.collection.unsafe.sort.RadixSortSuite
> ...
> Tests: succeeded 9, failed 36, canceled 0, ignored 0, pending 0
> *** 36 TESTS FAILED ***
> $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none 
> -DwildcardSuites=org.apache.spark.util.collection.SorterSuite
> ...
> Tests: succeeded 3, failed 1, canceled 0, ignored 2, pending 0
> *** 1 TEST FAILED ***
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32475) java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;

2020-07-28 Thread jinhai (Jira)

jinhai created SPARK-32475:
--

 Summary: java.lang.NoSuchMethodError: 
java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
 Key: SPARK-32475
 URL: https://issues.apache.org/jira/browse/SPARK-32475
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
 Environment: Spark-3.0.0；JDK 8
Reporter: jinhai


When I use the command to compile spark-core_2.12 module, and then use the 
spark-core_2.12-3.0.0.jar instead of /jars/spark-core_2.12-3.0.0.jar, I will 
report an error (without making any code changes)

command: ./build/mvn -pl :spark-core_2.12 -Pyarn -Phive -Phive-thriftserver 
-Dhadoop.version=2.7.4 -DskipTests clean package
version: spark-3.0.0
jdk: 1.8



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32473) Use === instead IndexSeqView

2020-07-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32473:
--
Summary: Use === instead IndexSeqView  (was: Use === instead 
IndexSeqView.==)

> Use === instead IndexSeqView
> 
>
> Key: SPARK-32473
> URL: https://issues.apache.org/jira/browse/SPARK-32473
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> This issue aims to fix `SorterSuite` and `RadixSortSuite` in Scala 2.13 by 
> using `sameElements` instead of `IndexSeqView.==`.
> Scala 2.13 reimplements `IndexSeqView` and the behavior is different.
> - https://docs.scala-lang.org/overviews/core/collections-migration-213.html
> {code}
> Welcome to Scala 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_262).
> Type in expressions for evaluation. Or try :help.
> scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view
> res0: Boolean = true
> {code}
> {code}
> Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_262).
> Type in expressions for evaluation. Or try :help.
> scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view
> val res0: Boolean = false
> {code}
> {code}
> $ dev/change-scala-version.sh 2.13
> $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none 
> -DwildcardSuites=org.apache.spark.util.collection.unsafe.sort.RadixSortSuite
> ...
> Tests: succeeded 9, failed 36, canceled 0, ignored 0, pending 0
> *** 36 TESTS FAILED ***
> $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none 
> -DwildcardSuites=org.apache.spark.util.collection.SorterSuite
> ...
> Tests: succeeded 3, failed 1, canceled 0, ignored 2, pending 0
> *** 1 TEST FAILED ***
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32473) Use === instead IndexSeqView.==

2020-07-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32473:
-

Assignee: Dongjoon Hyun

> Use === instead IndexSeqView.==
> ---
>
> Key: SPARK-32473
> URL: https://issues.apache.org/jira/browse/SPARK-32473
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> This issue aims to fix `SorterSuite` and `RadixSortSuite` in Scala 2.13 by 
> using `sameElements` instead of `IndexSeqView.==`.
> Scala 2.13 reimplements `IndexSeqView` and the behavior is different.
> - https://docs.scala-lang.org/overviews/core/collections-migration-213.html
> {code}
> Welcome to Scala 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_262).
> Type in expressions for evaluation. Or try :help.
> scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view
> res0: Boolean = true
> {code}
> {code}
> Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_262).
> Type in expressions for evaluation. Or try :help.
> scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view
> val res0: Boolean = false
> {code}
> {code}
> $ dev/change-scala-version.sh 2.13
> $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none 
> -DwildcardSuites=org.apache.spark.util.collection.unsafe.sort.RadixSortSuite
> ...
> Tests: succeeded 9, failed 36, canceled 0, ignored 0, pending 0
> *** 36 TESTS FAILED ***
> $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none 
> -DwildcardSuites=org.apache.spark.util.collection.SorterSuite
> ...
> Tests: succeeded 3, failed 1, canceled 0, ignored 2, pending 0
> *** 1 TEST FAILED ***
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32473) Use === instead IndexSeqView.==

2020-07-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32473:
--
Summary: Use === instead IndexSeqView.==  (was: Use sameElements instead 
IndexSeqView.==)

> Use === instead IndexSeqView.==
> ---
>
> Key: SPARK-32473
> URL: https://issues.apache.org/jira/browse/SPARK-32473
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to fix `SorterSuite` and `RadixSortSuite` in Scala 2.13 by 
> using `sameElements` instead of `IndexSeqView.==`.
> Scala 2.13 reimplements `IndexSeqView` and the behavior is different.
> - https://docs.scala-lang.org/overviews/core/collections-migration-213.html
> {code}
> Welcome to Scala 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_262).
> Type in expressions for evaluation. Or try :help.
> scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view
> res0: Boolean = true
> {code}
> {code}
> Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_262).
> Type in expressions for evaluation. Or try :help.
> scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view
> val res0: Boolean = false
> {code}
> {code}
> $ dev/change-scala-version.sh 2.13
> $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none 
> -DwildcardSuites=org.apache.spark.util.collection.unsafe.sort.RadixSortSuite
> ...
> Tests: succeeded 9, failed 36, canceled 0, ignored 0, pending 0
> *** 36 TESTS FAILED ***
> $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none 
> -DwildcardSuites=org.apache.spark.util.collection.SorterSuite
> ...
> Tests: succeeded 3, failed 1, canceled 0, ignored 2, pending 0
> *** 1 TEST FAILED ***
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32473) Use sameElements instead IndexSeqView.==

2020-07-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32473:
--
Description: 
This issue aims to fix `SorterSuite` and `RadixSortSuite` in Scala 2.13 by 
using `sameElements` instead of `IndexSeqView.==`.

Scala 2.13 reimplements `IndexSeqView` and the behavior is different.
- https://docs.scala-lang.org/overviews/core/collections-migration-213.html

{code}
Welcome to Scala 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_262).
Type in expressions for evaluation. Or try :help.

scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view
res0: Boolean = true
{code}

{code}
Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_262).
Type in expressions for evaluation. Or try :help.

scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view
val res0: Boolean = false
{code}

{code}
$ dev/change-scala-version.sh 2.13
$ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none 
-DwildcardSuites=org.apache.spark.util.collection.unsafe.sort.RadixSortSuite
...
Tests: succeeded 9, failed 36, canceled 0, ignored 0, pending 0
*** 36 TESTS FAILED ***

$ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none 
-DwildcardSuites=org.apache.spark.util.collection.SorterSuite
...
Tests: succeeded 3, failed 1, canceled 0, ignored 2, pending 0
*** 1 TEST FAILED ***
{code}

> Use sameElements instead IndexSeqView.==
> 
>
> Key: SPARK-32473
> URL: https://issues.apache.org/jira/browse/SPARK-32473
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to fix `SorterSuite` and `RadixSortSuite` in Scala 2.13 by 
> using `sameElements` instead of `IndexSeqView.==`.
> Scala 2.13 reimplements `IndexSeqView` and the behavior is different.
> - https://docs.scala-lang.org/overviews/core/collections-migration-213.html
> {code}
> Welcome to Scala 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_262).
> Type in expressions for evaluation. Or try :help.
> scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view
> res0: Boolean = true
> {code}
> {code}
> Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_262).
> Type in expressions for evaluation. Or try :help.
> scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view
> val res0: Boolean = false
> {code}
> {code}
> $ dev/change-scala-version.sh 2.13
> $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none 
> -DwildcardSuites=org.apache.spark.util.collection.unsafe.sort.RadixSortSuite
> ...
> Tests: succeeded 9, failed 36, canceled 0, ignored 0, pending 0
> *** 36 TESTS FAILED ***
> $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none 
> -DwildcardSuites=org.apache.spark.util.collection.SorterSuite
> ...
> Tests: succeeded 3, failed 1, canceled 0, ignored 2, pending 0
> *** 1 TEST FAILED ***
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-32473) Use sameElements instead IndexSeqView.==

2020-07-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32473:
--
Comment: was deleted

(was: User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/29280)

> Use sameElements instead IndexSeqView.==
> 
>
> Key: SPARK-32473
> URL: https://issues.apache.org/jira/browse/SPARK-32473
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to fix `SorterSuite` and `RadixSortSuite` in Scala 2.13 by 
> using `sameElements` instead of `IndexSeqView.==`.
> Scala 2.13 reimplements `IndexSeqView` and the behavior is different.
> - https://docs.scala-lang.org/overviews/core/collections-migration-213.html
> {code}
> Welcome to Scala 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_262).
> Type in expressions for evaluation. Or try :help.
> scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view
> res0: Boolean = true
> {code}
> {code}
> Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_262).
> Type in expressions for evaluation. Or try :help.
> scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view
> val res0: Boolean = false
> {code}
> {code}
> $ dev/change-scala-version.sh 2.13
> $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none 
> -DwildcardSuites=org.apache.spark.util.collection.unsafe.sort.RadixSortSuite
> ...
> Tests: succeeded 9, failed 36, canceled 0, ignored 0, pending 0
> *** 36 TESTS FAILED ***
> $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none 
> -DwildcardSuites=org.apache.spark.util.collection.SorterSuite
> ...
> Tests: succeeded 3, failed 1, canceled 0, ignored 2, pending 0
> *** 1 TEST FAILED ***
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32474) NullAwareAntiJoin multi-column support

2020-07-28 Thread Leanken.Lin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-32474:

Description: 
This is a follow up improvement of Issue SPARK-32290.

In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to 
BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but 
it's only targeting on Single Column Case, because it's much more complicate in 
multi column support.

See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6

 

FYI, code logical for single and multi column is defined at

~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql

~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql

 

For supporting multi column, I throw the following idea and see if is it worth 
to do multi-column support with some trade off. I would need to do some data 
expansion in HashedRelation, and i would call this new type of HashedRelation 
as NullAwareHashedRelation.

 

In NullAwareHashedRelation, key with null column is allowed, which is opposite 
in LongHashedRelation and UnsafeHashedRelation; And single key might be 
expanded into 2^N - 1 records, (N refer to columnNum of the key). for example, 
if there is a record

(1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), 
C(2,3) as a combination to copy origin key row, and setNull at target position, 
and then insert into NullAwareHashedRelation. including the origin key row, 
there will be 7 key row inserted as follow.

(null, 2, 3)

(1, null, 3)

(1, 2, null)

(null, null, 3)

(null, 2, null)

(1, null, null)

(1, 2, 3)

 

with the expanded data we can extract a common pattern for both single and 
multi column. allNull refer to a unsafeRow which has all null columns.
 * buildSide is empty input => return all rows
 * allNullColumnKey Exists In buildSide input => reject all rows
 * if streamedSideRow.allNull is true => drop the row
 * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
=> drop the row
 * if streamedSideRow.allNull is false & notFindMatch in 
NullAwareHashedRelation => return the row

 

this solution will sure make buildSide data expand to 2^N-1 times, but since it 
is normally up to 2~3 column in NAAJ in normal production query, i suppose that 
it's acceptable to expand buildSide data to around 7X. I would also have a 
limitation of max column support for NAAJ, basically should not more than 3. 

 

 

 

 

 

 

  was:
This is a follow up improvement of Issue SPARK-32290.

In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to 
BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but 
it's only targeting on Single Column Case, because it's much more complicate in 
multi column support.

See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6

 

NAAJ multi column logical

!image-2020-07-29-12-03-22-554.png!

NAAJ single column logical

!image-2020-07-29-12-03-32-677.png!

For supporting multi column, I throw the following idea and see if is it worth 
to do multi-column support with some trade off. I would need to do some data 
expansion in HashedRelation, and i would call this new type of HashedRelation 
as NullAwareHashedRelation.

 

In NullAwareHashedRelation, key with null column is allowed, which is opposite 
in LongHashedRelation and UnsafeHashedRelation; And single key might be 
expanded into 2^N - 1 records, (N refer to columnNum of the key). for example, 
if there is a record

(1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), 
C(2,3) as a combination to copy origin key row, and setNull at target position, 
and then insert into NullAwareHashedRelation. including the origin key row, 
there will be 7 key row inserted as follow.

(null, 2, 3)

(1, null, 3)

(1, 2, null)

(null, null, 3)

(null, 2, null)

(1, null, null)

(1, 2, 3)

 

with the expanded data we can extract a common pattern for both single and 
multi column. allNull refer to a unsafeRow which has all null columns.
 * buildSide is empty input => return all rows
 * allNullColumnKey Exists In buildSide input => reject all rows
 * if streamedSideRow.allNull is true => drop the row
 * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
=> drop the row
 * if streamedSideRow.allNull is false & notFindMatch in 
NullAwareHashedRelation => return the row

 

this solution will sure make buildSide data expand to 2^N-1 times, but since it 
is normally up to 2~3 column in NAAJ in normal production query, i suppose that 
it's acceptable to expand buildSide data to around 7X. I would also have a 
limitation of max column support for NAAJ, basically should not more than 3. 

 

 

 

 

 

 


> NullAwareAntiJoin multi-column support
> --
>
> Key:

[jira] [Updated] (SPARK-32473) Use sameElements instead IndexSeqView.==

2020-07-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32473:
--
Summary: Use sameElements instead IndexSeqView.==  (was: Fix RadixSortSuite 
by using sameElements instead IndexSeqView.==)

> Use sameElements instead IndexSeqView.==
> 
>
> Key: SPARK-32473
> URL: https://issues.apache.org/jira/browse/SPARK-32473
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32474) NullAwareAntiJoin multi-column support

2020-07-28 Thread Leanken.Lin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-32474:

Description: 
This is a follow up improvement of Issue SPARK-32290.

In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to 
BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but 
it's only targeting on Single Column Case, because it's much more complicate in 
multi column support.

See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6

 

NAAJ multi column logical

!image-2020-07-29-12-03-22-554.png!

NAAJ single column logical

!image-2020-07-29-12-03-32-677.png!

For supporting multi column, I throw the following idea and see if is it worth 
to do multi-column support with some trade off. I would need to do some data 
expansion in HashedRelation, and i would call this new type of HashedRelation 
as NullAwareHashedRelation.

 

In NullAwareHashedRelation, key with null column is allowed, which is opposite 
in LongHashedRelation and UnsafeHashedRelation; And single key might be 
expanded into 2^N - 1 records, (N refer to columnNum of the key). for example, 
if there is a record

(1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), 
C(2,3) as a combination to copy origin key row, and setNull at target position, 
and then insert into NullAwareHashedRelation. including the origin key row, 
there will be 7 key row inserted as follow.

(null, 2, 3)

(1, null, 3)

(1, 2, null)

(null, null, 3)

(null, 2, null)

(1, null, null)

(1, 2, 3)

 

with the expanded data we can extract a common pattern for both single and 
multi column. allNull refer to a unsafeRow which has all null columns.
 * buildSide is empty input => return all rows
 * allNullColumnKey Exists In buildSide input => reject all rows
 * if streamedSideRow.allNull is true => drop the row
 * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
=> drop the row
 * if streamedSideRow.allNull is false & notFindMatch in 
NullAwareHashedRelation => return the row

 

this solution will sure make buildSide data expand to 2^N-1 times, but since it 
is normally up to 2~3 column in NAAJ in normal production query, i suppose that 
it's acceptable to expand buildSide data to around 7X. I would also have a 
limitation of max column support for NAAJ, basically should not more than 3. 

 

 

 

 

 

 

  was:
This is a follow up improvement of Issue 
[SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290].

In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to 
BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but 
it's only targeting on Single Column Case, because it's much more complicate in 
multi column support.

See. http://www.vldb.org/pvldb/vol2/vldb09-423.pdf Section 6

 

NAAJ multi column logical

!image-2020-07-29-11-41-11-939.png!

NAAJ single column logical

!image-2020-07-29-11-41-03-757.png!

For supporting multi column, I throw the following idea and see if is it worth 
to do multi-column support with some trade off. I would need to do some data 
expansion in HashedRelation, and i would call this new type of HashedRelation 
as NullAwareHashedRelation.

 

In NullAwareHashedRelation, key with null column is allowed, which is opposite 
in LongHashedRelation and UnsafeHashedRelation; And single key might be 
expanded into 2^N - 1 records, (N refer to columnNum of the key). for example, 
if there is a record

(1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), 
C(2,3) as a combination to copy origin key row, and setNull at target position, 
and then insert into NullAwareHashedRelation. including the origin key row, 
there will be 7 key row inserted as follow.

(null, 2, 3)

(1, null, 3)

(1, 2, null)

(null, null, 3)

(null, 2, null)

(1, null, null)

(1, 2, 3)

 

with the expanded data we can extract a common pattern for both single and 
multi column. allNull refer to a unsafeRow which has all null columns.
 * buildSide is empty input => return all rows
 * allNullColumnKey Exists In buildSide input => reject all rows
 * if streamedSideRow.allNull is true => drop the row
 * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
=> drop the row
 * if streamedSideRow.allNull is false & notFindMatch in 
NullAwareHashedRelation => return the row

 

this solution will sure make buildSide data expand to 2^N-1 times, but since it 
is normally up to 2~3 column in NAAJ in normal production query, i suppose that 
it's acceptable to expand buildSide data to around 7X. I would also have a 
limitation of max column support for NAAJ, basically should not more than 3. 

 

 

 

 

 

 


> NullAwareAntiJoin multi-column support
> --
>
> Key: SPARK-32474
> URL: https://issues.apache.org/jira/browse/SPARK-32474
>

[jira] [Created] (SPARK-32474) NullAwareAntiJoin multi-column support

2020-07-28 Thread Leanken.Lin (Jira)

Leanken.Lin created SPARK-32474:
---

 Summary: NullAwareAntiJoin multi-column support
 Key: SPARK-32474
 URL: https://issues.apache.org/jira/browse/SPARK-32474
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Leanken.Lin
 Fix For: 3.1.0


This is a follow up improvement of Issue 
[SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290].

In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to 
BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but 
it's only targeting on Single Column Case, because it's much more complicate in 
multi column support.

See. http://www.vldb.org/pvldb/vol2/vldb09-423.pdf Section 6

 

NAAJ multi column logical

!image-2020-07-29-11-41-11-939.png!

NAAJ single column logical

!image-2020-07-29-11-41-03-757.png!

For supporting multi column, I throw the following idea and see if is it worth 
to do multi-column support with some trade off. I would need to do some data 
expansion in HashedRelation, and i would call this new type of HashedRelation 
as NullAwareHashedRelation.

 

In NullAwareHashedRelation, key with null column is allowed, which is opposite 
in LongHashedRelation and UnsafeHashedRelation; And single key might be 
expanded into 2^N - 1 records, (N refer to columnNum of the key). for example, 
if there is a record

(1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), 
C(2,3) as a combination to copy origin key row, and setNull at target position, 
and then insert into NullAwareHashedRelation. including the origin key row, 
there will be 7 key row inserted as follow.

(null, 2, 3)

(1, null, 3)

(1, 2, null)

(null, null, 3)

(null, 2, null)

(1, null, null)

(1, 2, 3)

 

with the expanded data we can extract a common pattern for both single and 
multi column. allNull refer to a unsafeRow which has all null columns.
 * buildSide is empty input => return all rows
 * allNullColumnKey Exists In buildSide input => reject all rows
 * if streamedSideRow.allNull is true => drop the row
 * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
=> drop the row
 * if streamedSideRow.allNull is false & notFindMatch in 
NullAwareHashedRelation => return the row

 

this solution will sure make buildSide data expand to 2^N-1 times, but since it 
is normally up to 2~3 column in NAAJ in normal production query, i suppose that 
it's acceptable to expand buildSide data to around 7X. I would also have a 
limitation of max column support for NAAJ, basically should not more than 3. 

 

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32283) Multiple Kryo registrators can't be used anymore

2020-07-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32283:
---

Assignee: Lantao Jin

> Multiple Kryo registrators can't be used anymore
> 
>
> Key: SPARK-32283
> URL: https://issues.apache.org/jira/browse/SPARK-32283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Lorenz Bühmann
>Assignee: Lantao Jin
>Priority: Minor
>
> This is a regression in Spark 3.0 as it is working with Spark 2.
> According to the docs, it should be possible to register multiple Kryo 
> registrators via Spark config option spark.kryo.registrator . 
> In Spark 3.0 the code to parse Kryo config options has been refactored into 
> Scala class 
> [Kryo|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala].
>  The code to parse the registrators is in [Line 
> 29-32|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala#L29-L32]
> {code:scala}
> val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator")
> .version("0.5.0")
> .stringConf
> .createOptional
> {code}
> but it should be
> {code:scala}
> val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator")
> .version("0.5.0")
> .stringConf
> .toSequence
> .createOptional
> {code}
>  to split the comma seprated list.
> In previous Spark 2.x it was done differently directly in [KryoSerializer 
> Line 
> 77-79|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala#L77-L79]
> {code:scala}
> private val userRegistrators = conf.get("spark.kryo.registrator", "")
> .split(',').map(_.trim)
> .filter(!_.isEmpty)
> {code}
> Hope this helps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32283) Multiple Kryo registrators can't be used anymore

2020-07-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32283.
-
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 29123
[https://github.com/apache/spark/pull/29123]

> Multiple Kryo registrators can't be used anymore
> 
>
> Key: SPARK-32283
> URL: https://issues.apache.org/jira/browse/SPARK-32283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Lorenz Bühmann
>Assignee: Lantao Jin
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> This is a regression in Spark 3.0 as it is working with Spark 2.
> According to the docs, it should be possible to register multiple Kryo 
> registrators via Spark config option spark.kryo.registrator . 
> In Spark 3.0 the code to parse Kryo config options has been refactored into 
> Scala class 
> [Kryo|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala].
>  The code to parse the registrators is in [Line 
> 29-32|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala#L29-L32]
> {code:scala}
> val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator")
> .version("0.5.0")
> .stringConf
> .createOptional
> {code}
> but it should be
> {code:scala}
> val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator")
> .version("0.5.0")
> .stringConf
> .toSequence
> .createOptional
> {code}
>  to split the comma seprated list.
> In previous Spark 2.x it was done differently directly in [KryoSerializer 
> Line 
> 77-79|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala#L77-L79]
> {code:scala}
> private val userRegistrators = conf.get("spark.kryo.registrator", "")
> .split(',').map(_.trim)
> .filter(!_.isEmpty)
> {code}
> Hope this helps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32401) Migrate function related commands to new framework

2020-07-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32401.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29198
[https://github.com/apache/spark/pull/29198]

> Migrate function related commands to new framework
> --
>
> Key: SPARK-32401
> URL: https://issues.apache.org/jira/browse/SPARK-32401
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.1.0
>
>
> Migrate the following function related commands to the new resolution 
> framework:
>  * CREATE FUNCTION
>  * DESCRIBE FUNCTION
>  * DROP FUNCTION
>  * SHOW FUNCTIONS



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32401) Migrate function related commands to new framework

2020-07-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32401:
---

Assignee: Terry Kim

> Migrate function related commands to new framework
> --
>
> Key: SPARK-32401
> URL: https://issues.apache.org/jira/browse/SPARK-32401
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>
> Migrate the following function related commands to the new resolution 
> framework:
>  * CREATE FUNCTION
>  * DESCRIBE FUNCTION
>  * DROP FUNCTION
>  * SHOW FUNCTIONS



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32473) Fix RadixSortSuite by using sameElements instead IndexSeqView.==

2020-07-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166849#comment-17166849
 ] 

Apache Spark commented on SPARK-32473:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/29280

> Fix RadixSortSuite by using sameElements instead IndexSeqView.==
> 
>
> Key: SPARK-32473
> URL: https://issues.apache.org/jira/browse/SPARK-32473
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32473) Fix RadixSortSuite by using sameElements instead IndexSeqView.==

2020-07-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166847#comment-17166847
 ] 

Apache Spark commented on SPARK-32473:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/29280

> Fix RadixSortSuite by using sameElements instead IndexSeqView.==
> 
>
> Key: SPARK-32473
> URL: https://issues.apache.org/jira/browse/SPARK-32473
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32473) Fix RadixSortSuite by using sameElements instead IndexSeqView.==

2020-07-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32473:


Assignee: (was: Apache Spark)

> Fix RadixSortSuite by using sameElements instead IndexSeqView.==
> 
>
> Key: SPARK-32473
> URL: https://issues.apache.org/jira/browse/SPARK-32473
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32473) Fix RadixSortSuite by using sameElements instead IndexSeqView.==

2020-07-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32473:


Assignee: Apache Spark

> Fix RadixSortSuite by using sameElements instead IndexSeqView.==
> 
>
> Key: SPARK-32473
> URL: https://issues.apache.org/jira/browse/SPARK-32473
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32473) Fix RadixSortSuite by using sameElements instead IndexSeqView.==

2020-07-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32473:
--
Summary: Fix RadixSortSuite by using sameElements instead IndexSeqView.==  
(was: Use sameElements instead IndexSeqView.==)

> Fix RadixSortSuite by using sameElements instead IndexSeqView.==
> 
>
> Key: SPARK-32473
> URL: https://issues.apache.org/jira/browse/SPARK-32473
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32473) Use sameElements instead IndexSeqView.==

2020-07-28 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-32473:
-

 Summary: Use sameElements instead IndexSeqView.==
 Key: SPARK-32473
 URL: https://issues.apache.org/jira/browse/SPARK-32473
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, Tests
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32470) Remove task result size check for shuffle map stage

2020-07-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-32470:

Affects Version/s: (was: 3.0.0)
   3.1.0

> Remove task result size check for shuffle map stage
> ---
>
> Key: SPARK-32470
> URL: https://issues.apache.org/jira/browse/SPARK-32470
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Wei Xue
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32465) How do I get the SPARK shuffle monitoring indicator?

2020-07-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32465:
--
Target Version/s:   (was: 2.2.1)

> How do I get the SPARK shuffle monitoring indicator?
> 
>
> Key: SPARK-32465
> URL: https://issues.apache.org/jira/browse/SPARK-32465
> Project: Spark
>  Issue Type: Question
>  Components: Shuffle
>Affects Versions: 2.2.1
>Reporter: MOBIN
>Priority: Major
>  Labels: Metrics, Monitoring
>
> I want to monitor the spark task through graphite_export and prometheus, I 
> have collected execution, driver cpu and memory related indicators, but there 
> is no shuffle read and shuffle write related indicators, how should I 
> configure it?
> metrics.properties:
> master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
> worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
> driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
> executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
> shuffleService.source.jvm.class=org.apache.spark.metrics.source.JvmSource



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32465) How do I get the SPARK shuffle monitoring indicator?

2020-07-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32465:
--
Fix Version/s: (was: 2.2.1)

> How do I get the SPARK shuffle monitoring indicator?
> 
>
> Key: SPARK-32465
> URL: https://issues.apache.org/jira/browse/SPARK-32465
> Project: Spark
>  Issue Type: Question
>  Components: Shuffle
>Affects Versions: 2.2.1
>Reporter: MOBIN
>Priority: Major
>  Labels: Metrics, Monitoring
>
> I want to monitor the spark task through graphite_export and prometheus, I 
> have collected execution, driver cpu and memory related indicators, but there 
> is no shuffle read and shuffle write related indicators, how should I 
> configure it?
> metrics.properties:
> master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
> worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
> driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
> executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
> shuffleService.source.jvm.class=org.apache.spark.metrics.source.JvmSource



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32471) Describe JSON option `allowNonNumericNumbers`

2020-07-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32471:


Assignee: Maxim Gekk

> Describe JSON option `allowNonNumericNumbers`
> -
>
> Key: SPARK-32471
> URL: https://issues.apache.org/jira/browse/SPARK-32471
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> JSON datasource support the `allowNonNumericNumbers` option but it is not 
> described. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32471) Describe JSON option `allowNonNumericNumbers`

2020-07-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32471.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29275
[https://github.com/apache/spark/pull/29275]

> Describe JSON option `allowNonNumericNumbers`
> -
>
> Key: SPARK-32471
> URL: https://issues.apache.org/jira/browse/SPARK-32471
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.1.0
>
>
> JSON datasource support the `allowNonNumericNumbers` option but it is not 
> described. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-31525) Inconsistent result of df.head(1) and df.head()

2020-07-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-31525:
--
  Assignee: (was: Tianshi Zhu)

Sorry I was confused. Let's keep it consistent with Scala side. Reverted at 
https://github.com/apache/spark/commit/5491c08bf1d3472e712c0dd88c2881d6496108c0

> Inconsistent result of df.head(1) and df.head()
> ---
>
> Key: SPARK-31525
> URL: https://issues.apache.org/jira/browse/SPARK-31525
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Joshua Hendinata
>Priority: Minor
> Fix For: 3.1.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In this line 
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1339],
>  if you are calling `df.head()` and dataframe is empty, it will return *None*
> but if you are calling `df.head(1)` and dataframe is empty, it will return 
> *empty list* instead.
> This particular behaviour is not consistent and can create confusion. 
> Especially when you are calling `len(df.head())` which will throw an 
> exception for empty dataframe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31525) Inconsistent result of df.head(1) and df.head()

2020-07-28 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31525:
-
Fix Version/s: (was: 3.1.0)

> Inconsistent result of df.head(1) and df.head()
> ---
>
> Key: SPARK-31525
> URL: https://issues.apache.org/jira/browse/SPARK-31525
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Joshua Hendinata
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In this line 
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1339],
>  if you are calling `df.head()` and dataframe is empty, it will return *None*
> but if you are calling `df.head(1)` and dataframe is empty, it will return 
> *empty list* instead.
> This particular behaviour is not consistent and can create confusion. 
> Especially when you are calling `len(df.head())` which will throw an 
> exception for empty dataframe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31525) Inconsistent result of df.head(1) and df.head()

2020-07-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31525:


Assignee: Apache Spark

> Inconsistent result of df.head(1) and df.head()
> ---
>
> Key: SPARK-31525
> URL: https://issues.apache.org/jira/browse/SPARK-31525
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Joshua Hendinata
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In this line 
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1339],
>  if you are calling `df.head()` and dataframe is empty, it will return *None*
> but if you are calling `df.head(1)` and dataframe is empty, it will return 
> *empty list* instead.
> This particular behaviour is not consistent and can create confusion. 
> Especially when you are calling `len(df.head())` which will throw an 
> exception for empty dataframe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31525) Inconsistent result of df.head(1) and df.head()

2020-07-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31525:


Assignee: (was: Apache Spark)

> Inconsistent result of df.head(1) and df.head()
> ---
>
> Key: SPARK-31525
> URL: https://issues.apache.org/jira/browse/SPARK-31525
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Joshua Hendinata
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In this line 
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1339],
>  if you are calling `df.head()` and dataframe is empty, it will return *None*
> but if you are calling `df.head(1)` and dataframe is empty, it will return 
> *empty list* instead.
> This particular behaviour is not consistent and can create confusion. 
> Especially when you are calling `len(df.head())` which will throw an 
> exception for empty dataframe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31418) Blacklisting feature aborts Spark job without retrying for max num retries in case of Dynamic allocation

2020-07-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166823#comment-17166823
 ] 

Apache Spark commented on SPARK-31418:
--

User 'venkata91' has created a pull request for this issue:
https://github.com/apache/spark/pull/29279

> Blacklisting feature aborts Spark job without retrying for max num retries in 
> case of Dynamic allocation
> 
>
> Key: SPARK-31418
> URL: https://issues.apache.org/jira/browse/SPARK-31418
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.5
>Reporter: Venkata krishnan Sowrirajan
>Assignee: Venkata krishnan Sowrirajan
>Priority: Major
> Fix For: 3.1.0
>
>
> With Spark blacklisting, if a task fails on an executor, the executor gets 
> blacklisted for the task. In order to retry the task, it checks if there are 
> idle blacklisted executor which can be killed and replaced to retry the task 
> if not it aborts the job without doing max retries.
> In the context of dynamic allocation this can be better, instead of killing 
> the blacklisted idle executor (its possible there are no idle blacklisted 
> executor), request an additional executor and retry the task.
> This can be easily reproduced with a simple job like below, although this 
> example should fail eventually just to show that its not retried 
> spark.task.maxFailures times: 
> {code:java}
> def test(a: Int) = { a.asInstanceOf[String] }
> sc.parallelize(1 to 10, 10).map(x => test(x)).collect 
> {code}
> with dynamic allocation enabled and min executors set to 1. But there are 
> various other cases where this can fail as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31418) Blacklisting feature aborts Spark job without retrying for max num retries in case of Dynamic allocation

2020-07-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166821#comment-17166821
 ] 

Apache Spark commented on SPARK-31418:
--

User 'venkata91' has created a pull request for this issue:
https://github.com/apache/spark/pull/29279

> Blacklisting feature aborts Spark job without retrying for max num retries in 
> case of Dynamic allocation
> 
>
> Key: SPARK-31418
> URL: https://issues.apache.org/jira/browse/SPARK-31418
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.5
>Reporter: Venkata krishnan Sowrirajan
>Assignee: Venkata krishnan Sowrirajan
>Priority: Major
> Fix For: 3.1.0
>
>
> With Spark blacklisting, if a task fails on an executor, the executor gets 
> blacklisted for the task. In order to retry the task, it checks if there are 
> idle blacklisted executor which can be killed and replaced to retry the task 
> if not it aborts the job without doing max retries.
> In the context of dynamic allocation this can be better, instead of killing 
> the blacklisted idle executor (its possible there are no idle blacklisted 
> executor), request an additional executor and retry the task.
> This can be easily reproduced with a simple job like below, although this 
> example should fail eventually just to show that its not retried 
> spark.task.maxFailures times: 
> {code:java}
> def test(a: Int) = { a.asInstanceOf[String] }
> sc.parallelize(1 to 10, 10).map(x => test(x)).collect 
> {code}
> with dynamic allocation enabled and min executors set to 1. But there are 
> various other cases where this can fail as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32160) Executors should not be able to create SparkContext.

2020-07-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166806#comment-17166806
 ] 

Apache Spark commented on SPARK-32160:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/29278

> Executors should not be able to create SparkContext.
> 
>
> Key: SPARK-32160
> URL: https://issues.apache.org/jira/browse/SPARK-32160
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> Currently executors can create SparkContext, but shouldn't be able to create 
> it.
> {code:scala}
> sc.range(0, 1).foreach { _ =>
>   new SparkContext(new SparkConf().setAppName("test").setMaster("local"))
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32160) Executors should not be able to create SparkContext.

2020-07-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166807#comment-17166807
 ] 

Apache Spark commented on SPARK-32160:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/29278

> Executors should not be able to create SparkContext.
> 
>
> Key: SPARK-32160
> URL: https://issues.apache.org/jira/browse/SPARK-32160
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> Currently executors can create SparkContext, but shouldn't be able to create 
> it.
> {code:scala}
> sc.range(0, 1).foreach { _ =>
>   new SparkContext(new SparkConf().setAppName("test").setMaster("local"))
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20684) expose createOrReplaceGlobalTempView/createGlobalTempView and dropGlobalTempView in SparkR

2020-07-28 Thread Felix Cheung (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166801#comment-17166801
 ] 

Felix Cheung commented on SPARK-20684:
--

[https://github.com/apache/spark/pull/17941#issuecomment-301669567]

[https://github.com/apache/spark/pull/19176#issuecomment-328292002]

[https://github.com/apache/spark/pull/19176#issuecomment-328292789]

 

> expose createOrReplaceGlobalTempView/createGlobalTempView and 
> dropGlobalTempView in SparkR
> --
>
> Key: SPARK-20684
> URL: https://issues.apache.org/jira/browse/SPARK-20684
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>Priority: Major
>
> This is a useful API that is not exposed in SparkR. It will help with moving 
> data between languages on a single single Spark application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12172) Consider removing SparkR internal RDD APIs

2020-07-28 Thread Felix Cheung (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166581#comment-17166581
 ] 

Felix Cheung edited comment on SPARK-12172 at 7/29/20, 1:20 AM:


These are methods (map etc) that were never public and not supported.

They were not call-able unless you directly reference the internal namespace 
Spark:::


was (Author: felixcheung):
These are methods (map etc) that were never public and not supported.

> Consider removing SparkR internal RDD APIs
> --
>
> Key: SPARK-12172
> URL: https://issues.apache.org/jira/browse/SPARK-12172
> Project: Spark
>  Issue Type: Task
>  Components: SparkR
>Reporter: Felix Cheung
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12172) Consider removing SparkR internal RDD APIs

2020-07-28 Thread Felix Cheung (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166581#comment-17166581
 ] 

Felix Cheung edited comment on SPARK-12172 at 7/29/20, 1:19 AM:


These are methods (map etc) that were never public and not supported.


was (Author: felixcheung):
These are methods (map etc) that were never public and not supported.

On Tue, Jul 28, 2020 at 10:18 AM S Daniel Zafar (Jira) 



> Consider removing SparkR internal RDD APIs
> --
>
> Key: SPARK-12172
> URL: https://issues.apache.org/jira/browse/SPARK-12172
> Project: Spark
>  Issue Type: Task
>  Components: SparkR
>Reporter: Felix Cheung
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32472) Expose confusion matrix elements by threshold in BinaryClassificationMetrics

2020-07-28 Thread Kevin Moore (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Moore updated SPARK-32472:

Description: 
Currently, the only thresholded metrics available from 
BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly 
through roc()) the false positive rate.

Unfortunately, you can't always compute the individual thresholded confusion 
matrix elements (TP, FP, TN, FN) from these quantities. You can make a system 
of equations out of the existing thresholded metrics and the total count, but 
they become underdetermined when there are no true positives.

Fortunately, the individual confusion matrix elements by threshold are already 
computed and sitting in the confusions variable. It would be helpful to expose 
these elements directly. The easiest way would probably be by adding methods 
like 
{code:java}
def truePositivesByThreshold(): RDD[(Double, Double)] = confusions.map{ case 
(t, c) => (t, c.weightedTruePositives) }{code}
An alternative could be to expose the entire RDD[(Double, 
BinaryConfusionMatrix)] in one method, but BinaryConfusionMatrix is also 
currently package private.

The closest issue to this I found was this one for adding new calculations to 
BinaryClassificationMetrics https://issues.apache.org/jira/browse/SPARK-18844, 
which was closed without any changes being merged.

  was:
Currently, the only thresholded metrics available from 
BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly 
through roc()) the false positive rate.

Unfortunately, you can't always compute the individual thresholded confusion 
matrix elements (TP, FP, TN, FN) from these quantities. You can make a system 
of equations out of the existing thresholded metrics and the total count, but 
they become underdetermined when there are no true positives.

Fortunately, the individual confusion matrix elements by threshold are already 
computed and sitting in the `confusions` variable. It would be helpful to 
expose these elements directly. The easiest way would probably be by adding 
methods like 
{code:java}
def truePositivesByThreshold(): RDD[(Double, Double)] = confusions.map{ case 
(t, c) => (t, c.weightedTruePositives) }{code}
 

An alternative could be to expose the entire RDD[(Double, 
BinaryConfusionMatrix)] in one method, but BinaryConfusionMatrix is also 
currently package private.

The closest issue to this I found was this one for adding new calculations to 
BinaryClassificationMetrics https://issues.apache.org/jira/browse/SPARK-18844, 
which was closed without any changes being merged.


> Expose confusion matrix elements by threshold in BinaryClassificationMetrics
> 
>
> Key: SPARK-32472
> URL: https://issues.apache.org/jira/browse/SPARK-32472
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 3.0.0
>Reporter: Kevin Moore
>Priority: Minor
>
> Currently, the only thresholded metrics available from 
> BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly 
> through roc()) the false positive rate.
> Unfortunately, you can't always compute the individual thresholded confusion 
> matrix elements (TP, FP, TN, FN) from these quantities. You can make a system 
> of equations out of the existing thresholded metrics and the total count, but 
> they become underdetermined when there are no true positives.
> Fortunately, the individual confusion matrix elements by threshold are 
> already computed and sitting in the confusions variable. It would be helpful 
> to expose these elements directly. The easiest way would probably be by 
> adding methods like 
> {code:java}
> def truePositivesByThreshold(): RDD[(Double, Double)] = confusions.map{ case 
> (t, c) => (t, c.weightedTruePositives) }{code}
> An alternative could be to expose the entire RDD[(Double, 
> BinaryConfusionMatrix)] in one method, but BinaryConfusionMatrix is also 
> currently package private.
> The closest issue to this I found was this one for adding new calculations to 
> BinaryClassificationMetrics 
> https://issues.apache.org/jira/browse/SPARK-18844, which was closed without 
> any changes being merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32472) Expose confusion matrix elements by threshold in BinaryClassificationMetrics

2020-07-28 Thread Kevin Moore (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Moore updated SPARK-32472:

Description: 
Currently, the only thresholded metrics available from 
BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly 
through roc()) the false positive rate.

Unfortunately, you can't always compute the individual thresholded confusion 
matrix elements (TP, FP, TN, FN) from these quantities. You can make a system 
of equations out of the existing thresholded metrics and the total count, but 
they become underdetermined when there are no true positives.

Fortunately, the individual confusion matrix elements by threshold are already 
computed and sitting in the `confusions` variable. It would be helpful to 
expose these elements directly. The easiest way would probably be by adding 
methods like 
{code:java}
def truePositivesByThreshold(): RDD[(Double, Double)] = confusions.map{ case 
(t, c) => (t, c.weightedTruePositives) }{code}
 

An alternative could be to expose the entire RDD[(Double, 
BinaryConfusionMatrix)] in one method, but BinaryConfusionMatrix is also 
currently package private.

The closest issue to this I found was this one for adding new calculations to 
BinaryClassificationMetrics https://issues.apache.org/jira/browse/SPARK-18844, 
which was closed without any changes being merged.

  was:
Currently, the only thresholded metrics available from 
BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly 
through roc()) the false positive rate.

Unfortunately, you can't always compute the individual thresholded confusion 
matrix elements (TP, FP, TN, FN) from these quantities. You can make a system 
of equations out of the existing thresholded metrics and the total count, but 
they become underdetermined when there are no true positives.

Fortunately, the individual confusion matrix elements by threshold are already 
computed and sitting in the `confusions` variable. It would be helpful to 
expose these elements directly. The easiest way would probably be by adding 
methods like

 
{code:java}
// def truePositivesByThreshold(): RDD[(Double, Double)] = confusions.map{ case 
(t, c) => (t, c.weightedTruePositives) }{code}
 

An alternative could be to expose the entire RDD[(Double, 
BinaryConfusionMatrix)] in one method, but BinaryConfusionMatrix is also 
currently package private.

The closest issue to this I found was this one for adding new calculations to 
BinaryClassificationMetrics https://issues.apache.org/jira/browse/SPARK-18844, 
which was closed without any changes being merged.


> Expose confusion matrix elements by threshold in BinaryClassificationMetrics
> 
>
> Key: SPARK-32472
> URL: https://issues.apache.org/jira/browse/SPARK-32472
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 3.0.0
>Reporter: Kevin Moore
>Priority: Minor
>
> Currently, the only thresholded metrics available from 
> BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly 
> through roc()) the false positive rate.
> Unfortunately, you can't always compute the individual thresholded confusion 
> matrix elements (TP, FP, TN, FN) from these quantities. You can make a system 
> of equations out of the existing thresholded metrics and the total count, but 
> they become underdetermined when there are no true positives.
> Fortunately, the individual confusion matrix elements by threshold are 
> already computed and sitting in the `confusions` variable. It would be 
> helpful to expose these elements directly. The easiest way would probably be 
> by adding methods like 
> {code:java}
> def truePositivesByThreshold(): RDD[(Double, Double)] = confusions.map{ case 
> (t, c) => (t, c.weightedTruePositives) }{code}
>  
> An alternative could be to expose the entire RDD[(Double, 
> BinaryConfusionMatrix)] in one method, but BinaryConfusionMatrix is also 
> currently package private.
> The closest issue to this I found was this one for adding new calculations to 
> BinaryClassificationMetrics 
> https://issues.apache.org/jira/browse/SPARK-18844, which was closed without 
> any changes being merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32472) Expose confusion matrix elements by threshold in BinaryClassificationMetrics

2020-07-28 Thread Kevin Moore (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Moore updated SPARK-32472:

Description: 
Currently, the only thresholded metrics available from 
BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly 
through roc()) the false positive rate.

Unfortunately, you can't always compute the individual thresholded confusion 
matrix elements (TP, FP, TN, FN) from these quantities. You can make a system 
of equations out of the existing thresholded metrics and the total count, but 
they become underdetermined when there are no true positives.

Fortunately, the individual confusion matrix elements by threshold are already 
computed and sitting in the `confusions` variable. It would be helpful to 
expose these elements directly. The easiest way would probably be by adding 
methods like

 
{code:java}
// def truePositivesByThreshold(): RDD[(Double, Double)] = confusions.map{ case 
(t, c) => (t, c.weightedTruePositives) }{code}
 

An alternative could be to expose the entire RDD[(Double, 
BinaryConfusionMatrix)] in one method, but BinaryConfusionMatrix is also 
currently package private.

The closest issue to this I found was this one for adding new calculations to 
BinaryClassificationMetrics https://issues.apache.org/jira/browse/SPARK-18844, 
which was closed without any changes being merged.

  was:
Currently, the only thresholded metrics available from 
BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly 
through `roc()`) the false positive rate.

Unfortunately, you can't always compute the individual thresholded confusion 
matrix elements (TP, FP, TN, FN) from these quantities. You can make a system 
of equations out of the existing thresholded metrics and the total count, but 
they become underdetermined when there are no true positives.

Fortunately, the individual confusion matrix elements by threshold are already 
computed and sitting in the `confusions` variable. It would be helpful to 
expose these elements directly. The easiest way would probably be by adding 
methods like `def truePositivesByThreshold(): RDD[(Double, Double)] = 
confusions.map\{ case (t, c) => (t, c.weightedTruePositives) }`.

An alternative could be to expose the entire `RDD[(Double, 
BinaryConfusionMatrix)]` in one method, but `BinaryConfusionMatrix` is also 
currently package private.

The closest issue to this I found was this one for adding new calculations to 
BinaryClassificationMetrics https://issues.apache.org/jira/browse/SPARK-18844, 
which was closed without any changes being merged.


> Expose confusion matrix elements by threshold in BinaryClassificationMetrics
> 
>
> Key: SPARK-32472
> URL: https://issues.apache.org/jira/browse/SPARK-32472
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 3.0.0
>Reporter: Kevin Moore
>Priority: Minor
>
> Currently, the only thresholded metrics available from 
> BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly 
> through roc()) the false positive rate.
> Unfortunately, you can't always compute the individual thresholded confusion 
> matrix elements (TP, FP, TN, FN) from these quantities. You can make a system 
> of equations out of the existing thresholded metrics and the total count, but 
> they become underdetermined when there are no true positives.
> Fortunately, the individual confusion matrix elements by threshold are 
> already computed and sitting in the `confusions` variable. It would be 
> helpful to expose these elements directly. The easiest way would probably be 
> by adding methods like
>  
> {code:java}
> // def truePositivesByThreshold(): RDD[(Double, Double)] = confusions.map{ 
> case (t, c) => (t, c.weightedTruePositives) }{code}
>  
> An alternative could be to expose the entire RDD[(Double, 
> BinaryConfusionMatrix)] in one method, but BinaryConfusionMatrix is also 
> currently package private.
> The closest issue to this I found was this one for adding new calculations to 
> BinaryClassificationMetrics 
> https://issues.apache.org/jira/browse/SPARK-18844, which was closed without 
> any changes being merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32472) Expose confusion matrix elements by threshold in BinaryClassificationMetrics

2020-07-28 Thread Kevin Moore (Jira)

Kevin Moore created SPARK-32472:
---

 Summary: Expose confusion matrix elements by threshold in 
BinaryClassificationMetrics
 Key: SPARK-32472
 URL: https://issues.apache.org/jira/browse/SPARK-32472
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 3.0.0
Reporter: Kevin Moore


Currently, the only thresholded metrics available from 
BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly 
through `roc()`) the false positive rate.

Unfortunately, you can't always compute the individual thresholded confusion 
matrix elements (TP, FP, TN, FN) from these quantities. You can make a system 
of equations out of the existing thresholded metrics and the total count, but 
they become underdetermined when there are no true positives.

Fortunately, the individual confusion matrix elements by threshold are already 
computed and sitting in the `confusions` variable. It would be helpful to 
expose these elements directly. The easiest way would probably be by adding 
methods like `def truePositivesByThreshold(): RDD[(Double, Double)] = 
confusions.map\{ case (t, c) => (t, c.weightedTruePositives) }`.

An alternative could be to expose the entire `RDD[(Double, 
BinaryConfusionMatrix)]` in one method, but `BinaryConfusionMatrix` is also 
currently package private.

The closest issue to this I found was this one for adding new calculations to 
BinaryClassificationMetrics https://issues.apache.org/jira/browse/SPARK-18844, 
which was closed without any changes being merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32140) Add summary to FMClassificationModel

2020-07-28 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-32140:
---
Description: Add summary and training summary to FMClassificationModel.  
(was: Add summary and training summary to )

> Add summary to FMClassificationModel
> 
>
> Key: SPARK-32140
> URL: https://issues.apache.org/jira/browse/SPARK-32140
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.1.0
>
>
> Add summary and training summary to FMClassificationModel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32140) Add summary to FMClassificationModel

2020-07-28 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-32140:
---
Description: Add summary and training summary to 

> Add summary to FMClassificationModel
> 
>
> Key: SPARK-32140
> URL: https://issues.apache.org/jira/browse/SPARK-32140
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.1.0
>
>
> Add summary and training summary to 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32449) Add summary to MultilayerPerceptronClassificationModel

2020-07-28 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-32449:
---
Description: Add summary and training summary to 
MultilayerPerceptronClassificationModel

> Add summary to MultilayerPerceptronClassificationModel
> --
>
> Key: SPARK-32449
> URL: https://issues.apache.org/jira/browse/SPARK-32449
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Add summary and training summary to MultilayerPerceptronClassificationModel



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32139) Unify Classification Training Summary

2020-07-28 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-32139:
---
Description: add classification model summary and training summary to each 
of the classification algorithms. The classification model summary basically 
gives a summary for all the classification algorithms evaluation metrics such 
as accuracy/precision/recall. The training summary describes information about 
model training iterations and the objective function (scaled loss + 
regularization) at each iteration. These are very useful information for users.

> Unify Classification Training Summary
> -
>
> Key: SPARK-32139
> URL: https://issues.apache.org/jira/browse/SPARK-32139
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> add classification model summary and training summary to each of the 
> classification algorithms. The classification model summary basically gives a 
> summary for all the classification algorithms evaluation metrics such as 
> accuracy/precision/recall. The training summary describes information about 
> model training iterations and the objective function (scaled loss + 
> regularization) at each iteration. These are very useful information for 
> users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32421) Add code-gen for shuffled hash join

2020-07-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1715#comment-1715
 ] 

Apache Spark commented on SPARK-32421:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/29277

> Add code-gen for shuffled hash join
> ---
>
> Key: SPARK-32421
> URL: https://issues.apache.org/jira/browse/SPARK-32421
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> We added shuffled hash join codegen internally in our fork, and seeing 
> obvious improvement in benchmark compared to current non-codegen code path. 
> Creating this Jira to add this support. Shuffled hash join codegen is very 
> similar to broadcast hash join codegen. So this is a simple change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32421) Add code-gen for shuffled hash join

2020-07-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1714#comment-1714
 ] 

Apache Spark commented on SPARK-32421:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/29277

> Add code-gen for shuffled hash join
> ---
>
> Key: SPARK-32421
> URL: https://issues.apache.org/jira/browse/SPARK-32421
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> We added shuffled hash join codegen internally in our fork, and seeing 
> obvious improvement in benchmark compared to current non-codegen code path. 
> Creating this Jira to add this support. Shuffled hash join codegen is very 
> similar to broadcast hash join codegen. So this is a simple change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32421) Add code-gen for shuffled hash join

2020-07-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32421:


Assignee: (was: Apache Spark)

> Add code-gen for shuffled hash join
> ---
>
> Key: SPARK-32421
> URL: https://issues.apache.org/jira/browse/SPARK-32421
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> We added shuffled hash join codegen internally in our fork, and seeing 
> obvious improvement in benchmark compared to current non-codegen code path. 
> Creating this Jira to add this support. Shuffled hash join codegen is very 
> similar to broadcast hash join codegen. So this is a simple change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32421) Add code-gen for shuffled hash join

2020-07-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32421:


Assignee: Apache Spark

> Add code-gen for shuffled hash join
> ---
>
> Key: SPARK-32421
> URL: https://issues.apache.org/jira/browse/SPARK-32421
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Trivial
>
> We added shuffled hash join codegen internally in our fork, and seeing 
> obvious improvement in benchmark compared to current non-codegen code path. 
> Creating this Jira to add this support. Shuffled hash join codegen is very 
> similar to broadcast hash join codegen. So this is a simple change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30817) SparkR ML algorithms parity

2020-07-28 Thread Maciej Szymkiewicz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1713#comment-1713
 ] 

Maciej Szymkiewicz commented on SPARK-30817:


[~dan_z] I think we're in-sync ATM and this specific issue should be resolved.

> SparkR ML algorithms parity 
> 
>
> Key: SPARK-30817
> URL: https://issues.apache.org/jira/browse/SPARK-30817
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SparkR
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> As of 3.0 the following algorithms are missing form SparkR
> * {{LinearRegression}} 
> * {{FMRegressor}} (Added to ML in 3.0)
> * {{FMClassifier}} (Added to ML in 3.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32470) Remove task result size check for shuffle map stage

2020-07-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166643#comment-17166643
 ] 

Apache Spark commented on SPARK-32470:
--

User 'maryannxue' has created a pull request for this issue:
https://github.com/apache/spark/pull/29276

> Remove task result size check for shuffle map stage
> ---
>
> Key: SPARK-32470
> URL: https://issues.apache.org/jira/browse/SPARK-32470
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32470) Remove task result size check for shuffle map stage

2020-07-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32470:


Assignee: (was: Apache Spark)

> Remove task result size check for shuffle map stage
> ---
>
> Key: SPARK-32470
> URL: https://issues.apache.org/jira/browse/SPARK-32470
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32470) Remove task result size check for shuffle map stage

2020-07-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32470:


Assignee: Apache Spark

> Remove task result size check for shuffle map stage
> ---
>
> Key: SPARK-32470
> URL: https://issues.apache.org/jira/browse/SPARK-32470
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32471) Describe JSON option `allowNonNumericNumbers`

2020-07-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166638#comment-17166638
 ] 

Apache Spark commented on SPARK-32471:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29275

> Describe JSON option `allowNonNumericNumbers`
> -
>
> Key: SPARK-32471
> URL: https://issues.apache.org/jira/browse/SPARK-32471
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> JSON datasource support the `allowNonNumericNumbers` option but it is not 
> described. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32471) Describe JSON option `allowNonNumericNumbers`

2020-07-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166639#comment-17166639
 ] 

Apache Spark commented on SPARK-32471:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29275

> Describe JSON option `allowNonNumericNumbers`
> -
>
> Key: SPARK-32471
> URL: https://issues.apache.org/jira/browse/SPARK-32471
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> JSON datasource support the `allowNonNumericNumbers` option but it is not 
> described. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32471) Describe JSON option `allowNonNumericNumbers`

2020-07-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32471:


Assignee: Apache Spark

> Describe JSON option `allowNonNumericNumbers`
> -
>
> Key: SPARK-32471
> URL: https://issues.apache.org/jira/browse/SPARK-32471
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> JSON datasource support the `allowNonNumericNumbers` option but it is not 
> described. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32471) Describe JSON option `allowNonNumericNumbers`

2020-07-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32471:


Assignee: (was: Apache Spark)

> Describe JSON option `allowNonNumericNumbers`
> -
>
> Key: SPARK-32471
> URL: https://issues.apache.org/jira/browse/SPARK-32471
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> JSON datasource support the `allowNonNumericNumbers` option but it is not 
> described. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32471) Describe JSON option `allowNonNumericNumbers`

2020-07-28 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-32471:
--

 Summary: Describe JSON option `allowNonNumericNumbers`
 Key: SPARK-32471
 URL: https://issues.apache.org/jira/browse/SPARK-32471
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


JSON datasource support the `allowNonNumericNumbers` option but it is not 
described. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32397) Snapshot artifacts can have differing timestamps, making it hard to consume

2020-07-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166632#comment-17166632
 ] 

Apache Spark commented on SPARK-32397:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29274

> Snapshot artifacts can have differing timestamps, making it hard to consume
> ---
>
> Key: SPARK-32397
> URL: https://issues.apache.org/jira/browse/SPARK-32397
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Minor
>
> Since we use multiple sub components in building Spark we can get into a 
> situation where the timestamps for these components is different. This can 
> make it difficult to consume Spark snapshots in an environment where someone 
> is running a nightly build for other folks to develop on top of.
> I believe I have a small fix for this already, but just waiting to verify and 
> then I'll open a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32397) Snapshot artifacts can have differing timestamps, making it hard to consume

2020-07-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32397:


Assignee: Apache Spark  (was: Holden Karau)

> Snapshot artifacts can have differing timestamps, making it hard to consume
> ---
>
> Key: SPARK-32397
> URL: https://issues.apache.org/jira/browse/SPARK-32397
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Minor
>
> Since we use multiple sub components in building Spark we can get into a 
> situation where the timestamps for these components is different. This can 
> make it difficult to consume Spark snapshots in an environment where someone 
> is running a nightly build for other folks to develop on top of.
> I believe I have a small fix for this already, but just waiting to verify and 
> then I'll open a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32397) Snapshot artifacts can have differing timestamps, making it hard to consume

2020-07-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166629#comment-17166629
 ] 

Apache Spark commented on SPARK-32397:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29274

> Snapshot artifacts can have differing timestamps, making it hard to consume
> ---
>
> Key: SPARK-32397
> URL: https://issues.apache.org/jira/browse/SPARK-32397
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Minor
>
> Since we use multiple sub components in building Spark we can get into a 
> situation where the timestamps for these components is different. This can 
> make it difficult to consume Spark snapshots in an environment where someone 
> is running a nightly build for other folks to develop on top of.
> I believe I have a small fix for this already, but just waiting to verify and 
> then I'll open a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32397) Snapshot artifacts can have differing timestamps, making it hard to consume

2020-07-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32397:


Assignee: Holden Karau  (was: Apache Spark)

> Snapshot artifacts can have differing timestamps, making it hard to consume
> ---
>
> Key: SPARK-32397
> URL: https://issues.apache.org/jira/browse/SPARK-32397
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Minor
>
> Since we use multiple sub components in building Spark we can get into a 
> situation where the timestamps for these components is different. This can 
> make it difficult to consume Spark snapshots in an environment where someone 
> is running a nightly build for other folks to develop on top of.
> I believe I have a small fix for this already, but just waiting to verify and 
> then I'll open a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32429) Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch

2020-07-28 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166627#comment-17166627
 ] 

Thomas Graves edited comment on SPARK-32429 at 7/28/20, 6:47 PM:
-

Yes so for this first implementation we didn't really address users selecting 
different types of GPUs, but I think the design is generic enough to handle but 
requires extra support from the cluster manager.  Otherwise I think it's left 
to the user to discover the details on the GPU. 

So I think the scenario you are talking about is a Worker has multiple gpus of 
different types so for it to discover them we would either have to explicitly 
add support for a "type" (spark.executor.resource.gpu.type) or you have 2 
custom resources (k80/v100), which for standalone mode would be fine because 
you just supply the Worker with different discovery scripts and then like you 
say the application would request one or the other type of resource.  The 
application just needs to know to request the custom resources vs just "gpu".

I think there are a few ways we could make this generic. One to make it 
completely generic is to make it a plugin that would run before launching 
executors and python processes. spark.worker.resource.XX.launchPlugin = 
someClass.  You could pass the env and resources into each one and it could set 
whatever it needs.   There are less generic ways if you want Spark to know more 
about CUDA. What do you think of something like this?

 


was (Author: tgraves):
Yes so for this first implementation we didn't really address users selecting 
different types of GPUs, but I think the design is generic enough to handle but 
requires extra support from the cluster manager.  Otherwise I think it's left 
to the user to discover the details on the GPU. 

So I think the scenario you are talking about is a Worker has multiple gpus of 
different types so for it to discover them we would either have to explicitly 
add support for a "type" (spark.executor.resource.gpu.type) or you have 2 
custom resources (k80/v100), which for standalone mode would be fine because 
you just supply the Worker with different discovery scripts and then like you 
say the application would request one or the other type of resource.  The 
application just needs to know to request the custom resources vs just "gpu".

I think there are a few ways we could make this generic. One to make it 
completely generic is to make it a plugin that would run before launching 
executors and python processes. spark.worker.resource.XX.launchPlugins = 
someClass,anotherOne.  You could pass the env and resources into each one and 
it could set whatever it needs.   There are less generic ways if you want Spark 
to know more about CUDA. What do you think of something like this?

 

> Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch
> -
>
> Key: SPARK-32429
> URL: https://issues.apache.org/jira/browse/SPARK-32429
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> It would be nice if standalone mode could allow users to set 
> CUDA_VISIBLE_DEVICES before launching an executor.  This has multiple 
> benefits. 
>  * kind of an isolation in that the executor can only see the GPUs set there. 
>  * If your GPU application doesn't support explicitly setting the GPU device 
> id, setting this will make any GPU look like the default (id 0) and things 
> generally just work without any explicit setting
>  * New features are being added on newer GPUs that require explicit setting 
> of CUDA_VISIBLE_DEVICES like MIG 
> ([https://www.nvidia.com/en-us/technologies/multi-instance-gpu/])
> The code changes to just set this are very small, once we set them we would 
> also possibly need to change the gpu addresses as it changes them to start 
> from device id 0 again.
> The easiest implementation would just specifically support this and have it 
> behind a config and set when the config is on and GPU resources are 
> allocated. 
> Note we probably want to have this same thing set when we launch a python 
> process as well so that it gets same env.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32429) Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch

2020-07-28 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166627#comment-17166627
 ] 

Thomas Graves commented on SPARK-32429:
---

Yes so for this first implementation we didn't really address users selecting 
different types of GPUs, but I think the design is generic enough to handle but 
requires extra support from the cluster manager.  Otherwise I think it's left 
to the user to discover the details on the GPU. 

So I think the scenario you are talking about is a Worker has multiple gpus of 
different types so for it to discover them we would either have to explicitly 
add support for a "type" (spark.executor.resource.gpu.type) or you have 2 
custom resources (k80/v100), which for standalone mode would be fine because 
you just supply the Worker with different discovery scripts and then like you 
say the application would request one or the other type of resource.  The 
application just needs to know to request the custom resources vs just "gpu".

I think there are a few ways we could make this generic. One to make it 
completely generic is to make it a plugin that would run before launching 
executors and python processes. spark.worker.resource.XX.launchPlugins = 
someClass,anotherOne.  You could pass the env and resources into each one and 
it could set whatever it needs.   There are less generic ways if you want Spark 
to know more about CUDA. What do you think of something like this?

 

> Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch
> -
>
> Key: SPARK-32429
> URL: https://issues.apache.org/jira/browse/SPARK-32429
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> It would be nice if standalone mode could allow users to set 
> CUDA_VISIBLE_DEVICES before launching an executor.  This has multiple 
> benefits. 
>  * kind of an isolation in that the executor can only see the GPUs set there. 
>  * If your GPU application doesn't support explicitly setting the GPU device 
> id, setting this will make any GPU look like the default (id 0) and things 
> generally just work without any explicit setting
>  * New features are being added on newer GPUs that require explicit setting 
> of CUDA_VISIBLE_DEVICES like MIG 
> ([https://www.nvidia.com/en-us/technologies/multi-instance-gpu/])
> The code changes to just set this are very small, once we set them we would 
> also possibly need to change the gpu addresses as it changes them to start 
> from device id 0 again.
> The easiest implementation would just specifically support this and have it 
> behind a config and set when the config is on and GPU resources are 
> allocated. 
> Note we probably want to have this same thing set when we launch a python 
> process as well so that it gets same env.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28001) Dataframe throws 'socket.timeout: timed out' exception

2020-07-28 Thread StanislavKo (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166608#comment-17166608
 ] 

StanislavKo edited comment on SPARK-28001 at 7/28/20, 6:04 PM:
---

+1

 

OS: Windows 10 

Python: 3.7

PySpark: (2.4.0 in requirements.txt, 2.4.3 in $SPARK_HOME)

Cluster manager: Spark Standalone


was (Author: stanislavko):
+1

 

OS: Windows 10 

Python: 3.7

PySpark: 2.4.3

Cluster manager: Spark Standalone

> Dataframe throws 'socket.timeout: timed out' exception
> --
>
> Key: SPARK-28001
> URL: https://issues.apache.org/jira/browse/SPARK-28001
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: Processor: Intel Core i7-7700 CPU @ 3.60Ghz
> RAM: 16 GB
> OS: Windows 10 Enterprise 64-bit
> Python: 3.7.2
> PySpark: 3.4.3
> Cluster manager: Spark Standalone
>Reporter: Marius Stanescu
>Priority: Critical
>
> I load data from Azure Table Storage, create a DataFrame and perform a couple 
> of operations via two user-defined functions, then call show() to display the 
> results. If I load a very small batch of items, like 5, everything is working 
> fine, but if I load a batch grater then 10 items from Azure Table Storage 
> then I get the 'socket.timeout: timed out' exception.
> Here is the code:
>  
> {code}
> import time
> import json
> import requests
> from requests.auth import HTTPBasicAuth
> from azure.cosmosdb.table.tableservice import TableService
> from azure.cosmosdb.table.models import Entity
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import udf, struct
> from pyspark.sql.types import BooleanType
> def main():
> batch_size = 25
> azure_table_account_name = '***'
> azure_table_account_key = '***'
> azure_table_name = '***'
> spark = SparkSession \
> .builder \
> .appName(agent_name) \
> .config("spark.sql.crossJoin.enabled", "true") \
> .getOrCreate()
> table_service = TableService(account_name=azure_table_account_name, 
> account_key=azure_table_account_key)
> continuation_token = None
> while True:
> messages = table_service.query_entities(
> azure_table_name,
> select="RowKey, PartitionKey, messageId, ownerSmtp, Timestamp",
> num_results=batch_size,
> marker=continuation_token,
> timeout=60)
> continuation_token = messages.next_marker
> messages_list = list(messages)
> 
> if not len(messages_list):
> time.sleep(5)
> pass
> 
> messages_df = spark.createDataFrame(messages_list)
> 
> register_records_df = messages_df \
> .withColumn('Registered', register_record('RowKey', 
> 'PartitionKey', 'messageId', 'ownerSmtp', 'Timestamp'))
> 
> only_registered_records_df = register_records_df \
> .filter(register_records_df.Registered == True) \
> .drop(register_records_df.Registered)
> 
> update_message_status_df = only_registered_records_df \
> .withColumn('TableEntryDeleted', delete_table_entity('RowKey', 
> 'PartitionKey'))
> 
> results_df = update_message_status_df.select(
> update_message_status_df.RowKey,
> update_message_status_df.PartitionKey,
> update_message_status_df.TableEntryDeleted)
> #results_df.explain()
> results_df.show(n=batch_size, truncate=False)
> @udf(returnType=BooleanType())
> def register_record(rowKey, partitionKey, messageId, ownerSmtp, timestamp):
>   # call an API
> try:
>   url = '{}/data/record/{}'.format('***', rowKey)
>   headers = { 'Content-type': 'application/json' }
>   response = requests.post(
>   url,
>   headers=headers,
>   auth=HTTPBasicAuth('***', '***'),
>   data=prepare_record_data(rowKey, partitionKey, 
> messageId, ownerSmtp, timestamp))
> 
> return bool(response)
> except:
> return False
> def prepare_record_data(rowKey, partitionKey, messageId, ownerSmtp, 
> timestamp):
> record_data = {
> "Title": messageId,
> "Type": '***',
> "Source": '***',
> "Creator": ownerSmtp,
> "Publisher": '***',
> "Date": timestamp.strftime('%Y-%m-%dT%H:%M:%SZ')
> }
> return json.dumps(record_data)
> @udf(returnType=BooleanType())
> def delete_table_entity(row_key, partition_key):
> azure_table_account_name = '***'
> azure_table_account_key = '***'
> azure_table_name = '***'
> try:
> table_service = TableService(account_name=azure_table_account_name, 
>

[jira] [Created] (SPARK-32470) Remove task result size check for shuffle map stage

2020-07-28 Thread Wei Xue (Jira)

Wei Xue created SPARK-32470:
---

 Summary: Remove task result size check for shuffle map stage
 Key: SPARK-32470
 URL: https://issues.apache.org/jira/browse/SPARK-32470
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Wei Xue






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28001) Dataframe throws 'socket.timeout: timed out' exception

2020-07-28 Thread StanislavKo (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166608#comment-17166608
 ] 

StanislavKo commented on SPARK-28001:
-

+1

 

OS: Windows 10 

Python: 3.7.2

PySpark: 2.4.3

Cluster manager: Spark Standalone

> Dataframe throws 'socket.timeout: timed out' exception
> --
>
> Key: SPARK-28001
> URL: https://issues.apache.org/jira/browse/SPARK-28001
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: Processor: Intel Core i7-7700 CPU @ 3.60Ghz
> RAM: 16 GB
> OS: Windows 10 Enterprise 64-bit
> Python: 3.7.2
> PySpark: 3.4.3
> Cluster manager: Spark Standalone
>Reporter: Marius Stanescu
>Priority: Critical
>
> I load data from Azure Table Storage, create a DataFrame and perform a couple 
> of operations via two user-defined functions, then call show() to display the 
> results. If I load a very small batch of items, like 5, everything is working 
> fine, but if I load a batch grater then 10 items from Azure Table Storage 
> then I get the 'socket.timeout: timed out' exception.
> Here is the code:
>  
> {code}
> import time
> import json
> import requests
> from requests.auth import HTTPBasicAuth
> from azure.cosmosdb.table.tableservice import TableService
> from azure.cosmosdb.table.models import Entity
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import udf, struct
> from pyspark.sql.types import BooleanType
> def main():
> batch_size = 25
> azure_table_account_name = '***'
> azure_table_account_key = '***'
> azure_table_name = '***'
> spark = SparkSession \
> .builder \
> .appName(agent_name) \
> .config("spark.sql.crossJoin.enabled", "true") \
> .getOrCreate()
> table_service = TableService(account_name=azure_table_account_name, 
> account_key=azure_table_account_key)
> continuation_token = None
> while True:
> messages = table_service.query_entities(
> azure_table_name,
> select="RowKey, PartitionKey, messageId, ownerSmtp, Timestamp",
> num_results=batch_size,
> marker=continuation_token,
> timeout=60)
> continuation_token = messages.next_marker
> messages_list = list(messages)
> 
> if not len(messages_list):
> time.sleep(5)
> pass
> 
> messages_df = spark.createDataFrame(messages_list)
> 
> register_records_df = messages_df \
> .withColumn('Registered', register_record('RowKey', 
> 'PartitionKey', 'messageId', 'ownerSmtp', 'Timestamp'))
> 
> only_registered_records_df = register_records_df \
> .filter(register_records_df.Registered == True) \
> .drop(register_records_df.Registered)
> 
> update_message_status_df = only_registered_records_df \
> .withColumn('TableEntryDeleted', delete_table_entity('RowKey', 
> 'PartitionKey'))
> 
> results_df = update_message_status_df.select(
> update_message_status_df.RowKey,
> update_message_status_df.PartitionKey,
> update_message_status_df.TableEntryDeleted)
> #results_df.explain()
> results_df.show(n=batch_size, truncate=False)
> @udf(returnType=BooleanType())
> def register_record(rowKey, partitionKey, messageId, ownerSmtp, timestamp):
>   # call an API
> try:
>   url = '{}/data/record/{}'.format('***', rowKey)
>   headers = { 'Content-type': 'application/json' }
>   response = requests.post(
>   url,
>   headers=headers,
>   auth=HTTPBasicAuth('***', '***'),
>   data=prepare_record_data(rowKey, partitionKey, 
> messageId, ownerSmtp, timestamp))
> 
> return bool(response)
> except:
> return False
> def prepare_record_data(rowKey, partitionKey, messageId, ownerSmtp, 
> timestamp):
> record_data = {
> "Title": messageId,
> "Type": '***',
> "Source": '***',
> "Creator": ownerSmtp,
> "Publisher": '***',
> "Date": timestamp.strftime('%Y-%m-%dT%H:%M:%SZ')
> }
> return json.dumps(record_data)
> @udf(returnType=BooleanType())
> def delete_table_entity(row_key, partition_key):
> azure_table_account_name = '***'
> azure_table_account_key = '***'
> azure_table_name = '***'
> try:
> table_service = TableService(account_name=azure_table_account_name, 
> account_key=azure_table_account_key)
> table_service.delete_entity(azure_table_name, partition_key, row_key)
> return True
> except:
> return False
> if __name__ == "__main__":
>

[jira] [Comment Edited] (SPARK-28001) Dataframe throws 'socket.timeout: timed out' exception

2020-07-28 Thread StanislavKo (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166608#comment-17166608
 ] 

StanislavKo edited comment on SPARK-28001 at 7/28/20, 6:02 PM:
---

+1

 

OS: Windows 10 

Python: 3.7

PySpark: 2.4.3

Cluster manager: Spark Standalone


was (Author: stanislavko):
+1

 

OS: Windows 10 

Python: 3.7.2

PySpark: 2.4.3

Cluster manager: Spark Standalone

> Dataframe throws 'socket.timeout: timed out' exception
> --
>
> Key: SPARK-28001
> URL: https://issues.apache.org/jira/browse/SPARK-28001
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: Processor: Intel Core i7-7700 CPU @ 3.60Ghz
> RAM: 16 GB
> OS: Windows 10 Enterprise 64-bit
> Python: 3.7.2
> PySpark: 3.4.3
> Cluster manager: Spark Standalone
>Reporter: Marius Stanescu
>Priority: Critical
>
> I load data from Azure Table Storage, create a DataFrame and perform a couple 
> of operations via two user-defined functions, then call show() to display the 
> results. If I load a very small batch of items, like 5, everything is working 
> fine, but if I load a batch grater then 10 items from Azure Table Storage 
> then I get the 'socket.timeout: timed out' exception.
> Here is the code:
>  
> {code}
> import time
> import json
> import requests
> from requests.auth import HTTPBasicAuth
> from azure.cosmosdb.table.tableservice import TableService
> from azure.cosmosdb.table.models import Entity
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import udf, struct
> from pyspark.sql.types import BooleanType
> def main():
> batch_size = 25
> azure_table_account_name = '***'
> azure_table_account_key = '***'
> azure_table_name = '***'
> spark = SparkSession \
> .builder \
> .appName(agent_name) \
> .config("spark.sql.crossJoin.enabled", "true") \
> .getOrCreate()
> table_service = TableService(account_name=azure_table_account_name, 
> account_key=azure_table_account_key)
> continuation_token = None
> while True:
> messages = table_service.query_entities(
> azure_table_name,
> select="RowKey, PartitionKey, messageId, ownerSmtp, Timestamp",
> num_results=batch_size,
> marker=continuation_token,
> timeout=60)
> continuation_token = messages.next_marker
> messages_list = list(messages)
> 
> if not len(messages_list):
> time.sleep(5)
> pass
> 
> messages_df = spark.createDataFrame(messages_list)
> 
> register_records_df = messages_df \
> .withColumn('Registered', register_record('RowKey', 
> 'PartitionKey', 'messageId', 'ownerSmtp', 'Timestamp'))
> 
> only_registered_records_df = register_records_df \
> .filter(register_records_df.Registered == True) \
> .drop(register_records_df.Registered)
> 
> update_message_status_df = only_registered_records_df \
> .withColumn('TableEntryDeleted', delete_table_entity('RowKey', 
> 'PartitionKey'))
> 
> results_df = update_message_status_df.select(
> update_message_status_df.RowKey,
> update_message_status_df.PartitionKey,
> update_message_status_df.TableEntryDeleted)
> #results_df.explain()
> results_df.show(n=batch_size, truncate=False)
> @udf(returnType=BooleanType())
> def register_record(rowKey, partitionKey, messageId, ownerSmtp, timestamp):
>   # call an API
> try:
>   url = '{}/data/record/{}'.format('***', rowKey)
>   headers = { 'Content-type': 'application/json' }
>   response = requests.post(
>   url,
>   headers=headers,
>   auth=HTTPBasicAuth('***', '***'),
>   data=prepare_record_data(rowKey, partitionKey, 
> messageId, ownerSmtp, timestamp))
> 
> return bool(response)
> except:
> return False
> def prepare_record_data(rowKey, partitionKey, messageId, ownerSmtp, 
> timestamp):
> record_data = {
> "Title": messageId,
> "Type": '***',
> "Source": '***',
> "Creator": ownerSmtp,
> "Publisher": '***',
> "Date": timestamp.strftime('%Y-%m-%dT%H:%M:%SZ')
> }
> return json.dumps(record_data)
> @udf(returnType=BooleanType())
> def delete_table_entity(row_key, partition_key):
> azure_table_account_name = '***'
> azure_table_account_key = '***'
> azure_table_name = '***'
> try:
> table_service = TableService(account_name=azure_table_account_name, 
> account_key=azure_table_account_key)
>

[jira] [Resolved] (SPARK-32458) Mismatched row access sizes in tests

2020-07-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32458.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29258
[https://github.com/apache/spark/pull/29258]

> Mismatched row access sizes in tests
> 
>
> Key: SPARK-32458
> URL: https://issues.apache.org/jira/browse/SPARK-32458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Michael Munday
>Assignee: Michael Munday
>Priority: Minor
>  Labels: catalyst, endianness
> Fix For: 3.1.0
>
>
> The RowEncoderSuite and UnsafeMapSuite tests fail on big-endian systems. This 
> is because the test data is written into the row using unsafe operations with 
> one size and then read back using a different size. For example, in 
> UnsafeMapSuite the test data is written using putLong and then read back 
> using getInt. This happens to work on little-endian systems but these 
> differences appear to be typos and cause the tests to fail on big-endian 
> systems.
> I have a patch that fixes the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32458) Mismatched row access sizes in tests

2020-07-28 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32458:
-

Assignee: Michael Munday

> Mismatched row access sizes in tests
> 
>
> Key: SPARK-32458
> URL: https://issues.apache.org/jira/browse/SPARK-32458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Michael Munday
>Assignee: Michael Munday
>Priority: Minor
>  Labels: catalyst, endianness
>
> The RowEncoderSuite and UnsafeMapSuite tests fail on big-endian systems. This 
> is because the test data is written into the row using unsafe operations with 
> one size and then read back using a different size. For example, in 
> UnsafeMapSuite the test data is written using putLong and then read back 
> using getInt. This happens to work on little-endian systems but these 
> differences appear to be typos and cause the tests to fail on big-endian 
> systems.
> I have a patch that fixes the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12172) Consider removing SparkR internal RDD APIs

2020-07-28 Thread Felix Cheung (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166581#comment-17166581
 ] 

Felix Cheung commented on SPARK-12172:
--

These are methods (map etc) that were never public and not supported.

On Tue, Jul 28, 2020 at 10:18 AM S Daniel Zafar (Jira) 



> Consider removing SparkR internal RDD APIs
> --
>
> Key: SPARK-12172
> URL: https://issues.apache.org/jira/browse/SPARK-12172
> Project: Spark
>  Issue Type: Task
>  Components: SparkR
>Reporter: Felix Cheung
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20684) expose createOrReplaceGlobalTempView/createGlobalTempView and dropGlobalTempView in SparkR

2020-07-28 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166575#comment-17166575
 ] 

Dongjoon Hyun commented on SPARK-20684:
---

Actually, there was two PRs for this.
- https://github.com/apache/spark/pull/17941 (mine)
- https://github.com/apache/spark/pull/19176 (Yanbo's)

The PR is closed because we didn't implement the feature inside R. If we have 
the implementation, we can expose it after that. Let's keep this open, 
[~dan_z]. That is my opinions as the patch author. I'm still active in this 
area.

> expose createOrReplaceGlobalTempView/createGlobalTempView and 
> dropGlobalTempView in SparkR
> --
>
> Key: SPARK-20684
> URL: https://issues.apache.org/jira/browse/SPARK-20684
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>Priority: Major
>
> This is a useful API that is not exposed in SparkR. It will help with moving 
> data between languages on a single single Spark application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-30817) SparkR ML algorithms parity

2020-07-28 Thread S Daniel Zafar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166571#comment-17166571
 ] 

S Daniel Zafar edited comment on SPARK-30817 at 7/28/20, 5:19 PM:
--

I would like to work on this issue, is that all right [~hyukjin.kwon]? It would 
be my first.


was (Author: dan_z):
I would like to address this issue, is that all right [~hyukjin.kwon]?

> SparkR ML algorithms parity 
> 
>
> Key: SPARK-30817
> URL: https://issues.apache.org/jira/browse/SPARK-30817
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SparkR
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> As of 3.0 the following algorithms are missing form SparkR
> * {{LinearRegression}} 
> * {{FMRegressor}} (Added to ML in 3.0)
> * {{FMClassifier}} (Added to ML in 3.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12172) Consider removing SparkR internal RDD APIs

2020-07-28 Thread S Daniel Zafar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166569#comment-17166569
 ] 

S Daniel Zafar commented on SPARK-12172:


My opinion is that it makes sense to keep these methods, since they exist in 
PySpark. Removing basic things like `map` seems counterintuitive. The PR is 
closed, should we close this as well?

> Consider removing SparkR internal RDD APIs
> --
>
> Key: SPARK-12172
> URL: https://issues.apache.org/jira/browse/SPARK-12172
> Project: Spark
>  Issue Type: Task
>  Components: SparkR
>Reporter: Felix Cheung
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32469) ApplyColumnarRulesAndInsertTransitions should be idempotent

2020-07-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166567#comment-17166567
 ] 

Apache Spark commented on SPARK-32469:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29273

> ApplyColumnarRulesAndInsertTransitions should be idempotent
> ---
>
> Key: SPARK-32469
> URL: https://issues.apache.org/jira/browse/SPARK-32469
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32469) ApplyColumnarRulesAndInsertTransitions should be idempotent

2020-07-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166564#comment-17166564
 ] 

Apache Spark commented on SPARK-32469:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29273

> ApplyColumnarRulesAndInsertTransitions should be idempotent
> ---
>
> Key: SPARK-32469
> URL: https://issues.apache.org/jira/browse/SPARK-32469
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32469) ApplyColumnarRulesAndInsertTransitions should be idempotent

2020-07-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32469:


Assignee: Apache Spark  (was: Wenchen Fan)

> ApplyColumnarRulesAndInsertTransitions should be idempotent
> ---
>
> Key: SPARK-32469
> URL: https://issues.apache.org/jira/browse/SPARK-32469
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32469) ApplyColumnarRulesAndInsertTransitions should be idempotent

2020-07-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32469:


Assignee: Wenchen Fan  (was: Apache Spark)

> ApplyColumnarRulesAndInsertTransitions should be idempotent
> ---
>
> Key: SPARK-32469
> URL: https://issues.apache.org/jira/browse/SPARK-32469
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32469) ApplyColumnarRulesAndInsertTransitions should be idempotent

2020-07-28 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-32469:
---

 Summary: ApplyColumnarRulesAndInsertTransitions should be 
idempotent
 Key: SPARK-32469
 URL: https://issues.apache.org/jira/browse/SPARK-32469
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20684) expose createOrReplaceGlobalTempView/createGlobalTempView and dropGlobalTempView in SparkR

2020-07-28 Thread S Daniel Zafar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166559#comment-17166559
 ] 

S Daniel Zafar commented on SPARK-20684:


The PR ([https://github.com/apache/spark/pull/17941]) was closed. I think we 
can close this.

> expose createOrReplaceGlobalTempView/createGlobalTempView and 
> dropGlobalTempView in SparkR
> --
>
> Key: SPARK-20684
> URL: https://issues.apache.org/jira/browse/SPARK-20684
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>Priority: Major
>
> This is a useful API that is not exposed in SparkR. It will help with moving 
> data between languages on a single single Spark application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32429) Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch

2020-07-28 Thread Xiangrui Meng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166532#comment-17166532
 ] 

Xiangrui Meng edited comment on SPARK-32429 at 7/28/20, 4:30 PM:
-

[~tgraves] Thanks for the clarification! It makes sense to add GPU isolation at 
executor level. Your prototype adds special meaning to the "gpu" resource name. 
I wonder if we want to make it more configurable in the final implementation. A 
scenario we considered previously was a cluster with two generation of GPUs: 
K80, V100. I think it is safe to assume that Spark application should only 
request one GPU type. Then we will need some configuration to tell based on 
which resource name user wants to set CUDA_VISIBLE_DEVICES.

Btw, we found that setting CUDA_DEVICE_ORDER=PCI_BUS_ID is necessary to have 
consistent device ordering between different processes even 
CUDA_VISIBLE_DEVICES are set the same. Not sure if the same setting were used 
in YARN/k8s.


was (Author: mengxr):
[~tgraves] Thanks for the clarification! It makes sense to add GPU isolation at 
executor level. Your prototype adds special meaning to the "gpu" resource name. 
I wonder if we want to make it more configurable in the final implementation. A 
scenario we considered previously was a cluster with two generation of GPUs: 
K80, V100. I think it is safe to assume that Spark application should only 
request one GPU type. Then we will need some configuration to tell based on 
which resource name user wants to set CUDA_VISIBLE_DEVICES.

Btw, we found that setting CUDA_DEVICE_ORDER=PCI_BUS_ID is necessary to have 
consistent device ordering between different processes even 
CUDA_VISIBLE_DEVICES are set the same.

> Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch
> -
>
> Key: SPARK-32429
> URL: https://issues.apache.org/jira/browse/SPARK-32429
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> It would be nice if standalone mode could allow users to set 
> CUDA_VISIBLE_DEVICES before launching an executor.  This has multiple 
> benefits. 
>  * kind of an isolation in that the executor can only see the GPUs set there. 
>  * If your GPU application doesn't support explicitly setting the GPU device 
> id, setting this will make any GPU look like the default (id 0) and things 
> generally just work without any explicit setting
>  * New features are being added on newer GPUs that require explicit setting 
> of CUDA_VISIBLE_DEVICES like MIG 
> ([https://www.nvidia.com/en-us/technologies/multi-instance-gpu/])
> The code changes to just set this are very small, once we set them we would 
> also possibly need to change the gpu addresses as it changes them to start 
> from device id 0 again.
> The easiest implementation would just specifically support this and have it 
> behind a config and set when the config is on and GPU resources are 
> allocated. 
> Note we probably want to have this same thing set when we launch a python 
> process as well so that it gets same env.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32429) Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch

2020-07-28 Thread Xiangrui Meng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166532#comment-17166532
 ] 

Xiangrui Meng commented on SPARK-32429:
---

[~tgraves] Thanks for the clarification! It makes sense to add GPU isolation at 
executor level. Your prototype adds special meaning to the "gpu" resource name. 
I wonder if we want to make it more configurable in the final implementation. A 
scenario we considered previously was a cluster with two generation of GPUs: 
K80, V100. I think it is safe to assume that Spark application should only 
request one GPU type. Then we will need some configuration to tell based on 
which resource name user wants to set CUDA_VISIBLE_DEVICES.

Btw, we found that setting CUDA_DEVICE_ORDER=PCI_BUS_ID is necessary to have 
consistent device ordering between different processes even 
CUDA_VISIBLE_DEVICES are set the same.

> Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch
> -
>
> Key: SPARK-32429
> URL: https://issues.apache.org/jira/browse/SPARK-32429
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> It would be nice if standalone mode could allow users to set 
> CUDA_VISIBLE_DEVICES before launching an executor.  This has multiple 
> benefits. 
>  * kind of an isolation in that the executor can only see the GPUs set there. 
>  * If your GPU application doesn't support explicitly setting the GPU device 
> id, setting this will make any GPU look like the default (id 0) and things 
> generally just work without any explicit setting
>  * New features are being added on newer GPUs that require explicit setting 
> of CUDA_VISIBLE_DEVICES like MIG 
> ([https://www.nvidia.com/en-us/technologies/multi-instance-gpu/])
> The code changes to just set this are very small, once we set them we would 
> also possibly need to change the gpu addresses as it changes them to start 
> from device id 0 again.
> The easiest implementation would just specifically support this and have it 
> behind a config and set when the config is on and GPU resources are 
> allocated. 
> Note we probably want to have this same thing set when we launch a python 
> process as well so that it gets same env.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32339) Improve MLlib BLAS native acceleration docs

2020-07-28 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao reassigned SPARK-32339:
--

Assignee: Xiaochang Wu

> Improve MLlib BLAS native acceleration docs
> ---
>
> Key: SPARK-32339
> URL: https://issues.apache.org/jira/browse/SPARK-32339
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Xiaochang Wu
>Assignee: Xiaochang Wu
>Priority: Major
>
> The document of enabling BLAS native acceleration in ML guide 
> ([https://spark.apache.org/docs/latest/ml-guide.html#dependencies]) is 
> incomplete and unclear to the user. 
> We will rewrite it to a clearer and complete guide.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32339) Improve MLlib BLAS native acceleration docs

2020-07-28 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao resolved SPARK-32339.

Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 29139
[https://github.com/apache/spark/pull/29139]

> Improve MLlib BLAS native acceleration docs
> ---
>
> Key: SPARK-32339
> URL: https://issues.apache.org/jira/browse/SPARK-32339
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Xiaochang Wu
>Assignee: Xiaochang Wu
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> The document of enabling BLAS native acceleration in ML guide 
> ([https://spark.apache.org/docs/latest/ml-guide.html#dependencies]) is 
> incomplete and unclear to the user. 
> We will rewrite it to a clearer and complete guide.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32468) Fix timeout config issue in Kafka connector tests

2020-07-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166494#comment-17166494
 ] 

Apache Spark commented on SPARK-32468:
--

User 'gaborgsomogyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/29272

> Fix timeout config issue in Kafka connector tests
> -
>
> Key: SPARK-32468
> URL: https://issues.apache.org/jira/browse/SPARK-32468
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Tests
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> While I'm implementing SPARK-32032 I've found a bug in Kafka: 
> https://issues.apache.org/jira/browse/KAFKA-10318. This will cause issues 
> only later when it's fixed but it would be good to fix it now because 
> SPARK-32032 would like to bring in AdminClient where the code blows up with 
> the mentioned ConfigException. This would reduce the code changes in the 
> mentioned jira.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32468) Fix timeout config issue in Kafka connector tests

2020-07-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32468:


Assignee: (was: Apache Spark)

> Fix timeout config issue in Kafka connector tests
> -
>
> Key: SPARK-32468
> URL: https://issues.apache.org/jira/browse/SPARK-32468
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Tests
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> While I'm implementing SPARK-32032 I've found a bug in Kafka: 
> https://issues.apache.org/jira/browse/KAFKA-10318. This will cause issues 
> only later when it's fixed but it would be good to fix it now because 
> SPARK-32032 would like to bring in AdminClient where the code blows up with 
> the mentioned ConfigException. This would reduce the code changes in the 
> mentioned jira.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32468) Fix timeout config issue in Kafka connector tests

2020-07-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32468:


Assignee: Apache Spark

> Fix timeout config issue in Kafka connector tests
> -
>
> Key: SPARK-32468
> URL: https://issues.apache.org/jira/browse/SPARK-32468
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Tests
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Apache Spark
>Priority: Minor
>
> While I'm implementing SPARK-32032 I've found a bug in Kafka: 
> https://issues.apache.org/jira/browse/KAFKA-10318. This will cause issues 
> only later when it's fixed but it would be good to fix it now because 
> SPARK-32032 would like to bring in AdminClient where the code blows up with 
> the mentioned ConfigException. This would reduce the code changes in the 
> mentioned jira.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32468) Fix timeout config issue in Kafka connector tests

2020-07-28 Thread Gabor Somogyi (Jira)

Gabor Somogyi created SPARK-32468:
-

 Summary: Fix timeout config issue in Kafka connector tests
 Key: SPARK-32468
 URL: https://issues.apache.org/jira/browse/SPARK-32468
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming, Tests
Affects Versions: 3.1.0
Reporter: Gabor Somogyi


While I'm implementing SPARK-32032 I've found a bug in Kafka: 
https://issues.apache.org/jira/browse/KAFKA-10318. This will cause issues only 
later when it's fixed but it would be good to fix it now because SPARK-32032 
would like to bring in AdminClient where the code blows up with the mentioned 
ConfigException. This would reduce the code changes in the mentioned jira.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32467) Avoid encoding URL twice on https redirect

2020-07-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32467:


Assignee: Apache Spark  (was: Gengliang Wang)

> Avoid encoding URL twice on https redirect
> --
>
> Key: SPARK-32467
> URL: https://issues.apache.org/jira/browse/SPARK-32467
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Currently, on https redirect, the original URL is encoded as an HTTPS URL. 
> However, the original URL could be encoded already, so that the return result 
> of method
> UriInfo.getQueryParameters will contain encoded keys and values. For example, 
> a parameter
> order[0][dir] will become order%255B0%255D%255Bcolumn%255D after encoded 
> twice, and the decoded
> key in the result of UriInfo.getQueryParameters will be 
> order%5B0%5D%5Bcolumn%5D.
> To fix the problem, we try decoding the query parameters before encoding it. 
> This is to make sure we encode the URL exactly once.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32467) Avoid encoding URL twice on https redirect

2020-07-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166477#comment-17166477
 ] 

Apache Spark commented on SPARK-32467:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29271

> Avoid encoding URL twice on https redirect
> --
>
> Key: SPARK-32467
> URL: https://issues.apache.org/jira/browse/SPARK-32467
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, on https redirect, the original URL is encoded as an HTTPS URL. 
> However, the original URL could be encoded already, so that the return result 
> of method
> UriInfo.getQueryParameters will contain encoded keys and values. For example, 
> a parameter
> order[0][dir] will become order%255B0%255D%255Bcolumn%255D after encoded 
> twice, and the decoded
> key in the result of UriInfo.getQueryParameters will be 
> order%5B0%5D%5Bcolumn%5D.
> To fix the problem, we try decoding the query parameters before encoding it. 
> This is to make sure we encode the URL exactly once.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32467) Avoid encoding URL twice on https redirect

2020-07-28 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-32467:
--

 Summary: Avoid encoding URL twice on https redirect
 Key: SPARK-32467
 URL: https://issues.apache.org/jira/browse/SPARK-32467
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.1.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


Currently, on https redirect, the original URL is encoded as an HTTPS URL. 
However, the original URL could be encoded already, so that the return result 
of method
UriInfo.getQueryParameters will contain encoded keys and values. For example, a 
parameter
order[0][dir] will become order%255B0%255D%255Bcolumn%255D after encoded twice, 
and the decoded
key in the result of UriInfo.getQueryParameters will be 
order%5B0%5D%5Bcolumn%5D.

To fix the problem, we try decoding the query parameters before encoding it. 
This is to make sure we encode the URL exactly once.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32466) Add support to catch SparkPlan regression base on TPC-DS queries

2020-07-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32466:


Assignee: Apache Spark

> Add support to catch SparkPlan regression base on TPC-DS queries
> 
>
> Key: SPARK-32466
> URL: https://issues.apache.org/jira/browse/SPARK-32466
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
> Nowadays, Spark is getting more and more complex. Any changes might cause 
> regression unintentionally. Spark already has some benchmark to catch the 
> performance regression. But, yet, it doesn't have a way to detect the 
> regression inside SparkPlan. It would be good if we could find some possible 
> regression early during the compile phase before the runtime phase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28004) Update jquery to 3.4.1

2020-07-28 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166411#comment-17166411
 ] 

Sean R. Owen commented on SPARK-28004:
--

I hadn't planned to, as it's a non-trivial change, and it wasn't obvious that 
the CVEs (so far) affect Spark. It's not clear they don't, either, and it is 
valid to back-port security-related updates. I'd generally suggest people move 
towards Spark 3 at this point. But I'd look at a back-port PR that someone has 
gotten working on 2.4.x.

> Update jquery to 3.4.1
> --
>
> Key: SPARK-28004
> URL: https://issues.apache.org/jira/browse/SPARK-28004
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Major
> Fix For: 3.0.0
>
>
> We're using an old-ish jQuery, 1.12.4, and should probably update for Spark 3 
> to keep up in general, but also to keep up with CVEs. In fact, we know of at 
> least one resolved in only 3.4.0+ 
> (https://nvd.nist.gov/vuln/detail/CVE-2019-11358). They may not affect Spark, 
> but, if the update isn't painful, maybe worthwhile in order to make future 
> 3.x updates easier.
> jQuery 1 -> 2 doesn't sound like a breaking change, as 2.0 is supposed to 
> maintain compatibility with 1.9+ 
> (https://blog.jquery.com/2013/04/18/jquery-2-0-released/)
> 2 -> 3 has breaking changes: https://jquery.com/upgrade-guide/3.0/. It's hard 
> to evaluate each one, but the most likely area for problems is in ajax(). 
> However, our usage of jQuery (and plugins) is pretty simple. 
> I've tried updating and testing the UI, and can't see any warnings, errors, 
> or problematic functionality. This includes the Spark UI, master UI, worker 
> UI, and docs (well, I wasn't able to build R docs)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32466) Add support to catch SparkPlan regression base on TPC-DS queries

2020-07-28 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32466:


Assignee: (was: Apache Spark)

> Add support to catch SparkPlan regression base on TPC-DS queries
> 
>
> Key: SPARK-32466
> URL: https://issues.apache.org/jira/browse/SPARK-32466
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Nowadays, Spark is getting more and more complex. Any changes might cause 
> regression unintentionally. Spark already has some benchmark to catch the 
> performance regression. But, yet, it doesn't have a way to detect the 
> regression inside SparkPlan. It would be good if we could find some possible 
> regression early during the compile phase before the runtime phase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32466) Add support to catch SparkPlan regression base on TPC-DS queries

2020-07-28 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166410#comment-17166410
 ] 

Apache Spark commented on SPARK-32466:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/29270

> Add support to catch SparkPlan regression base on TPC-DS queries
> 
>
> Key: SPARK-32466
> URL: https://issues.apache.org/jira/browse/SPARK-32466
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Nowadays, Spark is getting more and more complex. Any changes might cause 
> regression unintentionally. Spark already has some benchmark to catch the 
> performance regression. But, yet, it doesn't have a way to detect the 
> regression inside SparkPlan. It would be good if we could find some possible 
> regression early during the compile phase before the runtime phase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21784) Add ALTER TABLE ADD CONSTRANT DDL to support defining primary key and foreign keys

2020-07-28 Thread Hamad Javed (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-21784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166401#comment-17166401
 ] 

Hamad Javed commented on SPARK-21784:
-

Hey [~ksunitha], I would be really interested in having this feature added in 
spark. Having primary keys defined is a strong feature in a lot of traditional 
Databases and porting it over to spark would very useful - we can eliminate a 
whole class of problems simply by knowing and enforcing primary key 
constraints. 

> Add ALTER TABLE ADD CONSTRANT DDL to support defining primary key and foreign 
> keys
> --
>
> Key: SPARK-21784
> URL: https://issues.apache.org/jira/browse/SPARK-21784
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Suresh Thalamati
>Priority: Major
>
> Currently Spark SQL does not have  DDL support to define primary key , and 
> foreign key constraints. This Jira is to add DDL support to define primary 
> key and foreign key informational constraint using ALTER TABLE syntax. These 
> constraints will be used in query optimization and you can find more details 
> about this in the spec in SPARK-19842
> *Syntax :*
> {code}
> ALTER TABLE [db_name.]table_name ADD [CONSTRAINT constraintName]
>   (PRIMARY KEY (col_names) |
>   FOREIGN KEY (col_names) REFERENCES [db_name.]table_name [(col_names)])
>   [VALIDATE | NOVALIDATE] [RELY | NORELY]
> {code}
> Examples :
> {code:sql}
> ALTER TABLE employee _ADD CONSTRANT pk_ PRIMARY KEY(empno) VALIDATE RELY
> ALTER TABLE department _ADD CONSTRAINT emp_fk_ FOREIGN KEY (mgrno) REFERENCES 
> employee(empno) NOVALIDATE NORELY
> {code}
> *Constraint name generated by the system:*
> {code:sql}
> ALTER TABLE department ADD PRIMARY KEY(deptno) VALIDATE RELY
> ALTER TABLE employee ADD FOREIGN KEY (workdept) REFERENCES department(deptno) 
> VALIDATE RELY;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32382) Override table renaming in JDBC dialects

2020-07-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32382.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29237
[https://github.com/apache/spark/pull/29237]

> Override table renaming in JDBC dialects
> 
>
> Key: SPARK-32382
> URL: https://issues.apache.org/jira/browse/SPARK-32382
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.1.0
>
>
> SPARK-32375 adds new method renameTable to JdbcDialect with the default 
> implementation:
> {code:sql}
> ALTER TABLE table_name RENAME TO new_table_name;
> {code}
> which is supported by Oracle, MySQL, MariaDB, PostgreSQL and SQLite but other 
> dialects might not support such syntax, for instance SQL Server (using the 
> stored procedure called sp_rename):
> {code:sql}
> sp_rename 'table_name', 'new_table_name';
> {code}
> The ticket aims to support table renaming in all JDBC dialects.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32382) Override table renaming in JDBC dialects

2020-07-28 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32382:
---

Assignee: Maxim Gekk

> Override table renaming in JDBC dialects
> 
>
> Key: SPARK-32382
> URL: https://issues.apache.org/jira/browse/SPARK-32382
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> SPARK-32375 adds new method renameTable to JdbcDialect with the default 
> implementation:
> {code:sql}
> ALTER TABLE table_name RENAME TO new_table_name;
> {code}
> which is supported by Oracle, MySQL, MariaDB, PostgreSQL and SQLite but other 
> dialects might not support such syntax, for instance SQL Server (using the 
> stored procedure called sp_rename):
> {code:sql}
> sp_rename 'table_name', 'new_table_name';
> {code}
> The ticket aims to support table renaming in all JDBC dialects.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 125 matches

Mail list logo