[jira] [Updated] (SPARK-32474) NullAwareAntiJoin multi-column support
[ https://issues.apache.org/jira/browse/SPARK-32474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-32474: Target Version/s: (was: 3.0.0) > NullAwareAntiJoin multi-column support > -- > > Key: SPARK-32474 > URL: https://issues.apache.org/jira/browse/SPARK-32474 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Leanken.Lin >Priority: Minor > Fix For: 3.1.0 > > > This is a follow up improvement of Issue SPARK-32290. > In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to > BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but > it's only targeting on Single Column Case, because it's much more complicate > in multi column support. > See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6 > > FYI, code logical for single and multi column is defined at > ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql > ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql > > For supporting multi column, I throw the following idea and see if is it > worth to do multi-column support with some trade off. I would need to do some > data expansion in HashedRelation, and i would call this new type of > HashedRelation as NullAwareHashedRelation. > > In NullAwareHashedRelation, key with null column is allowed, which is > opposite in LongHashedRelation and UnsafeHashedRelation; And single key might > be expanded into 2^N - 1 records, (N refer to columnNum of the key). for > example, if there is a record > (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), > C(2,3) as a combination to copy origin key row, and setNull at target > position, and then insert into NullAwareHashedRelation. including the origin > key row, there will be 7 key row inserted as follow. > (null, 2, 3) > (1, null, 3) > (1, 2, null) > (null, null, 3) > (null, 2, null) > (1, null, null) > (1, 2, 3) > > with the expanded data we can extract a common pattern for both single and > multi column. allNull refer to a unsafeRow which has all null columns. > * buildSide is empty input => return all rows > * allNullColumnKey Exists In buildSide input => reject all rows > * if streamedSideRow.allNull is true => drop the row > * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation > => drop the row > * if streamedSideRow.allNull is false & notFindMatch in > NullAwareHashedRelation => return the row > > this solution will sure make buildSide data expand to 2^N-1 times, but since > it is normally up to 2~3 column in NAAJ in normal production query, i suppose > that it's acceptable to expand buildSide data to around 7X. I would also have > a limitation of max column support for NAAJ, basically should not more than > 3. > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32474) NullAwareAntiJoin multi-column support
[ https://issues.apache.org/jira/browse/SPARK-32474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-32474: Fix Version/s: (was: 3.1.0) > NullAwareAntiJoin multi-column support > -- > > Key: SPARK-32474 > URL: https://issues.apache.org/jira/browse/SPARK-32474 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Leanken.Lin >Priority: Minor > > This is a follow up improvement of Issue SPARK-32290. > In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to > BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but > it's only targeting on Single Column Case, because it's much more complicate > in multi column support. > See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6 > > FYI, code logical for single and multi column is defined at > ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql > ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql > > For supporting multi column, I throw the following idea and see if is it > worth to do multi-column support with some trade off. I would need to do some > data expansion in HashedRelation, and i would call this new type of > HashedRelation as NullAwareHashedRelation. > > In NullAwareHashedRelation, key with null column is allowed, which is > opposite in LongHashedRelation and UnsafeHashedRelation; And single key might > be expanded into 2^N - 1 records, (N refer to columnNum of the key). for > example, if there is a record > (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), > C(2,3) as a combination to copy origin key row, and setNull at target > position, and then insert into NullAwareHashedRelation. including the origin > key row, there will be 7 key row inserted as follow. > (null, 2, 3) > (1, null, 3) > (1, 2, null) > (null, null, 3) > (null, 2, null) > (1, null, null) > (1, 2, 3) > > with the expanded data we can extract a common pattern for both single and > multi column. allNull refer to a unsafeRow which has all null columns. > * buildSide is empty input => return all rows > * allNullColumnKey Exists In buildSide input => reject all rows > * if streamedSideRow.allNull is true => drop the row > * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation > => drop the row > * if streamedSideRow.allNull is false & notFindMatch in > NullAwareHashedRelation => return the row > > this solution will sure make buildSide data expand to 2^N-1 times, but since > it is normally up to 2~3 column in NAAJ in normal production query, i suppose > that it's acceptable to expand buildSide data to around 7X. I would also have > a limitation of max column support for NAAJ, basically should not more than > 3. > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32473) Use === instead IndexSeqView
[ https://issues.apache.org/jira/browse/SPARK-32473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32473. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29280 [https://github.com/apache/spark/pull/29280] > Use === instead IndexSeqView > > > Key: SPARK-32473 > URL: https://issues.apache.org/jira/browse/SPARK-32473 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.1.0 > > > This issue aims to fix `SorterSuite` and `RadixSortSuite` in Scala 2.13 by > using `sameElements` instead of `IndexSeqView.==`. > Scala 2.13 reimplements `IndexSeqView` and the behavior is different. > - https://docs.scala-lang.org/overviews/core/collections-migration-213.html > {code} > Welcome to Scala 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_262). > Type in expressions for evaluation. Or try :help. > scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view > res0: Boolean = true > {code} > {code} > Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_262). > Type in expressions for evaluation. Or try :help. > scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view > val res0: Boolean = false > {code} > {code} > $ dev/change-scala-version.sh 2.13 > $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none > -DwildcardSuites=org.apache.spark.util.collection.unsafe.sort.RadixSortSuite > ... > Tests: succeeded 9, failed 36, canceled 0, ignored 0, pending 0 > *** 36 TESTS FAILED *** > $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none > -DwildcardSuites=org.apache.spark.util.collection.SorterSuite > ... > Tests: succeeded 3, failed 1, canceled 0, ignored 2, pending 0 > *** 1 TEST FAILED *** > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32475) java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
jinhai created SPARK-32475: -- Summary: java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer; Key: SPARK-32475 URL: https://issues.apache.org/jira/browse/SPARK-32475 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Environment: Spark-3.0.0;JDK 8 Reporter: jinhai When I use the command to compile spark-core_2.12 module, and then use the spark-core_2.12-3.0.0.jar instead of /jars/spark-core_2.12-3.0.0.jar, I will report an error (without making any code changes) command: ./build/mvn -pl :spark-core_2.12 -Pyarn -Phive -Phive-thriftserver -Dhadoop.version=2.7.4 -DskipTests clean package version: spark-3.0.0 jdk: 1.8 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32473) Use === instead IndexSeqView
[ https://issues.apache.org/jira/browse/SPARK-32473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32473: -- Summary: Use === instead IndexSeqView (was: Use === instead IndexSeqView.==) > Use === instead IndexSeqView > > > Key: SPARK-32473 > URL: https://issues.apache.org/jira/browse/SPARK-32473 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > This issue aims to fix `SorterSuite` and `RadixSortSuite` in Scala 2.13 by > using `sameElements` instead of `IndexSeqView.==`. > Scala 2.13 reimplements `IndexSeqView` and the behavior is different. > - https://docs.scala-lang.org/overviews/core/collections-migration-213.html > {code} > Welcome to Scala 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_262). > Type in expressions for evaluation. Or try :help. > scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view > res0: Boolean = true > {code} > {code} > Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_262). > Type in expressions for evaluation. Or try :help. > scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view > val res0: Boolean = false > {code} > {code} > $ dev/change-scala-version.sh 2.13 > $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none > -DwildcardSuites=org.apache.spark.util.collection.unsafe.sort.RadixSortSuite > ... > Tests: succeeded 9, failed 36, canceled 0, ignored 0, pending 0 > *** 36 TESTS FAILED *** > $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none > -DwildcardSuites=org.apache.spark.util.collection.SorterSuite > ... > Tests: succeeded 3, failed 1, canceled 0, ignored 2, pending 0 > *** 1 TEST FAILED *** > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32473) Use === instead IndexSeqView.==
[ https://issues.apache.org/jira/browse/SPARK-32473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-32473: - Assignee: Dongjoon Hyun > Use === instead IndexSeqView.== > --- > > Key: SPARK-32473 > URL: https://issues.apache.org/jira/browse/SPARK-32473 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > This issue aims to fix `SorterSuite` and `RadixSortSuite` in Scala 2.13 by > using `sameElements` instead of `IndexSeqView.==`. > Scala 2.13 reimplements `IndexSeqView` and the behavior is different. > - https://docs.scala-lang.org/overviews/core/collections-migration-213.html > {code} > Welcome to Scala 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_262). > Type in expressions for evaluation. Or try :help. > scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view > res0: Boolean = true > {code} > {code} > Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_262). > Type in expressions for evaluation. Or try :help. > scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view > val res0: Boolean = false > {code} > {code} > $ dev/change-scala-version.sh 2.13 > $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none > -DwildcardSuites=org.apache.spark.util.collection.unsafe.sort.RadixSortSuite > ... > Tests: succeeded 9, failed 36, canceled 0, ignored 0, pending 0 > *** 36 TESTS FAILED *** > $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none > -DwildcardSuites=org.apache.spark.util.collection.SorterSuite > ... > Tests: succeeded 3, failed 1, canceled 0, ignored 2, pending 0 > *** 1 TEST FAILED *** > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32473) Use === instead IndexSeqView.==
[ https://issues.apache.org/jira/browse/SPARK-32473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32473: -- Summary: Use === instead IndexSeqView.== (was: Use sameElements instead IndexSeqView.==) > Use === instead IndexSeqView.== > --- > > Key: SPARK-32473 > URL: https://issues.apache.org/jira/browse/SPARK-32473 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue aims to fix `SorterSuite` and `RadixSortSuite` in Scala 2.13 by > using `sameElements` instead of `IndexSeqView.==`. > Scala 2.13 reimplements `IndexSeqView` and the behavior is different. > - https://docs.scala-lang.org/overviews/core/collections-migration-213.html > {code} > Welcome to Scala 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_262). > Type in expressions for evaluation. Or try :help. > scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view > res0: Boolean = true > {code} > {code} > Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_262). > Type in expressions for evaluation. Or try :help. > scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view > val res0: Boolean = false > {code} > {code} > $ dev/change-scala-version.sh 2.13 > $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none > -DwildcardSuites=org.apache.spark.util.collection.unsafe.sort.RadixSortSuite > ... > Tests: succeeded 9, failed 36, canceled 0, ignored 0, pending 0 > *** 36 TESTS FAILED *** > $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none > -DwildcardSuites=org.apache.spark.util.collection.SorterSuite > ... > Tests: succeeded 3, failed 1, canceled 0, ignored 2, pending 0 > *** 1 TEST FAILED *** > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32473) Use sameElements instead IndexSeqView.==
[ https://issues.apache.org/jira/browse/SPARK-32473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32473: -- Description: This issue aims to fix `SorterSuite` and `RadixSortSuite` in Scala 2.13 by using `sameElements` instead of `IndexSeqView.==`. Scala 2.13 reimplements `IndexSeqView` and the behavior is different. - https://docs.scala-lang.org/overviews/core/collections-migration-213.html {code} Welcome to Scala 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_262). Type in expressions for evaluation. Or try :help. scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view res0: Boolean = true {code} {code} Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_262). Type in expressions for evaluation. Or try :help. scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view val res0: Boolean = false {code} {code} $ dev/change-scala-version.sh 2.13 $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.util.collection.unsafe.sort.RadixSortSuite ... Tests: succeeded 9, failed 36, canceled 0, ignored 0, pending 0 *** 36 TESTS FAILED *** $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.util.collection.SorterSuite ... Tests: succeeded 3, failed 1, canceled 0, ignored 2, pending 0 *** 1 TEST FAILED *** {code} > Use sameElements instead IndexSeqView.== > > > Key: SPARK-32473 > URL: https://issues.apache.org/jira/browse/SPARK-32473 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue aims to fix `SorterSuite` and `RadixSortSuite` in Scala 2.13 by > using `sameElements` instead of `IndexSeqView.==`. > Scala 2.13 reimplements `IndexSeqView` and the behavior is different. > - https://docs.scala-lang.org/overviews/core/collections-migration-213.html > {code} > Welcome to Scala 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_262). > Type in expressions for evaluation. Or try :help. > scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view > res0: Boolean = true > {code} > {code} > Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_262). > Type in expressions for evaluation. Or try :help. > scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view > val res0: Boolean = false > {code} > {code} > $ dev/change-scala-version.sh 2.13 > $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none > -DwildcardSuites=org.apache.spark.util.collection.unsafe.sort.RadixSortSuite > ... > Tests: succeeded 9, failed 36, canceled 0, ignored 0, pending 0 > *** 36 TESTS FAILED *** > $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none > -DwildcardSuites=org.apache.spark.util.collection.SorterSuite > ... > Tests: succeeded 3, failed 1, canceled 0, ignored 2, pending 0 > *** 1 TEST FAILED *** > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-32473) Use sameElements instead IndexSeqView.==
[ https://issues.apache.org/jira/browse/SPARK-32473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32473: -- Comment: was deleted (was: User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/29280) > Use sameElements instead IndexSeqView.== > > > Key: SPARK-32473 > URL: https://issues.apache.org/jira/browse/SPARK-32473 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue aims to fix `SorterSuite` and `RadixSortSuite` in Scala 2.13 by > using `sameElements` instead of `IndexSeqView.==`. > Scala 2.13 reimplements `IndexSeqView` and the behavior is different. > - https://docs.scala-lang.org/overviews/core/collections-migration-213.html > {code} > Welcome to Scala 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_262). > Type in expressions for evaluation. Or try :help. > scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view > res0: Boolean = true > {code} > {code} > Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_262). > Type in expressions for evaluation. Or try :help. > scala> Seq(1,2,3).toArray.view == Seq(1,2,3).toArray.view > val res0: Boolean = false > {code} > {code} > $ dev/change-scala-version.sh 2.13 > $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none > -DwildcardSuites=org.apache.spark.util.collection.unsafe.sort.RadixSortSuite > ... > Tests: succeeded 9, failed 36, canceled 0, ignored 0, pending 0 > *** 36 TESTS FAILED *** > $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none > -DwildcardSuites=org.apache.spark.util.collection.SorterSuite > ... > Tests: succeeded 3, failed 1, canceled 0, ignored 2, pending 0 > *** 1 TEST FAILED *** > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32474) NullAwareAntiJoin multi-column support
[ https://issues.apache.org/jira/browse/SPARK-32474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-32474: Description: This is a follow up improvement of Issue SPARK-32290. In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but it's only targeting on Single Column Case, because it's much more complicate in multi column support. See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6 FYI, code logical for single and multi column is defined at ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql For supporting multi column, I throw the following idea and see if is it worth to do multi-column support with some trade off. I would need to do some data expansion in HashedRelation, and i would call this new type of HashedRelation as NullAwareHashedRelation. In NullAwareHashedRelation, key with null column is allowed, which is opposite in LongHashedRelation and UnsafeHashedRelation; And single key might be expanded into 2^N - 1 records, (N refer to columnNum of the key). for example, if there is a record (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), C(2,3) as a combination to copy origin key row, and setNull at target position, and then insert into NullAwareHashedRelation. including the origin key row, there will be 7 key row inserted as follow. (null, 2, 3) (1, null, 3) (1, 2, null) (null, null, 3) (null, 2, null) (1, null, null) (1, 2, 3) with the expanded data we can extract a common pattern for both single and multi column. allNull refer to a unsafeRow which has all null columns. * buildSide is empty input => return all rows * allNullColumnKey Exists In buildSide input => reject all rows * if streamedSideRow.allNull is true => drop the row * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation => drop the row * if streamedSideRow.allNull is false & notFindMatch in NullAwareHashedRelation => return the row this solution will sure make buildSide data expand to 2^N-1 times, but since it is normally up to 2~3 column in NAAJ in normal production query, i suppose that it's acceptable to expand buildSide data to around 7X. I would also have a limitation of max column support for NAAJ, basically should not more than 3. was: This is a follow up improvement of Issue SPARK-32290. In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but it's only targeting on Single Column Case, because it's much more complicate in multi column support. See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6 NAAJ multi column logical !image-2020-07-29-12-03-22-554.png! NAAJ single column logical !image-2020-07-29-12-03-32-677.png! For supporting multi column, I throw the following idea and see if is it worth to do multi-column support with some trade off. I would need to do some data expansion in HashedRelation, and i would call this new type of HashedRelation as NullAwareHashedRelation. In NullAwareHashedRelation, key with null column is allowed, which is opposite in LongHashedRelation and UnsafeHashedRelation; And single key might be expanded into 2^N - 1 records, (N refer to columnNum of the key). for example, if there is a record (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), C(2,3) as a combination to copy origin key row, and setNull at target position, and then insert into NullAwareHashedRelation. including the origin key row, there will be 7 key row inserted as follow. (null, 2, 3) (1, null, 3) (1, 2, null) (null, null, 3) (null, 2, null) (1, null, null) (1, 2, 3) with the expanded data we can extract a common pattern for both single and multi column. allNull refer to a unsafeRow which has all null columns. * buildSide is empty input => return all rows * allNullColumnKey Exists In buildSide input => reject all rows * if streamedSideRow.allNull is true => drop the row * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation => drop the row * if streamedSideRow.allNull is false & notFindMatch in NullAwareHashedRelation => return the row this solution will sure make buildSide data expand to 2^N-1 times, but since it is normally up to 2~3 column in NAAJ in normal production query, i suppose that it's acceptable to expand buildSide data to around 7X. I would also have a limitation of max column support for NAAJ, basically should not more than 3. > NullAwareAntiJoin multi-column support > -- > > Key:
[jira] [Updated] (SPARK-32473) Use sameElements instead IndexSeqView.==
[ https://issues.apache.org/jira/browse/SPARK-32473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32473: -- Summary: Use sameElements instead IndexSeqView.== (was: Fix RadixSortSuite by using sameElements instead IndexSeqView.==) > Use sameElements instead IndexSeqView.== > > > Key: SPARK-32473 > URL: https://issues.apache.org/jira/browse/SPARK-32473 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32474) NullAwareAntiJoin multi-column support
[ https://issues.apache.org/jira/browse/SPARK-32474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-32474: Description: This is a follow up improvement of Issue SPARK-32290. In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but it's only targeting on Single Column Case, because it's much more complicate in multi column support. See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6 NAAJ multi column logical !image-2020-07-29-12-03-22-554.png! NAAJ single column logical !image-2020-07-29-12-03-32-677.png! For supporting multi column, I throw the following idea and see if is it worth to do multi-column support with some trade off. I would need to do some data expansion in HashedRelation, and i would call this new type of HashedRelation as NullAwareHashedRelation. In NullAwareHashedRelation, key with null column is allowed, which is opposite in LongHashedRelation and UnsafeHashedRelation; And single key might be expanded into 2^N - 1 records, (N refer to columnNum of the key). for example, if there is a record (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), C(2,3) as a combination to copy origin key row, and setNull at target position, and then insert into NullAwareHashedRelation. including the origin key row, there will be 7 key row inserted as follow. (null, 2, 3) (1, null, 3) (1, 2, null) (null, null, 3) (null, 2, null) (1, null, null) (1, 2, 3) with the expanded data we can extract a common pattern for both single and multi column. allNull refer to a unsafeRow which has all null columns. * buildSide is empty input => return all rows * allNullColumnKey Exists In buildSide input => reject all rows * if streamedSideRow.allNull is true => drop the row * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation => drop the row * if streamedSideRow.allNull is false & notFindMatch in NullAwareHashedRelation => return the row this solution will sure make buildSide data expand to 2^N-1 times, but since it is normally up to 2~3 column in NAAJ in normal production query, i suppose that it's acceptable to expand buildSide data to around 7X. I would also have a limitation of max column support for NAAJ, basically should not more than 3. was: This is a follow up improvement of Issue [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290]. In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but it's only targeting on Single Column Case, because it's much more complicate in multi column support. See. http://www.vldb.org/pvldb/vol2/vldb09-423.pdf Section 6 NAAJ multi column logical !image-2020-07-29-11-41-11-939.png! NAAJ single column logical !image-2020-07-29-11-41-03-757.png! For supporting multi column, I throw the following idea and see if is it worth to do multi-column support with some trade off. I would need to do some data expansion in HashedRelation, and i would call this new type of HashedRelation as NullAwareHashedRelation. In NullAwareHashedRelation, key with null column is allowed, which is opposite in LongHashedRelation and UnsafeHashedRelation; And single key might be expanded into 2^N - 1 records, (N refer to columnNum of the key). for example, if there is a record (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), C(2,3) as a combination to copy origin key row, and setNull at target position, and then insert into NullAwareHashedRelation. including the origin key row, there will be 7 key row inserted as follow. (null, 2, 3) (1, null, 3) (1, 2, null) (null, null, 3) (null, 2, null) (1, null, null) (1, 2, 3) with the expanded data we can extract a common pattern for both single and multi column. allNull refer to a unsafeRow which has all null columns. * buildSide is empty input => return all rows * allNullColumnKey Exists In buildSide input => reject all rows * if streamedSideRow.allNull is true => drop the row * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation => drop the row * if streamedSideRow.allNull is false & notFindMatch in NullAwareHashedRelation => return the row this solution will sure make buildSide data expand to 2^N-1 times, but since it is normally up to 2~3 column in NAAJ in normal production query, i suppose that it's acceptable to expand buildSide data to around 7X. I would also have a limitation of max column support for NAAJ, basically should not more than 3. > NullAwareAntiJoin multi-column support > -- > > Key: SPARK-32474 > URL: https://issues.apache.org/jira/browse/SPARK-32474 >
[jira] [Created] (SPARK-32474) NullAwareAntiJoin multi-column support
Leanken.Lin created SPARK-32474: --- Summary: NullAwareAntiJoin multi-column support Key: SPARK-32474 URL: https://issues.apache.org/jira/browse/SPARK-32474 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Leanken.Lin Fix For: 3.1.0 This is a follow up improvement of Issue [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290]. In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but it's only targeting on Single Column Case, because it's much more complicate in multi column support. See. http://www.vldb.org/pvldb/vol2/vldb09-423.pdf Section 6 NAAJ multi column logical !image-2020-07-29-11-41-11-939.png! NAAJ single column logical !image-2020-07-29-11-41-03-757.png! For supporting multi column, I throw the following idea and see if is it worth to do multi-column support with some trade off. I would need to do some data expansion in HashedRelation, and i would call this new type of HashedRelation as NullAwareHashedRelation. In NullAwareHashedRelation, key with null column is allowed, which is opposite in LongHashedRelation and UnsafeHashedRelation; And single key might be expanded into 2^N - 1 records, (N refer to columnNum of the key). for example, if there is a record (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), C(2,3) as a combination to copy origin key row, and setNull at target position, and then insert into NullAwareHashedRelation. including the origin key row, there will be 7 key row inserted as follow. (null, 2, 3) (1, null, 3) (1, 2, null) (null, null, 3) (null, 2, null) (1, null, null) (1, 2, 3) with the expanded data we can extract a common pattern for both single and multi column. allNull refer to a unsafeRow which has all null columns. * buildSide is empty input => return all rows * allNullColumnKey Exists In buildSide input => reject all rows * if streamedSideRow.allNull is true => drop the row * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation => drop the row * if streamedSideRow.allNull is false & notFindMatch in NullAwareHashedRelation => return the row this solution will sure make buildSide data expand to 2^N-1 times, but since it is normally up to 2~3 column in NAAJ in normal production query, i suppose that it's acceptable to expand buildSide data to around 7X. I would also have a limitation of max column support for NAAJ, basically should not more than 3. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32283) Multiple Kryo registrators can't be used anymore
[ https://issues.apache.org/jira/browse/SPARK-32283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32283: --- Assignee: Lantao Jin > Multiple Kryo registrators can't be used anymore > > > Key: SPARK-32283 > URL: https://issues.apache.org/jira/browse/SPARK-32283 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Lorenz Bühmann >Assignee: Lantao Jin >Priority: Minor > > This is a regression in Spark 3.0 as it is working with Spark 2. > According to the docs, it should be possible to register multiple Kryo > registrators via Spark config option spark.kryo.registrator . > In Spark 3.0 the code to parse Kryo config options has been refactored into > Scala class > [Kryo|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala]. > The code to parse the registrators is in [Line > 29-32|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala#L29-L32] > {code:scala} > val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator") > .version("0.5.0") > .stringConf > .createOptional > {code} > but it should be > {code:scala} > val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator") > .version("0.5.0") > .stringConf > .toSequence > .createOptional > {code} > to split the comma seprated list. > In previous Spark 2.x it was done differently directly in [KryoSerializer > Line > 77-79|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala#L77-L79] > {code:scala} > private val userRegistrators = conf.get("spark.kryo.registrator", "") > .split(',').map(_.trim) > .filter(!_.isEmpty) > {code} > Hope this helps. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32283) Multiple Kryo registrators can't be used anymore
[ https://issues.apache.org/jira/browse/SPARK-32283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32283. - Fix Version/s: 3.1.0 3.0.1 Resolution: Fixed Issue resolved by pull request 29123 [https://github.com/apache/spark/pull/29123] > Multiple Kryo registrators can't be used anymore > > > Key: SPARK-32283 > URL: https://issues.apache.org/jira/browse/SPARK-32283 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Lorenz Bühmann >Assignee: Lantao Jin >Priority: Minor > Fix For: 3.0.1, 3.1.0 > > > This is a regression in Spark 3.0 as it is working with Spark 2. > According to the docs, it should be possible to register multiple Kryo > registrators via Spark config option spark.kryo.registrator . > In Spark 3.0 the code to parse Kryo config options has been refactored into > Scala class > [Kryo|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala]. > The code to parse the registrators is in [Line > 29-32|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/Kryo.scala#L29-L32] > {code:scala} > val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator") > .version("0.5.0") > .stringConf > .createOptional > {code} > but it should be > {code:scala} > val KRYO_USER_REGISTRATORS = ConfigBuilder("spark.kryo.registrator") > .version("0.5.0") > .stringConf > .toSequence > .createOptional > {code} > to split the comma seprated list. > In previous Spark 2.x it was done differently directly in [KryoSerializer > Line > 77-79|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala#L77-L79] > {code:scala} > private val userRegistrators = conf.get("spark.kryo.registrator", "") > .split(',').map(_.trim) > .filter(!_.isEmpty) > {code} > Hope this helps. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32401) Migrate function related commands to new framework
[ https://issues.apache.org/jira/browse/SPARK-32401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32401. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29198 [https://github.com/apache/spark/pull/29198] > Migrate function related commands to new framework > -- > > Key: SPARK-32401 > URL: https://issues.apache.org/jira/browse/SPARK-32401 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.1.0 > > > Migrate the following function related commands to the new resolution > framework: > * CREATE FUNCTION > * DESCRIBE FUNCTION > * DROP FUNCTION > * SHOW FUNCTIONS -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32401) Migrate function related commands to new framework
[ https://issues.apache.org/jira/browse/SPARK-32401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32401: --- Assignee: Terry Kim > Migrate function related commands to new framework > -- > > Key: SPARK-32401 > URL: https://issues.apache.org/jira/browse/SPARK-32401 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > > Migrate the following function related commands to the new resolution > framework: > * CREATE FUNCTION > * DESCRIBE FUNCTION > * DROP FUNCTION > * SHOW FUNCTIONS -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32473) Fix RadixSortSuite by using sameElements instead IndexSeqView.==
[ https://issues.apache.org/jira/browse/SPARK-32473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166849#comment-17166849 ] Apache Spark commented on SPARK-32473: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/29280 > Fix RadixSortSuite by using sameElements instead IndexSeqView.== > > > Key: SPARK-32473 > URL: https://issues.apache.org/jira/browse/SPARK-32473 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32473) Fix RadixSortSuite by using sameElements instead IndexSeqView.==
[ https://issues.apache.org/jira/browse/SPARK-32473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166847#comment-17166847 ] Apache Spark commented on SPARK-32473: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/29280 > Fix RadixSortSuite by using sameElements instead IndexSeqView.== > > > Key: SPARK-32473 > URL: https://issues.apache.org/jira/browse/SPARK-32473 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32473) Fix RadixSortSuite by using sameElements instead IndexSeqView.==
[ https://issues.apache.org/jira/browse/SPARK-32473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32473: Assignee: (was: Apache Spark) > Fix RadixSortSuite by using sameElements instead IndexSeqView.== > > > Key: SPARK-32473 > URL: https://issues.apache.org/jira/browse/SPARK-32473 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32473) Fix RadixSortSuite by using sameElements instead IndexSeqView.==
[ https://issues.apache.org/jira/browse/SPARK-32473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32473: Assignee: Apache Spark > Fix RadixSortSuite by using sameElements instead IndexSeqView.== > > > Key: SPARK-32473 > URL: https://issues.apache.org/jira/browse/SPARK-32473 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32473) Fix RadixSortSuite by using sameElements instead IndexSeqView.==
[ https://issues.apache.org/jira/browse/SPARK-32473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32473: -- Summary: Fix RadixSortSuite by using sameElements instead IndexSeqView.== (was: Use sameElements instead IndexSeqView.==) > Fix RadixSortSuite by using sameElements instead IndexSeqView.== > > > Key: SPARK-32473 > URL: https://issues.apache.org/jira/browse/SPARK-32473 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32473) Use sameElements instead IndexSeqView.==
Dongjoon Hyun created SPARK-32473: - Summary: Use sameElements instead IndexSeqView.== Key: SPARK-32473 URL: https://issues.apache.org/jira/browse/SPARK-32473 Project: Spark Issue Type: Sub-task Components: Spark Core, Tests Affects Versions: 3.1.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32470) Remove task result size check for shuffle map stage
[ https://issues.apache.org/jira/browse/SPARK-32470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-32470: Affects Version/s: (was: 3.0.0) 3.1.0 > Remove task result size check for shuffle map stage > --- > > Key: SPARK-32470 > URL: https://issues.apache.org/jira/browse/SPARK-32470 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Wei Xue >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32465) How do I get the SPARK shuffle monitoring indicator?
[ https://issues.apache.org/jira/browse/SPARK-32465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32465: -- Target Version/s: (was: 2.2.1) > How do I get the SPARK shuffle monitoring indicator? > > > Key: SPARK-32465 > URL: https://issues.apache.org/jira/browse/SPARK-32465 > Project: Spark > Issue Type: Question > Components: Shuffle >Affects Versions: 2.2.1 >Reporter: MOBIN >Priority: Major > Labels: Metrics, Monitoring > > I want to monitor the spark task through graphite_export and prometheus, I > have collected execution, driver cpu and memory related indicators, but there > is no shuffle read and shuffle write related indicators, how should I > configure it? > metrics.properties: > master.source.jvm.class=org.apache.spark.metrics.source.JvmSource > worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource > driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource > executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource > shuffleService.source.jvm.class=org.apache.spark.metrics.source.JvmSource -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32465) How do I get the SPARK shuffle monitoring indicator?
[ https://issues.apache.org/jira/browse/SPARK-32465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32465: -- Fix Version/s: (was: 2.2.1) > How do I get the SPARK shuffle monitoring indicator? > > > Key: SPARK-32465 > URL: https://issues.apache.org/jira/browse/SPARK-32465 > Project: Spark > Issue Type: Question > Components: Shuffle >Affects Versions: 2.2.1 >Reporter: MOBIN >Priority: Major > Labels: Metrics, Monitoring > > I want to monitor the spark task through graphite_export and prometheus, I > have collected execution, driver cpu and memory related indicators, but there > is no shuffle read and shuffle write related indicators, how should I > configure it? > metrics.properties: > master.source.jvm.class=org.apache.spark.metrics.source.JvmSource > worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource > driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource > executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource > shuffleService.source.jvm.class=org.apache.spark.metrics.source.JvmSource -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32471) Describe JSON option `allowNonNumericNumbers`
[ https://issues.apache.org/jira/browse/SPARK-32471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32471: Assignee: Maxim Gekk > Describe JSON option `allowNonNumericNumbers` > - > > Key: SPARK-32471 > URL: https://issues.apache.org/jira/browse/SPARK-32471 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > JSON datasource support the `allowNonNumericNumbers` option but it is not > described. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32471) Describe JSON option `allowNonNumericNumbers`
[ https://issues.apache.org/jira/browse/SPARK-32471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32471. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29275 [https://github.com/apache/spark/pull/29275] > Describe JSON option `allowNonNumericNumbers` > - > > Key: SPARK-32471 > URL: https://issues.apache.org/jira/browse/SPARK-32471 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.1.0 > > > JSON datasource support the `allowNonNumericNumbers` option but it is not > described. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-31525) Inconsistent result of df.head(1) and df.head()
[ https://issues.apache.org/jira/browse/SPARK-31525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-31525: -- Assignee: (was: Tianshi Zhu) Sorry I was confused. Let's keep it consistent with Scala side. Reverted at https://github.com/apache/spark/commit/5491c08bf1d3472e712c0dd88c2881d6496108c0 > Inconsistent result of df.head(1) and df.head() > --- > > Key: SPARK-31525 > URL: https://issues.apache.org/jira/browse/SPARK-31525 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Joshua Hendinata >Priority: Minor > Fix For: 3.1.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > In this line > [https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1339], > if you are calling `df.head()` and dataframe is empty, it will return *None* > but if you are calling `df.head(1)` and dataframe is empty, it will return > *empty list* instead. > This particular behaviour is not consistent and can create confusion. > Especially when you are calling `len(df.head())` which will throw an > exception for empty dataframe -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31525) Inconsistent result of df.head(1) and df.head()
[ https://issues.apache.org/jira/browse/SPARK-31525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31525: - Fix Version/s: (was: 3.1.0) > Inconsistent result of df.head(1) and df.head() > --- > > Key: SPARK-31525 > URL: https://issues.apache.org/jira/browse/SPARK-31525 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Joshua Hendinata >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > In this line > [https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1339], > if you are calling `df.head()` and dataframe is empty, it will return *None* > but if you are calling `df.head(1)` and dataframe is empty, it will return > *empty list* instead. > This particular behaviour is not consistent and can create confusion. > Especially when you are calling `len(df.head())` which will throw an > exception for empty dataframe -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31525) Inconsistent result of df.head(1) and df.head()
[ https://issues.apache.org/jira/browse/SPARK-31525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31525: Assignee: Apache Spark > Inconsistent result of df.head(1) and df.head() > --- > > Key: SPARK-31525 > URL: https://issues.apache.org/jira/browse/SPARK-31525 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Joshua Hendinata >Assignee: Apache Spark >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > In this line > [https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1339], > if you are calling `df.head()` and dataframe is empty, it will return *None* > but if you are calling `df.head(1)` and dataframe is empty, it will return > *empty list* instead. > This particular behaviour is not consistent and can create confusion. > Especially when you are calling `len(df.head())` which will throw an > exception for empty dataframe -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31525) Inconsistent result of df.head(1) and df.head()
[ https://issues.apache.org/jira/browse/SPARK-31525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31525: Assignee: (was: Apache Spark) > Inconsistent result of df.head(1) and df.head() > --- > > Key: SPARK-31525 > URL: https://issues.apache.org/jira/browse/SPARK-31525 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Joshua Hendinata >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > In this line > [https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1339], > if you are calling `df.head()` and dataframe is empty, it will return *None* > but if you are calling `df.head(1)` and dataframe is empty, it will return > *empty list* instead. > This particular behaviour is not consistent and can create confusion. > Especially when you are calling `len(df.head())` which will throw an > exception for empty dataframe -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31418) Blacklisting feature aborts Spark job without retrying for max num retries in case of Dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-31418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166823#comment-17166823 ] Apache Spark commented on SPARK-31418: -- User 'venkata91' has created a pull request for this issue: https://github.com/apache/spark/pull/29279 > Blacklisting feature aborts Spark job without retrying for max num retries in > case of Dynamic allocation > > > Key: SPARK-31418 > URL: https://issues.apache.org/jira/browse/SPARK-31418 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0, 2.4.5 >Reporter: Venkata krishnan Sowrirajan >Assignee: Venkata krishnan Sowrirajan >Priority: Major > Fix For: 3.1.0 > > > With Spark blacklisting, if a task fails on an executor, the executor gets > blacklisted for the task. In order to retry the task, it checks if there are > idle blacklisted executor which can be killed and replaced to retry the task > if not it aborts the job without doing max retries. > In the context of dynamic allocation this can be better, instead of killing > the blacklisted idle executor (its possible there are no idle blacklisted > executor), request an additional executor and retry the task. > This can be easily reproduced with a simple job like below, although this > example should fail eventually just to show that its not retried > spark.task.maxFailures times: > {code:java} > def test(a: Int) = { a.asInstanceOf[String] } > sc.parallelize(1 to 10, 10).map(x => test(x)).collect > {code} > with dynamic allocation enabled and min executors set to 1. But there are > various other cases where this can fail as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31418) Blacklisting feature aborts Spark job without retrying for max num retries in case of Dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-31418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166821#comment-17166821 ] Apache Spark commented on SPARK-31418: -- User 'venkata91' has created a pull request for this issue: https://github.com/apache/spark/pull/29279 > Blacklisting feature aborts Spark job without retrying for max num retries in > case of Dynamic allocation > > > Key: SPARK-31418 > URL: https://issues.apache.org/jira/browse/SPARK-31418 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0, 2.4.5 >Reporter: Venkata krishnan Sowrirajan >Assignee: Venkata krishnan Sowrirajan >Priority: Major > Fix For: 3.1.0 > > > With Spark blacklisting, if a task fails on an executor, the executor gets > blacklisted for the task. In order to retry the task, it checks if there are > idle blacklisted executor which can be killed and replaced to retry the task > if not it aborts the job without doing max retries. > In the context of dynamic allocation this can be better, instead of killing > the blacklisted idle executor (its possible there are no idle blacklisted > executor), request an additional executor and retry the task. > This can be easily reproduced with a simple job like below, although this > example should fail eventually just to show that its not retried > spark.task.maxFailures times: > {code:java} > def test(a: Int) = { a.asInstanceOf[String] } > sc.parallelize(1 to 10, 10).map(x => test(x)).collect > {code} > with dynamic allocation enabled and min executors set to 1. But there are > various other cases where this can fail as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32160) Executors should not be able to create SparkContext.
[ https://issues.apache.org/jira/browse/SPARK-32160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166806#comment-17166806 ] Apache Spark commented on SPARK-32160: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/29278 > Executors should not be able to create SparkContext. > > > Key: SPARK-32160 > URL: https://issues.apache.org/jira/browse/SPARK-32160 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.0.1, 3.1.0 > > > Currently executors can create SparkContext, but shouldn't be able to create > it. > {code:scala} > sc.range(0, 1).foreach { _ => > new SparkContext(new SparkConf().setAppName("test").setMaster("local")) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32160) Executors should not be able to create SparkContext.
[ https://issues.apache.org/jira/browse/SPARK-32160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166807#comment-17166807 ] Apache Spark commented on SPARK-32160: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/29278 > Executors should not be able to create SparkContext. > > > Key: SPARK-32160 > URL: https://issues.apache.org/jira/browse/SPARK-32160 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.0.1, 3.1.0 > > > Currently executors can create SparkContext, but shouldn't be able to create > it. > {code:scala} > sc.range(0, 1).foreach { _ => > new SparkContext(new SparkConf().setAppName("test").setMaster("local")) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20684) expose createOrReplaceGlobalTempView/createGlobalTempView and dropGlobalTempView in SparkR
[ https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166801#comment-17166801 ] Felix Cheung commented on SPARK-20684: -- [https://github.com/apache/spark/pull/17941#issuecomment-301669567] [https://github.com/apache/spark/pull/19176#issuecomment-328292002] [https://github.com/apache/spark/pull/19176#issuecomment-328292789] > expose createOrReplaceGlobalTempView/createGlobalTempView and > dropGlobalTempView in SparkR > -- > > Key: SPARK-20684 > URL: https://issues.apache.org/jira/browse/SPARK-20684 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki >Priority: Major > > This is a useful API that is not exposed in SparkR. It will help with moving > data between languages on a single single Spark application. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12172) Consider removing SparkR internal RDD APIs
[ https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166581#comment-17166581 ] Felix Cheung edited comment on SPARK-12172 at 7/29/20, 1:20 AM: These are methods (map etc) that were never public and not supported. They were not call-able unless you directly reference the internal namespace Spark::: was (Author: felixcheung): These are methods (map etc) that were never public and not supported. > Consider removing SparkR internal RDD APIs > -- > > Key: SPARK-12172 > URL: https://issues.apache.org/jira/browse/SPARK-12172 > Project: Spark > Issue Type: Task > Components: SparkR >Reporter: Felix Cheung >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12172) Consider removing SparkR internal RDD APIs
[ https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166581#comment-17166581 ] Felix Cheung edited comment on SPARK-12172 at 7/29/20, 1:19 AM: These are methods (map etc) that were never public and not supported. was (Author: felixcheung): These are methods (map etc) that were never public and not supported. On Tue, Jul 28, 2020 at 10:18 AM S Daniel Zafar (Jira) > Consider removing SparkR internal RDD APIs > -- > > Key: SPARK-12172 > URL: https://issues.apache.org/jira/browse/SPARK-12172 > Project: Spark > Issue Type: Task > Components: SparkR >Reporter: Felix Cheung >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32472) Expose confusion matrix elements by threshold in BinaryClassificationMetrics
[ https://issues.apache.org/jira/browse/SPARK-32472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Moore updated SPARK-32472: Description: Currently, the only thresholded metrics available from BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly through roc()) the false positive rate. Unfortunately, you can't always compute the individual thresholded confusion matrix elements (TP, FP, TN, FN) from these quantities. You can make a system of equations out of the existing thresholded metrics and the total count, but they become underdetermined when there are no true positives. Fortunately, the individual confusion matrix elements by threshold are already computed and sitting in the confusions variable. It would be helpful to expose these elements directly. The easiest way would probably be by adding methods like {code:java} def truePositivesByThreshold(): RDD[(Double, Double)] = confusions.map{ case (t, c) => (t, c.weightedTruePositives) }{code} An alternative could be to expose the entire RDD[(Double, BinaryConfusionMatrix)] in one method, but BinaryConfusionMatrix is also currently package private. The closest issue to this I found was this one for adding new calculations to BinaryClassificationMetrics https://issues.apache.org/jira/browse/SPARK-18844, which was closed without any changes being merged. was: Currently, the only thresholded metrics available from BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly through roc()) the false positive rate. Unfortunately, you can't always compute the individual thresholded confusion matrix elements (TP, FP, TN, FN) from these quantities. You can make a system of equations out of the existing thresholded metrics and the total count, but they become underdetermined when there are no true positives. Fortunately, the individual confusion matrix elements by threshold are already computed and sitting in the `confusions` variable. It would be helpful to expose these elements directly. The easiest way would probably be by adding methods like {code:java} def truePositivesByThreshold(): RDD[(Double, Double)] = confusions.map{ case (t, c) => (t, c.weightedTruePositives) }{code} An alternative could be to expose the entire RDD[(Double, BinaryConfusionMatrix)] in one method, but BinaryConfusionMatrix is also currently package private. The closest issue to this I found was this one for adding new calculations to BinaryClassificationMetrics https://issues.apache.org/jira/browse/SPARK-18844, which was closed without any changes being merged. > Expose confusion matrix elements by threshold in BinaryClassificationMetrics > > > Key: SPARK-32472 > URL: https://issues.apache.org/jira/browse/SPARK-32472 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 3.0.0 >Reporter: Kevin Moore >Priority: Minor > > Currently, the only thresholded metrics available from > BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly > through roc()) the false positive rate. > Unfortunately, you can't always compute the individual thresholded confusion > matrix elements (TP, FP, TN, FN) from these quantities. You can make a system > of equations out of the existing thresholded metrics and the total count, but > they become underdetermined when there are no true positives. > Fortunately, the individual confusion matrix elements by threshold are > already computed and sitting in the confusions variable. It would be helpful > to expose these elements directly. The easiest way would probably be by > adding methods like > {code:java} > def truePositivesByThreshold(): RDD[(Double, Double)] = confusions.map{ case > (t, c) => (t, c.weightedTruePositives) }{code} > An alternative could be to expose the entire RDD[(Double, > BinaryConfusionMatrix)] in one method, but BinaryConfusionMatrix is also > currently package private. > The closest issue to this I found was this one for adding new calculations to > BinaryClassificationMetrics > https://issues.apache.org/jira/browse/SPARK-18844, which was closed without > any changes being merged. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32472) Expose confusion matrix elements by threshold in BinaryClassificationMetrics
[ https://issues.apache.org/jira/browse/SPARK-32472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Moore updated SPARK-32472: Description: Currently, the only thresholded metrics available from BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly through roc()) the false positive rate. Unfortunately, you can't always compute the individual thresholded confusion matrix elements (TP, FP, TN, FN) from these quantities. You can make a system of equations out of the existing thresholded metrics and the total count, but they become underdetermined when there are no true positives. Fortunately, the individual confusion matrix elements by threshold are already computed and sitting in the `confusions` variable. It would be helpful to expose these elements directly. The easiest way would probably be by adding methods like {code:java} def truePositivesByThreshold(): RDD[(Double, Double)] = confusions.map{ case (t, c) => (t, c.weightedTruePositives) }{code} An alternative could be to expose the entire RDD[(Double, BinaryConfusionMatrix)] in one method, but BinaryConfusionMatrix is also currently package private. The closest issue to this I found was this one for adding new calculations to BinaryClassificationMetrics https://issues.apache.org/jira/browse/SPARK-18844, which was closed without any changes being merged. was: Currently, the only thresholded metrics available from BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly through roc()) the false positive rate. Unfortunately, you can't always compute the individual thresholded confusion matrix elements (TP, FP, TN, FN) from these quantities. You can make a system of equations out of the existing thresholded metrics and the total count, but they become underdetermined when there are no true positives. Fortunately, the individual confusion matrix elements by threshold are already computed and sitting in the `confusions` variable. It would be helpful to expose these elements directly. The easiest way would probably be by adding methods like {code:java} // def truePositivesByThreshold(): RDD[(Double, Double)] = confusions.map{ case (t, c) => (t, c.weightedTruePositives) }{code} An alternative could be to expose the entire RDD[(Double, BinaryConfusionMatrix)] in one method, but BinaryConfusionMatrix is also currently package private. The closest issue to this I found was this one for adding new calculations to BinaryClassificationMetrics https://issues.apache.org/jira/browse/SPARK-18844, which was closed without any changes being merged. > Expose confusion matrix elements by threshold in BinaryClassificationMetrics > > > Key: SPARK-32472 > URL: https://issues.apache.org/jira/browse/SPARK-32472 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 3.0.0 >Reporter: Kevin Moore >Priority: Minor > > Currently, the only thresholded metrics available from > BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly > through roc()) the false positive rate. > Unfortunately, you can't always compute the individual thresholded confusion > matrix elements (TP, FP, TN, FN) from these quantities. You can make a system > of equations out of the existing thresholded metrics and the total count, but > they become underdetermined when there are no true positives. > Fortunately, the individual confusion matrix elements by threshold are > already computed and sitting in the `confusions` variable. It would be > helpful to expose these elements directly. The easiest way would probably be > by adding methods like > {code:java} > def truePositivesByThreshold(): RDD[(Double, Double)] = confusions.map{ case > (t, c) => (t, c.weightedTruePositives) }{code} > > An alternative could be to expose the entire RDD[(Double, > BinaryConfusionMatrix)] in one method, but BinaryConfusionMatrix is also > currently package private. > The closest issue to this I found was this one for adding new calculations to > BinaryClassificationMetrics > https://issues.apache.org/jira/browse/SPARK-18844, which was closed without > any changes being merged. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32472) Expose confusion matrix elements by threshold in BinaryClassificationMetrics
[ https://issues.apache.org/jira/browse/SPARK-32472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Moore updated SPARK-32472: Description: Currently, the only thresholded metrics available from BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly through roc()) the false positive rate. Unfortunately, you can't always compute the individual thresholded confusion matrix elements (TP, FP, TN, FN) from these quantities. You can make a system of equations out of the existing thresholded metrics and the total count, but they become underdetermined when there are no true positives. Fortunately, the individual confusion matrix elements by threshold are already computed and sitting in the `confusions` variable. It would be helpful to expose these elements directly. The easiest way would probably be by adding methods like {code:java} // def truePositivesByThreshold(): RDD[(Double, Double)] = confusions.map{ case (t, c) => (t, c.weightedTruePositives) }{code} An alternative could be to expose the entire RDD[(Double, BinaryConfusionMatrix)] in one method, but BinaryConfusionMatrix is also currently package private. The closest issue to this I found was this one for adding new calculations to BinaryClassificationMetrics https://issues.apache.org/jira/browse/SPARK-18844, which was closed without any changes being merged. was: Currently, the only thresholded metrics available from BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly through `roc()`) the false positive rate. Unfortunately, you can't always compute the individual thresholded confusion matrix elements (TP, FP, TN, FN) from these quantities. You can make a system of equations out of the existing thresholded metrics and the total count, but they become underdetermined when there are no true positives. Fortunately, the individual confusion matrix elements by threshold are already computed and sitting in the `confusions` variable. It would be helpful to expose these elements directly. The easiest way would probably be by adding methods like `def truePositivesByThreshold(): RDD[(Double, Double)] = confusions.map\{ case (t, c) => (t, c.weightedTruePositives) }`. An alternative could be to expose the entire `RDD[(Double, BinaryConfusionMatrix)]` in one method, but `BinaryConfusionMatrix` is also currently package private. The closest issue to this I found was this one for adding new calculations to BinaryClassificationMetrics https://issues.apache.org/jira/browse/SPARK-18844, which was closed without any changes being merged. > Expose confusion matrix elements by threshold in BinaryClassificationMetrics > > > Key: SPARK-32472 > URL: https://issues.apache.org/jira/browse/SPARK-32472 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 3.0.0 >Reporter: Kevin Moore >Priority: Minor > > Currently, the only thresholded metrics available from > BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly > through roc()) the false positive rate. > Unfortunately, you can't always compute the individual thresholded confusion > matrix elements (TP, FP, TN, FN) from these quantities. You can make a system > of equations out of the existing thresholded metrics and the total count, but > they become underdetermined when there are no true positives. > Fortunately, the individual confusion matrix elements by threshold are > already computed and sitting in the `confusions` variable. It would be > helpful to expose these elements directly. The easiest way would probably be > by adding methods like > > {code:java} > // def truePositivesByThreshold(): RDD[(Double, Double)] = confusions.map{ > case (t, c) => (t, c.weightedTruePositives) }{code} > > An alternative could be to expose the entire RDD[(Double, > BinaryConfusionMatrix)] in one method, but BinaryConfusionMatrix is also > currently package private. > The closest issue to this I found was this one for adding new calculations to > BinaryClassificationMetrics > https://issues.apache.org/jira/browse/SPARK-18844, which was closed without > any changes being merged. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32472) Expose confusion matrix elements by threshold in BinaryClassificationMetrics
Kevin Moore created SPARK-32472: --- Summary: Expose confusion matrix elements by threshold in BinaryClassificationMetrics Key: SPARK-32472 URL: https://issues.apache.org/jira/browse/SPARK-32472 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 3.0.0 Reporter: Kevin Moore Currently, the only thresholded metrics available from BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly through `roc()`) the false positive rate. Unfortunately, you can't always compute the individual thresholded confusion matrix elements (TP, FP, TN, FN) from these quantities. You can make a system of equations out of the existing thresholded metrics and the total count, but they become underdetermined when there are no true positives. Fortunately, the individual confusion matrix elements by threshold are already computed and sitting in the `confusions` variable. It would be helpful to expose these elements directly. The easiest way would probably be by adding methods like `def truePositivesByThreshold(): RDD[(Double, Double)] = confusions.map\{ case (t, c) => (t, c.weightedTruePositives) }`. An alternative could be to expose the entire `RDD[(Double, BinaryConfusionMatrix)]` in one method, but `BinaryConfusionMatrix` is also currently package private. The closest issue to this I found was this one for adding new calculations to BinaryClassificationMetrics https://issues.apache.org/jira/browse/SPARK-18844, which was closed without any changes being merged. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32140) Add summary to FMClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-32140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao updated SPARK-32140: --- Description: Add summary and training summary to FMClassificationModel. (was: Add summary and training summary to ) > Add summary to FMClassificationModel > > > Key: SPARK-32140 > URL: https://issues.apache.org/jira/browse/SPARK-32140 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.1.0 > > > Add summary and training summary to FMClassificationModel. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32140) Add summary to FMClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-32140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao updated SPARK-32140: --- Description: Add summary and training summary to > Add summary to FMClassificationModel > > > Key: SPARK-32140 > URL: https://issues.apache.org/jira/browse/SPARK-32140 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.1.0 > > > Add summary and training summary to -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32449) Add summary to MultilayerPerceptronClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-32449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao updated SPARK-32449: --- Description: Add summary and training summary to MultilayerPerceptronClassificationModel > Add summary to MultilayerPerceptronClassificationModel > -- > > Key: SPARK-32449 > URL: https://issues.apache.org/jira/browse/SPARK-32449 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Minor > > Add summary and training summary to MultilayerPerceptronClassificationModel -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32139) Unify Classification Training Summary
[ https://issues.apache.org/jira/browse/SPARK-32139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao updated SPARK-32139: --- Description: add classification model summary and training summary to each of the classification algorithms. The classification model summary basically gives a summary for all the classification algorithms evaluation metrics such as accuracy/precision/recall. The training summary describes information about model training iterations and the objective function (scaled loss + regularization) at each iteration. These are very useful information for users. > Unify Classification Training Summary > - > > Key: SPARK-32139 > URL: https://issues.apache.org/jira/browse/SPARK-32139 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Major > > add classification model summary and training summary to each of the > classification algorithms. The classification model summary basically gives a > summary for all the classification algorithms evaluation metrics such as > accuracy/precision/recall. The training summary describes information about > model training iterations and the objective function (scaled loss + > regularization) at each iteration. These are very useful information for > users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32421) Add code-gen for shuffled hash join
[ https://issues.apache.org/jira/browse/SPARK-32421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1715#comment-1715 ] Apache Spark commented on SPARK-32421: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/29277 > Add code-gen for shuffled hash join > --- > > Key: SPARK-32421 > URL: https://issues.apache.org/jira/browse/SPARK-32421 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Trivial > > We added shuffled hash join codegen internally in our fork, and seeing > obvious improvement in benchmark compared to current non-codegen code path. > Creating this Jira to add this support. Shuffled hash join codegen is very > similar to broadcast hash join codegen. So this is a simple change. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32421) Add code-gen for shuffled hash join
[ https://issues.apache.org/jira/browse/SPARK-32421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1714#comment-1714 ] Apache Spark commented on SPARK-32421: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/29277 > Add code-gen for shuffled hash join > --- > > Key: SPARK-32421 > URL: https://issues.apache.org/jira/browse/SPARK-32421 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Trivial > > We added shuffled hash join codegen internally in our fork, and seeing > obvious improvement in benchmark compared to current non-codegen code path. > Creating this Jira to add this support. Shuffled hash join codegen is very > similar to broadcast hash join codegen. So this is a simple change. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32421) Add code-gen for shuffled hash join
[ https://issues.apache.org/jira/browse/SPARK-32421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32421: Assignee: (was: Apache Spark) > Add code-gen for shuffled hash join > --- > > Key: SPARK-32421 > URL: https://issues.apache.org/jira/browse/SPARK-32421 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Trivial > > We added shuffled hash join codegen internally in our fork, and seeing > obvious improvement in benchmark compared to current non-codegen code path. > Creating this Jira to add this support. Shuffled hash join codegen is very > similar to broadcast hash join codegen. So this is a simple change. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32421) Add code-gen for shuffled hash join
[ https://issues.apache.org/jira/browse/SPARK-32421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32421: Assignee: Apache Spark > Add code-gen for shuffled hash join > --- > > Key: SPARK-32421 > URL: https://issues.apache.org/jira/browse/SPARK-32421 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Apache Spark >Priority: Trivial > > We added shuffled hash join codegen internally in our fork, and seeing > obvious improvement in benchmark compared to current non-codegen code path. > Creating this Jira to add this support. Shuffled hash join codegen is very > similar to broadcast hash join codegen. So this is a simple change. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30817) SparkR ML algorithms parity
[ https://issues.apache.org/jira/browse/SPARK-30817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1713#comment-1713 ] Maciej Szymkiewicz commented on SPARK-30817: [~dan_z] I think we're in-sync ATM and this specific issue should be resolved. > SparkR ML algorithms parity > > > Key: SPARK-30817 > URL: https://issues.apache.org/jira/browse/SPARK-30817 > Project: Spark > Issue Type: Umbrella > Components: ML, SparkR >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > As of 3.0 the following algorithms are missing form SparkR > * {{LinearRegression}} > * {{FMRegressor}} (Added to ML in 3.0) > * {{FMClassifier}} (Added to ML in 3.0) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32470) Remove task result size check for shuffle map stage
[ https://issues.apache.org/jira/browse/SPARK-32470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166643#comment-17166643 ] Apache Spark commented on SPARK-32470: -- User 'maryannxue' has created a pull request for this issue: https://github.com/apache/spark/pull/29276 > Remove task result size check for shuffle map stage > --- > > Key: SPARK-32470 > URL: https://issues.apache.org/jira/browse/SPARK-32470 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Wei Xue >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32470) Remove task result size check for shuffle map stage
[ https://issues.apache.org/jira/browse/SPARK-32470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32470: Assignee: (was: Apache Spark) > Remove task result size check for shuffle map stage > --- > > Key: SPARK-32470 > URL: https://issues.apache.org/jira/browse/SPARK-32470 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Wei Xue >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32470) Remove task result size check for shuffle map stage
[ https://issues.apache.org/jira/browse/SPARK-32470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32470: Assignee: Apache Spark > Remove task result size check for shuffle map stage > --- > > Key: SPARK-32470 > URL: https://issues.apache.org/jira/browse/SPARK-32470 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Wei Xue >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32471) Describe JSON option `allowNonNumericNumbers`
[ https://issues.apache.org/jira/browse/SPARK-32471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166638#comment-17166638 ] Apache Spark commented on SPARK-32471: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/29275 > Describe JSON option `allowNonNumericNumbers` > - > > Key: SPARK-32471 > URL: https://issues.apache.org/jira/browse/SPARK-32471 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Minor > > JSON datasource support the `allowNonNumericNumbers` option but it is not > described. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32471) Describe JSON option `allowNonNumericNumbers`
[ https://issues.apache.org/jira/browse/SPARK-32471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166639#comment-17166639 ] Apache Spark commented on SPARK-32471: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/29275 > Describe JSON option `allowNonNumericNumbers` > - > > Key: SPARK-32471 > URL: https://issues.apache.org/jira/browse/SPARK-32471 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Minor > > JSON datasource support the `allowNonNumericNumbers` option but it is not > described. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32471) Describe JSON option `allowNonNumericNumbers`
[ https://issues.apache.org/jira/browse/SPARK-32471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32471: Assignee: Apache Spark > Describe JSON option `allowNonNumericNumbers` > - > > Key: SPARK-32471 > URL: https://issues.apache.org/jira/browse/SPARK-32471 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > JSON datasource support the `allowNonNumericNumbers` option but it is not > described. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32471) Describe JSON option `allowNonNumericNumbers`
[ https://issues.apache.org/jira/browse/SPARK-32471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32471: Assignee: (was: Apache Spark) > Describe JSON option `allowNonNumericNumbers` > - > > Key: SPARK-32471 > URL: https://issues.apache.org/jira/browse/SPARK-32471 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Minor > > JSON datasource support the `allowNonNumericNumbers` option but it is not > described. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32471) Describe JSON option `allowNonNumericNumbers`
Maxim Gekk created SPARK-32471: -- Summary: Describe JSON option `allowNonNumericNumbers` Key: SPARK-32471 URL: https://issues.apache.org/jira/browse/SPARK-32471 Project: Spark Issue Type: Documentation Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk JSON datasource support the `allowNonNumericNumbers` option but it is not described. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32397) Snapshot artifacts can have differing timestamps, making it hard to consume
[ https://issues.apache.org/jira/browse/SPARK-32397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166632#comment-17166632 ] Apache Spark commented on SPARK-32397: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/29274 > Snapshot artifacts can have differing timestamps, making it hard to consume > --- > > Key: SPARK-32397 > URL: https://issues.apache.org/jira/browse/SPARK-32397 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.1, 3.1.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Minor > > Since we use multiple sub components in building Spark we can get into a > situation where the timestamps for these components is different. This can > make it difficult to consume Spark snapshots in an environment where someone > is running a nightly build for other folks to develop on top of. > I believe I have a small fix for this already, but just waiting to verify and > then I'll open a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32397) Snapshot artifacts can have differing timestamps, making it hard to consume
[ https://issues.apache.org/jira/browse/SPARK-32397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32397: Assignee: Apache Spark (was: Holden Karau) > Snapshot artifacts can have differing timestamps, making it hard to consume > --- > > Key: SPARK-32397 > URL: https://issues.apache.org/jira/browse/SPARK-32397 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.1, 3.1.0 >Reporter: Holden Karau >Assignee: Apache Spark >Priority: Minor > > Since we use multiple sub components in building Spark we can get into a > situation where the timestamps for these components is different. This can > make it difficult to consume Spark snapshots in an environment where someone > is running a nightly build for other folks to develop on top of. > I believe I have a small fix for this already, but just waiting to verify and > then I'll open a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32397) Snapshot artifacts can have differing timestamps, making it hard to consume
[ https://issues.apache.org/jira/browse/SPARK-32397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166629#comment-17166629 ] Apache Spark commented on SPARK-32397: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/29274 > Snapshot artifacts can have differing timestamps, making it hard to consume > --- > > Key: SPARK-32397 > URL: https://issues.apache.org/jira/browse/SPARK-32397 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.1, 3.1.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Minor > > Since we use multiple sub components in building Spark we can get into a > situation where the timestamps for these components is different. This can > make it difficult to consume Spark snapshots in an environment where someone > is running a nightly build for other folks to develop on top of. > I believe I have a small fix for this already, but just waiting to verify and > then I'll open a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32397) Snapshot artifacts can have differing timestamps, making it hard to consume
[ https://issues.apache.org/jira/browse/SPARK-32397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32397: Assignee: Holden Karau (was: Apache Spark) > Snapshot artifacts can have differing timestamps, making it hard to consume > --- > > Key: SPARK-32397 > URL: https://issues.apache.org/jira/browse/SPARK-32397 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.1, 3.1.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Minor > > Since we use multiple sub components in building Spark we can get into a > situation where the timestamps for these components is different. This can > make it difficult to consume Spark snapshots in an environment where someone > is running a nightly build for other folks to develop on top of. > I believe I have a small fix for this already, but just waiting to verify and > then I'll open a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32429) Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch
[ https://issues.apache.org/jira/browse/SPARK-32429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166627#comment-17166627 ] Thomas Graves edited comment on SPARK-32429 at 7/28/20, 6:47 PM: - Yes so for this first implementation we didn't really address users selecting different types of GPUs, but I think the design is generic enough to handle but requires extra support from the cluster manager. Otherwise I think it's left to the user to discover the details on the GPU. So I think the scenario you are talking about is a Worker has multiple gpus of different types so for it to discover them we would either have to explicitly add support for a "type" (spark.executor.resource.gpu.type) or you have 2 custom resources (k80/v100), which for standalone mode would be fine because you just supply the Worker with different discovery scripts and then like you say the application would request one or the other type of resource. The application just needs to know to request the custom resources vs just "gpu". I think there are a few ways we could make this generic. One to make it completely generic is to make it a plugin that would run before launching executors and python processes. spark.worker.resource.XX.launchPlugin = someClass. You could pass the env and resources into each one and it could set whatever it needs. There are less generic ways if you want Spark to know more about CUDA. What do you think of something like this? was (Author: tgraves): Yes so for this first implementation we didn't really address users selecting different types of GPUs, but I think the design is generic enough to handle but requires extra support from the cluster manager. Otherwise I think it's left to the user to discover the details on the GPU. So I think the scenario you are talking about is a Worker has multiple gpus of different types so for it to discover them we would either have to explicitly add support for a "type" (spark.executor.resource.gpu.type) or you have 2 custom resources (k80/v100), which for standalone mode would be fine because you just supply the Worker with different discovery scripts and then like you say the application would request one or the other type of resource. The application just needs to know to request the custom resources vs just "gpu". I think there are a few ways we could make this generic. One to make it completely generic is to make it a plugin that would run before launching executors and python processes. spark.worker.resource.XX.launchPlugins = someClass,anotherOne. You could pass the env and resources into each one and it could set whatever it needs. There are less generic ways if you want Spark to know more about CUDA. What do you think of something like this? > Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch > - > > Key: SPARK-32429 > URL: https://issues.apache.org/jira/browse/SPARK-32429 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > It would be nice if standalone mode could allow users to set > CUDA_VISIBLE_DEVICES before launching an executor. This has multiple > benefits. > * kind of an isolation in that the executor can only see the GPUs set there. > * If your GPU application doesn't support explicitly setting the GPU device > id, setting this will make any GPU look like the default (id 0) and things > generally just work without any explicit setting > * New features are being added on newer GPUs that require explicit setting > of CUDA_VISIBLE_DEVICES like MIG > ([https://www.nvidia.com/en-us/technologies/multi-instance-gpu/]) > The code changes to just set this are very small, once we set them we would > also possibly need to change the gpu addresses as it changes them to start > from device id 0 again. > The easiest implementation would just specifically support this and have it > behind a config and set when the config is on and GPU resources are > allocated. > Note we probably want to have this same thing set when we launch a python > process as well so that it gets same env. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32429) Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch
[ https://issues.apache.org/jira/browse/SPARK-32429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166627#comment-17166627 ] Thomas Graves commented on SPARK-32429: --- Yes so for this first implementation we didn't really address users selecting different types of GPUs, but I think the design is generic enough to handle but requires extra support from the cluster manager. Otherwise I think it's left to the user to discover the details on the GPU. So I think the scenario you are talking about is a Worker has multiple gpus of different types so for it to discover them we would either have to explicitly add support for a "type" (spark.executor.resource.gpu.type) or you have 2 custom resources (k80/v100), which for standalone mode would be fine because you just supply the Worker with different discovery scripts and then like you say the application would request one or the other type of resource. The application just needs to know to request the custom resources vs just "gpu". I think there are a few ways we could make this generic. One to make it completely generic is to make it a plugin that would run before launching executors and python processes. spark.worker.resource.XX.launchPlugins = someClass,anotherOne. You could pass the env and resources into each one and it could set whatever it needs. There are less generic ways if you want Spark to know more about CUDA. What do you think of something like this? > Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch > - > > Key: SPARK-32429 > URL: https://issues.apache.org/jira/browse/SPARK-32429 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > It would be nice if standalone mode could allow users to set > CUDA_VISIBLE_DEVICES before launching an executor. This has multiple > benefits. > * kind of an isolation in that the executor can only see the GPUs set there. > * If your GPU application doesn't support explicitly setting the GPU device > id, setting this will make any GPU look like the default (id 0) and things > generally just work without any explicit setting > * New features are being added on newer GPUs that require explicit setting > of CUDA_VISIBLE_DEVICES like MIG > ([https://www.nvidia.com/en-us/technologies/multi-instance-gpu/]) > The code changes to just set this are very small, once we set them we would > also possibly need to change the gpu addresses as it changes them to start > from device id 0 again. > The easiest implementation would just specifically support this and have it > behind a config and set when the config is on and GPU resources are > allocated. > Note we probably want to have this same thing set when we launch a python > process as well so that it gets same env. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28001) Dataframe throws 'socket.timeout: timed out' exception
[ https://issues.apache.org/jira/browse/SPARK-28001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166608#comment-17166608 ] StanislavKo edited comment on SPARK-28001 at 7/28/20, 6:04 PM: --- +1 OS: Windows 10 Python: 3.7 PySpark: (2.4.0 in requirements.txt, 2.4.3 in $SPARK_HOME) Cluster manager: Spark Standalone was (Author: stanislavko): +1 OS: Windows 10 Python: 3.7 PySpark: 2.4.3 Cluster manager: Spark Standalone > Dataframe throws 'socket.timeout: timed out' exception > -- > > Key: SPARK-28001 > URL: https://issues.apache.org/jira/browse/SPARK-28001 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 > Environment: Processor: Intel Core i7-7700 CPU @ 3.60Ghz > RAM: 16 GB > OS: Windows 10 Enterprise 64-bit > Python: 3.7.2 > PySpark: 3.4.3 > Cluster manager: Spark Standalone >Reporter: Marius Stanescu >Priority: Critical > > I load data from Azure Table Storage, create a DataFrame and perform a couple > of operations via two user-defined functions, then call show() to display the > results. If I load a very small batch of items, like 5, everything is working > fine, but if I load a batch grater then 10 items from Azure Table Storage > then I get the 'socket.timeout: timed out' exception. > Here is the code: > > {code} > import time > import json > import requests > from requests.auth import HTTPBasicAuth > from azure.cosmosdb.table.tableservice import TableService > from azure.cosmosdb.table.models import Entity > from pyspark.sql import SparkSession > from pyspark.sql.functions import udf, struct > from pyspark.sql.types import BooleanType > def main(): > batch_size = 25 > azure_table_account_name = '***' > azure_table_account_key = '***' > azure_table_name = '***' > spark = SparkSession \ > .builder \ > .appName(agent_name) \ > .config("spark.sql.crossJoin.enabled", "true") \ > .getOrCreate() > table_service = TableService(account_name=azure_table_account_name, > account_key=azure_table_account_key) > continuation_token = None > while True: > messages = table_service.query_entities( > azure_table_name, > select="RowKey, PartitionKey, messageId, ownerSmtp, Timestamp", > num_results=batch_size, > marker=continuation_token, > timeout=60) > continuation_token = messages.next_marker > messages_list = list(messages) > > if not len(messages_list): > time.sleep(5) > pass > > messages_df = spark.createDataFrame(messages_list) > > register_records_df = messages_df \ > .withColumn('Registered', register_record('RowKey', > 'PartitionKey', 'messageId', 'ownerSmtp', 'Timestamp')) > > only_registered_records_df = register_records_df \ > .filter(register_records_df.Registered == True) \ > .drop(register_records_df.Registered) > > update_message_status_df = only_registered_records_df \ > .withColumn('TableEntryDeleted', delete_table_entity('RowKey', > 'PartitionKey')) > > results_df = update_message_status_df.select( > update_message_status_df.RowKey, > update_message_status_df.PartitionKey, > update_message_status_df.TableEntryDeleted) > #results_df.explain() > results_df.show(n=batch_size, truncate=False) > @udf(returnType=BooleanType()) > def register_record(rowKey, partitionKey, messageId, ownerSmtp, timestamp): > # call an API > try: > url = '{}/data/record/{}'.format('***', rowKey) > headers = { 'Content-type': 'application/json' } > response = requests.post( > url, > headers=headers, > auth=HTTPBasicAuth('***', '***'), > data=prepare_record_data(rowKey, partitionKey, > messageId, ownerSmtp, timestamp)) > > return bool(response) > except: > return False > def prepare_record_data(rowKey, partitionKey, messageId, ownerSmtp, > timestamp): > record_data = { > "Title": messageId, > "Type": '***', > "Source": '***', > "Creator": ownerSmtp, > "Publisher": '***', > "Date": timestamp.strftime('%Y-%m-%dT%H:%M:%SZ') > } > return json.dumps(record_data) > @udf(returnType=BooleanType()) > def delete_table_entity(row_key, partition_key): > azure_table_account_name = '***' > azure_table_account_key = '***' > azure_table_name = '***' > try: > table_service = TableService(account_name=azure_table_account_name, >
[jira] [Created] (SPARK-32470) Remove task result size check for shuffle map stage
Wei Xue created SPARK-32470: --- Summary: Remove task result size check for shuffle map stage Key: SPARK-32470 URL: https://issues.apache.org/jira/browse/SPARK-32470 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Wei Xue -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28001) Dataframe throws 'socket.timeout: timed out' exception
[ https://issues.apache.org/jira/browse/SPARK-28001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166608#comment-17166608 ] StanislavKo commented on SPARK-28001: - +1 OS: Windows 10 Python: 3.7.2 PySpark: 2.4.3 Cluster manager: Spark Standalone > Dataframe throws 'socket.timeout: timed out' exception > -- > > Key: SPARK-28001 > URL: https://issues.apache.org/jira/browse/SPARK-28001 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 > Environment: Processor: Intel Core i7-7700 CPU @ 3.60Ghz > RAM: 16 GB > OS: Windows 10 Enterprise 64-bit > Python: 3.7.2 > PySpark: 3.4.3 > Cluster manager: Spark Standalone >Reporter: Marius Stanescu >Priority: Critical > > I load data from Azure Table Storage, create a DataFrame and perform a couple > of operations via two user-defined functions, then call show() to display the > results. If I load a very small batch of items, like 5, everything is working > fine, but if I load a batch grater then 10 items from Azure Table Storage > then I get the 'socket.timeout: timed out' exception. > Here is the code: > > {code} > import time > import json > import requests > from requests.auth import HTTPBasicAuth > from azure.cosmosdb.table.tableservice import TableService > from azure.cosmosdb.table.models import Entity > from pyspark.sql import SparkSession > from pyspark.sql.functions import udf, struct > from pyspark.sql.types import BooleanType > def main(): > batch_size = 25 > azure_table_account_name = '***' > azure_table_account_key = '***' > azure_table_name = '***' > spark = SparkSession \ > .builder \ > .appName(agent_name) \ > .config("spark.sql.crossJoin.enabled", "true") \ > .getOrCreate() > table_service = TableService(account_name=azure_table_account_name, > account_key=azure_table_account_key) > continuation_token = None > while True: > messages = table_service.query_entities( > azure_table_name, > select="RowKey, PartitionKey, messageId, ownerSmtp, Timestamp", > num_results=batch_size, > marker=continuation_token, > timeout=60) > continuation_token = messages.next_marker > messages_list = list(messages) > > if not len(messages_list): > time.sleep(5) > pass > > messages_df = spark.createDataFrame(messages_list) > > register_records_df = messages_df \ > .withColumn('Registered', register_record('RowKey', > 'PartitionKey', 'messageId', 'ownerSmtp', 'Timestamp')) > > only_registered_records_df = register_records_df \ > .filter(register_records_df.Registered == True) \ > .drop(register_records_df.Registered) > > update_message_status_df = only_registered_records_df \ > .withColumn('TableEntryDeleted', delete_table_entity('RowKey', > 'PartitionKey')) > > results_df = update_message_status_df.select( > update_message_status_df.RowKey, > update_message_status_df.PartitionKey, > update_message_status_df.TableEntryDeleted) > #results_df.explain() > results_df.show(n=batch_size, truncate=False) > @udf(returnType=BooleanType()) > def register_record(rowKey, partitionKey, messageId, ownerSmtp, timestamp): > # call an API > try: > url = '{}/data/record/{}'.format('***', rowKey) > headers = { 'Content-type': 'application/json' } > response = requests.post( > url, > headers=headers, > auth=HTTPBasicAuth('***', '***'), > data=prepare_record_data(rowKey, partitionKey, > messageId, ownerSmtp, timestamp)) > > return bool(response) > except: > return False > def prepare_record_data(rowKey, partitionKey, messageId, ownerSmtp, > timestamp): > record_data = { > "Title": messageId, > "Type": '***', > "Source": '***', > "Creator": ownerSmtp, > "Publisher": '***', > "Date": timestamp.strftime('%Y-%m-%dT%H:%M:%SZ') > } > return json.dumps(record_data) > @udf(returnType=BooleanType()) > def delete_table_entity(row_key, partition_key): > azure_table_account_name = '***' > azure_table_account_key = '***' > azure_table_name = '***' > try: > table_service = TableService(account_name=azure_table_account_name, > account_key=azure_table_account_key) > table_service.delete_entity(azure_table_name, partition_key, row_key) > return True > except: > return False > if __name__ == "__main__": >
[jira] [Comment Edited] (SPARK-28001) Dataframe throws 'socket.timeout: timed out' exception
[ https://issues.apache.org/jira/browse/SPARK-28001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166608#comment-17166608 ] StanislavKo edited comment on SPARK-28001 at 7/28/20, 6:02 PM: --- +1 OS: Windows 10 Python: 3.7 PySpark: 2.4.3 Cluster manager: Spark Standalone was (Author: stanislavko): +1 OS: Windows 10 Python: 3.7.2 PySpark: 2.4.3 Cluster manager: Spark Standalone > Dataframe throws 'socket.timeout: timed out' exception > -- > > Key: SPARK-28001 > URL: https://issues.apache.org/jira/browse/SPARK-28001 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 > Environment: Processor: Intel Core i7-7700 CPU @ 3.60Ghz > RAM: 16 GB > OS: Windows 10 Enterprise 64-bit > Python: 3.7.2 > PySpark: 3.4.3 > Cluster manager: Spark Standalone >Reporter: Marius Stanescu >Priority: Critical > > I load data from Azure Table Storage, create a DataFrame and perform a couple > of operations via two user-defined functions, then call show() to display the > results. If I load a very small batch of items, like 5, everything is working > fine, but if I load a batch grater then 10 items from Azure Table Storage > then I get the 'socket.timeout: timed out' exception. > Here is the code: > > {code} > import time > import json > import requests > from requests.auth import HTTPBasicAuth > from azure.cosmosdb.table.tableservice import TableService > from azure.cosmosdb.table.models import Entity > from pyspark.sql import SparkSession > from pyspark.sql.functions import udf, struct > from pyspark.sql.types import BooleanType > def main(): > batch_size = 25 > azure_table_account_name = '***' > azure_table_account_key = '***' > azure_table_name = '***' > spark = SparkSession \ > .builder \ > .appName(agent_name) \ > .config("spark.sql.crossJoin.enabled", "true") \ > .getOrCreate() > table_service = TableService(account_name=azure_table_account_name, > account_key=azure_table_account_key) > continuation_token = None > while True: > messages = table_service.query_entities( > azure_table_name, > select="RowKey, PartitionKey, messageId, ownerSmtp, Timestamp", > num_results=batch_size, > marker=continuation_token, > timeout=60) > continuation_token = messages.next_marker > messages_list = list(messages) > > if not len(messages_list): > time.sleep(5) > pass > > messages_df = spark.createDataFrame(messages_list) > > register_records_df = messages_df \ > .withColumn('Registered', register_record('RowKey', > 'PartitionKey', 'messageId', 'ownerSmtp', 'Timestamp')) > > only_registered_records_df = register_records_df \ > .filter(register_records_df.Registered == True) \ > .drop(register_records_df.Registered) > > update_message_status_df = only_registered_records_df \ > .withColumn('TableEntryDeleted', delete_table_entity('RowKey', > 'PartitionKey')) > > results_df = update_message_status_df.select( > update_message_status_df.RowKey, > update_message_status_df.PartitionKey, > update_message_status_df.TableEntryDeleted) > #results_df.explain() > results_df.show(n=batch_size, truncate=False) > @udf(returnType=BooleanType()) > def register_record(rowKey, partitionKey, messageId, ownerSmtp, timestamp): > # call an API > try: > url = '{}/data/record/{}'.format('***', rowKey) > headers = { 'Content-type': 'application/json' } > response = requests.post( > url, > headers=headers, > auth=HTTPBasicAuth('***', '***'), > data=prepare_record_data(rowKey, partitionKey, > messageId, ownerSmtp, timestamp)) > > return bool(response) > except: > return False > def prepare_record_data(rowKey, partitionKey, messageId, ownerSmtp, > timestamp): > record_data = { > "Title": messageId, > "Type": '***', > "Source": '***', > "Creator": ownerSmtp, > "Publisher": '***', > "Date": timestamp.strftime('%Y-%m-%dT%H:%M:%SZ') > } > return json.dumps(record_data) > @udf(returnType=BooleanType()) > def delete_table_entity(row_key, partition_key): > azure_table_account_name = '***' > azure_table_account_key = '***' > azure_table_name = '***' > try: > table_service = TableService(account_name=azure_table_account_name, > account_key=azure_table_account_key) >
[jira] [Resolved] (SPARK-32458) Mismatched row access sizes in tests
[ https://issues.apache.org/jira/browse/SPARK-32458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32458. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29258 [https://github.com/apache/spark/pull/29258] > Mismatched row access sizes in tests > > > Key: SPARK-32458 > URL: https://issues.apache.org/jira/browse/SPARK-32458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Michael Munday >Assignee: Michael Munday >Priority: Minor > Labels: catalyst, endianness > Fix For: 3.1.0 > > > The RowEncoderSuite and UnsafeMapSuite tests fail on big-endian systems. This > is because the test data is written into the row using unsafe operations with > one size and then read back using a different size. For example, in > UnsafeMapSuite the test data is written using putLong and then read back > using getInt. This happens to work on little-endian systems but these > differences appear to be typos and cause the tests to fail on big-endian > systems. > I have a patch that fixes the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32458) Mismatched row access sizes in tests
[ https://issues.apache.org/jira/browse/SPARK-32458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-32458: - Assignee: Michael Munday > Mismatched row access sizes in tests > > > Key: SPARK-32458 > URL: https://issues.apache.org/jira/browse/SPARK-32458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Michael Munday >Assignee: Michael Munday >Priority: Minor > Labels: catalyst, endianness > > The RowEncoderSuite and UnsafeMapSuite tests fail on big-endian systems. This > is because the test data is written into the row using unsafe operations with > one size and then read back using a different size. For example, in > UnsafeMapSuite the test data is written using putLong and then read back > using getInt. This happens to work on little-endian systems but these > differences appear to be typos and cause the tests to fail on big-endian > systems. > I have a patch that fixes the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12172) Consider removing SparkR internal RDD APIs
[ https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166581#comment-17166581 ] Felix Cheung commented on SPARK-12172: -- These are methods (map etc) that were never public and not supported. On Tue, Jul 28, 2020 at 10:18 AM S Daniel Zafar (Jira) > Consider removing SparkR internal RDD APIs > -- > > Key: SPARK-12172 > URL: https://issues.apache.org/jira/browse/SPARK-12172 > Project: Spark > Issue Type: Task > Components: SparkR >Reporter: Felix Cheung >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20684) expose createOrReplaceGlobalTempView/createGlobalTempView and dropGlobalTempView in SparkR
[ https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166575#comment-17166575 ] Dongjoon Hyun commented on SPARK-20684: --- Actually, there was two PRs for this. - https://github.com/apache/spark/pull/17941 (mine) - https://github.com/apache/spark/pull/19176 (Yanbo's) The PR is closed because we didn't implement the feature inside R. If we have the implementation, we can expose it after that. Let's keep this open, [~dan_z]. That is my opinions as the patch author. I'm still active in this area. > expose createOrReplaceGlobalTempView/createGlobalTempView and > dropGlobalTempView in SparkR > -- > > Key: SPARK-20684 > URL: https://issues.apache.org/jira/browse/SPARK-20684 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki >Priority: Major > > This is a useful API that is not exposed in SparkR. It will help with moving > data between languages on a single single Spark application. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30817) SparkR ML algorithms parity
[ https://issues.apache.org/jira/browse/SPARK-30817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166571#comment-17166571 ] S Daniel Zafar edited comment on SPARK-30817 at 7/28/20, 5:19 PM: -- I would like to work on this issue, is that all right [~hyukjin.kwon]? It would be my first. was (Author: dan_z): I would like to address this issue, is that all right [~hyukjin.kwon]? > SparkR ML algorithms parity > > > Key: SPARK-30817 > URL: https://issues.apache.org/jira/browse/SPARK-30817 > Project: Spark > Issue Type: Umbrella > Components: ML, SparkR >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > As of 3.0 the following algorithms are missing form SparkR > * {{LinearRegression}} > * {{FMRegressor}} (Added to ML in 3.0) > * {{FMClassifier}} (Added to ML in 3.0) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12172) Consider removing SparkR internal RDD APIs
[ https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166569#comment-17166569 ] S Daniel Zafar commented on SPARK-12172: My opinion is that it makes sense to keep these methods, since they exist in PySpark. Removing basic things like `map` seems counterintuitive. The PR is closed, should we close this as well? > Consider removing SparkR internal RDD APIs > -- > > Key: SPARK-12172 > URL: https://issues.apache.org/jira/browse/SPARK-12172 > Project: Spark > Issue Type: Task > Components: SparkR >Reporter: Felix Cheung >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32469) ApplyColumnarRulesAndInsertTransitions should be idempotent
[ https://issues.apache.org/jira/browse/SPARK-32469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166567#comment-17166567 ] Apache Spark commented on SPARK-32469: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/29273 > ApplyColumnarRulesAndInsertTransitions should be idempotent > --- > > Key: SPARK-32469 > URL: https://issues.apache.org/jira/browse/SPARK-32469 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32469) ApplyColumnarRulesAndInsertTransitions should be idempotent
[ https://issues.apache.org/jira/browse/SPARK-32469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166564#comment-17166564 ] Apache Spark commented on SPARK-32469: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/29273 > ApplyColumnarRulesAndInsertTransitions should be idempotent > --- > > Key: SPARK-32469 > URL: https://issues.apache.org/jira/browse/SPARK-32469 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32469) ApplyColumnarRulesAndInsertTransitions should be idempotent
[ https://issues.apache.org/jira/browse/SPARK-32469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32469: Assignee: Apache Spark (was: Wenchen Fan) > ApplyColumnarRulesAndInsertTransitions should be idempotent > --- > > Key: SPARK-32469 > URL: https://issues.apache.org/jira/browse/SPARK-32469 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32469) ApplyColumnarRulesAndInsertTransitions should be idempotent
[ https://issues.apache.org/jira/browse/SPARK-32469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32469: Assignee: Wenchen Fan (was: Apache Spark) > ApplyColumnarRulesAndInsertTransitions should be idempotent > --- > > Key: SPARK-32469 > URL: https://issues.apache.org/jira/browse/SPARK-32469 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32469) ApplyColumnarRulesAndInsertTransitions should be idempotent
Wenchen Fan created SPARK-32469: --- Summary: ApplyColumnarRulesAndInsertTransitions should be idempotent Key: SPARK-32469 URL: https://issues.apache.org/jira/browse/SPARK-32469 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20684) expose createOrReplaceGlobalTempView/createGlobalTempView and dropGlobalTempView in SparkR
[ https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166559#comment-17166559 ] S Daniel Zafar commented on SPARK-20684: The PR ([https://github.com/apache/spark/pull/17941]) was closed. I think we can close this. > expose createOrReplaceGlobalTempView/createGlobalTempView and > dropGlobalTempView in SparkR > -- > > Key: SPARK-20684 > URL: https://issues.apache.org/jira/browse/SPARK-20684 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki >Priority: Major > > This is a useful API that is not exposed in SparkR. It will help with moving > data between languages on a single single Spark application. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32429) Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch
[ https://issues.apache.org/jira/browse/SPARK-32429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166532#comment-17166532 ] Xiangrui Meng edited comment on SPARK-32429 at 7/28/20, 4:30 PM: - [~tgraves] Thanks for the clarification! It makes sense to add GPU isolation at executor level. Your prototype adds special meaning to the "gpu" resource name. I wonder if we want to make it more configurable in the final implementation. A scenario we considered previously was a cluster with two generation of GPUs: K80, V100. I think it is safe to assume that Spark application should only request one GPU type. Then we will need some configuration to tell based on which resource name user wants to set CUDA_VISIBLE_DEVICES. Btw, we found that setting CUDA_DEVICE_ORDER=PCI_BUS_ID is necessary to have consistent device ordering between different processes even CUDA_VISIBLE_DEVICES are set the same. Not sure if the same setting were used in YARN/k8s. was (Author: mengxr): [~tgraves] Thanks for the clarification! It makes sense to add GPU isolation at executor level. Your prototype adds special meaning to the "gpu" resource name. I wonder if we want to make it more configurable in the final implementation. A scenario we considered previously was a cluster with two generation of GPUs: K80, V100. I think it is safe to assume that Spark application should only request one GPU type. Then we will need some configuration to tell based on which resource name user wants to set CUDA_VISIBLE_DEVICES. Btw, we found that setting CUDA_DEVICE_ORDER=PCI_BUS_ID is necessary to have consistent device ordering between different processes even CUDA_VISIBLE_DEVICES are set the same. > Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch > - > > Key: SPARK-32429 > URL: https://issues.apache.org/jira/browse/SPARK-32429 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > It would be nice if standalone mode could allow users to set > CUDA_VISIBLE_DEVICES before launching an executor. This has multiple > benefits. > * kind of an isolation in that the executor can only see the GPUs set there. > * If your GPU application doesn't support explicitly setting the GPU device > id, setting this will make any GPU look like the default (id 0) and things > generally just work without any explicit setting > * New features are being added on newer GPUs that require explicit setting > of CUDA_VISIBLE_DEVICES like MIG > ([https://www.nvidia.com/en-us/technologies/multi-instance-gpu/]) > The code changes to just set this are very small, once we set them we would > also possibly need to change the gpu addresses as it changes them to start > from device id 0 again. > The easiest implementation would just specifically support this and have it > behind a config and set when the config is on and GPU resources are > allocated. > Note we probably want to have this same thing set when we launch a python > process as well so that it gets same env. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32429) Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch
[ https://issues.apache.org/jira/browse/SPARK-32429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166532#comment-17166532 ] Xiangrui Meng commented on SPARK-32429: --- [~tgraves] Thanks for the clarification! It makes sense to add GPU isolation at executor level. Your prototype adds special meaning to the "gpu" resource name. I wonder if we want to make it more configurable in the final implementation. A scenario we considered previously was a cluster with two generation of GPUs: K80, V100. I think it is safe to assume that Spark application should only request one GPU type. Then we will need some configuration to tell based on which resource name user wants to set CUDA_VISIBLE_DEVICES. Btw, we found that setting CUDA_DEVICE_ORDER=PCI_BUS_ID is necessary to have consistent device ordering between different processes even CUDA_VISIBLE_DEVICES are set the same. > Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch > - > > Key: SPARK-32429 > URL: https://issues.apache.org/jira/browse/SPARK-32429 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > It would be nice if standalone mode could allow users to set > CUDA_VISIBLE_DEVICES before launching an executor. This has multiple > benefits. > * kind of an isolation in that the executor can only see the GPUs set there. > * If your GPU application doesn't support explicitly setting the GPU device > id, setting this will make any GPU look like the default (id 0) and things > generally just work without any explicit setting > * New features are being added on newer GPUs that require explicit setting > of CUDA_VISIBLE_DEVICES like MIG > ([https://www.nvidia.com/en-us/technologies/multi-instance-gpu/]) > The code changes to just set this are very small, once we set them we would > also possibly need to change the gpu addresses as it changes them to start > from device id 0 again. > The easiest implementation would just specifically support this and have it > behind a config and set when the config is on and GPU resources are > allocated. > Note we probably want to have this same thing set when we launch a python > process as well so that it gets same env. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32339) Improve MLlib BLAS native acceleration docs
[ https://issues.apache.org/jira/browse/SPARK-32339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao reassigned SPARK-32339: -- Assignee: Xiaochang Wu > Improve MLlib BLAS native acceleration docs > --- > > Key: SPARK-32339 > URL: https://issues.apache.org/jira/browse/SPARK-32339 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 3.1.0 >Reporter: Xiaochang Wu >Assignee: Xiaochang Wu >Priority: Major > > The document of enabling BLAS native acceleration in ML guide > ([https://spark.apache.org/docs/latest/ml-guide.html#dependencies]) is > incomplete and unclear to the user. > We will rewrite it to a clearer and complete guide. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32339) Improve MLlib BLAS native acceleration docs
[ https://issues.apache.org/jira/browse/SPARK-32339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao resolved SPARK-32339. Fix Version/s: 3.1.0 3.0.1 Resolution: Fixed Issue resolved by pull request 29139 [https://github.com/apache/spark/pull/29139] > Improve MLlib BLAS native acceleration docs > --- > > Key: SPARK-32339 > URL: https://issues.apache.org/jira/browse/SPARK-32339 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 3.1.0 >Reporter: Xiaochang Wu >Assignee: Xiaochang Wu >Priority: Major > Fix For: 3.0.1, 3.1.0 > > > The document of enabling BLAS native acceleration in ML guide > ([https://spark.apache.org/docs/latest/ml-guide.html#dependencies]) is > incomplete and unclear to the user. > We will rewrite it to a clearer and complete guide. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32468) Fix timeout config issue in Kafka connector tests
[ https://issues.apache.org/jira/browse/SPARK-32468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166494#comment-17166494 ] Apache Spark commented on SPARK-32468: -- User 'gaborgsomogyi' has created a pull request for this issue: https://github.com/apache/spark/pull/29272 > Fix timeout config issue in Kafka connector tests > - > > Key: SPARK-32468 > URL: https://issues.apache.org/jira/browse/SPARK-32468 > Project: Spark > Issue Type: Bug > Components: Structured Streaming, Tests >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Minor > > While I'm implementing SPARK-32032 I've found a bug in Kafka: > https://issues.apache.org/jira/browse/KAFKA-10318. This will cause issues > only later when it's fixed but it would be good to fix it now because > SPARK-32032 would like to bring in AdminClient where the code blows up with > the mentioned ConfigException. This would reduce the code changes in the > mentioned jira. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32468) Fix timeout config issue in Kafka connector tests
[ https://issues.apache.org/jira/browse/SPARK-32468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32468: Assignee: (was: Apache Spark) > Fix timeout config issue in Kafka connector tests > - > > Key: SPARK-32468 > URL: https://issues.apache.org/jira/browse/SPARK-32468 > Project: Spark > Issue Type: Bug > Components: Structured Streaming, Tests >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Minor > > While I'm implementing SPARK-32032 I've found a bug in Kafka: > https://issues.apache.org/jira/browse/KAFKA-10318. This will cause issues > only later when it's fixed but it would be good to fix it now because > SPARK-32032 would like to bring in AdminClient where the code blows up with > the mentioned ConfigException. This would reduce the code changes in the > mentioned jira. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32468) Fix timeout config issue in Kafka connector tests
[ https://issues.apache.org/jira/browse/SPARK-32468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32468: Assignee: Apache Spark > Fix timeout config issue in Kafka connector tests > - > > Key: SPARK-32468 > URL: https://issues.apache.org/jira/browse/SPARK-32468 > Project: Spark > Issue Type: Bug > Components: Structured Streaming, Tests >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Assignee: Apache Spark >Priority: Minor > > While I'm implementing SPARK-32032 I've found a bug in Kafka: > https://issues.apache.org/jira/browse/KAFKA-10318. This will cause issues > only later when it's fixed but it would be good to fix it now because > SPARK-32032 would like to bring in AdminClient where the code blows up with > the mentioned ConfigException. This would reduce the code changes in the > mentioned jira. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32468) Fix timeout config issue in Kafka connector tests
Gabor Somogyi created SPARK-32468: - Summary: Fix timeout config issue in Kafka connector tests Key: SPARK-32468 URL: https://issues.apache.org/jira/browse/SPARK-32468 Project: Spark Issue Type: Bug Components: Structured Streaming, Tests Affects Versions: 3.1.0 Reporter: Gabor Somogyi While I'm implementing SPARK-32032 I've found a bug in Kafka: https://issues.apache.org/jira/browse/KAFKA-10318. This will cause issues only later when it's fixed but it would be good to fix it now because SPARK-32032 would like to bring in AdminClient where the code blows up with the mentioned ConfigException. This would reduce the code changes in the mentioned jira. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32467) Avoid encoding URL twice on https redirect
[ https://issues.apache.org/jira/browse/SPARK-32467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32467: Assignee: Apache Spark (was: Gengliang Wang) > Avoid encoding URL twice on https redirect > -- > > Key: SPARK-32467 > URL: https://issues.apache.org/jira/browse/SPARK-32467 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > Currently, on https redirect, the original URL is encoded as an HTTPS URL. > However, the original URL could be encoded already, so that the return result > of method > UriInfo.getQueryParameters will contain encoded keys and values. For example, > a parameter > order[0][dir] will become order%255B0%255D%255Bcolumn%255D after encoded > twice, and the decoded > key in the result of UriInfo.getQueryParameters will be > order%5B0%5D%5Bcolumn%5D. > To fix the problem, we try decoding the query parameters before encoding it. > This is to make sure we encode the URL exactly once. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32467) Avoid encoding URL twice on https redirect
[ https://issues.apache.org/jira/browse/SPARK-32467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166477#comment-17166477 ] Apache Spark commented on SPARK-32467: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/29271 > Avoid encoding URL twice on https redirect > -- > > Key: SPARK-32467 > URL: https://issues.apache.org/jira/browse/SPARK-32467 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Currently, on https redirect, the original URL is encoded as an HTTPS URL. > However, the original URL could be encoded already, so that the return result > of method > UriInfo.getQueryParameters will contain encoded keys and values. For example, > a parameter > order[0][dir] will become order%255B0%255D%255Bcolumn%255D after encoded > twice, and the decoded > key in the result of UriInfo.getQueryParameters will be > order%5B0%5D%5Bcolumn%5D. > To fix the problem, we try decoding the query parameters before encoding it. > This is to make sure we encode the URL exactly once. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32467) Avoid encoding URL twice on https redirect
Gengliang Wang created SPARK-32467: -- Summary: Avoid encoding URL twice on https redirect Key: SPARK-32467 URL: https://issues.apache.org/jira/browse/SPARK-32467 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.1.0 Reporter: Gengliang Wang Assignee: Gengliang Wang Currently, on https redirect, the original URL is encoded as an HTTPS URL. However, the original URL could be encoded already, so that the return result of method UriInfo.getQueryParameters will contain encoded keys and values. For example, a parameter order[0][dir] will become order%255B0%255D%255Bcolumn%255D after encoded twice, and the decoded key in the result of UriInfo.getQueryParameters will be order%5B0%5D%5Bcolumn%5D. To fix the problem, we try decoding the query parameters before encoding it. This is to make sure we encode the URL exactly once. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32466) Add support to catch SparkPlan regression base on TPC-DS queries
[ https://issues.apache.org/jira/browse/SPARK-32466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32466: Assignee: Apache Spark > Add support to catch SparkPlan regression base on TPC-DS queries > > > Key: SPARK-32466 > URL: https://issues.apache.org/jira/browse/SPARK-32466 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > > Nowadays, Spark is getting more and more complex. Any changes might cause > regression unintentionally. Spark already has some benchmark to catch the > performance regression. But, yet, it doesn't have a way to detect the > regression inside SparkPlan. It would be good if we could find some possible > regression early during the compile phase before the runtime phase. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28004) Update jquery to 3.4.1
[ https://issues.apache.org/jira/browse/SPARK-28004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166411#comment-17166411 ] Sean R. Owen commented on SPARK-28004: -- I hadn't planned to, as it's a non-trivial change, and it wasn't obvious that the CVEs (so far) affect Spark. It's not clear they don't, either, and it is valid to back-port security-related updates. I'd generally suggest people move towards Spark 3 at this point. But I'd look at a back-port PR that someone has gotten working on 2.4.x. > Update jquery to 3.4.1 > -- > > Key: SPARK-28004 > URL: https://issues.apache.org/jira/browse/SPARK-28004 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Major > Fix For: 3.0.0 > > > We're using an old-ish jQuery, 1.12.4, and should probably update for Spark 3 > to keep up in general, but also to keep up with CVEs. In fact, we know of at > least one resolved in only 3.4.0+ > (https://nvd.nist.gov/vuln/detail/CVE-2019-11358). They may not affect Spark, > but, if the update isn't painful, maybe worthwhile in order to make future > 3.x updates easier. > jQuery 1 -> 2 doesn't sound like a breaking change, as 2.0 is supposed to > maintain compatibility with 1.9+ > (https://blog.jquery.com/2013/04/18/jquery-2-0-released/) > 2 -> 3 has breaking changes: https://jquery.com/upgrade-guide/3.0/. It's hard > to evaluate each one, but the most likely area for problems is in ajax(). > However, our usage of jQuery (and plugins) is pretty simple. > I've tried updating and testing the UI, and can't see any warnings, errors, > or problematic functionality. This includes the Spark UI, master UI, worker > UI, and docs (well, I wasn't able to build R docs) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32466) Add support to catch SparkPlan regression base on TPC-DS queries
[ https://issues.apache.org/jira/browse/SPARK-32466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32466: Assignee: (was: Apache Spark) > Add support to catch SparkPlan regression base on TPC-DS queries > > > Key: SPARK-32466 > URL: https://issues.apache.org/jira/browse/SPARK-32466 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > Nowadays, Spark is getting more and more complex. Any changes might cause > regression unintentionally. Spark already has some benchmark to catch the > performance regression. But, yet, it doesn't have a way to detect the > regression inside SparkPlan. It would be good if we could find some possible > regression early during the compile phase before the runtime phase. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32466) Add support to catch SparkPlan regression base on TPC-DS queries
[ https://issues.apache.org/jira/browse/SPARK-32466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166410#comment-17166410 ] Apache Spark commented on SPARK-32466: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/29270 > Add support to catch SparkPlan regression base on TPC-DS queries > > > Key: SPARK-32466 > URL: https://issues.apache.org/jira/browse/SPARK-32466 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > Nowadays, Spark is getting more and more complex. Any changes might cause > regression unintentionally. Spark already has some benchmark to catch the > performance regression. But, yet, it doesn't have a way to detect the > regression inside SparkPlan. It would be good if we could find some possible > regression early during the compile phase before the runtime phase. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21784) Add ALTER TABLE ADD CONSTRANT DDL to support defining primary key and foreign keys
[ https://issues.apache.org/jira/browse/SPARK-21784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166401#comment-17166401 ] Hamad Javed commented on SPARK-21784: - Hey [~ksunitha], I would be really interested in having this feature added in spark. Having primary keys defined is a strong feature in a lot of traditional Databases and porting it over to spark would very useful - we can eliminate a whole class of problems simply by knowing and enforcing primary key constraints. > Add ALTER TABLE ADD CONSTRANT DDL to support defining primary key and foreign > keys > -- > > Key: SPARK-21784 > URL: https://issues.apache.org/jira/browse/SPARK-21784 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Suresh Thalamati >Priority: Major > > Currently Spark SQL does not have DDL support to define primary key , and > foreign key constraints. This Jira is to add DDL support to define primary > key and foreign key informational constraint using ALTER TABLE syntax. These > constraints will be used in query optimization and you can find more details > about this in the spec in SPARK-19842 > *Syntax :* > {code} > ALTER TABLE [db_name.]table_name ADD [CONSTRAINT constraintName] > (PRIMARY KEY (col_names) | > FOREIGN KEY (col_names) REFERENCES [db_name.]table_name [(col_names)]) > [VALIDATE | NOVALIDATE] [RELY | NORELY] > {code} > Examples : > {code:sql} > ALTER TABLE employee _ADD CONSTRANT pk_ PRIMARY KEY(empno) VALIDATE RELY > ALTER TABLE department _ADD CONSTRAINT emp_fk_ FOREIGN KEY (mgrno) REFERENCES > employee(empno) NOVALIDATE NORELY > {code} > *Constraint name generated by the system:* > {code:sql} > ALTER TABLE department ADD PRIMARY KEY(deptno) VALIDATE RELY > ALTER TABLE employee ADD FOREIGN KEY (workdept) REFERENCES department(deptno) > VALIDATE RELY; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32382) Override table renaming in JDBC dialects
[ https://issues.apache.org/jira/browse/SPARK-32382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32382. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29237 [https://github.com/apache/spark/pull/29237] > Override table renaming in JDBC dialects > > > Key: SPARK-32382 > URL: https://issues.apache.org/jira/browse/SPARK-32382 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.1.0 > > > SPARK-32375 adds new method renameTable to JdbcDialect with the default > implementation: > {code:sql} > ALTER TABLE table_name RENAME TO new_table_name; > {code} > which is supported by Oracle, MySQL, MariaDB, PostgreSQL and SQLite but other > dialects might not support such syntax, for instance SQL Server (using the > stored procedure called sp_rename): > {code:sql} > sp_rename 'table_name', 'new_table_name'; > {code} > The ticket aims to support table renaming in all JDBC dialects. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32382) Override table renaming in JDBC dialects
[ https://issues.apache.org/jira/browse/SPARK-32382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32382: --- Assignee: Maxim Gekk > Override table renaming in JDBC dialects > > > Key: SPARK-32382 > URL: https://issues.apache.org/jira/browse/SPARK-32382 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > SPARK-32375 adds new method renameTable to JdbcDialect with the default > implementation: > {code:sql} > ALTER TABLE table_name RENAME TO new_table_name; > {code} > which is supported by Oracle, MySQL, MariaDB, PostgreSQL and SQLite but other > dialects might not support such syntax, for instance SQL Server (using the > stored procedure called sp_rename): > {code:sql} > sp_rename 'table_name', 'new_table_name'; > {code} > The ticket aims to support table renaming in all JDBC dialects. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org