[jira] [Commented] (SPARK-36175) Support TimestampNTZ in Avro data source

2021-08-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391385#comment-17391385
 ] 

Apache Spark commented on SPARK-36175:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/33607

> Support TimestampNTZ in Avro data source 
> -
>
> Key: SPARK-36175
> URL: https://issues.apache.org/jira/browse/SPARK-36175
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.3.0
>
>
> As per the Avro spec 
> https://avro.apache.org/docs/1.10.2/spec.html#Local+timestamp+%28microsecond+precision%29,
>  Spark can convert TimestampNTZ type from/to Avro's Local timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36175) Support TimestampNTZ in Avro data source

2021-08-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391386#comment-17391386
 ] 

Apache Spark commented on SPARK-36175:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/33607

> Support TimestampNTZ in Avro data source 
> -
>
> Key: SPARK-36175
> URL: https://issues.apache.org/jira/browse/SPARK-36175
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.3.0
>
>
> As per the Avro spec 
> https://avro.apache.org/docs/1.10.2/spec.html#Local+timestamp+%28microsecond+precision%29,
>  Spark can convert TimestampNTZ type from/to Avro's Local timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35815) Allow delayThreshold for watermark to be represented as ANSI day-time/year-month interval literals

2021-08-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391373#comment-17391373
 ] 

Apache Spark commented on SPARK-35815:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33606

> Allow delayThreshold for watermark to be represented as ANSI 
> day-time/year-month interval literals
> --
>
> Key: SPARK-35815
> URL: https://issues.apache.org/jira/browse/SPARK-35815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0
>
>
> delayThreshold parameter of DataFrame.withWatermark should handle ANSI 
> day-time/year-month interval literals.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36379) Null at root level of a JSON array causes the parsing failure (w/ permissive mode)

2021-08-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36379:
-
Issue Type: Bug  (was: Improvement)

> Null at root level of a JSON array causes the parsing failure (w/ permissive 
> mode)
> --
>
> Key: SPARK-36379
> URL: https://issues.apache.org/jira/browse/SPARK-36379
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> scala> spark.read.json(Seq("""[{"a": "str"}, null, {"a": 
> "str"}]""").toDS).collect()
> {code}
> {code}
> ...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
> (TID 1) (172.30.3.20 executor driver): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
> {code}
> Since the mode (by default) is permissive, we shouldn't just fail like above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36379) Null at root level of a JSON array causes the parsing failure (w/ permissive mode)

2021-08-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36379:
-
Priority: Minor  (was: Major)

> Null at root level of a JSON array causes the parsing failure (w/ permissive 
> mode)
> --
>
> Key: SPARK-36379
> URL: https://issues.apache.org/jira/browse/SPARK-36379
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> {code}
> scala> spark.read.json(Seq("""[{"a": "str"}, null, {"a": 
> "str"}]""").toDS).collect()
> {code}
> {code}
> ...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
> (TID 1) (172.30.3.20 executor driver): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
> {code}
> Since the mode (by default) is permissive, we shouldn't just fail like above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36379) Null at root level of a JSON array causes the parsing failure (w/ permissive mode)

2021-08-01 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-36379:


 Summary: Null at root level of a JSON array causes the parsing 
failure (w/ permissive mode)
 Key: SPARK-36379
 URL: https://issues.apache.org/jira/browse/SPARK-36379
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.2, 3.2.0, 3.3.0
Reporter: Hyukjin Kwon



{code}
scala> spark.read.json(Seq("""[{"a": "str"}, null, {"a": 
"str"}]""").toDS).collect()
{code}

{code}
...
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 
1) (172.30.3.20 executor driver): java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
{code}

Since the mode (by default) is permissive, we shouldn't just fail like above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35917) Disable push-based shuffle until the feature is complete

2021-08-01 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-35917:

Fix Version/s: (was: 3.2.0)

> Disable push-based shuffle until the feature is complete
> 
>
> Key: SPARK-35917
> URL: https://issues.apache.org/jira/browse/SPARK-35917
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Priority: Major
>
> Push-based shuffle is partially merged in apache master but some of the tasks 
> are still incomplete. Since 3.2 is going to cut soon, we will not be able to 
> get the pending tasks reviewed and merged. Few of the pending tasks make 
> protocol changes to the push-based shuffle protocols, so we would like to 
> prevent users from enabling push-based shuffle both on the client and the 
> server until push-based shuffle implementation is complete. 
> We can prevent push-based shuffle to be used by throwing 
> {{UnsupportedOperationException}} (or something like that) both on the client 
> and the server when the user tries to enable it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35917) Disable push-based shuffle until the feature is complete

2021-08-01 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-35917:

Fix Version/s: 3.2.0

> Disable push-based shuffle until the feature is complete
> 
>
> Key: SPARK-35917
> URL: https://issues.apache.org/jira/browse/SPARK-35917
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Priority: Major
> Fix For: 3.2.0
>
>
> Push-based shuffle is partially merged in apache master but some of the tasks 
> are still incomplete. Since 3.2 is going to cut soon, we will not be able to 
> get the pending tasks reviewed and merged. Few of the pending tasks make 
> protocol changes to the push-based shuffle protocols, so we would like to 
> prevent users from enabling push-based shuffle both on the client and the 
> server until push-based shuffle implementation is complete. 
> We can prevent push-based shuffle to be used by throwing 
> {{UnsupportedOperationException}} (or something like that) both on the client 
> and the server when the user tries to enable it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35917) Disable push-based shuffle until the feature is complete

2021-08-01 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-35917:
---

Assignee: (was: Mridul Muralidharan)

> Disable push-based shuffle until the feature is complete
> 
>
> Key: SPARK-35917
> URL: https://issues.apache.org/jira/browse/SPARK-35917
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Priority: Major
>
> Push-based shuffle is partially merged in apache master but some of the tasks 
> are still incomplete. Since 3.2 is going to cut soon, we will not be able to 
> get the pending tasks reviewed and merged. Few of the pending tasks make 
> protocol changes to the push-based shuffle protocols, so we would like to 
> prevent users from enabling push-based shuffle both on the client and the 
> server until push-based shuffle implementation is complete. 
> We can prevent push-based shuffle to be used by throwing 
> {{UnsupportedOperationException}} (or something like that) both on the client 
> and the server when the user tries to enable it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36306) Refactor seventeenth set of 20 query execution errors to use error classes

2021-08-01 Thread PengLei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391362#comment-17391362
 ] 

PengLei commented on SPARK-36306:
-

working on this

> Refactor seventeenth set of 20 query execution errors to use error classes
> --
>
> Key: SPARK-36306
> URL: https://issues.apache.org/jira/browse/SPARK-36306
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Priority: Major
>
> Refactor some exceptions in 
> [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala]
>  to use error classes.
> There are currently ~350 exceptions in this file; so this PR only focuses on 
> the seventeenth set of 20.
> {code:java}
> legacyCheckpointDirectoryExistsError
> subprocessExitedError
> outputDataTypeUnsupportedByNodeWithoutSerdeError
> invalidStartIndexError
> concurrentModificationOnExternalAppendOnlyUnsafeRowArrayError
> doExecuteBroadcastNotImplementedError
> databaseNameConflictWithSystemPreservedDatabaseError
> commentOnTableUnsupportedError
> unsupportedUpdateColumnNullabilityError
> renameColumnUnsupportedForOlderMySQLError
> failedToExecuteQueryError
> nestedFieldUnsupportedError
> transformationsAndActionsNotInvokedByDriverError
> repeatedPivotsUnsupportedError
> pivotNotAfterGroupByUnsupportedError
> {code}
> For more detail, see the parent ticket SPARK-36094.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36305) Refactor sixteenth set of 20 query execution errors to use error classes

2021-08-01 Thread PengLei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391361#comment-17391361
 ] 

PengLei commented on SPARK-36305:
-

working on this

> Refactor sixteenth set of 20 query execution errors to use error classes
> 
>
> Key: SPARK-36305
> URL: https://issues.apache.org/jira/browse/SPARK-36305
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Priority: Major
>
> Refactor some exceptions in 
> [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala]
>  to use error classes.
> There are currently ~350 exceptions in this file; so this PR only focuses on 
> the sixteenth set of 20.
> {code:java}
> cannotDropMultiPartitionsOnNonatomicPartitionTableError
> truncateMultiPartitionUnsupportedError
> overwriteTableByUnsupportedExpressionError
> dynamicPartitionOverwriteUnsupportedByTableError
> failedMergingSchemaError
> cannotBroadcastTableOverMaxTableRowsError
> cannotBroadcastTableOverMaxTableBytesError
> notEnoughMemoryToBuildAndBroadcastTableError
> executeCodePathUnsupportedError
> cannotMergeClassWithOtherClassError
> continuousProcessingUnsupportedByDataSourceError
> failedToReadDataError
> failedToGenerateEpochMarkerError
> foreachWriterAbortedDueToTaskFailureError
> integerOverflowError
> failedToReadDeltaFileError
> failedToReadSnapshotFileError
> cannotPurgeAsBreakInternalStateError
> cleanUpSourceFilesUnsupportedError
> latestOffsetNotCalledError
> {code}
> For more detail, see the parent ticket SPARK-36094.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36304) Refactor fifteenth set of 20 query execution errors to use error classes

2021-08-01 Thread PengLei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391360#comment-17391360
 ] 

PengLei commented on SPARK-36304:
-

woking on this

> Refactor fifteenth set of 20 query execution errors to use error classes
> 
>
> Key: SPARK-36304
> URL: https://issues.apache.org/jira/browse/SPARK-36304
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Priority: Major
>
> Refactor some exceptions in 
> [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala]
>  to use error classes.
> There are currently ~350 exceptions in this file; so this PR only focuses on 
> the fifteenth set of 20.
> {code:java}
> unsupportedOperationExceptionError
> nullLiteralsCannotBeCastedError
> notUserDefinedTypeError
> cannotLoadUserDefinedTypeError
> timeZoneIdNotSpecifiedForTimestampTypeError
> notPublicClassError
> primitiveTypesNotSupportedError
> fieldIndexOnRowWithoutSchemaError
> valueIsNullError
> onlySupportDataSourcesProvidingFileFormatError
> failToSetOriginalPermissionBackError
> failToSetOriginalACLBackError
> multiFailuresInStageMaterializationError
> unrecognizedCompressionSchemaTypeIDError
> getParentLoggerNotImplementedError
> cannotCreateParquetConverterForTypeError
> cannotCreateParquetConverterForDecimalTypeError
> cannotCreateParquetConverterForDataTypeError
> cannotAddMultiPartitionsOnNonatomicPartitionTableError
> userSpecifiedSchemaUnsupportedByDataSourceError
> {code}
> For more detail, see the parent ticket SPARK-36094.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36378) Minor changes to address a few identified server side inefficiencies

2021-08-01 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-36378:
---

Assignee: (was: Mridul Muralidharan)

> Minor changes to address a few identified server side inefficiencies
> 
>
> Key: SPARK-36378
> URL: https://issues.apache.org/jira/browse/SPARK-36378
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Min Shen
>Priority: Major
>
> With the SPIP ticket close to being finished, we have done some performance 
> evaluations to compare the performance of push-based shuffle in upstream 
> Spark with the production version we have internally at LinkedIn.
> The evaluations have revealed a few regressions and also some additional perf 
> improvement opportunity.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2021-08-01 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-30602:
---

Shepherd: Mridul Muralidharan
Assignee: (was: Mridul Muralidharan)

> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>  Labels: release-notes
> Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, 
> vldb_magnet_final.pdf
>
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable Spark external shuffle service and store the intermediate 
> shuffle files on HDD. Because the number of blocks generated for a particular 
> shuffle grows quadratically compared to the size of shuffled data (# mappers 
> and reducers grows linearly with the size of shuffled data, but # blocks is # 
> mappers * # reducers), one general trend we have observed is that the more 
> data a Spark application processes, the smaller the block size becomes. In a 
> few production clusters we have seen, the average shuffle block size is only 
> 10s of KBs. Because of the inefficiency of performing random reads on HDD for 
> small amount of data, the overall efficiency of the Spark external shuffle 
> services serving the shuffle blocks degrades as we see an increasing # of 
> Spark applications processing an increasing amount of data. In addition, 
> because Spark external shuffle service is a shared service in a multi-tenancy 
> cluster, the inefficiency with one Spark application could propagate to other 
> applications as well.
> In this ticket, we propose a solution to improve Spark shuffle efficiency in 
> above mentioned environments with push-based shuffle. With push-based 
> shuffle, shuffle is performed at the end of mappers and blocks get pre-merged 
> and move towards reducers. In our prototype implementation, we have seen 
> significant efficiency improvements when performing large shuffles. We take a 
> Spark-native approach to achieve this, i.e., extending Spark’s existing 
> shuffle netty protocol, and the behaviors of Spark mappers, reducers and 
> drivers. This way, we can bring the benefits of more efficient shuffle in 
> Spark without incurring the dependency or overhead of either specialized 
> storage layer or external infrastructure pieces.
>  
> Link to dev mailing list discussion: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2021-08-01 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-30602:

Fix Version/s: 3.2.0

> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.2.0
>
> Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, 
> vldb_magnet_final.pdf
>
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable Spark external shuffle service and store the intermediate 
> shuffle files on HDD. Because the number of blocks generated for a particular 
> shuffle grows quadratically compared to the size of shuffled data (# mappers 
> and reducers grows linearly with the size of shuffled data, but # blocks is # 
> mappers * # reducers), one general trend we have observed is that the more 
> data a Spark application processes, the smaller the block size becomes. In a 
> few production clusters we have seen, the average shuffle block size is only 
> 10s of KBs. Because of the inefficiency of performing random reads on HDD for 
> small amount of data, the overall efficiency of the Spark external shuffle 
> services serving the shuffle blocks degrades as we see an increasing # of 
> Spark applications processing an increasing amount of data. In addition, 
> because Spark external shuffle service is a shared service in a multi-tenancy 
> cluster, the inefficiency with one Spark application could propagate to other 
> applications as well.
> In this ticket, we propose a solution to improve Spark shuffle efficiency in 
> above mentioned environments with push-based shuffle. With push-based 
> shuffle, shuffle is performed at the end of mappers and blocks get pre-merged 
> and move towards reducers. In our prototype implementation, we have seen 
> significant efficiency improvements when performing large shuffles. We take a 
> Spark-native approach to achieve this, i.e., extending Spark’s existing 
> shuffle netty protocol, and the behaviors of Spark mappers, reducers and 
> drivers. This way, we can bring the benefits of more efficient shuffle in 
> Spark without incurring the dependency or overhead of either specialized 
> storage layer or external infrastructure pieces.
>  
> Link to dev mailing list discussion: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36363) AKS SPark UI does not have executor tab showing up

2021-08-01 Thread Koushik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391349#comment-17391349
 ] 

Koushik commented on SPARK-36363:
-

hi kwon,

 

its harder for us to migrate to newer version as most of the apps in our 
project running with spark 2.2. don't we have fix with spark 2.2 version?

> AKS SPark UI does not have executor tab showing up
> --
>
> Key: SPARK-36363
> URL: https://issues.apache.org/jira/browse/SPARK-36363
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Koushik
>Priority: Major
>
> Spark UI Executor tab showing blank and i see the below error in the network 
> tab :
> https://keplerfnet-aks-prod.az.3pc.att.com/proxy:10.128.0.76:4043/executors/
> Failed to load resource: the server responded with a status of 404 ()
> DevTools failed to load source map: Could not load content for 
> [https://keplerfnet-aks-prod.az.3pc.att.com/proxy:10.128.0.76:4043/static/vis.map|https://ind01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkeplerfnet-aks-prod.az.3pc.att.com%2Fproxy%3A10.128.0.76%3A4043%2Fstatic%2Fvis.map&data=04%7C01%7CKoushik.Gopal%40TechMahindra.com%7C71ec48c8fa8d4ecc123908d95388dd8e%7Cedf442f5b9944c86a131b42b03a16c95%7C0%7C0%7C637632669559893674%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=UyrYVdDO4vfzwq4%2Fl4GHN6Gm8QC%2FMrvrGMl50FUCGrI%3D&reserved=0]:
>  HTTP error: status code 502, net::ERR_HTTP_RESPONSE_CODE_FAILURE



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20415) SPARK job hangs while writing DataFrame to HDFS

2021-08-01 Thread Yu Gan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391259#comment-17391259
 ] 

Yu Gan edited comment on SPARK-20415 at 8/2/21, 6:01 AM:
-

Did you find the root cause? I came across the same issue in our azure 
environment. 

org.apache.spark.unsafe.Platform.copyMemory

...

org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask

...

 

BTW, spark version: 3.1.1


was (Author: gyustorm):
Did you find the root cause? I came across the same issue in our azure 
environment. 

org.apache.spark.unsafe.Platform.copyMemory

...

org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask

...

> SPARK job hangs while writing DataFrame to HDFS
> ---
>
> Key: SPARK-20415
> URL: https://issues.apache.org/jira/browse/SPARK-20415
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, YARN
>Affects Versions: 2.1.0
> Environment: EMR 5.4.0
>Reporter: P K
>Priority: Major
>  Labels: bulk-closed
>
> We are in POC phase with Spark. One of the Steps is reading compressed json 
> files that come from sources, "explode" them into tabular format and then 
> write them to HDFS. This worked for about three weeks until a few days ago, 
> for a particular dataset, the writer just hangs. I logged in to the worker 
> machines and see this stack trace:
> "Executor task launch worker-0" #39 daemon prio=5 os_prio=0 
> tid=0x7f6210352800 nid=0x4542 runnable [0x7f61f52b3000]
>java.lang.Thread.State: RUNNABLE
> at org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.writeToMemory(UnsafeArrayData.java:311)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply6_2$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:111)
> at 
> org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:109)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6

[jira] [Commented] (SPARK-32923) Add support to properly handle different type of stage retries

2021-08-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391333#comment-17391333
 ] 

Apache Spark commented on SPARK-32923:
--

User 'venkata91' has created a pull request for this issue:
https://github.com/apache/spark/pull/33605

> Add support to properly handle different type of stage retries
> --
>
> Key: SPARK-32923
> URL: https://issues.apache.org/jira/browse/SPARK-32923
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Venkata krishnan Sowrirajan
>Priority: Major
> Fix For: 3.2.0
>
>
> In SPARK-23243 and SPARK-25341, the concept of an INDETERMINATE stage was 
> introduced, which would be handled differently if retried.
> Since these was added to address a data correctness issue, we should also add 
> support for these in push-based shuffle, so that we would be able to rollback 
> the merged shuffle partitions of a shuffle map stage if it's an INDETERMINATE 
> stage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2021-08-01 Thread Mridul Muralidharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391318#comment-17391318
 ] 

Mridul Muralidharan edited comment on SPARK-30602 at 8/2/21, 5:30 AM:
--

The only pending task here is documentation.
[~vsowrirajan] has opened a PR for that - but marking this jira as Resolved for 
3.2 as all code changes have been merged to master and branch-3.2


was (Author: mridulm80):
The only pending task here is documentation.
[~vsowrirajan] has opened a PR for that - but marking this jira as Resolved for 
3.2 as all changes have been merged to master and branch-3.2

> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Mridul Muralidharan
>Priority: Major
>  Labels: release-notes
> Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, 
> vldb_magnet_final.pdf
>
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable Spark external shuffle service and store the intermediate 
> shuffle files on HDD. Because the number of blocks generated for a particular 
> shuffle grows quadratically compared to the size of shuffled data (# mappers 
> and reducers grows linearly with the size of shuffled data, but # blocks is # 
> mappers * # reducers), one general trend we have observed is that the more 
> data a Spark application processes, the smaller the block size becomes. In a 
> few production clusters we have seen, the average shuffle block size is only 
> 10s of KBs. Because of the inefficiency of performing random reads on HDD for 
> small amount of data, the overall efficiency of the Spark external shuffle 
> services serving the shuffle blocks degrades as we see an increasing # of 
> Spark applications processing an increasing amount of data. In addition, 
> because Spark external shuffle service is a shared service in a multi-tenancy 
> cluster, the inefficiency with one Spark application could propagate to other 
> applications as well.
> In this ticket, we propose a solution to improve Spark shuffle efficiency in 
> above mentioned environments with push-based shuffle. With push-based 
> shuffle, shuffle is performed at the end of mappers and blocks get pre-merged 
> and move towards reducers. In our prototype implementation, we have seen 
> significant efficiency improvements when performing large shuffles. We take a 
> Spark-native approach to achieve this, i.e., extending Spark’s existing 
> shuffle netty protocol, and the behaviors of Spark mappers, reducers and 
> drivers. This way, we can bring the benefits of more efficient shuffle in 
> Spark without incurring the dependency or overhead of either specialized 
> storage layer or external infrastructure pieces.
>  
> Link to dev mailing list discussion: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-36378) Minor changes to address a few identified server side inefficiencies

2021-08-01 Thread Min Shen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Shen reopened SPARK-36378:
--

> Minor changes to address a few identified server side inefficiencies
> 
>
> Key: SPARK-36378
> URL: https://issues.apache.org/jira/browse/SPARK-36378
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Min Shen
>Assignee: Mridul Muralidharan
>Priority: Major
>
> With the SPIP ticket close to being finished, we have done some performance 
> evaluations to compare the performance of push-based shuffle in upstream 
> Spark with the production version we have internally at LinkedIn.
> The evaluations have revealed a few regressions and also some additional perf 
> improvement opportunity.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36378) Minor changes to address a few identified server side inefficiencies

2021-08-01 Thread Min Shen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Shen updated SPARK-36378:
-
Parent: SPARK-33235
Issue Type: Sub-task  (was: Bug)

> Minor changes to address a few identified server side inefficiencies
> 
>
> Key: SPARK-36378
> URL: https://issues.apache.org/jira/browse/SPARK-36378
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Min Shen
>Assignee: Mridul Muralidharan
>Priority: Major
>
> With the SPIP ticket close to being finished, we have done some performance 
> evaluations to compare the performance of push-based shuffle in upstream 
> Spark with the production version we have internally at LinkedIn.
> The evaluations have revealed a few regressions and also some additional perf 
> improvement opportunity.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2021-08-01 Thread Mridul Muralidharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391324#comment-17391324
 ] 

Mridul Muralidharan commented on SPARK-30602:
-

Thanks for all the work in getting this feature into Apache Spark.
In no particular order:

 [~mshen], [~csingh], [~vsowrirajan], [~zhouyejoe] for all the work in getting 
this in !
Thanks to all the reviewers: [~Ngone51], [~attilapiros], [~tgraves], 
[~dongjoon], [~hyukjin.kwon], [~Gengliang.Wang].

> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Mridul Muralidharan
>Priority: Major
>  Labels: release-notes
> Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, 
> vldb_magnet_final.pdf
>
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable Spark external shuffle service and store the intermediate 
> shuffle files on HDD. Because the number of blocks generated for a particular 
> shuffle grows quadratically compared to the size of shuffled data (# mappers 
> and reducers grows linearly with the size of shuffled data, but # blocks is # 
> mappers * # reducers), one general trend we have observed is that the more 
> data a Spark application processes, the smaller the block size becomes. In a 
> few production clusters we have seen, the average shuffle block size is only 
> 10s of KBs. Because of the inefficiency of performing random reads on HDD for 
> small amount of data, the overall efficiency of the Spark external shuffle 
> services serving the shuffle blocks degrades as we see an increasing # of 
> Spark applications processing an increasing amount of data. In addition, 
> because Spark external shuffle service is a shared service in a multi-tenancy 
> cluster, the inefficiency with one Spark application could propagate to other 
> applications as well.
> In this ticket, we propose a solution to improve Spark shuffle efficiency in 
> above mentioned environments with push-based shuffle. With push-based 
> shuffle, shuffle is performed at the end of mappers and blocks get pre-merged 
> and move towards reducers. In our prototype implementation, we have seen 
> significant efficiency improvements when performing large shuffles. We take a 
> Spark-native approach to achieve this, i.e., extending Spark’s existing 
> shuffle netty protocol, and the behaviors of Spark mappers, reducers and 
> drivers. This way, we can bring the benefits of more efficient shuffle in 
> Spark without incurring the dependency or overhead of either specialized 
> storage layer or external infrastructure pieces.
>  
> Link to dev mailing list discussion: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36378) Minor changes to address a few identified server side inefficiencies

2021-08-01 Thread Min Shen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Shen updated SPARK-36378:
-
Parent: (was: SPARK-30602)
Issue Type: Bug  (was: Sub-task)

> Minor changes to address a few identified server side inefficiencies
> 
>
> Key: SPARK-36378
> URL: https://issues.apache.org/jira/browse/SPARK-36378
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Min Shen
>Assignee: Mridul Muralidharan
>Priority: Major
>
> With the SPIP ticket close to being finished, we have done some performance 
> evaluations to compare the performance of push-based shuffle in upstream 
> Spark with the production version we have internally at LinkedIn.
> The evaluations have revealed a few regressions and also some additional perf 
> improvement opportunity.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36378) Minor changes to address a few identified server side inefficiencies

2021-08-01 Thread Min Shen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391322#comment-17391322
 ] 

Min Shen commented on SPARK-36378:
--

If moving this outside of the SPIP is preferred, then will move this to under 
SPARK-33235 and reopen.

> Minor changes to address a few identified server side inefficiencies
> 
>
> Key: SPARK-36378
> URL: https://issues.apache.org/jira/browse/SPARK-36378
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Min Shen
>Assignee: Mridul Muralidharan
>Priority: Major
>
> With the SPIP ticket close to being finished, we have done some performance 
> evaluations to compare the performance of push-based shuffle in upstream 
> Spark with the production version we have internally at LinkedIn.
> The evaluations have revealed a few regressions and also some additional perf 
> improvement opportunity.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36266) Rename classes in shuffle RPC used for block push operations

2021-08-01 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-36266:
---

Assignee: Min Shen

> Rename classes in shuffle RPC used for block push operations
> 
>
> Key: SPARK-36266
> URL: https://issues.apache.org/jira/browse/SPARK-36266
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Min Shen
>Priority: Major
> Fix For: 3.2.0
>
>
> In the current implementation of push-based shuffle, we are reusing certain 
> code between both block fetch and block push.
> This is generally good except that certain classes that are meant to be used 
> for both block fetch and block push now have names that indicate they are 
> only for block fetches, which is confusing.
> This ticket renames these classes to be more generic to be reused across both 
> block fetch and block push.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36378) Minor changes to address a few identified server side inefficiencies

2021-08-01 Thread Min Shen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391320#comment-17391320
 ] 

Min Shen commented on SPARK-36378:
--

Would prefer to merge this in if possible.

> Minor changes to address a few identified server side inefficiencies
> 
>
> Key: SPARK-36378
> URL: https://issues.apache.org/jira/browse/SPARK-36378
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Min Shen
>Assignee: Mridul Muralidharan
>Priority: Major
>
> With the SPIP ticket close to being finished, we have done some performance 
> evaluations to compare the performance of push-based shuffle in upstream 
> Spark with the production version we have internally at LinkedIn.
> The evaluations have revealed a few regressions and also some additional perf 
> improvement opportunity.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2021-08-01 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-30602.
-
Resolution: Fixed

The only pending task here is documentation.
[~vsowrirajan] has opened a PR for that - but marking this jira as Resolved for 
3.2 as all changes have been merged to master and branch-3.2

> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Mridul Muralidharan
>Priority: Major
>  Labels: release-notes
> Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, 
> vldb_magnet_final.pdf
>
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable Spark external shuffle service and store the intermediate 
> shuffle files on HDD. Because the number of blocks generated for a particular 
> shuffle grows quadratically compared to the size of shuffled data (# mappers 
> and reducers grows linearly with the size of shuffled data, but # blocks is # 
> mappers * # reducers), one general trend we have observed is that the more 
> data a Spark application processes, the smaller the block size becomes. In a 
> few production clusters we have seen, the average shuffle block size is only 
> 10s of KBs. Because of the inefficiency of performing random reads on HDD for 
> small amount of data, the overall efficiency of the Spark external shuffle 
> services serving the shuffle blocks degrades as we see an increasing # of 
> Spark applications processing an increasing amount of data. In addition, 
> because Spark external shuffle service is a shared service in a multi-tenancy 
> cluster, the inefficiency with one Spark application could propagate to other 
> applications as well.
> In this ticket, we propose a solution to improve Spark shuffle efficiency in 
> above mentioned environments with push-based shuffle. With push-based 
> shuffle, shuffle is performed at the end of mappers and blocks get pre-merged 
> and move towards reducers. In our prototype implementation, we have seen 
> significant efficiency improvements when performing large shuffles. We take a 
> Spark-native approach to achieve this, i.e., extending Spark’s existing 
> shuffle netty protocol, and the behaviors of Spark mappers, reducers and 
> drivers. This way, we can bring the benefits of more efficient shuffle in 
> Spark without incurring the dependency or overhead of either specialized 
> storage layer or external infrastructure pieces.
>  
> Link to dev mailing list discussion: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2021-08-01 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-30602:
---

Assignee: Mridul Muralidharan

> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Mridul Muralidharan
>Priority: Major
>  Labels: release-notes
> Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, 
> vldb_magnet_final.pdf
>
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable Spark external shuffle service and store the intermediate 
> shuffle files on HDD. Because the number of blocks generated for a particular 
> shuffle grows quadratically compared to the size of shuffled data (# mappers 
> and reducers grows linearly with the size of shuffled data, but # blocks is # 
> mappers * # reducers), one general trend we have observed is that the more 
> data a Spark application processes, the smaller the block size becomes. In a 
> few production clusters we have seen, the average shuffle block size is only 
> 10s of KBs. Because of the inefficiency of performing random reads on HDD for 
> small amount of data, the overall efficiency of the Spark external shuffle 
> services serving the shuffle blocks degrades as we see an increasing # of 
> Spark applications processing an increasing amount of data. In addition, 
> because Spark external shuffle service is a shared service in a multi-tenancy 
> cluster, the inefficiency with one Spark application could propagate to other 
> applications as well.
> In this ticket, we propose a solution to improve Spark shuffle efficiency in 
> above mentioned environments with push-based shuffle. With push-based 
> shuffle, shuffle is performed at the end of mappers and blocks get pre-merged 
> and move towards reducers. In our prototype implementation, we have seen 
> significant efficiency improvements when performing large shuffles. We take a 
> Spark-native approach to achieve this, i.e., extending Spark’s existing 
> shuffle netty protocol, and the behaviors of Spark mappers, reducers and 
> drivers. This way, we can bring the benefits of more efficient shuffle in 
> Spark without incurring the dependency or overhead of either specialized 
> storage layer or external infrastructure pieces.
>  
> Link to dev mailing list discussion: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32923) Add support to properly handle different type of stage retries

2021-08-01 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-32923:

Fix Version/s: 3.2.0

> Add support to properly handle different type of stage retries
> --
>
> Key: SPARK-32923
> URL: https://issues.apache.org/jira/browse/SPARK-32923
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Venkata krishnan Sowrirajan
>Priority: Major
> Fix For: 3.2.0
>
>
> In SPARK-23243 and SPARK-25341, the concept of an INDETERMINATE stage was 
> introduced, which would be handled differently if retried.
> Since these was added to address a data correctness issue, we should also add 
> support for these in push-based shuffle, so that we would be able to rollback 
> the merged shuffle partitions of a shuffle map stage if it's an INDETERMINATE 
> stage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32923) Add support to properly handle different type of stage retries

2021-08-01 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-32923:
---

Assignee: Venkata krishnan Sowrirajan

> Add support to properly handle different type of stage retries
> --
>
> Key: SPARK-32923
> URL: https://issues.apache.org/jira/browse/SPARK-32923
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Venkata krishnan Sowrirajan
>Priority: Major
>
> In SPARK-23243 and SPARK-25341, the concept of an INDETERMINATE stage was 
> introduced, which would be handled differently if retried.
> Since these was added to address a data correctness issue, we should also add 
> support for these in push-based shuffle, so that we would be able to rollback 
> the merged shuffle partitions of a shuffle map stage if it's an INDETERMINATE 
> stage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-32923) Add support to properly handle different type of stage retries

2021-08-01 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-32923:

Comment: was deleted

(was: This has been handled by SPARK-32923)

> Add support to properly handle different type of stage retries
> --
>
> Key: SPARK-32923
> URL: https://issues.apache.org/jira/browse/SPARK-32923
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>
> In SPARK-23243 and SPARK-25341, the concept of an INDETERMINATE stage was 
> introduced, which would be handled differently if retried.
> Since these was added to address a data correctness issue, we should also add 
> support for these in push-based shuffle, so that we would be able to rollback 
> the merged shuffle partitions of a shuffle map stage if it's an INDETERMINATE 
> stage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32923) Add support to properly handle different type of stage retries

2021-08-01 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-32923.
-
Resolution: Fixed

This has been handled by SPARK-32923

> Add support to properly handle different type of stage retries
> --
>
> Key: SPARK-32923
> URL: https://issues.apache.org/jira/browse/SPARK-32923
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>
> In SPARK-23243 and SPARK-25341, the concept of an INDETERMINATE stage was 
> introduced, which would be handled differently if retried.
> Since these was added to address a data correctness issue, we should also add 
> support for these in push-based shuffle, so that we would be able to rollback 
> the merged shuffle partitions of a shuffle map stage if it's an INDETERMINATE 
> stage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36378) Minor changes to address a few identified server side inefficiencies

2021-08-01 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-36378:
---

Assignee: Mridul Muralidharan

> Minor changes to address a few identified server side inefficiencies
> 
>
> Key: SPARK-36378
> URL: https://issues.apache.org/jira/browse/SPARK-36378
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Min Shen
>Assignee: Mridul Muralidharan
>Priority: Major
>
> With the SPIP ticket close to being finished, we have done some performance 
> evaluations to compare the performance of push-based shuffle in upstream 
> Spark with the production version we have internally at LinkedIn.
> The evaluations have revealed a few regressions and also some additional perf 
> improvement opportunity.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36378) Minor changes to address a few identified server side inefficiencies

2021-08-01 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-36378.
-
Resolution: Won't Fix

Let us move this outside of the SPIP and into individual jira's and follow up 
work.

> Minor changes to address a few identified server side inefficiencies
> 
>
> Key: SPARK-36378
> URL: https://issues.apache.org/jira/browse/SPARK-36378
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Min Shen
>Assignee: Mridul Muralidharan
>Priority: Major
>
> With the SPIP ticket close to being finished, we have done some performance 
> evaluations to compare the performance of push-based shuffle in upstream 
> Spark with the production version we have internally at LinkedIn.
> The evaluations have revealed a few regressions and also some additional perf 
> improvement opportunity.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35917) Disable push-based shuffle until the feature is complete

2021-08-01 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-35917:
---

Assignee: Mridul Muralidharan

> Disable push-based shuffle until the feature is complete
> 
>
> Key: SPARK-35917
> URL: https://issues.apache.org/jira/browse/SPARK-35917
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Assignee: Mridul Muralidharan
>Priority: Major
>
> Push-based shuffle is partially merged in apache master but some of the tasks 
> are still incomplete. Since 3.2 is going to cut soon, we will not be able to 
> get the pending tasks reviewed and merged. Few of the pending tasks make 
> protocol changes to the push-based shuffle protocols, so we would like to 
> prevent users from enabling push-based shuffle both on the client and the 
> server until push-based shuffle implementation is complete. 
> We can prevent push-based shuffle to be used by throwing 
> {{UnsupportedOperationException}} (or something like that) both on the client 
> and the server when the user tries to enable it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35917) Disable push-based shuffle until the feature is complete

2021-08-01 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-35917.
-
Resolution: Won't Fix

> Disable push-based shuffle until the feature is complete
> 
>
> Key: SPARK-35917
> URL: https://issues.apache.org/jira/browse/SPARK-35917
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Priority: Major
>
> Push-based shuffle is partially merged in apache master but some of the tasks 
> are still incomplete. Since 3.2 is going to cut soon, we will not be able to 
> get the pending tasks reviewed and merged. Few of the pending tasks make 
> protocol changes to the push-based shuffle protocols, so we would like to 
> prevent users from enabling push-based shuffle both on the client and the 
> server until push-based shuffle implementation is complete. 
> We can prevent push-based shuffle to be used by throwing 
> {{UnsupportedOperationException}} (or something like that) both on the client 
> and the server when the user tries to enable it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35917) Disable push-based shuffle until the feature is complete

2021-08-01 Thread Mridul Muralidharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391313#comment-17391313
 ] 

Mridul Muralidharan commented on SPARK-35917:
-

Closing this Jira - as push based shuffle has been merged.

Thanks for the ping [~Gengliang.Wang] !

> Disable push-based shuffle until the feature is complete
> 
>
> Key: SPARK-35917
> URL: https://issues.apache.org/jira/browse/SPARK-35917
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Priority: Major
>
> Push-based shuffle is partially merged in apache master but some of the tasks 
> are still incomplete. Since 3.2 is going to cut soon, we will not be able to 
> get the pending tasks reviewed and merged. Few of the pending tasks make 
> protocol changes to the push-based shuffle protocols, so we would like to 
> prevent users from enabling push-based shuffle both on the client and the 
> server until push-based shuffle implementation is complete. 
> We can prevent push-based shuffle to be used by throwing 
> {{UnsupportedOperationException}} (or something like that) both on the client 
> and the server when the user tries to enable it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36378) Minor changes to address a few identified server side inefficiencies

2021-08-01 Thread Min Shen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Shen updated SPARK-36378:
-
Description: 
With the SPIP ticket close to being finished, we have done some performance 
evaluations to compare the performance of push-based shuffle in upstream Spark 
with the production version we have internally at LinkedIn.

The evaluations have revealed a few regressions and also some additional perf 
improvement opportunity.

 

> Minor changes to address a few identified server side inefficiencies
> 
>
> Key: SPARK-36378
> URL: https://issues.apache.org/jira/browse/SPARK-36378
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Min Shen
>Priority: Major
>
> With the SPIP ticket close to being finished, we have done some performance 
> evaluations to compare the performance of push-based shuffle in upstream 
> Spark with the production version we have internally at LinkedIn.
> The evaluations have revealed a few regressions and also some additional perf 
> improvement opportunity.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36377) Fix documentation in spark-env.sh.template

2021-08-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391298#comment-17391298
 ] 

Apache Spark commented on SPARK-36377:
--

User 'yutoacts' has created a pull request for this issue:
https://github.com/apache/spark/pull/33604

> Fix documentation in spark-env.sh.template
> --
>
> Key: SPARK-36377
> URL: https://issues.apache.org/jira/browse/SPARK-36377
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Spark Submit
>Affects Versions: 3.1.2
>Reporter: Yuto Akutsu
>Priority: Major
>
> Some options in the "Options read in YARN client/cluster mode" section in 
> spark-env.sh.template is read by other modes too (e.g. SPARK_CONF_DIR, 
> SPARK_EXECUTOR_CORES, etc.), so we should re-document it to help users 
> distinguish what's only read by YARN mode from what's not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35917) Disable push-based shuffle until the feature is complete

2021-08-01 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391299#comment-17391299
 ] 

Gengliang Wang commented on SPARK-35917:


[~csingh] Shall we mark this one as won't do since 
https://github.com/apache/spark/pull/33034 is merged?
cc [~mridul]

> Disable push-based shuffle until the feature is complete
> 
>
> Key: SPARK-35917
> URL: https://issues.apache.org/jira/browse/SPARK-35917
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Priority: Major
>
> Push-based shuffle is partially merged in apache master but some of the tasks 
> are still incomplete. Since 3.2 is going to cut soon, we will not be able to 
> get the pending tasks reviewed and merged. Few of the pending tasks make 
> protocol changes to the push-based shuffle protocols, so we would like to 
> prevent users from enabling push-based shuffle both on the client and the 
> server until push-based shuffle implementation is complete. 
> We can prevent push-based shuffle to be used by throwing 
> {{UnsupportedOperationException}} (or something like that) both on the client 
> and the server when the user tries to enable it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36377) Fix documentation in spark-env.sh.template

2021-08-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36377:


Assignee: Apache Spark

> Fix documentation in spark-env.sh.template
> --
>
> Key: SPARK-36377
> URL: https://issues.apache.org/jira/browse/SPARK-36377
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Spark Submit
>Affects Versions: 3.1.2
>Reporter: Yuto Akutsu
>Assignee: Apache Spark
>Priority: Major
>
> Some options in the "Options read in YARN client/cluster mode" section in 
> spark-env.sh.template is read by other modes too (e.g. SPARK_CONF_DIR, 
> SPARK_EXECUTOR_CORES, etc.), so we should re-document it to help users 
> distinguish what's only read by YARN mode from what's not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36377) Fix documentation in spark-env.sh.template

2021-08-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36377:


Assignee: (was: Apache Spark)

> Fix documentation in spark-env.sh.template
> --
>
> Key: SPARK-36377
> URL: https://issues.apache.org/jira/browse/SPARK-36377
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Spark Submit
>Affects Versions: 3.1.2
>Reporter: Yuto Akutsu
>Priority: Major
>
> Some options in the "Options read in YARN client/cluster mode" section in 
> spark-env.sh.template is read by other modes too (e.g. SPARK_CONF_DIR, 
> SPARK_EXECUTOR_CORES, etc.), so we should re-document it to help users 
> distinguish what's only read by YARN mode from what's not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36377) Fix documentation in spark-env.sh.template

2021-08-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391297#comment-17391297
 ] 

Apache Spark commented on SPARK-36377:
--

User 'yutoacts' has created a pull request for this issue:
https://github.com/apache/spark/pull/33604

> Fix documentation in spark-env.sh.template
> --
>
> Key: SPARK-36377
> URL: https://issues.apache.org/jira/browse/SPARK-36377
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Spark Submit
>Affects Versions: 3.1.2
>Reporter: Yuto Akutsu
>Priority: Major
>
> Some options in the "Options read in YARN client/cluster mode" section in 
> spark-env.sh.template is read by other modes too (e.g. SPARK_CONF_DIR, 
> SPARK_EXECUTOR_CORES, etc.), so we should re-document it to help users 
> distinguish what's only read by YARN mode from what's not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36378) Minor changes to address a few identified server side inefficiencies

2021-08-01 Thread Min Shen (Jira)
Min Shen created SPARK-36378:


 Summary: Minor changes to address a few identified server side 
inefficiencies
 Key: SPARK-36378
 URL: https://issues.apache.org/jira/browse/SPARK-36378
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Affects Versions: 3.2.0
Reporter: Min Shen






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36303) Refactor fourteenth set of 20 query execution errors to use error classes

2021-08-01 Thread dgd_contributor (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391294#comment-17391294
 ] 

dgd_contributor commented on SPARK-36303:
-

working on this

> Refactor fourteenth set of 20 query execution errors to use error classes
> -
>
> Key: SPARK-36303
> URL: https://issues.apache.org/jira/browse/SPARK-36303
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Priority: Major
>
> Refactor some exceptions in 
> [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala]
>  to use error classes.
> There are currently ~350 exceptions in this file; so this PR only focuses on 
> the fourteenth set of 20.
> {code:java}
> cannotGetEventTimeWatermarkError
> cannotSetTimeoutTimestampError
> batchMetadataFileNotFoundError
> multiStreamingQueriesUsingPathConcurrentlyError
> addFilesWithAbsolutePathUnsupportedError
> microBatchUnsupportedByDataSourceError
> cannotExecuteStreamingRelationExecError
> invalidStreamingOutputModeError
> catalogPluginClassNotFoundError
> catalogPluginClassNotImplementedError
> catalogPluginClassNotFoundForCatalogError
> catalogFailToFindPublicNoArgConstructorError
> catalogFailToCallPublicNoArgConstructorError
> cannotInstantiateAbstractCatalogPluginClassError
> failedToInstantiateConstructorForCatalogError
> noSuchElementExceptionError
> noSuchElementExceptionError
> cannotMutateReadOnlySQLConfError
> cannotCloneOrCopyReadOnlySQLConfError
> cannotGetSQLConfInSchedulerEventLoopThreadError
> {code}
> For more detail, see the parent ticket SPARK-36094.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36302) Refactor thirteenth set of 20 query execution errors to use error classes

2021-08-01 Thread dgd_contributor (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391283#comment-17391283
 ] 

dgd_contributor commented on SPARK-36302:
-

working on this.

> Refactor thirteenth set of 20 query execution errors to use error classes
> -
>
> Key: SPARK-36302
> URL: https://issues.apache.org/jira/browse/SPARK-36302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Priority: Major
>
> Refactor some exceptions in 
> [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala]
>  to use error classes.
> There are currently ~350 exceptions in this file; so this PR only focuses on 
> the thirteenth set of 20.
> {code:java}
> serDeInterfaceNotFoundError
> convertHiveTableToCatalogTableError
> cannotRecognizeHiveTypeError
> getTablesByTypeUnsupportedByHiveVersionError
> dropTableWithPurgeUnsupportedError
> alterTableWithDropPartitionAndPurgeUnsupportedError
> invalidPartitionFilterError
> getPartitionMetadataByFilterError
> unsupportedHiveMetastoreVersionError
> loadHiveClientCausesNoClassDefFoundError
> cannotFetchTablesOfDatabaseError
> illegalLocationClauseForViewPartitionError
> renamePathAsExistsPathError
> renameAsExistsPathError
> renameSrcPathNotFoundError
> failedRenameTempFileError
> legacyMetadataPathExistsError
> partitionColumnNotFoundInSchemaError
> stateNotDefinedOrAlreadyRemovedError
> cannotSetTimeoutDurationError
> {code}
> For more detail, see the parent ticket SPARK-36094.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36301) Refactor twelfth set of 20 query execution errors to use error classes

2021-08-01 Thread dgd_contributor (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391279#comment-17391279
 ] 

dgd_contributor commented on SPARK-36301:
-

working on this

> Refactor twelfth set of 20 query execution errors to use error classes
> --
>
> Key: SPARK-36301
> URL: https://issues.apache.org/jira/browse/SPARK-36301
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Priority: Major
>
> Refactor some exceptions in 
> [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala]
>  to use error classes.
> There are currently ~350 exceptions in this file; so this PR only focuses on 
> the twelfth set of 20.
> {code:java}
> cannotRewriteDomainJoinWithConditionsError
> decorrelateInnerQueryThroughPlanUnsupportedError
> methodCalledInAnalyzerNotAllowedError
> cannotSafelyMergeSerdePropertiesError
> pairUnsupportedAtFunctionError
> onceStrategyIdempotenceIsBrokenForBatchError[TreeType
> structuralIntegrityOfInputPlanIsBrokenInClassError
> structuralIntegrityIsBrokenAfterApplyingRuleError
> ruleIdNotFoundForRuleError
> cannotCreateArrayWithElementsExceedLimitError
> indexOutOfBoundsOfArrayDataError
> malformedRecordsDetectedInRecordParsingError
> remoteOperationsUnsupportedError
> invalidKerberosConfigForHiveServer2Error
> parentSparkUIToAttachTabNotFoundError
> inferSchemaUnsupportedForHiveError
> requestedPartitionsMismatchTablePartitionsError
> dynamicPartitionKeyNotAmongWrittenPartitionPathsError
> cannotRemovePartitionDirError
> cannotCreateStagingDirError
> {code}
> For more detail, see the parent ticket SPARK-36094.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36298) Refactor ninth set of 20 query execution errors to use error classes

2021-08-01 Thread dgd_contributor (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391277#comment-17391277
 ] 

dgd_contributor commented on SPARK-36298:
-

working on this.

> Refactor ninth set of 20 query execution errors to use error classes
> 
>
> Key: SPARK-36298
> URL: https://issues.apache.org/jira/browse/SPARK-36298
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Priority: Major
>
> Refactor some exceptions in 
> [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala]
>  to use error classes.
> There are currently ~350 exceptions in this file; so this PR only focuses on 
> the ninth set of 20.
> {code:java}
> unscaledValueTooLargeForPrecisionError
> decimalPrecisionExceedsMaxPrecisionError
> outOfDecimalTypeRangeError
> unsupportedArrayTypeError
> unsupportedJavaTypeError
> failedParsingStructTypeError
> failedMergingFieldsError
> cannotMergeDecimalTypesWithIncompatiblePrecisionAndScaleError
> cannotMergeDecimalTypesWithIncompatiblePrecisionError
> cannotMergeDecimalTypesWithIncompatibleScaleError
> cannotMergeIncompatibleDataTypesError
> exceedMapSizeLimitError
> duplicateMapKeyFoundError
> mapDataKeyArrayLengthDiffersFromValueArrayLengthError
> fieldDiffersFromDerivedLocalDateError
> failToParseDateTimeInNewParserError
> failToFormatDateTimeInNewFormatterError
> failToRecognizePatternAfterUpgradeError
> failToRecognizePatternError
> cannotCastUTF8StringToDataTypeError
> {code}
> For more detail, see the parent ticket SPARK-36094.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36299) Refactor tenth set of 20 query execution errors to use error classes

2021-08-01 Thread dgd_contributor (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391278#comment-17391278
 ] 

dgd_contributor commented on SPARK-36299:
-

working on this.

> Refactor tenth set of 20 query execution errors to use error classes
> 
>
> Key: SPARK-36299
> URL: https://issues.apache.org/jira/browse/SPARK-36299
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Priority: Major
>
> Refactor some exceptions in 
> [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala]
>  to use error classes.
> There are currently ~350 exceptions in this file; so this PR only focuses on 
> the tenth set of 20.
> {code:java}
> registeringStreamingQueryListenerError
> concurrentQueryInstanceError
> cannotParseJsonArraysAsStructsError
> cannotParseStringAsDataTypeError
> failToParseEmptyStringForDataTypeError
> failToParseValueForDataTypeError
> rootConverterReturnNullError
> cannotHaveCircularReferencesInBeanClassError
> cannotHaveCircularReferencesInClassError
> cannotUseInvalidJavaIdentifierAsFieldNameError
> cannotFindEncoderForTypeError
> attributesForTypeUnsupportedError
> schemaForTypeUnsupportedError
> cannotFindConstructorForTypeError
> paramExceedOneCharError
> paramIsNotIntegerError
> paramIsNotBooleanValueError
> foundNullValueForNotNullableFieldError
> malformedCSVRecordError
> elementsOfTupleExceedLimitError
> {code}
> For more detail, see the parent ticket SPARK-36094.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2021-08-01 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391276#comment-17391276
 ] 

Gengliang Wang commented on SPARK-30602:


[~zhouyejoe] Great, thanks!

> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>  Labels: release-notes
> Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, 
> vldb_magnet_final.pdf
>
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable Spark external shuffle service and store the intermediate 
> shuffle files on HDD. Because the number of blocks generated for a particular 
> shuffle grows quadratically compared to the size of shuffled data (# mappers 
> and reducers grows linearly with the size of shuffled data, but # blocks is # 
> mappers * # reducers), one general trend we have observed is that the more 
> data a Spark application processes, the smaller the block size becomes. In a 
> few production clusters we have seen, the average shuffle block size is only 
> 10s of KBs. Because of the inefficiency of performing random reads on HDD for 
> small amount of data, the overall efficiency of the Spark external shuffle 
> services serving the shuffle blocks degrades as we see an increasing # of 
> Spark applications processing an increasing amount of data. In addition, 
> because Spark external shuffle service is a shared service in a multi-tenancy 
> cluster, the inefficiency with one Spark application could propagate to other 
> applications as well.
> In this ticket, we propose a solution to improve Spark shuffle efficiency in 
> above mentioned environments with push-based shuffle. With push-based 
> shuffle, shuffle is performed at the end of mappers and blocks get pre-merged 
> and move towards reducers. In our prototype implementation, we have seen 
> significant efficiency improvements when performing large shuffles. We take a 
> Spark-native approach to achieve this, i.e., extending Spark’s existing 
> shuffle netty protocol, and the behaviors of Spark mappers, reducers and 
> drivers. This way, we can bring the benefits of more efficient shuffle in 
> Spark without incurring the dependency or overhead of either specialized 
> storage layer or external infrastructure pieces.
>  
> Link to dev mailing list discussion: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36377) Fix documentation in spark-env.sh.template

2021-08-01 Thread Yuto Akutsu (Jira)
Yuto Akutsu created SPARK-36377:
---

 Summary: Fix documentation in spark-env.sh.template
 Key: SPARK-36377
 URL: https://issues.apache.org/jira/browse/SPARK-36377
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Spark Submit
Affects Versions: 3.1.2
Reporter: Yuto Akutsu


Some options in the "Options read in YARN client/cluster mode" section in 
spark-env.sh.template is read by other modes too (e.g. SPARK_CONF_DIR, 
SPARK_EXECUTOR_CORES, etc.), so we should re-document it to help users 
distinguish what's only read by YARN mode from what's not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36376) Collapse repartitions if there is a project between them

2021-08-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36376:


Assignee: (was: Apache Spark)

> Collapse repartitions if there is a project between them
> 
>
> Key: SPARK-36376
> URL: https://issues.apache.org/jira/browse/SPARK-36376
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> For example:
> {code:scala}
> testRelation.distribute('a, 'b)(10).select('a).distribute('a)(20)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36376) Collapse repartitions if there is a project between them

2021-08-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36376:


Assignee: Apache Spark

> Collapse repartitions if there is a project between them
> 
>
> Key: SPARK-36376
> URL: https://issues.apache.org/jira/browse/SPARK-36376
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> For example:
> {code:scala}
> testRelation.distribute('a, 'b)(10).select('a).distribute('a)(20)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36376) Collapse repartitions if there is a project between them

2021-08-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391269#comment-17391269
 ] 

Apache Spark commented on SPARK-36376:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/33603

> Collapse repartitions if there is a project between them
> 
>
> Key: SPARK-36376
> URL: https://issues.apache.org/jira/browse/SPARK-36376
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> For example:
> {code:scala}
> testRelation.distribute('a, 'b)(10).select('a).distribute('a)(20)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36375) Flaky Test: BasicSchedulerIntegrationSuite - 'super simple job' and 'multi-stage job'

2021-08-01 Thread wuyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391268#comment-17391268
 ] 

wuyi commented on SPARK-36375:
--

[~hyukjin.kwon] I'd like to take a look first. Thanks for the ping.

> Flaky Test: BasicSchedulerIntegrationSuite - 'super simple job' and 
> 'multi-stage job'
> -
>
> Key: SPARK-36375
> URL: https://issues.apache.org/jira/browse/SPARK-36375
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/runs/3216286546
> {code}
> [info] BasicSchedulerIntegrationSuite:
> [info] - super simple job *** FAILED *** (56 milliseconds)
> [info]   Map() did not equal Map(0 -> 42, 5 -> 42, 1 -> 42, 6 -> 42, 9 -> 42, 
> 2 -> 42, 7 -> 42, 3 -> 42, 8 -> 42, 4 -> 42) 
> (SchedulerIntegrationSuite.scala:545)
> [info]   Analysis:
> [info]   HashMap(0: -> 42, 1: -> 42, 2: -> 42, 3: -> 42, 4: -> 42, 5: -> 42, 
> 6: -> 42, 7: -> 42, 8: -> 42, 9: -> 42)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
> [info]   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
> [info]   at 
> org.apache.spark.scheduler.BasicSchedulerIntegrationSuite.$anonfun$new$1(SchedulerIntegrationSuite.scala:545)
> [info]   at 
> org.apache.spark.scheduler.SchedulerIntegrationSuite.$anonfun$testScheduler$1(SchedulerIntegrationSuite.scala:98)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
> [info]   at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
> [info]   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
> [info]   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
> [info]   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
> [info]   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
> [info]   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
> [info]   at scala.collection.immutable.List.foreach(List.scala:431)
> [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
> [info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
> [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563)
> [info]   at org.scalatest.Suite.run(Suite.scala:1112)
> [info]   at org.scalatest.Suite.run$(Suite.scala:1094)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
> [info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
> [info]   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:62)
> [info]   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
> [info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.

[jira] [Created] (SPARK-36376) Collapse repartitions if there is a project between them

2021-08-01 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-36376:
---

 Summary: Collapse repartitions if there is a project between them
 Key: SPARK-36376
 URL: https://issues.apache.org/jira/browse/SPARK-36376
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Yuming Wang


For example:
{code:scala}
testRelation.distribute('a, 'b)(10).select('a).distribute('a)(20)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36375) Flaky Test: BasicSchedulerIntegrationSuite - 'super simple job' and 'multi-stage job'

2021-08-01 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391264#comment-17391264
 ] 

Hyukjin Kwon commented on SPARK-36375:
--

[~wuyi] do you have any idea on this?

> Flaky Test: BasicSchedulerIntegrationSuite - 'super simple job' and 
> 'multi-stage job'
> -
>
> Key: SPARK-36375
> URL: https://issues.apache.org/jira/browse/SPARK-36375
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/runs/3216286546
> {code}
> [info] BasicSchedulerIntegrationSuite:
> [info] - super simple job *** FAILED *** (56 milliseconds)
> [info]   Map() did not equal Map(0 -> 42, 5 -> 42, 1 -> 42, 6 -> 42, 9 -> 42, 
> 2 -> 42, 7 -> 42, 3 -> 42, 8 -> 42, 4 -> 42) 
> (SchedulerIntegrationSuite.scala:545)
> [info]   Analysis:
> [info]   HashMap(0: -> 42, 1: -> 42, 2: -> 42, 3: -> 42, 4: -> 42, 5: -> 42, 
> 6: -> 42, 7: -> 42, 8: -> 42, 9: -> 42)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
> [info]   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
> [info]   at 
> org.apache.spark.scheduler.BasicSchedulerIntegrationSuite.$anonfun$new$1(SchedulerIntegrationSuite.scala:545)
> [info]   at 
> org.apache.spark.scheduler.SchedulerIntegrationSuite.$anonfun$testScheduler$1(SchedulerIntegrationSuite.scala:98)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
> [info]   at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
> [info]   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
> [info]   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
> [info]   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
> [info]   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
> [info]   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
> [info]   at scala.collection.immutable.List.foreach(List.scala:431)
> [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
> [info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
> [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563)
> [info]   at org.scalatest.Suite.run(Suite.scala:1112)
> [info]   at org.scalatest.Suite.run$(Suite.scala:1094)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
> [info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
> [info]   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:62)
> [info]   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
> [info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
> [

[jira] [Updated] (SPARK-36375) Flaky Test: BasicSchedulerIntegrationSuite - 'super simple job' and 'multi-stage job'

2021-08-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36375:
-
Issue Type: Test  (was: Improvement)

> Flaky Test: BasicSchedulerIntegrationSuite - 'super simple job' and 
> 'multi-stage job'
> -
>
> Key: SPARK-36375
> URL: https://issues.apache.org/jira/browse/SPARK-36375
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/apache/spark/runs/3216286546
> {code}
> [info] BasicSchedulerIntegrationSuite:
> [info] - super simple job *** FAILED *** (56 milliseconds)
> [info]   Map() did not equal Map(0 -> 42, 5 -> 42, 1 -> 42, 6 -> 42, 9 -> 42, 
> 2 -> 42, 7 -> 42, 3 -> 42, 8 -> 42, 4 -> 42) 
> (SchedulerIntegrationSuite.scala:545)
> [info]   Analysis:
> [info]   HashMap(0: -> 42, 1: -> 42, 2: -> 42, 3: -> 42, 4: -> 42, 5: -> 42, 
> 6: -> 42, 7: -> 42, 8: -> 42, 9: -> 42)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
> [info]   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
> [info]   at 
> org.apache.spark.scheduler.BasicSchedulerIntegrationSuite.$anonfun$new$1(SchedulerIntegrationSuite.scala:545)
> [info]   at 
> org.apache.spark.scheduler.SchedulerIntegrationSuite.$anonfun$testScheduler$1(SchedulerIntegrationSuite.scala:98)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
> [info]   at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
> [info]   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
> [info]   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
> [info]   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
> [info]   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
> [info]   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
> [info]   at scala.collection.immutable.List.foreach(List.scala:431)
> [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
> [info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
> [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563)
> [info]   at org.scalatest.Suite.run(Suite.scala:1112)
> [info]   at org.scalatest.Suite.run$(Suite.scala:1094)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
> [info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
> [info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
> [info]   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:62)
> [info]   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
> [info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
> [info]   at org.scalatest.BeforeAndAfterAll.run$(BeforeA

[jira] [Created] (SPARK-36375) Flaky Test: BasicSchedulerIntegrationSuite - 'super simple job' and 'multi-stage job'

2021-08-01 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-36375:


 Summary: Flaky Test: BasicSchedulerIntegrationSuite - 'super 
simple job' and 'multi-stage job'
 Key: SPARK-36375
 URL: https://issues.apache.org/jira/browse/SPARK-36375
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


https://github.com/apache/spark/runs/3216286546

{code}
[info] BasicSchedulerIntegrationSuite:
[info] - super simple job *** FAILED *** (56 milliseconds)
[info]   Map() did not equal Map(0 -> 42, 5 -> 42, 1 -> 42, 6 -> 42, 9 -> 42, 2 
-> 42, 7 -> 42, 3 -> 42, 8 -> 42, 4 -> 42) (SchedulerIntegrationSuite.scala:545)
[info]   Analysis:
[info]   HashMap(0: -> 42, 1: -> 42, 2: -> 42, 3: -> 42, 4: -> 42, 5: -> 42, 6: 
-> 42, 7: -> 42, 8: -> 42, 9: -> 42)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at 
org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
[info]   at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
[info]   at 
org.apache.spark.scheduler.BasicSchedulerIntegrationSuite.$anonfun$new$1(SchedulerIntegrationSuite.scala:545)
[info]   at 
org.apache.spark.scheduler.SchedulerIntegrationSuite.$anonfun$testScheduler$1(SchedulerIntegrationSuite.scala:98)
[info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
[info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
[info]   at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
[info]   at 
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
[info]   at 
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
[info]   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
[info]   at 
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
[info]   at scala.collection.immutable.List.foreach(List.scala:431)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
[info]   at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563)
[info]   at org.scalatest.Suite.run(Suite.scala:1112)
[info]   at org.scalatest.Suite.run$(Suite.scala:1094)
[info]   at 
org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563)
[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
[info]   at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:62)
[info]   at 
org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
[info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info]   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info]   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:62)
[info]   at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318)
[info]   at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513)
[info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413)
[info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info]   at 
jav

[jira] [Commented] (SPARK-20415) SPARK job hangs while writing DataFrame to HDFS

2021-08-01 Thread Yu Gan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391259#comment-17391259
 ] 

Yu Gan commented on SPARK-20415:


Did you find the root cause? I came across the same issue in our azure 
environment. 

org.apache.spark.unsafe.Platform.copyMemory

...

org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask

...

> SPARK job hangs while writing DataFrame to HDFS
> ---
>
> Key: SPARK-20415
> URL: https://issues.apache.org/jira/browse/SPARK-20415
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, YARN
>Affects Versions: 2.1.0
> Environment: EMR 5.4.0
>Reporter: P K
>Priority: Major
>  Labels: bulk-closed
>
> We are in POC phase with Spark. One of the Steps is reading compressed json 
> files that come from sources, "explode" them into tabular format and then 
> write them to HDFS. This worked for about three weeks until a few days ago, 
> for a particular dataset, the writer just hangs. I logged in to the worker 
> machines and see this stack trace:
> "Executor task launch worker-0" #39 daemon prio=5 os_prio=0 
> tid=0x7f6210352800 nid=0x4542 runnable [0x7f61f52b3000]
>java.lang.Thread.State: RUNNABLE
> at org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.writeToMemory(UnsafeArrayData.java:311)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply6_2$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:111)
> at 
> org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:109)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> The last messages ever printed in stderr before the hang are:
> 17/04/18 01:41:14 INFO DAGScheduler: Final stage: ResultStage 4 (save at 
> NativeMethodAccessorImpl.java:0)
> 17/04/18 01:41:14 INFO DAGScheduler: Parents of final stage: List()
> 17/04/18 01:41:14 INFO DAGScheduler: Missing parents: List()
> 17/04/18 01:41:14 INFO DAGSc

[jira] [Assigned] (SPARK-36372) ALTER TABLE ADD COLUMNS should check duplicates for the specified columns for v2 command

2021-08-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36372:


Assignee: (was: Apache Spark)

> ALTER TABLE ADD COLUMNS should check duplicates for the specified columns for 
> v2 command
> 
>
> Key: SPARK-36372
> URL: https://issues.apache.org/jira/browse/SPARK-36372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Priority: Major
>
> ALTER TABLE ADD COLUMNS currently doesn't check duplicates for the specified 
> columns for v2 command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36372) ALTER TABLE ADD COLUMNS should check duplicates for the specified columns for v2 command

2021-08-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36372:


Assignee: Apache Spark

> ALTER TABLE ADD COLUMNS should check duplicates for the specified columns for 
> v2 command
> 
>
> Key: SPARK-36372
> URL: https://issues.apache.org/jira/browse/SPARK-36372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Major
>
> ALTER TABLE ADD COLUMNS currently doesn't check duplicates for the specified 
> columns for v2 command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36372) ALTER TABLE ADD COLUMNS should check duplicates for the specified columns for v2 command

2021-08-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391258#comment-17391258
 ] 

Apache Spark commented on SPARK-36372:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/33600

> ALTER TABLE ADD COLUMNS should check duplicates for the specified columns for 
> v2 command
> 
>
> Key: SPARK-36372
> URL: https://issues.apache.org/jira/browse/SPARK-36372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Priority: Major
>
> ALTER TABLE ADD COLUMNS currently doesn't check duplicates for the specified 
> columns for v2 command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36374) Push-based shuffle documentation

2021-08-01 Thread Venkata krishnan Sowrirajan (Jira)
Venkata krishnan Sowrirajan created SPARK-36374:
---

 Summary: Push-based shuffle documentation
 Key: SPARK-36374
 URL: https://issues.apache.org/jira/browse/SPARK-36374
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 3.2.0
Reporter: Venkata krishnan Sowrirajan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36329) show api of Dataset should get as input the output method

2021-08-01 Thread Izek Greenfield (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391159#comment-17391159
 ] 

Izek Greenfield commented on SPARK-36329:
-

# if you do it like that you should write it again and again.
 # it very handy to have it in the same format as show.

> show api of Dataset should get as input the output method
> -
>
> Key: SPARK-36329
> URL: https://issues.apache.org/jira/browse/SPARK-36329
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Izek Greenfield
>Priority: Major
>
> For now show is:
> {code:scala}
> def show(numRows: Int, truncate: Boolean): Unit = if (truncate) {
> println(showString(numRows, truncate = 20))
>   } else {
> println(showString(numRows, truncate = 0))
>   }
> {code}
> it can be turn into:
> {code:scala}
> def show(numRows: Int, truncate: Boolean, out: String => Unit = println): 
> Unit = if (truncate) {
> out(showString(numRows, truncate = 20))
>   } else {
> out(showString(numRows, truncate = 0))
>   }
> {code}
> so user will be able to send that to file/log...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36373) DecimalPrecision only add necessary cast

2021-08-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36373:


Assignee: (was: Apache Spark)

> DecimalPrecision only add necessary cast
> 
>
> Key: SPARK-36373
> URL: https://issues.apache.org/jira/browse/SPARK-36373
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> For example:
> {noformat}
> EqualTo(AttributeReference("d1", DecimalType(5, 2))(), 
> AttributeReference("d2", DecimalType(2, 1))())
> {noformat}
> It will add a useless cast to {{d1}}:
> {noformat}
> (cast(d1#6 as decimal(5,2)) = cast(d2#7 as decimal(5,2)))
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36373) DecimalPrecision only add necessary cast

2021-08-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36373:


Assignee: Apache Spark

> DecimalPrecision only add necessary cast
> 
>
> Key: SPARK-36373
> URL: https://issues.apache.org/jira/browse/SPARK-36373
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> For example:
> {noformat}
> EqualTo(AttributeReference("d1", DecimalType(5, 2))(), 
> AttributeReference("d2", DecimalType(2, 1))())
> {noformat}
> It will add a useless cast to {{d1}}:
> {noformat}
> (cast(d1#6 as decimal(5,2)) = cast(d2#7 as decimal(5,2)))
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36373) DecimalPrecision only add necessary cast

2021-08-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391150#comment-17391150
 ] 

Apache Spark commented on SPARK-36373:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/33602

> DecimalPrecision only add necessary cast
> 
>
> Key: SPARK-36373
> URL: https://issues.apache.org/jira/browse/SPARK-36373
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> For example:
> {noformat}
> EqualTo(AttributeReference("d1", DecimalType(5, 2))(), 
> AttributeReference("d2", DecimalType(2, 1))())
> {noformat}
> It will add a useless cast to {{d1}}:
> {noformat}
> (cast(d1#6 as decimal(5,2)) = cast(d2#7 as decimal(5,2)))
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36092) Migrate to GitHub Actions Codecov from Jenkins

2021-08-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36092.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33591
[https://github.com/apache/spark/pull/33591]

> Migrate to GitHub Actions Codecov from Jenkins
> --
>
> Key: SPARK-36092
> URL: https://issues.apache.org/jira/browse/SPARK-36092
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.2.0
>
>
> We currently uses manual Codecov site to work around our Jenkins CI security 
> issue. Now we use GitHub Actions so we can leverage Codecov to report the 
> coverage for PySpark.
> See also https://github.com/codecov/codecov-action



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36092) Migrate to GitHub Actions Codecov from Jenkins

2021-08-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36092:


Assignee: Hyukjin Kwon

> Migrate to GitHub Actions Codecov from Jenkins
> --
>
> Key: SPARK-36092
> URL: https://issues.apache.org/jira/browse/SPARK-36092
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> We currently uses manual Codecov site to work around our Jenkins CI security 
> issue. Now we use GitHub Actions so we can leverage Codecov to report the 
> coverage for PySpark.
> See also https://github.com/codecov/codecov-action



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36373) DecimalPrecision only add necessary cast

2021-08-01 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-36373:
---

 Summary: DecimalPrecision only add necessary cast
 Key: SPARK-36373
 URL: https://issues.apache.org/jira/browse/SPARK-36373
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Yuming Wang


For example:
{noformat}
EqualTo(AttributeReference("d1", DecimalType(5, 2))(), AttributeReference("d2", 
DecimalType(2, 1))())
{noformat}
It will add a useless cast to {{d1}}:
{noformat}
(cast(d1#6 as decimal(5,2)) = cast(d2#7 as decimal(5,2)))
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36362) Omnibus Java code static analyzer warning fixes

2021-08-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1739#comment-1739
 ] 

Apache Spark commented on SPARK-36362:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/33601

> Omnibus Java code static analyzer warning fixes
> ---
>
> Key: SPARK-36362
> URL: https://issues.apache.org/jira/browse/SPARK-36362
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
> Fix For: 3.3.0
>
>
> Inspired by a recent Java code touch-up, I wanted to fix in one pass several 
> lingering non-trivial issues with the Java code that a static analyzer turns 
> up. Only a few of these have material effects, but some do, and figured we 
> could avoid taking N PRs over time to address these.
> * Some int*int multiplications that widen to long maybe could overflow
> * Unnecessarily non-static inner classes
> * Some tests "catch (AssertionError)" and do nothing
> * Manual array iteration vs very slightly faster/simpler foreach
> * Incorrect generic types that just happen to not cause a runtime error
> * Missed opportunities for try-close
> * Mutable enums which shouldn't be
> * .. and a few other minor things



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36362) Omnibus Java code static analyzer warning fixes

2021-08-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391110#comment-17391110
 ] 

Apache Spark commented on SPARK-36362:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/33601

> Omnibus Java code static analyzer warning fixes
> ---
>
> Key: SPARK-36362
> URL: https://issues.apache.org/jira/browse/SPARK-36362
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
> Fix For: 3.3.0
>
>
> Inspired by a recent Java code touch-up, I wanted to fix in one pass several 
> lingering non-trivial issues with the Java code that a static analyzer turns 
> up. Only a few of these have material effects, but some do, and figured we 
> could avoid taking N PRs over time to address these.
> * Some int*int multiplications that widen to long maybe could overflow
> * Unnecessarily non-static inner classes
> * Some tests "catch (AssertionError)" and do nothing
> * Manual array iteration vs very slightly faster/simpler foreach
> * Incorrect generic types that just happen to not cause a runtime error
> * Missed opportunities for try-close
> * Mutable enums which shouldn't be
> * .. and a few other minor things



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org