[jira] [Commented] (SPARK-36175) Support TimestampNTZ in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-36175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391385#comment-17391385 ] Apache Spark commented on SPARK-36175: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/33607 > Support TimestampNTZ in Avro data source > - > > Key: SPARK-36175 > URL: https://issues.apache.org/jira/browse/SPARK-36175 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: jiaan.geng >Priority: Major > Fix For: 3.3.0 > > > As per the Avro spec > https://avro.apache.org/docs/1.10.2/spec.html#Local+timestamp+%28microsecond+precision%29, > Spark can convert TimestampNTZ type from/to Avro's Local timestamp type. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36175) Support TimestampNTZ in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-36175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391386#comment-17391386 ] Apache Spark commented on SPARK-36175: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/33607 > Support TimestampNTZ in Avro data source > - > > Key: SPARK-36175 > URL: https://issues.apache.org/jira/browse/SPARK-36175 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: jiaan.geng >Priority: Major > Fix For: 3.3.0 > > > As per the Avro spec > https://avro.apache.org/docs/1.10.2/spec.html#Local+timestamp+%28microsecond+precision%29, > Spark can convert TimestampNTZ type from/to Avro's Local timestamp type. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35815) Allow delayThreshold for watermark to be represented as ANSI day-time/year-month interval literals
[ https://issues.apache.org/jira/browse/SPARK-35815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391373#comment-17391373 ] Apache Spark commented on SPARK-35815: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/33606 > Allow delayThreshold for watermark to be represented as ANSI > day-time/year-month interval literals > -- > > Key: SPARK-35815 > URL: https://issues.apache.org/jira/browse/SPARK-35815 > Project: Spark > Issue Type: Sub-task > Components: SQL, Structured Streaming >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.2.0 > > > delayThreshold parameter of DataFrame.withWatermark should handle ANSI > day-time/year-month interval literals. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36379) Null at root level of a JSON array causes the parsing failure (w/ permissive mode)
[ https://issues.apache.org/jira/browse/SPARK-36379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-36379: - Issue Type: Bug (was: Improvement) > Null at root level of a JSON array causes the parsing failure (w/ permissive > mode) > -- > > Key: SPARK-36379 > URL: https://issues.apache.org/jira/browse/SPARK-36379 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > scala> spark.read.json(Seq("""[{"a": "str"}, null, {"a": > "str"}]""").toDS).collect() > {code} > {code} > ... > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 > (TID 1) (172.30.3.20 executor driver): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) > {code} > Since the mode (by default) is permissive, we shouldn't just fail like above. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36379) Null at root level of a JSON array causes the parsing failure (w/ permissive mode)
[ https://issues.apache.org/jira/browse/SPARK-36379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-36379: - Priority: Minor (was: Major) > Null at root level of a JSON array causes the parsing failure (w/ permissive > mode) > -- > > Key: SPARK-36379 > URL: https://issues.apache.org/jira/browse/SPARK-36379 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > {code} > scala> spark.read.json(Seq("""[{"a": "str"}, null, {"a": > "str"}]""").toDS).collect() > {code} > {code} > ... > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 > (TID 1) (172.30.3.20 executor driver): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) > {code} > Since the mode (by default) is permissive, we shouldn't just fail like above. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36379) Null at root level of a JSON array causes the parsing failure (w/ permissive mode)
Hyukjin Kwon created SPARK-36379: Summary: Null at root level of a JSON array causes the parsing failure (w/ permissive mode) Key: SPARK-36379 URL: https://issues.apache.org/jira/browse/SPARK-36379 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.2, 3.2.0, 3.3.0 Reporter: Hyukjin Kwon {code} scala> spark.read.json(Seq("""[{"a": "str"}, null, {"a": "str"}]""").toDS).collect() {code} {code} ... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (172.30.3.20 executor driver): java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) {code} Since the mode (by default) is permissive, we shouldn't just fail like above. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35917) Disable push-based shuffle until the feature is complete
[ https://issues.apache.org/jira/browse/SPARK-35917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan updated SPARK-35917: Fix Version/s: (was: 3.2.0) > Disable push-based shuffle until the feature is complete > > > Key: SPARK-35917 > URL: https://issues.apache.org/jira/browse/SPARK-35917 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Chandni Singh >Priority: Major > > Push-based shuffle is partially merged in apache master but some of the tasks > are still incomplete. Since 3.2 is going to cut soon, we will not be able to > get the pending tasks reviewed and merged. Few of the pending tasks make > protocol changes to the push-based shuffle protocols, so we would like to > prevent users from enabling push-based shuffle both on the client and the > server until push-based shuffle implementation is complete. > We can prevent push-based shuffle to be used by throwing > {{UnsupportedOperationException}} (or something like that) both on the client > and the server when the user tries to enable it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35917) Disable push-based shuffle until the feature is complete
[ https://issues.apache.org/jira/browse/SPARK-35917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan updated SPARK-35917: Fix Version/s: 3.2.0 > Disable push-based shuffle until the feature is complete > > > Key: SPARK-35917 > URL: https://issues.apache.org/jira/browse/SPARK-35917 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Chandni Singh >Priority: Major > Fix For: 3.2.0 > > > Push-based shuffle is partially merged in apache master but some of the tasks > are still incomplete. Since 3.2 is going to cut soon, we will not be able to > get the pending tasks reviewed and merged. Few of the pending tasks make > protocol changes to the push-based shuffle protocols, so we would like to > prevent users from enabling push-based shuffle both on the client and the > server until push-based shuffle implementation is complete. > We can prevent push-based shuffle to be used by throwing > {{UnsupportedOperationException}} (or something like that) both on the client > and the server when the user tries to enable it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35917) Disable push-based shuffle until the feature is complete
[ https://issues.apache.org/jira/browse/SPARK-35917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-35917: --- Assignee: (was: Mridul Muralidharan) > Disable push-based shuffle until the feature is complete > > > Key: SPARK-35917 > URL: https://issues.apache.org/jira/browse/SPARK-35917 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Chandni Singh >Priority: Major > > Push-based shuffle is partially merged in apache master but some of the tasks > are still incomplete. Since 3.2 is going to cut soon, we will not be able to > get the pending tasks reviewed and merged. Few of the pending tasks make > protocol changes to the push-based shuffle protocols, so we would like to > prevent users from enabling push-based shuffle both on the client and the > server until push-based shuffle implementation is complete. > We can prevent push-based shuffle to be used by throwing > {{UnsupportedOperationException}} (or something like that) both on the client > and the server when the user tries to enable it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36306) Refactor seventeenth set of 20 query execution errors to use error classes
[ https://issues.apache.org/jira/browse/SPARK-36306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391362#comment-17391362 ] PengLei commented on SPARK-36306: - working on this > Refactor seventeenth set of 20 query execution errors to use error classes > -- > > Key: SPARK-36306 > URL: https://issues.apache.org/jira/browse/SPARK-36306 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Karen Feng >Priority: Major > > Refactor some exceptions in > [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala] > to use error classes. > There are currently ~350 exceptions in this file; so this PR only focuses on > the seventeenth set of 20. > {code:java} > legacyCheckpointDirectoryExistsError > subprocessExitedError > outputDataTypeUnsupportedByNodeWithoutSerdeError > invalidStartIndexError > concurrentModificationOnExternalAppendOnlyUnsafeRowArrayError > doExecuteBroadcastNotImplementedError > databaseNameConflictWithSystemPreservedDatabaseError > commentOnTableUnsupportedError > unsupportedUpdateColumnNullabilityError > renameColumnUnsupportedForOlderMySQLError > failedToExecuteQueryError > nestedFieldUnsupportedError > transformationsAndActionsNotInvokedByDriverError > repeatedPivotsUnsupportedError > pivotNotAfterGroupByUnsupportedError > {code} > For more detail, see the parent ticket SPARK-36094. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36305) Refactor sixteenth set of 20 query execution errors to use error classes
[ https://issues.apache.org/jira/browse/SPARK-36305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391361#comment-17391361 ] PengLei commented on SPARK-36305: - working on this > Refactor sixteenth set of 20 query execution errors to use error classes > > > Key: SPARK-36305 > URL: https://issues.apache.org/jira/browse/SPARK-36305 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Karen Feng >Priority: Major > > Refactor some exceptions in > [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala] > to use error classes. > There are currently ~350 exceptions in this file; so this PR only focuses on > the sixteenth set of 20. > {code:java} > cannotDropMultiPartitionsOnNonatomicPartitionTableError > truncateMultiPartitionUnsupportedError > overwriteTableByUnsupportedExpressionError > dynamicPartitionOverwriteUnsupportedByTableError > failedMergingSchemaError > cannotBroadcastTableOverMaxTableRowsError > cannotBroadcastTableOverMaxTableBytesError > notEnoughMemoryToBuildAndBroadcastTableError > executeCodePathUnsupportedError > cannotMergeClassWithOtherClassError > continuousProcessingUnsupportedByDataSourceError > failedToReadDataError > failedToGenerateEpochMarkerError > foreachWriterAbortedDueToTaskFailureError > integerOverflowError > failedToReadDeltaFileError > failedToReadSnapshotFileError > cannotPurgeAsBreakInternalStateError > cleanUpSourceFilesUnsupportedError > latestOffsetNotCalledError > {code} > For more detail, see the parent ticket SPARK-36094. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36304) Refactor fifteenth set of 20 query execution errors to use error classes
[ https://issues.apache.org/jira/browse/SPARK-36304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391360#comment-17391360 ] PengLei commented on SPARK-36304: - woking on this > Refactor fifteenth set of 20 query execution errors to use error classes > > > Key: SPARK-36304 > URL: https://issues.apache.org/jira/browse/SPARK-36304 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Karen Feng >Priority: Major > > Refactor some exceptions in > [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala] > to use error classes. > There are currently ~350 exceptions in this file; so this PR only focuses on > the fifteenth set of 20. > {code:java} > unsupportedOperationExceptionError > nullLiteralsCannotBeCastedError > notUserDefinedTypeError > cannotLoadUserDefinedTypeError > timeZoneIdNotSpecifiedForTimestampTypeError > notPublicClassError > primitiveTypesNotSupportedError > fieldIndexOnRowWithoutSchemaError > valueIsNullError > onlySupportDataSourcesProvidingFileFormatError > failToSetOriginalPermissionBackError > failToSetOriginalACLBackError > multiFailuresInStageMaterializationError > unrecognizedCompressionSchemaTypeIDError > getParentLoggerNotImplementedError > cannotCreateParquetConverterForTypeError > cannotCreateParquetConverterForDecimalTypeError > cannotCreateParquetConverterForDataTypeError > cannotAddMultiPartitionsOnNonatomicPartitionTableError > userSpecifiedSchemaUnsupportedByDataSourceError > {code} > For more detail, see the parent ticket SPARK-36094. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36378) Minor changes to address a few identified server side inefficiencies
[ https://issues.apache.org/jira/browse/SPARK-36378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-36378: --- Assignee: (was: Mridul Muralidharan) > Minor changes to address a few identified server side inefficiencies > > > Key: SPARK-36378 > URL: https://issues.apache.org/jira/browse/SPARK-36378 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.2.0 >Reporter: Min Shen >Priority: Major > > With the SPIP ticket close to being finished, we have done some performance > evaluations to compare the performance of push-based shuffle in upstream > Spark with the production version we have internally at LinkedIn. > The evaluations have revealed a few regressions and also some additional perf > improvement opportunity. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency
[ https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-30602: --- Shepherd: Mridul Muralidharan Assignee: (was: Mridul Muralidharan) > SPIP: Support push-based shuffle to improve shuffle efficiency > -- > > Key: SPARK-30602 > URL: https://issues.apache.org/jira/browse/SPARK-30602 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Priority: Major > Labels: release-notes > Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, > vldb_magnet_final.pdf > > > In a large deployment of a Spark compute infrastructure, Spark shuffle is > becoming a potential scaling bottleneck and a source of inefficiency in the > cluster. When doing Spark on YARN for a large-scale deployment, people > usually enable Spark external shuffle service and store the intermediate > shuffle files on HDD. Because the number of blocks generated for a particular > shuffle grows quadratically compared to the size of shuffled data (# mappers > and reducers grows linearly with the size of shuffled data, but # blocks is # > mappers * # reducers), one general trend we have observed is that the more > data a Spark application processes, the smaller the block size becomes. In a > few production clusters we have seen, the average shuffle block size is only > 10s of KBs. Because of the inefficiency of performing random reads on HDD for > small amount of data, the overall efficiency of the Spark external shuffle > services serving the shuffle blocks degrades as we see an increasing # of > Spark applications processing an increasing amount of data. In addition, > because Spark external shuffle service is a shared service in a multi-tenancy > cluster, the inefficiency with one Spark application could propagate to other > applications as well. > In this ticket, we propose a solution to improve Spark shuffle efficiency in > above mentioned environments with push-based shuffle. With push-based > shuffle, shuffle is performed at the end of mappers and blocks get pre-merged > and move towards reducers. In our prototype implementation, we have seen > significant efficiency improvements when performing large shuffles. We take a > Spark-native approach to achieve this, i.e., extending Spark’s existing > shuffle netty protocol, and the behaviors of Spark mappers, reducers and > drivers. This way, we can bring the benefits of more efficient shuffle in > Spark without incurring the dependency or overhead of either specialized > storage layer or external infrastructure pieces. > > Link to dev mailing list discussion: > [http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency
[ https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan updated SPARK-30602: Fix Version/s: 3.2.0 > SPIP: Support push-based shuffle to improve shuffle efficiency > -- > > Key: SPARK-30602 > URL: https://issues.apache.org/jira/browse/SPARK-30602 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Priority: Major > Labels: release-notes > Fix For: 3.2.0 > > Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, > vldb_magnet_final.pdf > > > In a large deployment of a Spark compute infrastructure, Spark shuffle is > becoming a potential scaling bottleneck and a source of inefficiency in the > cluster. When doing Spark on YARN for a large-scale deployment, people > usually enable Spark external shuffle service and store the intermediate > shuffle files on HDD. Because the number of blocks generated for a particular > shuffle grows quadratically compared to the size of shuffled data (# mappers > and reducers grows linearly with the size of shuffled data, but # blocks is # > mappers * # reducers), one general trend we have observed is that the more > data a Spark application processes, the smaller the block size becomes. In a > few production clusters we have seen, the average shuffle block size is only > 10s of KBs. Because of the inefficiency of performing random reads on HDD for > small amount of data, the overall efficiency of the Spark external shuffle > services serving the shuffle blocks degrades as we see an increasing # of > Spark applications processing an increasing amount of data. In addition, > because Spark external shuffle service is a shared service in a multi-tenancy > cluster, the inefficiency with one Spark application could propagate to other > applications as well. > In this ticket, we propose a solution to improve Spark shuffle efficiency in > above mentioned environments with push-based shuffle. With push-based > shuffle, shuffle is performed at the end of mappers and blocks get pre-merged > and move towards reducers. In our prototype implementation, we have seen > significant efficiency improvements when performing large shuffles. We take a > Spark-native approach to achieve this, i.e., extending Spark’s existing > shuffle netty protocol, and the behaviors of Spark mappers, reducers and > drivers. This way, we can bring the benefits of more efficient shuffle in > Spark without incurring the dependency or overhead of either specialized > storage layer or external infrastructure pieces. > > Link to dev mailing list discussion: > [http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36363) AKS SPark UI does not have executor tab showing up
[ https://issues.apache.org/jira/browse/SPARK-36363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391349#comment-17391349 ] Koushik commented on SPARK-36363: - hi kwon, its harder for us to migrate to newer version as most of the apps in our project running with spark 2.2. don't we have fix with spark 2.2 version? > AKS SPark UI does not have executor tab showing up > -- > > Key: SPARK-36363 > URL: https://issues.apache.org/jira/browse/SPARK-36363 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Koushik >Priority: Major > > Spark UI Executor tab showing blank and i see the below error in the network > tab : > https://keplerfnet-aks-prod.az.3pc.att.com/proxy:10.128.0.76:4043/executors/ > Failed to load resource: the server responded with a status of 404 () > DevTools failed to load source map: Could not load content for > [https://keplerfnet-aks-prod.az.3pc.att.com/proxy:10.128.0.76:4043/static/vis.map|https://ind01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkeplerfnet-aks-prod.az.3pc.att.com%2Fproxy%3A10.128.0.76%3A4043%2Fstatic%2Fvis.map&data=04%7C01%7CKoushik.Gopal%40TechMahindra.com%7C71ec48c8fa8d4ecc123908d95388dd8e%7Cedf442f5b9944c86a131b42b03a16c95%7C0%7C0%7C637632669559893674%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=UyrYVdDO4vfzwq4%2Fl4GHN6Gm8QC%2FMrvrGMl50FUCGrI%3D&reserved=0]: > HTTP error: status code 502, net::ERR_HTTP_RESPONSE_CODE_FAILURE -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20415) SPARK job hangs while writing DataFrame to HDFS
[ https://issues.apache.org/jira/browse/SPARK-20415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391259#comment-17391259 ] Yu Gan edited comment on SPARK-20415 at 8/2/21, 6:01 AM: - Did you find the root cause? I came across the same issue in our azure environment. org.apache.spark.unsafe.Platform.copyMemory ... org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask ... BTW, spark version: 3.1.1 was (Author: gyustorm): Did you find the root cause? I came across the same issue in our azure environment. org.apache.spark.unsafe.Platform.copyMemory ... org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask ... > SPARK job hangs while writing DataFrame to HDFS > --- > > Key: SPARK-20415 > URL: https://issues.apache.org/jira/browse/SPARK-20415 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, YARN >Affects Versions: 2.1.0 > Environment: EMR 5.4.0 >Reporter: P K >Priority: Major > Labels: bulk-closed > > We are in POC phase with Spark. One of the Steps is reading compressed json > files that come from sources, "explode" them into tabular format and then > write them to HDFS. This worked for about three weeks until a few days ago, > for a particular dataset, the writer just hangs. I logged in to the worker > machines and see this stack trace: > "Executor task launch worker-0" #39 daemon prio=5 os_prio=0 > tid=0x7f6210352800 nid=0x4542 runnable [0x7f61f52b3000] >java.lang.Thread.State: RUNNABLE > at org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210) > at > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.writeToMemory(UnsafeArrayData.java:311) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply6_2$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:111) > at > org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:109) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) > at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
[jira] [Commented] (SPARK-32923) Add support to properly handle different type of stage retries
[ https://issues.apache.org/jira/browse/SPARK-32923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391333#comment-17391333 ] Apache Spark commented on SPARK-32923: -- User 'venkata91' has created a pull request for this issue: https://github.com/apache/spark/pull/33605 > Add support to properly handle different type of stage retries > -- > > Key: SPARK-32923 > URL: https://issues.apache.org/jira/browse/SPARK-32923 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Assignee: Venkata krishnan Sowrirajan >Priority: Major > Fix For: 3.2.0 > > > In SPARK-23243 and SPARK-25341, the concept of an INDETERMINATE stage was > introduced, which would be handled differently if retried. > Since these was added to address a data correctness issue, we should also add > support for these in push-based shuffle, so that we would be able to rollback > the merged shuffle partitions of a shuffle map stage if it's an INDETERMINATE > stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency
[ https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391318#comment-17391318 ] Mridul Muralidharan edited comment on SPARK-30602 at 8/2/21, 5:30 AM: -- The only pending task here is documentation. [~vsowrirajan] has opened a PR for that - but marking this jira as Resolved for 3.2 as all code changes have been merged to master and branch-3.2 was (Author: mridulm80): The only pending task here is documentation. [~vsowrirajan] has opened a PR for that - but marking this jira as Resolved for 3.2 as all changes have been merged to master and branch-3.2 > SPIP: Support push-based shuffle to improve shuffle efficiency > -- > > Key: SPARK-30602 > URL: https://issues.apache.org/jira/browse/SPARK-30602 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Assignee: Mridul Muralidharan >Priority: Major > Labels: release-notes > Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, > vldb_magnet_final.pdf > > > In a large deployment of a Spark compute infrastructure, Spark shuffle is > becoming a potential scaling bottleneck and a source of inefficiency in the > cluster. When doing Spark on YARN for a large-scale deployment, people > usually enable Spark external shuffle service and store the intermediate > shuffle files on HDD. Because the number of blocks generated for a particular > shuffle grows quadratically compared to the size of shuffled data (# mappers > and reducers grows linearly with the size of shuffled data, but # blocks is # > mappers * # reducers), one general trend we have observed is that the more > data a Spark application processes, the smaller the block size becomes. In a > few production clusters we have seen, the average shuffle block size is only > 10s of KBs. Because of the inefficiency of performing random reads on HDD for > small amount of data, the overall efficiency of the Spark external shuffle > services serving the shuffle blocks degrades as we see an increasing # of > Spark applications processing an increasing amount of data. In addition, > because Spark external shuffle service is a shared service in a multi-tenancy > cluster, the inefficiency with one Spark application could propagate to other > applications as well. > In this ticket, we propose a solution to improve Spark shuffle efficiency in > above mentioned environments with push-based shuffle. With push-based > shuffle, shuffle is performed at the end of mappers and blocks get pre-merged > and move towards reducers. In our prototype implementation, we have seen > significant efficiency improvements when performing large shuffles. We take a > Spark-native approach to achieve this, i.e., extending Spark’s existing > shuffle netty protocol, and the behaviors of Spark mappers, reducers and > drivers. This way, we can bring the benefits of more efficient shuffle in > Spark without incurring the dependency or overhead of either specialized > storage layer or external infrastructure pieces. > > Link to dev mailing list discussion: > [http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-36378) Minor changes to address a few identified server side inefficiencies
[ https://issues.apache.org/jira/browse/SPARK-36378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Min Shen reopened SPARK-36378: -- > Minor changes to address a few identified server side inefficiencies > > > Key: SPARK-36378 > URL: https://issues.apache.org/jira/browse/SPARK-36378 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.2.0 >Reporter: Min Shen >Assignee: Mridul Muralidharan >Priority: Major > > With the SPIP ticket close to being finished, we have done some performance > evaluations to compare the performance of push-based shuffle in upstream > Spark with the production version we have internally at LinkedIn. > The evaluations have revealed a few regressions and also some additional perf > improvement opportunity. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36378) Minor changes to address a few identified server side inefficiencies
[ https://issues.apache.org/jira/browse/SPARK-36378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Min Shen updated SPARK-36378: - Parent: SPARK-33235 Issue Type: Sub-task (was: Bug) > Minor changes to address a few identified server side inefficiencies > > > Key: SPARK-36378 > URL: https://issues.apache.org/jira/browse/SPARK-36378 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.2.0 >Reporter: Min Shen >Assignee: Mridul Muralidharan >Priority: Major > > With the SPIP ticket close to being finished, we have done some performance > evaluations to compare the performance of push-based shuffle in upstream > Spark with the production version we have internally at LinkedIn. > The evaluations have revealed a few regressions and also some additional perf > improvement opportunity. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency
[ https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391324#comment-17391324 ] Mridul Muralidharan commented on SPARK-30602: - Thanks for all the work in getting this feature into Apache Spark. In no particular order: [~mshen], [~csingh], [~vsowrirajan], [~zhouyejoe] for all the work in getting this in ! Thanks to all the reviewers: [~Ngone51], [~attilapiros], [~tgraves], [~dongjoon], [~hyukjin.kwon], [~Gengliang.Wang]. > SPIP: Support push-based shuffle to improve shuffle efficiency > -- > > Key: SPARK-30602 > URL: https://issues.apache.org/jira/browse/SPARK-30602 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Assignee: Mridul Muralidharan >Priority: Major > Labels: release-notes > Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, > vldb_magnet_final.pdf > > > In a large deployment of a Spark compute infrastructure, Spark shuffle is > becoming a potential scaling bottleneck and a source of inefficiency in the > cluster. When doing Spark on YARN for a large-scale deployment, people > usually enable Spark external shuffle service and store the intermediate > shuffle files on HDD. Because the number of blocks generated for a particular > shuffle grows quadratically compared to the size of shuffled data (# mappers > and reducers grows linearly with the size of shuffled data, but # blocks is # > mappers * # reducers), one general trend we have observed is that the more > data a Spark application processes, the smaller the block size becomes. In a > few production clusters we have seen, the average shuffle block size is only > 10s of KBs. Because of the inefficiency of performing random reads on HDD for > small amount of data, the overall efficiency of the Spark external shuffle > services serving the shuffle blocks degrades as we see an increasing # of > Spark applications processing an increasing amount of data. In addition, > because Spark external shuffle service is a shared service in a multi-tenancy > cluster, the inefficiency with one Spark application could propagate to other > applications as well. > In this ticket, we propose a solution to improve Spark shuffle efficiency in > above mentioned environments with push-based shuffle. With push-based > shuffle, shuffle is performed at the end of mappers and blocks get pre-merged > and move towards reducers. In our prototype implementation, we have seen > significant efficiency improvements when performing large shuffles. We take a > Spark-native approach to achieve this, i.e., extending Spark’s existing > shuffle netty protocol, and the behaviors of Spark mappers, reducers and > drivers. This way, we can bring the benefits of more efficient shuffle in > Spark without incurring the dependency or overhead of either specialized > storage layer or external infrastructure pieces. > > Link to dev mailing list discussion: > [http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36378) Minor changes to address a few identified server side inefficiencies
[ https://issues.apache.org/jira/browse/SPARK-36378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Min Shen updated SPARK-36378: - Parent: (was: SPARK-30602) Issue Type: Bug (was: Sub-task) > Minor changes to address a few identified server side inefficiencies > > > Key: SPARK-36378 > URL: https://issues.apache.org/jira/browse/SPARK-36378 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 3.2.0 >Reporter: Min Shen >Assignee: Mridul Muralidharan >Priority: Major > > With the SPIP ticket close to being finished, we have done some performance > evaluations to compare the performance of push-based shuffle in upstream > Spark with the production version we have internally at LinkedIn. > The evaluations have revealed a few regressions and also some additional perf > improvement opportunity. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36378) Minor changes to address a few identified server side inefficiencies
[ https://issues.apache.org/jira/browse/SPARK-36378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391322#comment-17391322 ] Min Shen commented on SPARK-36378: -- If moving this outside of the SPIP is preferred, then will move this to under SPARK-33235 and reopen. > Minor changes to address a few identified server side inefficiencies > > > Key: SPARK-36378 > URL: https://issues.apache.org/jira/browse/SPARK-36378 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.2.0 >Reporter: Min Shen >Assignee: Mridul Muralidharan >Priority: Major > > With the SPIP ticket close to being finished, we have done some performance > evaluations to compare the performance of push-based shuffle in upstream > Spark with the production version we have internally at LinkedIn. > The evaluations have revealed a few regressions and also some additional perf > improvement opportunity. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36266) Rename classes in shuffle RPC used for block push operations
[ https://issues.apache.org/jira/browse/SPARK-36266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-36266: --- Assignee: Min Shen > Rename classes in shuffle RPC used for block push operations > > > Key: SPARK-36266 > URL: https://issues.apache.org/jira/browse/SPARK-36266 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Assignee: Min Shen >Priority: Major > Fix For: 3.2.0 > > > In the current implementation of push-based shuffle, we are reusing certain > code between both block fetch and block push. > This is generally good except that certain classes that are meant to be used > for both block fetch and block push now have names that indicate they are > only for block fetches, which is confusing. > This ticket renames these classes to be more generic to be reused across both > block fetch and block push. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36378) Minor changes to address a few identified server side inefficiencies
[ https://issues.apache.org/jira/browse/SPARK-36378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391320#comment-17391320 ] Min Shen commented on SPARK-36378: -- Would prefer to merge this in if possible. > Minor changes to address a few identified server side inefficiencies > > > Key: SPARK-36378 > URL: https://issues.apache.org/jira/browse/SPARK-36378 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.2.0 >Reporter: Min Shen >Assignee: Mridul Muralidharan >Priority: Major > > With the SPIP ticket close to being finished, we have done some performance > evaluations to compare the performance of push-based shuffle in upstream > Spark with the production version we have internally at LinkedIn. > The evaluations have revealed a few regressions and also some additional perf > improvement opportunity. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency
[ https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-30602. - Resolution: Fixed The only pending task here is documentation. [~vsowrirajan] has opened a PR for that - but marking this jira as Resolved for 3.2 as all changes have been merged to master and branch-3.2 > SPIP: Support push-based shuffle to improve shuffle efficiency > -- > > Key: SPARK-30602 > URL: https://issues.apache.org/jira/browse/SPARK-30602 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Assignee: Mridul Muralidharan >Priority: Major > Labels: release-notes > Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, > vldb_magnet_final.pdf > > > In a large deployment of a Spark compute infrastructure, Spark shuffle is > becoming a potential scaling bottleneck and a source of inefficiency in the > cluster. When doing Spark on YARN for a large-scale deployment, people > usually enable Spark external shuffle service and store the intermediate > shuffle files on HDD. Because the number of blocks generated for a particular > shuffle grows quadratically compared to the size of shuffled data (# mappers > and reducers grows linearly with the size of shuffled data, but # blocks is # > mappers * # reducers), one general trend we have observed is that the more > data a Spark application processes, the smaller the block size becomes. In a > few production clusters we have seen, the average shuffle block size is only > 10s of KBs. Because of the inefficiency of performing random reads on HDD for > small amount of data, the overall efficiency of the Spark external shuffle > services serving the shuffle blocks degrades as we see an increasing # of > Spark applications processing an increasing amount of data. In addition, > because Spark external shuffle service is a shared service in a multi-tenancy > cluster, the inefficiency with one Spark application could propagate to other > applications as well. > In this ticket, we propose a solution to improve Spark shuffle efficiency in > above mentioned environments with push-based shuffle. With push-based > shuffle, shuffle is performed at the end of mappers and blocks get pre-merged > and move towards reducers. In our prototype implementation, we have seen > significant efficiency improvements when performing large shuffles. We take a > Spark-native approach to achieve this, i.e., extending Spark’s existing > shuffle netty protocol, and the behaviors of Spark mappers, reducers and > drivers. This way, we can bring the benefits of more efficient shuffle in > Spark without incurring the dependency or overhead of either specialized > storage layer or external infrastructure pieces. > > Link to dev mailing list discussion: > [http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency
[ https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-30602: --- Assignee: Mridul Muralidharan > SPIP: Support push-based shuffle to improve shuffle efficiency > -- > > Key: SPARK-30602 > URL: https://issues.apache.org/jira/browse/SPARK-30602 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Assignee: Mridul Muralidharan >Priority: Major > Labels: release-notes > Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, > vldb_magnet_final.pdf > > > In a large deployment of a Spark compute infrastructure, Spark shuffle is > becoming a potential scaling bottleneck and a source of inefficiency in the > cluster. When doing Spark on YARN for a large-scale deployment, people > usually enable Spark external shuffle service and store the intermediate > shuffle files on HDD. Because the number of blocks generated for a particular > shuffle grows quadratically compared to the size of shuffled data (# mappers > and reducers grows linearly with the size of shuffled data, but # blocks is # > mappers * # reducers), one general trend we have observed is that the more > data a Spark application processes, the smaller the block size becomes. In a > few production clusters we have seen, the average shuffle block size is only > 10s of KBs. Because of the inefficiency of performing random reads on HDD for > small amount of data, the overall efficiency of the Spark external shuffle > services serving the shuffle blocks degrades as we see an increasing # of > Spark applications processing an increasing amount of data. In addition, > because Spark external shuffle service is a shared service in a multi-tenancy > cluster, the inefficiency with one Spark application could propagate to other > applications as well. > In this ticket, we propose a solution to improve Spark shuffle efficiency in > above mentioned environments with push-based shuffle. With push-based > shuffle, shuffle is performed at the end of mappers and blocks get pre-merged > and move towards reducers. In our prototype implementation, we have seen > significant efficiency improvements when performing large shuffles. We take a > Spark-native approach to achieve this, i.e., extending Spark’s existing > shuffle netty protocol, and the behaviors of Spark mappers, reducers and > drivers. This way, we can bring the benefits of more efficient shuffle in > Spark without incurring the dependency or overhead of either specialized > storage layer or external infrastructure pieces. > > Link to dev mailing list discussion: > [http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32923) Add support to properly handle different type of stage retries
[ https://issues.apache.org/jira/browse/SPARK-32923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan updated SPARK-32923: Fix Version/s: 3.2.0 > Add support to properly handle different type of stage retries > -- > > Key: SPARK-32923 > URL: https://issues.apache.org/jira/browse/SPARK-32923 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Assignee: Venkata krishnan Sowrirajan >Priority: Major > Fix For: 3.2.0 > > > In SPARK-23243 and SPARK-25341, the concept of an INDETERMINATE stage was > introduced, which would be handled differently if retried. > Since these was added to address a data correctness issue, we should also add > support for these in push-based shuffle, so that we would be able to rollback > the merged shuffle partitions of a shuffle map stage if it's an INDETERMINATE > stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32923) Add support to properly handle different type of stage retries
[ https://issues.apache.org/jira/browse/SPARK-32923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-32923: --- Assignee: Venkata krishnan Sowrirajan > Add support to properly handle different type of stage retries > -- > > Key: SPARK-32923 > URL: https://issues.apache.org/jira/browse/SPARK-32923 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Assignee: Venkata krishnan Sowrirajan >Priority: Major > > In SPARK-23243 and SPARK-25341, the concept of an INDETERMINATE stage was > introduced, which would be handled differently if retried. > Since these was added to address a data correctness issue, we should also add > support for these in push-based shuffle, so that we would be able to rollback > the merged shuffle partitions of a shuffle map stage if it's an INDETERMINATE > stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-32923) Add support to properly handle different type of stage retries
[ https://issues.apache.org/jira/browse/SPARK-32923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan updated SPARK-32923: Comment: was deleted (was: This has been handled by SPARK-32923) > Add support to properly handle different type of stage retries > -- > > Key: SPARK-32923 > URL: https://issues.apache.org/jira/browse/SPARK-32923 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Priority: Major > > In SPARK-23243 and SPARK-25341, the concept of an INDETERMINATE stage was > introduced, which would be handled differently if retried. > Since these was added to address a data correctness issue, we should also add > support for these in push-based shuffle, so that we would be able to rollback > the merged shuffle partitions of a shuffle map stage if it's an INDETERMINATE > stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32923) Add support to properly handle different type of stage retries
[ https://issues.apache.org/jira/browse/SPARK-32923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-32923. - Resolution: Fixed This has been handled by SPARK-32923 > Add support to properly handle different type of stage retries > -- > > Key: SPARK-32923 > URL: https://issues.apache.org/jira/browse/SPARK-32923 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Priority: Major > > In SPARK-23243 and SPARK-25341, the concept of an INDETERMINATE stage was > introduced, which would be handled differently if retried. > Since these was added to address a data correctness issue, we should also add > support for these in push-based shuffle, so that we would be able to rollback > the merged shuffle partitions of a shuffle map stage if it's an INDETERMINATE > stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36378) Minor changes to address a few identified server side inefficiencies
[ https://issues.apache.org/jira/browse/SPARK-36378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-36378: --- Assignee: Mridul Muralidharan > Minor changes to address a few identified server side inefficiencies > > > Key: SPARK-36378 > URL: https://issues.apache.org/jira/browse/SPARK-36378 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.2.0 >Reporter: Min Shen >Assignee: Mridul Muralidharan >Priority: Major > > With the SPIP ticket close to being finished, we have done some performance > evaluations to compare the performance of push-based shuffle in upstream > Spark with the production version we have internally at LinkedIn. > The evaluations have revealed a few regressions and also some additional perf > improvement opportunity. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36378) Minor changes to address a few identified server side inefficiencies
[ https://issues.apache.org/jira/browse/SPARK-36378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-36378. - Resolution: Won't Fix Let us move this outside of the SPIP and into individual jira's and follow up work. > Minor changes to address a few identified server side inefficiencies > > > Key: SPARK-36378 > URL: https://issues.apache.org/jira/browse/SPARK-36378 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.2.0 >Reporter: Min Shen >Assignee: Mridul Muralidharan >Priority: Major > > With the SPIP ticket close to being finished, we have done some performance > evaluations to compare the performance of push-based shuffle in upstream > Spark with the production version we have internally at LinkedIn. > The evaluations have revealed a few regressions and also some additional perf > improvement opportunity. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35917) Disable push-based shuffle until the feature is complete
[ https://issues.apache.org/jira/browse/SPARK-35917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-35917: --- Assignee: Mridul Muralidharan > Disable push-based shuffle until the feature is complete > > > Key: SPARK-35917 > URL: https://issues.apache.org/jira/browse/SPARK-35917 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Chandni Singh >Assignee: Mridul Muralidharan >Priority: Major > > Push-based shuffle is partially merged in apache master but some of the tasks > are still incomplete. Since 3.2 is going to cut soon, we will not be able to > get the pending tasks reviewed and merged. Few of the pending tasks make > protocol changes to the push-based shuffle protocols, so we would like to > prevent users from enabling push-based shuffle both on the client and the > server until push-based shuffle implementation is complete. > We can prevent push-based shuffle to be used by throwing > {{UnsupportedOperationException}} (or something like that) both on the client > and the server when the user tries to enable it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35917) Disable push-based shuffle until the feature is complete
[ https://issues.apache.org/jira/browse/SPARK-35917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-35917. - Resolution: Won't Fix > Disable push-based shuffle until the feature is complete > > > Key: SPARK-35917 > URL: https://issues.apache.org/jira/browse/SPARK-35917 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Chandni Singh >Priority: Major > > Push-based shuffle is partially merged in apache master but some of the tasks > are still incomplete. Since 3.2 is going to cut soon, we will not be able to > get the pending tasks reviewed and merged. Few of the pending tasks make > protocol changes to the push-based shuffle protocols, so we would like to > prevent users from enabling push-based shuffle both on the client and the > server until push-based shuffle implementation is complete. > We can prevent push-based shuffle to be used by throwing > {{UnsupportedOperationException}} (or something like that) both on the client > and the server when the user tries to enable it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35917) Disable push-based shuffle until the feature is complete
[ https://issues.apache.org/jira/browse/SPARK-35917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391313#comment-17391313 ] Mridul Muralidharan commented on SPARK-35917: - Closing this Jira - as push based shuffle has been merged. Thanks for the ping [~Gengliang.Wang] ! > Disable push-based shuffle until the feature is complete > > > Key: SPARK-35917 > URL: https://issues.apache.org/jira/browse/SPARK-35917 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Chandni Singh >Priority: Major > > Push-based shuffle is partially merged in apache master but some of the tasks > are still incomplete. Since 3.2 is going to cut soon, we will not be able to > get the pending tasks reviewed and merged. Few of the pending tasks make > protocol changes to the push-based shuffle protocols, so we would like to > prevent users from enabling push-based shuffle both on the client and the > server until push-based shuffle implementation is complete. > We can prevent push-based shuffle to be used by throwing > {{UnsupportedOperationException}} (or something like that) both on the client > and the server when the user tries to enable it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36378) Minor changes to address a few identified server side inefficiencies
[ https://issues.apache.org/jira/browse/SPARK-36378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Min Shen updated SPARK-36378: - Description: With the SPIP ticket close to being finished, we have done some performance evaluations to compare the performance of push-based shuffle in upstream Spark with the production version we have internally at LinkedIn. The evaluations have revealed a few regressions and also some additional perf improvement opportunity. > Minor changes to address a few identified server side inefficiencies > > > Key: SPARK-36378 > URL: https://issues.apache.org/jira/browse/SPARK-36378 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.2.0 >Reporter: Min Shen >Priority: Major > > With the SPIP ticket close to being finished, we have done some performance > evaluations to compare the performance of push-based shuffle in upstream > Spark with the production version we have internally at LinkedIn. > The evaluations have revealed a few regressions and also some additional perf > improvement opportunity. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36377) Fix documentation in spark-env.sh.template
[ https://issues.apache.org/jira/browse/SPARK-36377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391298#comment-17391298 ] Apache Spark commented on SPARK-36377: -- User 'yutoacts' has created a pull request for this issue: https://github.com/apache/spark/pull/33604 > Fix documentation in spark-env.sh.template > -- > > Key: SPARK-36377 > URL: https://issues.apache.org/jira/browse/SPARK-36377 > Project: Spark > Issue Type: Documentation > Components: Documentation, Spark Submit >Affects Versions: 3.1.2 >Reporter: Yuto Akutsu >Priority: Major > > Some options in the "Options read in YARN client/cluster mode" section in > spark-env.sh.template is read by other modes too (e.g. SPARK_CONF_DIR, > SPARK_EXECUTOR_CORES, etc.), so we should re-document it to help users > distinguish what's only read by YARN mode from what's not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35917) Disable push-based shuffle until the feature is complete
[ https://issues.apache.org/jira/browse/SPARK-35917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391299#comment-17391299 ] Gengliang Wang commented on SPARK-35917: [~csingh] Shall we mark this one as won't do since https://github.com/apache/spark/pull/33034 is merged? cc [~mridul] > Disable push-based shuffle until the feature is complete > > > Key: SPARK-35917 > URL: https://issues.apache.org/jira/browse/SPARK-35917 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Chandni Singh >Priority: Major > > Push-based shuffle is partially merged in apache master but some of the tasks > are still incomplete. Since 3.2 is going to cut soon, we will not be able to > get the pending tasks reviewed and merged. Few of the pending tasks make > protocol changes to the push-based shuffle protocols, so we would like to > prevent users from enabling push-based shuffle both on the client and the > server until push-based shuffle implementation is complete. > We can prevent push-based shuffle to be used by throwing > {{UnsupportedOperationException}} (or something like that) both on the client > and the server when the user tries to enable it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36377) Fix documentation in spark-env.sh.template
[ https://issues.apache.org/jira/browse/SPARK-36377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36377: Assignee: Apache Spark > Fix documentation in spark-env.sh.template > -- > > Key: SPARK-36377 > URL: https://issues.apache.org/jira/browse/SPARK-36377 > Project: Spark > Issue Type: Documentation > Components: Documentation, Spark Submit >Affects Versions: 3.1.2 >Reporter: Yuto Akutsu >Assignee: Apache Spark >Priority: Major > > Some options in the "Options read in YARN client/cluster mode" section in > spark-env.sh.template is read by other modes too (e.g. SPARK_CONF_DIR, > SPARK_EXECUTOR_CORES, etc.), so we should re-document it to help users > distinguish what's only read by YARN mode from what's not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36377) Fix documentation in spark-env.sh.template
[ https://issues.apache.org/jira/browse/SPARK-36377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36377: Assignee: (was: Apache Spark) > Fix documentation in spark-env.sh.template > -- > > Key: SPARK-36377 > URL: https://issues.apache.org/jira/browse/SPARK-36377 > Project: Spark > Issue Type: Documentation > Components: Documentation, Spark Submit >Affects Versions: 3.1.2 >Reporter: Yuto Akutsu >Priority: Major > > Some options in the "Options read in YARN client/cluster mode" section in > spark-env.sh.template is read by other modes too (e.g. SPARK_CONF_DIR, > SPARK_EXECUTOR_CORES, etc.), so we should re-document it to help users > distinguish what's only read by YARN mode from what's not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36377) Fix documentation in spark-env.sh.template
[ https://issues.apache.org/jira/browse/SPARK-36377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391297#comment-17391297 ] Apache Spark commented on SPARK-36377: -- User 'yutoacts' has created a pull request for this issue: https://github.com/apache/spark/pull/33604 > Fix documentation in spark-env.sh.template > -- > > Key: SPARK-36377 > URL: https://issues.apache.org/jira/browse/SPARK-36377 > Project: Spark > Issue Type: Documentation > Components: Documentation, Spark Submit >Affects Versions: 3.1.2 >Reporter: Yuto Akutsu >Priority: Major > > Some options in the "Options read in YARN client/cluster mode" section in > spark-env.sh.template is read by other modes too (e.g. SPARK_CONF_DIR, > SPARK_EXECUTOR_CORES, etc.), so we should re-document it to help users > distinguish what's only read by YARN mode from what's not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36378) Minor changes to address a few identified server side inefficiencies
Min Shen created SPARK-36378: Summary: Minor changes to address a few identified server side inefficiencies Key: SPARK-36378 URL: https://issues.apache.org/jira/browse/SPARK-36378 Project: Spark Issue Type: Sub-task Components: Shuffle, Spark Core Affects Versions: 3.2.0 Reporter: Min Shen -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36303) Refactor fourteenth set of 20 query execution errors to use error classes
[ https://issues.apache.org/jira/browse/SPARK-36303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391294#comment-17391294 ] dgd_contributor commented on SPARK-36303: - working on this > Refactor fourteenth set of 20 query execution errors to use error classes > - > > Key: SPARK-36303 > URL: https://issues.apache.org/jira/browse/SPARK-36303 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Karen Feng >Priority: Major > > Refactor some exceptions in > [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala] > to use error classes. > There are currently ~350 exceptions in this file; so this PR only focuses on > the fourteenth set of 20. > {code:java} > cannotGetEventTimeWatermarkError > cannotSetTimeoutTimestampError > batchMetadataFileNotFoundError > multiStreamingQueriesUsingPathConcurrentlyError > addFilesWithAbsolutePathUnsupportedError > microBatchUnsupportedByDataSourceError > cannotExecuteStreamingRelationExecError > invalidStreamingOutputModeError > catalogPluginClassNotFoundError > catalogPluginClassNotImplementedError > catalogPluginClassNotFoundForCatalogError > catalogFailToFindPublicNoArgConstructorError > catalogFailToCallPublicNoArgConstructorError > cannotInstantiateAbstractCatalogPluginClassError > failedToInstantiateConstructorForCatalogError > noSuchElementExceptionError > noSuchElementExceptionError > cannotMutateReadOnlySQLConfError > cannotCloneOrCopyReadOnlySQLConfError > cannotGetSQLConfInSchedulerEventLoopThreadError > {code} > For more detail, see the parent ticket SPARK-36094. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36302) Refactor thirteenth set of 20 query execution errors to use error classes
[ https://issues.apache.org/jira/browse/SPARK-36302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391283#comment-17391283 ] dgd_contributor commented on SPARK-36302: - working on this. > Refactor thirteenth set of 20 query execution errors to use error classes > - > > Key: SPARK-36302 > URL: https://issues.apache.org/jira/browse/SPARK-36302 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Karen Feng >Priority: Major > > Refactor some exceptions in > [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala] > to use error classes. > There are currently ~350 exceptions in this file; so this PR only focuses on > the thirteenth set of 20. > {code:java} > serDeInterfaceNotFoundError > convertHiveTableToCatalogTableError > cannotRecognizeHiveTypeError > getTablesByTypeUnsupportedByHiveVersionError > dropTableWithPurgeUnsupportedError > alterTableWithDropPartitionAndPurgeUnsupportedError > invalidPartitionFilterError > getPartitionMetadataByFilterError > unsupportedHiveMetastoreVersionError > loadHiveClientCausesNoClassDefFoundError > cannotFetchTablesOfDatabaseError > illegalLocationClauseForViewPartitionError > renamePathAsExistsPathError > renameAsExistsPathError > renameSrcPathNotFoundError > failedRenameTempFileError > legacyMetadataPathExistsError > partitionColumnNotFoundInSchemaError > stateNotDefinedOrAlreadyRemovedError > cannotSetTimeoutDurationError > {code} > For more detail, see the parent ticket SPARK-36094. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36301) Refactor twelfth set of 20 query execution errors to use error classes
[ https://issues.apache.org/jira/browse/SPARK-36301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391279#comment-17391279 ] dgd_contributor commented on SPARK-36301: - working on this > Refactor twelfth set of 20 query execution errors to use error classes > -- > > Key: SPARK-36301 > URL: https://issues.apache.org/jira/browse/SPARK-36301 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Karen Feng >Priority: Major > > Refactor some exceptions in > [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala] > to use error classes. > There are currently ~350 exceptions in this file; so this PR only focuses on > the twelfth set of 20. > {code:java} > cannotRewriteDomainJoinWithConditionsError > decorrelateInnerQueryThroughPlanUnsupportedError > methodCalledInAnalyzerNotAllowedError > cannotSafelyMergeSerdePropertiesError > pairUnsupportedAtFunctionError > onceStrategyIdempotenceIsBrokenForBatchError[TreeType > structuralIntegrityOfInputPlanIsBrokenInClassError > structuralIntegrityIsBrokenAfterApplyingRuleError > ruleIdNotFoundForRuleError > cannotCreateArrayWithElementsExceedLimitError > indexOutOfBoundsOfArrayDataError > malformedRecordsDetectedInRecordParsingError > remoteOperationsUnsupportedError > invalidKerberosConfigForHiveServer2Error > parentSparkUIToAttachTabNotFoundError > inferSchemaUnsupportedForHiveError > requestedPartitionsMismatchTablePartitionsError > dynamicPartitionKeyNotAmongWrittenPartitionPathsError > cannotRemovePartitionDirError > cannotCreateStagingDirError > {code} > For more detail, see the parent ticket SPARK-36094. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36298) Refactor ninth set of 20 query execution errors to use error classes
[ https://issues.apache.org/jira/browse/SPARK-36298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391277#comment-17391277 ] dgd_contributor commented on SPARK-36298: - working on this. > Refactor ninth set of 20 query execution errors to use error classes > > > Key: SPARK-36298 > URL: https://issues.apache.org/jira/browse/SPARK-36298 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Karen Feng >Priority: Major > > Refactor some exceptions in > [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala] > to use error classes. > There are currently ~350 exceptions in this file; so this PR only focuses on > the ninth set of 20. > {code:java} > unscaledValueTooLargeForPrecisionError > decimalPrecisionExceedsMaxPrecisionError > outOfDecimalTypeRangeError > unsupportedArrayTypeError > unsupportedJavaTypeError > failedParsingStructTypeError > failedMergingFieldsError > cannotMergeDecimalTypesWithIncompatiblePrecisionAndScaleError > cannotMergeDecimalTypesWithIncompatiblePrecisionError > cannotMergeDecimalTypesWithIncompatibleScaleError > cannotMergeIncompatibleDataTypesError > exceedMapSizeLimitError > duplicateMapKeyFoundError > mapDataKeyArrayLengthDiffersFromValueArrayLengthError > fieldDiffersFromDerivedLocalDateError > failToParseDateTimeInNewParserError > failToFormatDateTimeInNewFormatterError > failToRecognizePatternAfterUpgradeError > failToRecognizePatternError > cannotCastUTF8StringToDataTypeError > {code} > For more detail, see the parent ticket SPARK-36094. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36299) Refactor tenth set of 20 query execution errors to use error classes
[ https://issues.apache.org/jira/browse/SPARK-36299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391278#comment-17391278 ] dgd_contributor commented on SPARK-36299: - working on this. > Refactor tenth set of 20 query execution errors to use error classes > > > Key: SPARK-36299 > URL: https://issues.apache.org/jira/browse/SPARK-36299 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Karen Feng >Priority: Major > > Refactor some exceptions in > [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala] > to use error classes. > There are currently ~350 exceptions in this file; so this PR only focuses on > the tenth set of 20. > {code:java} > registeringStreamingQueryListenerError > concurrentQueryInstanceError > cannotParseJsonArraysAsStructsError > cannotParseStringAsDataTypeError > failToParseEmptyStringForDataTypeError > failToParseValueForDataTypeError > rootConverterReturnNullError > cannotHaveCircularReferencesInBeanClassError > cannotHaveCircularReferencesInClassError > cannotUseInvalidJavaIdentifierAsFieldNameError > cannotFindEncoderForTypeError > attributesForTypeUnsupportedError > schemaForTypeUnsupportedError > cannotFindConstructorForTypeError > paramExceedOneCharError > paramIsNotIntegerError > paramIsNotBooleanValueError > foundNullValueForNotNullableFieldError > malformedCSVRecordError > elementsOfTupleExceedLimitError > {code} > For more detail, see the parent ticket SPARK-36094. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency
[ https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391276#comment-17391276 ] Gengliang Wang commented on SPARK-30602: [~zhouyejoe] Great, thanks! > SPIP: Support push-based shuffle to improve shuffle efficiency > -- > > Key: SPARK-30602 > URL: https://issues.apache.org/jira/browse/SPARK-30602 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Priority: Major > Labels: release-notes > Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, > vldb_magnet_final.pdf > > > In a large deployment of a Spark compute infrastructure, Spark shuffle is > becoming a potential scaling bottleneck and a source of inefficiency in the > cluster. When doing Spark on YARN for a large-scale deployment, people > usually enable Spark external shuffle service and store the intermediate > shuffle files on HDD. Because the number of blocks generated for a particular > shuffle grows quadratically compared to the size of shuffled data (# mappers > and reducers grows linearly with the size of shuffled data, but # blocks is # > mappers * # reducers), one general trend we have observed is that the more > data a Spark application processes, the smaller the block size becomes. In a > few production clusters we have seen, the average shuffle block size is only > 10s of KBs. Because of the inefficiency of performing random reads on HDD for > small amount of data, the overall efficiency of the Spark external shuffle > services serving the shuffle blocks degrades as we see an increasing # of > Spark applications processing an increasing amount of data. In addition, > because Spark external shuffle service is a shared service in a multi-tenancy > cluster, the inefficiency with one Spark application could propagate to other > applications as well. > In this ticket, we propose a solution to improve Spark shuffle efficiency in > above mentioned environments with push-based shuffle. With push-based > shuffle, shuffle is performed at the end of mappers and blocks get pre-merged > and move towards reducers. In our prototype implementation, we have seen > significant efficiency improvements when performing large shuffles. We take a > Spark-native approach to achieve this, i.e., extending Spark’s existing > shuffle netty protocol, and the behaviors of Spark mappers, reducers and > drivers. This way, we can bring the benefits of more efficient shuffle in > Spark without incurring the dependency or overhead of either specialized > storage layer or external infrastructure pieces. > > Link to dev mailing list discussion: > [http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36377) Fix documentation in spark-env.sh.template
Yuto Akutsu created SPARK-36377: --- Summary: Fix documentation in spark-env.sh.template Key: SPARK-36377 URL: https://issues.apache.org/jira/browse/SPARK-36377 Project: Spark Issue Type: Documentation Components: Documentation, Spark Submit Affects Versions: 3.1.2 Reporter: Yuto Akutsu Some options in the "Options read in YARN client/cluster mode" section in spark-env.sh.template is read by other modes too (e.g. SPARK_CONF_DIR, SPARK_EXECUTOR_CORES, etc.), so we should re-document it to help users distinguish what's only read by YARN mode from what's not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36376) Collapse repartitions if there is a project between them
[ https://issues.apache.org/jira/browse/SPARK-36376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36376: Assignee: (was: Apache Spark) > Collapse repartitions if there is a project between them > > > Key: SPARK-36376 > URL: https://issues.apache.org/jira/browse/SPARK-36376 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > > For example: > {code:scala} > testRelation.distribute('a, 'b)(10).select('a).distribute('a)(20) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36376) Collapse repartitions if there is a project between them
[ https://issues.apache.org/jira/browse/SPARK-36376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36376: Assignee: Apache Spark > Collapse repartitions if there is a project between them > > > Key: SPARK-36376 > URL: https://issues.apache.org/jira/browse/SPARK-36376 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > For example: > {code:scala} > testRelation.distribute('a, 'b)(10).select('a).distribute('a)(20) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36376) Collapse repartitions if there is a project between them
[ https://issues.apache.org/jira/browse/SPARK-36376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391269#comment-17391269 ] Apache Spark commented on SPARK-36376: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/33603 > Collapse repartitions if there is a project between them > > > Key: SPARK-36376 > URL: https://issues.apache.org/jira/browse/SPARK-36376 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > > For example: > {code:scala} > testRelation.distribute('a, 'b)(10).select('a).distribute('a)(20) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36375) Flaky Test: BasicSchedulerIntegrationSuite - 'super simple job' and 'multi-stage job'
[ https://issues.apache.org/jira/browse/SPARK-36375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391268#comment-17391268 ] wuyi commented on SPARK-36375: -- [~hyukjin.kwon] I'd like to take a look first. Thanks for the ping. > Flaky Test: BasicSchedulerIntegrationSuite - 'super simple job' and > 'multi-stage job' > - > > Key: SPARK-36375 > URL: https://issues.apache.org/jira/browse/SPARK-36375 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > https://github.com/apache/spark/runs/3216286546 > {code} > [info] BasicSchedulerIntegrationSuite: > [info] - super simple job *** FAILED *** (56 milliseconds) > [info] Map() did not equal Map(0 -> 42, 5 -> 42, 1 -> 42, 6 -> 42, 9 -> 42, > 2 -> 42, 7 -> 42, 3 -> 42, 8 -> 42, 4 -> 42) > (SchedulerIntegrationSuite.scala:545) > [info] Analysis: > [info] HashMap(0: -> 42, 1: -> 42, 2: -> 42, 3: -> 42, 4: -> 42, 5: -> 42, > 6: -> 42, 7: -> 42, 8: -> 42, 9: -> 42) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > [info] at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > [info] at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > [info] at > org.apache.spark.scheduler.BasicSchedulerIntegrationSuite.$anonfun$new$1(SchedulerIntegrationSuite.scala:545) > [info] at > org.apache.spark.scheduler.SchedulerIntegrationSuite.$anonfun$testScheduler$1(SchedulerIntegrationSuite.scala:98) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) > [info] at > org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) > [info] at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62) > [info] at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) > [info] at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) > [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > [info] at scala.collection.immutable.List.foreach(List.scala:431) > [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) > [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) > [info] at > org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) > [info] at org.scalatest.Suite.run(Suite.scala:1112) > [info] at org.scalatest.Suite.run$(Suite.scala:1094) > [info] at > org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273) > [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272) > [info] at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:62) > [info] at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) > [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.
[jira] [Created] (SPARK-36376) Collapse repartitions if there is a project between them
Yuming Wang created SPARK-36376: --- Summary: Collapse repartitions if there is a project between them Key: SPARK-36376 URL: https://issues.apache.org/jira/browse/SPARK-36376 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Yuming Wang For example: {code:scala} testRelation.distribute('a, 'b)(10).select('a).distribute('a)(20) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36375) Flaky Test: BasicSchedulerIntegrationSuite - 'super simple job' and 'multi-stage job'
[ https://issues.apache.org/jira/browse/SPARK-36375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391264#comment-17391264 ] Hyukjin Kwon commented on SPARK-36375: -- [~wuyi] do you have any idea on this? > Flaky Test: BasicSchedulerIntegrationSuite - 'super simple job' and > 'multi-stage job' > - > > Key: SPARK-36375 > URL: https://issues.apache.org/jira/browse/SPARK-36375 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > https://github.com/apache/spark/runs/3216286546 > {code} > [info] BasicSchedulerIntegrationSuite: > [info] - super simple job *** FAILED *** (56 milliseconds) > [info] Map() did not equal Map(0 -> 42, 5 -> 42, 1 -> 42, 6 -> 42, 9 -> 42, > 2 -> 42, 7 -> 42, 3 -> 42, 8 -> 42, 4 -> 42) > (SchedulerIntegrationSuite.scala:545) > [info] Analysis: > [info] HashMap(0: -> 42, 1: -> 42, 2: -> 42, 3: -> 42, 4: -> 42, 5: -> 42, > 6: -> 42, 7: -> 42, 8: -> 42, 9: -> 42) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > [info] at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > [info] at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > [info] at > org.apache.spark.scheduler.BasicSchedulerIntegrationSuite.$anonfun$new$1(SchedulerIntegrationSuite.scala:545) > [info] at > org.apache.spark.scheduler.SchedulerIntegrationSuite.$anonfun$testScheduler$1(SchedulerIntegrationSuite.scala:98) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) > [info] at > org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) > [info] at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62) > [info] at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) > [info] at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) > [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > [info] at scala.collection.immutable.List.foreach(List.scala:431) > [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) > [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) > [info] at > org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) > [info] at org.scalatest.Suite.run(Suite.scala:1112) > [info] at org.scalatest.Suite.run$(Suite.scala:1094) > [info] at > org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273) > [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272) > [info] at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:62) > [info] at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) > [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > [
[jira] [Updated] (SPARK-36375) Flaky Test: BasicSchedulerIntegrationSuite - 'super simple job' and 'multi-stage job'
[ https://issues.apache.org/jira/browse/SPARK-36375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-36375: - Issue Type: Test (was: Improvement) > Flaky Test: BasicSchedulerIntegrationSuite - 'super simple job' and > 'multi-stage job' > - > > Key: SPARK-36375 > URL: https://issues.apache.org/jira/browse/SPARK-36375 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > https://github.com/apache/spark/runs/3216286546 > {code} > [info] BasicSchedulerIntegrationSuite: > [info] - super simple job *** FAILED *** (56 milliseconds) > [info] Map() did not equal Map(0 -> 42, 5 -> 42, 1 -> 42, 6 -> 42, 9 -> 42, > 2 -> 42, 7 -> 42, 3 -> 42, 8 -> 42, 4 -> 42) > (SchedulerIntegrationSuite.scala:545) > [info] Analysis: > [info] HashMap(0: -> 42, 1: -> 42, 2: -> 42, 3: -> 42, 4: -> 42, 5: -> 42, > 6: -> 42, 7: -> 42, 8: -> 42, 9: -> 42) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > [info] at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > [info] at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > [info] at > org.apache.spark.scheduler.BasicSchedulerIntegrationSuite.$anonfun$new$1(SchedulerIntegrationSuite.scala:545) > [info] at > org.apache.spark.scheduler.SchedulerIntegrationSuite.$anonfun$testScheduler$1(SchedulerIntegrationSuite.scala:98) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) > [info] at > org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) > [info] at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62) > [info] at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) > [info] at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) > [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > [info] at scala.collection.immutable.List.foreach(List.scala:431) > [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) > [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) > [info] at > org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) > [info] at org.scalatest.Suite.run(Suite.scala:1112) > [info] at org.scalatest.Suite.run$(Suite.scala:1094) > [info] at > org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273) > [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272) > [info] at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:62) > [info] at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) > [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeA
[jira] [Created] (SPARK-36375) Flaky Test: BasicSchedulerIntegrationSuite - 'super simple job' and 'multi-stage job'
Hyukjin Kwon created SPARK-36375: Summary: Flaky Test: BasicSchedulerIntegrationSuite - 'super simple job' and 'multi-stage job' Key: SPARK-36375 URL: https://issues.apache.org/jira/browse/SPARK-36375 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.0 Reporter: Hyukjin Kwon https://github.com/apache/spark/runs/3216286546 {code} [info] BasicSchedulerIntegrationSuite: [info] - super simple job *** FAILED *** (56 milliseconds) [info] Map() did not equal Map(0 -> 42, 5 -> 42, 1 -> 42, 6 -> 42, 9 -> 42, 2 -> 42, 7 -> 42, 3 -> 42, 8 -> 42, 4 -> 42) (SchedulerIntegrationSuite.scala:545) [info] Analysis: [info] HashMap(0: -> 42, 1: -> 42, 2: -> 42, 3: -> 42, 4: -> 42, 5: -> 42, 6: -> 42, 7: -> 42, 8: -> 42, 9: -> 42) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) [info] at org.apache.spark.scheduler.BasicSchedulerIntegrationSuite.$anonfun$new$1(SchedulerIntegrationSuite.scala:545) [info] at org.apache.spark.scheduler.SchedulerIntegrationSuite.$anonfun$testScheduler$1(SchedulerIntegrationSuite.scala:98) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62) [info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) [info] at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) [info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) [info] at org.scalatest.Suite.run(Suite.scala:1112) [info] at org.scalatest.Suite.run$(Suite.scala:1094) [info] at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273) [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:62) [info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) [info] at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:62) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513) [info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [info] at jav
[jira] [Commented] (SPARK-20415) SPARK job hangs while writing DataFrame to HDFS
[ https://issues.apache.org/jira/browse/SPARK-20415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391259#comment-17391259 ] Yu Gan commented on SPARK-20415: Did you find the root cause? I came across the same issue in our azure environment. org.apache.spark.unsafe.Platform.copyMemory ... org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask ... > SPARK job hangs while writing DataFrame to HDFS > --- > > Key: SPARK-20415 > URL: https://issues.apache.org/jira/browse/SPARK-20415 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, YARN >Affects Versions: 2.1.0 > Environment: EMR 5.4.0 >Reporter: P K >Priority: Major > Labels: bulk-closed > > We are in POC phase with Spark. One of the Steps is reading compressed json > files that come from sources, "explode" them into tabular format and then > write them to HDFS. This worked for about three weeks until a few days ago, > for a particular dataset, the writer just hangs. I logged in to the worker > machines and see this stack trace: > "Executor task launch worker-0" #39 daemon prio=5 os_prio=0 > tid=0x7f6210352800 nid=0x4542 runnable [0x7f61f52b3000] >java.lang.Thread.State: RUNNABLE > at org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210) > at > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.writeToMemory(UnsafeArrayData.java:311) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply6_2$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:111) > at > org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:109) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) > at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > The last messages ever printed in stderr before the hang are: > 17/04/18 01:41:14 INFO DAGScheduler: Final stage: ResultStage 4 (save at > NativeMethodAccessorImpl.java:0) > 17/04/18 01:41:14 INFO DAGScheduler: Parents of final stage: List() > 17/04/18 01:41:14 INFO DAGScheduler: Missing parents: List() > 17/04/18 01:41:14 INFO DAGSc
[jira] [Assigned] (SPARK-36372) ALTER TABLE ADD COLUMNS should check duplicates for the specified columns for v2 command
[ https://issues.apache.org/jira/browse/SPARK-36372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36372: Assignee: (was: Apache Spark) > ALTER TABLE ADD COLUMNS should check duplicates for the specified columns for > v2 command > > > Key: SPARK-36372 > URL: https://issues.apache.org/jira/browse/SPARK-36372 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Priority: Major > > ALTER TABLE ADD COLUMNS currently doesn't check duplicates for the specified > columns for v2 command. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36372) ALTER TABLE ADD COLUMNS should check duplicates for the specified columns for v2 command
[ https://issues.apache.org/jira/browse/SPARK-36372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36372: Assignee: Apache Spark > ALTER TABLE ADD COLUMNS should check duplicates for the specified columns for > v2 command > > > Key: SPARK-36372 > URL: https://issues.apache.org/jira/browse/SPARK-36372 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Assignee: Apache Spark >Priority: Major > > ALTER TABLE ADD COLUMNS currently doesn't check duplicates for the specified > columns for v2 command. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36372) ALTER TABLE ADD COLUMNS should check duplicates for the specified columns for v2 command
[ https://issues.apache.org/jira/browse/SPARK-36372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391258#comment-17391258 ] Apache Spark commented on SPARK-36372: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/33600 > ALTER TABLE ADD COLUMNS should check duplicates for the specified columns for > v2 command > > > Key: SPARK-36372 > URL: https://issues.apache.org/jira/browse/SPARK-36372 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Priority: Major > > ALTER TABLE ADD COLUMNS currently doesn't check duplicates for the specified > columns for v2 command. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36374) Push-based shuffle documentation
Venkata krishnan Sowrirajan created SPARK-36374: --- Summary: Push-based shuffle documentation Key: SPARK-36374 URL: https://issues.apache.org/jira/browse/SPARK-36374 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 3.2.0 Reporter: Venkata krishnan Sowrirajan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36329) show api of Dataset should get as input the output method
[ https://issues.apache.org/jira/browse/SPARK-36329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391159#comment-17391159 ] Izek Greenfield commented on SPARK-36329: - # if you do it like that you should write it again and again. # it very handy to have it in the same format as show. > show api of Dataset should get as input the output method > - > > Key: SPARK-36329 > URL: https://issues.apache.org/jira/browse/SPARK-36329 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Izek Greenfield >Priority: Major > > For now show is: > {code:scala} > def show(numRows: Int, truncate: Boolean): Unit = if (truncate) { > println(showString(numRows, truncate = 20)) > } else { > println(showString(numRows, truncate = 0)) > } > {code} > it can be turn into: > {code:scala} > def show(numRows: Int, truncate: Boolean, out: String => Unit = println): > Unit = if (truncate) { > out(showString(numRows, truncate = 20)) > } else { > out(showString(numRows, truncate = 0)) > } > {code} > so user will be able to send that to file/log... -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36373) DecimalPrecision only add necessary cast
[ https://issues.apache.org/jira/browse/SPARK-36373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36373: Assignee: (was: Apache Spark) > DecimalPrecision only add necessary cast > > > Key: SPARK-36373 > URL: https://issues.apache.org/jira/browse/SPARK-36373 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > > For example: > {noformat} > EqualTo(AttributeReference("d1", DecimalType(5, 2))(), > AttributeReference("d2", DecimalType(2, 1))()) > {noformat} > It will add a useless cast to {{d1}}: > {noformat} > (cast(d1#6 as decimal(5,2)) = cast(d2#7 as decimal(5,2))) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36373) DecimalPrecision only add necessary cast
[ https://issues.apache.org/jira/browse/SPARK-36373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36373: Assignee: Apache Spark > DecimalPrecision only add necessary cast > > > Key: SPARK-36373 > URL: https://issues.apache.org/jira/browse/SPARK-36373 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > For example: > {noformat} > EqualTo(AttributeReference("d1", DecimalType(5, 2))(), > AttributeReference("d2", DecimalType(2, 1))()) > {noformat} > It will add a useless cast to {{d1}}: > {noformat} > (cast(d1#6 as decimal(5,2)) = cast(d2#7 as decimal(5,2))) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36373) DecimalPrecision only add necessary cast
[ https://issues.apache.org/jira/browse/SPARK-36373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391150#comment-17391150 ] Apache Spark commented on SPARK-36373: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/33602 > DecimalPrecision only add necessary cast > > > Key: SPARK-36373 > URL: https://issues.apache.org/jira/browse/SPARK-36373 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > > For example: > {noformat} > EqualTo(AttributeReference("d1", DecimalType(5, 2))(), > AttributeReference("d2", DecimalType(2, 1))()) > {noformat} > It will add a useless cast to {{d1}}: > {noformat} > (cast(d1#6 as decimal(5,2)) = cast(d2#7 as decimal(5,2))) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36092) Migrate to GitHub Actions Codecov from Jenkins
[ https://issues.apache.org/jira/browse/SPARK-36092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-36092. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 33591 [https://github.com/apache/spark/pull/33591] > Migrate to GitHub Actions Codecov from Jenkins > -- > > Key: SPARK-36092 > URL: https://issues.apache.org/jira/browse/SPARK-36092 > Project: Spark > Issue Type: Improvement > Components: Project Infra, PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.2.0 > > > We currently uses manual Codecov site to work around our Jenkins CI security > issue. Now we use GitHub Actions so we can leverage Codecov to report the > coverage for PySpark. > See also https://github.com/codecov/codecov-action -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36092) Migrate to GitHub Actions Codecov from Jenkins
[ https://issues.apache.org/jira/browse/SPARK-36092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-36092: Assignee: Hyukjin Kwon > Migrate to GitHub Actions Codecov from Jenkins > -- > > Key: SPARK-36092 > URL: https://issues.apache.org/jira/browse/SPARK-36092 > Project: Spark > Issue Type: Improvement > Components: Project Infra, PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > We currently uses manual Codecov site to work around our Jenkins CI security > issue. Now we use GitHub Actions so we can leverage Codecov to report the > coverage for PySpark. > See also https://github.com/codecov/codecov-action -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36373) DecimalPrecision only add necessary cast
Yuming Wang created SPARK-36373: --- Summary: DecimalPrecision only add necessary cast Key: SPARK-36373 URL: https://issues.apache.org/jira/browse/SPARK-36373 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Yuming Wang For example: {noformat} EqualTo(AttributeReference("d1", DecimalType(5, 2))(), AttributeReference("d2", DecimalType(2, 1))()) {noformat} It will add a useless cast to {{d1}}: {noformat} (cast(d1#6 as decimal(5,2)) = cast(d2#7 as decimal(5,2))) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36362) Omnibus Java code static analyzer warning fixes
[ https://issues.apache.org/jira/browse/SPARK-36362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1739#comment-1739 ] Apache Spark commented on SPARK-36362: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/33601 > Omnibus Java code static analyzer warning fixes > --- > > Key: SPARK-36362 > URL: https://issues.apache.org/jira/browse/SPARK-36362 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL, Tests >Affects Versions: 3.2.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Minor > Fix For: 3.3.0 > > > Inspired by a recent Java code touch-up, I wanted to fix in one pass several > lingering non-trivial issues with the Java code that a static analyzer turns > up. Only a few of these have material effects, but some do, and figured we > could avoid taking N PRs over time to address these. > * Some int*int multiplications that widen to long maybe could overflow > * Unnecessarily non-static inner classes > * Some tests "catch (AssertionError)" and do nothing > * Manual array iteration vs very slightly faster/simpler foreach > * Incorrect generic types that just happen to not cause a runtime error > * Missed opportunities for try-close > * Mutable enums which shouldn't be > * .. and a few other minor things -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36362) Omnibus Java code static analyzer warning fixes
[ https://issues.apache.org/jira/browse/SPARK-36362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391110#comment-17391110 ] Apache Spark commented on SPARK-36362: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/33601 > Omnibus Java code static analyzer warning fixes > --- > > Key: SPARK-36362 > URL: https://issues.apache.org/jira/browse/SPARK-36362 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL, Tests >Affects Versions: 3.2.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Minor > Fix For: 3.3.0 > > > Inspired by a recent Java code touch-up, I wanted to fix in one pass several > lingering non-trivial issues with the Java code that a static analyzer turns > up. Only a few of these have material effects, but some do, and figured we > could avoid taking N PRs over time to address these. > * Some int*int multiplications that widen to long maybe could overflow > * Unnecessarily non-static inner classes > * Some tests "catch (AssertionError)" and do nothing > * Manual array iteration vs very slightly faster/simpler foreach > * Incorrect generic types that just happen to not cause a runtime error > * Missed opportunities for try-close > * Mutable enums which shouldn't be > * .. and a few other minor things -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org