[jira] [Resolved] (SPARK-39624) Support coalesce partition through cartesianProduct
[ https://issues.apache.org/jira/browse/SPARK-39624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl resolved SPARK-39624. Resolution: Duplicate > Support coalesce partition through cartesianProduct > --- > > Key: SPARK-39624 > URL: https://issues.apache.org/jira/browse/SPARK-39624 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0, 3.3.0 >Reporter: senmiao >Priority: Minor > Attachments: 屏幕截图 2022-06-28 114256.jpg > > > `CoalesceShufflePartitions` can not optimize CartesianProductExec and the > result partition would be `left partition * right partition` which can be > quite lagre. > > It's better to support partial optimize with `CartesianProduct`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49509) Use Platform.allocateDirectBuffer instead of ByteBuffer.allocateDirect
dzcxzl created SPARK-49509: -- Summary: Use Platform.allocateDirectBuffer instead of ByteBuffer.allocateDirect Key: SPARK-49509 URL: https://issues.apache.org/jira/browse/SPARK-49509 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49502) Avoid NPE in SparkEnv.get.shuffleManager.unregisterShuffle
dzcxzl created SPARK-49502: -- Summary: Avoid NPE in SparkEnv.get.shuffleManager.unregisterShuffle Key: SPARK-49502 URL: https://issues.apache.org/jira/browse/SPARK-49502 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 4.0.0 Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49445) Support show tooltip in the progress bar of UI
dzcxzl created SPARK-49445: -- Summary: Support show tooltip in the progress bar of UI Key: SPARK-49445 URL: https://issues.apache.org/jira/browse/SPARK-49445 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 4.0.0 Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49386) Add memory based thresholds for shuffle spill
dzcxzl created SPARK-49386: -- Summary: Add memory based thresholds for shuffle spill Key: SPARK-49386 URL: https://issues.apache.org/jira/browse/SPARK-49386 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 4.0.0 Reporter: dzcxzl We can only determine the number of spills by configuring {{{}spark.shuffle.spill.numElementsForceSpillThreshold{}}}. In some scenarios, the size of a row may be very large in the memory. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-49217) Support separate buffer size configuration in UnsafeShuffleWriter
[ https://issues.apache.org/jira/browse/SPARK-49217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-49217: --- Description: {{UnsafeShuffleWriter#mergeSpillsWithFileStream}} uses {{spark.shuffle.file.buffer}} as the buffer for reading spill files, and this buffer is an off-heap buffer. In the spill process, we hope that the buffer size is larger, but once there are too many files in the spill, {{UnsafeShuffleWriter#mergeSpillsWithFileStream}} needs to create a lot of off-heap memory, which makes the executor easily killed by YARN. [https://github.com/apache/spark/blob/e72d21c299a450e48b3cf6e5d36b8f3e9a568088/core/src/main/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriter.java#L372-L375] {code:java} for (int i = 0; i < spills.length; i++) { spillInputStreams[i] = new NioBufferedFileInputStream( spills[i].file, inputBufferSizeInBytes);{code} > Support separate buffer size configuration in UnsafeShuffleWriter > - > > Key: SPARK-49217 > URL: https://issues.apache.org/jira/browse/SPARK-49217 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > {{UnsafeShuffleWriter#mergeSpillsWithFileStream}} uses > {{spark.shuffle.file.buffer}} as the buffer for reading spill files, and this > buffer is an off-heap buffer. > In the spill process, we hope that the buffer size is larger, but once there > are too many files in the spill, > {{UnsafeShuffleWriter#mergeSpillsWithFileStream}} needs to create a lot of > off-heap memory, which makes the executor easily killed by YARN. > > [https://github.com/apache/spark/blob/e72d21c299a450e48b3cf6e5d36b8f3e9a568088/core/src/main/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriter.java#L372-L375] > > {code:java} > for (int i = 0; i < spills.length; i++) { > spillInputStreams[i] = new NioBufferedFileInputStream( > spills[i].file, > inputBufferSizeInBytes);{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49217) Support separate buffer size configuration in UnsafeShuffleWriter
dzcxzl created SPARK-49217: -- Summary: Support separate buffer size configuration in UnsafeShuffleWriter Key: SPARK-49217 URL: https://issues.apache.org/jira/browse/SPARK-49217 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49039) Reset checkbox when executor metrics are loaded in the Stages tab
dzcxzl created SPARK-49039: -- Summary: Reset checkbox when executor metrics are loaded in the Stages tab Key: SPARK-49039 URL: https://issues.apache.org/jira/browse/SPARK-49039 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.2.0, 3.1.0 Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48540) Avoid ivy output loading settings to stdout
dzcxzl created SPARK-48540: -- Summary: Avoid ivy output loading settings to stdout Key: SPARK-48540 URL: https://issues.apache.org/jira/browse/SPARK-48540 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.0 Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48218) TransportClientFactory.createClient may NPE cause FetchFailedException
dzcxzl created SPARK-48218: -- Summary: TransportClientFactory.createClient may NPE cause FetchFailedException Key: SPARK-48218 URL: https://issues.apache.org/jira/browse/SPARK-48218 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 4.0.0 Reporter: dzcxzl {code:java} org.apache.spark.shuffle.FetchFailedException at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:1180) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:913) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:84) at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) Caused by: java.lang.NullPointerException at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:178) at org.apache.spark.network.shuffle.ExternalBlockStoreClient.lambda$fetchBlocks$0(ExternalBlockStoreClient.java:128) at org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:154) at org.apache.spark.network.shuffle.RetryingBlockTransferor.start(RetryingBlockTransferor.java:133) at org.apache.spark.network.shuffle.ExternalBlockStoreClient.fetchBlocks(ExternalBlockStoreClient.java:139) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48070) Support AdaptiveQueryExecSuite to skip check results
dzcxzl created SPARK-48070: -- Summary: Support AdaptiveQueryExecSuite to skip check results Key: SPARK-48070 URL: https://issues.apache.org/jira/browse/SPARK-48070 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 4.0.0 Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48037) SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data
[ https://issues.apache.org/jira/browse/SPARK-48037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-48037: --- Affects Version/s: 3.3.0 (was: 3.1.0) (was: 3.0.1) > SortShuffleWriter lacks shuffle write related metrics resulting in > potentially inaccurate data > -- > > Key: SPARK-48037 > URL: https://issues.apache.org/jira/browse/SPARK-48037 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.3.0 >Reporter: dzcxzl >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48037) SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data
dzcxzl created SPARK-48037: -- Summary: SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data Key: SPARK-48037 URL: https://issues.apache.org/jira/browse/SPARK-48037 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 3.1.0, 3.0.1 Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47799) Preserve parameter information when using SBT package jar
dzcxzl created SPARK-47799: -- Summary: Preserve parameter information when using SBT package jar Key: SPARK-47799 URL: https://issues.apache.org/jira/browse/SPARK-47799 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.1 Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47456) Support ORC Brotli codec
dzcxzl created SPARK-47456: -- Summary: Support ORC Brotli codec Key: SPARK-47456 URL: https://issues.apache.org/jira/browse/SPARK-47456 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46943) Support for configuring ShuffledHashJoin plan size Threshold
[ https://issues.apache.org/jira/browse/SPARK-46943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-46943: --- Description: When we enable `spark.sql.join.preferSortMergeJoin=false`, we may get the following error. {code:java} org.apache.spark.SparkException: Can't acquire 1073741824 bytes memory to build hash relation, got 478549889 bytes at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotAcquireMemoryToBuildLongHashedRelationError(QueryExecutionErrors.scala:795) at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.ensureAcquireMemory(HashedRelation.scala:581) at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.grow(HashedRelation.scala:813) at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.append(HashedRelation.scala:761) at org.apache.spark.sql.execution.joins.LongHashedRelation$.apply(HashedRelation.scala:1064) at org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:153) at org.apache.spark.sql.execution.joins.ShuffledHashJoinExec.buildHashedRelation(ShuffledHashJoinExec.scala:75) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage11.init(Unknown Source) at org.apache.spark.sql.execution.WholeStageCodegenExec.$anonfun$doExecute$6(WholeStageCodegenExec.scala:775) at org.apache.spark.sql.execution.WholeStageCodegenExec.$anonfun$doExecute$6$adapted(WholeStageCodegenExec.scala:771) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915){code} Because when converting SMJ to SHJ, it only determines whether the size of the plan is smaller than `conf.autoBroadcastJoinThreshold * conf.numShufflePartitions`. When the configured `numShufflePartitions` is large enough, it is easy to convert to SHJ. The executor build hash relation fails due to insufficient memory. [https://github.com/apache/spark/blob/223afea9960c7ef1a4c8654e043e860f6c248185/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L505-L513] was: When we enable `spark.sql.join.preferSortMergeJoin=false`, we may get the following error. {code:java} org.apache.spark.SparkException: Can't acquire 1073741824 bytes memory to build hash relation, got 478549889 bytes at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotAcquireMemoryToBuildLongHashedRelationError(QueryExecutionErrors.scala:795) at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.ensureAcquireMemory(HashedRelation.scala:581) at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.grow(HashedRelation.scala:813) at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.append(HashedRelation.scala:761) at org.apache.spark.sql.execution.joins.LongHashedRelation$.apply(HashedRelation.scala:1064) at org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:153) at org.apache.spark.sql.execution.joins.ShuffledHashJoinExec.buildHashedRelation(ShuffledHashJoinExec.scala:75) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage11.init(Unknown Source) at org.apache.spark.sql.execution.WholeStageCodegenExec.$anonfun$doExecute$6(WholeStageCodegenExec.scala:775) at org.apache.spark.sql.execution.WholeStageCodegenExec.$anonfun$doExecute$6$adapted(WholeStageCodegenExec.scala:771) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915) {code} Because when converting SMJ to SHJ, it only determines whether the size of the plan is smaller than `conf.autoBroadcastJoinThreshold * conf.numShufflePartitions`. When the configured `numShufflePartitions` is large enough, it is easy to convert to SHJ. The executor build hash relation fails due to insufficient memory. https://github.com/apache/spark/blob/223afea9960c7ef1a4c8654e043e860f6c248185/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L505-L513 > Support for configuring ShuffledHashJoin plan size Threshold > > > Key: SPARK-46943 > URL: https://issues.apache.org/jira/browse/SPARK-46943 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: dzcxzl >Priority: Minor > Labels: pull-request-available > > When we enable `spark.sql.join.preferSortMergeJoin=false`, we may get the > following error. > > {code:java} > org.apache.spark.SparkException: Can't acquire 1073741824 bytes memory to > build hash relation, got 478549889 bytes > at > org.apache.spark.sql.errors.QueryExecutionErrors$.cannotAcquireMemoryToBuildLongHashedRelationError(QueryExecutionErrors.scala:795) > at > org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.ensureAcquireMemory(HashedRelation.scala:581) > at > org.apache.spark.sql.
[jira] [Updated] (SPARK-46943) Support for configuring ShuffledHashJoin plan size Threshold
[ https://issues.apache.org/jira/browse/SPARK-46943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-46943: --- Description: When we enable `spark.sql.join.preferSortMergeJoin=false`, we may get the following error. {code:java} org.apache.spark.SparkException: Can't acquire 1073741824 bytes memory to build hash relation, got 478549889 bytes at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotAcquireMemoryToBuildLongHashedRelationError(QueryExecutionErrors.scala:795) at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.ensureAcquireMemory(HashedRelation.scala:581) at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.grow(HashedRelation.scala:813) at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.append(HashedRelation.scala:761) at org.apache.spark.sql.execution.joins.LongHashedRelation$.apply(HashedRelation.scala:1064) at org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:153) at org.apache.spark.sql.execution.joins.ShuffledHashJoinExec.buildHashedRelation(ShuffledHashJoinExec.scala:75) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage11.init(Unknown Source) at org.apache.spark.sql.execution.WholeStageCodegenExec.$anonfun$doExecute$6(WholeStageCodegenExec.scala:775) at org.apache.spark.sql.execution.WholeStageCodegenExec.$anonfun$doExecute$6$adapted(WholeStageCodegenExec.scala:771) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915) {code} Because when converting SMJ to SHJ, it only determines whether the size of the plan is smaller than `conf.autoBroadcastJoinThreshold * conf.numShufflePartitions`. When the configured `numShufflePartitions` is large enough, it is easy to convert to SHJ. The executor build hash relation fails due to insufficient memory. https://github.com/apache/spark/blob/223afea9960c7ef1a4c8654e043e860f6c248185/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L505-L513 > Support for configuring ShuffledHashJoin plan size Threshold > > > Key: SPARK-46943 > URL: https://issues.apache.org/jira/browse/SPARK-46943 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: dzcxzl >Priority: Minor > Labels: pull-request-available > > When we enable `spark.sql.join.preferSortMergeJoin=false`, we may get the > following error. > > {code:java} > org.apache.spark.SparkException: Can't acquire 1073741824 bytes memory to > build hash relation, got 478549889 bytes > at > org.apache.spark.sql.errors.QueryExecutionErrors$.cannotAcquireMemoryToBuildLongHashedRelationError(QueryExecutionErrors.scala:795) > at > org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.ensureAcquireMemory(HashedRelation.scala:581) > at > org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.grow(HashedRelation.scala:813) > at > org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.append(HashedRelation.scala:761) > at > org.apache.spark.sql.execution.joins.LongHashedRelation$.apply(HashedRelation.scala:1064) > at > org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:153) > at > org.apache.spark.sql.execution.joins.ShuffledHashJoinExec.buildHashedRelation(ShuffledHashJoinExec.scala:75) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage11.init(Unknown > Source) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.$anonfun$doExecute$6(WholeStageCodegenExec.scala:775) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.$anonfun$doExecute$6$adapted(WholeStageCodegenExec.scala:771) > at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915) > {code} > > Because when converting SMJ to SHJ, it only determines whether the size of > the plan is smaller than `conf.autoBroadcastJoinThreshold * > conf.numShufflePartitions`. > When the configured `numShufflePartitions` is large enough, it is easy to > convert to SHJ. The executor build hash relation fails due to insufficient > memory. > > https://github.com/apache/spark/blob/223afea9960c7ef1a4c8654e043e860f6c248185/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L505-L513 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46943) Support for configuring ShuffledHashJoin plan size Threshold
dzcxzl created SPARK-46943: -- Summary: Support for configuring ShuffledHashJoin plan size Threshold Key: SPARK-46943 URL: https://issues.apache.org/jira/browse/SPARK-46943 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33458) Hive partition pruning support Contains, StartsWith and EndsWith predicate
[ https://issues.apache.org/jira/browse/SPARK-33458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769889#comment-17769889 ] dzcxzl commented on SPARK-33458: After [HIVE-22900|https://issues.apache.org/jira/browse/HIVE-22900] (HMS 4.0), like filter partition supports direct sql. Now Spark uses .* method, which may cause incorrect results. Because .* is the way to write JDO query, direct sql must use %. > Hive partition pruning support Contains, StartsWith and EndsWith predicate > -- > > Key: SPARK-33458 > URL: https://issues.apache.org/jira/browse/SPARK-33458 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.1.0 > > > Hive partition pruning can support Contains, StartsWith and EndsWith > predicate: > https://github.com/apache/hive/blob/0c2c8a7f57330880f156466526bc0fdc94681035/metastore/src/test/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java#L1074-L1075 > https://github.com/apache/hive/commit/0c2c8a7f57330880f156466526bc0fdc94681035#diff-b1200d4259fafd48d7bbd0050e89772218813178f68461a2e82551c52319b282 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44650) `spark.executor.defaultJavaOptions` Check illegal java options
dzcxzl created SPARK-44650: -- Summary: `spark.executor.defaultJavaOptions` Check illegal java options Key: SPARK-44650 URL: https://issues.apache.org/jira/browse/SPARK-44650 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.1 Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44583) `spark.*.io.connectionCreationTimeout` parameter documentation
dzcxzl created SPARK-44583: -- Summary: `spark.*.io.connectionCreationTimeout` parameter documentation Key: SPARK-44583 URL: https://issues.apache.org/jira/browse/SPARK-44583 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 3.4.1 Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44556) Reuse `OrcTail` when enable vectorizedReader
dzcxzl created SPARK-44556: -- Summary: Reuse `OrcTail` when enable vectorizedReader Key: SPARK-44556 URL: https://issues.apache.org/jira/browse/SPARK-44556 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.1 Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44497) Show task partition id in Task table
[ https://issues.apache.org/jira/browse/SPARK-44497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-44497: --- Description: In SPARK-37831, the partition id is added in taskinfo, and the task partition id cannot be directly seen in the ui. (was: In [SPARK-37831|https://issues.apache.org/jira/browse/SPARK-37831], the partition id is added in taskinfo, and the task partition id cannot be directly seen in the ui) > Show task partition id in Task table > > > Key: SPARK-44497 > URL: https://issues.apache.org/jira/browse/SPARK-44497 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.4.1 >Reporter: dzcxzl >Priority: Minor > > In SPARK-37831, the partition id is added in taskinfo, and the task partition > id cannot be directly seen in the ui. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44497) Show task partition id in Task table
dzcxzl created SPARK-44497: -- Summary: Show task partition id in Task table Key: SPARK-44497 URL: https://issues.apache.org/jira/browse/SPARK-44497 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.4.1 Reporter: dzcxzl In [SPARK-37831|https://issues.apache.org/jira/browse/SPARK-37831], the partition id is added in taskinfo, and the task partition id cannot be directly seen in the ui -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44490) Remove TaskPagedTable in StagePage
[ https://issues.apache.org/jira/browse/SPARK-44490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-44490: --- Description: In [SPARK-21809|https://issues.apache.org/jira/browse/SPARK-21809], we introduced stagespage-template.html to show the running status of Stage. TaskPagedTable is no longer effective, but there are still many PRs updating related codes. > Remove TaskPagedTable in StagePage > -- > > Key: SPARK-44490 > URL: https://issues.apache.org/jira/browse/SPARK-44490 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.4.1 >Reporter: dzcxzl >Priority: Minor > > In [SPARK-21809|https://issues.apache.org/jira/browse/SPARK-21809], we > introduced stagespage-template.html to show the running status of Stage. > TaskPagedTable is no longer effective, but there are still many PRs updating > related codes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44490) Remove TaskPagedTable in StagePage
dzcxzl created SPARK-44490: -- Summary: Remove TaskPagedTable in StagePage Key: SPARK-44490 URL: https://issues.apache.org/jira/browse/SPARK-44490 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.4.1 Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44454) HiveShim getTablesByType support fallback
dzcxzl created SPARK-44454: -- Summary: HiveShim getTablesByType support fallback Key: SPARK-44454 URL: https://issues.apache.org/jira/browse/SPARK-44454 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.1 Reporter: dzcxzl When we use a high version of Hive Client to communicate with a low version of Hive meta store, we may encounter Invalid method name: 'get_tables_by_type'. {code:java} 23/07/17 12:45:24,391 [main] DEBUG SparkSqlParser: Parsing command: show views 23/07/17 12:45:24,489 [main] ERROR log: Got exception: org.apache.thrift.TApplicationException Invalid method name: 'get_tables_by_type' org.apache.thrift.TApplicationException: Invalid method name: 'get_tables_by_type' at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_tables_by_type(ThriftHiveMetastore.java:1433) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_tables_by_type(ThriftHiveMetastore.java:1418) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTables(HiveMetaStoreClient.java:1411) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) at com.sun.proxy.$Proxy23.getTables(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2344) at com.sun.proxy.$Proxy23.getTables(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getTablesByType(Hive.java:1427) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.sql.hive.client.Shim_v2_3.getTablesByType(HiveShim.scala:1408) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$listTablesByType$1(HiveClientImpl.scala:789) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:225) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:224) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:274) at org.apache.spark.sql.hive.client.HiveClientImpl.listTablesByType(HiveClientImpl.scala:785) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listViews$1(HiveExternalCatalog.scala:895) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:108) at org.apache.spark.sql.hive.HiveExternalCatalog.listViews(HiveExternalCatalog.scala:893) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listViews(ExternalCatalogWithListener.scala:158) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listViews(SessionCatalog.scala:1040) at org.apache.spark.sql.execution.command.ShowViewsCommand.$anonfun$run$5(views.scala:407) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.execution.command.ShowViewsCommand.run(views.scala:407) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44240) Setting the topKSortFallbackThreshold value may lead to inaccurate results
[ https://issues.apache.org/jira/browse/SPARK-44240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-44240: --- Description: {code:java} set spark.sql.execution.topKSortFallbackThreshold=1; SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT 1) a; {code} If GlobalLimitExec is not the final operator and has a sort operator, shuffle read does not guarantee the order, which leads to the limit read data that may be random. TakeOrderedAndProjectExec has ordering, so there is no such problem. !topKSortFallbackThreshold.png! {code:java} set spark.sql.execution.topKSortFallbackThreshold=1; select min(id) from (select id from range(9) order by id desc limit 1) a; {code} !topKSortFallbackThresholdDesc.png! was: {code:java} set spark.sql.execution.topKSortFallbackThreshold=1; SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT 1) a; {code} If GlobalLimitExec is not the final operator and has a sort operator, shuffle read does not guarantee the order, which leads to the limit read data that may be random. TakeOrderedAndProjectExec has ordering, so there is no such problem. !topKSortFallbackThreshold.png! > Setting the topKSortFallbackThreshold value may lead to inaccurate results > -- > > Key: SPARK-44240 > URL: https://issues.apache.org/jira/browse/SPARK-44240 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.4.0 >Reporter: dzcxzl >Priority: Minor > Attachments: topKSortFallbackThreshold.png, > topKSortFallbackThresholdDesc.png > > > > {code:java} > set spark.sql.execution.topKSortFallbackThreshold=1; > SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT > 1) a; {code} > > If GlobalLimitExec is not the final operator and has a sort operator, shuffle > read does not guarantee the order, which leads to the limit read data that > may be random. > TakeOrderedAndProjectExec has ordering, so there is no such problem. > > !topKSortFallbackThreshold.png! > {code:java} > set spark.sql.execution.topKSortFallbackThreshold=1; > select min(id) from (select id from range(9) order by id desc limit > 1) a; {code} > !topKSortFallbackThresholdDesc.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44240) Setting the topKSortFallbackThreshold value may lead to inaccurate results
[ https://issues.apache.org/jira/browse/SPARK-44240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-44240: --- Description: {code:java} set spark.sql.execution.topKSortFallbackThreshold=1; SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT 1) a; {code} If GlobalLimitExec is not the final operator and has a sort operator, shuffle read does not guarantee the order, which leads to the limit read data that may be random. TakeOrderedAndProjectExec has ordering, so there is no such problem. !topKSortFallbackThreshold.png! was: {code:java} set spark.sql.execution.topKSortFallbackThreshold=1; SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT 1) a; {code} If GlobalLimitExec is not the final operator and has a sort operator, shuffle read does not guarantee the order, which leads to the limit read data that may be random. TakeOrderedAndProjectExec has ordering, so there is no such problem. !topKSortFallbackThreshold.png! {code:java} set spark.sql.execution.topKSortFallbackThreshold=1; select min(id) from (select id from range(9) order by id desc limit 1) a; {code} !topKSortFallbackThresholdDesc.png! > Setting the topKSortFallbackThreshold value may lead to inaccurate results > -- > > Key: SPARK-44240 > URL: https://issues.apache.org/jira/browse/SPARK-44240 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.4.0 >Reporter: dzcxzl >Priority: Minor > Attachments: topKSortFallbackThreshold.png, > topKSortFallbackThresholdDesc.png > > > > {code:java} > set spark.sql.execution.topKSortFallbackThreshold=1; > SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT > 1) a; {code} > > If GlobalLimitExec is not the final operator and has a sort operator, shuffle > read does not guarantee the order, which leads to the limit read data that > may be random. > TakeOrderedAndProjectExec has ordering, so there is no such problem. > > !topKSortFallbackThreshold.png! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44240) Setting the topKSortFallbackThreshold value may lead to inaccurate results
[ https://issues.apache.org/jira/browse/SPARK-44240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-44240: --- Description: {code:java} set spark.sql.execution.topKSortFallbackThreshold=1; SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT 1) a; {code} If GlobalLimitExec is not the final operator and has a sort operator, shuffle read does not guarantee the order, which leads to the limit read data that may be random. TakeOrderedAndProjectExec has ordering, so there is no such problem. !topKSortFallbackThreshold.png! {code:java} set spark.sql.execution.topKSortFallbackThreshold=1; select min(id) from (select id from range(9) order by id desc limit 1) a; {code} !topKSortFallbackThresholdDesc.png! was: {code:java} set spark.sql.execution.topKSortFallbackThreshold=1; SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT 1) a; {code} If GlobalLimitExec is not the final operator and has a sort operator, shuffle read does not guarantee the order, which leads to the limit read data that may be random. TakeOrderedAndProjectExec has ordering, so there is no such problem. !topKSortFallbackThreshold.png! > Setting the topKSortFallbackThreshold value may lead to inaccurate results > -- > > Key: SPARK-44240 > URL: https://issues.apache.org/jira/browse/SPARK-44240 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.4.0 >Reporter: dzcxzl >Priority: Minor > Attachments: topKSortFallbackThreshold.png, > topKSortFallbackThresholdDesc.png > > > > {code:java} > set spark.sql.execution.topKSortFallbackThreshold=1; > SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT > 1) a; {code} > > If GlobalLimitExec is not the final operator and has a sort operator, shuffle > read does not guarantee the order, which leads to the limit read data that > may be random. > TakeOrderedAndProjectExec has ordering, so there is no such problem. > > !topKSortFallbackThreshold.png! > > {code:java} > set spark.sql.execution.topKSortFallbackThreshold=1; > select min(id) from (select id from range(9) order by id desc limit > 1) a; {code} > !topKSortFallbackThresholdDesc.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44240) Setting the topKSortFallbackThreshold value may lead to inaccurate results
[ https://issues.apache.org/jira/browse/SPARK-44240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-44240: --- Attachment: topKSortFallbackThresholdDesc.png > Setting the topKSortFallbackThreshold value may lead to inaccurate results > -- > > Key: SPARK-44240 > URL: https://issues.apache.org/jira/browse/SPARK-44240 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.4.0 >Reporter: dzcxzl >Priority: Minor > Attachments: topKSortFallbackThreshold.png, > topKSortFallbackThresholdDesc.png > > > > {code:java} > set spark.sql.execution.topKSortFallbackThreshold=1; > SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT > 1) a; {code} > > If GlobalLimitExec is not the final operator and has a sort operator, shuffle > read does not guarantee the order, which leads to the limit read data that > may be random. > TakeOrderedAndProjectExec has ordering, so there is no such problem. > > !topKSortFallbackThreshold.png! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44240) Setting the topKSortFallbackThreshold value may lead to inaccurate results
[ https://issues.apache.org/jira/browse/SPARK-44240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-44240: --- Description: {code:java} set spark.sql.execution.topKSortFallbackThreshold=1; SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT 1) a; {code} If GlobalLimitExec is not the final operator and has a sort operator, shuffle read does not guarantee the order, which leads to the limit read data that may be random. TakeOrderedAndProjectExec has ordering, so there is no such problem. !topKSortFallbackThreshold.png! was: {code:java} set spark.sql.execution.topKSortFallbackThreshold=1; SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT 1) a; {code} If GlobalLimitExec is not the final operator, shuffle read does not guarantee the order, which leads to the limit read data that may be random. TakeOrderedAndProjectExec has ordering, so there is no such problem. !topKSortFallbackThreshold.png! > Setting the topKSortFallbackThreshold value may lead to inaccurate results > -- > > Key: SPARK-44240 > URL: https://issues.apache.org/jira/browse/SPARK-44240 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.4.0 >Reporter: dzcxzl >Priority: Minor > Attachments: topKSortFallbackThreshold.png > > > > {code:java} > set spark.sql.execution.topKSortFallbackThreshold=1; > SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT > 1) a; {code} > > If GlobalLimitExec is not the final operator and has a sort operator, shuffle > read does not guarantee the order, which leads to the limit read data that > may be random. > TakeOrderedAndProjectExec has ordering, so there is no such problem. > > !topKSortFallbackThreshold.png! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44240) Setting the topKSortFallbackThreshold value may lead to inaccurate results
[ https://issues.apache.org/jira/browse/SPARK-44240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-44240: --- Description: {code:java} set spark.sql.execution.topKSortFallbackThreshold=1; SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT 1) a; {code} If GlobalLimitExec is not the final operator, shuffle read does not guarantee the order, which leads to the limit read data that may be random. TakeOrderedAndProjectExec has ordering, so there is no such problem. !topKSortFallbackThreshold.png! was: {code:java} set spark.sql.execution.topKSortFallbackThreshold=1; SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT 1) a; {code} If GlobalLimitExec is not the final operator, shuffle read does not guarantee the order, which leads to the limit read data that may be random. TakeOrderedAndProjectExec has ordering, so there is no such problem. > Setting the topKSortFallbackThreshold value may lead to inaccurate results > -- > > Key: SPARK-44240 > URL: https://issues.apache.org/jira/browse/SPARK-44240 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.4.0 >Reporter: dzcxzl >Priority: Minor > Attachments: topKSortFallbackThreshold.png > > > > {code:java} > set spark.sql.execution.topKSortFallbackThreshold=1; > SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT > 1) a; {code} > > If GlobalLimitExec is not the final operator, shuffle read does not guarantee > the order, which leads to the limit read data that may be random. > TakeOrderedAndProjectExec has ordering, so there is no such problem. > > !topKSortFallbackThreshold.png! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44240) Setting the topKSortFallbackThreshold value may lead to inaccurate results
[ https://issues.apache.org/jira/browse/SPARK-44240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-44240: --- Attachment: topKSortFallbackThreshold.png > Setting the topKSortFallbackThreshold value may lead to inaccurate results > -- > > Key: SPARK-44240 > URL: https://issues.apache.org/jira/browse/SPARK-44240 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.4.0 >Reporter: dzcxzl >Priority: Minor > Attachments: topKSortFallbackThreshold.png > > > > {code:java} > set spark.sql.execution.topKSortFallbackThreshold=1; > SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT > 1) a; {code} > > If GlobalLimitExec is not the final operator, shuffle read does not guarantee > the order, which leads to the limit read data that may be random. > TakeOrderedAndProjectExec has ordering, so there is no such problem. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44240) Setting the topKSortFallbackThreshold value may lead to inaccurate results
[ https://issues.apache.org/jira/browse/SPARK-44240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-44240: --- Description: {code:java} set spark.sql.execution.topKSortFallbackThreshold=1; SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT 1) a; {code} If GlobalLimitExec is not the final operator, shuffle read does not guarantee the order, which leads to the limit read data that may be random. TakeOrderedAndProjectExec has ordering, so there is no such problem. was: {code:java} set spark.sql.execution.topKSortFallbackThreshold=1; SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT 1) a; {code} If GlobalLimitExec is not the final operator, shuffle read does not guarantee the order, which leads to the limit read data that may be random. TakeOrderedAndProjectExec has ordering, so there is no such problem. > Setting the topKSortFallbackThreshold value may lead to inaccurate results > -- > > Key: SPARK-44240 > URL: https://issues.apache.org/jira/browse/SPARK-44240 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.4.0 >Reporter: dzcxzl >Priority: Minor > > > {code:java} > set spark.sql.execution.topKSortFallbackThreshold=1; > SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT > 1) a; {code} > > If GlobalLimitExec is not the final operator, shuffle read does not guarantee > the order, which leads to the limit read data that may be random. > TakeOrderedAndProjectExec has ordering, so there is no such problem. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44240) Setting the topKSortFallbackThreshold value may lead to inaccurate results
dzcxzl created SPARK-44240: -- Summary: Setting the topKSortFallbackThreshold value may lead to inaccurate results Key: SPARK-44240 URL: https://issues.apache.org/jira/browse/SPARK-44240 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0, 3.3.0, 3.2.0, 3.1.0, 3.0.0, 2.4.0 Reporter: dzcxzl {code:java} set spark.sql.execution.topKSortFallbackThreshold=1; SELECT min(id) FROM ( SELECT id FROM range(9) ORDER BY id LIMIT 1) a; {code} If GlobalLimitExec is not the final operator, shuffle read does not guarantee the order, which leads to the limit read data that may be random. TakeOrderedAndProjectExec has ordering, so there is no such problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37605) Support the configuration of the initial number of scan partitions when executing a take on a query
[ https://issues.apache.org/jira/browse/SPARK-37605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl resolved SPARK-37605. Resolution: Duplicate > Support the configuration of the initial number of scan partitions when > executing a take on a query > --- > > Key: SPARK-37605 > URL: https://issues.apache.org/jira/browse/SPARK-37605 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: dzcxzl >Priority: Trivial > > Now the initial number of scanned partitions is 1 by default when executing a > take on a query. > This number does not support configuration. > Sometimes the first task runs slower. If we have this configuration, we can > increase the initial parallelism. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43301) BlockStoreClient getHostLocalDirs RPC supports IOException retry
[ https://issues.apache.org/jira/browse/SPARK-43301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-43301: --- Summary: BlockStoreClient getHostLocalDirs RPC supports IOException retry (was: BlockStoreClient getHostLocalDirs RPC supports IOexception retry) > BlockStoreClient getHostLocalDirs RPC supports IOException retry > > > Key: SPARK-43301 > URL: https://issues.apache.org/jira/browse/SPARK-43301 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: dzcxzl >Priority: Minor > > BlockStoreClient#getHostLocalDirs RPC did not retry when IOexception > occurred, and then FetchFailedException was thrown. > > {code:java} > 23/04/24 01:24:55,158 [shuffle-client-7-1] WARN ExternalBlockStoreClient: > Error while trying to get the host local dirs for [148] > 23/04/24 01:24:55,158 [shuffle-client-7-1] ERROR ShuffleBlockFetcherIterator: > Error occurred while fetching host local blocks > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253) > at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132) > at > io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) > at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) > at > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43301) BlockStoreClient getHostLocalDirs RPC supports IOexception retry
dzcxzl created SPARK-43301: -- Summary: BlockStoreClient getHostLocalDirs RPC supports IOexception retry Key: SPARK-43301 URL: https://issues.apache.org/jira/browse/SPARK-43301 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: dzcxzl BlockStoreClient#getHostLocalDirs RPC did not retry when IOexception occurred, and then FetchFailedException was thrown. {code:java} 23/04/24 01:24:55,158 [shuffle-client-7-1] WARN ExternalBlockStoreClient: Error while trying to get the host local dirs for [148] 23/04/24 01:24:55,158 [shuffle-client-7-1] ERROR ShuffleBlockFetcherIterator: Error occurred while fetching host local blocks java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42808) Avoid getting availableProcessors every time in MapOutputTrackerMaster#getStatistics
dzcxzl created SPARK-42808: -- Summary: Avoid getting availableProcessors every time in MapOutputTrackerMaster#getStatistics Key: SPARK-42808 URL: https://issues.apache.org/jira/browse/SPARK-42808 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.2 Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42807) Apply custom log URL pattern for yarn-client AM log URL in SHS
dzcxzl created SPARK-42807: -- Summary: Apply custom log URL pattern for yarn-client AM log URL in SHS Key: SPARK-42807 URL: https://issues.apache.org/jira/browse/SPARK-42807 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.2 Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42366) Log shuffle data corruption diagnose cause
[ https://issues.apache.org/jira/browse/SPARK-42366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-42366: --- Summary: Log shuffle data corruption diagnose cause (was: Log output shuffle data corruption diagnose cause) > Log shuffle data corruption diagnose cause > -- > > Key: SPARK-42366 > URL: https://issues.apache.org/jira/browse/SPARK-42366 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: dzcxzl >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42366) Log output shuffle data corruption diagnose cause
[ https://issues.apache.org/jira/browse/SPARK-42366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-42366: --- Summary: Log output shuffle data corruption diagnose cause (was: Log output shuffle data corruption diagnose causes) > Log output shuffle data corruption diagnose cause > - > > Key: SPARK-42366 > URL: https://issues.apache.org/jira/browse/SPARK-42366 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: dzcxzl >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42366) Log output shuffle data corruption diagnose causes
dzcxzl created SPARK-42366: -- Summary: Log output shuffle data corruption diagnose causes Key: SPARK-42366 URL: https://issues.apache.org/jira/browse/SPARK-42366 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 3.2.0 Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35744) Performance degradation in avro SpecificRecordBuilders
[ https://issues.apache.org/jira/browse/SPARK-35744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654965#comment-17654965 ] dzcxzl commented on SPARK-35744: This problem should be solved by upgrading avro 1.11.0 version ([AVRO-3186|https://issues.apache.org/jira/browse/AVRO-3186]) through [SPARK-37206|https://issues.apache.org/jira/browse/SPARK-37206], we should be able to close this ticket. > Performance degradation in avro SpecificRecordBuilders > -- > > Key: SPARK-35744 > URL: https://issues.apache.org/jira/browse/SPARK-35744 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Steven Aerts >Priority: Minor > > Creating this bug to let you know that when we tested out spark 3.2.0 we saw > a significant performance degradation where our code was handling Avro > Specific Record objects. This slowed down some of our jobs with a factor 4. > Spark 3.2.0 upsteps the avro version from 1.8.2 to 1.10.2. > The degradation was caused by a change introduced in avro 1.9.0. This change > degrades performance when creating avro specific records in certain > classloader topologies, like the ones used in spark. > We notified and [proposed|https://github.com/apache/avro/pull/1253] a simple > fix upstream in the avro project. (Links contain more details) > It is unclear for us how many other projects are using avro specific records > in a spark context and will be impacted by this degradation. > Feel free to close this issue if you think this issue is too much of a > corner case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41003) BHJ LeftAnti does not update numOutputRows when codegen is disabled
dzcxzl created SPARK-41003: -- Summary: BHJ LeftAnti does not update numOutputRows when codegen is disabled Key: SPARK-41003 URL: https://issues.apache.org/jira/browse/SPARK-41003 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40987) Avoid creating a directory when deleting a block, causing DAGScheduler to not work
dzcxzl created SPARK-40987: -- Summary: Avoid creating a directory when deleting a block, causing DAGScheduler to not work Key: SPARK-40987 URL: https://issues.apache.org/jira/browse/SPARK-40987 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.1, 3.2.2 Reporter: dzcxzl When the driver submits a job, DAGScheduler calls sc.broadcast(taskBinaryBytes). TorrentBroadcast#writeBlocks may fail due to disk problems during blockManager#putBytes. BlockManager#doPut calls BlockManager#removeBlockInternal to clean up the block. BlockManager#removeBlockInternal calls DiskStore#remove to clean up blocks on disk. DiskStore#remove will try to create the directory because the directory does not exist, and an exception will be thrown at this time. BlockInfoManager#blockInfoWrappers block info and lock not removed. The catch block in TorrentBroadcast#writeBlocks will call blockManager.removeBroadcast to clean up the broadcast. Because the block lock in BlockInfoManager#blockInfoWrappers is not released, the dag-scheduler-event-loop thread of DAGScheduler will wait forever. {code:java} 22/11/01 18:27:48 WARN BlockManager: Putting block broadcast_0_piece0 failed due to exception java.io.IOException: X. 22/11/01 18:27:48 ERROR TorrentBroadcast: Store broadcast broadcast_0 fail, remove all pieces of the broadcast {code} {code:java} "dag-scheduler-event-loop" #54 daemon prio=5 os_prio=31 tid=0x7fc98e3fa800 nid=0x7203 waiting on condition [0x78c1e000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0007add3d8c8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at org.apache.spark.storage.BlockInfoManager.$anonfun$acquireLock$1(BlockInfoManager.scala:221) at org.apache.spark.storage.BlockInfoManager.$anonfun$acquireLock$1$adapted(BlockInfoManager.scala:214) at org.apache.spark.storage.BlockInfoManager$$Lambda$3038/1307533457.apply(Unknown Source) at org.apache.spark.storage.BlockInfoWrapper.withLock(BlockInfoManager.scala:105) at org.apache.spark.storage.BlockInfoManager.acquireLock(BlockInfoManager.scala:214) at org.apache.spark.storage.BlockInfoManager.lockForWriting(BlockInfoManager.scala:293) at org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:1979) at org.apache.spark.storage.BlockManager.$anonfun$removeBroadcast$3(BlockManager.scala:1970) at org.apache.spark.storage.BlockManager.$anonfun$removeBroadcast$3$adapted(BlockManager.scala:1970) at org.apache.spark.storage.BlockManager$$Lambda$3092/1241801156.apply(Unknown Source) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at org.apache.spark.storage.BlockManager.removeBroadcast(BlockManager.scala:1970) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:179) at org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:99) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:38) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:78) at org.apache.spark.SparkContext.broadcastInternal(SparkContext.scala:1538) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1520) at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1539) at org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1355) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1297) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2929) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2921) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2910) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40312) Add missing configuration documentation in Spark History Server
dzcxzl created SPARK-40312: -- Summary: Add missing configuration documentation in Spark History Server Key: SPARK-40312 URL: https://issues.apache.org/jira/browse/SPARK-40312 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 3.3.0 Reporter: dzcxzl -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39830) Reading ORC table that requires type promotion may throw AIOOBE
[ https://issues.apache.org/jira/browse/SPARK-39830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569385#comment-17569385 ] dzcxzl commented on SPARK-39830: cc @[~dongjoon] > Reading ORC table that requires type promotion may throw AIOOBE > --- > > Key: SPARK-39830 > URL: https://issues.apache.org/jira/browse/SPARK-39830 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: dzcxzl >Priority: Trivial > > We can add a UT to test the scenario after the ORC-1205 release. > > bin/spark-shell > {code:java} > spark.sql("set orc.stripe.size=10240") > spark.sql("set orc.rows.between.memory.checks=1") > spark.sql("set spark.sql.orc.columnarWriterBatchSize=1") > val df = spark.range(1, 1+512, 1, 1).map { i => > if( i == 1 ){ > (i, Array.fill[Byte](5 * 1024 * 1024)('X')) > } else { > (i,Array.fill[Byte](1)('X')) > } > }.toDF("c1","c2") > df.write.format("orc").save("file:///tmp/test_table_orc_t1") > spark.sql("create external table test_table_orc_t1 (c1 string ,c2 binary) > location 'file:///tmp/test_table_orc_t1' stored as orc ") > spark.sql("select * from test_table_orc_t1").show() {code} > Querying this table will get the following exception > {code:java} > java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387) > at > org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740) > at > org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069) > at > org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371) > at > org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:84) > at > org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:102) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39830) Reading ORC table that requires type promotion may throw AIOOBE
[ https://issues.apache.org/jira/browse/SPARK-39830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-39830: --- Description: We can add a UT to test the scenario after the ORC-1205 release. bin/spark-shell {code:java} spark.sql("set orc.stripe.size=10240") spark.sql("set orc.rows.between.memory.checks=1") spark.sql("set spark.sql.orc.columnarWriterBatchSize=1") val df = spark.range(1, 1+512, 1, 1).map { i => if( i == 1 ){ (i, Array.fill[Byte](5 * 1024 * 1024)('X')) } else { (i,Array.fill[Byte](1)('X')) } }.toDF("c1","c2") df.write.format("orc").save("file:///tmp/test_table_orc_t1") spark.sql("create external table test_table_orc_t1 (c1 string ,c2 binary) location 'file:///tmp/test_table_orc_t1' stored as orc ") spark.sql("select * from test_table_orc_t1").show() {code} Querying this table will get the following exception {code:java} java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387) at org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740) at org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069) at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371) at org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:84) at org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:102) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) {code} was: {code:java} spark.sql("set orc.stripe.size=10240") spark.sql("set orc.rows.between.memory.checks=1") spark.sql("set spark.sql.orc.columnarWriterBatchSize=1") val df = spark.range(1, 1+512, 1, 1).map { i => if( i == 1 ){ (i, Array.fill[Byte](5 * 1024 * 1024)('X')) } else { (i,Array.fill[Byte](1)('X')) } }.toDF("c1","c2") df.write.format("orc").save("file:///tmp/test_table_orc_t1") spark.sql("create external table test_table_orc_t1 (c1 string ,c2 binary) location 'file:///tmp/test_table_orc_t1' stored as orc ") spark.sql("select * from test_table_orc_t1").show() {code} Querying this table will get the following exception {code:java} java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387) at org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740) at org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069) at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371) at org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:84) at org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:102) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) {code} We can add a UT to test the scenario after the [ORC-1205|https://issues.apache.org/jira/browse/ORC-1205] release > Reading ORC table that requires type promotion may throw AIOOBE > --- > > Key: SPARK-39830 > URL: https://issues.apache.org/jira/browse/SPARK-39830 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: dzcxzl >Priority: Trivial > > We can add a UT to test the scenario after the ORC-1205 release. > > bin/spark-shell > {code:java} > spark.sql("set orc.stripe.size=10240") > spark.sql("set orc.rows.between.memory.checks=1") > spark.sql("set spark.sql.orc.columnarWriterBatchSize=1") > val df = spark.range(1, 1+512, 1, 1).map { i => > if( i == 1 ){ > (i, Array.fill[Byte](5 * 1024 * 1024)('X')) > } else { > (i,Array.fill[Byte](1)('X')) > } > }.toDF("c1","c2") > df.write.format("orc").save("file:///tmp/test_table_orc_t1") > spark.sql("create external
[jira] [Created] (SPARK-39830) Reading ORC table that requires type promotion may throw AIOOBE
dzcxzl created SPARK-39830: -- Summary: Reading ORC table that requires type promotion may throw AIOOBE Key: SPARK-39830 URL: https://issues.apache.org/jira/browse/SPARK-39830 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: dzcxzl {code:java} spark.sql("set orc.stripe.size=10240") spark.sql("set orc.rows.between.memory.checks=1") spark.sql("set spark.sql.orc.columnarWriterBatchSize=1") val df = spark.range(1, 1+512, 1, 1).map { i => if( i == 1 ){ (i, Array.fill[Byte](5 * 1024 * 1024)('X')) } else { (i,Array.fill[Byte](1)('X')) } }.toDF("c1","c2") df.write.format("orc").save("file:///tmp/test_table_orc_t1") spark.sql("create external table test_table_orc_t1 (c1 string ,c2 binary) location 'file:///tmp/test_table_orc_t1' stored as orc ") spark.sql("select * from test_table_orc_t1").show() {code} Querying this table will get the following exception {code:java} java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387) at org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740) at org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069) at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371) at org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:84) at org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:102) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) {code} We can add a UT to test the scenario after the [ORC-1205|https://issues.apache.org/jira/browse/ORC-1205] release -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39628) Fix race condition when handling IdleStateEvent again
Title: Message Title dzcxzl created an issue Spark / SPARK-39628 Fix race condition when handling IdleStateEvent again Issue Type: Bug Affects Versions: 3.3.0 Assignee: Unassigned Components: Spark Core Created: 28/Jun/22 10:26 Priority: Minor Reporter: dzcxzl In SPARK-27073, fix a race condition when handling of IdleStateEvent, but in SPARK-37462 the call order is modified, which leads to a possible regression. Add Comment
[jira] [Updated] (SPARK-39355) Single column uses quoted to construct UnresolvedAttribute
[ https://issues.apache.org/jira/browse/SPARK-39355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-39355: --- Summary: Single column uses quoted to construct UnresolvedAttribute (was: Avoid UnresolvedAttribute.apply throwing ParseException) > Single column uses quoted to construct UnresolvedAttribute > -- > > Key: SPARK-39355 > URL: https://issues.apache.org/jira/browse/SPARK-39355 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: dzcxzl >Priority: Trivial > > > {code:java} > select * from (select '2022-06-01' as c1 ) a where c1 in (select > date_add('2022-06-01',0)); {code} > {code:java} > Error in query: > mismatched input '(' expecting {, '.', '-'}(line 1, pos 8) > == SQL == > date_add(2022-06-01, 0) > ^^^ {code} > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39415) Local mode supports HadoopDelegationTokenManager
[ https://issues.apache.org/jira/browse/SPARK-39415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl resolved SPARK-39415. Resolution: Duplicate > Local mode supports HadoopDelegationTokenManager > > > Key: SPARK-39415 > URL: https://issues.apache.org/jira/browse/SPARK-39415 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: dzcxzl >Priority: Minor > > Now in the kerberos environment, using spark-submit --master=local > --proxy-user xxx cannot access Hive Meta Store, and using --keytab will not > automatically relogin. > {code:java} > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1743) > at > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:483) > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39415) Local mode supports HadoopDelegationTokenManager
[ https://issues.apache.org/jira/browse/SPARK-39415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-39415: --- Summary: Local mode supports HadoopDelegationTokenManager (was: Local mode supports delegationTokenManager) > Local mode supports HadoopDelegationTokenManager > > > Key: SPARK-39415 > URL: https://issues.apache.org/jira/browse/SPARK-39415 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: dzcxzl >Priority: Minor > > Now in the kerberos environment, using spark-submit --master=local > --proxy-user xxx cannot access Hive Meta Store, and using --keytab will not > automatically relogin. > {code:java} > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1743) > at > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:483) > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39415) Local mode supports delegationTokenManager
dzcxzl created SPARK-39415: -- Summary: Local mode supports delegationTokenManager Key: SPARK-39415 URL: https://issues.apache.org/jira/browse/SPARK-39415 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.2.1 Reporter: dzcxzl Now in the kerberos environment, using spark-submit --master=local --proxy-user xxx cannot access Hive Meta Store, and using --keytab will not automatically relogin. {code:java} javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1743) at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:483) {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39382) UI show the duration of the failed task when the executor lost
[ https://issues.apache.org/jira/browse/SPARK-39382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-39382: --- Summary: UI show the duration of the failed task when the executor lost (was: UI show the duartion of the failed task when the executor lost) > UI show the duration of the failed task when the executor lost > -- > > Key: SPARK-39382 > URL: https://issues.apache.org/jira/browse/SPARK-39382 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: dzcxzl >Priority: Trivial > > When the executor is lost due to OOM or other reasons, the metrics of these > failed tasks do not have executorRunTime, which results in that the duration > cannot be displayed in the UI. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39387) Upgrade hive-storage-api to 2.7.3
[ https://issues.apache.org/jira/browse/SPARK-39387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-39387: --- Description: HIVE-25190: Fix many small allocations in BytesColumnVector {code:java} Caused by: java.lang.RuntimeException: Overflow of newLength. smallBuffer.length=1073741824, nextElemLength=408101 at org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector.increaseBufferSpace(BytesColumnVector.java:311) at org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector.setVal(BytesColumnVector.java:182) at org.apache.hadoop.hive.ql.io.orc.WriterImpl.setColumn(WriterImpl.java:179) at org.apache.hadoop.hive.ql.io.orc.WriterImpl.setColumn(WriterImpl.java:268) at org.apache.hadoop.hive.ql.io.orc.WriterImpl.setColumn(WriterImpl.java:223) at org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:294) at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:105) at org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:157) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:176) at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithMetrics(FileFormatDataWriter.scala:86) at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithIterator(FileFormatDataWriter.scala:93) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:312) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1534) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:319) {code} was:[HIVE-25190|https://issues.apache.org/jira/browse/HIVE-25190]: Fix many small allocations in BytesColumnVector > Upgrade hive-storage-api to 2.7.3 > - > > Key: SPARK-39387 > URL: https://issues.apache.org/jira/browse/SPARK-39387 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.1 >Reporter: dzcxzl >Priority: Minor > > HIVE-25190: Fix many small allocations in BytesColumnVector > > {code:java} > Caused by: java.lang.RuntimeException: Overflow of newLength. > smallBuffer.length=1073741824, nextElemLength=408101 > at > org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector.increaseBufferSpace(BytesColumnVector.java:311) > at > org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector.setVal(BytesColumnVector.java:182) > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl.setColumn(WriterImpl.java:179) > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl.setColumn(WriterImpl.java:268) > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl.setColumn(WriterImpl.java:223) > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:294) > at > org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:105) > at > org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:157) > at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:176) > at > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithMetrics(FileFormatDataWriter.scala:86) > at > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithIterator(FileFormatDataWriter.scala:93) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:312) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1534) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:319) > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39387) Upgrade hive-storage-api to 2.7.3
dzcxzl created SPARK-39387: -- Summary: Upgrade hive-storage-api to 2.7.3 Key: SPARK-39387 URL: https://issues.apache.org/jira/browse/SPARK-39387 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.2.1 Reporter: dzcxzl [HIVE-25190|https://issues.apache.org/jira/browse/HIVE-25190]: Fix many small allocations in BytesColumnVector -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39382) UI show the duartion of the failed task when the executor lost
dzcxzl created SPARK-39382: -- Summary: UI show the duartion of the failed task when the executor lost Key: SPARK-39382 URL: https://issues.apache.org/jira/browse/SPARK-39382 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.2.1 Reporter: dzcxzl When the executor is lost due to OOM or other reasons, the metrics of these failed tasks do not have executorRunTime, which results in that the duration cannot be displayed in the UI. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39381) Make vectorized orc columar writer batch size configurable
[ https://issues.apache.org/jira/browse/SPARK-39381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-39381: --- Description: Now vectorized columar orc writer batch size is default 1024. (was: Now vectorized columar orc writer batch size is default 1024) > Make vectorized orc columar writer batch size configurable > -- > > Key: SPARK-39381 > URL: https://issues.apache.org/jira/browse/SPARK-39381 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: dzcxzl >Priority: Minor > > Now vectorized columar orc writer batch size is default 1024. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39381) Make vectorized orc columar writer batch size configurable
dzcxzl created SPARK-39381: -- Summary: Make vectorized orc columar writer batch size configurable Key: SPARK-39381 URL: https://issues.apache.org/jira/browse/SPARK-39381 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.1 Reporter: dzcxzl Now vectorized columar orc writer batch size is default 1024 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39355) Avoid UnresolvedAttribute.apply throwing ParseException
[ https://issues.apache.org/jira/browse/SPARK-39355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-39355: --- Summary: Avoid UnresolvedAttribute.apply throwing ParseException (was: UnresolvedAttribute should only use CatalystSqlParser if name contains dot) > Avoid UnresolvedAttribute.apply throwing ParseException > --- > > Key: SPARK-39355 > URL: https://issues.apache.org/jira/browse/SPARK-39355 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: dzcxzl >Priority: Trivial > > > {code:java} > select * from (select '2022-06-01' as c1 ) a where c1 in (select > date_add('2022-06-01',0)); {code} > {code:java} > Error in query: > mismatched input '(' expecting {, '.', '-'}(line 1, pos 8) > == SQL == > date_add(2022-06-01, 0) > ^^^ {code} > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39355) UnresolvedAttribute should only use CatalystSqlParser if name contains dot
dzcxzl created SPARK-39355: -- Summary: UnresolvedAttribute should only use CatalystSqlParser if name contains dot Key: SPARK-39355 URL: https://issues.apache.org/jira/browse/SPARK-39355 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: dzcxzl {code:java} select * from (select '2022-06-01' as c1 ) a where c1 in (select date_add('2022-06-01',0)); {code} {code:java} Error in query: mismatched input '(' expecting {, '.', '-'}(line 1, pos 8) == SQL == date_add(2022-06-01, 0) ^^^ {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38979) Improve error log readability in OrcUtils.requestedColumnIds
dzcxzl created SPARK-38979: -- Summary: Improve error log readability in OrcUtils.requestedColumnIds Key: SPARK-38979 URL: https://issues.apache.org/jira/browse/SPARK-38979 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.1 Reporter: dzcxzl In OrcUtils#requestedColumnIds sometimes it fails because orcFieldNames.length > dataSchema.length, the log is not very clear. {code:java} java.lang.AssertionError: assertion failed: The given data schema struct has less fields than the actual ORC physical schema, no idea which columns were dropped, fail to read. {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38951) Aggregate aliases override field names in ResolveAggregateFunctions
dzcxzl created SPARK-38951: -- Summary: Aggregate aliases override field names in ResolveAggregateFunctions Key: SPARK-38951 URL: https://issues.apache.org/jira/browse/SPARK-38951 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1 Reporter: dzcxzl Spark versions before 3.1.x can query the following SQL: {code:java} select sum(id) as id from range(10) group by id order by sum(id);{code} {code:java} Error in query: Resolved attribute(s) id#0L missing from id#1L in operator !Aggregate [id#1L], [sum(id#1L) AS id#0L, sum(id#0L) AS sum(id#0L)#4L]. Attribute(s) with the same name appear in the operation: id. Please check if the right attribute(s) are used.; Project [id#0L] +- Sort [sum(id#0L)#4L ASC NULLS FIRST], true +- !Aggregate [id#1L], [sum(id#1L) AS id#0L, sum(id#0L) AS sum(id#0L)#4L] +- Range (0, 10, step=1, splits=None) {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38936) Script transform feed thread should have name
dzcxzl created SPARK-38936: -- Summary: Script transform feed thread should have name Key: SPARK-38936 URL: https://issues.apache.org/jira/browse/SPARK-38936 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.1, 3.1.1 Reporter: dzcxzl Lost feed thread name after SPARK-32105 refactoring -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37605) Support the configuration of the initial number of scan partitions when executing a take on a query
dzcxzl created SPARK-37605: -- Summary: Support the configuration of the initial number of scan partitions when executing a take on a query Key: SPARK-37605 URL: https://issues.apache.org/jira/browse/SPARK-37605 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: dzcxzl Now the initial number of scanned partitions is 1 by default when executing a take on a query. This number does not support configuration. Sometimes the first task runs slower. If we have this configuration, we can increase the initial parallelism. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37561) Avoid loading all functions when obtaining hive's DelegationToken
[ https://issues.apache.org/jira/browse/SPARK-37561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-37561: --- Attachment: getDelegationToken_load_functions.png > Avoid loading all functions when obtaining hive's DelegationToken > - > > Key: SPARK-37561 > URL: https://issues.apache.org/jira/browse/SPARK-37561 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: dzcxzl >Priority: Trivial > Attachments: getDelegationToken_load_functions.png > > > At present, when obtaining the delegationToken of hive, all functions will be > loaded. > This is unnecessary, it takes time to load the function, and it also > increases the burden on the hive meta store. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37561) Avoid loading all functions when obtaining hive's DelegationToken
[ https://issues.apache.org/jira/browse/SPARK-37561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-37561: --- Description: At present, when obtaining the delegationToken of hive, all functions will be loaded. This is unnecessary, it takes time to load the function, and it also increases the burden on the hive meta store. !getDelegationToken_load_functions.png! was: At present, when obtaining the delegationToken of hive, all functions will be loaded. This is unnecessary, it takes time to load the function, and it also increases the burden on the hive meta store. > Avoid loading all functions when obtaining hive's DelegationToken > - > > Key: SPARK-37561 > URL: https://issues.apache.org/jira/browse/SPARK-37561 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: dzcxzl >Priority: Trivial > Attachments: getDelegationToken_load_functions.png > > > At present, when obtaining the delegationToken of hive, all functions will be > loaded. > This is unnecessary, it takes time to load the function, and it also > increases the burden on the hive meta store. > > !getDelegationToken_load_functions.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37561) Avoid loading all functions when obtaining hive's DelegationToken
dzcxzl created SPARK-37561: -- Summary: Avoid loading all functions when obtaining hive's DelegationToken Key: SPARK-37561 URL: https://issues.apache.org/jira/browse/SPARK-37561 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: dzcxzl At present, when obtaining the delegationToken of hive, all functions will be loaded. This is unnecessary, it takes time to load the function, and it also increases the burden on the hive meta store. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36799) Pass queryExecution name in CLI
[ https://issues.apache.org/jira/browse/SPARK-36799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-36799: --- Summary: Pass queryExecution name in CLI (was: Pass queryExecution name in CLI when only select query) > Pass queryExecution name in CLI > --- > > Key: SPARK-36799 > URL: https://issues.apache.org/jira/browse/SPARK-36799 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > Fix For: 3.3.0 > > > Now when in spark-sql CLI, QueryExecutionListener can receive command, but > not select query, because queryExecution Name is not passed. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37217) The number of dynamic partitions should early check when writing to external tables
[ https://issues.apache.org/jira/browse/SPARK-37217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-37217: --- Summary: The number of dynamic partitions should early check when writing to external tables (was: Dynamic partitions should fail quickly when writing to external tables to prevent data deletion) > The number of dynamic partitions should early check when writing to external > tables > --- > > Key: SPARK-37217 > URL: https://issues.apache.org/jira/browse/SPARK-37217 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: dzcxzl >Priority: Trivial > > [SPARK-29295|https://issues.apache.org/jira/browse/SPARK-29295] introduces a > mechanism that writes to external tables is a dynamic partition method, and > the data in the target partition will be deleted first. > Assuming that 1001 partitions are written, the data of 10001 partitions will > be deleted first, but because hive.exec.max.dynamic.partitions is 1000 by > default, loadDynamicPartitions will fail at this time, but the data of 1001 > partitions has been deleted. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37217) Dynamic partitions should fail quickly when writing to external tables to prevent data deletion
dzcxzl created SPARK-37217: -- Summary: Dynamic partitions should fail quickly when writing to external tables to prevent data deletion Key: SPARK-37217 URL: https://issues.apache.org/jira/browse/SPARK-37217 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: dzcxzl [SPARK-29295|https://issues.apache.org/jira/browse/SPARK-29295] introduces a mechanism that writes to external tables is a dynamic partition method, and the data in the target partition will be deleted first. Assuming that 1001 partitions are written, the data of 10001 partitions will be deleted first, but because hive.exec.max.dynamic.partitions is 1000 by default, loadDynamicPartitions will fail at this time, but the data of 1001 partitions has been deleted. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36799) Pass queryExecution name in CLI when only select query
dzcxzl created SPARK-36799: -- Summary: Pass queryExecution name in CLI when only select query Key: SPARK-36799 URL: https://issues.apache.org/jira/browse/SPARK-36799 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.2 Reporter: dzcxzl Now when in spark-sql CLI, QueryExecutionListener can receive command, but not select query, because queryExecution Name is not passed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36616) Unrecognized connection property 'url' when using Presto JDBC
[ https://issues.apache.org/jira/browse/SPARK-36616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407128#comment-17407128 ] dzcxzl commented on SPARK-36616: You can use the JdbcConnectionProvider interface provided by SPARK-32001 to create a jdbc connection. > Unrecognized connection property 'url' when using Presto JDBC > - > > Key: SPARK-36616 > URL: https://issues.apache.org/jira/browse/SPARK-36616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: Rajkumar Gunasekaran >Priority: Blocker > > Hi, Here is my spark sql code, where I am trying to read a presto table > based on this guide; > [https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html] > {code:scala} > val df = spark.read > .format("jdbc") > .option("driver", "com.facebook.presto.jdbc.PrestoDriver") > .option("url", "jdbc:presto://localhost:8889/mycatalog") > .option("query", "select * from mydb.mytable limit 1") > .option("user", "myuserid") > .load() > {code} > > I am getting the following exception: *_unrecognized connection property > 'url'_* > {code:java} > Exception in thread "main" java.sql.SQLException: Unrecognized connection > property 'url' > at > com.facebook.presto.jdbc.PrestoDriverUri.validateConnectionProperties(PrestoDriverUri.java:345) > at com.facebook.presto.jdbc.PrestoDriverUri.(PrestoDriverUri.java:102) > at com.facebook.presto.jdbc.PrestoDriverUri.(PrestoDriverUri.java:92) > at com.facebook.presto.jdbc.PrestoDriver.connect(PrestoDriver.java:87) > at > org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49) > at > org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider$.create(ConnectionProvider.scala:68) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$createConnectionFactory$1(JdbcUtils.scala:62) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:56) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:226) > at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:341) > > {code} > Seems like this issue is related to > [https://github.com/prestodb/presto/issues/9254] where the property `url` is > not a recognized property in Presto and looks like the fix needs to be done > on the Spark side? > Our development is blocked on this exception and would appreciate any > guidance. Thanks! > PS: > presto-jdbc version: 0.245 / 0.260 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36550) Propagation cause when UDF reflection fails
[ https://issues.apache.org/jira/browse/SPARK-36550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-36550: --- Description: Now when UDF reflection fails, InvocationTargetException is thrown, but it is not a specific exception. {code:java} Error in query: No handler for Hive UDF 'XXX': java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) {code} was:Now when UDF reflection fails, InvocationTargetException is thrown, but it is not a specific exception. > Propagation cause when UDF reflection fails > --- > > Key: SPARK-36550 > URL: https://issues.apache.org/jira/browse/SPARK-36550 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2 >Reporter: dzcxzl >Priority: Trivial > > Now when UDF reflection fails, InvocationTargetException is thrown, but it is > not a specific exception. > {code:java} > Error in query: No handler for Hive UDF 'XXX': > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36550) Propagation cause when UDF reflection fails
dzcxzl created SPARK-36550: -- Summary: Propagation cause when UDF reflection fails Key: SPARK-36550 URL: https://issues.apache.org/jira/browse/SPARK-36550 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.2 Reporter: dzcxzl Now when UDF reflection fails, InvocationTargetException is thrown, but it is not a specific exception. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36451) Ivy skips looking for source and doc pom
[ https://issues.apache.org/jira/browse/SPARK-36451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-36451: --- Description: Because SPARK-35863 Upgrade Ivy to 2.5.0, it supports skip searching the source and doc pom, but the remote repo will still be queried at present. org.apache.ivy.plugins.parser.m2.PomModuleDescriptorParser#addSourcesAndJavadocArtifactsIfPresent {code:java} boolean sourcesLookup = !"false" .equals(ivySettings.getVariable("ivy.maven.lookup.sources")); boolean javadocLookup = !"false" .equals(ivySettings.getVariable("ivy.maven.lookup.javadoc")); if (!sourcesLookup && !javadocLookup) { Message.debug("Sources and javadocs lookup disabled"); return; } {code} was:Because SPARK-35863 Upgrade Ivy to 2.5.0, it supports skip searching the source and doc pom, but the remote repo will still be queried at present. > Ivy skips looking for source and doc pom > > > Key: SPARK-36451 > URL: https://issues.apache.org/jira/browse/SPARK-36451 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 3.2.0 >Reporter: dzcxzl >Priority: Trivial > > Because SPARK-35863 Upgrade Ivy to 2.5.0, it supports skip searching the > source and doc pom, but the remote repo will still be queried at present. > > org.apache.ivy.plugins.parser.m2.PomModuleDescriptorParser#addSourcesAndJavadocArtifactsIfPresent > {code:java} > boolean sourcesLookup = !"false" > .equals(ivySettings.getVariable("ivy.maven.lookup.sources")); > boolean javadocLookup = !"false" > .equals(ivySettings.getVariable("ivy.maven.lookup.javadoc")); > if (!sourcesLookup && !javadocLookup) { > Message.debug("Sources and javadocs lookup disabled"); > return; > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36451) Ivy skips looking for source and doc pom
dzcxzl created SPARK-36451: -- Summary: Ivy skips looking for source and doc pom Key: SPARK-36451 URL: https://issues.apache.org/jira/browse/SPARK-36451 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 3.2.0 Reporter: dzcxzl Because SPARK-35863 Upgrade Ivy to 2.5.0, it supports skip searching the source and doc pom, but the remote repo will still be queried at present. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35437) Use expressions to filter Hive partitions at client side
[ https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-35437: --- Summary: Use expressions to filter Hive partitions at client side (was: Hive partition filtering client optimization) > Use expressions to filter Hive partitions at client side > > > Key: SPARK-35437 > URL: https://issues.apache.org/jira/browse/SPARK-35437 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1 >Reporter: dzcxzl >Priority: Minor > > When we have a table with a lot of partitions and there is no way to filter > it on the MetaStore Server, we will get all the partition details and filter > it on the client side. This is slow and puts a lot of pressure on the > MetaStore Server. > We can first pull all the partition names, filter by expressions, and then > obtain detailed information about the corresponding partitions from the > MetaStore Server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36390) Replace SessionState.close with SessionState.detachSession
dzcxzl created SPARK-36390: -- Summary: Replace SessionState.close with SessionState.detachSession Key: SPARK-36390 URL: https://issues.apache.org/jira/browse/SPARK-36390 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: dzcxzl https://issues.apache.org/jira/browse/SPARK-35286 replace SessionState.start with SessionState.setCurrentSessionState, but SessionState.close will create a HiveMetaStoreClient , connect to the Hive Meta Store Server, and then load all functions -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32467) Avoid encoding URL twice on https redirect
[ https://issues.apache.org/jira/browse/SPARK-32467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368005#comment-17368005 ] dzcxzl edited comment on SPARK-32467 at 7/6/21, 1:16 PM: - YARN-3239. WebAppProxy does not support a final tracking url which has query fragments and params. If the Yarn cluster does not use the YARN-3217 YARN-3239 patch, the running spark job still encounters the NPE problem when accessing the task page. Does spark need to do URL decode twice to avoid NPE? was (Author: dzcxzl): YARN-3239. WebAppProxy does not support a final tracking url which has query fragments and params. If the Yarn cluster does not use the YARN-3239 patch, the running spark job still encounters the NPE problem when accessing the task page. Does spark need to do URL decode twice to avoid NPE? > Avoid encoding URL twice on https redirect > -- > > Key: SPARK-32467 > URL: https://issues.apache.org/jira/browse/SPARK-32467 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.1, 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Currently, on https redirect, the original URL is encoded as an HTTPS URL. > However, the original URL could be encoded already, so that the return result > of method > UriInfo.getQueryParameters will contain encoded keys and values. For example, > a parameter > order[0][dir] will become order%255B0%255D%255Bcolumn%255D after encoded > twice, and the decoded > key in the result of UriInfo.getQueryParameters will be > order%5B0%5D%5Bcolumn%5D. > To fix the problem, we try decoding the query parameters before encoding it. > This is to make sure we encode the URL exactly once. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34632) Can we create 'SessionState' with a username in 'HiveClientImpl'
[ https://issues.apache.org/jira/browse/SPARK-34632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17373410#comment-17373410 ] dzcxzl commented on SPARK-34632: You can use the default Authenticator to get the username through ugi. hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.HadoopDefaultAuthenticator > Can we create 'SessionState' with a username in 'HiveClientImpl' > > > Key: SPARK-34632 > URL: https://issues.apache.org/jira/browse/SPARK-34632 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: HonglunChen >Priority: Minor > > [https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L165] > Like this: > val state = new SessionState(hiveConf, userName) > We can then easily use the Hive Authorization through the user information in > the 'SessionState'. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35913) Create hive permanent function with owner name
dzcxzl created SPARK-35913: -- Summary: Create hive permanent function with owner name Key: SPARK-35913 URL: https://issues.apache.org/jira/browse/SPARK-35913 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.2 Reporter: dzcxzl Now create a hive permanent function, no owner name, null value {code:java} private def toHiveFunction(f: CatalogFunction, db: String): HiveFunction = { val resourceUris = f.resources.map { resource => new ResourceUri(ResourceType.valueOf( resource.resourceType.resourceType.toUpperCase(Locale.ROOT)), resource.uri) } new HiveFunction( f.identifier.funcName, db, f.className, null, PrincipalType.USER, (System.currentTimeMillis / 1000).toInt, FunctionType.JAVA, resourceUris.asJava) } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32467) Avoid encoding URL twice on https redirect
[ https://issues.apache.org/jira/browse/SPARK-32467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368005#comment-17368005 ] dzcxzl commented on SPARK-32467: YARN-3239. WebAppProxy does not support a final tracking url which has query fragments and params. If the Yarn cluster does not use the YARN-3239 patch, the running spark job still encounters the NPE problem when accessing the task page. Does spark need to do URL decode twice to avoid NPE? > Avoid encoding URL twice on https redirect > -- > > Key: SPARK-32467 > URL: https://issues.apache.org/jira/browse/SPARK-32467 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.1, 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Currently, on https redirect, the original URL is encoded as an HTTPS URL. > However, the original URL could be encoded already, so that the return result > of method > UriInfo.getQueryParameters will contain encoded keys and values. For example, > a parameter > order[0][dir] will become order%255B0%255D%255Bcolumn%255D after encoded > twice, and the decoded > key in the result of UriInfo.getQueryParameters will be > order%5B0%5D%5Bcolumn%5D. > To fix the problem, we try decoding the query parameters before encoding it. > This is to make sure we encode the URL exactly once. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35437) Hive partition filtering client optimization
dzcxzl created SPARK-35437: -- Summary: Hive partition filtering client optimization Key: SPARK-35437 URL: https://issues.apache.org/jira/browse/SPARK-35437 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.1 Reporter: dzcxzl When we have a table with a lot of partitions and there is no way to filter it on the MetaStore Server, we will get all the partition details and filter it on the client side. This is slow and puts a lot of pressure on the MetaStore Server. We can first pull all the partition names, filter by expressions, and then obtain detailed information about the corresponding partitions from the MetaStore Server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
[ https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265724#comment-17265724 ] dzcxzl edited comment on SPARK-33790 at 1/15/21, 4:28 PM: -- [https://github.com/scala/bug/issues/10436] was (Author: dzcxzl): Thread stack when not working. PID 117049 0x1c939 [^top.png] [^jstack.png] [https://github.com/scala/bug/issues/10436] > Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader > > > Key: SPARK-33790 > URL: https://issues.apache.org/jira/browse/SPARK-33790 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Critical > Fix For: 3.0.2, 3.2.0, 3.1.1 > > > FsHistoryProvider#checkForLogs already has FileStatus when constructing > SingleFileEventLogFileReader, and there is no need to get the FileStatus > again when SingleFileEventLogFileReader#fileSizeForLastIndex. > This can reduce a lot of rpc calls and improve the speed of the history > server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
[ https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265724#comment-17265724 ] dzcxzl edited comment on SPARK-33790 at 1/15/21, 4:27 PM: -- Thread stack when not working. PID 117049 0x1c939 [^top.png] [^jstack.png] [https://github.com/scala/bug/issues/10436] was (Author: dzcxzl): Thread stack when not working. PID 117049 0x1c939 !top.png! !jstack.png! [https://github.com/scala/bug/issues/10436] > Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader > > > Key: SPARK-33790 > URL: https://issues.apache.org/jira/browse/SPARK-33790 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Critical > Fix For: 3.0.2, 3.2.0, 3.1.1 > > > FsHistoryProvider#checkForLogs already has FileStatus when constructing > SingleFileEventLogFileReader, and there is no need to get the FileStatus > again when SingleFileEventLogFileReader#fileSizeForLastIndex. > This can reduce a lot of rpc calls and improve the speed of the history > server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
[ https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265724#comment-17265724 ] dzcxzl edited comment on SPARK-33790 at 1/15/21, 4:26 PM: -- Thread stack when not working. PID 117049 0x1c939 !top.png! !jstack.png! [https://github.com/scala/bug/issues/10436] was (Author: dzcxzl): Thread stack when not working. PID 117049 0x1c939 !top.png! !jstack.png! [https://github.com/scala/bug/issues/10436] > Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader > > > Key: SPARK-33790 > URL: https://issues.apache.org/jira/browse/SPARK-33790 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Critical > Fix For: 3.0.2, 3.2.0, 3.1.1 > > > FsHistoryProvider#checkForLogs already has FileStatus when constructing > SingleFileEventLogFileReader, and there is no need to get the FileStatus > again when SingleFileEventLogFileReader#fileSizeForLastIndex. > This can reduce a lot of rpc calls and improve the speed of the history > server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
[ https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265724#comment-17265724 ] dzcxzl edited comment on SPARK-33790 at 1/15/21, 4:25 PM: -- Thread stack when not working. PID 117049 0x1c939 !top.png! !jstack.png! [https://github.com/scala/bug/issues/10436] was (Author: dzcxzl): Thread stack when not working !http://git.dev.sh.ctripcorp.com/framework-di/spark-2.2.0/uploads/9cfa9662f563ac64f77f4d4ee6fd9243/image.png! [https://github.com/scala/bug/issues/10436] > Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader > > > Key: SPARK-33790 > URL: https://issues.apache.org/jira/browse/SPARK-33790 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Critical > Fix For: 3.0.2, 3.2.0, 3.1.1 > > > FsHistoryProvider#checkForLogs already has FileStatus when constructing > SingleFileEventLogFileReader, and there is no need to get the FileStatus > again when SingleFileEventLogFileReader#fileSizeForLastIndex. > This can reduce a lot of rpc calls and improve the speed of the history > server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
[ https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265812#comment-17265812 ] dzcxzl commented on SPARK-33790: ok, I opened a JIRA [SPARK-34125 |https://issues.apache.org/jira/browse/SPARK-34125] > Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader > > > Key: SPARK-33790 > URL: https://issues.apache.org/jira/browse/SPARK-33790 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Critical > Fix For: 3.2.0 > > > FsHistoryProvider#checkForLogs already has FileStatus when constructing > SingleFileEventLogFileReader, and there is no need to get the FileStatus > again when SingleFileEventLogFileReader#fileSizeForLastIndex. > This can reduce a lot of rpc calls and improve the speed of the history > server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34125) Make EventLoggingListener.codecMap thread-safe
[ https://issues.apache.org/jira/browse/SPARK-34125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-34125: --- Description: 2.x version of history server EventLoggingListener.codecMap is of type mutable.HashMap, which is not thread safe This will cause the history server to suddenly get stuck and not work. The 3.x version was changed to EventLogFileReader.codecMap to ConcurrentHashMap type, so there is no such problem.(-SPARK-28869-) PID 117049 0x1c939 !top.png! !jstack.png! was: 2.x version of history server EventLoggingListener.codecMap is of type mutable.HashMap, which is not thread safe This will cause the history server to suddenly get stuck and not work. The 3.x version was changed to EventLogFileReader.codecMap to ConcurrentHashMap type, so there is no such problem.([SPARK-28869|https://issues.apache.org/jira/browse/SPARK-28869]) PID 117049 0x1c939 !top.png! !jstack.png! > Make EventLoggingListener.codecMap thread-safe > -- > > Key: SPARK-34125 > URL: https://issues.apache.org/jira/browse/SPARK-34125 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: dzcxzl >Priority: Trivial > Attachments: jstack.png, top.png > > > 2.x version of history server > EventLoggingListener.codecMap is of type mutable.HashMap, which is not > thread safe > This will cause the history server to suddenly get stuck and not work. > The 3.x version was changed to EventLogFileReader.codecMap to > ConcurrentHashMap type, so there is no such problem.(-SPARK-28869-) > PID 117049 0x1c939 > !top.png! > > !jstack.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34125) Make EventLoggingListener.codecMap thread-safe
[ https://issues.apache.org/jira/browse/SPARK-34125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-34125: --- Description: 2.x version of history server EventLoggingListener.codecMap is of type mutable.HashMap, which is not thread safe This will cause the history server to suddenly get stuck and not work. The 3.x version was changed to EventLogFileReader.codecMap to ConcurrentHashMap type, so there is no such problem.([SPARK-28869|https://issues.apache.org/jira/browse/SPARK-28869]) PID 117049 0x1c939 !top.png! !jstack.png! was: 2.x version of history server EventLoggingListener.codecMap is of type mutable.HashMap, which is not thread safe This will cause the history server to suddenly get stuck and not work. The 3.x version was changed to EventLogFileReader.codecMap to ConcurrentHashMap type, so there is no such problem. PID 117049 0x1c939 !top.png! !jstack.png! > Make EventLoggingListener.codecMap thread-safe > -- > > Key: SPARK-34125 > URL: https://issues.apache.org/jira/browse/SPARK-34125 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: dzcxzl >Priority: Trivial > Attachments: jstack.png, top.png > > > 2.x version of history server > EventLoggingListener.codecMap is of type mutable.HashMap, which is not > thread safe > This will cause the history server to suddenly get stuck and not work. > The 3.x version was changed to EventLogFileReader.codecMap to > ConcurrentHashMap type, so there is no such > problem.([SPARK-28869|https://issues.apache.org/jira/browse/SPARK-28869]) > PID 117049 0x1c939 > !top.png! > > !jstack.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34125) Make EventLoggingListener.codecMap thread-safe
[ https://issues.apache.org/jira/browse/SPARK-34125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-34125: --- Description: 2.x version of history server EventLoggingListener.codecMap is of type mutable.HashMap, which is not thread safe This will cause the history server to suddenly get stuck and not work. The 3.x version was changed to EventLogFileReader.codecMap to ConcurrentHashMap type, so there is no such problem. PID 117049 0x1c939 !top.png! !jstack.png! was: 2.x version of history server EventLoggingListener.codecMap is of type mutable.HashMap, which is not thread safe This will cause the history server to suddenly get stuck and not work. The 3.x version was changed to EventLogFileReader.codecMap to ConcurrentHashMap type, so there is no such problem. > Make EventLoggingListener.codecMap thread-safe > -- > > Key: SPARK-34125 > URL: https://issues.apache.org/jira/browse/SPARK-34125 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: dzcxzl >Priority: Trivial > Attachments: jstack.png, top.png > > > 2.x version of history server > EventLoggingListener.codecMap is of type mutable.HashMap, which is not > thread safe > This will cause the history server to suddenly get stuck and not work. > The 3.x version was changed to EventLogFileReader.codecMap to > ConcurrentHashMap type, so there is no such problem. > PID 117049 0x1c939 > !top.png! > > !jstack.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34125) Make EventLoggingListener.codecMap thread-safe
[ https://issues.apache.org/jira/browse/SPARK-34125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-34125: --- Attachment: jstack.png > Make EventLoggingListener.codecMap thread-safe > -- > > Key: SPARK-34125 > URL: https://issues.apache.org/jira/browse/SPARK-34125 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: dzcxzl >Priority: Trivial > Attachments: jstack.png, top.png > > > 2.x version of history server > EventLoggingListener.codecMap is of type mutable.HashMap, which is not thread > safe > This will cause the history server to suddenly get stuck and not work. > The 3.x version was changed to EventLogFileReader.codecMap to > ConcurrentHashMap type, so there is no such problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34125) Make EventLoggingListener.codecMap thread-safe
dzcxzl created SPARK-34125: -- Summary: Make EventLoggingListener.codecMap thread-safe Key: SPARK-34125 URL: https://issues.apache.org/jira/browse/SPARK-34125 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.7 Reporter: dzcxzl Attachments: jstack.png, top.png 2.x version of history server EventLoggingListener.codecMap is of type mutable.HashMap, which is not thread safe This will cause the history server to suddenly get stuck and not work. The 3.x version was changed to EventLogFileReader.codecMap to ConcurrentHashMap type, so there is no such problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34125) Make EventLoggingListener.codecMap thread-safe
[ https://issues.apache.org/jira/browse/SPARK-34125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-34125: --- Attachment: top.png > Make EventLoggingListener.codecMap thread-safe > -- > > Key: SPARK-34125 > URL: https://issues.apache.org/jira/browse/SPARK-34125 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7 >Reporter: dzcxzl >Priority: Trivial > Attachments: jstack.png, top.png > > > 2.x version of history server > EventLoggingListener.codecMap is of type mutable.HashMap, which is not thread > safe > This will cause the history server to suddenly get stuck and not work. > The 3.x version was changed to EventLogFileReader.codecMap to > ConcurrentHashMap type, so there is no such problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
[ https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265724#comment-17265724 ] dzcxzl commented on SPARK-33790: Thread stack when not working !http://git.dev.sh.ctripcorp.com/framework-di/spark-2.2.0/uploads/9cfa9662f563ac64f77f4d4ee6fd9243/image.png! [https://github.com/scala/bug/issues/10436] > Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader > > > Key: SPARK-33790 > URL: https://issues.apache.org/jira/browse/SPARK-33790 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Critical > Fix For: 3.2.0 > > > FsHistoryProvider#checkForLogs already has FileStatus when constructing > SingleFileEventLogFileReader, and there is no need to get the FileStatus > again when SingleFileEventLogFileReader#fileSizeForLastIndex. > This can reduce a lot of rpc calls and improve the speed of the history > server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
[ https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265691#comment-17265691 ] dzcxzl commented on SPARK-33790: This is indeed a performance regression problem. The following is my case 2.x version EventLoggingListener.codecMap is of type mutable.HashMap, which is not thread-safe and may hang. 3.x version changed to EventLogFileReader.codecMap changed to ConcurrentHashMap type. In the 2.x version, the history server may not work. I tried to use the 3.x version, and found that a round of scan has slowed down a lot, 7min rose to about 23min. In addition, do I need to fix the thread safety issues in version 2.x? [~kabhwan] > Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader > > > Key: SPARK-33790 > URL: https://issues.apache.org/jira/browse/SPARK-33790 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Critical > Fix For: 3.2.0 > > > FsHistoryProvider#checkForLogs already has FileStatus when constructing > SingleFileEventLogFileReader, and there is no need to get the FileStatus > again when SingleFileEventLogFileReader#fileSizeForLastIndex. > This can reduce a lot of rpc calls and improve the speed of the history > server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org