[jira] [Created] (SPARK-46037) When Left Join build Left, ShuffledHashJoinExec may result in incorrect results
mcdull_zhang created SPARK-46037: Summary: When Left Join build Left, ShuffledHashJoinExec may result in incorrect results Key: SPARK-46037 URL: https://issues.apache.org/jira/browse/SPARK-46037 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: mcdull_zhang When Left Join build Left and codegen is turned off, ShuffledHashJoinExec may have incorrect results. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43911) Use toSet to deduplicate the iterator data to prevent the creation of large Array
[ https://issues.apache.org/jira/browse/SPARK-43911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mcdull_zhang updated SPARK-43911: - Summary: Use toSet to deduplicate the iterator data to prevent the creation of large Array (was: Directly use Set to consume iterator data to deduplicate, thereby reducing memory usage) > Use toSet to deduplicate the iterator data to prevent the creation of large > Array > - > > Key: SPARK-43911 > URL: https://issues.apache.org/jira/browse/SPARK-43911 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: mcdull_zhang >Priority: Minor > > When SubqueryBroadcastExec reuses the keys of Broadcast HashedRelation for > dynamic partition pruning, it will put all the keys in an Array, and then > call the distinct of the Array to remove the duplicates. > In general, Broadcast HashedRelation may have many rows, and the repetition > rate of this key is high. Doing so will cause this Array to occupy a large > amount of memory (and this memory is not managed by MemoryManager), which may > trigger OOM. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43911) Directly use Set to consume iterator data to deduplicate, thereby reducing memory usage
mcdull_zhang created SPARK-43911: Summary: Directly use Set to consume iterator data to deduplicate, thereby reducing memory usage Key: SPARK-43911 URL: https://issues.apache.org/jira/browse/SPARK-43911 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: mcdull_zhang When SubqueryBroadcastExec reuses the keys of Broadcast HashedRelation for dynamic partition pruning, it will put all the keys in an Array, and then call the distinct of the Array to remove the duplicates. In general, Broadcast HashedRelation may have many rows, and the repetition rate of this key is high. Doing so will cause this Array to occupy a large amount of memory (and this memory is not managed by MemoryManager), which may trigger OOM. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41361) Invalid call toAttribute on unresolved object exception caused by WidenSetOperationTypes
mcdull_zhang created SPARK-41361: Summary: Invalid call toAttribute on unresolved object exception caused by WidenSetOperationTypes Key: SPARK-41361 URL: https://issues.apache.org/jira/browse/SPARK-41361 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: mcdull_zhang The problem can be reproduced in the following way: {code:java} spark-sql> CREATE OR REPLACE TEMPORARY VIEW t1 AS VALUES (1, 'a'), (2, 'b') tbl(c1, c2); spark-sql> CREATE OR REPLACE TEMPORARY VIEW t2 AS VALUES (1.0, 1), (2.0, 4) tbl(c1, c2); spark-sql> SELECT > TRANSFORM(*) USING 'cat' AS (a) > FROM > ( > SELECT c2 AS c from t2 > UNION > SELECT c2 AS c from t1); Invalid call to toAttribute on unresolved object{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41191) Cache Table is not working while nested caches exist
mcdull_zhang created SPARK-41191: Summary: Cache Table is not working while nested caches exist Key: SPARK-41191 URL: https://issues.apache.org/jira/browse/SPARK-41191 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.1 Reporter: mcdull_zhang For example the following statement: {code:java} //代码占位符 cache table t1 as select a from testData3 group by a; cache table t2 as select a,b from testData2 where a in (select a from t1); select key,value,b from testData t3 join t2 on t3.key=t2.a;{code} The cached t2 is not used in the third statement -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40076) Support number-only column names in ORC data sources when orc impl is hive
mcdull_zhang created SPARK-40076: Summary: Support number-only column names in ORC data sources when orc impl is hive Key: SPARK-40076 URL: https://issues.apache.org/jira/browse/SPARK-40076 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.2 Reporter: mcdull_zhang This problem is similar to SPARK-36663, both because *CatalystSqlParser.parseDataType* fails to parse if a column name (and nested field) consists of only numbers. SPARK-36663solves the problem when configuring spark.sql.orc.impl=native (default is native), but when configuring spark.sql.orc.impl=hive, it still throws an error now -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39126) After eliminating join to one side, that side should take advantage of LocalShuffleRead optimization
mcdull_zhang created SPARK-39126: Summary: After eliminating join to one side, that side should take advantage of LocalShuffleRead optimization Key: SPARK-39126 URL: https://issues.apache.org/jira/browse/SPARK-39126 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.1 Reporter: mcdull_zhang PropagateEmptyRelation can simplify Join. For example, if the right side of LEFT JOIN is empty, then it can eliminate join to its left side. If there is a shuffle on the left side, it can be considered that the shuffle is meaningless, and the shuffle can be optimized by using LocalRead. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38867) Avoid OOM when bufferedPlan has a lot of duplicate keys in SortMergeJoin codegen
mcdull_zhang created SPARK-38867: Summary: Avoid OOM when bufferedPlan has a lot of duplicate keys in SortMergeJoin codegen Key: SPARK-38867 URL: https://issues.apache.org/jira/browse/SPARK-38867 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.1 Reporter: mcdull_zhang WholeStageCodegenExec is wrapped in BufferedRowIterator. BufferedRowIterator uses a LinkedList to hold the output of WholeStageCodegenExec. When the parent of SortMergeJoin cannot codegen, SortMergeJoin needs to append the output to this LinkedList. SortMergeJoin processes a record in streamedPlan each time. If all records in bufferedPlan can match this record, all records in bufferedPlan will be saved in LinkedList, resulting in OOM. The above situation is very common in our internal use, so it is best to add a configuration to the codegen code. If there are enough pieces in the LinkedList, stop SortMergeJoin and let the parent consume it first. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38570) Incorrect DynamicPartitionPruning caused by Literal
[ https://issues.apache.org/jira/browse/SPARK-38570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mcdull_zhang updated SPARK-38570: - Description: The return value of Literal.references is an empty AttributeSet, so Literal is mistaken for a partition column. org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan: {code:java} val srcInfo: Option[(Expression, LogicalPlan)] = findExpressionAndTrackLineageDown(a, plan) srcInfo.flatMap { case (resExp, l: LogicalRelation) => l.relation match { case fs: HadoopFsRelation => val partitionColumns = AttributeSet( l.resolve(fs.partitionSchema, fs.sparkSession.sessionState.analyzer.resolver)) // When resExp is a Literal, Literal is considered a partition column. if (resExp.references.subsetOf(partitionColumns)) { return Some(l) } else { None } case _ => None } {code} was: The return value of Literal.references is an empty AttributeSet, so Literal is mistaken for a partition column. org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan: {code:java} val srcInfo: Option[(Expression, LogicalPlan)] = findExpressionAndTrackLineageDown(a, plan) srcInfo.flatMap { case (resExp, l: LogicalRelation) => l.relation match { case fs: HadoopFsRelation => val partitionColumns = AttributeSet( l.resolve(fs.partitionSchema, fs.sparkSession.sessionState.analyzer.resolver)) // When resExp is a Literal, Literal is considered a partition column. if (resExp.references.subsetOf(partitionColumns)) { return Some(l) } else { None } case _ => None } {code} > Incorrect DynamicPartitionPruning caused by Literal > --- > > Key: SPARK-38570 > URL: https://issues.apache.org/jira/browse/SPARK-38570 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: mcdull_zhang >Priority: Minor > > The return value of Literal.references is an empty AttributeSet, so Literal > is mistaken for a partition column. > > org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan: > {code:java} > val srcInfo: Option[(Expression, LogicalPlan)] = > findExpressionAndTrackLineageDown(a, plan) > srcInfo.flatMap { > case (resExp, l: LogicalRelation) => > l.relation match { > case fs: HadoopFsRelation => > val partitionColumns = AttributeSet( > l.resolve(fs.partitionSchema, > fs.sparkSession.sessionState.analyzer.resolver)) > // When resExp is a Literal, Literal is considered a partition > column. > if (resExp.references.subsetOf(partitionColumns)) { > return Some(l) > } else { > None > } > case _ => None > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38570) Incorrect DynamicPartitionPruning caused by Literal
mcdull_zhang created SPARK-38570: Summary: Incorrect DynamicPartitionPruning caused by Literal Key: SPARK-38570 URL: https://issues.apache.org/jira/browse/SPARK-38570 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: mcdull_zhang The return value of Literal.references is an empty AttributeSet, so Literal is mistaken for a partition column. org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan: {code:java} val srcInfo: Option[(Expression, LogicalPlan)] = findExpressionAndTrackLineageDown(a, plan) srcInfo.flatMap { case (resExp, l: LogicalRelation) => l.relation match { case fs: HadoopFsRelation => val partitionColumns = AttributeSet( l.resolve(fs.partitionSchema, fs.sparkSession.sessionState.analyzer.resolver)) // When resExp is a Literal, Literal is considered a partition column. if (resExp.references.subsetOf(partitionColumns)) { return Some(l) } else { None } case _ => None } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38542) UnsafeHashedRelation should serialize numKeys out
[ https://issues.apache.org/jira/browse/SPARK-38542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mcdull_zhang updated SPARK-38542: - Description: At present, UnsafeHashedRelation does not write out numKeys during serialization, so the numKeys of UnsafeHashedRelation obtained by deserialization is equal to 0. The numFields of UnsafeRows returned by UnsafeHashedRelation.keys() are all 0, which can lead to missing or incorrect data. For example, in SubqueryBroadcastExec, the HashedRelation.keys() function is called. {code:java} val broadcastRelation = child.executeBroadcast[HashedRelation]().value val (iter, expr) = if (broadcastRelation.isInstanceOf[LongHashedRelation]) { (broadcastRelation.keys(), HashJoin.extractKeyExprAt(buildKeys, index)) } else { (broadcastRelation.keys(), BoundReference(index, buildKeys(index).dataType, buildKeys(index).nullable)) }{code} was: At present, UnsafeHashedRelation does not write out numKeys during serialization, so the numKeys of UnsafeHashedRelation obtained by deserialization is equal to 0. The numFields of UnsafeRows returned by UnsafeHashedRelation.keys() are all 0, which can lead to missing or incorrect data. For example, in SubqueryBroadcastExec, the HashedRelation.keys() function is called. {code:java} val broadcastRelation = child.executeBroadcast[HashedRelation]().value val (iter, expr) = if (broadcastRelation.isInstanceOf[LongHashedRelation]) { (broadcastRelation.keys(), HashJoin.extractKeyExprAt(buildKeys, index)) } else { (broadcastRelation.keys(), BoundReference(index, buildKeys(index).dataType, buildKeys(index).nullable)) }{code} > UnsafeHashedRelation should serialize numKeys out > - > > Key: SPARK-38542 > URL: https://issues.apache.org/jira/browse/SPARK-38542 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: mcdull_zhang >Priority: Critical > > At present, UnsafeHashedRelation does not write out numKeys during > serialization, so the numKeys of UnsafeHashedRelation obtained by > deserialization is equal to 0. The numFields of UnsafeRows returned by > UnsafeHashedRelation.keys() are all 0, which can lead to missing or incorrect > data. > > For example, in SubqueryBroadcastExec, the HashedRelation.keys() function is > called. > {code:java} > val broadcastRelation = child.executeBroadcast[HashedRelation]().value > val (iter, expr) = if (broadcastRelation.isInstanceOf[LongHashedRelation]) { > (broadcastRelation.keys(), HashJoin.extractKeyExprAt(buildKeys, index)) > } else { > (broadcastRelation.keys(), > BoundReference(index, buildKeys(index).dataType, > buildKeys(index).nullable)) > }{code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38542) UnsafeHashedRelation should serialize numKeys out
[ https://issues.apache.org/jira/browse/SPARK-38542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mcdull_zhang updated SPARK-38542: - Description: At present, UnsafeHashedRelation does not write out numKeys during serialization, so the numKeys of UnsafeHashedRelation obtained by deserialization is equal to 0. The numFields of UnsafeRows returned by UnsafeHashedRelation.keys() are all 0, which can lead to missing or incorrect data. For example, in SubqueryBroadcastExec, the HashedRelation.keys() function is called. {code:java} val broadcastRelation = child.executeBroadcast[HashedRelation]().value val (iter, expr) = if (broadcastRelation.isInstanceOf[LongHashedRelation]) { (broadcastRelation.keys(), HashJoin.extractKeyExprAt(buildKeys, index)) } else { (broadcastRelation.keys(), BoundReference(index, buildKeys(index).dataType, buildKeys(index).nullable)) }{code} was: At present, UnsafeHashedRelation does not write out numKeys during serialization, so the numKeys of UnsafeHashedRelation obtained by deserialization is equal to 0. The numFields of UnsafeRows returned by UnsafeHashedRelation.keys() are all 0, which can lead to missing or incorrect data. For example, in SubqueryBroadcastExec, the HashedRelation.keys() function is called. {code:java} val broadcastRelation = child.executeBroadcast[HashedRelation]().value val (iter, expr) = if (broadcastRelation.isInstanceOf[LongHashedRelation]) { (broadcastRelation.keys(), HashJoin.extractKeyExprAt(buildKeys, index)) } else { (broadcastRelation.keys(), BoundReference(index, buildKeys(index).dataType, buildKeys(index).nullable)) }{code} > UnsafeHashedRelation should serialize numKeys out > - > > Key: SPARK-38542 > URL: https://issues.apache.org/jira/browse/SPARK-38542 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: mcdull_zhang >Priority: Critical > > At present, UnsafeHashedRelation does not write out numKeys during > serialization, so the numKeys of UnsafeHashedRelation obtained by > deserialization is equal to 0. The numFields of UnsafeRows returned by > UnsafeHashedRelation.keys() are all 0, which can lead to missing or incorrect > data. > > For example, in SubqueryBroadcastExec, the HashedRelation.keys() function is > called. > {code:java} > val broadcastRelation = child.executeBroadcast[HashedRelation]().value > val (iter, expr) = if (broadcastRelation.isInstanceOf[LongHashedRelation]) { > (broadcastRelation.keys(), HashJoin.extractKeyExprAt(buildKeys, index)) > } else { > (broadcastRelation.keys(), > BoundReference(index, buildKeys(index).dataType, > buildKeys(index).nullable)) > }{code} > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38542) UnsafeHashedRelation should serialize numKeys out
mcdull_zhang created SPARK-38542: Summary: UnsafeHashedRelation should serialize numKeys out Key: SPARK-38542 URL: https://issues.apache.org/jira/browse/SPARK-38542 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: mcdull_zhang At present, UnsafeHashedRelation does not write out numKeys during serialization, so the numKeys of UnsafeHashedRelation obtained by deserialization is equal to 0. The numFields of UnsafeRows returned by UnsafeHashedRelation.keys() are all 0, which can lead to missing or incorrect data. For example, in SubqueryBroadcastExec, the HashedRelation.keys() function is called. {code:java} val broadcastRelation = child.executeBroadcast[HashedRelation]().value val (iter, expr) = if (broadcastRelation.isInstanceOf[LongHashedRelation]) { (broadcastRelation.keys(), HashJoin.extractKeyExprAt(buildKeys, index)) } else { (broadcastRelation.keys(), BoundReference(index, buildKeys(index).dataType, buildKeys(index).nullable)) }{code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37652) Support optimize skewed join through union
[ https://issues.apache.org/jira/browse/SPARK-37652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mcdull_zhang updated SPARK-37652: - Description: `OptimizeSkewedJoin` rule will take effect only when the plan has two ShuffleQueryStageExec。 With `Union`, it might break the assumption. For example, the following plans *scenes 1* {noformat} Union SMJ ShuffleQueryStage ShuffleQueryStage SMJ ShuffleQueryStage ShuffleQueryStage {noformat} *scenes 2* {noformat} Union SMJ ShuffleQueryStage ShuffleQueryStage HashAggregate {noformat} when one or more of the SMJ data in the above plan is skewed, it cannot be processed at present. It's better to support partial optimize with Union. was: `OptimizeSkewedJoin` rule will take effect only when the plan has two ShuffleQueryStageExec。 With `Union`, it might break the assumption. For example, the following plans {code: none} Union SMJ SMJ {code} > Support optimize skewed join through union > -- > > Key: SPARK-37652 > URL: https://issues.apache.org/jira/browse/SPARK-37652 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: mcdull_zhang >Priority: Minor > > `OptimizeSkewedJoin` rule will take effect only when the plan has two > ShuffleQueryStageExec。 > With `Union`, it might break the assumption. For example, the following plans > *scenes 1* > {noformat} > Union > SMJ > ShuffleQueryStage > ShuffleQueryStage > SMJ > ShuffleQueryStage > ShuffleQueryStage > {noformat} > *scenes 2* > {noformat} > Union > SMJ > ShuffleQueryStage > ShuffleQueryStage > HashAggregate > {noformat} > when one or more of the SMJ data in the above plan is skewed, it cannot be > processed at present. > It's better to support partial optimize with Union. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37652) Support optimize skewed join through union
[ https://issues.apache.org/jira/browse/SPARK-37652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mcdull_zhang updated SPARK-37652: - Description: `OptimizeSkewedJoin` rule will take effect only when the plan has two ShuffleQueryStageExec。 With `Union`, it might break the assumption. For example, the following plans {code: none} Union SMJ SMJ {code} was: `OptimizeSkewedJoin` rule will take effect only when the plan has two ShuffleQueryStageExec。 With `Union`, it might break the assumption. For example, the following plans {code:tex} // Some comments here public String getFoo() { return foo; } {code} > Support optimize skewed join through union > -- > > Key: SPARK-37652 > URL: https://issues.apache.org/jira/browse/SPARK-37652 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: mcdull_zhang >Priority: Minor > > `OptimizeSkewedJoin` rule will take effect only when the plan has two > ShuffleQueryStageExec。 > With `Union`, it might break the assumption. For example, the following plans > {code: none} > Union > SMJ > SMJ > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37652) Support optimize skewed join through union
mcdull_zhang created SPARK-37652: Summary: Support optimize skewed join through union Key: SPARK-37652 URL: https://issues.apache.org/jira/browse/SPARK-37652 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: mcdull_zhang `OptimizeSkewedJoin` rule will take effect only when the plan has two ShuffleQueryStageExec。 With `Union`, it might break the assumption. For example, the following plans {panel:title=scenes1:} Union SMJ SMJ {panel} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37652) Support optimize skewed join through union
[ https://issues.apache.org/jira/browse/SPARK-37652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mcdull_zhang updated SPARK-37652: - Description: `OptimizeSkewedJoin` rule will take effect only when the plan has two ShuffleQueryStageExec。 With `Union`, it might break the assumption. For example, the following plans {code:tex} // Some comments here public String getFoo() { return foo; } {code} was: `OptimizeSkewedJoin` rule will take effect only when the plan has two ShuffleQueryStageExec。 With `Union`, it might break the assumption. For example, the following plans {panel:title=scenes1:} Union SMJ SMJ {panel} > Support optimize skewed join through union > -- > > Key: SPARK-37652 > URL: https://issues.apache.org/jira/browse/SPARK-37652 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: mcdull_zhang >Priority: Minor > > `OptimizeSkewedJoin` rule will take effect only when the plan has two > ShuffleQueryStageExec。 > With `Union`, it might break the assumption. For example, the following plans > {code:tex} > // Some comments here > public String getFoo() > { > return foo; > } > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37301) ConcurrentModificationException caused by CollectionAccumulator serialization in the heartbeat thread
[ https://issues.apache.org/jira/browse/SPARK-37301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mcdull_zhang updated SPARK-37301: - Description: In our production environment, you can use the following code to reproduce the problem: {code:scala} val acc = sc.collectionAccumulator[String]("test_acc") sc.parallelize(Array(0)).foreach(_ => { var i = 0 var stop = false val start = System.currentTimeMillis() while (!stop) { acc.add(i.toString) if (i % 1 == 0) { acc.reset() if ((System.currentTimeMillis() - start) / 1000 > 120) { stop = true } } i = i + 1 } }) sc.stop() {code} This code can make the executor fail to send heartbeats, even more than the default 60 times, and then the executor exits. {noformat} 21/11/11 21:00:23 WARN Executor: Issue communicating with driver in heartbeater org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103) at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1007) at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:212) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019) at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: java.util.ConcurrentModificationException at java.util.ArrayList.writeObject(ArrayList.java:766) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44) at org.apache.spark.rpc.netty.RequestMessage.serialize(NettyRpcEnv.scala:601) at org.apache.spark.rpc.netty.NettyRpcEnv.askAbortable(NettyRpcEnv.scala:244) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.askAbortable(NettyRpcEnv.scala:555) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:559)
[jira] [Updated] (SPARK-37301) ConcurrentModificationException caused by CollectionAccumulator serialization in the heartbeat thread
[ https://issues.apache.org/jira/browse/SPARK-37301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mcdull_zhang updated SPARK-37301: - Description: In our production environment, you can use the following code to reproduce the problem: {code:scala} val acc = sc.collectionAccumulator[String]("test_acc") sc.parallelize(Array(0)).foreach(_ => { var i = 0 var stop = false val start = System.currentTimeMillis() while (!stop) { acc.add(i.toString) if (i % 1 == 0) { acc.reset() if ((System.currentTimeMillis() - start) / 1000 > 120) { stop = true } } i = i + 1 } }) sc.stop() {code} This code can make the executor fail to send heartbeats, even more than the default 60 times, and then the executor exits. ```tex 21/11/11 21:00:23 WARN Executor: Issue communicating with driver in heartbeater org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103) at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1007) at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:212) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019) at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: java.util.ConcurrentModificationException at java.util.ArrayList.writeObject(ArrayList.java:766) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44) at org.apache.spark.rpc.netty.RequestMessage.serialize(NettyRpcEnv.scala:601) at org.apache.spark.rpc.netty.NettyRpcEnv.askAbortable(NettyRpcEnv.scala:244) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.askAbortable(NettyRpcEnv.scala:555) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:559) a
[jira] [Created] (SPARK-37301) ConcurrentModificationException caused by CollectionAccumulator serialization in the heartbeat thread
mcdull_zhang created SPARK-37301: Summary: ConcurrentModificationException caused by CollectionAccumulator serialization in the heartbeat thread Key: SPARK-37301 URL: https://issues.apache.org/jira/browse/SPARK-37301 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.0 Reporter: mcdull_zhang In our production environment, you can use the following code to reproduce the problem: ```scala val acc = sc.collectionAccumulator[String]("test_acc") sc.parallelize(Array(0)).foreach(_ => { var i = 0 var stop = false val start = System.currentTimeMillis() while (!stop) { acc.add(i.toString) if (i % 1 == 0) { acc.reset() if ((System.currentTimeMillis() - start) / 1000 > 120) { stop = true } } i = i + 1 } }) sc.stop() ``` This code can make the executor fail to send heartbeats, even more than the default 60 times, and then the executor exits. ```tex 21/11/11 21:00:23 WARN Executor: Issue communicating with driver in heartbeater org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103) at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1007) at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:212) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019) at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: java.util.ConcurrentModificationException at java.util.ArrayList.writeObject(ArrayList.java:766) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44) at org.apache.spark.rpc.netty.RequestMessage.serialize(NettyRpcEnv.scala:601) at org.apache.spark.rpc.
[jira] [Commented] (SPARK-36663) When the existing field name is a number, an error will be reported when reading the orc file
[ https://issues.apache.org/jira/browse/SPARK-36663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409492#comment-17409492 ] mcdull_zhang commented on SPARK-36663: -- cc [~hyukjin.kwon] [~cloud_fan] > When the existing field name is a number, an error will be reported when > reading the orc file > - > > Key: SPARK-36663 > URL: https://issues.apache.org/jira/browse/SPARK-36663 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2 >Reporter: mcdull_zhang >Priority: Critical > Attachments: image-2021-09-03-20-56-28-846.png > > > You can use the following methods to reproduce the problem: > {quote}val path = "file:///tmp/test_orc" > spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path) > spark.read.orc(path) > {quote} > The error message is like this: > {quote}org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '100' expecting {'ADD', 'AFTER' > == SQL == > struct<100:bigint> > ---^^^ > {quote} > The error is actually issued by this line of code: > {quote}CatalystSqlParser.parseDataType("100:bigint") > {quote} > > The specific background is that spark calls the above code in the process of > converting the schema of the orc file into the catalyst schema. > {quote}// code in OrcUtils > private def toCatalystSchema(schema: TypeDescription): StructType = > Unknown macro: \{ > CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]) > }{quote} > There are two solutions I currently think of: > # Modify the syntax analysis of SparkSQL to identify this kind of schema > # The TypeDescription.toString method should add the quote symbol to the > numeric column name, because the following syntax is supported: > {quote}CatalystSqlParser.parseDataType("`100`:bigint") > {quote} > But currently TypeDescription does not support changing the UNQUOTED_NAMES > variable, should we first submit a pr to the orc project to support the > configuration of this variable。 > !image-2021-09-03-20-56-28-846.png! > > How do spark members think about this issue? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36663) When the existing field name is a number, an error will be reported when reading the orc file
[ https://issues.apache.org/jira/browse/SPARK-36663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mcdull_zhang updated SPARK-36663: - Description: You can use the following methods to reproduce the problem: {quote}val path = "file:///tmp/test_orc" spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path) spark.read.orc(path) {quote} The error message is like this: {quote}org.apache.spark.sql.catalyst.parser.ParseException: mismatched input '100' expecting {'ADD', 'AFTER' == SQL == struct<100:bigint> ---^^^ {quote} The error is actually issued by this line of code: {quote}CatalystSqlParser.parseDataType("100:bigint") {quote} The specific background is that spark calls the above code in the process of converting the schema of the orc file into the catalyst schema. {quote}// code in OrcUtils private def toCatalystSchema(schema: TypeDescription): StructType = Unknown macro: \{ CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]) }{quote} There are two solutions I currently think of: # Modify the syntax analysis of SparkSQL to identify this kind of schema # The TypeDescription.toString method should add the quote symbol to the numeric column name, because the following syntax is supported: {quote}CatalystSqlParser.parseDataType("`100`:bigint") {quote} But currently TypeDescription does not support changing the UNQUOTED_NAMES variable, should we first submit a pr to the orc project to support the configuration of this variable。 !image-2021-09-03-20-56-28-846.png! How do spark members think about this issue? was: You can use the following methods to reproduce the problem: {quote}val path = "file:///tmp/test_orc" spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path) spark.read.orc(path) {quote} The error message is like this: {quote}org.apache.spark.sql.catalyst.parser.ParseException: mismatched input '100' expecting {'ADD', 'AFTER' == SQL == struct<100:bigint> ---^^^ {quote} The error is actually issued by this line of code: {quote}CatalystSqlParser.parseDataType("100:bigint") {quote} The specific background is that spark calls the above code in the process of converting the schema of the orc file into the catalyst schema. {quote}// code in OrcUtils private def toCatalystSchema(schema: TypeDescription): StructType = { CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]) }{quote} There are two solutions I currently think of: # Modify the syntax analysis of SparkSQL to identify this kind of schema # The TypeDescription.toString method should add the quote symbol to the numeric column name, because the following syntax is supported: {quote}CatalystSqlParser.parseDataType("`100`:bigint"){quote} But currently TypeDescription does not support changing the UNQUOTED_NAMES variable, should we first submit a pr to the orc project to support the configuration of this variable。 !image-2021-09-03-20-53-35-626.png! How do spark members think about this issue? > When the existing field name is a number, an error will be reported when > reading the orc file > - > > Key: SPARK-36663 > URL: https://issues.apache.org/jira/browse/SPARK-36663 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2 >Reporter: mcdull_zhang >Priority: Critical > Attachments: image-2021-09-03-20-56-28-846.png > > > You can use the following methods to reproduce the problem: > {quote}val path = "file:///tmp/test_orc" > spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path) > spark.read.orc(path) > {quote} > The error message is like this: > {quote}org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '100' expecting {'ADD', 'AFTER' > == SQL == > struct<100:bigint> > ---^^^ > {quote} > The error is actually issued by this line of code: > {quote}CatalystSqlParser.parseDataType("100:bigint") > {quote} > > The specific background is that spark calls the above code in the process of > converting the schema of the orc file into the catalyst schema. > {quote}// code in OrcUtils > private def toCatalystSchema(schema: TypeDescription): StructType = > Unknown macro: \{ > CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]) > }{quote} > There are two solutions I currently think of: > # Modify the syntax analysis of SparkSQL to identify this kind of schema > # The TypeDescription.toString method should add the quote symbol to the > numeric column name, because the following syntax is supp
[jira] [Updated] (SPARK-36663) When the existing field name is a number, an error will be reported when reading the orc file
[ https://issues.apache.org/jira/browse/SPARK-36663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mcdull_zhang updated SPARK-36663: - Attachment: image-2021-09-03-20-56-28-846.png > When the existing field name is a number, an error will be reported when > reading the orc file > - > > Key: SPARK-36663 > URL: https://issues.apache.org/jira/browse/SPARK-36663 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2 >Reporter: mcdull_zhang >Priority: Critical > Attachments: image-2021-09-03-20-56-28-846.png > > > You can use the following methods to reproduce the problem: > {quote}val path = "file:///tmp/test_orc" > spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path) > spark.read.orc(path) > {quote} > The error message is like this: > {quote}org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '100' expecting {'ADD', 'AFTER' > == SQL == > struct<100:bigint> > ---^^^ > {quote} > The error is actually issued by this line of code: > {quote}CatalystSqlParser.parseDataType("100:bigint") > {quote} > > The specific background is that spark calls the above code in the process of > converting the schema of the orc file into the catalyst schema. > {quote}// code in OrcUtils > private def toCatalystSchema(schema: TypeDescription): StructType = { > > CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]) > }{quote} > There are two solutions I currently think of: > # Modify the syntax analysis of SparkSQL to identify this kind of schema > # The TypeDescription.toString method should add the quote symbol to the > numeric column name, because the following syntax is supported: > {quote}CatalystSqlParser.parseDataType("`100`:bigint"){quote} > But currently TypeDescription does not support changing the UNQUOTED_NAMES > variable, should we first submit a pr to the orc project to support the > configuration of this variable。 > !image-2021-09-03-20-53-35-626.png! > > How do spark members think about this issue? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36663) When the existing field name is a number, an error will be reported when reading the orc file
mcdull_zhang created SPARK-36663: Summary: When the existing field name is a number, an error will be reported when reading the orc file Key: SPARK-36663 URL: https://issues.apache.org/jira/browse/SPARK-36663 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.2, 3.0.3 Reporter: mcdull_zhang You can use the following methods to reproduce the problem: {quote}val path = "file:///tmp/test_orc" spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path) spark.read.orc(path) {quote} The error message is like this: {quote}org.apache.spark.sql.catalyst.parser.ParseException: mismatched input '100' expecting {'ADD', 'AFTER' == SQL == struct<100:bigint> ---^^^ {quote} The error is actually issued by this line of code: {quote}CatalystSqlParser.parseDataType("100:bigint") {quote} The specific background is that spark calls the above code in the process of converting the schema of the orc file into the catalyst schema. {quote}// code in OrcUtils private def toCatalystSchema(schema: TypeDescription): StructType = { CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType]) }{quote} There are two solutions I currently think of: # Modify the syntax analysis of SparkSQL to identify this kind of schema # The TypeDescription.toString method should add the quote symbol to the numeric column name, because the following syntax is supported: {quote}CatalystSqlParser.parseDataType("`100`:bigint"){quote} But currently TypeDescription does not support changing the UNQUOTED_NAMES variable, should we first submit a pr to the orc project to support the configuration of this variable。 !image-2021-09-03-20-53-35-626.png! How do spark members think about this issue? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36612) Support left outer join build left or right outer join build right in shuffled hash join
mcdull_zhang created SPARK-36612: Summary: Support left outer join build left or right outer join build right in shuffled hash join Key: SPARK-36612 URL: https://issues.apache.org/jira/browse/SPARK-36612 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: mcdull_zhang Currently spark sql does not support build left side when left outer join (or build right side when right outer join). However, in our production environment, there are a large number of scenarios where small tables are left join large tables, and many times, large tables have data skew (currently AQE can't handle this kind of skew). Inspired by SPARK-32399, we can use similar ideas to realize left outer join build left. I think this treatment is very meaningful, but I don’t know how members consider this matter? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36082) when the right side is small enough to use SingleColumn Null Aware Anti Join
mcdull_zhang created SPARK-36082: Summary: when the right side is small enough to use SingleColumn Null Aware Anti Join Key: SPARK-36082 URL: https://issues.apache.org/jira/browse/SPARK-36082 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0, 3.1.3 Reporter: mcdull_zhang Fix For: 3.2.0 NULL-aware ANTI join (https://issues.apache.org/jira/browse/SPARK-32290) will build right side into a HashMap. code in SparkStrategy: {code:java} case j @ ExtractSingleColumnNullAwareAntiJoin(leftKeys, rightKeys) => Seq(joins.BroadcastHashJoinExec(leftKeys, rightKeys, LeftAnti, BuildRight, None, planLater(j.left), planLater(j.right), isNullAwareAntiJoin = true)){code} we should add the conditions and use this optimization when the size of the right side is small enough. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31459) When using the insert overwrite directory syntax, if the target path is an existing file, the final run result is incorrect
mcdull_zhang created SPARK-31459: Summary: When using the insert overwrite directory syntax, if the target path is an existing file, the final run result is incorrect Key: SPARK-31459 URL: https://issues.apache.org/jira/browse/SPARK-31459 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.5, 3 Environment: spark2.4.5 Reporter: mcdull_zhang When using the insert overwrite directory syntax, if the target path is an existing file, the final operation result is incorrect. At present, Spark will not delete the existing files. After the calculation is completed, one of the result files will be renamed to the result path. This is different from hive's behavior. Hive will delete the existing target file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org