[jira] [Created] (SPARK-46037) When Left Join build Left, ShuffledHashJoinExec may result in incorrect results

2023-11-21 Thread mcdull_zhang (Jira)
mcdull_zhang created SPARK-46037:


 Summary: When Left Join build Left, ShuffledHashJoinExec may 
result in incorrect results
 Key: SPARK-46037
 URL: https://issues.apache.org/jira/browse/SPARK-46037
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: mcdull_zhang


When Left Join build Left and codegen is turned off, ShuffledHashJoinExec may 
have incorrect results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43911) Use toSet to deduplicate the iterator data to prevent the creation of large Array

2023-06-02 Thread mcdull_zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mcdull_zhang updated SPARK-43911:
-
Summary: Use toSet to deduplicate the iterator data to prevent the creation 
of large Array  (was: Directly use Set to consume iterator data to deduplicate, 
thereby reducing memory usage)

> Use toSet to deduplicate the iterator data to prevent the creation of large 
> Array
> -
>
> Key: SPARK-43911
> URL: https://issues.apache.org/jira/browse/SPARK-43911
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: mcdull_zhang
>Priority: Minor
>
> When SubqueryBroadcastExec reuses the keys of Broadcast HashedRelation for 
> dynamic partition pruning, it will put all the keys in an Array, and then 
> call the distinct of the Array to remove the duplicates.
> In general, Broadcast HashedRelation may have many rows, and the repetition 
> rate of this key is high. Doing so will cause this Array to occupy a large 
> amount of memory (and this memory is not managed by MemoryManager), which may 
> trigger OOM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43911) Directly use Set to consume iterator data to deduplicate, thereby reducing memory usage

2023-06-01 Thread mcdull_zhang (Jira)
mcdull_zhang created SPARK-43911:


 Summary: Directly use Set to consume iterator data to deduplicate, 
thereby reducing memory usage
 Key: SPARK-43911
 URL: https://issues.apache.org/jira/browse/SPARK-43911
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: mcdull_zhang


When SubqueryBroadcastExec reuses the keys of Broadcast HashedRelation for 
dynamic partition pruning, it will put all the keys in an Array, and then call 
the distinct of the Array to remove the duplicates.

In general, Broadcast HashedRelation may have many rows, and the repetition 
rate of this key is high. Doing so will cause this Array to occupy a large 
amount of memory (and this memory is not managed by MemoryManager), which may 
trigger OOM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41361) Invalid call toAttribute on unresolved object exception caused by WidenSetOperationTypes

2022-12-02 Thread mcdull_zhang (Jira)
mcdull_zhang created SPARK-41361:


 Summary: Invalid call toAttribute on unresolved object exception 
caused by WidenSetOperationTypes
 Key: SPARK-41361
 URL: https://issues.apache.org/jira/browse/SPARK-41361
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: mcdull_zhang


The problem can be reproduced in the following way:
{code:java}
spark-sql> CREATE OR REPLACE TEMPORARY VIEW t1 AS VALUES (1, 'a'), (2, 'b') 
tbl(c1, c2);

spark-sql> CREATE OR REPLACE TEMPORARY VIEW t2 AS VALUES (1.0, 1), (2.0, 4) 
tbl(c1, c2); 

spark-sql> SELECT
         > TRANSFORM(*) USING 'cat' AS (a)
         > FROM
         > (
         > SELECT c2 AS c from t2
         > UNION
         > SELECT c2 AS c from t1);
Invalid call to toAttribute on unresolved object{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41191) Cache Table is not working while nested caches exist

2022-11-17 Thread mcdull_zhang (Jira)
mcdull_zhang created SPARK-41191:


 Summary: Cache Table is not working while nested caches exist
 Key: SPARK-41191
 URL: https://issues.apache.org/jira/browse/SPARK-41191
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.1
Reporter: mcdull_zhang


For example the following statement:
{code:java}
//代码占位符
cache table t1 as select a from testData3 group by a;
cache table t2 as select a,b from testData2 where a in (select a from t1);
select key,value,b from testData t3 join t2 on t3.key=t2.a;{code}
The cached t2 is not used in the third statement



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40076) Support number-only column names in ORC data sources when orc impl is hive

2022-08-14 Thread mcdull_zhang (Jira)
mcdull_zhang created SPARK-40076:


 Summary: Support number-only column names in ORC data sources when 
orc impl is hive
 Key: SPARK-40076
 URL: https://issues.apache.org/jira/browse/SPARK-40076
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.2
Reporter: mcdull_zhang


This problem is similar to SPARK-36663, both because 
*CatalystSqlParser.parseDataType* fails to parse if a column name (and nested 
field) consists of only numbers.

SPARK-36663solves the problem when configuring spark.sql.orc.impl=native 
(default is native), but when configuring spark.sql.orc.impl=hive, it still 
throws an error now



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39126) After eliminating join to one side, that side should take advantage of LocalShuffleRead optimization

2022-05-08 Thread mcdull_zhang (Jira)
mcdull_zhang created SPARK-39126:


 Summary: After eliminating join to one side, that side should take 
advantage of LocalShuffleRead optimization
 Key: SPARK-39126
 URL: https://issues.apache.org/jira/browse/SPARK-39126
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.1
Reporter: mcdull_zhang


PropagateEmptyRelation can simplify Join. For example, if the right side of 
LEFT JOIN is empty, then it can eliminate join to its left side.

If there is a shuffle on the left side, it can be considered that the shuffle 
is meaningless, and the shuffle can be optimized by using LocalRead.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38867) Avoid OOM when bufferedPlan has a lot of duplicate keys in SortMergeJoin codegen

2022-04-11 Thread mcdull_zhang (Jira)
mcdull_zhang created SPARK-38867:


 Summary: Avoid OOM when bufferedPlan has a lot of duplicate keys 
in SortMergeJoin codegen
 Key: SPARK-38867
 URL: https://issues.apache.org/jira/browse/SPARK-38867
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.1
Reporter: mcdull_zhang


WholeStageCodegenExec is wrapped in BufferedRowIterator.

BufferedRowIterator uses a LinkedList to hold the output of 
WholeStageCodegenExec.

When the parent of SortMergeJoin cannot codegen, SortMergeJoin needs to append 
the output to this LinkedList.

SortMergeJoin processes a record in streamedPlan each time. If all records in 
bufferedPlan can match this record, all records in bufferedPlan will be saved 
in LinkedList, resulting in OOM.

The above situation is very common in our internal use, so it is best to add a 
configuration to the codegen code. If there are enough pieces in the 
LinkedList, stop SortMergeJoin and let the parent consume it first.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38570) Incorrect DynamicPartitionPruning caused by Literal

2022-03-16 Thread mcdull_zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mcdull_zhang updated SPARK-38570:
-
Description: 
The return value of Literal.references is an empty AttributeSet, so Literal is 
mistaken for a partition column.

 

org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan:
{code:java}
val srcInfo: Option[(Expression, LogicalPlan)] = 
findExpressionAndTrackLineageDown(a, plan)
srcInfo.flatMap {
  case (resExp, l: LogicalRelation) =>
l.relation match {
  case fs: HadoopFsRelation =>
val partitionColumns = AttributeSet(
  l.resolve(fs.partitionSchema, 
fs.sparkSession.sessionState.analyzer.resolver))
// When resExp is a Literal, Literal is considered a partition column.  
       
if (resExp.references.subsetOf(partitionColumns)) {
  return Some(l)
} else {
  None
}
  case _ => None
} {code}

  was:
The return value of Literal.references is an empty AttributeSet, so Literal is 
mistaken for a partition column.

 

org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan:
{code:java}
val srcInfo: Option[(Expression, LogicalPlan)] = 
findExpressionAndTrackLineageDown(a, plan)
srcInfo.flatMap {
  case (resExp, l: LogicalRelation) =>
l.relation match {
  case fs: HadoopFsRelation =>
val partitionColumns = AttributeSet(
  l.resolve(fs.partitionSchema, 
fs.sparkSession.sessionState.analyzer.resolver))
// When resExp is a Literal, Literal is considered a partition column.  
       if (resExp.references.subsetOf(partitionColumns)) {
  return Some(l)
} else {
  None
}
  case _ => None
} {code}


> Incorrect DynamicPartitionPruning caused by Literal
> ---
>
> Key: SPARK-38570
> URL: https://issues.apache.org/jira/browse/SPARK-38570
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: mcdull_zhang
>Priority: Minor
>
> The return value of Literal.references is an empty AttributeSet, so Literal 
> is mistaken for a partition column.
>  
> org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan:
> {code:java}
> val srcInfo: Option[(Expression, LogicalPlan)] = 
> findExpressionAndTrackLineageDown(a, plan)
> srcInfo.flatMap {
>   case (resExp, l: LogicalRelation) =>
> l.relation match {
>   case fs: HadoopFsRelation =>
> val partitionColumns = AttributeSet(
>   l.resolve(fs.partitionSchema, 
> fs.sparkSession.sessionState.analyzer.resolver))
> // When resExp is a Literal, Literal is considered a partition 
> column.         
> if (resExp.references.subsetOf(partitionColumns)) {
>   return Some(l)
> } else {
>   None
> }
>   case _ => None
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38570) Incorrect DynamicPartitionPruning caused by Literal

2022-03-16 Thread mcdull_zhang (Jira)
mcdull_zhang created SPARK-38570:


 Summary: Incorrect DynamicPartitionPruning caused by Literal
 Key: SPARK-38570
 URL: https://issues.apache.org/jira/browse/SPARK-38570
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: mcdull_zhang


The return value of Literal.references is an empty AttributeSet, so Literal is 
mistaken for a partition column.

 

org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan:
{code:java}
val srcInfo: Option[(Expression, LogicalPlan)] = 
findExpressionAndTrackLineageDown(a, plan)
srcInfo.flatMap {
  case (resExp, l: LogicalRelation) =>
l.relation match {
  case fs: HadoopFsRelation =>
val partitionColumns = AttributeSet(
  l.resolve(fs.partitionSchema, 
fs.sparkSession.sessionState.analyzer.resolver))
// When resExp is a Literal, Literal is considered a partition column.  
       if (resExp.references.subsetOf(partitionColumns)) {
  return Some(l)
} else {
  None
}
  case _ => None
} {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38542) UnsafeHashedRelation should serialize numKeys out

2022-03-13 Thread mcdull_zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mcdull_zhang updated SPARK-38542:
-
Description: 
At present, UnsafeHashedRelation does not write out numKeys during 
serialization, so the numKeys of UnsafeHashedRelation obtained by 
deserialization is equal to 0. The numFields of UnsafeRows returned by 
UnsafeHashedRelation.keys() are all 0, which can lead to missing or incorrect 
data.

 

For example, in SubqueryBroadcastExec, the HashedRelation.keys() function is 
called.
{code:java}
val broadcastRelation = child.executeBroadcast[HashedRelation]().value
val (iter, expr) = if (broadcastRelation.isInstanceOf[LongHashedRelation]) {
  (broadcastRelation.keys(), HashJoin.extractKeyExprAt(buildKeys, index))
} else {
  (broadcastRelation.keys(),
BoundReference(index, buildKeys(index).dataType, buildKeys(index).nullable))
}{code}
 

  was:
At present, UnsafeHashedRelation does not write out numKeys during 
serialization, so the numKeys of UnsafeHashedRelation obtained by 
deserialization is equal to 0. The numFields of UnsafeRows returned by 
UnsafeHashedRelation.keys() are all 0, which can lead to missing or incorrect 
data.

 

For example, in SubqueryBroadcastExec, the HashedRelation.keys() function is 
called.
{code:java}
val broadcastRelation = child.executeBroadcast[HashedRelation]().value
val (iter, expr) = if (broadcastRelation.isInstanceOf[LongHashedRelation]) {
  (broadcastRelation.keys(), HashJoin.extractKeyExprAt(buildKeys, index))
} else {
  (broadcastRelation.keys(),
BoundReference(index, buildKeys(index).dataType, buildKeys(index).nullable))
}{code}
 

 

 

 


> UnsafeHashedRelation should serialize numKeys out
> -
>
> Key: SPARK-38542
> URL: https://issues.apache.org/jira/browse/SPARK-38542
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: mcdull_zhang
>Priority: Critical
>
> At present, UnsafeHashedRelation does not write out numKeys during 
> serialization, so the numKeys of UnsafeHashedRelation obtained by 
> deserialization is equal to 0. The numFields of UnsafeRows returned by 
> UnsafeHashedRelation.keys() are all 0, which can lead to missing or incorrect 
> data.
>  
> For example, in SubqueryBroadcastExec, the HashedRelation.keys() function is 
> called.
> {code:java}
> val broadcastRelation = child.executeBroadcast[HashedRelation]().value
> val (iter, expr) = if (broadcastRelation.isInstanceOf[LongHashedRelation]) {
>   (broadcastRelation.keys(), HashJoin.extractKeyExprAt(buildKeys, index))
> } else {
>   (broadcastRelation.keys(),
> BoundReference(index, buildKeys(index).dataType, 
> buildKeys(index).nullable))
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38542) UnsafeHashedRelation should serialize numKeys out

2022-03-13 Thread mcdull_zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mcdull_zhang updated SPARK-38542:
-
Description: 
At present, UnsafeHashedRelation does not write out numKeys during 
serialization, so the numKeys of UnsafeHashedRelation obtained by 
deserialization is equal to 0. The numFields of UnsafeRows returned by 
UnsafeHashedRelation.keys() are all 0, which can lead to missing or incorrect 
data.

 

For example, in SubqueryBroadcastExec, the HashedRelation.keys() function is 
called.
{code:java}
val broadcastRelation = child.executeBroadcast[HashedRelation]().value
val (iter, expr) = if (broadcastRelation.isInstanceOf[LongHashedRelation]) {
  (broadcastRelation.keys(), HashJoin.extractKeyExprAt(buildKeys, index))
} else {
  (broadcastRelation.keys(),
BoundReference(index, buildKeys(index).dataType, buildKeys(index).nullable))
}{code}
 

 

 

 

  was:
At present, UnsafeHashedRelation does not write out numKeys during 
serialization, so the numKeys of UnsafeHashedRelation obtained by 
deserialization is equal to 0. The numFields of UnsafeRows returned by 
UnsafeHashedRelation.keys() are all 0, which can lead to missing or incorrect 
data.

 

For example, in SubqueryBroadcastExec, the HashedRelation.keys() function is 
called.
{code:java}
val broadcastRelation = child.executeBroadcast[HashedRelation]().value
val (iter, expr) = if (broadcastRelation.isInstanceOf[LongHashedRelation]) {
  (broadcastRelation.keys(), HashJoin.extractKeyExprAt(buildKeys, index))
} else {
  (broadcastRelation.keys(),
BoundReference(index, buildKeys(index).dataType, buildKeys(index).nullable))
}{code}
 

 

 


> UnsafeHashedRelation should serialize numKeys out
> -
>
> Key: SPARK-38542
> URL: https://issues.apache.org/jira/browse/SPARK-38542
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: mcdull_zhang
>Priority: Critical
>
> At present, UnsafeHashedRelation does not write out numKeys during 
> serialization, so the numKeys of UnsafeHashedRelation obtained by 
> deserialization is equal to 0. The numFields of UnsafeRows returned by 
> UnsafeHashedRelation.keys() are all 0, which can lead to missing or incorrect 
> data.
>  
> For example, in SubqueryBroadcastExec, the HashedRelation.keys() function is 
> called.
> {code:java}
> val broadcastRelation = child.executeBroadcast[HashedRelation]().value
> val (iter, expr) = if (broadcastRelation.isInstanceOf[LongHashedRelation]) {
>   (broadcastRelation.keys(), HashJoin.extractKeyExprAt(buildKeys, index))
> } else {
>   (broadcastRelation.keys(),
> BoundReference(index, buildKeys(index).dataType, 
> buildKeys(index).nullable))
> }{code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38542) UnsafeHashedRelation should serialize numKeys out

2022-03-13 Thread mcdull_zhang (Jira)
mcdull_zhang created SPARK-38542:


 Summary: UnsafeHashedRelation should serialize numKeys out
 Key: SPARK-38542
 URL: https://issues.apache.org/jira/browse/SPARK-38542
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: mcdull_zhang


At present, UnsafeHashedRelation does not write out numKeys during 
serialization, so the numKeys of UnsafeHashedRelation obtained by 
deserialization is equal to 0. The numFields of UnsafeRows returned by 
UnsafeHashedRelation.keys() are all 0, which can lead to missing or incorrect 
data.

 

For example, in SubqueryBroadcastExec, the HashedRelation.keys() function is 
called.
{code:java}
val broadcastRelation = child.executeBroadcast[HashedRelation]().value
val (iter, expr) = if (broadcastRelation.isInstanceOf[LongHashedRelation]) {
  (broadcastRelation.keys(), HashJoin.extractKeyExprAt(buildKeys, index))
} else {
  (broadcastRelation.keys(),
BoundReference(index, buildKeys(index).dataType, buildKeys(index).nullable))
}{code}
 

 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37652) Support optimize skewed join through union

2021-12-15 Thread mcdull_zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mcdull_zhang updated SPARK-37652:
-
Description: 
`OptimizeSkewedJoin` rule will take effect only when the plan has two 
ShuffleQueryStageExec。

With `Union`, it might break the assumption. For example, the following plans
*scenes 1*
{noformat}
Union
SMJ
ShuffleQueryStage
ShuffleQueryStage
SMJ
ShuffleQueryStage
ShuffleQueryStage
{noformat}
*scenes 2*
{noformat}
Union
SMJ
ShuffleQueryStage
ShuffleQueryStage
HashAggregate
{noformat}
when one or more of the SMJ data in the above plan is skewed, it cannot be 
processed at present.

It's better to support partial optimize with Union.

  was:
`OptimizeSkewedJoin` rule will take effect only when the plan has two 
ShuffleQueryStageExec。

With `Union`, it might break the assumption. For example, the following plans

{code: none}
Union
SMJ
SMJ
{code}





> Support optimize skewed join through union
> --
>
> Key: SPARK-37652
> URL: https://issues.apache.org/jira/browse/SPARK-37652
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: mcdull_zhang
>Priority: Minor
>
> `OptimizeSkewedJoin` rule will take effect only when the plan has two 
> ShuffleQueryStageExec。
> With `Union`, it might break the assumption. For example, the following plans
> *scenes 1*
> {noformat}
> Union
> SMJ
> ShuffleQueryStage
> ShuffleQueryStage
> SMJ
> ShuffleQueryStage
> ShuffleQueryStage
> {noformat}
> *scenes 2*
> {noformat}
> Union
> SMJ
> ShuffleQueryStage
> ShuffleQueryStage
> HashAggregate
> {noformat}
> when one or more of the SMJ data in the above plan is skewed, it cannot be 
> processed at present.
> It's better to support partial optimize with Union.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37652) Support optimize skewed join through union

2021-12-15 Thread mcdull_zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mcdull_zhang updated SPARK-37652:
-
Description: 
`OptimizeSkewedJoin` rule will take effect only when the plan has two 
ShuffleQueryStageExec。

With `Union`, it might break the assumption. For example, the following plans

{code: none}
Union
SMJ
SMJ
{code}




  was:
`OptimizeSkewedJoin` rule will take effect only when the plan has two 
ShuffleQueryStageExec。

With `Union`, it might break the assumption. For example, the following plans

{code:tex}
// Some comments here
public String getFoo()
{
return foo;
}
{code}





> Support optimize skewed join through union
> --
>
> Key: SPARK-37652
> URL: https://issues.apache.org/jira/browse/SPARK-37652
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: mcdull_zhang
>Priority: Minor
>
> `OptimizeSkewedJoin` rule will take effect only when the plan has two 
> ShuffleQueryStageExec。
> With `Union`, it might break the assumption. For example, the following plans
> {code: none}
> Union
> SMJ
> SMJ
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37652) Support optimize skewed join through union

2021-12-15 Thread mcdull_zhang (Jira)
mcdull_zhang created SPARK-37652:


 Summary: Support optimize skewed join through union
 Key: SPARK-37652
 URL: https://issues.apache.org/jira/browse/SPARK-37652
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: mcdull_zhang


`OptimizeSkewedJoin` rule will take effect only when the plan has two 
ShuffleQueryStageExec。

With `Union`, it might break the assumption. For example, the following plans

{panel:title=scenes1:}
Union
SMJ
SMJ
{panel}





--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37652) Support optimize skewed join through union

2021-12-15 Thread mcdull_zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mcdull_zhang updated SPARK-37652:
-
Description: 
`OptimizeSkewedJoin` rule will take effect only when the plan has two 
ShuffleQueryStageExec。

With `Union`, it might break the assumption. For example, the following plans

{code:tex}
// Some comments here
public String getFoo()
{
return foo;
}
{code}




  was:
`OptimizeSkewedJoin` rule will take effect only when the plan has two 
ShuffleQueryStageExec。

With `Union`, it might break the assumption. For example, the following plans

{panel:title=scenes1:}
Union
SMJ
SMJ
{panel}




> Support optimize skewed join through union
> --
>
> Key: SPARK-37652
> URL: https://issues.apache.org/jira/browse/SPARK-37652
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: mcdull_zhang
>Priority: Minor
>
> `OptimizeSkewedJoin` rule will take effect only when the plan has two 
> ShuffleQueryStageExec。
> With `Union`, it might break the assumption. For example, the following plans
> {code:tex}
> // Some comments here
> public String getFoo()
> {
> return foo;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37301) ConcurrentModificationException caused by CollectionAccumulator serialization in the heartbeat thread

2021-11-12 Thread mcdull_zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mcdull_zhang updated SPARK-37301:
-
Description: 
In our production environment, you can use the following code to reproduce the 
problem:


{code:scala}
val acc = sc.collectionAccumulator[String]("test_acc")
    
sc.parallelize(Array(0)).foreach(_ => {
  var i = 0
  var stop = false
  val start = System.currentTimeMillis()
  while (!stop) {
    acc.add(i.toString)
    if (i % 1 == 0) {
      acc.reset()
      if ((System.currentTimeMillis() - start) / 1000 > 120) {
        stop = true
      }
    }
    i = i + 1
  }
})
sc.stop()
{code}

This code can make the executor fail to send heartbeats, even more than the 
default 60 times, and then the executor exits.


{noformat}
21/11/11 21:00:23 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult: 
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
    at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
    at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1007)
    at 
org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:212)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
    at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.ConcurrentModificationException
    at java.util.ArrayList.writeObject(ArrayList.java:766)
    at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
    at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
    at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
    at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
    at 
org.apache.spark.rpc.netty.RequestMessage.serialize(NettyRpcEnv.scala:601)
    at 
org.apache.spark.rpc.netty.NettyRpcEnv.askAbortable(NettyRpcEnv.scala:244)
    at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.askAbortable(NettyRpcEnv.scala:555)
    at org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:559)

[jira] [Updated] (SPARK-37301) ConcurrentModificationException caused by CollectionAccumulator serialization in the heartbeat thread

2021-11-12 Thread mcdull_zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mcdull_zhang updated SPARK-37301:
-
Description: 
In our production environment, you can use the following code to reproduce the 
problem:


{code:scala}
val acc = sc.collectionAccumulator[String]("test_acc")
    
sc.parallelize(Array(0)).foreach(_ => {
  var i = 0
  var stop = false
  val start = System.currentTimeMillis()
  while (!stop) {
    acc.add(i.toString)
    if (i % 1 == 0) {
      acc.reset()
      if ((System.currentTimeMillis() - start) / 1000 > 120) {
        stop = true
      }
    }
    i = i + 1
  }
})
sc.stop()
{code}

This code can make the executor fail to send heartbeats, even more than the 
default 60 times, and then the executor exits.

```tex
21/11/11 21:00:23 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult: 
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
    at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
    at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1007)
    at 
org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:212)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
    at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.ConcurrentModificationException
    at java.util.ArrayList.writeObject(ArrayList.java:766)
    at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
    at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
    at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
    at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
    at 
org.apache.spark.rpc.netty.RequestMessage.serialize(NettyRpcEnv.scala:601)
    at 
org.apache.spark.rpc.netty.NettyRpcEnv.askAbortable(NettyRpcEnv.scala:244)
    at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.askAbortable(NettyRpcEnv.scala:555)
    at org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:559)
    a

[jira] [Created] (SPARK-37301) ConcurrentModificationException caused by CollectionAccumulator serialization in the heartbeat thread

2021-11-12 Thread mcdull_zhang (Jira)
mcdull_zhang created SPARK-37301:


 Summary: ConcurrentModificationException caused by 
CollectionAccumulator serialization in the heartbeat thread
 Key: SPARK-37301
 URL: https://issues.apache.org/jira/browse/SPARK-37301
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: mcdull_zhang


In our production environment, you can use the following code to reproduce the 
problem:

```scala
val acc = sc.collectionAccumulator[String]("test_acc")
    
sc.parallelize(Array(0)).foreach(_ => {
  var i = 0
  var stop = false
  val start = System.currentTimeMillis()
  while (!stop) {
    acc.add(i.toString)
    if (i % 1 == 0) {
      acc.reset()
      if ((System.currentTimeMillis() - start) / 1000 > 120) {
        stop = true
      }
    }
    i = i + 1
  }
})
sc.stop()
```

This code can make the executor fail to send heartbeats, even more than the 
default 60 times, and then the executor exits.

```tex
21/11/11 21:00:23 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult: 
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
    at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
    at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1007)
    at 
org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:212)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
    at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.ConcurrentModificationException
    at java.util.ArrayList.writeObject(ArrayList.java:766)
    at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
    at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
    at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
    at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
    at 
org.apache.spark.rpc.netty.RequestMessage.serialize(NettyRpcEnv.scala:601)
    at 
org.apache.spark.rpc.

[jira] [Commented] (SPARK-36663) When the existing field name is a number, an error will be reported when reading the orc file

2021-09-03 Thread mcdull_zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17409492#comment-17409492
 ] 

mcdull_zhang commented on SPARK-36663:
--

cc  [~hyukjin.kwon]      [~cloud_fan]

> When the existing field name is a number, an error will be reported when 
> reading the orc file
> -
>
> Key: SPARK-36663
> URL: https://issues.apache.org/jira/browse/SPARK-36663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2
>Reporter: mcdull_zhang
>Priority: Critical
> Attachments: image-2021-09-03-20-56-28-846.png
>
>
> You can use the following methods to reproduce the problem:
> {quote}val path = "file:///tmp/test_orc"
> spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path)
> spark.read.orc(path)
> {quote}
> The error message is like this:
> {quote}org.apache.spark.sql.catalyst.parser.ParseException:
>  mismatched input '100' expecting {'ADD', 'AFTER'
> == SQL ==
>  struct<100:bigint>
>  ---^^^
> {quote}
> The error is actually issued by this line of code:
> {quote}CatalystSqlParser.parseDataType("100:bigint")
> {quote}
>  
> The specific background is that spark calls the above code in the process of 
> converting the schema of the orc file into the catalyst schema.
> {quote}// code in OrcUtils
>  private def toCatalystSchema(schema: TypeDescription): StructType =
> Unknown macro: \{  
> CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType])
>  }{quote}
> There are two solutions I currently think of:
>  # Modify the syntax analysis of SparkSQL to identify this kind of schema
>  # The TypeDescription.toString method should add the quote symbol to the 
> numeric column name, because the following syntax is supported:
> {quote}CatalystSqlParser.parseDataType("`100`:bigint")
> {quote}
> But currently TypeDescription does not support changing the UNQUOTED_NAMES 
> variable, should we first submit a pr to the orc project to support the 
> configuration of this variable。
> !image-2021-09-03-20-56-28-846.png!
>  
> How do spark members think about this issue?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36663) When the existing field name is a number, an error will be reported when reading the orc file

2021-09-03 Thread mcdull_zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mcdull_zhang updated SPARK-36663:
-
Description: 
You can use the following methods to reproduce the problem:
{quote}val path = "file:///tmp/test_orc"

spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path)

spark.read.orc(path)
{quote}
The error message is like this:
{quote}org.apache.spark.sql.catalyst.parser.ParseException:
 mismatched input '100' expecting {'ADD', 'AFTER'

== SQL ==
 struct<100:bigint>
 ---^^^
{quote}
The error is actually issued by this line of code:
{quote}CatalystSqlParser.parseDataType("100:bigint")
{quote}
 

The specific background is that spark calls the above code in the process of 
converting the schema of the orc file into the catalyst schema.
{quote}// code in OrcUtils
 private def toCatalystSchema(schema: TypeDescription): StructType =
Unknown macro: \{  
CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType])
 }{quote}
There are two solutions I currently think of:
 # Modify the syntax analysis of SparkSQL to identify this kind of schema
 # The TypeDescription.toString method should add the quote symbol to the 
numeric column name, because the following syntax is supported:
{quote}CatalystSqlParser.parseDataType("`100`:bigint")
{quote}

But currently TypeDescription does not support changing the UNQUOTED_NAMES 
variable, should we first submit a pr to the orc project to support the 
configuration of this variable。

!image-2021-09-03-20-56-28-846.png!

 

How do spark members think about this issue?

 

  was:
You can use the following methods to reproduce the problem:
{quote}val path = "file:///tmp/test_orc"

spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path)

spark.read.orc(path)
{quote}
The error message is like this:
{quote}org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '100' expecting {'ADD', 'AFTER'

== SQL ==
struct<100:bigint>
---^^^
{quote}
The error is actually issued by this line of code:
{quote}CatalystSqlParser.parseDataType("100:bigint")
{quote}
 

The specific background is that spark calls the above code in the process of 
converting the schema of the orc file into the catalyst schema.
{quote}// code in OrcUtils
private def toCatalystSchema(schema: TypeDescription): StructType = {
 
CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType])
}{quote}
There are two solutions I currently think of:
 # Modify the syntax analysis of SparkSQL to identify this kind of schema
 # The TypeDescription.toString method should add the quote symbol to the 
numeric column name, because the following syntax is supported:
{quote}CatalystSqlParser.parseDataType("`100`:bigint"){quote}

But currently TypeDescription does not support changing the UNQUOTED_NAMES 
variable, should we first submit a pr to the orc project to support the 
configuration of this variable。

!image-2021-09-03-20-53-35-626.png!

 

How do spark members think about this issue?

 


> When the existing field name is a number, an error will be reported when 
> reading the orc file
> -
>
> Key: SPARK-36663
> URL: https://issues.apache.org/jira/browse/SPARK-36663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2
>Reporter: mcdull_zhang
>Priority: Critical
> Attachments: image-2021-09-03-20-56-28-846.png
>
>
> You can use the following methods to reproduce the problem:
> {quote}val path = "file:///tmp/test_orc"
> spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path)
> spark.read.orc(path)
> {quote}
> The error message is like this:
> {quote}org.apache.spark.sql.catalyst.parser.ParseException:
>  mismatched input '100' expecting {'ADD', 'AFTER'
> == SQL ==
>  struct<100:bigint>
>  ---^^^
> {quote}
> The error is actually issued by this line of code:
> {quote}CatalystSqlParser.parseDataType("100:bigint")
> {quote}
>  
> The specific background is that spark calls the above code in the process of 
> converting the schema of the orc file into the catalyst schema.
> {quote}// code in OrcUtils
>  private def toCatalystSchema(schema: TypeDescription): StructType =
> Unknown macro: \{  
> CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType])
>  }{quote}
> There are two solutions I currently think of:
>  # Modify the syntax analysis of SparkSQL to identify this kind of schema
>  # The TypeDescription.toString method should add the quote symbol to the 
> numeric column name, because the following syntax is supp

[jira] [Updated] (SPARK-36663) When the existing field name is a number, an error will be reported when reading the orc file

2021-09-03 Thread mcdull_zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mcdull_zhang updated SPARK-36663:
-
Attachment: image-2021-09-03-20-56-28-846.png

> When the existing field name is a number, an error will be reported when 
> reading the orc file
> -
>
> Key: SPARK-36663
> URL: https://issues.apache.org/jira/browse/SPARK-36663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2
>Reporter: mcdull_zhang
>Priority: Critical
> Attachments: image-2021-09-03-20-56-28-846.png
>
>
> You can use the following methods to reproduce the problem:
> {quote}val path = "file:///tmp/test_orc"
> spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path)
> spark.read.orc(path)
> {quote}
> The error message is like this:
> {quote}org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '100' expecting {'ADD', 'AFTER'
> == SQL ==
> struct<100:bigint>
> ---^^^
> {quote}
> The error is actually issued by this line of code:
> {quote}CatalystSqlParser.parseDataType("100:bigint")
> {quote}
>  
> The specific background is that spark calls the above code in the process of 
> converting the schema of the orc file into the catalyst schema.
> {quote}// code in OrcUtils
> private def toCatalystSchema(schema: TypeDescription): StructType = {
>  
> CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType])
> }{quote}
> There are two solutions I currently think of:
>  # Modify the syntax analysis of SparkSQL to identify this kind of schema
>  # The TypeDescription.toString method should add the quote symbol to the 
> numeric column name, because the following syntax is supported:
> {quote}CatalystSqlParser.parseDataType("`100`:bigint"){quote}
> But currently TypeDescription does not support changing the UNQUOTED_NAMES 
> variable, should we first submit a pr to the orc project to support the 
> configuration of this variable。
> !image-2021-09-03-20-53-35-626.png!
>  
> How do spark members think about this issue?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36663) When the existing field name is a number, an error will be reported when reading the orc file

2021-09-03 Thread mcdull_zhang (Jira)
mcdull_zhang created SPARK-36663:


 Summary: When the existing field name is a number, an error will 
be reported when reading the orc file
 Key: SPARK-36663
 URL: https://issues.apache.org/jira/browse/SPARK-36663
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2, 3.0.3
Reporter: mcdull_zhang


You can use the following methods to reproduce the problem:
{quote}val path = "file:///tmp/test_orc"

spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path)

spark.read.orc(path)
{quote}
The error message is like this:
{quote}org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '100' expecting {'ADD', 'AFTER'

== SQL ==
struct<100:bigint>
---^^^
{quote}
The error is actually issued by this line of code:
{quote}CatalystSqlParser.parseDataType("100:bigint")
{quote}
 

The specific background is that spark calls the above code in the process of 
converting the schema of the orc file into the catalyst schema.
{quote}// code in OrcUtils
private def toCatalystSchema(schema: TypeDescription): StructType = {
 
CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType])
}{quote}
There are two solutions I currently think of:
 # Modify the syntax analysis of SparkSQL to identify this kind of schema
 # The TypeDescription.toString method should add the quote symbol to the 
numeric column name, because the following syntax is supported:
{quote}CatalystSqlParser.parseDataType("`100`:bigint"){quote}

But currently TypeDescription does not support changing the UNQUOTED_NAMES 
variable, should we first submit a pr to the orc project to support the 
configuration of this variable。

!image-2021-09-03-20-53-35-626.png!

 

How do spark members think about this issue?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36612) Support left outer join build left or right outer join build right in shuffled hash join

2021-08-30 Thread mcdull_zhang (Jira)
mcdull_zhang created SPARK-36612:


 Summary: Support left outer join build left or right outer join 
build right in shuffled hash join
 Key: SPARK-36612
 URL: https://issues.apache.org/jira/browse/SPARK-36612
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: mcdull_zhang


Currently spark sql does not support build left side when left outer join (or 
build right side when right outer join).

However, in our production environment, there are a large number of scenarios 
where small tables are left join large tables, and many times, large tables 
have data skew (currently AQE can't handle this kind of skew).

Inspired by SPARK-32399, we can use similar ideas to realize left outer join 
build left.

I think this treatment is very meaningful, but I don’t know how members 
consider this matter?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36082) when the right side is small enough to use SingleColumn Null Aware Anti Join

2021-07-10 Thread mcdull_zhang (Jira)
mcdull_zhang created SPARK-36082:


 Summary: when the right side is small enough to use SingleColumn 
Null Aware Anti Join
 Key: SPARK-36082
 URL: https://issues.apache.org/jira/browse/SPARK-36082
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0, 3.1.3
Reporter: mcdull_zhang
 Fix For: 3.2.0


NULL-aware ANTI join (https://issues.apache.org/jira/browse/SPARK-32290) will 
build right side into a HashMap.

code in SparkStrategy:

 
{code:java}
case j @ ExtractSingleColumnNullAwareAntiJoin(leftKeys, rightKeys) =>
  Seq(joins.BroadcastHashJoinExec(leftKeys, rightKeys, LeftAnti, BuildRight,
None, planLater(j.left), planLater(j.right), isNullAwareAntiJoin = 
true)){code}
we should add the conditions and use this optimization when the size of the 
right side is small enough.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31459) When using the insert overwrite directory syntax, if the target path is an existing file, the final run result is incorrect

2020-04-16 Thread mcdull_zhang (Jira)
mcdull_zhang created SPARK-31459:


 Summary: When using the insert overwrite directory syntax, if the 
target path is an existing file, the final run result is incorrect
 Key: SPARK-31459
 URL: https://issues.apache.org/jira/browse/SPARK-31459
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.5, 3
 Environment: spark2.4.5
Reporter: mcdull_zhang


When using the insert overwrite directory syntax, if the target path is an 
existing file, the final operation result is incorrect.
At present, Spark will not delete the existing files. After the calculation is 
completed, one of the result files will be renamed to the result path.
This is different from hive's behavior. Hive will delete the existing target 
file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org