[jira] [Resolved] (SPARK-39153) When we look at spark UI or History, we can see the failed tasks first

2023-03-28 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong resolved SPARK-39153.
-
Resolution: Not A Problem

> When we look at spark UI or History, we can see the failed tasks first
> --
>
> Key: SPARK-39153
> URL: https://issues.apache.org/jira/browse/SPARK-39153
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
> Environment: spark 3.2.0
>Reporter: jingxiong zhong
>Priority: Major
> Fix For: 3.2.0
>
>
> When a task fails, users are more concerned about the causes of failed tasks 
> and failed tasks. The Current Spark UI and History are sorted according to 
> "Index" rather than "Errors". When a large number of tasks are sorted, you 
> need to wait a certain period for tasks to be sorted. In order to find the 
> cause of Errors for failed tasks, we can improve the user experience by 
> specifying sorting by the "Errors" column at the beginning



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39967) Instead of using the scalar tasksSuccessful, use the successful array to calculate whether the task is completed

2023-03-28 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong resolved SPARK-39967.
-
Resolution: Fixed

New version not reproduced

 

> Instead of using the scalar tasksSuccessful, use the successful array to 
> calculate whether the task is completed
> 
>
> Key: SPARK-39967
> URL: https://issues.apache.org/jira/browse/SPARK-39967
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3, 2.4.6
>Reporter: jingxiong zhong
>Priority: Critical
> Attachments: spark1-1.png, spark2.png, spark3-1.png
>
>
> When counting the number of successful tasks in the stage of spark, spark 
> uses the indicator of `tasksSuccessful`, but in fact, the success or failure 
> of tasks is based on the array of `successful`. Through the log I added, it 
> is found that the number of failed tasks counted by `tasksSuccessful` is 
> inconsistent with the number of failures stored in the array of `successful`. 
> We should take `successful` as the standard.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42392) Add a new case of TriggeredByExecutorDecommissionInfo to remove unnecessary param

2023-02-09 Thread jingxiong zhong (Jira)
jingxiong zhong created SPARK-42392:
---

 Summary: Add a new case of TriggeredByExecutorDecommissionInfo to 
remove unnecessary param
 Key: SPARK-42392
 URL: https://issues.apache.org/jira/browse/SPARK-42392
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: jingxiong zhong


Add a new case of TriggeredByExecutorDecommissionInfo and no need to add 
additional parameters.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42336) Use getOrElse() instead of contains() in

2023-02-05 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-42336:

Summary: Use getOrElse() instead of contains() in(was: Use OpenHashMap 
instead of HashMap)

> Use getOrElse() instead of contains() in  
> --
>
> Key: SPARK-42336
> URL: https://issues.apache.org/jira/browse/SPARK-42336
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: jingxiong zhong
>Priority: Minor
>
> In ResourceAllocator, we can use `.getOrElse(address, throw new 
> SparkException(...))` instead of one `contains` which can gain better 
> performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42336) Use getOrElse() instead of contains() in ResourceAllocator

2023-02-05 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-42336:

Summary: Use getOrElse() instead of contains() in  ResourceAllocator  (was: 
Use getOrElse() instead of contains() in  )

> Use getOrElse() instead of contains() in  ResourceAllocator
> ---
>
> Key: SPARK-42336
> URL: https://issues.apache.org/jira/browse/SPARK-42336
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: jingxiong zhong
>Priority: Minor
>
> In ResourceAllocator, we can use `.getOrElse(address, throw new 
> SparkException(...))` instead of one `contains` which can gain better 
> performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42336) Use OpenHashMap instead of HashMap

2023-02-05 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-42336:

Description: In ResourceAllocator, we can use `.getOrElse(address, throw 
new SparkException(...))` instead of one `contains` which can gain better 
performance.  (was: In ResourceAllocator, we can use OpenHashMap instead of 
HashMap, which can gain better performance.)

> Use OpenHashMap instead of HashMap
> --
>
> Key: SPARK-42336
> URL: https://issues.apache.org/jira/browse/SPARK-42336
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: jingxiong zhong
>Priority: Minor
>
> In ResourceAllocator, we can use `.getOrElse(address, throw new 
> SparkException(...))` instead of one `contains` which can gain better 
> performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42336) Use OpenHashMap instead of HashMap

2023-02-03 Thread jingxiong zhong (Jira)
jingxiong zhong created SPARK-42336:
---

 Summary: Use OpenHashMap instead of HashMap
 Key: SPARK-42336
 URL: https://issues.apache.org/jira/browse/SPARK-42336
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: jingxiong zhong


In ResourceAllocator, we can use OpenHashMap instead of HashMap, which can gain 
better performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41982) When the inserted partition type is of string type, similar `dt=01` will be converted to `dt=1`

2023-01-11 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17672320#comment-17672320
 ] 

jingxiong zhong commented on SPARK-41982:
-

cc [~cloud_fan] [~gurwls223] I want to know your opinion.

> When the inserted partition type is of string type, similar `dt=01` will be 
> converted to `dt=1`
> ---
>
> Key: SPARK-41982
> URL: https://issues.apache.org/jira/browse/SPARK-41982
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: jingxiong zhong
>Priority: Critical
>
> At present, during the process of upgrading Spark2.4 to Spark3.2, we 
> carefully read the migration documentwe and found a kind of situation not 
> involved:
> {code:java}
> create table if not exists test_90(a string, b string) partitioned by (dt 
> string);
> desc formatted test_90;
> // case1
> insert into table test_90 partition (dt=05) values("1","2");
> // case2
> insert into table test_90 partition (dt='05') values("1","2");
> drop table test_90;{code}
> in spark2.4.3, it will generate such a path:
> {code:java}
> // the path
> hdfs://test5/user/hive/db1/test_90/dt=05 
> //result
> spark-sql> select * from test_90;
> 1       2       05
> 1       2       05
> Time taken: 1.316 seconds, Fetched 2 row(s)
> spark-sql> show partitions test_90; 
> dt=05 
> Time taken: 0.201 seconds, Fetched 1 row(s)
> spark-sql> select * from test_90 where dt='05';
> 1       2       05
> 1   2       05
> Time taken: 0.212 seconds, Fetched 2 row(s)
> spark-sql> explain insert into table test_90 partition (dt=05) 
> values("1","2");
> == Physical Plan ==
> Execute InsertIntoHiveTable InsertIntoHiveTable `db1`.`test_90`, 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde, Map(dt -> Some(05)), false, false, 
> [a, b]
> +- LocalTableScan [a#116, b#117]
> Time taken: 1.145 seconds, Fetched 1 row(s){code}
> in spark3.2.0, it will generate two path:
> {code:java}
> // the path
> hdfs://test5/user/hive/db1/test_90/dt=05 
> hdfs://test5/user/hive/db1/test_90/dt=5 
> // result
> spark-sql> select * from test_90;
> 1       2       05
> 1       2       5
> Time taken: 2.119 seconds, Fetched 2 row(s)
> spark-sql> show partitions test_90;
> dt=05
> dt=5
> Time taken: 0.161 seconds, Fetched 2 row(s)
> spark-sql> select * from test_90 where dt='05';
> 1       2       05
> Time taken: 0.252 seconds, Fetched 1 row(s)
> spark-sql> explain insert into table test_90 partition (dt=05) 
> values("1","2");
> plan
> == Physical Plan ==
> Execute InsertIntoHiveTable `db1`.`test_90`, 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde, [dt=Some(5)], false, false, [a, b]
> +- LocalTableScan [a#109, b#110]{code}
> This will cause problems in reading data after the user switches to spark3. 
> The root cause is that in the process of partition field resolution, Spark3 
> has a process of strongly converting this string type, which will cause 
> partition `05` to lose the previous `0`
> So I think we have two solutions:
> one is to record the risk clearly in the migration document, and the other is 
> to repair this case, because we internally keep the partition of string type 
> as string type, regardless of whether single or double quotation marks are 
> added.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41982) When the inserted partition type is of string type, similar `dt=01` will be converted to `dt=1`

2023-01-11 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-41982:

Description: 
At present, during the process of upgrading Spark2.4 to Spark3.2, we carefully 
read the migration documentwe and found a kind of situation not involved:
{code:java}
create table if not exists test_90(a string, b string) partitioned by (dt 
string);
desc formatted test_90;
// case1
insert into table test_90 partition (dt=05) values("1","2");
// case2
insert into table test_90 partition (dt='05') values("1","2");
drop table test_90;{code}
in spark2.4.3, it will generate such a path:
{code:java}
// the path
hdfs://test5/user/hive/db1/test_90/dt=05 

//result
spark-sql> select * from test_90;
1       2       05
1       2       05
Time taken: 1.316 seconds, Fetched 2 row(s)

spark-sql> show partitions test_90; 
dt=05 
Time taken: 0.201 seconds, Fetched 1 row(s)

spark-sql> select * from test_90 where dt='05';
1       2       05
1   2       05
Time taken: 0.212 seconds, Fetched 2 row(s)

spark-sql> explain insert into table test_90 partition (dt=05) values("1","2");
== Physical Plan ==
Execute InsertIntoHiveTable InsertIntoHiveTable `db1`.`test_90`, 
org.apache.hadoop.hive.ql.io.orc.OrcSerde, Map(dt -> Some(05)), false, false, 
[a, b]
+- LocalTableScan [a#116, b#117]
Time taken: 1.145 seconds, Fetched 1 row(s){code}
in spark3.2.0, it will generate two path:
{code:java}
// the path
hdfs://test5/user/hive/db1/test_90/dt=05 
hdfs://test5/user/hive/db1/test_90/dt=5 

// result
spark-sql> select * from test_90;
1       2       05
1       2       5
Time taken: 2.119 seconds, Fetched 2 row(s)

spark-sql> show partitions test_90;
dt=05
dt=5
Time taken: 0.161 seconds, Fetched 2 row(s)

spark-sql> select * from test_90 where dt='05';
1       2       05
Time taken: 0.252 seconds, Fetched 1 row(s)

spark-sql> explain insert into table test_90 partition (dt=05) values("1","2");
plan
== Physical Plan ==
Execute InsertIntoHiveTable `db1`.`test_90`, 
org.apache.hadoop.hive.ql.io.orc.OrcSerde, [dt=Some(5)], false, false, [a, b]
+- LocalTableScan [a#109, b#110]{code}
This will cause problems in reading data after the user switches to spark3. The 
root cause is that in the process of partition field resolution, Spark3 has a 
process of strongly converting this string type, which will cause partition 
`05` to lose the previous `0`

So I think we have two solutions:

one is to record the risk clearly in the migration document, and the other is 
to repair this case, because we internally keep the partition of string type as 
string type, regardless of whether single or double quotation marks are added.

 

 

  was:
At present, during the process of upgrading Spark2.4 to Spark3.2, we carefully 
read the migration documentwe and found a kind of situation not involved:

 
{code:java}
//代码占位符

create table if not exists test_90(a string, b string) partitioned by (dt 
string);
desc formatted test_90;
// case1
insert into table test_90 partition (dt=05) values("1","2");
// case2
insert into table test_90 partition (dt='05') values("1","2");
drop table test_90;{code}
in spark2.4.3, it will generate such a path:

 

 
{code:java}
//代码占位符
// the path
hdfs://test5/user/hive/db1/test_90/dt=05 

//result
spark-sql> select * from test_90;
1       2       05
1       2       05
Time taken: 1.316 seconds, Fetched 2 row(s)

spark-sql> show partitions test_90; 
dt=05 
Time taken: 0.201 seconds, Fetched 1 row(s)

spark-sql> select * from test_90 where dt='05';
1       2       05
1   2       05
Time taken: 0.212 seconds, Fetched 2 row(s)

spark-sql> explain insert into table test_90 partition (dt=05) values("1","2");
== Physical Plan ==
Execute InsertIntoHiveTable InsertIntoHiveTable `db1`.`test_90`, 
org.apache.hadoop.hive.ql.io.orc.OrcSerde, Map(dt -> Some(05)), false, false, 
[a, b]
+- LocalTableScan [a#116, b#117]
Time taken: 1.145 seconds, Fetched 1 row(s){code}
in spark3.2.0, it will generate two path:
{code:java}
//代码占位符
// the path
hdfs://test5/user/hive/db1/test_90/dt=05 
hdfs://test5/user/hive/db1/test_90/dt=5 

// result
spark-sql> select * from test_90;
1       2       05
1       2       5
Time taken: 2.119 seconds, Fetched 2 row(s)

spark-sql> show partitions test_90;
dt=05
dt=5
Time taken: 0.161 seconds, Fetched 2 row(s)

spark-sql> select * from test_90 where dt='05';
1       2       05
Time taken: 0.252 seconds, Fetched 1 row(s)

spark-sql> explain insert into table test_90 partition (dt=05) values("1","2");
plan
== Physical Plan ==
Execute InsertIntoHiveTable `db1`.`test_90`, 
org.apache.hadoop.hive.ql.io.orc.OrcSerde, [dt=Some(5)], false, false, [a, b]
+- LocalTableScan [a#109, b#110]{code}
This will cause problems in reading data after the user switches to spark3. The 
root cause is that in the process of partition field resolution, Spark3 has a 
process of strongly converting this string type, which will 

[jira] [Updated] (SPARK-41982) When the inserted partition type is of string type, similar `dt=01` will be converted to `dt=1`

2023-01-11 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-41982:

Description: 
At present, during the process of upgrading Spark2.4 to Spark3.2, we carefully 
read the migration documentwe and found a kind of situation not involved:

 
{code:java}
//代码占位符

create table if not exists test_90(a string, b string) partitioned by (dt 
string);
desc formatted test_90;
// case1
insert into table test_90 partition (dt=05) values("1","2");
// case2
insert into table test_90 partition (dt='05') values("1","2");
drop table test_90;{code}
in spark2.4.3, it will generate such a path:

 

 
{code:java}
//代码占位符
// the path
hdfs://test5/user/hive/db1/test_90/dt=05 

//result
spark-sql> select * from test_90;
1       2       05
1       2       05
Time taken: 1.316 seconds, Fetched 2 row(s)

spark-sql> show partitions test_90; 
dt=05 
Time taken: 0.201 seconds, Fetched 1 row(s)

spark-sql> select * from test_90 where dt='05';
1       2       05
1   2       05
Time taken: 0.212 seconds, Fetched 2 row(s)

spark-sql> explain insert into table test_90 partition (dt=05) values("1","2");
== Physical Plan ==
Execute InsertIntoHiveTable InsertIntoHiveTable `db1`.`test_90`, 
org.apache.hadoop.hive.ql.io.orc.OrcSerde, Map(dt -> Some(05)), false, false, 
[a, b]
+- LocalTableScan [a#116, b#117]
Time taken: 1.145 seconds, Fetched 1 row(s){code}
in spark3.2.0, it will generate two path:
{code:java}
//代码占位符
// the path
hdfs://test5/user/hive/db1/test_90/dt=05 
hdfs://test5/user/hive/db1/test_90/dt=5 

// result
spark-sql> select * from test_90;
1       2       05
1       2       5
Time taken: 2.119 seconds, Fetched 2 row(s)

spark-sql> show partitions test_90;
dt=05
dt=5
Time taken: 0.161 seconds, Fetched 2 row(s)

spark-sql> select * from test_90 where dt='05';
1       2       05
Time taken: 0.252 seconds, Fetched 1 row(s)

spark-sql> explain insert into table test_90 partition (dt=05) values("1","2");
plan
== Physical Plan ==
Execute InsertIntoHiveTable `db1`.`test_90`, 
org.apache.hadoop.hive.ql.io.orc.OrcSerde, [dt=Some(5)], false, false, [a, b]
+- LocalTableScan [a#109, b#110]{code}
This will cause problems in reading data after the user switches to spark3. The 
root cause is that in the process of partition field resolution, Spark3 has a 
process of strongly converting this string type, which will cause partition 
`05` to lose the previous `0`

So I think we have two solutions:

one is to record the risk clearly in the migration document, and the other is 
to repair this case, because we internally keep the partition of string type as 
string type, regardless of whether single or double quotation marks are added.

 

 

  was:
At present, during the process of upgrading Spark2.4 to Spark3.2, we carefully 
read the migration documentwe and found a kind of situation not involved:

 
{code:java}
//代码占位符

create table if not exists test_90(a string, b string) partitioned by (dt 
string);
desc formatted test_90;
// case1
insert into table test_90 partition (dt=05) values("1","2");
// case2
insert into table test_90 partition (dt='05') values("1","2");
drop table test_90;{code}
in spark2.4.3, it will generate such a path:

 

 
{code:java}
//代码占位符
// the path
hdfs://test5/user/hive/db1/test_90/dt=05 

//result
spark-sql> select * from test_90;
1       2       05
1       2       05
Time taken: 1.316 seconds, Fetched 2 row(s)

spark-sql> show partitions test_90; 
dt=05 
Time taken: 0.201 seconds, Fetched 1 row(s)

spark-sql> select * from bigdata_qa.test_90 where dt='05';
1       2       05
1   2       05
Time taken: 0.212 seconds, Fetched 2 row(s)

spark-sql> explain insert into table test_90 partition (dt=05) values("1","2");
== Physical Plan ==
Execute InsertIntoHiveTable InsertIntoHiveTable `bigdata_qa`.`test_90`, 
org.apache.hadoop.hive.ql.io.orc.OrcSerde, Map(dt -> Some(05)), false, false, 
[a, b]
+- LocalTableScan [a#116, b#117]
Time taken: 1.145 seconds, Fetched 1 row(s){code}
in spark3.2.0, it will generate two path:
{code:java}
//代码占位符
// the path
hdfs://test5/user/hive/db1/test_90/dt=05 
hdfs://test5/user/hive/db1/test_90/dt=5 

// result
spark-sql> select * from test_90;
1       2       05
1       2       5
Time taken: 2.119 seconds, Fetched 2 row(s)

spark-sql> show partitions test_90;
dt=05
dt=5
Time taken: 0.161 seconds, Fetched 2 row(s)

spark-sql> select * from bigdata_qa.test_90 where dt='05';
1       2       05
Time taken: 0.252 seconds, Fetched 1 row(s)

spark-sql> explain insert into table test_90 partition (dt=05) values("1","2");
plan
== Physical Plan ==
Execute InsertIntoHiveTable `bigdata_qa`.`test_90`, 
org.apache.hadoop.hive.ql.io.orc.OrcSerde, [dt=Some(5)], false, false, [a, b]
+- LocalTableScan [a#109, b#110]{code}
This will cause problems in reading data after the user switches to spark3. The 
root cause is that in the process of partition field resolution, 

[jira] [Updated] (SPARK-41982) When the inserted partition type is of string type, similar `dt=01` will be converted to `dt=1`

2023-01-11 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-41982:

Description: 
At present, during the process of upgrading Spark2.4 to Spark3.2, we carefully 
read the migration documentwe and found a kind of situation not involved:

 
{code:java}
//代码占位符

create table if not exists test_90(a string, b string) partitioned by (dt 
string);
desc formatted test_90;
// case1
insert into table test_90 partition (dt=05) values("1","2");
// case2
insert into table test_90 partition (dt='05') values("1","2");
drop table test_90;{code}
in spark2.4.3, it will generate such a path:

 

 
{code:java}
//代码占位符
// the path
hdfs://test5/user/hive/db1/test_90/dt=05 

//result
spark-sql> select * from test_90;
1       2       05
1       2       05
Time taken: 1.316 seconds, Fetched 2 row(s)

spark-sql> show partitions test_90; 
dt=05 
Time taken: 0.201 seconds, Fetched 1 row(s)

spark-sql> select * from bigdata_qa.test_90 where dt='05';
1       2       05
1   2       05
Time taken: 0.212 seconds, Fetched 2 row(s)

spark-sql> explain insert into table test_90 partition (dt=05) values("1","2");
== Physical Plan ==
Execute InsertIntoHiveTable InsertIntoHiveTable `bigdata_qa`.`test_90`, 
org.apache.hadoop.hive.ql.io.orc.OrcSerde, Map(dt -> Some(05)), false, false, 
[a, b]
+- LocalTableScan [a#116, b#117]
Time taken: 1.145 seconds, Fetched 1 row(s){code}
in spark3.2.0, it will generate two path:
{code:java}
//代码占位符
// the path
hdfs://test5/user/hive/db1/test_90/dt=05 
hdfs://test5/user/hive/db1/test_90/dt=5 

// result
spark-sql> select * from test_90;
1       2       05
1       2       5
Time taken: 2.119 seconds, Fetched 2 row(s)

spark-sql> show partitions test_90;
dt=05
dt=5
Time taken: 0.161 seconds, Fetched 2 row(s)

spark-sql> select * from bigdata_qa.test_90 where dt='05';
1       2       05
Time taken: 0.252 seconds, Fetched 1 row(s)

spark-sql> explain insert into table test_90 partition (dt=05) values("1","2");
plan
== Physical Plan ==
Execute InsertIntoHiveTable `bigdata_qa`.`test_90`, 
org.apache.hadoop.hive.ql.io.orc.OrcSerde, [dt=Some(5)], false, false, [a, b]
+- LocalTableScan [a#109, b#110]{code}
This will cause problems in reading data after the user switches to spark3. The 
root cause is that in the process of partition field resolution, Spark3 has a 
process of strongly converting this string type, which will cause partition 
`05` to lose the previous `0`

So I think we have two solutions:

one is to record the risk clearly in the migration document, and the other is 
to repair this case, because we internally keep the partition of string type as 
string type, regardless of whether single or double quotation marks are added.

 

 

> When the inserted partition type is of string type, similar `dt=01` will be 
> converted to `dt=1`
> ---
>
> Key: SPARK-41982
> URL: https://issues.apache.org/jira/browse/SPARK-41982
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: jingxiong zhong
>Priority: Critical
>
> At present, during the process of upgrading Spark2.4 to Spark3.2, we 
> carefully read the migration documentwe and found a kind of situation not 
> involved:
>  
> {code:java}
> //代码占位符
> create table if not exists test_90(a string, b string) partitioned by (dt 
> string);
> desc formatted test_90;
> // case1
> insert into table test_90 partition (dt=05) values("1","2");
> // case2
> insert into table test_90 partition (dt='05') values("1","2");
> drop table test_90;{code}
> in spark2.4.3, it will generate such a path:
>  
>  
> {code:java}
> //代码占位符
> // the path
> hdfs://test5/user/hive/db1/test_90/dt=05 
> //result
> spark-sql> select * from test_90;
> 1       2       05
> 1       2       05
> Time taken: 1.316 seconds, Fetched 2 row(s)
> spark-sql> show partitions test_90; 
> dt=05 
> Time taken: 0.201 seconds, Fetched 1 row(s)
> spark-sql> select * from bigdata_qa.test_90 where dt='05';
> 1       2       05
> 1   2       05
> Time taken: 0.212 seconds, Fetched 2 row(s)
> spark-sql> explain insert into table test_90 partition (dt=05) 
> values("1","2");
> == Physical Plan ==
> Execute InsertIntoHiveTable InsertIntoHiveTable `bigdata_qa`.`test_90`, 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde, Map(dt -> Some(05)), false, false, 
> [a, b]
> +- LocalTableScan [a#116, b#117]
> Time taken: 1.145 seconds, Fetched 1 row(s){code}
> in spark3.2.0, it will generate two path:
> {code:java}
> //代码占位符
> // the path
> hdfs://test5/user/hive/db1/test_90/dt=05 
> hdfs://test5/user/hive/db1/test_90/dt=5 
> // result
> spark-sql> select * from test_90;
> 1       2       05
> 1       2       5
> Time taken: 2.119 seconds, Fetched 2 row(s)
> spark-sql> 

[jira] [Created] (SPARK-41982) When the inserted partition type is of string type, similar `dt=01` will be converted to `dt=1`

2023-01-11 Thread jingxiong zhong (Jira)
jingxiong zhong created SPARK-41982:
---

 Summary: When the inserted partition type is of string type, 
similar `dt=01` will be converted to `dt=1`
 Key: SPARK-41982
 URL: https://issues.apache.org/jira/browse/SPARK-41982
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: jingxiong zhong






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41943) Use java api to create files and grant permissions is DiskBlockManager

2023-01-08 Thread jingxiong zhong (Jira)
jingxiong zhong created SPARK-41943:
---

 Summary: Use java api to create files and grant permissions is 
DiskBlockManager
 Key: SPARK-41943
 URL: https://issues.apache.org/jira/browse/SPARK-41943
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: jingxiong zhong


For method {{{}createDirWithPermission770{}}}, using java api to create files 
and grant permissions instead of calling shell commands.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37677) spark on k8s, when the user want to push python3.6.6.zip to the pod , but no permission to execute

2023-01-02 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653733#comment-17653733
 ] 

jingxiong zhong commented on SPARK-37677:
-

At present, I have repaired Hadoop version 3.3.5, but it has not been released 
yet. In the future, Spark needs to update the Hadoop version to solve this 
problem.[~valux] 

> spark on k8s, when the user want to push python3.6.6.zip to the pod , but no 
> permission to execute
> --
>
> Key: SPARK-37677
> URL: https://issues.apache.org/jira/browse/SPARK-37677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: jingxiong zhong
>Priority: Major
>
> In cluster mode, I hava another question that when I unzip python3.6.6.zip in 
> pod , but no permission to execute, my execute operation as follows:
> {code:sh}
> spark-submit \
> --archives ./python3.6.6.zip#python3.6.6 \
> --conf "spark.pyspark.python=python3.6.6/python3.6.6/bin/python3" \
> --conf "spark.pyspark.driver.python=python3.6.6/python3.6.6/bin/python3" \
> --conf spark.kubernetes.container.image.pullPolicy=Always \
> ./examples/src/main/python/pi.py 100
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37521) insert overwrite table but the partition information stored in Metastore was not changed

2023-01-02 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong resolved SPARK-37521.
-
Resolution: Won't Fix

> insert overwrite table but the partition information stored in Metastore was 
> not changed
> 
>
> Key: SPARK-37521
> URL: https://issues.apache.org/jira/browse/SPARK-37521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark3.2.0
> hive2.3.9
> metastore2.3.9
>Reporter: jingxiong zhong
>Priority: Major
>
> I create a partitioned table in SparkSQL, insert a data entry, add a regular 
> field, and finally insert a new data entry into the partition,The query is 
> normal in SparkSQL, but the return value of the newly inserted field is NULL 
> in Hive 2.3.9
> for example
> create table updata_col_test1(a int) partitioned by (dt string); 
> insert overwrite table updata_col_test1 partition(dt='20200101') values(1); 
> insert overwrite table updata_col_test1 partition(dt='20200102') values(1);
> insert overwrite table updata_col_test1 partition(dt='20200103') values(1);
> alter table  updata_col_test1 add columns (b int);
> insert overwrite table updata_col_test1 partition(dt) values(1, 2, 
> '20200101'); fail
> insert overwrite table updata_col_test1 partition(dt='20200101') values(1, 
> 2); fail
> insert overwrite table updata_col_test1 partition(dt='20200104') values(1, 
> 2); sucessfully



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41769) Remove useless semicolons

2023-01-01 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong resolved SPARK-41769.
-
Resolution: Won't Fix

> Remove useless semicolons
> -
>
> Key: SPARK-41769
> URL: https://issues.apache.org/jira/browse/SPARK-41769
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jingxiong zhong
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41769) Remove useless semicolons

2022-12-29 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-41769:

Component/s: (was: Spark Core)

> Remove useless semicolons
> -
>
> Key: SPARK-41769
> URL: https://issues.apache.org/jira/browse/SPARK-41769
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jingxiong zhong
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41769) Remove useless semicolons

2022-12-29 Thread jingxiong zhong (Jira)
jingxiong zhong created SPARK-41769:
---

 Summary: Remove useless semicolons
 Key: SPARK-41769
 URL: https://issues.apache.org/jira/browse/SPARK-41769
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.4.0
Reporter: jingxiong zhong






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41236) The renamed field name cannot be recognized after group filtering

2022-11-24 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17638532#comment-17638532
 ] 

jingxiong zhong commented on SPARK-41236:
-

I think you can a pr for it [~huldar] 

> The renamed field name cannot be recognized after group filtering
> -
>
> Key: SPARK-41236
> URL: https://issues.apache.org/jira/browse/SPARK-41236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jingxiong zhong
>Priority: Major
>
> {code:java}
> select collect_set(age) as age
> from db_table.table1
> group by name
> having size(age) > 1 
> {code}
> a simple sql, it work well in spark2.4, but doesn't work in spark3.2.0
> Is it a bug or a new standard?
> h3. *like this:*
> {code:sql}
> create db1.table1(age int, name string);
> insert into db1.table1 values(1, 'a');
> insert into db1.table1 values(2, 'b');
> insert into db1.table1 values(3, 'c');
> --then run sql like this 
> select collect_set(age) as age from db1.table1 group by name having size(age) 
> > 1 ;
> {code}
> h3. Stack Information
> org.apache.spark.sql.AnalysisException: cannot resolve 'age' given input 
> columns: [age]; line 4 pos 12;
> 'Filter (size('age, true) > 1)
> +- Aggregate [name#2], [collect_set(age#1, 0, 0) AS age#0]
>+- SubqueryAlias spark_catalog.db1.table1
>   +- HiveTableRelation [`db1`.`table1`, 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde, Data Cols: [age#1, name#2], 
> Partition Cols: []]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:54)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:179)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:535)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1128)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1127)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:467)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532)
>   at 
> org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1154)
>   at 
> org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1153)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:555)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUpWithPruning$1(QueryPlan.scala:181)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUpWithPruning(QueryPlan.scala:181)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:161)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:175)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:263)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:94)
>   at 
> 

[jira] [Comment Edited] (SPARK-41236) The renamed field name cannot be recognized after group filtering

2022-11-24 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17638532#comment-17638532
 ] 

jingxiong zhong edited comment on SPARK-41236 at 11/25/22 7:14 AM:
---

I think you can raise a pr for it [~huldar] 


was (Author: JIRAUSER281124):
I think you can a pr for it [~huldar] 

> The renamed field name cannot be recognized after group filtering
> -
>
> Key: SPARK-41236
> URL: https://issues.apache.org/jira/browse/SPARK-41236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jingxiong zhong
>Priority: Major
>
> {code:java}
> select collect_set(age) as age
> from db_table.table1
> group by name
> having size(age) > 1 
> {code}
> a simple sql, it work well in spark2.4, but doesn't work in spark3.2.0
> Is it a bug or a new standard?
> h3. *like this:*
> {code:sql}
> create db1.table1(age int, name string);
> insert into db1.table1 values(1, 'a');
> insert into db1.table1 values(2, 'b');
> insert into db1.table1 values(3, 'c');
> --then run sql like this 
> select collect_set(age) as age from db1.table1 group by name having size(age) 
> > 1 ;
> {code}
> h3. Stack Information
> org.apache.spark.sql.AnalysisException: cannot resolve 'age' given input 
> columns: [age]; line 4 pos 12;
> 'Filter (size('age, true) > 1)
> +- Aggregate [name#2], [collect_set(age#1, 0, 0) AS age#0]
>+- SubqueryAlias spark_catalog.db1.table1
>   +- HiveTableRelation [`db1`.`table1`, 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde, Data Cols: [age#1, name#2], 
> Partition Cols: []]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:54)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:179)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:535)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1128)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1127)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:467)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532)
>   at 
> org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1154)
>   at 
> org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1153)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:555)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUpWithPruning$1(QueryPlan.scala:181)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUpWithPruning(QueryPlan.scala:181)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:161)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:175)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:263)
>   at 
> 

[jira] [Updated] (SPARK-41236) The renamed field name cannot be recognized after group filtering

2022-11-24 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-41236:

Description: 
{code:java}
select collect_set(age) as age
from db_table.table1
group by name
having size(age) > 1 
{code}

a simple sql, it work well in spark2.4, but doesn't work in spark3.2.0
Is it a bug or a new standard?

h3. *like this:*

{code:sql}
create db1.table1(age int, name string);

insert into db1.table1 values(1, 'a');

insert into db1.table1 values(2, 'b');

insert into db1.table1 values(3, 'c');

--then run sql like this 
select collect_set(age) as age from db1.table1 group by name having size(age) > 
1 ;

{code}



h3. Stack Information
org.apache.spark.sql.AnalysisException: cannot resolve 'age' given input 
columns: [age]; line 4 pos 12;
'Filter (size('age, true) > 1)
+- Aggregate [name#2], [collect_set(age#1, 0, 0) AS age#0]
   +- SubqueryAlias spark_catalog.db1.table1
  +- HiveTableRelation [`db1`.`table1`, 
org.apache.hadoop.hive.ql.io.orc.OrcSerde, Data Cols: [age#1, name#2], 
Partition Cols: []]

at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:54)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:179)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:175)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:535)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532)
at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1128)
at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1127)
at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:467)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532)
at 
org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1154)
at 
org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1153)
at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:555)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUpWithPruning$1(QueryPlan.scala:181)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUpWithPruning(QueryPlan.scala:181)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:161)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:175)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:263)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:94)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:91)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:172)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:196)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:192)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:88)
at 

[jira] [Updated] (SPARK-41229) When using `db_ name.temp_ table_name`, an exception will be thrown

2022-11-24 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-41229:

Description: 
SQL1:
{code:java}
with table_hive1 as(select * from db1.table_hive)
select * from db1.table_hive1; {code}
 

It will throw exception `org.apache.spark.sql.AnalysisException: Table or view 
not found: db1.table_hive1;`but spark in 2.4.3 work well.

SQL2:
{code:java}
with table_hive1 as(select * from db1.table_hive)
select * from table_hive1; {code}
It work well.
I'm a little confused. Is this syntax with database name not supported.

you can run like this:
{code:java}
create db1.table_hive(age int, name string);
insert into db1.table_hive values(1, 'a');
insert into db1.table_hive values(2, 'b');
insert into db1.table_hive values(3, 'c');
with table_hive1 as(select * from db1.table_hive)
select * from db1.table_hive1; {code}

  was:
SQL1:
```with table_hive1 as(select * from db1.table_hive)
select * from db1.table_hive1;```

It will throw exception `org.apache.spark.sql.AnalysisException: Table or view 
not found: db1.table_hive1;`but spark in 2.4.3 work well.

SQL2:
```with table_hive1 as(select * from db1.table_hive)
select * from table_hive1;```

It work well.
I'm a little confused. Is this syntax with database name not supported.

you can run like this:

create db1.table_hive(age int, name string);

insert into db1.table_hive values(1, 'a');

insert into db1.table_hive values(2, 'b');

insert into db1.table_hive values(3, 'c');

with table_hive1 as(select * from db1.table_hive)
select * from db1.table_hive1;


> When using `db_ name.temp_ table_name`, an exception will be thrown
> ---
>
> Key: SPARK-41229
> URL: https://issues.apache.org/jira/browse/SPARK-41229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark3.2.0
> hadoop2.7.3
> hive-ms 2.3.9
>Reporter: jingxiong zhong
>Priority: Major
>
> SQL1:
> {code:java}
> with table_hive1 as(select * from db1.table_hive)
> select * from db1.table_hive1; {code}
>  
> It will throw exception `org.apache.spark.sql.AnalysisException: Table or 
> view not found: db1.table_hive1;`but spark in 2.4.3 work well.
> SQL2:
> {code:java}
> with table_hive1 as(select * from db1.table_hive)
> select * from table_hive1; {code}
> It work well.
> I'm a little confused. Is this syntax with database name not supported.
> you can run like this:
> {code:java}
> create db1.table_hive(age int, name string);
> insert into db1.table_hive values(1, 'a');
> insert into db1.table_hive values(2, 'b');
> insert into db1.table_hive values(3, 'c');
> with table_hive1 as(select * from db1.table_hive)
> select * from db1.table_hive1; {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41236) The renamed field name cannot be recognized after group filtering

2022-11-24 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-41236:

Description: 
{code:java}
select collect_set(age) as age
from db_table.table1
group by name
having size(age) > 1 
{code}

a simple sql, it work well in spark2.4, but doesn't work in spark3.2.0
Is it a bug or a new standard?

h3. *like this:*

{code:sql}
create db1.table1(age int, name string);

insert into db1.table1 values(1, 'a');

insert into db1.table1 values(2, 'b');

insert into db1.table1 values(3, 'c');

--then run sql like this 
select collect_set(age) as age from db1.table1 group by name having size(age) > 
1 ;

{code}



h3. Stack Information
org.apache.spark.sql.AnalysisException: cannot resolve 'age' given input 
columns: [age]; line 4 pos 12;
'Filter (size('age, true) > 1)
+- Aggregate [name#2], [collect_set(age#1, 0, 0) AS age#0]
   +- SubqueryAlias spark_catalog.bigdata_qa.table1
  +- HiveTableRelation [`bigdata_qa`.`table1`, 
org.apache.hadoop.hive.ql.io.orc.OrcSerde, Data Cols: [age#1, name#2], 
Partition Cols: []]

at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:54)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:179)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:175)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:535)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532)
at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1128)
at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1127)
at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:467)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532)
at 
org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1154)
at 
org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1153)
at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:555)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUpWithPruning$1(QueryPlan.scala:181)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUpWithPruning(QueryPlan.scala:181)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:161)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:175)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:263)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:94)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:91)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:172)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:196)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:192)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:88)
at 

[jira] [Updated] (SPARK-41236) The renamed field name cannot be recognized after group filtering

2022-11-24 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-41236:

Description: 
{code:java}
select collect_set(age) as age
from db_table.table1
group by name
having size(age) > 1 
{code}

a simple sql, it work well in spark2.4, but doesn't work in spark3.2.0
Is it a bug or a new standard?

h3. *like this:*

{code:sql}
create db1.table1(age int, name string);

insert into db1.table1 values(1, 'a');

insert into db1.table1 values(2, 'b');

insert into db1.table1 values(3, 'c');

{code}

then run sql like this `select collect_set(age) as age from db1.table1 group by 
name having size(age) > 1 ;`

h3. Stack Information
org.apache.spark.sql.AnalysisException: cannot resolve 'age' given input 
columns: [age]; line 4 pos 12;
'Filter (size('age, true) > 1)
+- Aggregate [name#2], [collect_set(age#1, 0, 0) AS age#0]
   +- SubqueryAlias spark_catalog.bigdata_qa.table1
  +- HiveTableRelation [`bigdata_qa`.`table1`, 
org.apache.hadoop.hive.ql.io.orc.OrcSerde, Data Cols: [age#1, name#2], 
Partition Cols: []]

at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:54)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:179)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:175)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:535)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532)
at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1128)
at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1127)
at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:467)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532)
at 
org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1154)
at 
org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1153)
at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:555)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUpWithPruning$1(QueryPlan.scala:181)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUpWithPruning(QueryPlan.scala:181)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:161)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:175)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:263)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:94)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:91)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:172)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:196)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:192)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:88)
at 

[jira] [Updated] (SPARK-41236) The renamed field name cannot be recognized after group filtering

2022-11-24 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-41236:

Description: 

{code:java}
select collect_set(age) as age
from db_table.table1
group by name
having size(age) > 1 
{code}

a simple sql, it work well in spark2.4, but doesn't work in spark3.2.0
Is it a bug or a new standard?

h3. *like this:*
spark-sql> create db1.table1(age int, name string);
Time taken: 1.709 seconds
spark-sql> insert into db1.table1 values(1, 'a');
Time taken: 2.114 seconds
spark-sql> insert into db1.table1 values(2, 'b');
Time taken: 10.208 seconds
spark-sql> insert into db1.table1 values(3, 'c');
Time taken: 0.673 seconds
then run sql like this `select collect_set(age) as age from db1.table1 group by 
name having size(age) > 1 ;`

h3. Stack Information
org.apache.spark.sql.AnalysisException: cannot resolve 'age' given input 
columns: [age]; line 4 pos 12;
'Filter (size('age, true) > 1)
+- Aggregate [name#2], [collect_set(age#1, 0, 0) AS age#0]
   +- SubqueryAlias spark_catalog.bigdata_qa.table1
  +- HiveTableRelation [`bigdata_qa`.`table1`, 
org.apache.hadoop.hive.ql.io.orc.OrcSerde, Data Cols: [age#1, name#2], 
Partition Cols: []]

at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:54)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:179)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:175)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:535)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532)
at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1128)
at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1127)
at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:467)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532)
at 
org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1154)
at 
org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1153)
at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:555)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUpWithPruning$1(QueryPlan.scala:181)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUpWithPruning(QueryPlan.scala:181)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:161)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:175)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:263)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:94)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:91)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:172)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:196)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:192)
at 

[jira] [Updated] (SPARK-41236) The renamed field name cannot be recognized after group filtering

2022-11-24 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-41236:

Description: 
`select collect_set(age) as age
from db_table.table1
group by name
having size(age) > 1 `
a simple sql, it work well in spark2.4, but doesn't work in spark3.2.0
Is it a bug or a new standard?

h3. *like this:*
spark-sql> create db1.table1(age int, name string);
Time taken: 1.709 seconds
spark-sql> insert into db1.table1 values(1, 'a');
Time taken: 2.114 seconds
spark-sql> insert into db1.table1 values(2, 'b');
Time taken: 10.208 seconds
spark-sql> insert into db1.table1 values(3, 'c');
Time taken: 0.673 seconds
then run sql like this `select collect_set(age) as age from db1.table1 group by 
name having size(age) > 1 ;`

h3. Stack Information
org.apache.spark.sql.AnalysisException: cannot resolve 'age' given input 
columns: [age]; line 4 pos 12;
'Filter (size('age, true) > 1)
+- Aggregate [name#2], [collect_set(age#1, 0, 0) AS age#0]
   +- SubqueryAlias spark_catalog.bigdata_qa.table1
  +- HiveTableRelation [`bigdata_qa`.`table1`, 
org.apache.hadoop.hive.ql.io.orc.OrcSerde, Data Cols: [age#1, name#2], 
Partition Cols: []]

at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:54)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:179)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:175)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:535)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532)
at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1128)
at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1127)
at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:467)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532)
at 
org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1154)
at 
org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1153)
at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:555)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUpWithPruning$1(QueryPlan.scala:181)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUpWithPruning(QueryPlan.scala:181)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:161)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:175)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:263)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:94)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:91)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:172)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:196)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:192)
at 

[jira] [Updated] (SPARK-41236) The renamed field name cannot be recognized after group filtering

2022-11-23 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-41236:

Description: 
`select collect_set(age) as age
from db_table.table1
group by name
having size(age) > 1 `
a simple sql, it work well in spark2.4, but doesn't work in spark3.2.0
Is it a bug or a new standard?

like this:
spark-sql> create db1.table1(age int, name string);
Time taken: 1.709 seconds
spark-sql> insert into db1.table1 values(1, 'a');
Time taken: 2.114 seconds
spark-sql> insert into db1.table1 values(2, 'b');
Time taken: 10.208 seconds
spark-sql> insert into db1.table1 values(3, 'c');
Time taken: 0.673 seconds
then run sql like this `select collect_set(age) as age from db1.table1 group by 
name having size(age) > 1 ;`

Stack Information
org.apache.spark.sql.AnalysisException: cannot resolve 'age' given input 
columns: [age]; line 4 pos 12;
'Filter (size('age, true) > 1)
+- Aggregate [name#2], [collect_set(age#1, 0, 0) AS age#0]
   +- SubqueryAlias spark_catalog.bigdata_qa.table1
  +- HiveTableRelation [`bigdata_qa`.`table1`, 
org.apache.hadoop.hive.ql.io.orc.OrcSerde, Data Cols: [age#1, name#2], 
Partition Cols: []]

at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:54)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:179)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:175)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:535)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532)
at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1128)
at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1127)
at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:467)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532)
at 
org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1154)
at 
org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1153)
at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:555)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUpWithPruning$1(QueryPlan.scala:181)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUpWithPruning(QueryPlan.scala:181)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:161)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:175)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:263)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:94)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:91)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:172)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:196)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:192)
at 

[jira] [Commented] (SPARK-41229) When using `db_ name.temp_ table_name`, an exception will be thrown

2022-11-23 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17638148#comment-17638148
 ] 

jingxiong zhong commented on SPARK-41229:
-

Sorry,:(I don't know what’s the self-contained reproducer, 

I added SQL above that can reproduce such errors

[~hyukjin.kwon] 

> When using `db_ name.temp_ table_name`, an exception will be thrown
> ---
>
> Key: SPARK-41229
> URL: https://issues.apache.org/jira/browse/SPARK-41229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark3.2.0
> hadoop2.7.3
> hive-ms 2.3.9
>Reporter: jingxiong zhong
>Priority: Major
>
> SQL1:
> ```with table_hive1 as(select * from db1.table_hive)
> select * from db1.table_hive1;```
> It will throw exception `org.apache.spark.sql.AnalysisException: Table or 
> view not found: db1.table_hive1;`but spark in 2.4.3 work well.
> SQL2:
> ```with table_hive1 as(select * from db1.table_hive)
> select * from table_hive1;```
> It work well.
> I'm a little confused. Is this syntax with database name not supported.
> you can run like this:
> create db1.table_hive(age int, name string);
> insert into db1.table_hive values(1, 'a');
> insert into db1.table_hive values(2, 'b');
> insert into db1.table_hive values(3, 'c');
> with table_hive1 as(select * from db1.table_hive)
> select * from db1.table_hive1;



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41229) When using `db_ name.temp_ table_name`, an exception will be thrown

2022-11-23 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-41229:

Description: 
SQL1:
```with table_hive1 as(select * from db1.table_hive)
select * from db1.table_hive1;```

It will throw exception `org.apache.spark.sql.AnalysisException: Table or view 
not found: db1.table_hive1;`but spark in 2.4.3 work well.

SQL2:
```with table_hive1 as(select * from db1.table_hive)
select * from table_hive1;```

It work well.
I'm a little confused. Is this syntax with database name not supported.

you can run like this:

create db1.table_hive(age int, name string);

insert into db1.table_hive values(1, 'a');

insert into db1.table_hive values(2, 'b');

insert into db1.table_hive values(3, 'c');

with table_hive1 as(select * from db1.table_hive)
select * from db1.table_hive1;

  was:
SQL1:
```with table_hive1 as(select * from db1.table_hive)
select * from db1.table_hive1;```

It will throw exception `org.apache.spark.sql.AnalysisException: Table or view 
not found: db1.table_hive1;`but spark in 2.4.3 work well.

SQL2:
```with table_hive1 as(select * from db1.table_hive)
select * from table_hive1;```

It work well.
I'm a little confused. Is this syntax with database name not supported.

you can run like this:

create db1.table_hive(age int, name string);

insert into db1.table_hive values(1, 'a');

insert into db1.table_hive values(2, 'b');

insert into db1.table_hive values(3, 'c');

`with table_hive1 as(select * from db1.table_hive)
select * from db1.table_hive1;`


> When using `db_ name.temp_ table_name`, an exception will be thrown
> ---
>
> Key: SPARK-41229
> URL: https://issues.apache.org/jira/browse/SPARK-41229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark3.2.0
> hadoop2.7.3
> hive-ms 2.3.9
>Reporter: jingxiong zhong
>Priority: Major
>
> SQL1:
> ```with table_hive1 as(select * from db1.table_hive)
> select * from db1.table_hive1;```
> It will throw exception `org.apache.spark.sql.AnalysisException: Table or 
> view not found: db1.table_hive1;`but spark in 2.4.3 work well.
> SQL2:
> ```with table_hive1 as(select * from db1.table_hive)
> select * from table_hive1;```
> It work well.
> I'm a little confused. Is this syntax with database name not supported.
> you can run like this:
> create db1.table_hive(age int, name string);
> insert into db1.table_hive values(1, 'a');
> insert into db1.table_hive values(2, 'b');
> insert into db1.table_hive values(3, 'c');
> with table_hive1 as(select * from db1.table_hive)
> select * from db1.table_hive1;



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41229) When using `db_ name.temp_ table_name`, an exception will be thrown

2022-11-23 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-41229:

Description: 
SQL1:
```with table_hive1 as(select * from db1.table_hive)
select * from db1.table_hive1;```

It will throw exception `org.apache.spark.sql.AnalysisException: Table or view 
not found: db1.table_hive1;`but spark in 2.4.3 work well.

SQL2:
```with table_hive1 as(select * from db1.table_hive)
select * from table_hive1;```

It work well.
I'm a little confused. Is this syntax with database name not supported.

you can run like this:

create db1.table_hive(age int, name string);

insert into db1.table_hive values(1, 'a');

insert into db1.table_hive values(2, 'b');

insert into db1.table_hive values(3, 'c');

`with table_hive1 as(select * from db1.table_hive)
select * from db1.table_hive1;`

  was:
SQL1:
```with table_hive1 as(select * from db1.table_hive)
select * from db1.table_hive1;```

It will throw exception `org.apache.spark.sql.AnalysisException: Table or view 
not found: db1.table_hive1;`but spark in 2.4.3 work well.

SQL2:
```with table_hive1 as(select * from db1.table_hive)
select * from table_hive1;```

It work well.
I'm a little confused. Is this syntax with database name not supported.




> When using `db_ name.temp_ table_name`, an exception will be thrown
> ---
>
> Key: SPARK-41229
> URL: https://issues.apache.org/jira/browse/SPARK-41229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark3.2.0
> hadoop2.7.3
> hive-ms 2.3.9
>Reporter: jingxiong zhong
>Priority: Major
>
> SQL1:
> ```with table_hive1 as(select * from db1.table_hive)
> select * from db1.table_hive1;```
> It will throw exception `org.apache.spark.sql.AnalysisException: Table or 
> view not found: db1.table_hive1;`but spark in 2.4.3 work well.
> SQL2:
> ```with table_hive1 as(select * from db1.table_hive)
> select * from table_hive1;```
> It work well.
> I'm a little confused. Is this syntax with database name not supported.
> you can run like this:
> create db1.table_hive(age int, name string);
> insert into db1.table_hive values(1, 'a');
> insert into db1.table_hive values(2, 'b');
> insert into db1.table_hive values(3, 'c');
> `with table_hive1 as(select * from db1.table_hive)
> select * from db1.table_hive1;`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41236) The renamed field name cannot be recognized after group filtering

2022-11-23 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-41236:

Description: 
`select collect_set(age) as age
from db_table.table1
group by name
having size(age) > 1 `
a simple sql, it work well in spark2.4, but doesn't work in spark3.2.0
Is it a bug or a new standard?

like this:
spark-sql> create db1.table1(age int, name string);
Time taken: 1.709 seconds
spark-sql> insert into db1.table1 values(1, 'a');
Time taken: 2.114 seconds
spark-sql> insert into db1.table1 values(2, 'b');
Time taken: 10.208 seconds
spark-sql> insert into db1.table1 values(3, 'c');
Time taken: 0.673 seconds
then run sql like this `select collect_set(age) as age from db1.table1 group by 
name having size(age) > 1 ;`


  was:
`select collect_set(age) as age
from db_table.table1
group by name
having size(age) > 1 `
a simple sql, it work well in spark2.4, but doesn't work in spark3.2.0
Is it a bug or a new standard?

like this:
spark-sql> create db1.table1(age int, name string);
Time taken: 1.709 seconds
spark-sql> insert into db1.table1 values(1, 'a');
Time taken: 2.114 seconds
spark-sql> insert into db1.table1 values(2, 'b');
Time taken: 10.208 seconds
spark-sql> insert into db1.table1 values(3, 'c');
Time taken: 0.673 seconds
spark-sql> select collect_set(age) as age
 > from db1.table1
 > group by name
 > having size(age) > 1 ;



> The renamed field name cannot be recognized after group filtering
> -
>
> Key: SPARK-41236
> URL: https://issues.apache.org/jira/browse/SPARK-41236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jingxiong zhong
>Priority: Major
>
> `select collect_set(age) as age
> from db_table.table1
> group by name
> having size(age) > 1 `
> a simple sql, it work well in spark2.4, but doesn't work in spark3.2.0
> Is it a bug or a new standard?
> like this:
> spark-sql> create db1.table1(age int, name string);
> Time taken: 1.709 seconds
> spark-sql> insert into db1.table1 values(1, 'a');
> Time taken: 2.114 seconds
> spark-sql> insert into db1.table1 values(2, 'b');
> Time taken: 10.208 seconds
> spark-sql> insert into db1.table1 values(3, 'c');
> Time taken: 0.673 seconds
> then run sql like this `select collect_set(age) as age from db1.table1 group 
> by name having size(age) > 1 ;`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41236) The renamed field name cannot be recognized after group filtering

2022-11-23 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-41236:

Description: 
`select collect_set(age) as age
from db_table.table1
group by name
having size(age) > 1 `
a simple sql, it work well in spark2.4, but doesn't work in spark3.2.0
Is it a bug or a new standard?

like this:
spark-sql> create db1.table1(age int, name string);
Time taken: 1.709 seconds
spark-sql> insert into db1.table1 values(1, 'a');
Time taken: 2.114 seconds
spark-sql> insert into db1.table1 values(2, 'b');
Time taken: 10.208 seconds
spark-sql> insert into db1.table1 values(3, 'c');
Time taken: 0.673 seconds
spark-sql> select collect_set(age) as age
 > from db1.table1
 > group by name
 > having size(age) > 1 ;


  was:
`select collect_set(age) as age
from db_table.table1
group by name
having size(age) > 1 `
a simple sql, it work well in spark2.4, but doesn't work in spark3.2.0
Is it a bug or a new standard?

like this:
spark-sql> create db1.table1(age int, name string);
Time taken: 1.709 seconds
spark-sql> insert into db1.table1 values(1, 'a');
Time taken: 2.114 seconds
spark-sql> insert into db1.table1 values(2, 'b');
Time taken: 10.208 seconds
spark-sql> insert into db1.table1 values(3, 'c');
Time taken: 0.673 seconds
spark-sql> select collect_set(age) as age
 > from db1.table1
 > group by name
 > having size(age) > 1 ;
Time taken: 3.022 seconds


> The renamed field name cannot be recognized after group filtering
> -
>
> Key: SPARK-41236
> URL: https://issues.apache.org/jira/browse/SPARK-41236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jingxiong zhong
>Priority: Major
>
> `select collect_set(age) as age
> from db_table.table1
> group by name
> having size(age) > 1 `
> a simple sql, it work well in spark2.4, but doesn't work in spark3.2.0
> Is it a bug or a new standard?
> like this:
> spark-sql> create db1.table1(age int, name string);
> Time taken: 1.709 seconds
> spark-sql> insert into db1.table1 values(1, 'a');
> Time taken: 2.114 seconds
> spark-sql> insert into db1.table1 values(2, 'b');
> Time taken: 10.208 seconds
> spark-sql> insert into db1.table1 values(3, 'c');
> Time taken: 0.673 seconds
> spark-sql> select collect_set(age) as age
>  > from db1.table1
>  > group by name
>  > having size(age) > 1 ;



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41236) The renamed field name cannot be recognized after group filtering

2022-11-23 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17638142#comment-17638142
 ] 

jingxiong zhong commented on SPARK-41236:
-

Thanks a lot, Is it to write the specific case like this. [~hyukjin.kwon]

> The renamed field name cannot be recognized after group filtering
> -
>
> Key: SPARK-41236
> URL: https://issues.apache.org/jira/browse/SPARK-41236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jingxiong zhong
>Priority: Major
>
> `select collect_set(age) as age
> from db_table.table1
> group by name
> having size(age) > 1 `
> a simple sql, it work well in spark2.4, but doesn't work in spark3.2.0
> Is it a bug or a new standard?
> like this:
> spark-sql> create db1.table1(age int, name string);
> Time taken: 1.709 seconds
> spark-sql> insert into db1.table1 values(1, 'a');
> Time taken: 2.114 seconds
> spark-sql> insert into db1.table1 values(2, 'b');
> Time taken: 10.208 seconds
> spark-sql> insert into db1.table1 values(3, 'c');
> Time taken: 0.673 seconds
> spark-sql> select collect_set(age) as age
>  > from db1.table1
>  > group by name
>  > having size(age) > 1 ;
> Time taken: 3.022 seconds



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41236) The renamed field name cannot be recognized after group filtering

2022-11-23 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-41236:

Description: 
`select collect_set(age) as age
from db_table.table1
group by name
having size(age) > 1 `
a simple sql, it work well in spark2.4, but doesn't work in spark3.2.0
Is it a bug or a new standard?

like this:
spark-sql> create db1.table1(age int, name string);
Time taken: 1.709 seconds
spark-sql> insert into db1.table1 values(1, 'a');
Time taken: 2.114 seconds
spark-sql> insert into db1.table1 values(2, 'b');
Time taken: 10.208 seconds
spark-sql> insert into db1.table1 values(3, 'c');
Time taken: 0.673 seconds
spark-sql> select collect_set(age) as age
 > from db1.table1
 > group by name
 > having size(age) > 1 ;
Time taken: 3.022 seconds

  was:
`select collect_set(age) as age
from db_table.table1
group by name
having size(age) > 1 `
a simple sql, it work well in spark2.4, but doesn't work in spark3.2.0
Is it a bug or a new standard?


> The renamed field name cannot be recognized after group filtering
> -
>
> Key: SPARK-41236
> URL: https://issues.apache.org/jira/browse/SPARK-41236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jingxiong zhong
>Priority: Major
>
> `select collect_set(age) as age
> from db_table.table1
> group by name
> having size(age) > 1 `
> a simple sql, it work well in spark2.4, but doesn't work in spark3.2.0
> Is it a bug or a new standard?
> like this:
> spark-sql> create db1.table1(age int, name string);
> Time taken: 1.709 seconds
> spark-sql> insert into db1.table1 values(1, 'a');
> Time taken: 2.114 seconds
> spark-sql> insert into db1.table1 values(2, 'b');
> Time taken: 10.208 seconds
> spark-sql> insert into db1.table1 values(3, 'c');
> Time taken: 0.673 seconds
> spark-sql> select collect_set(age) as age
>  > from db1.table1
>  > group by name
>  > having size(age) > 1 ;
> Time taken: 3.022 seconds



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41236) The renamed field name cannot be recognized after group filtering

2022-11-23 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-41236:

Summary: The renamed field name cannot be recognized after group filtering  
(was: The renamed field name cannot be recognized after group filtering, but it 
is the same as the original field name)

> The renamed field name cannot be recognized after group filtering
> -
>
> Key: SPARK-41236
> URL: https://issues.apache.org/jira/browse/SPARK-41236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jingxiong zhong
>Priority: Blocker
>
> `select collect_set(age) as age
> from db_table.table1
> group by name
> having size(age) > 1 `
> a simple sql, it work well in spark2.4, but doesn't work in spark3.2.0
> Is it a bug or a new standard?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41236) The renamed field name cannot be recognized after group filtering, but it is the same as the original field name

2022-11-23 Thread jingxiong zhong (Jira)
jingxiong zhong created SPARK-41236:
---

 Summary: The renamed field name cannot be recognized after group 
filtering, but it is the same as the original field name
 Key: SPARK-41236
 URL: https://issues.apache.org/jira/browse/SPARK-41236
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: jingxiong zhong


`select collect_set(age) as age
from db_table.table1
group by name
having size(age) > 1 `
a simple sql, it work well in spark2.4, but doesn't work in spark3.2.0
Is it a bug or a new standard?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41229) When using `db_ name.temp_ table_name`, an exception will be thrown

2022-11-22 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-41229:

Description: 
SQL1:
```with table_hive1 as(select * from db1.table_hive)
select * from db1.table_hive1;```

It will throw exception `org.apache.spark.sql.AnalysisException: Table or view 
not found: db1.table_hive1;`but spark in 2.4.3 work well.

SQL2:
```with table_hive1 as(select * from db1.table_hive)
select * from table_hive1;```

It work well.
I'm a little confused. Is this syntax with database name not supported.



  was:
SQL1:
```with table_hive1 as(select * from db1.table_hive)
select * from db1.table_hive1;```

It will throw exception `org.apache.spark.sql.AnalysisException: Table or view 
not found: bigdata_qa.zjx_hive1;`but spark in 2.4.3 work well.

SQL2:
```with table_hive1 as(select * from db1.table_hive)
select * from table_hive1;```

It work well.
I'm a little confused. Is this syntax with database name not supported.




> When using `db_ name.temp_ table_name`, an exception will be thrown
> ---
>
> Key: SPARK-41229
> URL: https://issues.apache.org/jira/browse/SPARK-41229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark3.2.0
> hadoop2.7.3
> hive-ms 2.3.9
>Reporter: jingxiong zhong
>Priority: Blocker
>
> SQL1:
> ```with table_hive1 as(select * from db1.table_hive)
> select * from db1.table_hive1;```
> It will throw exception `org.apache.spark.sql.AnalysisException: Table or 
> view not found: db1.table_hive1;`but spark in 2.4.3 work well.
> SQL2:
> ```with table_hive1 as(select * from db1.table_hive)
> select * from table_hive1;```
> It work well.
> I'm a little confused. Is this syntax with database name not supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41229) When using `db_ name.temp_ table_name`, an exception will be thrown

2022-11-22 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637595#comment-17637595
 ] 

jingxiong zhong commented on SPARK-41229:
-

[~cloud_fan] Could you help me about this?

> When using `db_ name.temp_ table_name`, an exception will be thrown
> ---
>
> Key: SPARK-41229
> URL: https://issues.apache.org/jira/browse/SPARK-41229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark3.2.0
> hadoop2.7.3
> hive-ms 2.3.9
>Reporter: jingxiong zhong
>Priority: Blocker
>
> SQL1:
> ```with table_hive1 as(select * from db1.table_hive)
> select * from db1.table_hive1;```
> It will throw exception `org.apache.spark.sql.AnalysisException: Table or 
> view not found: bigdata_qa.zjx_hive1;`but spark in 2.4.3 work well.
> SQL2:
> ```with table_hive1 as(select * from db1.table_hive)
> select * from table_hive1;```
> It work well.
> I'm a little confused. Is this syntax with database name not supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41229) When using `db_ name.temp_ table_name`, an exception will be thrown

2022-11-22 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-41229:

Description: 
SQL1:
```with table_hive1 as(select * from db1.table_hive)
select * from db1.table_hive1;```

It will throw exception `org.apache.spark.sql.AnalysisException: Table or view 
not found: bigdata_qa.zjx_hive1;`but spark in 2.4.3 work well.

SQL2:
```with table_hive1 as(select * from db1.table_hive)
select * from table_hive1;```

It work well.
I'm a little confused. Is this syntax with database name not supported.



  was:
```with table_hive1 as(select * from db1.table_hive)
select * from db1.table_hive1;```
It will throw exception `org.apache.spark.sql.AnalysisException: Table or view 
not found: bigdata_qa.zjx_hive1;`but spark in 2.4.3 work well.
```with table_hive1 as(select * from db1.table_hive)
select * from table_hive1;```
It work well.
I'm a little confused. Is this syntax with database name not supported.




> When using `db_ name.temp_ table_name`, an exception will be thrown
> ---
>
> Key: SPARK-41229
> URL: https://issues.apache.org/jira/browse/SPARK-41229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark3.2.0
> hadoop2.7.3
> hive-ms 2.3.9
>Reporter: jingxiong zhong
>Priority: Blocker
>
> SQL1:
> ```with table_hive1 as(select * from db1.table_hive)
> select * from db1.table_hive1;```
> It will throw exception `org.apache.spark.sql.AnalysisException: Table or 
> view not found: bigdata_qa.zjx_hive1;`but spark in 2.4.3 work well.
> SQL2:
> ```with table_hive1 as(select * from db1.table_hive)
> select * from table_hive1;```
> It work well.
> I'm a little confused. Is this syntax with database name not supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41229) When using `db_ name.temp_ table_name`, an exception will be thrown

2022-11-22 Thread jingxiong zhong (Jira)
jingxiong zhong created SPARK-41229:
---

 Summary: When using `db_ name.temp_ table_name`, an exception will 
be thrown
 Key: SPARK-41229
 URL: https://issues.apache.org/jira/browse/SPARK-41229
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
 Environment: spark3.2.0
hadoop2.7.3
hive-ms 2.3.9
Reporter: jingxiong zhong


```with table_hive1 as(select * from db1.table_hive)
select * from db1.table_hive1;```
It will throw exception `org.apache.spark.sql.AnalysisException: Table or view 
not found: bigdata_qa.zjx_hive1;`but spark in 2.4.3 work well.
```with table_hive1 as(select * from db1.table_hive)
select * from table_hive1;```
It work well.
I'm a little confused. Is this syntax with database name not supported.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40916) udf could not filter null value cause npe

2022-10-26 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong resolved SPARK-40916.
-
Resolution: Fixed

 add --conf spark.sql.subexpressionElimination.enabled=false

> udf could not filter null value cause npe
> -
>
> Key: SPARK-40916
> URL: https://issues.apache.org/jira/browse/SPARK-40916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark3.2.0
> hadoop2.7.3
> hive2.3.9
>Reporter: jingxiong zhong
>Priority: Critical
>
> ```
> select
> t22.uid,
> from
> (
> SELECT
> code,
> count(distinct uid) cnt
> FROM
> (
> SELECT
> uid,
> code,
> lng,
> lat
> FROM
> (
> select
>  
> riskmanage_dw.GEOHASH_ENCODE(manhattan_dw.aes_decode(lng),manhattan_dw.aes_decode(lat),8)
>  as code,
> uid,
> lng,
> lat,
> dt as event_time 
> from
> (
> select
> param['timestamp'] as dt,
> 
> get_json_object(get_json_object(param['input'],'$.baseInfo'),'$.uid') uid,
> 
> get_json_object(get_json_object(param['input'],'$.envInfo'),'$.lng') lng,
> 
> get_json_object(get_json_object(param['input'],'$.envInfo'),'$.lat') lat 
> from manhattan_ods.ods_log_manhattan_fbi_workflow_result_log
> and 
> get_json_object(get_json_object(param['input'],'$.bizExtents'),'$.productId')='2001'
>  
> )a
> and lng is not null
> and lat is not null
> ) t2
> group by uid,code,lng,lat
> ) t1
> GROUP BY code having count(DISTINCT uid)>=10
> )t11
> join
> (
> SELECT
> uid,
> code,
> lng,
> lat
> FROM
> (
> select
> 
> riskmanage_dw.GEOHASH_ENCODE(manhattan_dw.aes_decode(lng),manhattan_dw.aes_decode(lat),8)
>  as code,
> uid,
> lng,
> lat,
> dt as event_time
> from
> (
> select
> param['timestamp'] as dt,
> 
> get_json_object(get_json_object(param['input'],'$.baseInfo'),'$.uid') uid,
> 
> get_json_object(get_json_object(param['input'],'$.envInfo'),'$.lng') lng, 
> 
> get_json_object(get_json_object(param['input'],'$.envInfo'),'$.lat') lat 
> from manhattan_ods.ods_log_manhattan_fbi_workflow_result_log 
> and 
> get_json_object(get_json_object(param['input'],'$.bizExtents'),'$.productId')='2001'
>  
> )a
> and lng is not null
> and lat is not null
> ) t2
> where substr(code,0,6)<>'wx4ey3'
> group by uid,code,lng,lat
> ) t22 on t11.code=t22.code
> group by t22.uid
> ```
> this sql can't run because 
> `riskmanage_dw.GEOHASH_ENCODE(manhattan_dw.aes_decode(lng),manhattan_dw.aes_decode(lat),8)`
>  will throw npe(`Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> Unable to execute method public java.lang.String 
> com.xiaoju.automarket.GeohashEncode.evaluate(java.lang.Double,java.lang.Double,java.lang.Integer)
>  with arguments {null,null,8}:null`), but I have filter null in my condition, 
> the udf of manhattan_dw.aes_decode will return null if lng or lat is null, 
> *but after I remove `where substr(code,0,6)<>'wx4ey3' `this condition, it can 
> run normally.* 
> complete :
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to 
> execute method public java.lang.String 
> com.xiaoju.automarket.GeohashEncode.evaluate(java.lang.Double,java.lang.Double,java.lang.Integer)
>  with arguments {null,null,8}:null
>   at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:1049)
>   at org.apache.spark.sql.hive.HiveSimpleUDF.eval(hiveUDFs.scala:102)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.subExpr_3$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.FilterExec.$anonfun$doExecute$3(basicPhysicalOperators.scala:275)
>   at 
> org.apache.spark.sql.execution.FilterExec.$anonfun$doExecute$3$adapted(basicPhysicalOperators.scala:274)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:515)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
>  Source)
>   at 
> 

[jira] [Created] (SPARK-40916) udf could not filter null value cause npe

2022-10-26 Thread jingxiong zhong (Jira)
jingxiong zhong created SPARK-40916:
---

 Summary: udf could not filter null value cause npe
 Key: SPARK-40916
 URL: https://issues.apache.org/jira/browse/SPARK-40916
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
 Environment: spark3.2.0
hadoop2.7.3
hive2.3.9
Reporter: jingxiong zhong


```
select
t22.uid,
from
(
SELECT
code,
count(distinct uid) cnt
FROM
(
SELECT
uid,
code,
lng,
lat
FROM
(
select
 
riskmanage_dw.GEOHASH_ENCODE(manhattan_dw.aes_decode(lng),manhattan_dw.aes_decode(lat),8)
 as code,
uid,
lng,
lat,
dt as event_time 
from
(
select
param['timestamp'] as dt,

get_json_object(get_json_object(param['input'],'$.baseInfo'),'$.uid') uid,

get_json_object(get_json_object(param['input'],'$.envInfo'),'$.lng') lng,

get_json_object(get_json_object(param['input'],'$.envInfo'),'$.lat') lat 
from manhattan_ods.ods_log_manhattan_fbi_workflow_result_log
and 
get_json_object(get_json_object(param['input'],'$.bizExtents'),'$.productId')='2001'
 
)a
and lng is not null
and lat is not null
) t2
group by uid,code,lng,lat
) t1
GROUP BY code having count(DISTINCT uid)>=10
)t11
join
(
SELECT
uid,
code,
lng,
lat
FROM
(
select

riskmanage_dw.GEOHASH_ENCODE(manhattan_dw.aes_decode(lng),manhattan_dw.aes_decode(lat),8)
 as code,
uid,
lng,
lat,
dt as event_time
from
(
select
param['timestamp'] as dt,

get_json_object(get_json_object(param['input'],'$.baseInfo'),'$.uid') uid,

get_json_object(get_json_object(param['input'],'$.envInfo'),'$.lng') lng, 

get_json_object(get_json_object(param['input'],'$.envInfo'),'$.lat') lat 
from manhattan_ods.ods_log_manhattan_fbi_workflow_result_log 
and 
get_json_object(get_json_object(param['input'],'$.bizExtents'),'$.productId')='2001'
 
)a
and lng is not null
and lat is not null
) t2
where substr(code,0,6)<>'wx4ey3'
group by uid,code,lng,lat
) t22 on t11.code=t22.code
group by t22.uid
```
this sql can't run because 
`riskmanage_dw.GEOHASH_ENCODE(manhattan_dw.aes_decode(lng),manhattan_dw.aes_decode(lat),8)`
 will throw npe(`Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
Unable to execute method public java.lang.String 
com.xiaoju.automarket.GeohashEncode.evaluate(java.lang.Double,java.lang.Double,java.lang.Integer)
 with arguments {null,null,8}:null`), but I have filter null in my condition, 
the udf of manhattan_dw.aes_decode will return null if lng or lat is null, *but 
after I remove `where substr(code,0,6)<>'wx4ey3' `this condition, it can run 
normally.* 


complete :
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute 
method public java.lang.String 
com.xiaoju.automarket.GeohashEncode.evaluate(java.lang.Double,java.lang.Double,java.lang.Integer)
 with arguments {null,null,8}:null
at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:1049)
at org.apache.spark.sql.hive.HiveSimpleUDF.eval(hiveUDFs.scala:102)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.subExpr_3$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
at 
org.apache.spark.sql.execution.FilterExec.$anonfun$doExecute$3(basicPhysicalOperators.scala:275)
at 
org.apache.spark.sql.execution.FilterExec.$anonfun$doExecute$3$adapted(basicPhysicalOperators.scala:274)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:515)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)









--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39967) Instead of using the scalar tasksSuccessful, use the successful array to calculate whether the task is completed

2022-08-03 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-39967:

Description: When counting the number of successful tasks in the stage of 
spark, spark uses the indicator of `tasksSuccessful`, but in fact, the success 
or failure of tasks is based on the array of `successful`. Through the log I 
added, it is found that the number of failed tasks counted by `tasksSuccessful` 
is inconsistent with the number of failures stored in the array of 
`successful`. We should take `successful` as the standard.  (was: When counting 
the number of successful tasks in the stage of spark, spark uses the indicator 
of `tasksSuccessful`, but in fact, the success or failure of tasks is based on 
the array of `successful`. Through the log, it is found that the number of 
failed tasks counted by `tasksSuccessful` is inconsistent with the number of 
failures stored in the array of `successful`. We should take `successful` as 
the standard.)

> Instead of using the scalar tasksSuccessful, use the successful array to 
> calculate whether the task is completed
> 
>
> Key: SPARK-39967
> URL: https://issues.apache.org/jira/browse/SPARK-39967
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3, 2.4.6
>Reporter: jingxiong zhong
>Priority: Critical
> Attachments: spark1-1.png, spark2.png, spark3-1.png
>
>
> When counting the number of successful tasks in the stage of spark, spark 
> uses the indicator of `tasksSuccessful`, but in fact, the success or failure 
> of tasks is based on the array of `successful`. Through the log I added, it 
> is found that the number of failed tasks counted by `tasksSuccessful` is 
> inconsistent with the number of failures stored in the array of `successful`. 
> We should take `successful` as the standard.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39967) Instead of using the scalar tasksSuccessful, use the successful array to calculate whether the task is completed

2022-08-03 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-39967:

Attachment: spark1-1.png
spark2.png
spark3-1.png

> Instead of using the scalar tasksSuccessful, use the successful array to 
> calculate whether the task is completed
> 
>
> Key: SPARK-39967
> URL: https://issues.apache.org/jira/browse/SPARK-39967
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3, 2.4.6
>Reporter: jingxiong zhong
>Priority: Critical
> Attachments: spark1-1.png, spark2.png, spark3-1.png
>
>
> When counting the number of successful tasks in the stage of spark, spark 
> uses the indicator of `tasksSuccessful`, but in fact, the success or failure 
> of tasks is based on the array of `successful`. Through the log, it is found 
> that the number of failed tasks counted by `tasksSuccessful` is inconsistent 
> with the number of failures stored in the array of `successful`. We should 
> take `successful` as the standard.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39967) Instead of using the scalar tasksSuccessful, use the successful array to calculate whether the task is completed

2022-08-03 Thread jingxiong zhong (Jira)
jingxiong zhong created SPARK-39967:
---

 Summary: Instead of using the scalar tasksSuccessful, use the 
successful array to calculate whether the task is completed
 Key: SPARK-39967
 URL: https://issues.apache.org/jira/browse/SPARK-39967
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.6, 2.4.3
Reporter: jingxiong zhong


When counting the number of successful tasks in the stage of spark, spark uses 
the indicator of `tasksSuccessful`, but in fact, the success or failure of 
tasks is based on the array of `successful`. Through the log, it is found that 
the number of failed tasks counted by `tasksSuccessful` is inconsistent with 
the number of failures stored in the array of `successful`. We should take 
`successful` as the standard.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39153) When we look at spark UI or History, we can see the failed tasks first

2022-05-11 Thread jingxiong zhong (Jira)
jingxiong zhong created SPARK-39153:
---

 Summary: When we look at spark UI or History, we can see the 
failed tasks first
 Key: SPARK-39153
 URL: https://issues.apache.org/jira/browse/SPARK-39153
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.0
 Environment: spark 3.2.0
Reporter: jingxiong zhong
 Fix For: 3.2.0


When a task fails, users are more concerned about the causes of failed tasks 
and failed tasks. The Current Spark UI and History are sorted according to 
"Index" rather than "Errors". When a large number of tasks are sorted, you need 
to wait a certain period for tasks to be sorted. In order to find the cause of 
Errors for failed tasks, we can improve the user experience by specifying 
sorting by the "Errors" column at the beginning



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37677) spark on k8s, when the user want to push python3.6.6.zip to the pod , but no permission to execute

2022-01-22 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480381#comment-17480381
 ] 

jingxiong zhong commented on SPARK-37677:
-

[~hyukjin.kwon] Hey sir, I brought up another one for this issue. This is the 
use of shell command is to decompress.

> spark on k8s, when the user want to push python3.6.6.zip to the pod , but no 
> permission to execute
> --
>
> Key: SPARK-37677
> URL: https://issues.apache.org/jira/browse/SPARK-37677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: jingxiong zhong
>Priority: Major
>
> In cluster mode, I hava another question that when I unzip python3.6.6.zip in 
> pod , but no permission to execute, my execute operation as follows:
> {code:sh}
> spark-submit \
> --archives ./python3.6.6.zip#python3.6.6 \
> --conf "spark.pyspark.python=python3.6.6/python3.6.6/bin/python3" \
> --conf "spark.pyspark.driver.python=python3.6.6/python3.6.6/bin/python3" \
> --conf spark.kubernetes.container.image.pullPolicy=Always \
> ./examples/src/main/python/pi.py 100
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37708) pyspark adding third-party Dependencies on k8s

2022-01-20 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong resolved SPARK-37708.
-
Resolution: Not A Problem

> pyspark adding third-party Dependencies on k8s
> --
>
> Key: SPARK-37708
> URL: https://issues.apache.org/jira/browse/SPARK-37708
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.2.0
> Environment: pyspark3.2
>Reporter: jingxiong zhong
>Priority: Major
>
> I have a question about that how do I add my Python dependencies to Spark 
> Job, as following
> {code:sh}
> spark-submit \
> --archives s3a://path/python3.6.9.tgz#python3.6.9 \
> --conf "spark.pyspark.driver.python=python3.6.9/bin/python3" \
> --conf "spark.pyspark.python=python3.6.9/bin/python3" \
> --name "piroottest" \
> ./examples/src/main/python/pi.py 10
> {code}
> this can't run my job sucessfully,it throw error
> {code:sh}
> Traceback (most recent call last):
>   File "/tmp/spark-63b77184-6e89-4121-bc32-6a1b793e0c85/pi.py", line 21, in 
> 
> from pyspark.sql import SparkSession
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 121, in 
> 
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/__init__.py", line 42, 
> in 
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 27, in 
> 
> async def _ag():
>   File "/opt/spark/work-dir/python3.6.9/lib/python3.6/ctypes/__init__.py", 
> line 7, in 
> from _ctypes import Union, Structure, Array
> ImportError: libffi.so.6: cannot open shared object file: No such file or 
> directory
> {code}
> Or is there another way to add Python dependencies?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-35715) Option "--files" with local:// prefix is not honoured for Spark on kubernetes

2022-01-12 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17475135#comment-17475135
 ] 

jingxiong zhong edited comment on SPARK-35715 at 1/13/22, 7:00 AM:
---

It seems that spark 3 does not support the schema using local as the path. You 
can try file:///etc/xattr.conf


was (Author: JIRAUSER281124):
t seems that spark 3 does not support the schema using local as the path. You 
can try file:///etc/xattr.conf

> Option "--files" with local:// prefix is not honoured for Spark on kubernetes
> -
>
> Key: SPARK-35715
> URL: https://issues.apache.org/jira/browse/SPARK-35715
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.2, 3.1.2
>Reporter: Pardhu Madipalli
>Priority: Major
>
> When we provide a local file as a dependency using "--files" option, the file 
> is not getting copied to work directories of executors.
> h5. Example 1:
>  
> {code:java}
> $SPARK_HOME/bin/spark-submit --master k8s://https:// \ 
> --deploy-mode cluster \ 
> --name spark-pi \ 
> --class org.apache.spark.examples.SparkPi \ 
> --conf spark.executor.instances=1 \ 
> --conf spark.kubernetes.container.image= \ 
> --conf spark.kubernetes.driver.pod.name=sparkdriverpod \ 
> --files local:///etc/xattr.conf \ 
> local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100
> {code}
>  
> h6. Content of Spark Executor work-dir:
>  
> {code:java}
> ~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls 
> /opt/spark/work-dir/
> spark-examples_2.12-3.1.2.jar
> {code}
>  
> We can notice here that the file _/etc/xattr.conf_ is *NOT* copied to  
> _/opt/spark/work-dir/ ._
>  
> 
>  
> {{Instead of using "–files", if we use "--jars" option the file is getting 
> copied as expected.}}
> h5. Example 2:
> {code:java}
> $SPARK_HOME/bin/spark-submit --master k8s://https:// \ 
> --deploy-mode cluster \ 
> --name spark-pi \ 
> --class org.apache.spark.examples.SparkPi \ 
> --conf spark.executor.instances=1 \ 
> --conf spark.kubernetes.container.image= \ 
> --conf spark.kubernetes.driver.pod.name=sparkdriverpod \ 
> --jars local:///etc/xattr.conf \ 
> local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100
> {code}
> h6. Content of Spark Executor work-dir:
>  
> {code:java}
> ~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls 
> /opt/spark/work-dir/
> spark-examples_2.12-3.1.2.jar
> xattr.conf
> {code}
> We can notice here that the file _/etc/xattr.conf_ *IS COPIED* to  
> _/opt/spark/work-dir/ ._
>  
> I tested this with versions *3.1.2* and *3.0.2*. It is behaving the same way 
> in both cases.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35715) Option "--files" with local:// prefix is not honoured for Spark on kubernetes

2022-01-12 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17475135#comment-17475135
 ] 

jingxiong zhong commented on SPARK-35715:
-

t seems that spark 3 does not support the schema using local as the path. You 
can try file:///etc/xattr.conf

> Option "--files" with local:// prefix is not honoured for Spark on kubernetes
> -
>
> Key: SPARK-35715
> URL: https://issues.apache.org/jira/browse/SPARK-35715
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.2, 3.1.2
>Reporter: Pardhu Madipalli
>Priority: Major
>
> When we provide a local file as a dependency using "--files" option, the file 
> is not getting copied to work directories of executors.
> h5. Example 1:
>  
> {code:java}
> $SPARK_HOME/bin/spark-submit --master k8s://https:// \ 
> --deploy-mode cluster \ 
> --name spark-pi \ 
> --class org.apache.spark.examples.SparkPi \ 
> --conf spark.executor.instances=1 \ 
> --conf spark.kubernetes.container.image= \ 
> --conf spark.kubernetes.driver.pod.name=sparkdriverpod \ 
> --files local:///etc/xattr.conf \ 
> local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100
> {code}
>  
> h6. Content of Spark Executor work-dir:
>  
> {code:java}
> ~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls 
> /opt/spark/work-dir/
> spark-examples_2.12-3.1.2.jar
> {code}
>  
> We can notice here that the file _/etc/xattr.conf_ is *NOT* copied to  
> _/opt/spark/work-dir/ ._
>  
> 
>  
> {{Instead of using "–files", if we use "--jars" option the file is getting 
> copied as expected.}}
> h5. Example 2:
> {code:java}
> $SPARK_HOME/bin/spark-submit --master k8s://https:// \ 
> --deploy-mode cluster \ 
> --name spark-pi \ 
> --class org.apache.spark.examples.SparkPi \ 
> --conf spark.executor.instances=1 \ 
> --conf spark.kubernetes.container.image= \ 
> --conf spark.kubernetes.driver.pod.name=sparkdriverpod \ 
> --jars local:///etc/xattr.conf \ 
> local:///opt/spark/examples/jars/spark-examples_2.12-3.1.2.jar 100
> {code}
> h6. Content of Spark Executor work-dir:
>  
> {code:java}
> ~$ kubectl exec -n default spark-pi-22de6279f6bec01c-exec-1 ls 
> /opt/spark/work-dir/
> spark-examples_2.12-3.1.2.jar
> xattr.conf
> {code}
> We can notice here that the file _/etc/xattr.conf_ *IS COPIED* to  
> _/opt/spark/work-dir/ ._
>  
> I tested this with versions *3.1.2* and *3.0.2*. It is behaving the same way 
> in both cases.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37708) pyspark adding third-party Dependencies on k8s

2022-01-07 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17471014#comment-17471014
 ] 

jingxiong zhong commented on SPARK-37708:
-

[~hyukjin.kwon]In the end, we found that the operating system was different and 
that python would not run in the image.If we use centos system, it can work 
normally

> pyspark adding third-party Dependencies on k8s
> --
>
> Key: SPARK-37708
> URL: https://issues.apache.org/jira/browse/SPARK-37708
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.2.0
> Environment: pyspark3.2
>Reporter: jingxiong zhong
>Priority: Major
>
> I have a question about that how do I add my Python dependencies to Spark 
> Job, as following
> {code:sh}
> spark-submit \
> --archives s3a://path/python3.6.9.tgz#python3.6.9 \
> --conf "spark.pyspark.driver.python=python3.6.9/bin/python3" \
> --conf "spark.pyspark.python=python3.6.9/bin/python3" \
> --name "piroottest" \
> ./examples/src/main/python/pi.py 10
> {code}
> this can't run my job sucessfully,it throw error
> {code:sh}
> Traceback (most recent call last):
>   File "/tmp/spark-63b77184-6e89-4121-bc32-6a1b793e0c85/pi.py", line 21, in 
> 
> from pyspark.sql import SparkSession
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 121, in 
> 
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/__init__.py", line 42, 
> in 
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 27, in 
> 
> async def _ag():
>   File "/opt/spark/work-dir/python3.6.9/lib/python3.6/ctypes/__init__.py", 
> line 7, in 
> from _ctypes import Union, Structure, Array
> ImportError: libffi.so.6: cannot open shared object file: No such file or 
> directory
> {code}
> Or is there another way to add Python dependencies?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37677) spark on k8s, when the user want to push python3.6.6.zip to the pod , but no permission to execute

2021-12-31 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17467251#comment-17467251
 ] 

jingxiong zhong commented on SPARK-37677:
-

I found that unzip is implemented by decompressing files through Java's io. 
When we create an output file object, permission is given by default, so the 
permission of unzip's output file is consistent with the permission to create 
the file, which cannot be modified. Therefore, I try to give permission after 
the output file is created, Or you can directly save the permission information 
when reading the output file, but it need to modify the zip code here.

> spark on k8s, when the user want to push python3.6.6.zip to the pod , but no 
> permission to execute
> --
>
> Key: SPARK-37677
> URL: https://issues.apache.org/jira/browse/SPARK-37677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: jingxiong zhong
>Priority: Major
>
> In cluster mode, I hava another question that when I unzip python3.6.6.zip in 
> pod , but no permission to execute, my execute operation as follows:
> {code:sh}
> spark-submit \
> --archives ./python3.6.6.zip#python3.6.6 \
> --conf "spark.pyspark.python=python3.6.6/python3.6.6/bin/python3" \
> --conf "spark.pyspark.driver.python=python3.6.6/python3.6.6/bin/python3" \
> --conf spark.kubernetes.container.image.pullPolicy=Always \
> ./examples/src/main/python/pi.py 100
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37708) pyspark adding third-party Dependencies on k8s

2021-12-24 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465060#comment-17465060
 ] 

jingxiong zhong edited comment on SPARK-37708 at 12/24/21, 5:28 PM:


[~hyukjin.kwon] I found some packages downloaded from python3.9, such as 
pandas, NLTK.That would be conflict, because the operating system is different. 
Can I change the default operating system debian of dockerFile to Centos6/7?


was (Author: JIRAUSER281124):
[~hyukjin.kwon] I found some packages downloaded, such as pandas, NLTK.That 
would be conflict, because the operating system is different. Can I change the 
default operating system debian of dockerFile to Centos6/7?

> pyspark adding third-party Dependencies on k8s
> --
>
> Key: SPARK-37708
> URL: https://issues.apache.org/jira/browse/SPARK-37708
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.2.0
> Environment: pyspark3.2
>Reporter: jingxiong zhong
>Priority: Major
>
> I have a question about that how do I add my Python dependencies to Spark 
> Job, as following
> {code:sh}
> spark-submit \
> --archives s3a://path/python3.6.9.tgz#python3.6.9 \
> --conf "spark.pyspark.driver.python=python3.6.9/bin/python3" \
> --conf "spark.pyspark.python=python3.6.9/bin/python3" \
> --name "piroottest" \
> ./examples/src/main/python/pi.py 10
> {code}
> this can't run my job sucessfully,it throw error
> {code:sh}
> Traceback (most recent call last):
>   File "/tmp/spark-63b77184-6e89-4121-bc32-6a1b793e0c85/pi.py", line 21, in 
> 
> from pyspark.sql import SparkSession
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 121, in 
> 
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/__init__.py", line 42, 
> in 
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 27, in 
> 
> async def _ag():
>   File "/opt/spark/work-dir/python3.6.9/lib/python3.6/ctypes/__init__.py", 
> line 7, in 
> from _ctypes import Union, Structure, Array
> ImportError: libffi.so.6: cannot open shared object file: No such file or 
> directory
> {code}
> Or is there another way to add Python dependencies?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37708) pyspark adding third-party Dependencies on k8s

2021-12-24 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465060#comment-17465060
 ] 

jingxiong zhong edited comment on SPARK-37708 at 12/24/21, 5:24 PM:


[~hyukjin.kwon] I found some packages downloaded, such as pandas, NLTK.That 
would be conflict, because the operating system is different. Can I change the 
default operating system debian of dockerFile to Centos6/7?


was (Author: JIRAUSER281124):
[~hyukjin.kwon] I found some packages downloaded, such as pandas, NLTK. Can I 
change the default operating system debian of dockerFile to Centos6/7?

> pyspark adding third-party Dependencies on k8s
> --
>
> Key: SPARK-37708
> URL: https://issues.apache.org/jira/browse/SPARK-37708
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.2.0
> Environment: pyspark3.2
>Reporter: jingxiong zhong
>Priority: Major
>
> I have a question about that how do I add my Python dependencies to Spark 
> Job, as following
> {code:sh}
> spark-submit \
> --archives s3a://path/python3.6.9.tgz#python3.6.9 \
> --conf "spark.pyspark.driver.python=python3.6.9/bin/python3" \
> --conf "spark.pyspark.python=python3.6.9/bin/python3" \
> --name "piroottest" \
> ./examples/src/main/python/pi.py 10
> {code}
> this can't run my job sucessfully,it throw error
> {code:sh}
> Traceback (most recent call last):
>   File "/tmp/spark-63b77184-6e89-4121-bc32-6a1b793e0c85/pi.py", line 21, in 
> 
> from pyspark.sql import SparkSession
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 121, in 
> 
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/__init__.py", line 42, 
> in 
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 27, in 
> 
> async def _ag():
>   File "/opt/spark/work-dir/python3.6.9/lib/python3.6/ctypes/__init__.py", 
> line 7, in 
> from _ctypes import Union, Structure, Array
> ImportError: libffi.so.6: cannot open shared object file: No such file or 
> directory
> {code}
> Or is there another way to add Python dependencies?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37708) pyspark adding third-party Dependencies on k8s

2021-12-24 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465060#comment-17465060
 ] 

jingxiong zhong commented on SPARK-37708:
-

[~hyukjin.kwon] I found some packages downloaded, such as pandas, NLTK. Can I 
change the default operating system debian of dockerFile to Centos6/7?

> pyspark adding third-party Dependencies on k8s
> --
>
> Key: SPARK-37708
> URL: https://issues.apache.org/jira/browse/SPARK-37708
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.2.0
> Environment: pyspark3.2
>Reporter: jingxiong zhong
>Priority: Major
>
> I have a question about that how do I add my Python dependencies to Spark 
> Job, as following
> {code:sh}
> spark-submit \
> --archives s3a://path/python3.6.9.tgz#python3.6.9 \
> --conf "spark.pyspark.driver.python=python3.6.9/bin/python3" \
> --conf "spark.pyspark.python=python3.6.9/bin/python3" \
> --name "piroottest" \
> ./examples/src/main/python/pi.py 10
> {code}
> this can't run my job sucessfully,it throw error
> {code:sh}
> Traceback (most recent call last):
>   File "/tmp/spark-63b77184-6e89-4121-bc32-6a1b793e0c85/pi.py", line 21, in 
> 
> from pyspark.sql import SparkSession
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 121, in 
> 
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/__init__.py", line 42, 
> in 
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 27, in 
> 
> async def _ag():
>   File "/opt/spark/work-dir/python3.6.9/lib/python3.6/ctypes/__init__.py", 
> line 7, in 
> from _ctypes import Union, Structure, Array
> ImportError: libffi.so.6: cannot open shared object file: No such file or 
> directory
> {code}
> Or is there another way to add Python dependencies?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37708) pyspark adding third-party Dependencies on k8s

2021-12-24 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17465058#comment-17465058
 ] 

jingxiong zhong commented on SPARK-37708:
-

I used wget to download and compile the source code, but it seems python3.6 is 
not supported by spark3.2


> pyspark adding third-party Dependencies on k8s
> --
>
> Key: SPARK-37708
> URL: https://issues.apache.org/jira/browse/SPARK-37708
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.2.0
> Environment: pyspark3.2
>Reporter: jingxiong zhong
>Priority: Major
>
> I have a question about that how do I add my Python dependencies to Spark 
> Job, as following
> {code:sh}
> spark-submit \
> --archives s3a://path/python3.6.9.tgz#python3.6.9 \
> --conf "spark.pyspark.driver.python=python3.6.9/bin/python3" \
> --conf "spark.pyspark.python=python3.6.9/bin/python3" \
> --name "piroottest" \
> ./examples/src/main/python/pi.py 10
> {code}
> this can't run my job sucessfully,it throw error
> {code:sh}
> Traceback (most recent call last):
>   File "/tmp/spark-63b77184-6e89-4121-bc32-6a1b793e0c85/pi.py", line 21, in 
> 
> from pyspark.sql import SparkSession
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 121, in 
> 
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/__init__.py", line 42, 
> in 
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 27, in 
> 
> async def _ag():
>   File "/opt/spark/work-dir/python3.6.9/lib/python3.6/ctypes/__init__.py", 
> line 7, in 
> from _ctypes import Union, Structure, Array
> ImportError: libffi.so.6: cannot open shared object file: No such file or 
> directory
> {code}
> Or is there another way to add Python dependencies?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37708) pyspark adding third-party Dependencies on k8s

2021-12-21 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-37708:

Component/s: Kubernetes

> pyspark adding third-party Dependencies on k8s
> --
>
> Key: SPARK-37708
> URL: https://issues.apache.org/jira/browse/SPARK-37708
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.2.0
> Environment: pyspark3.2
>Reporter: jingxiong zhong
>Priority: Critical
>
> I have a question about that how do I add my Python dependencies to Spark 
> Job, as following
> {code:sh}
> spark-submit \
> --archives s3a://path/python3.6.9.tgz#python3.6.9 \
> --conf "spark.pyspark.driver.python=python3.6.9/bin/python3" \
> --conf "spark.pyspark.python=python3.6.9/bin/python3" \
> --name "piroottest" \
> ./examples/src/main/python/pi.py 10
> {code}
> this can't run my job sucessfully,it throw error
> {code:sh}
> Traceback (most recent call last):
>   File "/tmp/spark-63b77184-6e89-4121-bc32-6a1b793e0c85/pi.py", line 21, in 
> 
> from pyspark.sql import SparkSession
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 121, in 
> 
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/__init__.py", line 42, 
> in 
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 27, in 
> 
> async def _ag():
>   File "/opt/spark/work-dir/python3.6.9/lib/python3.6/ctypes/__init__.py", 
> line 7, in 
> from _ctypes import Union, Structure, Array
> ImportError: libffi.so.6: cannot open shared object file: No such file or 
> directory
> {code}
> Or is there another way to add Python dependencies?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37708) pyspark adding third-party Dependencies on k8s

2021-12-21 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-37708:

Description: 
I have a question about that how do I add my Python dependencies to Spark Job, 
as following


{code:sh}
spark-submit \
--archives s3a://path/python3.6.9.tgz#python3.6.9 \
--conf "spark.pyspark.driver.python=python3.6.9/bin/python3" \
--conf "spark.pyspark.python=python3.6.9/bin/python3" \
--name "piroottest" \
./examples/src/main/python/pi.py 10
{code}

this can't run my job sucessfully,it throw error


{code:sh}
Traceback (most recent call last):
  File "/tmp/spark-63b77184-6e89-4121-bc32-6a1b793e0c85/pi.py", line 21, in 

from pyspark.sql import SparkSession
  File "/opt/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 121, in 

  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/__init__.py", line 42, in 

  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 27, in 

async def _ag():
  File "/opt/spark/work-dir/python3.6.9/lib/python3.6/ctypes/__init__.py", line 
7, in 
from _ctypes import Union, Structure, Array
ImportError: libffi.so.6: cannot open shared object file: No such file or 
directory
{code}

Or is there another way to add Python dependencies?

  was:
I have a question about that how do I add my Python dependencies to Spark Job, 
as following

spark-submit \
--archives s3a://path/python3.6.9.tgz#python3.6.9 \
--conf "spark.pyspark.driver.python=python3.6.9/bin/python3" \
--conf "spark.pyspark.python=python3.6.9/bin/python3" \
--name "piroottest" \
./examples/src/main/python/pi.py 10
this can't run my job sucessfully,it throw error

Traceback (most recent call last):
  File "/tmp/spark-63b77184-6e89-4121-bc32-6a1b793e0c85/pi.py", line 21, in 

from pyspark.sql import SparkSession
  File "/opt/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 121, in 

  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/__init__.py", line 42, in 

  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 27, in 

async def _ag():
  File "/opt/spark/work-dir/python3.6.9/lib/python3.6/ctypes/__init__.py", line 
7, in 
from _ctypes import Union, Structure, Array
ImportError: libffi.so.6: cannot open shared object file: No such file or 
directory
Or is there another way to add Python dependencies?


> pyspark adding third-party Dependencies on k8s
> --
>
> Key: SPARK-37708
> URL: https://issues.apache.org/jira/browse/SPARK-37708
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: pyspark3.2
>Reporter: jingxiong zhong
>Priority: Critical
>
> I have a question about that how do I add my Python dependencies to Spark 
> Job, as following
> {code:sh}
> spark-submit \
> --archives s3a://path/python3.6.9.tgz#python3.6.9 \
> --conf "spark.pyspark.driver.python=python3.6.9/bin/python3" \
> --conf "spark.pyspark.python=python3.6.9/bin/python3" \
> --name "piroottest" \
> ./examples/src/main/python/pi.py 10
> {code}
> this can't run my job sucessfully,it throw error
> {code:sh}
> Traceback (most recent call last):
>   File "/tmp/spark-63b77184-6e89-4121-bc32-6a1b793e0c85/pi.py", line 21, in 
> 
> from pyspark.sql import SparkSession
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 121, in 
> 
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/__init__.py", line 42, 
> in 
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 27, in 
> 
> async def _ag():
>   File "/opt/spark/work-dir/python3.6.9/lib/python3.6/ctypes/__init__.py", 
> line 7, in 
> from _ctypes import Union, Structure, Array
> ImportError: libffi.so.6: cannot open shared object file: No such file or 
> directory
> {code}
> Or is there another way to add Python dependencies?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37708) pyspark adding third-party Dependencies on k8s

2021-12-21 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-37708:

Description: 
I have a question about that how do I add my Python dependencies to Spark Job, 
as following

spark-submit \
--archives s3a://path/python3.6.9.tgz#python3.6.9 \
--conf "spark.pyspark.driver.python=python3.6.9/bin/python3" \
--conf "spark.pyspark.python=python3.6.9/bin/python3" \
--name "piroottest" \
./examples/src/main/python/pi.py 10
this can't run my job sucessfully,it throw error

Traceback (most recent call last):
  File "/tmp/spark-63b77184-6e89-4121-bc32-6a1b793e0c85/pi.py", line 21, in 

from pyspark.sql import SparkSession
  File "/opt/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 121, in 

  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/__init__.py", line 42, in 

  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 27, in 

async def _ag():
  File "/opt/spark/work-dir/python3.6.9/lib/python3.6/ctypes/__init__.py", line 
7, in 
from _ctypes import Union, Structure, Array
ImportError: libffi.so.6: cannot open shared object file: No such file or 
directory
Or is there another way to add Python dependencies?

> pyspark adding third-party Dependencies on k8s
> --
>
> Key: SPARK-37708
> URL: https://issues.apache.org/jira/browse/SPARK-37708
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: pyspark3.2
>Reporter: jingxiong zhong
>Priority: Critical
>
> I have a question about that how do I add my Python dependencies to Spark 
> Job, as following
> spark-submit \
> --archives s3a://path/python3.6.9.tgz#python3.6.9 \
> --conf "spark.pyspark.driver.python=python3.6.9/bin/python3" \
> --conf "spark.pyspark.python=python3.6.9/bin/python3" \
> --name "piroottest" \
> ./examples/src/main/python/pi.py 10
> this can't run my job sucessfully,it throw error
> Traceback (most recent call last):
>   File "/tmp/spark-63b77184-6e89-4121-bc32-6a1b793e0c85/pi.py", line 21, in 
> 
> from pyspark.sql import SparkSession
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 121, in 
> 
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/__init__.py", line 42, 
> in 
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 27, in 
> 
> async def _ag():
>   File "/opt/spark/work-dir/python3.6.9/lib/python3.6/ctypes/__init__.py", 
> line 7, in 
> from _ctypes import Union, Structure, Array
> ImportError: libffi.so.6: cannot open shared object file: No such file or 
> directory
> Or is there another way to add Python dependencies?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37708) pyspark adding third-party Dependencies on k8s

2021-12-21 Thread jingxiong zhong (Jira)
jingxiong zhong created SPARK-37708:
---

 Summary: pyspark adding third-party Dependencies on k8s
 Key: SPARK-37708
 URL: https://issues.apache.org/jira/browse/SPARK-37708
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.0
 Environment: pyspark3.2
Reporter: jingxiong zhong






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26404) set spark.pyspark.python or PYSPARK_PYTHON doesn't work in k8s client-cluster mode.

2021-12-21 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17463384#comment-17463384
 ] 

jingxiong zhong edited comment on SPARK-26404 at 12/21/21, 5:52 PM:


@gollum999Tim Sanders,hey sir,  I have a question about that how do I add my 
Python dependencies to Spark Job, as following 
{code:sh}
spark-submit \
--archives s3a://path/python3.6.9.tgz#python3.6.9 \
--conf "spark.pyspark.driver.python=python3.6.9/bin/python3" \
--conf "spark.pyspark.python=python3.6.9/bin/python3" \
--name "piroottest" \
./examples/src/main/python/pi.py 10
{code}
this can't run my job sucessfully,it throw error

{code:sh}
Traceback (most recent call last):
  File "/tmp/spark-63b77184-6e89-4121-bc32-6a1b793e0c85/pi.py", line 21, in 

from pyspark.sql import SparkSession
  File "/opt/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 121, in 

  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/__init__.py", line 42, in 

  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 27, in 

async def _ag():
  File "/opt/spark/work-dir/python3.6.9/lib/python3.6/ctypes/__init__.py", line 
7, in 
from _ctypes import Union, Structure, Array
ImportError: libffi.so.6: cannot open shared object file: No such file or 
directory
{code}

Or is there another way to add Python dependencies?



was (Author: JIRAUSER281124):
@gollum999Tim Sanders,hey sir,  I have a question about that how can I add my 
python dependency into spark job, as following 
{code:sh}
spark-submit \
--archives s3a://path/python3.6.9.tgz#python3.6.9 \
--conf "spark.pyspark.driver.python=python3.6.9/bin/python3" \
--conf "spark.pyspark.python=python3.6.9/bin/python3" \
--name "piroottest" \
./examples/src/main/python/pi.py 10
{code}
this can't run my job sucessfully,it throw error

{code:sh}
Traceback (most recent call last):
  File "/tmp/spark-63b77184-6e89-4121-bc32-6a1b793e0c85/pi.py", line 21, in 

from pyspark.sql import SparkSession
  File "/opt/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 121, in 

  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/__init__.py", line 42, in 

  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 27, in 

async def _ag():
  File "/opt/spark/work-dir/python3.6.9/lib/python3.6/ctypes/__init__.py", line 
7, in 
from _ctypes import Union, Structure, Array
ImportError: libffi.so.6: cannot open shared object file: No such file or 
directory
{code}

Or is there another way to add Python dependencies?


> set spark.pyspark.python or PYSPARK_PYTHON doesn't work in k8s client-cluster 
> mode.
> ---
>
> Key: SPARK-26404
> URL: https://issues.apache.org/jira/browse/SPARK-26404
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.0
>Reporter: Dongqing  Liu
>Priority: Major
>
> Neither
>    conf.set("spark.executorEnv.PYSPARK_PYTHON", "/opt/pythonenvs/bin/python")
> nor 
>   conf.set("spark.pyspark.python", "/opt/pythonenvs/bin/python") 
> works. 
> Looks like the executor always picks python from PATH.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26404) set spark.pyspark.python or PYSPARK_PYTHON doesn't work in k8s client-cluster mode.

2021-12-21 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17463384#comment-17463384
 ] 

jingxiong zhong commented on SPARK-26404:
-

@gollum999Tim Sanders,hey sir,  I have a question about that how can I add my 
python dependency into spark job, as following 
{code:sh}
spark-submit \
--archives s3a://path/python3.6.9.tgz#python3.6.9 \
--conf "spark.pyspark.driver.python=python3.6.9/bin/python3" \
--conf "spark.pyspark.python=python3.6.9/bin/python3" \
--name "piroottest" \
./examples/src/main/python/pi.py 10
{code}
this can't run my job sucessfully,it throw error

{code:sh}
Traceback (most recent call last):
  File "/tmp/spark-63b77184-6e89-4121-bc32-6a1b793e0c85/pi.py", line 21, in 

from pyspark.sql import SparkSession
  File "/opt/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 121, in 

  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/__init__.py", line 42, in 

  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 27, in 

async def _ag():
  File "/opt/spark/work-dir/python3.6.9/lib/python3.6/ctypes/__init__.py", line 
7, in 
from _ctypes import Union, Structure, Array
ImportError: libffi.so.6: cannot open shared object file: No such file or 
directory
{code}

Or is there another way to add Python dependencies?


> set spark.pyspark.python or PYSPARK_PYTHON doesn't work in k8s client-cluster 
> mode.
> ---
>
> Key: SPARK-26404
> URL: https://issues.apache.org/jira/browse/SPARK-26404
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.0
>Reporter: Dongqing  Liu
>Priority: Major
>
> Neither
>    conf.set("spark.executorEnv.PYSPARK_PYTHON", "/opt/pythonenvs/bin/python")
> nor 
>   conf.set("spark.pyspark.python", "/opt/pythonenvs/bin/python") 
> works. 
> Looks like the executor always picks python from PATH.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37677) spark on k8s, when the user want to push python3.6.6.zip to the pod , but no permission to execute

2021-12-17 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461800#comment-17461800
 ] 

jingxiong zhong commented on SPARK-37677:
-

:D

> spark on k8s, when the user want to push python3.6.6.zip to the pod , but no 
> permission to execute
> --
>
> Key: SPARK-37677
> URL: https://issues.apache.org/jira/browse/SPARK-37677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: jingxiong zhong
>Priority: Major
>
> In cluster mode, I hava another question that when I unzip python3.6.6.zip in 
> pod , but no permission to execute, my execute operation as follows:
> {code:sh}
> spark-submit \
> --archives ./python3.6.6.zip#python3.6.6 \
> --conf "spark.pyspark.python=python3.6.6/python3.6.6/bin/python3" \
> --conf "spark.pyspark.driver.python=python3.6.6/python3.6.6/bin/python3" \
> --conf spark.kubernetes.container.image.pullPolicy=Always \
> ./examples/src/main/python/pi.py 100
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37677) spark on k8s, when the user want to push python3.6.6.zip to the pod , but no permission to execute

2021-12-17 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461799#comment-17461799
 ] 

jingxiong zhong commented on SPARK-37677:
-

Let me have a try.

> spark on k8s, when the user want to push python3.6.6.zip to the pod , but no 
> permission to execute
> --
>
> Key: SPARK-37677
> URL: https://issues.apache.org/jira/browse/SPARK-37677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: jingxiong zhong
>Priority: Major
>
> In cluster mode, I hava another question that when I unzip python3.6.6.zip in 
> pod , but no permission to execute, my execute operation as follows:
> {code:sh}
> spark-submit \
> --archives ./python3.6.6.zip#python3.6.6 \
> --conf "spark.pyspark.python=python3.6.6/python3.6.6/bin/python3" \
> --conf "spark.pyspark.driver.python=python3.6.6/python3.6.6/bin/python3" \
> --conf spark.kubernetes.container.image.pullPolicy=Always \
> ./examples/src/main/python/pi.py 100
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-37677) spark on k8s, when the user want to push python3.6.6.zip to the pod , but no permission to execute

2021-12-17 Thread jingxiong zhong (Jira)


[ https://issues.apache.org/jira/browse/SPARK-37677 ]


jingxiong zhong deleted comment on SPARK-37677:
-

was (Author: JIRAUSER281124):
:D

> spark on k8s, when the user want to push python3.6.6.zip to the pod , but no 
> permission to execute
> --
>
> Key: SPARK-37677
> URL: https://issues.apache.org/jira/browse/SPARK-37677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: jingxiong zhong
>Priority: Major
>
> In cluster mode, I hava another question that when I unzip python3.6.6.zip in 
> pod , but no permission to execute, my execute operation as follows:
> {code:sh}
> spark-submit \
> --archives ./python3.6.6.zip#python3.6.6 \
> --conf "spark.pyspark.python=python3.6.6/python3.6.6/bin/python3" \
> --conf "spark.pyspark.driver.python=python3.6.6/python3.6.6/bin/python3" \
> --conf spark.kubernetes.container.image.pullPolicy=Always \
> ./examples/src/main/python/pi.py 100
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37677) spark on k8s, when the user want to push python3.6.6.zip to the pod , but no permission to execute

2021-12-17 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461797#comment-17461797
 ] 

jingxiong zhong commented on SPARK-37677:
-

can we fix it throuth modify file permissions when the file is decompressed?

> spark on k8s, when the user want to push python3.6.6.zip to the pod , but no 
> permission to execute
> --
>
> Key: SPARK-37677
> URL: https://issues.apache.org/jira/browse/SPARK-37677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: jingxiong zhong
>Priority: Major
>
> In cluster mode, I hava another question that when I unzip python3.6.6.zip in 
> pod , but no permission to execute, my execute operation as follows:
> {code:sh}
> spark-submit \
> --archives ./python3.6.6.zip#python3.6.6 \
> --conf "spark.pyspark.python=python3.6.6/python3.6.6/bin/python3" \
> --conf "spark.pyspark.driver.python=python3.6.6/python3.6.6/bin/python3" \
> --conf spark.kubernetes.container.image.pullPolicy=Always \
> ./examples/src/main/python/pi.py 100
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36088) 'spark.archives' does not extract the archive file into the driver under client mode

2021-12-17 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461405#comment-17461405
 ] 

jingxiong zhong commented on SPARK-36088:
-

@hyukjin.kwon I make an issue at 
https://issues.apache.org/jira/browse/SPARK-37677, I think I can fix it.

> 'spark.archives' does not extract the archive file into the driver under 
> client mode
> 
>
> Key: SPARK-36088
> URL: https://issues.apache.org/jira/browse/SPARK-36088
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit
>Affects Versions: 3.1.2
>Reporter: rickcheng
>Priority: Major
>
> When running spark in the k8s cluster, there are 2 deploy modes: cluster and 
> client. After my test, in the cluster mode, *spark.archives* can extract the 
> archive file to the working directory of the executors and driver. But in 
> client mode, *spark.archives* can only extract the archive file to the 
> working directory of the executors.
>  
> However, I need *spark.archives* to send the virtual environment tar file 
> packaged by conda to both the driver and executors under client mode (So that 
> the executor and the driver have the same python environment).
>  
> Why *spark.archives* does not extract the archive file into the working 
> directory of the driver under client mode?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37677) spark on k8s, when the user want to push python3.6.6.zip to the pod , but no permission to execute

2021-12-17 Thread jingxiong zhong (Jira)
jingxiong zhong created SPARK-37677:
---

 Summary: spark on k8s, when the user want to push python3.6.6.zip 
to the pod , but no permission to execute
 Key: SPARK-37677
 URL: https://issues.apache.org/jira/browse/SPARK-37677
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.0
 Environment: spark-3.2.0
Reporter: jingxiong zhong


In cluster mode, I hava another question that when I unzip python3.6.6.zip in 
pod , but no permission to execute, my execute operation as follows:


{code:sh}
spark-submit \
--archives ./python3.6.6.zip#python3.6.6 \
--conf "spark.pyspark.python=python3.6.6/python3.6.6/bin/python3" \
--conf "spark.pyspark.driver.python=python3.6.6/python3.6.6/bin/python3" \
--conf spark.kubernetes.container.image.pullPolicy=Always \
./examples/src/main/python/pi.py 100
{code}





--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37521) insert overwrite table but the partition information stored in Metastore was not changed

2021-12-17 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17452521#comment-17452521
 ] 

jingxiong zhong edited comment on SPARK-37521 at 12/17/21, 10:45 AM:
-

The schema of metasotre's updated partition was not found in Hive

when you execute


{code:sql}
'create table updata_col_test1(a int) partitioned by (dt string); 
insert overwrite table updata_col_test1 partition(dt='20200101') values(1); 
insert overwrite table updata_col_test1 partition(dt='20200102') values(1);
insert overwrite table updata_col_test1 partition(dt='20200103') values(1);
alter table  updata_col_test1 add columns (b int);
insert overwrite table updata_col_test1 partition(dt) values(1, 2, '20200101'); 
'
{code}

result from two engine

HIVE:

hive> select * from bigdata_qa.updata_col_test1;
OK
updata_col_test1.a    updata_col_test1.b    updata_col_test1.dt
1    NULL    20200101
1    NULL    20200102
1    NULL    20200103
Time taken: 2.985 seconds, Fetched: 3 row(s)
hive>  desc bigdata_qa.updata_col_test1 partition(dt='20200101');
OK
col_name    data_type    comment
a                       int
dt                      string

# Partition Information
# col_name                data_type               comment

dt                      string
Time taken: 6.469 seconds, Fetched: 7 row(s)

SPARK:

spark-sql> select * from bigdata_qa.updata_col_test1;

a    b    dt
1    2    20200101
1    NULL    20200102
1    NULL    20200103
Time taken: 0.357 seconds, Fetched 3 row(s)
spark-sql> desc bigdata_qa.updata_col_test1 partition(dt='20200101');
col_name    data_type    comment
a                       int
b                       int
dt                      string
# Partition Information
# col_name              data_type               comment
dt                      string
Time taken: 0.196 seconds, Fetched 6 row(s)

 


was (Author: JIRAUSER281124):
The schema of metasotre's updated partition was not found in Hive

when you execute

'create table updata_col_test1(a int) partitioned by (dt string); 
insert overwrite table updata_col_test1 partition(dt='20200101') values(1); 
insert overwrite table updata_col_test1 partition(dt='20200102') values(1);
insert overwrite table updata_col_test1 partition(dt='20200103') values(1);

alter table  updata_col_test1 add columns (b int);

insert overwrite table updata_col_test1 partition(dt) values(1, 2, '20200101'); 
'

result from two engine

HIVE:

hive> select * from bigdata_qa.updata_col_test1;
OK
updata_col_test1.a    updata_col_test1.b    updata_col_test1.dt
1    NULL    20200101
1    NULL    20200102
1    NULL    20200103
Time taken: 2.985 seconds, Fetched: 3 row(s)
hive>  desc bigdata_qa.updata_col_test1 partition(dt='20200101');
OK
col_name    data_type    comment
a                       int
dt                      string

# Partition Information
# col_name                data_type               comment

dt                      string
Time taken: 6.469 seconds, Fetched: 7 row(s)

SPARK:

spark-sql> select * from bigdata_qa.updata_col_test1;

a    b    dt
1    2    20200101
1    NULL    20200102
1    NULL    20200103
Time taken: 0.357 seconds, Fetched 3 row(s)
spark-sql> desc bigdata_qa.updata_col_test1 partition(dt='20200101');
col_name    data_type    comment
a                       int
b                       int
dt                      string
# Partition Information
# col_name              data_type               comment
dt                      string
Time taken: 0.196 seconds, Fetched 6 row(s)

 

> insert overwrite table but the partition information stored in Metastore was 
> not changed
> 
>
> Key: SPARK-37521
> URL: https://issues.apache.org/jira/browse/SPARK-37521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark3.2.0
> hive2.3.9
> metastore2.3.9
>Reporter: jingxiong zhong
>Priority: Major
>
> I create a partitioned table in SparkSQL, insert a data entry, add a regular 
> field, and finally insert a new data entry into the partition,The query is 
> normal in SparkSQL, but the return value of the newly inserted field is NULL 
> in Hive 2.3.9
> for example
> create table updata_col_test1(a int) partitioned by (dt string); 
> insert overwrite table updata_col_test1 partition(dt='20200101') values(1); 
> insert overwrite table updata_col_test1 partition(dt='20200102') values(1);
> insert overwrite table updata_col_test1 partition(dt='20200103') values(1);
> alter table  updata_col_test1 add columns (b int);
> insert overwrite table updata_col_test1 partition(dt) values(1, 2, 
> '20200101'); fail
> insert overwrite table updata_col_test1 partition(dt='20200101') values(1, 
> 2); fail
> insert overwrite table updata_col_test1 

[jira] [Comment Edited] (SPARK-36088) 'spark.archives' does not extract the archive file into the driver under client mode

2021-12-16 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461230#comment-17461230
 ] 

jingxiong zhong edited comment on SPARK-36088 at 12/17/21, 6:53 AM:


In cluster mode, I hava another question that when I unzip python3.6.6.zip in 
pod , but no permission to execute, my execute operation as follows:

{code:sh}
spark-submit \
--archives ./python3.6.6.zip#python3.6.6 \
--conf "spark.pyspark.python=python3.6.6/python3.6.6/bin/python3" \
--conf "spark.pyspark.driver.python=python3.6.6/python3.6.6/bin/python3" \
--conf spark.kubernetes.container.image.pullPolicy=Always \
./examples/src/main/python/pi.py 100
{code}



was (Author: JIRAUSER281124):
In cluster mode, I hava another question that when I unzip python3.6.6.zip in 
pod , but no permission to execute, my execute operation as follows:

{code:shell}
spark-submit \
--archives ./python3.6.6.zip#python3.6.6 \
--conf "spark.pyspark.python=python3.6.6/python3.6.6/bin/python3" \
--conf "spark.pyspark.driver.python=python3.6.6/python3.6.6/bin/python3" \
--conf spark.kubernetes.container.image.pullPolicy=Always \
./examples/src/main/python/pi.py 100
{code}


> 'spark.archives' does not extract the archive file into the driver under 
> client mode
> 
>
> Key: SPARK-36088
> URL: https://issues.apache.org/jira/browse/SPARK-36088
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit
>Affects Versions: 3.1.2
>Reporter: rickcheng
>Priority: Major
>
> When running spark in the k8s cluster, there are 2 deploy modes: cluster and 
> client. After my test, in the cluster mode, *spark.archives* can extract the 
> archive file to the working directory of the executors and driver. But in 
> client mode, *spark.archives* can only extract the archive file to the 
> working directory of the executors.
>  
> However, I need *spark.archives* to send the virtual environment tar file 
> packaged by conda to both the driver and executors under client mode (So that 
> the executor and the driver have the same python environment).
>  
> Why *spark.archives* does not extract the archive file into the working 
> directory of the driver under client mode?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36088) 'spark.archives' does not extract the archive file into the driver under client mode

2021-12-16 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461230#comment-17461230
 ] 

jingxiong zhong commented on SPARK-36088:
-

In cluster mode, I hava another question that when I unzip python3.6.6.zip in 
pod , but no permission to execute, my execute operation as follows:

{code:shell}
spark-submit \
--archives ./python3.6.6.zip#python3.6.6 \
--conf "spark.pyspark.python=python3.6.6/python3.6.6/bin/python3" \
--conf "spark.pyspark.driver.python=python3.6.6/python3.6.6/bin/python3" \
--conf spark.kubernetes.container.image.pullPolicy=Always \
./examples/src/main/python/pi.py 100
{code}


> 'spark.archives' does not extract the archive file into the driver under 
> client mode
> 
>
> Key: SPARK-36088
> URL: https://issues.apache.org/jira/browse/SPARK-36088
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit
>Affects Versions: 3.1.2
>Reporter: rickcheng
>Priority: Major
>
> When running spark in the k8s cluster, there are 2 deploy modes: cluster and 
> client. After my test, in the cluster mode, *spark.archives* can extract the 
> archive file to the working directory of the executors and driver. But in 
> client mode, *spark.archives* can only extract the archive file to the 
> working directory of the executors.
>  
> However, I need *spark.archives* to send the virtual environment tar file 
> packaged by conda to both the driver and executors under client mode (So that 
> the executor and the driver have the same python environment).
>  
> Why *spark.archives* does not extract the archive file into the working 
> directory of the driver under client mode?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37521) insert overwrite table but the partition information stored in Metastore was not changed

2021-12-04 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-37521:

Issue Type: Bug  (was: New Bugzilla Project)

> insert overwrite table but the partition information stored in Metastore was 
> not changed
> 
>
> Key: SPARK-37521
> URL: https://issues.apache.org/jira/browse/SPARK-37521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark3.2.0
> hive2.3.9
> metastore2.3.9
>Reporter: jingxiong zhong
>Priority: Major
>
> I create a partitioned table in SparkSQL, insert a data entry, add a regular 
> field, and finally insert a new data entry into the partition,The query is 
> normal in SparkSQL, but the return value of the newly inserted field is NULL 
> in Hive 2.3.9
> for example
> create table updata_col_test1(a int) partitioned by (dt string); 
> insert overwrite table updata_col_test1 partition(dt='20200101') values(1); 
> insert overwrite table updata_col_test1 partition(dt='20200102') values(1);
> insert overwrite table updata_col_test1 partition(dt='20200103') values(1);
> alter table  updata_col_test1 add columns (b int);
> insert overwrite table updata_col_test1 partition(dt) values(1, 2, 
> '20200101'); fail
> insert overwrite table updata_col_test1 partition(dt='20200101') values(1, 
> 2); fail
> insert overwrite table updata_col_test1 partition(dt='20200104') values(1, 
> 2); sucessfully



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37521) insert overwrite table but the partition information stored in Metastore was not changed

2021-12-04 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-37521:

Issue Type: New Bugzilla Project  (was: Question)

> insert overwrite table but the partition information stored in Metastore was 
> not changed
> 
>
> Key: SPARK-37521
> URL: https://issues.apache.org/jira/browse/SPARK-37521
> Project: Spark
>  Issue Type: New Bugzilla Project
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark3.2.0
> hive2.3.9
> metastore2.3.9
>Reporter: jingxiong zhong
>Priority: Major
>
> I create a partitioned table in SparkSQL, insert a data entry, add a regular 
> field, and finally insert a new data entry into the partition,The query is 
> normal in SparkSQL, but the return value of the newly inserted field is NULL 
> in Hive 2.3.9
> for example
> create table updata_col_test1(a int) partitioned by (dt string); 
> insert overwrite table updata_col_test1 partition(dt='20200101') values(1); 
> insert overwrite table updata_col_test1 partition(dt='20200102') values(1);
> insert overwrite table updata_col_test1 partition(dt='20200103') values(1);
> alter table  updata_col_test1 add columns (b int);
> insert overwrite table updata_col_test1 partition(dt) values(1, 2, 
> '20200101'); fail
> insert overwrite table updata_col_test1 partition(dt='20200101') values(1, 
> 2); fail
> insert overwrite table updata_col_test1 partition(dt='20200104') values(1, 
> 2); sucessfully



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37521) insert overwrite table but the partition information stored in Metastore was not changed

2021-12-02 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-37521:

Summary: insert overwrite table but the partition information stored in 
Metastore was not changed  (was: insert overwrite table but didn't change the 
message of metastore)

> insert overwrite table but the partition information stored in Metastore was 
> not changed
> 
>
> Key: SPARK-37521
> URL: https://issues.apache.org/jira/browse/SPARK-37521
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark3.2.0
> hive2.3.9
> metastore2.3.9
>Reporter: jingxiong zhong
>Priority: Blocker
>
> I create a partitioned table in SparkSQL, insert a data entry, add a regular 
> field, and finally insert a new data entry into the partition,The query is 
> normal in SparkSQL, but the return value of the newly inserted field is NULL 
> in Hive 2.3.9
> for example
> create table updata_col_test1(a int) partitioned by (dt string); 
> insert overwrite table updata_col_test1 partition(dt='20200101') values(1); 
> insert overwrite table updata_col_test1 partition(dt='20200102') values(1);
> insert overwrite table updata_col_test1 partition(dt='20200103') values(1);
> alter table  updata_col_test1 add columns (b int);
> insert overwrite table updata_col_test1 partition(dt) values(1, 2, 
> '20200101'); fail
> insert overwrite table updata_col_test1 partition(dt='20200101') values(1, 
> 2); fail
> insert overwrite table updata_col_test1 partition(dt='20200104') values(1, 
> 2); sucessfully



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37521) insert overwrite table but didn't change the message of metastore

2021-12-02 Thread jingxiong zhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17452521#comment-17452521
 ] 

jingxiong zhong commented on SPARK-37521:
-

The schema of metasotre's updated partition was not found in Hive

when you execute

'create table updata_col_test1(a int) partitioned by (dt string); 
insert overwrite table updata_col_test1 partition(dt='20200101') values(1); 
insert overwrite table updata_col_test1 partition(dt='20200102') values(1);
insert overwrite table updata_col_test1 partition(dt='20200103') values(1);

alter table  updata_col_test1 add columns (b int);

insert overwrite table updata_col_test1 partition(dt) values(1, 2, '20200101'); 
'

result from two engine

HIVE:

hive> select * from bigdata_qa.updata_col_test1;
OK
updata_col_test1.a    updata_col_test1.b    updata_col_test1.dt
1    NULL    20200101
1    NULL    20200102
1    NULL    20200103
Time taken: 2.985 seconds, Fetched: 3 row(s)
hive>  desc bigdata_qa.updata_col_test1 partition(dt='20200101');
OK
col_name    data_type    comment
a                       int
dt                      string

# Partition Information
# col_name                data_type               comment

dt                      string
Time taken: 6.469 seconds, Fetched: 7 row(s)

SPARK:

spark-sql> select * from bigdata_qa.updata_col_test1;

a    b    dt
1    2    20200101
1    NULL    20200102
1    NULL    20200103
Time taken: 0.357 seconds, Fetched 3 row(s)
spark-sql> desc bigdata_qa.updata_col_test1 partition(dt='20200101');
col_name    data_type    comment
a                       int
b                       int
dt                      string
# Partition Information
# col_name              data_type               comment
dt                      string
Time taken: 0.196 seconds, Fetched 6 row(s)

 

> insert overwrite table but didn't change the message of metastore
> -
>
> Key: SPARK-37521
> URL: https://issues.apache.org/jira/browse/SPARK-37521
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark3.2.0
> hive2.3.9
> metastore2.3.9
>Reporter: jingxiong zhong
>Priority: Blocker
>
> I create a partitioned table in SparkSQL, insert a data entry, add a regular 
> field, and finally insert a new data entry into the partition,The query is 
> normal in SparkSQL, but the return value of the newly inserted field is NULL 
> in Hive 2.3.9,whatever 
> for example
> create table updata_col_test1(a int) partitioned by (dt string); 
> insert overwrite table updata_col_test1 partition(dt='20200101') values(1); 
> insert overwrite table updata_col_test1 partition(dt='20200102') values(1);
> insert overwrite table updata_col_test1 partition(dt='20200103') values(1);
> alter table  updata_col_test1 add columns (b int);
> insert overwrite table updata_col_test1 partition(dt) values(1, 2, 
> '20200101'); fail
> insert overwrite table updata_col_test1 partition(dt='20200101') values(1, 
> 2); fail
> insert overwrite table updata_col_test1 partition(dt='20200104') values(1, 
> 2); sucessfully



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37521) insert overwrite table but didn't change the message of metastore

2021-12-02 Thread jingxiong zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jingxiong zhong updated SPARK-37521:

Description: 
I create a partitioned table in SparkSQL, insert a data entry, add a regular 
field, and finally insert a new data entry into the partition,The query is 
normal in SparkSQL, but the return value of the newly inserted field is NULL in 
Hive 2.3.9

for example

create table updata_col_test1(a int) partitioned by (dt string); 
insert overwrite table updata_col_test1 partition(dt='20200101') values(1); 
insert overwrite table updata_col_test1 partition(dt='20200102') values(1);
insert overwrite table updata_col_test1 partition(dt='20200103') values(1);

alter table  updata_col_test1 add columns (b int);

insert overwrite table updata_col_test1 partition(dt) values(1, 2, '20200101'); 
fail
insert overwrite table updata_col_test1 partition(dt='20200101') values(1, 2); 
fail
insert overwrite table updata_col_test1 partition(dt='20200104') values(1, 2); 
sucessfully

  was:
I create a partitioned table in SparkSQL, insert a data entry, add a regular 
field, and finally insert a new data entry into the partition,The query is 
normal in SparkSQL, but the return value of the newly inserted field is NULL in 
Hive 2.3.9,whatever 

for example

create table updata_col_test1(a int) partitioned by (dt string); 
insert overwrite table updata_col_test1 partition(dt='20200101') values(1); 
insert overwrite table updata_col_test1 partition(dt='20200102') values(1);
insert overwrite table updata_col_test1 partition(dt='20200103') values(1);


alter table  updata_col_test1 add columns (b int);

insert overwrite table updata_col_test1 partition(dt) values(1, 2, '20200101'); 
fail
insert overwrite table updata_col_test1 partition(dt='20200101') values(1, 2); 
fail
insert overwrite table updata_col_test1 partition(dt='20200104') values(1, 2); 
sucessfully


> insert overwrite table but didn't change the message of metastore
> -
>
> Key: SPARK-37521
> URL: https://issues.apache.org/jira/browse/SPARK-37521
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.2.0
> Environment: spark3.2.0
> hive2.3.9
> metastore2.3.9
>Reporter: jingxiong zhong
>Priority: Blocker
>
> I create a partitioned table in SparkSQL, insert a data entry, add a regular 
> field, and finally insert a new data entry into the partition,The query is 
> normal in SparkSQL, but the return value of the newly inserted field is NULL 
> in Hive 2.3.9
> for example
> create table updata_col_test1(a int) partitioned by (dt string); 
> insert overwrite table updata_col_test1 partition(dt='20200101') values(1); 
> insert overwrite table updata_col_test1 partition(dt='20200102') values(1);
> insert overwrite table updata_col_test1 partition(dt='20200103') values(1);
> alter table  updata_col_test1 add columns (b int);
> insert overwrite table updata_col_test1 partition(dt) values(1, 2, 
> '20200101'); fail
> insert overwrite table updata_col_test1 partition(dt='20200101') values(1, 
> 2); fail
> insert overwrite table updata_col_test1 partition(dt='20200104') values(1, 
> 2); sucessfully



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37521) insert overwrite table but didn't change the message of metastore

2021-12-02 Thread jingxiong zhong (Jira)
jingxiong zhong created SPARK-37521:
---

 Summary: insert overwrite table but didn't change the message of 
metastore
 Key: SPARK-37521
 URL: https://issues.apache.org/jira/browse/SPARK-37521
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 3.2.0
 Environment: spark3.2.0

hive2.3.9

metastore2.3.9
Reporter: jingxiong zhong


I create a partitioned table in SparkSQL, insert a data entry, add a regular 
field, and finally insert a new data entry into the partition,The query is 
normal in SparkSQL, but the return value of the newly inserted field is NULL in 
Hive 2.3.9,whatever 

for example

create table updata_col_test1(a int) partitioned by (dt string); 
insert overwrite table updata_col_test1 partition(dt='20200101') values(1); 
insert overwrite table updata_col_test1 partition(dt='20200102') values(1);
insert overwrite table updata_col_test1 partition(dt='20200103') values(1);


alter table  updata_col_test1 add columns (b int);

insert overwrite table updata_col_test1 partition(dt) values(1, 2, '20200101'); 
fail
insert overwrite table updata_col_test1 partition(dt='20200101') values(1, 2); 
fail
insert overwrite table updata_col_test1 partition(dt='20200104') values(1, 2); 
sucessfully



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org