[jira] [Comment Edited] (SPARK-41556) input_file_positon

2022-12-16 Thread gabrywu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648856#comment-17648856
 ] 

gabrywu edited comment on SPARK-41556 at 12/17/22 5:18 AM:
---

[~yumwang] [~petertoth]  What do you think of it?


was (Author: gabry.wu):
[~yumwang] [~ptoth] What do you think of it?

> input_file_positon
> --
>
> Key: SPARK-41556
> URL: https://issues.apache.org/jira/browse/SPARK-41556
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: gabrywu
>Priority: Trivial
>
> As for now, we have 3 built-in UDFs related to input files and blocks.  So 
> can we provide a new UDF to return current record position of a file or 
> block? Sometimes, it's useful and we can consider this position (called ROWID 
> in oracle) as a physical primary key.
>  
> |input_file_block_length()|Returns the length of the block being read, or -1 
> if not available.|
> |input_file_block_start()|Returns the start offset of the block being read, 
> or -1 if not available.|
> |input_file_name()|Returns the name of the file being read, or empty string 
> if not available.|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41556) input_file_positon

2022-12-16 Thread gabrywu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648856#comment-17648856
 ] 

gabrywu commented on SPARK-41556:
-

[~yumwang] [~ptoth] What do you think of it?

> input_file_positon
> --
>
> Key: SPARK-41556
> URL: https://issues.apache.org/jira/browse/SPARK-41556
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: gabrywu
>Priority: Trivial
>
> As for now, we have 3 built-in UDFs related to input files and blocks.  So 
> can we provide a new UDF to return current record position of a file or 
> block? Sometimes, it's useful and we can consider this position (called ROWID 
> in oracle) as a physical primary key.
>  
> |input_file_block_length()|Returns the length of the block being read, or -1 
> if not available.|
> |input_file_block_start()|Returns the start offset of the block being read, 
> or -1 if not available.|
> |input_file_name()|Returns the name of the file being read, or empty string 
> if not available.|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41556) input_file_positon

2022-12-16 Thread gabrywu (Jira)
gabrywu created SPARK-41556:
---

 Summary: input_file_positon
 Key: SPARK-41556
 URL: https://issues.apache.org/jira/browse/SPARK-41556
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.3.1
Reporter: gabrywu


As for now, we have 3 built-in UDFs related to input files and blocks.  So can 
we provide a new UDF to return current record position of a file or block? 
Sometimes, it's useful and we can consider this position (called ROWID in 
oracle) as a physical primary key.

 
|input_file_block_length()|Returns the length of the block being read, or -1 if 
not available.|
|input_file_block_start()|Returns the start offset of the block being read, or 
-1 if not available.|
|input_file_name()|Returns the name of the file being read, or empty string if 
not available.|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24497) ANSI SQL: Recursive query

2022-12-16 Thread gabrywu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648616#comment-17648616
 ] 

gabrywu commented on SPARK-24497:
-

this is a useful feature, when will it be merged to main branch?

> ANSI SQL: Recursive query
> -
>
> Key: SPARK-24497
> URL: https://issues.apache.org/jira/browse/SPARK-24497
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> h3. *Examples*
> Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" 
> represents the structure of an organization as an adjacency list.
> {code:sql}
> CREATE TABLE department (
> id INTEGER PRIMARY KEY,  -- department ID
> parent_department INTEGER REFERENCES department, -- upper department ID
> name TEXT -- department name
> );
> INSERT INTO department (id, parent_department, "name")
> VALUES
>  (0, NULL, 'ROOT'),
>  (1, 0, 'A'),
>  (2, 1, 'B'),
>  (3, 2, 'C'),
>  (4, 2, 'D'),
>  (5, 0, 'E'),
>  (6, 4, 'F'),
>  (7, 5, 'G');
> -- department structure represented here is as follows:
> --
> -- ROOT-+->A-+->B-+->C
> --  | |
> --  | +->D-+->F
> --  +->E-+->G
> {code}
>  
>  To extract all departments under A, you can use the following recursive 
> query:
> {code:sql}
> WITH RECURSIVE subdepartment AS
> (
> -- non-recursive term
> SELECT * FROM department WHERE name = 'A'
> UNION ALL
> -- recursive term
> SELECT d.*
> FROM
> department AS d
> JOIN
> subdepartment AS sd
> ON (d.parent_department = sd.id)
> )
> SELECT *
> FROM subdepartment
> ORDER BY name;
> {code}
> More details:
> [http://wiki.postgresql.org/wiki/CTEReadme]
> [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37872) [SQL] Some classes are move from org.codehaus.janino:janino to org.codehaus.janino:common-compiler after version 3.1.x

2022-07-07 Thread gabrywu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563599#comment-17563599
 ] 

gabrywu commented on SPARK-37872:
-

Yes, janino 3.0.16 is out of date, and not compatible with a higher version.

> [SQL] Some classes are move from org.codehaus.janino:janino to 
> org.codehaus.janino:common-compiler after version 3.1.x 
> ---
>
> Key: SPARK-37872
> URL: https://issues.apache.org/jira/browse/SPARK-37872
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5, 3.2.0
>Reporter: Jin Shen
>Priority: Major
>
> Here is the code:
>  
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L32]
>  
> ByteArrayClassLoader and InternalCompilerException are moved to 
> org.codehaus.janino:common-compiler
>  
> [https://github.com/janino-compiler/janino/blob/3.1.6/commons-compiler/src/main/java/org/codehaus/commons/compiler/util/reflect/ByteArrayClassLoader.java]
>  
> [https://github.com/janino-compiler/janino/blob/3.1.6/commons-compiler/src/main/java/org/codehaus/commons/compiler/InternalCompilerException.java]
>  
> The last working version of janino is 3.0.16 but it is out of date.
> Can we make change to this and upgrade to new version of janino and 
> common-compiler?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39036) Support Alter Table/Partition Concatenate command

2022-05-05 Thread gabrywu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532564#comment-17532564
 ] 

gabrywu commented on SPARK-39036:
-

[~hyukjin.kwon] What do you know about that?  Is anyone working on this to 
merge small files?

> Support Alter Table/Partition Concatenate command
> -
>
> Key: SPARK-39036
> URL: https://issues.apache.org/jira/browse/SPARK-39036
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: gabrywu
>Priority: Major
>
> Hi, folks, 
> In Hive, we can use following command to merge small files, however, there is 
> not a corresponding command to do that in Spark SQL. 
> I believe it's useful and it's not enough only using AQE.  Is anyone working 
> on this to merge small files? If not, I want to create a PR to implement it
>  
> {code:java}
> ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, 
> ...])] CONCATENATE;{code}
>  
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionConcatenate]
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39036) Support Alter Table/Partition Concatenate command

2022-05-05 Thread gabrywu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gabrywu updated SPARK-39036:

Description: 
Hi, folks, 

In Hive, we can use following command to merge small files, however, there is 
not a corresponding command to do that in Spark SQL. 

I believe it's useful and it's not enough only using AQE.  Is anyone working on 
this to merge small files? If not, I want to create a PR to implement it

 
{code:java}
ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])] 
CONCATENATE;{code}
 

[https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionConcatenate]

 

  was:
Hi, folks, 

In Hive, we can use following command to merge small files, however, there is 
not a corresponding command to do that. 

I believe it's useful and it's not enough only using AQE.  Is anyone working on 
this to merge small files? If not, I want to create a PR to implement it

 
{code:java}
ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])] 
CONCATENATE;{code}
 

[https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionConcatenate]

 


> Support Alter Table/Partition Concatenate command
> -
>
> Key: SPARK-39036
> URL: https://issues.apache.org/jira/browse/SPARK-39036
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: gabrywu
>Priority: Major
>
> Hi, folks, 
> In Hive, we can use following command to merge small files, however, there is 
> not a corresponding command to do that in Spark SQL. 
> I believe it's useful and it's not enough only using AQE.  Is anyone working 
> on this to merge small files? If not, I want to create a PR to implement it
>  
> {code:java}
> ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, 
> ...])] CONCATENATE;{code}
>  
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionConcatenate]
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39036) support Alter Table/Partition Concatenate command

2022-04-27 Thread gabrywu (Jira)
gabrywu created SPARK-39036:
---

 Summary: support Alter Table/Partition Concatenate command
 Key: SPARK-39036
 URL: https://issues.apache.org/jira/browse/SPARK-39036
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, SQL
Affects Versions: 3.3.0
Reporter: gabrywu


Hi, folks, 

In Hive, we can use following command to merge small files, however, there is 
not a corresponding command to do that. 

I believe it's useful and it's not enough only using AQE.  Is anyone working on 
this to merge small files? If not, I want to create a PR to implement it

 
{code:java}
ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])] 
CONCATENATE;{code}
 

[https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionConcatenate]

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39003) make AppHistoryServerPlugin public api for developer

2022-04-23 Thread gabrywu (Jira)
gabrywu created SPARK-39003:
---

 Summary: make AppHistoryServerPlugin public api for developer
 Key: SPARK-39003
 URL: https://issues.apache.org/jira/browse/SPARK-39003
 Project: Spark
  Issue Type: Wish
  Components: Spark Core
Affects Versions: 3.1.0, 2.4.0, 2.3.0
Reporter: gabrywu


For history server, there is an interface called 
{{{}AppHistoryServerPlugin{}}}, which is loaded based on SPI. However it is 
accessible for spark package, so, can we change it as a public interface? With 
that, developer can extend application history.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38769) [SQL] behavior of schema_of_json not same with 2.4.0

2022-04-02 Thread gabrywu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516421#comment-17516421
 ] 

gabrywu commented on SPARK-38769:
-

nomatter which UDF to work together, I believe we should not change its 
behavior, right?

For example, following json contains a field ato_long_v2, however, it will be 
ato_long_v3, and ato_long_v4, etc. We want to extract the version string as 
v2,v3,v4, and schema_of_json is used here
{code:java}
{
  "tt_v1": 165
  "tt_long_v2": 474
  "ato_long_v2": 431
  "tt_short_v2": 338
  "ato_v1": 408
  "ato_short_v2": 358
  "sf_long_v3": 400
  "sf_short_v3": 498
}{code}

> [SQL] behavior of schema_of_json not same with 2.4.0
> 
>
> Key: SPARK-38769
> URL: https://issues.apache.org/jira/browse/SPARK-38769
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: gabrywu
>Priority: Minor
>
> When I switch to spark 3.1.1 from spark 2.4.0, I found a built-in function 
> throw errors:
> |== Physical Plan == org.apache.spark.sql.AnalysisException: cannot resolve 
> 'schema_of_json(get_json_object(`adtnl_info_txt`, '$.all_model_scores'))' due 
> to data type mismatch: The input json should be a foldable string expression 
> and not null; however, got get_json_object(`adtnl_info_txt`, 
> '$.all_model_scores').; line 3 pos 2; |
> But schema_of_json worked well in 2.4.0, So, is it a bug, or a new feature, 
> which doesn't support non-Literal expressions?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-38769) [SQL] behavior of schema_of_json not same with 2.4.0

2022-04-02 Thread gabrywu (Jira)


[ https://issues.apache.org/jira/browse/SPARK-38769 ]


gabrywu deleted comment on SPARK-38769:
-

was (Author: gabry.wu):
[~maxgekk] 

> [SQL] behavior of schema_of_json not same with 2.4.0
> 
>
> Key: SPARK-38769
> URL: https://issues.apache.org/jira/browse/SPARK-38769
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: gabrywu
>Priority: Minor
>
> When I switch to spark 3.1.1 from spark 2.4.0, I found a built-in function 
> throw errors:
> |== Physical Plan == org.apache.spark.sql.AnalysisException: cannot resolve 
> 'schema_of_json(get_json_object(`adtnl_info_txt`, '$.all_model_scores'))' due 
> to data type mismatch: The input json should be a foldable string expression 
> and not null; however, got get_json_object(`adtnl_info_txt`, 
> '$.all_model_scores').; line 3 pos 2; |
> But schema_of_json worked well in 2.4.0, So, is it a bug, or a new feature, 
> which doesn't support non-Literal expressions?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-38769) [SQL] behavior of schema_of_json not same with 2.4.0

2022-04-02 Thread gabrywu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516421#comment-17516421
 ] 

gabrywu edited comment on SPARK-38769 at 4/3/22 2:20 AM:
-

[~hyukjin.kwon]  nomatter which UDF to work together, I believe we should not 
change its behavior, right?

For example, following json contains a field ato_long_v2, however, it will be 
ato_long_v3, and ato_long_v4, etc. We want to extract the version string as 
v2,v3,v4, and schema_of_json is used here
{code:java}
{
  "tt_v1": 165
  "tt_long_v2": 474
  "ato_long_v2": 431
  "tt_short_v2": 338
  "ato_v1": 408
  "ato_short_v2": 358
  "sf_long_v3": 400
  "sf_short_v3": 498
}{code}


was (Author: gabry.wu):
nomatter which UDF to work together, I believe we should not change its 
behavior, right?

For example, following json contains a field ato_long_v2, however, it will be 
ato_long_v3, and ato_long_v4, etc. We want to extract the version string as 
v2,v3,v4, and schema_of_json is used here
{code:java}
{
  "tt_v1": 165
  "tt_long_v2": 474
  "ato_long_v2": 431
  "tt_short_v2": 338
  "ato_v1": 408
  "ato_short_v2": 358
  "sf_long_v3": 400
  "sf_short_v3": 498
}{code}

> [SQL] behavior of schema_of_json not same with 2.4.0
> 
>
> Key: SPARK-38769
> URL: https://issues.apache.org/jira/browse/SPARK-38769
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: gabrywu
>Priority: Minor
>
> When I switch to spark 3.1.1 from spark 2.4.0, I found a built-in function 
> throw errors:
> |== Physical Plan == org.apache.spark.sql.AnalysisException: cannot resolve 
> 'schema_of_json(get_json_object(`adtnl_info_txt`, '$.all_model_scores'))' due 
> to data type mismatch: The input json should be a foldable string expression 
> and not null; however, got get_json_object(`adtnl_info_txt`, 
> '$.all_model_scores').; line 3 pos 2; |
> But schema_of_json worked well in 2.4.0, So, is it a bug, or a new feature, 
> which doesn't support non-Literal expressions?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38769) [SQL] behavior of schema_of_json not same with 2.4.0

2022-04-01 Thread gabrywu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516219#comment-17516219
 ] 

gabrywu commented on SPARK-38769:
-

[~maxgekk] 

> [SQL] behavior of schema_of_json not same with 2.4.0
> 
>
> Key: SPARK-38769
> URL: https://issues.apache.org/jira/browse/SPARK-38769
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: gabrywu
>Priority: Minor
>
> When I switch to spark 3.1.1 from spark 2.4.0, I found a built-in function 
> throw errors:
> |== Physical Plan == org.apache.spark.sql.AnalysisException: cannot resolve 
> 'schema_of_json(get_json_object(`adtnl_info_txt`, '$.all_model_scores'))' due 
> to data type mismatch: The input json should be a foldable string expression 
> and not null; however, got get_json_object(`adtnl_info_txt`, 
> '$.all_model_scores').; line 3 pos 2; |
> But schema_of_json worked well in 2.4.0, So, is it a bug, or a new feature, 
> which doesn't support non-Literal expressions?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38769) [SQL] behavior schema_of_json not same with 2.4.0

2022-04-01 Thread gabrywu (Jira)
gabrywu created SPARK-38769:
---

 Summary: [SQL] behavior schema_of_json not same with 2.4.0
 Key: SPARK-38769
 URL: https://issues.apache.org/jira/browse/SPARK-38769
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1
Reporter: gabrywu


When I switch to spark 3.1.1 from spark 2.4.0, I found a built-in function 
throw errors:
|== Physical Plan == org.apache.spark.sql.AnalysisException: cannot resolve 
'schema_of_json(get_json_object(`adtnl_info_txt`, '$.all_model_scores'))' due 
to data type mismatch: The input json should be a foldable string expression 
and not null; however, got get_json_object(`adtnl_info_txt`, 
'$.all_model_scores').; line 3 pos 2; |

But schema_of_json worked well in 2.4.0, So, is it a bug, or a new feature, 
which doesn't support non-Literal expressions?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38769) [SQL] behavior of schema_of_json not same with 2.4.0

2022-04-01 Thread gabrywu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gabrywu updated SPARK-38769:

Summary: [SQL] behavior of schema_of_json not same with 2.4.0  (was: [SQL] 
behavior schema_of_json not same with 2.4.0)

> [SQL] behavior of schema_of_json not same with 2.4.0
> 
>
> Key: SPARK-38769
> URL: https://issues.apache.org/jira/browse/SPARK-38769
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: gabrywu
>Priority: Minor
>
> When I switch to spark 3.1.1 from spark 2.4.0, I found a built-in function 
> throw errors:
> |== Physical Plan == org.apache.spark.sql.AnalysisException: cannot resolve 
> 'schema_of_json(get_json_object(`adtnl_info_txt`, '$.all_model_scores'))' due 
> to data type mismatch: The input json should be a foldable string expression 
> and not null; however, got get_json_object(`adtnl_info_txt`, 
> '$.all_model_scores').; line 3 pos 2; |
> But schema_of_json worked well in 2.4.0, So, is it a bug, or a new feature, 
> which doesn't support non-Literal expressions?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

2022-03-05 Thread gabrywu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gabrywu updated SPARK-38258:

Description: 
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
It's a little inconvenient, so why can't we {color:#ff}collect & update 
statistics automatically{color} when a spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metrics. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQLs, of course can also adjust the important configs, such as 
spark.sql.shuffle.partitions

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

Of course, we'd better {color:#ff}add a version number to statistics{color} 
in case of losing efficacy

 

https://docs.google.com/document/d/1L48Dovynboi_ARu-OqQNJCOQqeVUTutLu8fo-w_ZPPA/edit#

  was:
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
It's a little inconvenient, so why can't we {color:#ff}collect & update 
statistics automatically{color} when a spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metrics. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQLs, of course can also adjust the important configs, such as 
spark.sql.shuffle.partitions

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

Of course, we'd better {color:#ff}add a version number to statistics{color} 
in case of losing efficacy


> [proposal] collect & update statistics automatically when spark SQL is running
> --
>
> Key: SPARK-38258
> URL: https://issues.apache.org/jira/browse/SPARK-38258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0
>Reporter: gabrywu
>Priority: Minor
>
> As we all know, table & column statistics are very important to spark SQL 
> optimizer, however we have to collect & update them using 
> {code:java}
> analyze table tableName compute statistics{code}
> It's a little inconvenient, so why can't we {color:#ff}collect & update 
> statistics automatically{color} when a spark stage runs and finishes?
> For example, when a insert overwrite table statement finishes, we can update 
> a corresponding table statistics using SQL metrics. And in following queries, 
> spark sql optimizer can use these statistics.
> As we all know, it's a common case that we run daily batches using Spark 
> SQLs, so a same SQL can run every day, and the SQL and its corresponding 
> tables data change slowly. That means we can use statistics updated on 
> yesterday to optimize current SQLs, of course can also adjust the important 
> configs, such as spark.sql.shuffle.partitions
> So we'd better add a mechanism to store every stage's statistics somewhere, 
> and use it in new SQLs. Not just collect statistics after a stage finishes.
> Of course, we'd better {color:#ff}add a version number to 
> statistics{color} in case of losing efficacy
>  
> https://docs.google.com/document/d/1L48Dovynboi_ARu-OqQNJCOQqeVUTutLu8fo-w_ZPPA/edit#



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

2022-03-05 Thread gabrywu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501864#comment-17501864
 ] 

gabrywu commented on SPARK-38258:
-

[~yumwang] what do you think of it?

> [proposal] collect & update statistics automatically when spark SQL is running
> --
>
> Key: SPARK-38258
> URL: https://issues.apache.org/jira/browse/SPARK-38258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0
>Reporter: gabrywu
>Priority: Minor
>
> As we all know, table & column statistics are very important to spark SQL 
> optimizer, however we have to collect & update them using 
> {code:java}
> analyze table tableName compute statistics{code}
> It's a little inconvenient, so why can't we {color:#ff}collect & update 
> statistics automatically{color} when a spark stage runs and finishes?
> For example, when a insert overwrite table statement finishes, we can update 
> a corresponding table statistics using SQL metrics. And in following queries, 
> spark sql optimizer can use these statistics.
> As we all know, it's a common case that we run daily batches using Spark 
> SQLs, so a same SQL can run every day, and the SQL and its corresponding 
> tables data change slowly. That means we can use statistics updated on 
> yesterday to optimize current SQLs, of course can also adjust the important 
> configs, such as spark.sql.shuffle.partitions
> So we'd better add a mechanism to store every stage's statistics somewhere, 
> and use it in new SQLs. Not just collect statistics after a stage finishes.
> Of course, we'd better {color:#ff}add a version number to 
> statistics{color} in case of losing efficacy



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

2022-03-05 Thread gabrywu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gabrywu updated SPARK-38258:

Description: 
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
It's a little inconvenient, so why can't we {color:#ff}collect & update 
statistics automatically{color} when a spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metrics. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQLs, of course can also adjust the important configs, such as 
spark.sql.shuffle.partitions

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

Of course, we'd better {color:#ff}add a version number to statistics{color} 
in case of losing efficacy

  was:
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
It's a little inconvenient, so why can't we {color:#ff}collect & update 
statistics automatically{color} when a spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metrics. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQLs.

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

Of course, we'd better {color:#ff}add a version number to statistics{color} 
in case of losing efficacy


> [proposal] collect & update statistics automatically when spark SQL is running
> --
>
> Key: SPARK-38258
> URL: https://issues.apache.org/jira/browse/SPARK-38258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0
>Reporter: gabrywu
>Priority: Minor
>
> As we all know, table & column statistics are very important to spark SQL 
> optimizer, however we have to collect & update them using 
> {code:java}
> analyze table tableName compute statistics{code}
> It's a little inconvenient, so why can't we {color:#ff}collect & update 
> statistics automatically{color} when a spark stage runs and finishes?
> For example, when a insert overwrite table statement finishes, we can update 
> a corresponding table statistics using SQL metrics. And in following queries, 
> spark sql optimizer can use these statistics.
> As we all know, it's a common case that we run daily batches using Spark 
> SQLs, so a same SQL can run every day, and the SQL and its corresponding 
> tables data change slowly. That means we can use statistics updated on 
> yesterday to optimize current SQLs, of course can also adjust the important 
> configs, such as spark.sql.shuffle.partitions
> So we'd better add a mechanism to store every stage's statistics somewhere, 
> and use it in new SQLs. Not just collect statistics after a stage finishes.
> Of course, we'd better {color:#ff}add a version number to 
> statistics{color} in case of losing efficacy



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

2022-03-04 Thread gabrywu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gabrywu updated SPARK-38258:

Description: 
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
It's a little inconvenient, so why can't we {color:#ff}collect & update 
statistics automatically{color} when a spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metrics. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQLs.

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

Of course, we'd better {color:#ff}add a version number to statistics{color} 
in case of losing efficacy

  was:
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
 

It's a little inconvenient, so why can't we {color:#FF}collect & update 
statistics automatically{color} when a spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metrics. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQLs.

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

Of course, we'd better {color:#FF}add a version number to statistics{color} 
in case of losing efficacy


> [proposal] collect & update statistics automatically when spark SQL is running
> --
>
> Key: SPARK-38258
> URL: https://issues.apache.org/jira/browse/SPARK-38258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0
>Reporter: gabrywu
>Priority: Minor
>
> As we all know, table & column statistics are very important to spark SQL 
> optimizer, however we have to collect & update them using 
> {code:java}
> analyze table tableName compute statistics{code}
> It's a little inconvenient, so why can't we {color:#ff}collect & update 
> statistics automatically{color} when a spark stage runs and finishes?
> For example, when a insert overwrite table statement finishes, we can update 
> a corresponding table statistics using SQL metrics. And in following queries, 
> spark sql optimizer can use these statistics.
> As we all know, it's a common case that we run daily batches using Spark 
> SQLs, so a same SQL can run every day, and the SQL and its corresponding 
> tables data change slowly. That means we can use statistics updated on 
> yesterday to optimize current SQLs.
> So we'd better add a mechanism to store every stage's statistics somewhere, 
> and use it in new SQLs. Not just collect statistics after a stage finishes.
> Of course, we'd better {color:#ff}add a version number to 
> statistics{color} in case of losing efficacy



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

2022-03-04 Thread gabrywu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gabrywu updated SPARK-38258:

Description: 
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
 

It's a little inconvenient, so why can't we {color:#FF}collect & update 
statistics automatically{color} when a spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metrics. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQLs.

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

Of course, we'd better {color:#FF}add a version number to statistics{color} 
in case of losing efficacy

  was:
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
 

It's a little inconvenient, so why can't we collect & update statistics when a 
spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metrics. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQLs.

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

 


> [proposal] collect & update statistics automatically when spark SQL is running
> --
>
> Key: SPARK-38258
> URL: https://issues.apache.org/jira/browse/SPARK-38258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0
>Reporter: gabrywu
>Priority: Minor
>
> As we all know, table & column statistics are very important to spark SQL 
> optimizer, however we have to collect & update them using 
> {code:java}
> analyze table tableName compute statistics{code}
>  
> It's a little inconvenient, so why can't we {color:#FF}collect & update 
> statistics automatically{color} when a spark stage runs and finishes?
> For example, when a insert overwrite table statement finishes, we can update 
> a corresponding table statistics using SQL metrics. And in following queries, 
> spark sql optimizer can use these statistics.
> As we all know, it's a common case that we run daily batches using Spark 
> SQLs, so a same SQL can run every day, and the SQL and its corresponding 
> tables data change slowly. That means we can use statistics updated on 
> yesterday to optimize current SQLs.
> So we'd better add a mechanism to store every stage's statistics somewhere, 
> and use it in new SQLs. Not just collect statistics after a stage finishes.
> Of course, we'd better {color:#FF}add a version number to 
> statistics{color} in case of losing efficacy



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

2022-02-20 Thread gabrywu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gabrywu updated SPARK-38258:

Description: 
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
 

It's a little inconvenient, so why can't we collect & update statistics when a 
spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metrics. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQLs.

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

 

  was:
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
 

It's a little inconvenient, so why can't we collect & update statistics when a 
spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metric. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQL.

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

 


> [proposal] collect & update statistics automatically when spark SQL is running
> --
>
> Key: SPARK-38258
> URL: https://issues.apache.org/jira/browse/SPARK-38258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0
>Reporter: gabrywu
>Priority: Minor
>
> As we all know, table & column statistics are very important to spark SQL 
> optimizer, however we have to collect & update them using 
> {code:java}
> analyze table tableName compute statistics{code}
>  
> It's a little inconvenient, so why can't we collect & update statistics when 
> a spark stage runs and finishes?
> For example, when a insert overwrite table statement finishes, we can update 
> a corresponding table statistics using SQL metrics. And in following queries, 
> spark sql optimizer can use these statistics.
> As we all know, it's a common case that we run daily batches using Spark 
> SQLs, so a same SQL can run every day, and the SQL and its corresponding 
> tables data change slowly. That means we can use statistics updated on 
> yesterday to optimize current SQLs.
> So we'd better add a mechanism to store every stage's statistics somewhere, 
> and use it in new SQLs. Not just collect statistics after a stage finishes.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

2022-02-20 Thread gabrywu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gabrywu updated SPARK-38258:

Description: 
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
 

It's a little inconvenient, so why can't we collect & update statistics when a 
spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metric. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQL.

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

 

  was:
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
 

It's a little inconvenient, so why can't we collect & update statistics when a 
spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metric. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use sta

 


> [proposal] collect & update statistics automatically when spark SQL is running
> --
>
> Key: SPARK-38258
> URL: https://issues.apache.org/jira/browse/SPARK-38258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0
>Reporter: gabrywu
>Priority: Minor
>
> As we all know, table & column statistics are very important to spark SQL 
> optimizer, however we have to collect & update them using 
> {code:java}
> analyze table tableName compute statistics{code}
>  
> It's a little inconvenient, so why can't we collect & update statistics when 
> a spark stage runs and finishes?
> For example, when a insert overwrite table statement finishes, we can update 
> a corresponding table statistics using SQL metric. And in following queries, 
> spark sql optimizer can use these statistics.
> As we all know, it's a common case that we run daily batches using Spark 
> SQLs, so a same SQL can run every day, and the SQL and its corresponding 
> tables data change slowly. That means we can use statistics updated on 
> yesterday to optimize current SQL.
> So we'd better add a mechanism to store every stage's statistics somewhere, 
> and use it in new SQLs. Not just collect statistics after a stage finishes.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

2022-02-20 Thread gabrywu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gabrywu updated SPARK-38258:

Description: 
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
 

It's a little inconvenient, so why can't we collect & update statistics when a 
spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metric. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use sta

 

  was:
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
 

It's a little inconvenient, so why can't we collect & update statistics when a 
spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metric. And in following queries, 
spark sql optimizer can use these statistics.

So what do you think of it?[~yumwang] , it it reasonable?


> [proposal] collect & update statistics automatically when spark SQL is running
> --
>
> Key: SPARK-38258
> URL: https://issues.apache.org/jira/browse/SPARK-38258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0
>Reporter: gabrywu
>Priority: Minor
>
> As we all know, table & column statistics are very important to spark SQL 
> optimizer, however we have to collect & update them using 
> {code:java}
> analyze table tableName compute statistics{code}
>  
> It's a little inconvenient, so why can't we collect & update statistics when 
> a spark stage runs and finishes?
> For example, when a insert overwrite table statement finishes, we can update 
> a corresponding table statistics using SQL metric. And in following queries, 
> spark sql optimizer can use these statistics.
> As we all know, it's a common case that we run daily batches using Spark 
> SQLs, so a same SQL can run every day, and the SQL and its corresponding 
> tables data change slowly. That means we can use sta
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

2022-02-20 Thread gabrywu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gabrywu updated SPARK-38258:

Affects Version/s: 2.4.0

> [proposal] collect & update statistics automatically when spark SQL is running
> --
>
> Key: SPARK-38258
> URL: https://issues.apache.org/jira/browse/SPARK-38258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0
>Reporter: gabrywu
>Priority: Minor
>
> As we all know, table & column statistics are very important to spark SQL 
> optimizer, however we have to collect & update them using 
> {code:java}
> analyze table tableName compute statistics{code}
>  
> It's a little inconvenient, so why can't we collect & update statistics when 
> a spark stage runs and finishes?
> For example, when a insert overwrite table statement finishes, we can update 
> a corresponding table statistics using SQL metric. And in following queries, 
> spark sql optimizer can use these statistics.
> So what do you think of it?[~yumwang] , it it reasonable?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

2022-02-20 Thread gabrywu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gabrywu updated SPARK-38258:

Description: 
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
 

It's a little inconvenient, so why can't we collect & update statistics when a 
spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metric. And in following queries, 
spark sql optimizer can use these statistics.

So what do you think of it?[~yumwang] , it it reasonable?

  was:
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
 

It's a little inconvenient, so why can't we collect & update statistics when a 
spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metric. And in next queries, spark sql 
optimizer can use these statistics.

So what do you think of it?[~yumwang] 


> [proposal] collect & update statistics automatically when spark SQL is running
> --
>
> Key: SPARK-38258
> URL: https://issues.apache.org/jira/browse/SPARK-38258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0
>Reporter: gabrywu
>Priority: Minor
>
> As we all know, table & column statistics are very important to spark SQL 
> optimizer, however we have to collect & update them using 
> {code:java}
> analyze table tableName compute statistics{code}
>  
> It's a little inconvenient, so why can't we collect & update statistics when 
> a spark stage runs and finishes?
> For example, when a insert overwrite table statement finishes, we can update 
> a corresponding table statistics using SQL metric. And in following queries, 
> spark sql optimizer can use these statistics.
> So what do you think of it?[~yumwang] , it it reasonable?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

2022-02-20 Thread gabrywu (Jira)
gabrywu created SPARK-38258:
---

 Summary: [proposal] collect & update statistics automatically when 
spark SQL is running
 Key: SPARK-38258
 URL: https://issues.apache.org/jira/browse/SPARK-38258
 Project: Spark
  Issue Type: Wish
  Components: Spark Core, SQL
Affects Versions: 3.2.0, 3.1.0, 3.0.0
Reporter: gabrywu


As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
 

It's a little inconvenient, so why can't we collect & update statistics when a 
spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metric. And in next queries, spark sql 
optimizer can use these statistics.

So what do you think of it?[~yumwang] 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28329) SELECT INTO syntax

2022-02-14 Thread gabrywu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491884#comment-17491884
 ] 

gabrywu commented on SPARK-28329:
-

[~smilegator] Is there a plan to support that? select into a scalar variable? I 
think it's useful to optimize some SQLs like this
{code:SQL}
select max(id) into ${max_id} from db.tableA;
select * from db.tableB where id >= ${max_id};
{code}
It's better than the following SQL, because it can push down the filters id >= 
${max_id}
{code:SQL}
select * from db.tableB where id >= (select max(id) from db.tableA);
{code}

> SELECT INTO syntax
> --
>
> Key: SPARK-28329
> URL: https://issues.apache.org/jira/browse/SPARK-28329
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> h2. Synopsis
> {noformat}
> [ WITH [ RECURSIVE ] with_query [, ...] ]
> SELECT [ ALL | DISTINCT [ ON ( expression [, ...] ) ] ]
> * | expression [ [ AS ] output_name ] [, ...]
> INTO [ TEMPORARY | TEMP | UNLOGGED ] [ TABLE ] new_table
> [ FROM from_item [, ...] ]
> [ WHERE condition ]
> [ GROUP BY expression [, ...] ]
> [ HAVING condition [, ...] ]
> [ WINDOW window_name AS ( window_definition ) [, ...] ]
> [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ]
> [ ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS { FIRST | 
> LAST } ] [, ...] ]
> [ LIMIT { count | ALL } ]
> [ OFFSET start [ ROW | ROWS ] ]
> [ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY ]
> [ FOR { UPDATE | SHARE } [ OF table_name [, ...] ] [ NOWAIT ] [...] ]
> {noformat}
> h2. Description
> {{SELECT INTO}} creates a new table and fills it with data computed by a 
> query. The data is not returned to the client, as it is with a normal 
> {{SELECT}}. The new table's columns have the names and data types associated 
> with the output columns of the {{SELECT}}.
>  
> {{CREATE TABLE AS}} offers a superset of the functionality offered by 
> {{SELECT INTO}}.
> [https://www.postgresql.org/docs/11/sql-selectinto.html]
>  [https://www.postgresql.org/docs/11/sql-createtableas.html]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org