[jira] [Comment Edited] (SPARK-41556) input_file_positon
[ https://issues.apache.org/jira/browse/SPARK-41556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648856#comment-17648856 ] gabrywu edited comment on SPARK-41556 at 12/17/22 5:18 AM: --- [~yumwang] [~petertoth] What do you think of it? was (Author: gabry.wu): [~yumwang] [~ptoth] What do you think of it? > input_file_positon > -- > > Key: SPARK-41556 > URL: https://issues.apache.org/jira/browse/SPARK-41556 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.1 >Reporter: gabrywu >Priority: Trivial > > As for now, we have 3 built-in UDFs related to input files and blocks. So > can we provide a new UDF to return current record position of a file or > block? Sometimes, it's useful and we can consider this position (called ROWID > in oracle) as a physical primary key. > > |input_file_block_length()|Returns the length of the block being read, or -1 > if not available.| > |input_file_block_start()|Returns the start offset of the block being read, > or -1 if not available.| > |input_file_name()|Returns the name of the file being read, or empty string > if not available.| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41556) input_file_positon
[ https://issues.apache.org/jira/browse/SPARK-41556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648856#comment-17648856 ] gabrywu commented on SPARK-41556: - [~yumwang] [~ptoth] What do you think of it? > input_file_positon > -- > > Key: SPARK-41556 > URL: https://issues.apache.org/jira/browse/SPARK-41556 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.1 >Reporter: gabrywu >Priority: Trivial > > As for now, we have 3 built-in UDFs related to input files and blocks. So > can we provide a new UDF to return current record position of a file or > block? Sometimes, it's useful and we can consider this position (called ROWID > in oracle) as a physical primary key. > > |input_file_block_length()|Returns the length of the block being read, or -1 > if not available.| > |input_file_block_start()|Returns the start offset of the block being read, > or -1 if not available.| > |input_file_name()|Returns the name of the file being read, or empty string > if not available.| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41556) input_file_positon
gabrywu created SPARK-41556: --- Summary: input_file_positon Key: SPARK-41556 URL: https://issues.apache.org/jira/browse/SPARK-41556 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.3.1 Reporter: gabrywu As for now, we have 3 built-in UDFs related to input files and blocks. So can we provide a new UDF to return current record position of a file or block? Sometimes, it's useful and we can consider this position (called ROWID in oracle) as a physical primary key. |input_file_block_length()|Returns the length of the block being read, or -1 if not available.| |input_file_block_start()|Returns the start offset of the block being read, or -1 if not available.| |input_file_name()|Returns the name of the file being read, or empty string if not available.| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24497) ANSI SQL: Recursive query
[ https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648616#comment-17648616 ] gabrywu commented on SPARK-24497: - this is a useful feature, when will it be merged to main branch? > ANSI SQL: Recursive query > - > > Key: SPARK-24497 > URL: https://issues.apache.org/jira/browse/SPARK-24497 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > h3. *Examples* > Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" > represents the structure of an organization as an adjacency list. > {code:sql} > CREATE TABLE department ( > id INTEGER PRIMARY KEY, -- department ID > parent_department INTEGER REFERENCES department, -- upper department ID > name TEXT -- department name > ); > INSERT INTO department (id, parent_department, "name") > VALUES > (0, NULL, 'ROOT'), > (1, 0, 'A'), > (2, 1, 'B'), > (3, 2, 'C'), > (4, 2, 'D'), > (5, 0, 'E'), > (6, 4, 'F'), > (7, 5, 'G'); > -- department structure represented here is as follows: > -- > -- ROOT-+->A-+->B-+->C > -- | | > -- | +->D-+->F > -- +->E-+->G > {code} > > To extract all departments under A, you can use the following recursive > query: > {code:sql} > WITH RECURSIVE subdepartment AS > ( > -- non-recursive term > SELECT * FROM department WHERE name = 'A' > UNION ALL > -- recursive term > SELECT d.* > FROM > department AS d > JOIN > subdepartment AS sd > ON (d.parent_department = sd.id) > ) > SELECT * > FROM subdepartment > ORDER BY name; > {code} > More details: > [http://wiki.postgresql.org/wiki/CTEReadme] > [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37872) [SQL] Some classes are move from org.codehaus.janino:janino to org.codehaus.janino:common-compiler after version 3.1.x
[ https://issues.apache.org/jira/browse/SPARK-37872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563599#comment-17563599 ] gabrywu commented on SPARK-37872: - Yes, janino 3.0.16 is out of date, and not compatible with a higher version. > [SQL] Some classes are move from org.codehaus.janino:janino to > org.codehaus.janino:common-compiler after version 3.1.x > --- > > Key: SPARK-37872 > URL: https://issues.apache.org/jira/browse/SPARK-37872 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.5, 3.2.0 >Reporter: Jin Shen >Priority: Major > > Here is the code: > > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L32] > > ByteArrayClassLoader and InternalCompilerException are moved to > org.codehaus.janino:common-compiler > > [https://github.com/janino-compiler/janino/blob/3.1.6/commons-compiler/src/main/java/org/codehaus/commons/compiler/util/reflect/ByteArrayClassLoader.java] > > [https://github.com/janino-compiler/janino/blob/3.1.6/commons-compiler/src/main/java/org/codehaus/commons/compiler/InternalCompilerException.java] > > The last working version of janino is 3.0.16 but it is out of date. > Can we make change to this and upgrade to new version of janino and > common-compiler? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39036) Support Alter Table/Partition Concatenate command
[ https://issues.apache.org/jira/browse/SPARK-39036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532564#comment-17532564 ] gabrywu commented on SPARK-39036: - [~hyukjin.kwon] What do you know about that? Is anyone working on this to merge small files? > Support Alter Table/Partition Concatenate command > - > > Key: SPARK-39036 > URL: https://issues.apache.org/jira/browse/SPARK-39036 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Affects Versions: 3.3.0 >Reporter: gabrywu >Priority: Major > > Hi, folks, > In Hive, we can use following command to merge small files, however, there is > not a corresponding command to do that in Spark SQL. > I believe it's useful and it's not enough only using AQE. Is anyone working > on this to merge small files? If not, I want to create a PR to implement it > > {code:java} > ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, > ...])] CONCATENATE;{code} > > [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionConcatenate] > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39036) Support Alter Table/Partition Concatenate command
[ https://issues.apache.org/jira/browse/SPARK-39036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gabrywu updated SPARK-39036: Description: Hi, folks, In Hive, we can use following command to merge small files, however, there is not a corresponding command to do that in Spark SQL. I believe it's useful and it's not enough only using AQE. Is anyone working on this to merge small files? If not, I want to create a PR to implement it {code:java} ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])] CONCATENATE;{code} [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionConcatenate] was: Hi, folks, In Hive, we can use following command to merge small files, however, there is not a corresponding command to do that. I believe it's useful and it's not enough only using AQE. Is anyone working on this to merge small files? If not, I want to create a PR to implement it {code:java} ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])] CONCATENATE;{code} [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionConcatenate] > Support Alter Table/Partition Concatenate command > - > > Key: SPARK-39036 > URL: https://issues.apache.org/jira/browse/SPARK-39036 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Affects Versions: 3.3.0 >Reporter: gabrywu >Priority: Major > > Hi, folks, > In Hive, we can use following command to merge small files, however, there is > not a corresponding command to do that in Spark SQL. > I believe it's useful and it's not enough only using AQE. Is anyone working > on this to merge small files? If not, I want to create a PR to implement it > > {code:java} > ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, > ...])] CONCATENATE;{code} > > [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionConcatenate] > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39036) support Alter Table/Partition Concatenate command
gabrywu created SPARK-39036: --- Summary: support Alter Table/Partition Concatenate command Key: SPARK-39036 URL: https://issues.apache.org/jira/browse/SPARK-39036 Project: Spark Issue Type: New Feature Components: Spark Core, SQL Affects Versions: 3.3.0 Reporter: gabrywu Hi, folks, In Hive, we can use following command to merge small files, however, there is not a corresponding command to do that. I believe it's useful and it's not enough only using AQE. Is anyone working on this to merge small files? If not, I want to create a PR to implement it {code:java} ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])] CONCATENATE;{code} [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionConcatenate] -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39003) make AppHistoryServerPlugin public api for developer
gabrywu created SPARK-39003: --- Summary: make AppHistoryServerPlugin public api for developer Key: SPARK-39003 URL: https://issues.apache.org/jira/browse/SPARK-39003 Project: Spark Issue Type: Wish Components: Spark Core Affects Versions: 3.1.0, 2.4.0, 2.3.0 Reporter: gabrywu For history server, there is an interface called {{{}AppHistoryServerPlugin{}}}, which is loaded based on SPI. However it is accessible for spark package, so, can we change it as a public interface? With that, developer can extend application history. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38769) [SQL] behavior of schema_of_json not same with 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-38769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516421#comment-17516421 ] gabrywu commented on SPARK-38769: - nomatter which UDF to work together, I believe we should not change its behavior, right? For example, following json contains a field ato_long_v2, however, it will be ato_long_v3, and ato_long_v4, etc. We want to extract the version string as v2,v3,v4, and schema_of_json is used here {code:java} { "tt_v1": 165 "tt_long_v2": 474 "ato_long_v2": 431 "tt_short_v2": 338 "ato_v1": 408 "ato_short_v2": 358 "sf_long_v3": 400 "sf_short_v3": 498 }{code} > [SQL] behavior of schema_of_json not same with 2.4.0 > > > Key: SPARK-38769 > URL: https://issues.apache.org/jira/browse/SPARK-38769 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: gabrywu >Priority: Minor > > When I switch to spark 3.1.1 from spark 2.4.0, I found a built-in function > throw errors: > |== Physical Plan == org.apache.spark.sql.AnalysisException: cannot resolve > 'schema_of_json(get_json_object(`adtnl_info_txt`, '$.all_model_scores'))' due > to data type mismatch: The input json should be a foldable string expression > and not null; however, got get_json_object(`adtnl_info_txt`, > '$.all_model_scores').; line 3 pos 2; | > But schema_of_json worked well in 2.4.0, So, is it a bug, or a new feature, > which doesn't support non-Literal expressions? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-38769) [SQL] behavior of schema_of_json not same with 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-38769 ] gabrywu deleted comment on SPARK-38769: - was (Author: gabry.wu): [~maxgekk] > [SQL] behavior of schema_of_json not same with 2.4.0 > > > Key: SPARK-38769 > URL: https://issues.apache.org/jira/browse/SPARK-38769 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: gabrywu >Priority: Minor > > When I switch to spark 3.1.1 from spark 2.4.0, I found a built-in function > throw errors: > |== Physical Plan == org.apache.spark.sql.AnalysisException: cannot resolve > 'schema_of_json(get_json_object(`adtnl_info_txt`, '$.all_model_scores'))' due > to data type mismatch: The input json should be a foldable string expression > and not null; however, got get_json_object(`adtnl_info_txt`, > '$.all_model_scores').; line 3 pos 2; | > But schema_of_json worked well in 2.4.0, So, is it a bug, or a new feature, > which doesn't support non-Literal expressions? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38769) [SQL] behavior of schema_of_json not same with 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-38769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516421#comment-17516421 ] gabrywu edited comment on SPARK-38769 at 4/3/22 2:20 AM: - [~hyukjin.kwon] nomatter which UDF to work together, I believe we should not change its behavior, right? For example, following json contains a field ato_long_v2, however, it will be ato_long_v3, and ato_long_v4, etc. We want to extract the version string as v2,v3,v4, and schema_of_json is used here {code:java} { "tt_v1": 165 "tt_long_v2": 474 "ato_long_v2": 431 "tt_short_v2": 338 "ato_v1": 408 "ato_short_v2": 358 "sf_long_v3": 400 "sf_short_v3": 498 }{code} was (Author: gabry.wu): nomatter which UDF to work together, I believe we should not change its behavior, right? For example, following json contains a field ato_long_v2, however, it will be ato_long_v3, and ato_long_v4, etc. We want to extract the version string as v2,v3,v4, and schema_of_json is used here {code:java} { "tt_v1": 165 "tt_long_v2": 474 "ato_long_v2": 431 "tt_short_v2": 338 "ato_v1": 408 "ato_short_v2": 358 "sf_long_v3": 400 "sf_short_v3": 498 }{code} > [SQL] behavior of schema_of_json not same with 2.4.0 > > > Key: SPARK-38769 > URL: https://issues.apache.org/jira/browse/SPARK-38769 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: gabrywu >Priority: Minor > > When I switch to spark 3.1.1 from spark 2.4.0, I found a built-in function > throw errors: > |== Physical Plan == org.apache.spark.sql.AnalysisException: cannot resolve > 'schema_of_json(get_json_object(`adtnl_info_txt`, '$.all_model_scores'))' due > to data type mismatch: The input json should be a foldable string expression > and not null; however, got get_json_object(`adtnl_info_txt`, > '$.all_model_scores').; line 3 pos 2; | > But schema_of_json worked well in 2.4.0, So, is it a bug, or a new feature, > which doesn't support non-Literal expressions? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38769) [SQL] behavior of schema_of_json not same with 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-38769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516219#comment-17516219 ] gabrywu commented on SPARK-38769: - [~maxgekk] > [SQL] behavior of schema_of_json not same with 2.4.0 > > > Key: SPARK-38769 > URL: https://issues.apache.org/jira/browse/SPARK-38769 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: gabrywu >Priority: Minor > > When I switch to spark 3.1.1 from spark 2.4.0, I found a built-in function > throw errors: > |== Physical Plan == org.apache.spark.sql.AnalysisException: cannot resolve > 'schema_of_json(get_json_object(`adtnl_info_txt`, '$.all_model_scores'))' due > to data type mismatch: The input json should be a foldable string expression > and not null; however, got get_json_object(`adtnl_info_txt`, > '$.all_model_scores').; line 3 pos 2; | > But schema_of_json worked well in 2.4.0, So, is it a bug, or a new feature, > which doesn't support non-Literal expressions? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38769) [SQL] behavior schema_of_json not same with 2.4.0
gabrywu created SPARK-38769: --- Summary: [SQL] behavior schema_of_json not same with 2.4.0 Key: SPARK-38769 URL: https://issues.apache.org/jira/browse/SPARK-38769 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1 Reporter: gabrywu When I switch to spark 3.1.1 from spark 2.4.0, I found a built-in function throw errors: |== Physical Plan == org.apache.spark.sql.AnalysisException: cannot resolve 'schema_of_json(get_json_object(`adtnl_info_txt`, '$.all_model_scores'))' due to data type mismatch: The input json should be a foldable string expression and not null; however, got get_json_object(`adtnl_info_txt`, '$.all_model_scores').; line 3 pos 2; | But schema_of_json worked well in 2.4.0, So, is it a bug, or a new feature, which doesn't support non-Literal expressions? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38769) [SQL] behavior of schema_of_json not same with 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-38769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gabrywu updated SPARK-38769: Summary: [SQL] behavior of schema_of_json not same with 2.4.0 (was: [SQL] behavior schema_of_json not same with 2.4.0) > [SQL] behavior of schema_of_json not same with 2.4.0 > > > Key: SPARK-38769 > URL: https://issues.apache.org/jira/browse/SPARK-38769 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: gabrywu >Priority: Minor > > When I switch to spark 3.1.1 from spark 2.4.0, I found a built-in function > throw errors: > |== Physical Plan == org.apache.spark.sql.AnalysisException: cannot resolve > 'schema_of_json(get_json_object(`adtnl_info_txt`, '$.all_model_scores'))' due > to data type mismatch: The input json should be a foldable string expression > and not null; however, got get_json_object(`adtnl_info_txt`, > '$.all_model_scores').; line 3 pos 2; | > But schema_of_json worked well in 2.4.0, So, is it a bug, or a new feature, > which doesn't support non-Literal expressions? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running
[ https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gabrywu updated SPARK-38258: Description: As we all know, table & column statistics are very important to spark SQL optimizer, however we have to collect & update them using {code:java} analyze table tableName compute statistics{code} It's a little inconvenient, so why can't we {color:#ff}collect & update statistics automatically{color} when a spark stage runs and finishes? For example, when a insert overwrite table statement finishes, we can update a corresponding table statistics using SQL metrics. And in following queries, spark sql optimizer can use these statistics. As we all know, it's a common case that we run daily batches using Spark SQLs, so a same SQL can run every day, and the SQL and its corresponding tables data change slowly. That means we can use statistics updated on yesterday to optimize current SQLs, of course can also adjust the important configs, such as spark.sql.shuffle.partitions So we'd better add a mechanism to store every stage's statistics somewhere, and use it in new SQLs. Not just collect statistics after a stage finishes. Of course, we'd better {color:#ff}add a version number to statistics{color} in case of losing efficacy https://docs.google.com/document/d/1L48Dovynboi_ARu-OqQNJCOQqeVUTutLu8fo-w_ZPPA/edit# was: As we all know, table & column statistics are very important to spark SQL optimizer, however we have to collect & update them using {code:java} analyze table tableName compute statistics{code} It's a little inconvenient, so why can't we {color:#ff}collect & update statistics automatically{color} when a spark stage runs and finishes? For example, when a insert overwrite table statement finishes, we can update a corresponding table statistics using SQL metrics. And in following queries, spark sql optimizer can use these statistics. As we all know, it's a common case that we run daily batches using Spark SQLs, so a same SQL can run every day, and the SQL and its corresponding tables data change slowly. That means we can use statistics updated on yesterday to optimize current SQLs, of course can also adjust the important configs, such as spark.sql.shuffle.partitions So we'd better add a mechanism to store every stage's statistics somewhere, and use it in new SQLs. Not just collect statistics after a stage finishes. Of course, we'd better {color:#ff}add a version number to statistics{color} in case of losing efficacy > [proposal] collect & update statistics automatically when spark SQL is running > -- > > Key: SPARK-38258 > URL: https://issues.apache.org/jira/browse/SPARK-38258 > Project: Spark > Issue Type: Wish > Components: Spark Core, SQL >Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0 >Reporter: gabrywu >Priority: Minor > > As we all know, table & column statistics are very important to spark SQL > optimizer, however we have to collect & update them using > {code:java} > analyze table tableName compute statistics{code} > It's a little inconvenient, so why can't we {color:#ff}collect & update > statistics automatically{color} when a spark stage runs and finishes? > For example, when a insert overwrite table statement finishes, we can update > a corresponding table statistics using SQL metrics. And in following queries, > spark sql optimizer can use these statistics. > As we all know, it's a common case that we run daily batches using Spark > SQLs, so a same SQL can run every day, and the SQL and its corresponding > tables data change slowly. That means we can use statistics updated on > yesterday to optimize current SQLs, of course can also adjust the important > configs, such as spark.sql.shuffle.partitions > So we'd better add a mechanism to store every stage's statistics somewhere, > and use it in new SQLs. Not just collect statistics after a stage finishes. > Of course, we'd better {color:#ff}add a version number to > statistics{color} in case of losing efficacy > > https://docs.google.com/document/d/1L48Dovynboi_ARu-OqQNJCOQqeVUTutLu8fo-w_ZPPA/edit# -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running
[ https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501864#comment-17501864 ] gabrywu commented on SPARK-38258: - [~yumwang] what do you think of it? > [proposal] collect & update statistics automatically when spark SQL is running > -- > > Key: SPARK-38258 > URL: https://issues.apache.org/jira/browse/SPARK-38258 > Project: Spark > Issue Type: Wish > Components: Spark Core, SQL >Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0 >Reporter: gabrywu >Priority: Minor > > As we all know, table & column statistics are very important to spark SQL > optimizer, however we have to collect & update them using > {code:java} > analyze table tableName compute statistics{code} > It's a little inconvenient, so why can't we {color:#ff}collect & update > statistics automatically{color} when a spark stage runs and finishes? > For example, when a insert overwrite table statement finishes, we can update > a corresponding table statistics using SQL metrics. And in following queries, > spark sql optimizer can use these statistics. > As we all know, it's a common case that we run daily batches using Spark > SQLs, so a same SQL can run every day, and the SQL and its corresponding > tables data change slowly. That means we can use statistics updated on > yesterday to optimize current SQLs, of course can also adjust the important > configs, such as spark.sql.shuffle.partitions > So we'd better add a mechanism to store every stage's statistics somewhere, > and use it in new SQLs. Not just collect statistics after a stage finishes. > Of course, we'd better {color:#ff}add a version number to > statistics{color} in case of losing efficacy -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running
[ https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gabrywu updated SPARK-38258: Description: As we all know, table & column statistics are very important to spark SQL optimizer, however we have to collect & update them using {code:java} analyze table tableName compute statistics{code} It's a little inconvenient, so why can't we {color:#ff}collect & update statistics automatically{color} when a spark stage runs and finishes? For example, when a insert overwrite table statement finishes, we can update a corresponding table statistics using SQL metrics. And in following queries, spark sql optimizer can use these statistics. As we all know, it's a common case that we run daily batches using Spark SQLs, so a same SQL can run every day, and the SQL and its corresponding tables data change slowly. That means we can use statistics updated on yesterday to optimize current SQLs, of course can also adjust the important configs, such as spark.sql.shuffle.partitions So we'd better add a mechanism to store every stage's statistics somewhere, and use it in new SQLs. Not just collect statistics after a stage finishes. Of course, we'd better {color:#ff}add a version number to statistics{color} in case of losing efficacy was: As we all know, table & column statistics are very important to spark SQL optimizer, however we have to collect & update them using {code:java} analyze table tableName compute statistics{code} It's a little inconvenient, so why can't we {color:#ff}collect & update statistics automatically{color} when a spark stage runs and finishes? For example, when a insert overwrite table statement finishes, we can update a corresponding table statistics using SQL metrics. And in following queries, spark sql optimizer can use these statistics. As we all know, it's a common case that we run daily batches using Spark SQLs, so a same SQL can run every day, and the SQL and its corresponding tables data change slowly. That means we can use statistics updated on yesterday to optimize current SQLs. So we'd better add a mechanism to store every stage's statistics somewhere, and use it in new SQLs. Not just collect statistics after a stage finishes. Of course, we'd better {color:#ff}add a version number to statistics{color} in case of losing efficacy > [proposal] collect & update statistics automatically when spark SQL is running > -- > > Key: SPARK-38258 > URL: https://issues.apache.org/jira/browse/SPARK-38258 > Project: Spark > Issue Type: Wish > Components: Spark Core, SQL >Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0 >Reporter: gabrywu >Priority: Minor > > As we all know, table & column statistics are very important to spark SQL > optimizer, however we have to collect & update them using > {code:java} > analyze table tableName compute statistics{code} > It's a little inconvenient, so why can't we {color:#ff}collect & update > statistics automatically{color} when a spark stage runs and finishes? > For example, when a insert overwrite table statement finishes, we can update > a corresponding table statistics using SQL metrics. And in following queries, > spark sql optimizer can use these statistics. > As we all know, it's a common case that we run daily batches using Spark > SQLs, so a same SQL can run every day, and the SQL and its corresponding > tables data change slowly. That means we can use statistics updated on > yesterday to optimize current SQLs, of course can also adjust the important > configs, such as spark.sql.shuffle.partitions > So we'd better add a mechanism to store every stage's statistics somewhere, > and use it in new SQLs. Not just collect statistics after a stage finishes. > Of course, we'd better {color:#ff}add a version number to > statistics{color} in case of losing efficacy -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running
[ https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gabrywu updated SPARK-38258: Description: As we all know, table & column statistics are very important to spark SQL optimizer, however we have to collect & update them using {code:java} analyze table tableName compute statistics{code} It's a little inconvenient, so why can't we {color:#ff}collect & update statistics automatically{color} when a spark stage runs and finishes? For example, when a insert overwrite table statement finishes, we can update a corresponding table statistics using SQL metrics. And in following queries, spark sql optimizer can use these statistics. As we all know, it's a common case that we run daily batches using Spark SQLs, so a same SQL can run every day, and the SQL and its corresponding tables data change slowly. That means we can use statistics updated on yesterday to optimize current SQLs. So we'd better add a mechanism to store every stage's statistics somewhere, and use it in new SQLs. Not just collect statistics after a stage finishes. Of course, we'd better {color:#ff}add a version number to statistics{color} in case of losing efficacy was: As we all know, table & column statistics are very important to spark SQL optimizer, however we have to collect & update them using {code:java} analyze table tableName compute statistics{code} It's a little inconvenient, so why can't we {color:#FF}collect & update statistics automatically{color} when a spark stage runs and finishes? For example, when a insert overwrite table statement finishes, we can update a corresponding table statistics using SQL metrics. And in following queries, spark sql optimizer can use these statistics. As we all know, it's a common case that we run daily batches using Spark SQLs, so a same SQL can run every day, and the SQL and its corresponding tables data change slowly. That means we can use statistics updated on yesterday to optimize current SQLs. So we'd better add a mechanism to store every stage's statistics somewhere, and use it in new SQLs. Not just collect statistics after a stage finishes. Of course, we'd better {color:#FF}add a version number to statistics{color} in case of losing efficacy > [proposal] collect & update statistics automatically when spark SQL is running > -- > > Key: SPARK-38258 > URL: https://issues.apache.org/jira/browse/SPARK-38258 > Project: Spark > Issue Type: Wish > Components: Spark Core, SQL >Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0 >Reporter: gabrywu >Priority: Minor > > As we all know, table & column statistics are very important to spark SQL > optimizer, however we have to collect & update them using > {code:java} > analyze table tableName compute statistics{code} > It's a little inconvenient, so why can't we {color:#ff}collect & update > statistics automatically{color} when a spark stage runs and finishes? > For example, when a insert overwrite table statement finishes, we can update > a corresponding table statistics using SQL metrics. And in following queries, > spark sql optimizer can use these statistics. > As we all know, it's a common case that we run daily batches using Spark > SQLs, so a same SQL can run every day, and the SQL and its corresponding > tables data change slowly. That means we can use statistics updated on > yesterday to optimize current SQLs. > So we'd better add a mechanism to store every stage's statistics somewhere, > and use it in new SQLs. Not just collect statistics after a stage finishes. > Of course, we'd better {color:#ff}add a version number to > statistics{color} in case of losing efficacy -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running
[ https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gabrywu updated SPARK-38258: Description: As we all know, table & column statistics are very important to spark SQL optimizer, however we have to collect & update them using {code:java} analyze table tableName compute statistics{code} It's a little inconvenient, so why can't we {color:#FF}collect & update statistics automatically{color} when a spark stage runs and finishes? For example, when a insert overwrite table statement finishes, we can update a corresponding table statistics using SQL metrics. And in following queries, spark sql optimizer can use these statistics. As we all know, it's a common case that we run daily batches using Spark SQLs, so a same SQL can run every day, and the SQL and its corresponding tables data change slowly. That means we can use statistics updated on yesterday to optimize current SQLs. So we'd better add a mechanism to store every stage's statistics somewhere, and use it in new SQLs. Not just collect statistics after a stage finishes. Of course, we'd better {color:#FF}add a version number to statistics{color} in case of losing efficacy was: As we all know, table & column statistics are very important to spark SQL optimizer, however we have to collect & update them using {code:java} analyze table tableName compute statistics{code} It's a little inconvenient, so why can't we collect & update statistics when a spark stage runs and finishes? For example, when a insert overwrite table statement finishes, we can update a corresponding table statistics using SQL metrics. And in following queries, spark sql optimizer can use these statistics. As we all know, it's a common case that we run daily batches using Spark SQLs, so a same SQL can run every day, and the SQL and its corresponding tables data change slowly. That means we can use statistics updated on yesterday to optimize current SQLs. So we'd better add a mechanism to store every stage's statistics somewhere, and use it in new SQLs. Not just collect statistics after a stage finishes. > [proposal] collect & update statistics automatically when spark SQL is running > -- > > Key: SPARK-38258 > URL: https://issues.apache.org/jira/browse/SPARK-38258 > Project: Spark > Issue Type: Wish > Components: Spark Core, SQL >Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0 >Reporter: gabrywu >Priority: Minor > > As we all know, table & column statistics are very important to spark SQL > optimizer, however we have to collect & update them using > {code:java} > analyze table tableName compute statistics{code} > > It's a little inconvenient, so why can't we {color:#FF}collect & update > statistics automatically{color} when a spark stage runs and finishes? > For example, when a insert overwrite table statement finishes, we can update > a corresponding table statistics using SQL metrics. And in following queries, > spark sql optimizer can use these statistics. > As we all know, it's a common case that we run daily batches using Spark > SQLs, so a same SQL can run every day, and the SQL and its corresponding > tables data change slowly. That means we can use statistics updated on > yesterday to optimize current SQLs. > So we'd better add a mechanism to store every stage's statistics somewhere, > and use it in new SQLs. Not just collect statistics after a stage finishes. > Of course, we'd better {color:#FF}add a version number to > statistics{color} in case of losing efficacy -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running
[ https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gabrywu updated SPARK-38258: Description: As we all know, table & column statistics are very important to spark SQL optimizer, however we have to collect & update them using {code:java} analyze table tableName compute statistics{code} It's a little inconvenient, so why can't we collect & update statistics when a spark stage runs and finishes? For example, when a insert overwrite table statement finishes, we can update a corresponding table statistics using SQL metrics. And in following queries, spark sql optimizer can use these statistics. As we all know, it's a common case that we run daily batches using Spark SQLs, so a same SQL can run every day, and the SQL and its corresponding tables data change slowly. That means we can use statistics updated on yesterday to optimize current SQLs. So we'd better add a mechanism to store every stage's statistics somewhere, and use it in new SQLs. Not just collect statistics after a stage finishes. was: As we all know, table & column statistics are very important to spark SQL optimizer, however we have to collect & update them using {code:java} analyze table tableName compute statistics{code} It's a little inconvenient, so why can't we collect & update statistics when a spark stage runs and finishes? For example, when a insert overwrite table statement finishes, we can update a corresponding table statistics using SQL metric. And in following queries, spark sql optimizer can use these statistics. As we all know, it's a common case that we run daily batches using Spark SQLs, so a same SQL can run every day, and the SQL and its corresponding tables data change slowly. That means we can use statistics updated on yesterday to optimize current SQL. So we'd better add a mechanism to store every stage's statistics somewhere, and use it in new SQLs. Not just collect statistics after a stage finishes. > [proposal] collect & update statistics automatically when spark SQL is running > -- > > Key: SPARK-38258 > URL: https://issues.apache.org/jira/browse/SPARK-38258 > Project: Spark > Issue Type: Wish > Components: Spark Core, SQL >Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0 >Reporter: gabrywu >Priority: Minor > > As we all know, table & column statistics are very important to spark SQL > optimizer, however we have to collect & update them using > {code:java} > analyze table tableName compute statistics{code} > > It's a little inconvenient, so why can't we collect & update statistics when > a spark stage runs and finishes? > For example, when a insert overwrite table statement finishes, we can update > a corresponding table statistics using SQL metrics. And in following queries, > spark sql optimizer can use these statistics. > As we all know, it's a common case that we run daily batches using Spark > SQLs, so a same SQL can run every day, and the SQL and its corresponding > tables data change slowly. That means we can use statistics updated on > yesterday to optimize current SQLs. > So we'd better add a mechanism to store every stage's statistics somewhere, > and use it in new SQLs. Not just collect statistics after a stage finishes. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running
[ https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gabrywu updated SPARK-38258: Description: As we all know, table & column statistics are very important to spark SQL optimizer, however we have to collect & update them using {code:java} analyze table tableName compute statistics{code} It's a little inconvenient, so why can't we collect & update statistics when a spark stage runs and finishes? For example, when a insert overwrite table statement finishes, we can update a corresponding table statistics using SQL metric. And in following queries, spark sql optimizer can use these statistics. As we all know, it's a common case that we run daily batches using Spark SQLs, so a same SQL can run every day, and the SQL and its corresponding tables data change slowly. That means we can use statistics updated on yesterday to optimize current SQL. So we'd better add a mechanism to store every stage's statistics somewhere, and use it in new SQLs. Not just collect statistics after a stage finishes. was: As we all know, table & column statistics are very important to spark SQL optimizer, however we have to collect & update them using {code:java} analyze table tableName compute statistics{code} It's a little inconvenient, so why can't we collect & update statistics when a spark stage runs and finishes? For example, when a insert overwrite table statement finishes, we can update a corresponding table statistics using SQL metric. And in following queries, spark sql optimizer can use these statistics. As we all know, it's a common case that we run daily batches using Spark SQLs, so a same SQL can run every day, and the SQL and its corresponding tables data change slowly. That means we can use sta > [proposal] collect & update statistics automatically when spark SQL is running > -- > > Key: SPARK-38258 > URL: https://issues.apache.org/jira/browse/SPARK-38258 > Project: Spark > Issue Type: Wish > Components: Spark Core, SQL >Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0 >Reporter: gabrywu >Priority: Minor > > As we all know, table & column statistics are very important to spark SQL > optimizer, however we have to collect & update them using > {code:java} > analyze table tableName compute statistics{code} > > It's a little inconvenient, so why can't we collect & update statistics when > a spark stage runs and finishes? > For example, when a insert overwrite table statement finishes, we can update > a corresponding table statistics using SQL metric. And in following queries, > spark sql optimizer can use these statistics. > As we all know, it's a common case that we run daily batches using Spark > SQLs, so a same SQL can run every day, and the SQL and its corresponding > tables data change slowly. That means we can use statistics updated on > yesterday to optimize current SQL. > So we'd better add a mechanism to store every stage's statistics somewhere, > and use it in new SQLs. Not just collect statistics after a stage finishes. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running
[ https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gabrywu updated SPARK-38258: Description: As we all know, table & column statistics are very important to spark SQL optimizer, however we have to collect & update them using {code:java} analyze table tableName compute statistics{code} It's a little inconvenient, so why can't we collect & update statistics when a spark stage runs and finishes? For example, when a insert overwrite table statement finishes, we can update a corresponding table statistics using SQL metric. And in following queries, spark sql optimizer can use these statistics. As we all know, it's a common case that we run daily batches using Spark SQLs, so a same SQL can run every day, and the SQL and its corresponding tables data change slowly. That means we can use sta was: As we all know, table & column statistics are very important to spark SQL optimizer, however we have to collect & update them using {code:java} analyze table tableName compute statistics{code} It's a little inconvenient, so why can't we collect & update statistics when a spark stage runs and finishes? For example, when a insert overwrite table statement finishes, we can update a corresponding table statistics using SQL metric. And in following queries, spark sql optimizer can use these statistics. So what do you think of it?[~yumwang] , it it reasonable? > [proposal] collect & update statistics automatically when spark SQL is running > -- > > Key: SPARK-38258 > URL: https://issues.apache.org/jira/browse/SPARK-38258 > Project: Spark > Issue Type: Wish > Components: Spark Core, SQL >Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0 >Reporter: gabrywu >Priority: Minor > > As we all know, table & column statistics are very important to spark SQL > optimizer, however we have to collect & update them using > {code:java} > analyze table tableName compute statistics{code} > > It's a little inconvenient, so why can't we collect & update statistics when > a spark stage runs and finishes? > For example, when a insert overwrite table statement finishes, we can update > a corresponding table statistics using SQL metric. And in following queries, > spark sql optimizer can use these statistics. > As we all know, it's a common case that we run daily batches using Spark > SQLs, so a same SQL can run every day, and the SQL and its corresponding > tables data change slowly. That means we can use sta > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running
[ https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gabrywu updated SPARK-38258: Affects Version/s: 2.4.0 > [proposal] collect & update statistics automatically when spark SQL is running > -- > > Key: SPARK-38258 > URL: https://issues.apache.org/jira/browse/SPARK-38258 > Project: Spark > Issue Type: Wish > Components: Spark Core, SQL >Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0 >Reporter: gabrywu >Priority: Minor > > As we all know, table & column statistics are very important to spark SQL > optimizer, however we have to collect & update them using > {code:java} > analyze table tableName compute statistics{code} > > It's a little inconvenient, so why can't we collect & update statistics when > a spark stage runs and finishes? > For example, when a insert overwrite table statement finishes, we can update > a corresponding table statistics using SQL metric. And in following queries, > spark sql optimizer can use these statistics. > So what do you think of it?[~yumwang] , it it reasonable? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running
[ https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gabrywu updated SPARK-38258: Description: As we all know, table & column statistics are very important to spark SQL optimizer, however we have to collect & update them using {code:java} analyze table tableName compute statistics{code} It's a little inconvenient, so why can't we collect & update statistics when a spark stage runs and finishes? For example, when a insert overwrite table statement finishes, we can update a corresponding table statistics using SQL metric. And in following queries, spark sql optimizer can use these statistics. So what do you think of it?[~yumwang] , it it reasonable? was: As we all know, table & column statistics are very important to spark SQL optimizer, however we have to collect & update them using {code:java} analyze table tableName compute statistics{code} It's a little inconvenient, so why can't we collect & update statistics when a spark stage runs and finishes? For example, when a insert overwrite table statement finishes, we can update a corresponding table statistics using SQL metric. And in next queries, spark sql optimizer can use these statistics. So what do you think of it?[~yumwang] > [proposal] collect & update statistics automatically when spark SQL is running > -- > > Key: SPARK-38258 > URL: https://issues.apache.org/jira/browse/SPARK-38258 > Project: Spark > Issue Type: Wish > Components: Spark Core, SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0 >Reporter: gabrywu >Priority: Minor > > As we all know, table & column statistics are very important to spark SQL > optimizer, however we have to collect & update them using > {code:java} > analyze table tableName compute statistics{code} > > It's a little inconvenient, so why can't we collect & update statistics when > a spark stage runs and finishes? > For example, when a insert overwrite table statement finishes, we can update > a corresponding table statistics using SQL metric. And in following queries, > spark sql optimizer can use these statistics. > So what do you think of it?[~yumwang] , it it reasonable? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running
gabrywu created SPARK-38258: --- Summary: [proposal] collect & update statistics automatically when spark SQL is running Key: SPARK-38258 URL: https://issues.apache.org/jira/browse/SPARK-38258 Project: Spark Issue Type: Wish Components: Spark Core, SQL Affects Versions: 3.2.0, 3.1.0, 3.0.0 Reporter: gabrywu As we all know, table & column statistics are very important to spark SQL optimizer, however we have to collect & update them using {code:java} analyze table tableName compute statistics{code} It's a little inconvenient, so why can't we collect & update statistics when a spark stage runs and finishes? For example, when a insert overwrite table statement finishes, we can update a corresponding table statistics using SQL metric. And in next queries, spark sql optimizer can use these statistics. So what do you think of it?[~yumwang] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28329) SELECT INTO syntax
[ https://issues.apache.org/jira/browse/SPARK-28329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491884#comment-17491884 ] gabrywu commented on SPARK-28329: - [~smilegator] Is there a plan to support that? select into a scalar variable? I think it's useful to optimize some SQLs like this {code:SQL} select max(id) into ${max_id} from db.tableA; select * from db.tableB where id >= ${max_id}; {code} It's better than the following SQL, because it can push down the filters id >= ${max_id} {code:SQL} select * from db.tableB where id >= (select max(id) from db.tableA); {code} > SELECT INTO syntax > -- > > Key: SPARK-28329 > URL: https://issues.apache.org/jira/browse/SPARK-28329 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > h2. Synopsis > {noformat} > [ WITH [ RECURSIVE ] with_query [, ...] ] > SELECT [ ALL | DISTINCT [ ON ( expression [, ...] ) ] ] > * | expression [ [ AS ] output_name ] [, ...] > INTO [ TEMPORARY | TEMP | UNLOGGED ] [ TABLE ] new_table > [ FROM from_item [, ...] ] > [ WHERE condition ] > [ GROUP BY expression [, ...] ] > [ HAVING condition [, ...] ] > [ WINDOW window_name AS ( window_definition ) [, ...] ] > [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ] > [ ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS { FIRST | > LAST } ] [, ...] ] > [ LIMIT { count | ALL } ] > [ OFFSET start [ ROW | ROWS ] ] > [ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY ] > [ FOR { UPDATE | SHARE } [ OF table_name [, ...] ] [ NOWAIT ] [...] ] > {noformat} > h2. Description > {{SELECT INTO}} creates a new table and fills it with data computed by a > query. The data is not returned to the client, as it is with a normal > {{SELECT}}. The new table's columns have the names and data types associated > with the output columns of the {{SELECT}}. > > {{CREATE TABLE AS}} offers a superset of the functionality offered by > {{SELECT INTO}}. > [https://www.postgresql.org/docs/11/sql-selectinto.html] > [https://www.postgresql.org/docs/11/sql-createtableas.html] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org