[jira] [Commented] (SPARK-44817) SPIP: Incremental Stats Collection
[ https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17827739#comment-17827739 ] Rakesh Raushan commented on SPARK-44817: [~cloud_fan] [~dongjoon] What do you think about the proposal ? Does this sounds useful ? > SPIP: Incremental Stats Collection > -- > > Key: SPARK-44817 > URL: https://issues.apache.org/jira/browse/SPARK-44817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0, 4.0.0 >Reporter: Rakesh Raushan >Priority: Major > > Spark's Cost Based Optimizer is dependent on the table and column statistics. > After every execution of DML query, table and column stats are invalidated if > auto update of stats collection is not turned on. To keep stats updated we > need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very > expensive. It is not feasible to run this command after every DML query. > Instead, we can incrementally update the stats during each DML query run > itself. This way our table and column stats would be fresh at all the time > and CBO benefits can be applied. Initially, we can only update table level > stats and gradually start updating column level stats as well. > *Pros:* > 1. Optimize queries over table which is updated frequently. > 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE > STATISTICS` for updating stats. > [SPIP Document > |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44817) SPIP: Incremental Stats Collection
[ https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-44817: --- Summary: SPIP: Incremental Stats Collection (was: Incremental Stats Collection) > SPIP: Incremental Stats Collection > -- > > Key: SPARK-44817 > URL: https://issues.apache.org/jira/browse/SPARK-44817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0, 4.0.0 >Reporter: Rakesh Raushan >Priority: Major > > Spark's Cost Based Optimizer is dependent on the table and column statistics. > After every execution of DML query, table and column stats are invalidated if > auto update of stats collection is not turned on. To keep stats updated we > need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very > expensive. It is not feasible to run this command after every DML query. > Instead, we can incrementally update the stats during each DML query run > itself. This way our table and column stats would be fresh at all the time > and CBO benefits can be applied. Initially, we can only update table level > stats and gradually start updating column level stats as well. > *Pros:* > 1. Optimize queries over table which is updated frequently. > 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE > STATISTICS` for updating stats. > [SPIP Document > |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-44817) Incremental Stats Collection
[ https://issues.apache.org/jira/browse/SPARK-44817 ] Rakesh Raushan deleted comment on SPARK-44817: was (Author: rakson): [~gurwls223] [~cloud_fan] [~dongjoon] I have added an SPIP document. Does this feature seem useful to you? > Incremental Stats Collection > > > Key: SPARK-44817 > URL: https://issues.apache.org/jira/browse/SPARK-44817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0, 4.0.0 >Reporter: Rakesh Raushan >Priority: Major > > Spark's Cost Based Optimizer is dependent on the table and column statistics. > After every execution of DML query, table and column stats are invalidated if > auto update of stats collection is not turned on. To keep stats updated we > need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very > expensive. It is not feasible to run this command after every DML query. > Instead, we can incrementally update the stats during each DML query run > itself. This way our table and column stats would be fresh at all the time > and CBO benefits can be applied. Initially, we can only update table level > stats and gradually start updating column level stats as well. > *Pros:* > 1. Optimize queries over table which is updated frequently. > 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE > STATISTICS` for updating stats. > [SPIP Document > |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44817) Incremental Stats Collection
[ https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766213#comment-17766213 ] Rakesh Raushan commented on SPARK-44817: [~gurwls223] [~cloud_fan] [~dongjoon] I have added an SPIP document. Does this feature seem useful to you? > Incremental Stats Collection > > > Key: SPARK-44817 > URL: https://issues.apache.org/jira/browse/SPARK-44817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0, 4.0.0 >Reporter: Rakesh Raushan >Priority: Major > > Spark's Cost Based Optimizer is dependent on the table and column statistics. > After every execution of DML query, table and column stats are invalidated if > auto update of stats collection is not turned on. To keep stats updated we > need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very > expensive. It is not feasible to run this command after every DML query. > Instead, we can incrementally update the stats during each DML query run > itself. This way our table and column stats would be fresh at all the time > and CBO benefits can be applied. Initially, we can only update table level > stats and gradually start updating column level stats as well. > *Pros:* > 1. Optimize queries over table which is updated frequently. > 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE > STATISTICS` for updating stats. > [SPIP Document > |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44817) Incremental Stats Collection
[ https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759229#comment-17759229 ] Rakesh Raushan edited comment on SPARK-44817 at 8/26/23 9:02 AM: - [~gurwls223] [~cloud_fan] Added SPIP Document. Link for the document : [https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing] was (Author: rakson): Added SPIP Document. Link for the document : https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing > Incremental Stats Collection > > > Key: SPARK-44817 > URL: https://issues.apache.org/jira/browse/SPARK-44817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0, 4.0.0 >Reporter: Rakesh Raushan >Priority: Major > > Spark's Cost Based Optimizer is dependent on the table and column statistics. > After every execution of DML query, table and column stats are invalidated if > auto update of stats collection is not turned on. To keep stats updated we > need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very > expensive. It is not feasible to run this command after every DML query. > Instead, we can incrementally update the stats during each DML query run > itself. This way our table and column stats would be fresh at all the time > and CBO benefits can be applied. Initially, we can only update table level > stats and gradually start updating column level stats as well. > *Pros:* > 1. Optimize queries over table which is updated frequently. > 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE > STATISTICS` for updating stats. > [SPIP Document > |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing] > added -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44817) Incremental Stats Collection
[ https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-44817: --- Description: Spark's Cost Based Optimizer is dependent on the table and column statistics. After every execution of DML query, table and column stats are invalidated if auto update of stats collection is not turned on. To keep stats updated we need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very expensive. It is not feasible to run this command after every DML query. Instead, we can incrementally update the stats during each DML query run itself. This way our table and column stats would be fresh at all the time and CBO benefits can be applied. Initially, we can only update table level stats and gradually start updating column level stats as well. *Pros:* 1. Optimize queries over table which is updated frequently. 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE STATISTICS` for updating stats. [SPIP Document |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing] was: Spark's Cost Based Optimizer is dependent on the table and column statistics. After every execution of DML query, table and column stats are invalidated if auto update of stats collection is not turned on. To keep stats updated we need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very expensive. It is not feasible to run this command after every DML query. Instead, we can incrementally update the stats during each DML query run itself. This way our table and column stats would be fresh at all the time and CBO benefits can be applied. Initially, we can only update table level stats and gradually start updating column level stats as well. *Pros:* 1. Optimize queries over table which is updated frequently. 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE STATISTICS` for updating stats. [SPIP Document |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing] added > Incremental Stats Collection > > > Key: SPARK-44817 > URL: https://issues.apache.org/jira/browse/SPARK-44817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0, 4.0.0 >Reporter: Rakesh Raushan >Priority: Major > > Spark's Cost Based Optimizer is dependent on the table and column statistics. > After every execution of DML query, table and column stats are invalidated if > auto update of stats collection is not turned on. To keep stats updated we > need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very > expensive. It is not feasible to run this command after every DML query. > Instead, we can incrementally update the stats during each DML query run > itself. This way our table and column stats would be fresh at all the time > and CBO benefits can be applied. Initially, we can only update table level > stats and gradually start updating column level stats as well. > *Pros:* > 1. Optimize queries over table which is updated frequently. > 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE > STATISTICS` for updating stats. > [SPIP Document > |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44817) Incremental Stats Collection
[ https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-44817: --- Description: Spark's Cost Based Optimizer is dependent on the table and column statistics. After every execution of DML query, table and column stats are invalidated if auto update of stats collection is not turned on. To keep stats updated we need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very expensive. It is not feasible to run this command after every DML query. Instead, we can incrementally update the stats during each DML query run itself. This way our table and column stats would be fresh at all the time and CBO benefits can be applied. Initially, we can only update table level stats and gradually start updating column level stats as well. *Pros:* 1. Optimize queries over table which is updated frequently. 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE STATISTICS` for updating stats. [SPIP Document |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing] added was: Spark's Cost Based Optimizer is dependent on the table and column statistics. After every execution of DML query, table and column stats are invalidated if auto update of stats collection is not turned on. To keep stats updated we need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very expensive. It is not feasible to run this command after every DML query. Instead, we can incrementally update the stats during each DML query run itself. This way our table and column stats would be fresh at all the time and CBO benefits can be applied. Initially, we can only update table level stats and gradually start updating column level stats as well. *Pros:* 1. Optimize queries over table which is updated frequently. 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE STATISTICS` for updating stats. > Incremental Stats Collection > > > Key: SPARK-44817 > URL: https://issues.apache.org/jira/browse/SPARK-44817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0, 4.0.0 >Reporter: Rakesh Raushan >Priority: Major > > Spark's Cost Based Optimizer is dependent on the table and column statistics. > After every execution of DML query, table and column stats are invalidated if > auto update of stats collection is not turned on. To keep stats updated we > need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very > expensive. It is not feasible to run this command after every DML query. > Instead, we can incrementally update the stats during each DML query run > itself. This way our table and column stats would be fresh at all the time > and CBO benefits can be applied. Initially, we can only update table level > stats and gradually start updating column level stats as well. > *Pros:* > 1. Optimize queries over table which is updated frequently. > 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE > STATISTICS` for updating stats. > [SPIP Document > |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing] > added -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44817) Incremental Stats Collection
[ https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759229#comment-17759229 ] Rakesh Raushan commented on SPARK-44817: Added SPIP Document. Link for the document : https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing > Incremental Stats Collection > > > Key: SPARK-44817 > URL: https://issues.apache.org/jira/browse/SPARK-44817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0, 4.0.0 >Reporter: Rakesh Raushan >Priority: Major > > Spark's Cost Based Optimizer is dependent on the table and column statistics. > After every execution of DML query, table and column stats are invalidated if > auto update of stats collection is not turned on. To keep stats updated we > need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very > expensive. It is not feasible to run this command after every DML query. > Instead, we can incrementally update the stats during each DML query run > itself. This way our table and column stats would be fresh at all the time > and CBO benefits can be applied. Initially, we can only update table level > stats and gradually start updating column level stats as well. > *Pros:* > 1. Optimize queries over table which is updated frequently. > 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE > STATISTICS` for updating stats. > [SPIP Document > |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing] > added -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44817) Incremental Stats Collection
[ https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757539#comment-17757539 ] Rakesh Raushan commented on SPARK-44817: Sure. I would try to come up with a SPIP by this weekend. > Incremental Stats Collection > > > Key: SPARK-44817 > URL: https://issues.apache.org/jira/browse/SPARK-44817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0, 4.0.0 >Reporter: Rakesh Raushan >Priority: Major > > Spark's Cost Based Optimizer is dependent on the table and column statistics. > After every execution of DML query, table and column stats are invalidated if > auto update of stats collection is not turned on. To keep stats updated we > need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very > expensive. It is not feasible to run this command after every DML query. > Instead, we can incrementally update the stats during each DML query run > itself. This way our table and column stats would be fresh at all the time > and CBO benefits can be applied. Initially, we can only update table level > stats and gradually start updating column level stats as well. > *Pros:* > 1. Optimize queries over table which is updated frequently. > 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE > STATISTICS` for updating stats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44817) Incremental Stats Collection
[ https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-44817: --- Affects Version/s: 3.5.0 > Incremental Stats Collection > > > Key: SPARK-44817 > URL: https://issues.apache.org/jira/browse/SPARK-44817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0, 4.0.0 >Reporter: Rakesh Raushan >Priority: Major > > Spark's Cost Based Optimizer is dependent on the table and column statistics. > After every execution of DML query, table and column stats are invalidated if > auto update of stats collection is not turned on. To keep stats updated we > need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very > expensive. It is not feasible to run this command after every DML query. > Instead, we can incrementally update the stats during each DML query run > itself. This way our table and column stats would be fresh at all the time > and CBO benefits can be applied. Initially, we can only update table level > stats and gradually start updating column level stats as well. > *Pros:* > 1. Optimize queries over table which is updated frequently. > 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE > STATISTICS` for updating stats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44817) Incremental Stats Collection
[ https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754694#comment-17754694 ] Rakesh Raushan edited comment on SPARK-44817 at 8/16/23 1:08 PM: - [~cloud_fan] [~gurwls223] [~maxgekk] [~dongjoon] What are your thoughts over this ? If this looks promising, i can work on raising PR for this. was (Author: rakson): [~cloud_fan] [~gurwls223] [~maxgekk] What are your thoughts over this ? If this looks promising, i can work on raising PR for this. > Incremental Stats Collection > > > Key: SPARK-44817 > URL: https://issues.apache.org/jira/browse/SPARK-44817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Rakesh Raushan >Priority: Major > > Spark's Cost Based Optimizer is dependent on the table and column statistics. > After every execution of DML query, table and column stats are invalidated if > auto update of stats collection is not turned on. To keep stats updated we > need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very > expensive. It is not feasible to run this command after every DML query. > Instead, we can incrementally update the stats during each DML query run > itself. This way our table and column stats would be fresh at all the time > and CBO benefits can be applied. Initially, we can only update table level > stats and gradually start updating column level stats as well. > *Pros:* > 1. Optimize queries over table which is updated frequently. > 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE > STATISTICS` for updating stats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44817) Incremental Stats Collection
[ https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754694#comment-17754694 ] Rakesh Raushan commented on SPARK-44817: [~cloud_fan] [~gurwls223] [~maxgekk] What are your thoughts over this ? If this looks promising, i can work on raising PR for this. > Incremental Stats Collection > > > Key: SPARK-44817 > URL: https://issues.apache.org/jira/browse/SPARK-44817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Rakesh Raushan >Priority: Major > > Spark's Cost Based Optimizer is dependent on the table and column statistics. > After every execution of DML query, table and column stats are invalidated if > auto update of stats collection is not turned on. To keep stats updated we > need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very > expensive. It is not feasible to run this command after every DML query. > Instead, we can incrementally update the stats during each DML query run > itself. This way our table and column stats would be fresh at all the time > and CBO benefits can be applied. Initially, we can only update table level > stats and gradually start updating column level stats as well. > *Pros:* > 1. Optimize queries over table which is updated frequently. > 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE > STATISTICS` for updating stats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44817) Incremental Stats Collection
[ https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-44817: --- Description: Spark's Cost Based Optimizer is dependent on the table and column statistics. After every execution of DML query, table and column stats are invalidated if auto update of stats collection is not turned on. To keep stats updated we need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very expensive. It is not feasible to run this command after every DML query. Instead, we can incrementally update the stats during each DML query run itself. This way our table and column stats would be fresh at all the time and CBO benefits can be applied. Initially, we can only update table level stats and gradually start updating column level stats as well. *Pros:* 1. Optimize queries over table which is updated frequently. 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE STATISTICS` for updating stats. was: Spark's Cost Based Optimizer is dependent on the table and column statistics. After every execution of DML query, table and column stats are invalidated if auto update of stats collection is not turned on. To keep stats updated we need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very expensive. It is not feasible to run this command after every DML query. Instead, we can incrementally update the stats during each DML query run itself. This way our table and column stats would be fresh at all the time and CBO benefits can be applied. *Pros:* 1. Optimize queries over table which is updated frequently. 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE STATISTICS` for updating stats. > Incremental Stats Collection > > > Key: SPARK-44817 > URL: https://issues.apache.org/jira/browse/SPARK-44817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Rakesh Raushan >Priority: Major > > Spark's Cost Based Optimizer is dependent on the table and column statistics. > After every execution of DML query, table and column stats are invalidated if > auto update of stats collection is not turned on. To keep stats updated we > need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very > expensive. It is not feasible to run this command after every DML query. > Instead, we can incrementally update the stats during each DML query run > itself. This way our table and column stats would be fresh at all the time > and CBO benefits can be applied. Initially, we can only update table level > stats and gradually start updating column level stats as well. > *Pros:* > 1. Optimize queries over table which is updated frequently. > 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE > STATISTICS` for updating stats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44817) Incremental Stats Collection
Rakesh Raushan created SPARK-44817: -- Summary: Incremental Stats Collection Key: SPARK-44817 URL: https://issues.apache.org/jira/browse/SPARK-44817 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Rakesh Raushan Spark's Cost Based Optimizer is dependent on the table and column statistics. After every execution of DML query, table and column stats are invalidated if auto update of stats collection is not turned on. To keep stats updated we need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very expensive. It is not feasible to run this command after every DML query. Instead, we can incrementally update the stats during each DML query run itself. This way our table and column stats would be fresh at all the time and CBO benefits can be applied. *Pros:* 1. Optimize queries over table which is updated frequently. 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE STATISTICS` for updating stats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37840) Dynamically update the loaded Hive UDF JAR
[ https://issues.apache.org/jira/browse/SPARK-37840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472537#comment-17472537 ] Rakesh Raushan edited comment on SPARK-37840 at 1/11/22, 8:08 AM: -- [~cutiechi] The problem is with `jarClassLoader`. `jarClassLoader` needs to be updated after updated jar is added. was (Author: rakson): The problem is with `jarClassLoader`. We need to update our `jarClassLoader` after updated jar is added. > Dynamically update the loaded Hive UDF JAR > -- > > Key: SPARK-37840 > URL: https://issues.apache.org/jira/browse/SPARK-37840 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: melin >Priority: Major > > In the production environment, spark ThriftServer needs to be restarted if > jar files are updated after UDF files are loaded。 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37840) Dynamically update the loaded Hive UDF JAR
[ https://issues.apache.org/jira/browse/SPARK-37840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472537#comment-17472537 ] Rakesh Raushan commented on SPARK-37840: The problem is with `jarClassLoader`. We need to update our `jarClassLoader` after updated jar is added. > Dynamically update the loaded Hive UDF JAR > -- > > Key: SPARK-37840 > URL: https://issues.apache.org/jira/browse/SPARK-37840 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: melin >Priority: Major > > In the production environment, spark ThriftServer needs to be restarted if > jar files are updated after UDF files are loaded。 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37840) Dynamically update the loaded Hive UDF JAR
[ https://issues.apache.org/jira/browse/SPARK-37840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17470793#comment-17470793 ] Rakesh Raushan commented on SPARK-37840: We can dynamically update our UDF jars after loading them. I will try to raise a PR soon for this. > Dynamically update the loaded Hive UDF JAR > -- > > Key: SPARK-37840 > URL: https://issues.apache.org/jira/browse/SPARK-37840 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: melin >Priority: Major > > In the production environment, spark ThriftServer needs to be restarted if > jar files are updated after UDF files are loaded。 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32924) Web UI sort on duration is wrong
[ https://issues.apache.org/jira/browse/SPARK-32924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210977#comment-17210977 ] Rakesh Raushan edited comment on SPARK-32924 at 10/9/20, 1:49 PM: -- I think its due to string sorting. One similar issue is fixed here SPARK-31983 was (Author: rakson): I thinking its due to string sorting. One similar issue is fixed here [SPARK-31983|https://issues.apache.org/jira/browse/SPARK-31983] > Web UI sort on duration is wrong > > > Key: SPARK-32924 > URL: https://issues.apache.org/jira/browse/SPARK-32924 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.6 >Reporter: t oo >Priority: Major > Attachments: ui_sort.png > > > See attachment, 9 s(econds) is showing as larger than 8.1min -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32924) Web UI sort on duration is wrong
[ https://issues.apache.org/jira/browse/SPARK-32924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210977#comment-17210977 ] Rakesh Raushan commented on SPARK-32924: I thinking its due to string sorting. One similar issue is fixed here [SPARK-31983|https://issues.apache.org/jira/browse/SPARK-31983] > Web UI sort on duration is wrong > > > Key: SPARK-32924 > URL: https://issues.apache.org/jira/browse/SPARK-32924 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.6 >Reporter: t oo >Priority: Major > Attachments: ui_sort.png > > > See attachment, 9 s(econds) is showing as larger than 8.1min -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32694) Pushdown cast to data sources
[ https://issues.apache.org/jira/browse/SPARK-32694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183719#comment-17183719 ] Rakesh Raushan commented on SPARK-32694: One of the proposed [solution|https://github.com/apache/spark/pull/27648] for similar issue. > Pushdown cast to data sources > - > > Key: SPARK-32694 > URL: https://issues.apache.org/jira/browse/SPARK-32694 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Chao Sun >Priority: Major > > Currently we don't support pushing down cast to data source (see > [link|http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSql-Casting-of-Predicate-Literals-tp29956p30035.html] > for a discussion). For instance, in the following code snippet: > {code} > scala> case class Person(name: String, age: Short) > scala> Seq(Person("John", 32), Person("David", 25), Person("Mike", > 18)).toDS().write.parquet("/tmp/person.parquet") > scala> val personDS = spark.read.parquet("/tmp/person.parquet") > scala> personDS.createOrReplaceTempView("person") > scala> spark.sql("SELECT * FROM person where age < 30") > {code} > The predicate won't be pushed down to Parquet data source because in > {{DataSourceStrategy}}, {{PushableColumnBase}} only handles a few limited > cases such as {{Attribute}} and {{GetStructField}}. Potentially we can handle > {{Cast}} here as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31983) Tables of structured streaming tab show wrong result for duration column
Rakesh Raushan created SPARK-31983: -- Summary: Tables of structured streaming tab show wrong result for duration column Key: SPARK-31983 URL: https://issues.apache.org/jira/browse/SPARK-31983 Project: Spark Issue Type: Bug Components: SQL, Web UI Affects Versions: 3.0.0 Reporter: Rakesh Raushan Sorting result for duration column in tables of structured streaming tab is sometimes wrong. As we are sorting on string values. Consider "3ms" and "12ms". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31873) Spark Sql Function year does not extract year from date/timestamp
[ https://issues.apache.org/jira/browse/SPARK-31873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17120335#comment-17120335 ] Rakesh Raushan commented on SPARK-31873: Yeah. This is a problem with 2.4.5. {code:java} scala> val df = Seq(("1300-01-03 00:00:00") ).toDF("date_val").withColumn("date_val_ts", to_timestamp(col("date_val"))).withColumn("year_val", year(to_timestamp(col("date_val" df: org.apache.spark.sql.DataFrame = [date_val: string, date_val_ts: timestamp ... 1 more field] scala> df.show +---+---++ | date_val | date_val_ts. |year_val| +---+---++ |1300-01-03 00:00:00|1300-01-03 00:00:00| 1299 | +---+---++ {code} [~hyukjin.kwon] Do this need to get fixed in 2.4.5. If so, I can check this. > Spark Sql Function year does not extract year from date/timestamp > - > > Key: SPARK-31873 > URL: https://issues.apache.org/jira/browse/SPARK-31873 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Deepak Shingavi >Priority: Major > > There is a Spark SQL function > org.apache.spark.sql.functions.year which fails in below case > > {code:java} > // Code to extract year from Timestamp > val df = Seq( > ("1300-01-03 00:00:00") > ).toDF("date_val") > .withColumn("date_val_ts", to_timestamp(col("date_val"))) > .withColumn("year_val", year(to_timestamp(col("date_val" > df.show() > //Output of the above code > +---+---++ > | date_val|date_val_ts|year_val| > +---+---++ > |1300-01-03 00:00:00|1300-01-03 00:00:00|1299| > +---+---++ > {code} > > The above code works perfectly for all the years greater than 1300 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31873) Spark Sql Function year does not extract year from date/timestamp
[ https://issues.apache.org/jira/browse/SPARK-31873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17120301#comment-17120301 ] Rakesh Raushan edited comment on SPARK-31873 at 5/30/20, 4:28 PM: -- {code:java} scala> val df = Seq(("1300-01-03 00:00:00") ).toDF("date_val").withColumn("date_val_ts", to_timestamp(col("date_val"))).withColumn("year_val", year(to_timestamp(col("date_val" df: org.apache.spark.sql.DataFrame = [date_val: string, date_val_ts: timestamp ... 1 more field] scala> df.show +---+---++ | date_val | date_val_ts. |year_val| +---+---++ |1300-01-03 00:00:00|1300-01-03 00:00:00| 1300 | +---+---++ {code} This works fine with master branch. was (Author: rakson): scala> val df = Seq(("1300-01-03 00:00:00") ).toDF("date_val").withColumn("date_val_ts", to_timestamp(col("date_val"))).withColumn("year_val", year(to_timestamp(col("date_val" df: org.apache.spark.sql.DataFrame = [date_val: string, date_val_ts: timestamp ... 1 more field] scala> df.show +---+---++ | date_val| date_val_ts|year_val| +---+---++ |1300-01-03 00:00:00|1300-01-03 00:00:00| 1300| +---+---++ This works fine with master branch. > Spark Sql Function year does not extract year from date/timestamp > - > > Key: SPARK-31873 > URL: https://issues.apache.org/jira/browse/SPARK-31873 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Deepak Shingavi >Priority: Major > > There is a Spark SQL function > org.apache.spark.sql.functions.year which fails in below case > > {code:java} > // Code to extract year from Timestamp > val df = Seq( > ("1300-01-03 00:00:00") > ).toDF("date_val") > .withColumn("date_val_ts", to_timestamp(col("date_val"))) > .withColumn("year_val", year(to_timestamp(col("date_val" > df.show() > //Output of the above code > +---+---++ > | date_val|date_val_ts|year_val| > +---+---++ > |1300-01-03 00:00:00|1300-01-03 00:00:00|1299| > +---+---++ > {code} > > The above code works perfectly for all the years greater than 1300 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31873) Spark Sql Function year does not extract year from date/timestamp
[ https://issues.apache.org/jira/browse/SPARK-31873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17120301#comment-17120301 ] Rakesh Raushan commented on SPARK-31873: scala> val df = Seq(("1300-01-03 00:00:00") ).toDF("date_val").withColumn("date_val_ts", to_timestamp(col("date_val"))).withColumn("year_val", year(to_timestamp(col("date_val" df: org.apache.spark.sql.DataFrame = [date_val: string, date_val_ts: timestamp ... 1 more field] scala> df.show +---+---++ | date_val| date_val_ts|year_val| +---+---++ |1300-01-03 00:00:00|1300-01-03 00:00:00| 1300| +---+---++ This works fine with master branch. > Spark Sql Function year does not extract year from date/timestamp > - > > Key: SPARK-31873 > URL: https://issues.apache.org/jira/browse/SPARK-31873 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Deepak Shingavi >Priority: Major > > There is a Spark SQL function > org.apache.spark.sql.functions.year which fails in below case > > {code:java} > // Code to extract year from Timestamp > val df = Seq( > ("1300-01-03 00:00:00") > ).toDF("date_val") > .withColumn("date_val_ts", to_timestamp(col("date_val"))) > .withColumn("year_val", year(to_timestamp(col("date_val" > df.show() > //Output of the above code > +---+---++ > | date_val|date_val_ts|year_val| > +---+---++ > |1300-01-03 00:00:00|1300-01-03 00:00:00|1299| > +---+---++ > {code} > > The above code works perfectly for all the years greater than 1300 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31763) DataFrame.inputFiles() not Available
[ https://issues.apache.org/jira/browse/SPARK-31763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17116738#comment-17116738 ] Rakesh Raushan commented on SPARK-31763: Shall I open a PR for this? > DataFrame.inputFiles() not Available > > > Key: SPARK-31763 > URL: https://issues.apache.org/jira/browse/SPARK-31763 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5 >Reporter: Felix Kizhakkel Jose >Priority: Major > > I have been trying to list inputFiles that compose my DataSet by using > *PySpark* > spark_session.read > .format(sourceFileFormat) > .load(S3A_FILESYSTEM_PREFIX + bucket + File.separator + sourceFolderPrefix) > *.inputFiles();* > but I get an exception saying inputFiles attribute not present. But I was > able to get this functionality with Spark Java. > *So is this something missing in PySpark?* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31688) Refactor pagination framework for spark web UI pages
Rakesh Raushan created SPARK-31688: -- Summary: Refactor pagination framework for spark web UI pages Key: SPARK-31688 URL: https://issues.apache.org/jira/browse/SPARK-31688 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.1.0 Reporter: Rakesh Raushan Currently, a large chunk of code is copied when we implement pagination using the current pagination framework. We also embed HTML a lot, this decreases code readability. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31104) Add documentation for all new Json Functions
[ https://issues.apache.org/jira/browse/SPARK-31104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102438#comment-17102438 ] Rakesh Raushan commented on SPARK-31104: [~hyukjin.kwon] We can mark this as resolved as this task has already been completed. > Add documentation for all new Json Functions > > > Key: SPARK-31104 > URL: https://issues.apache.org/jira/browse/SPARK-31104 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31470) Introduce SORTED BY clause in CREATE TABLE statement
[ https://issues.apache.org/jira/browse/SPARK-31470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102437#comment-17102437 ] Rakesh Raushan commented on SPARK-31470: If this is required by community and [~yumwang] has not started working, I can work on this. [~yumwang] What to do you say? > Introduce SORTED BY clause in CREATE TABLE statement > > > Key: SPARK-31470 > URL: https://issues.apache.org/jira/browse/SPARK-31470 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > We usually sort on frequently filtered columns when writing data to improve > query performance. But there is no these info in the table information. > > {code:sql} > CREATE TABLE t(day INT, hour INT, year INT, month INT) > USING parquet > PARTITIONED BY (year, month) > SORTED BY (day, hour); > {code} > > Impala, Oracle and redshift support this clause: > https://issues.apache.org/jira/browse/IMPALA-4166 > https://docs.oracle.com/database/121/DWHSG/attcluster.htm#GUID-DAECFBC5-FD1A-45A5-8C2C-DC9884D0857B > https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data-compare-sort-styles.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31642) Support pagination for spark structured streaming tab
[ https://issues.apache.org/jira/browse/SPARK-31642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099961#comment-17099961 ] Rakesh Raushan commented on SPARK-31642: I am working on it > Support pagination for spark structured streaming tab > -- > > Key: SPARK-31642 > URL: https://issues.apache.org/jira/browse/SPARK-31642 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.1.0 >Reporter: jobit mathew >Priority: Minor > > Support pagination for spark structured streaming tab -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31638) Clean code for pagination for all pages
[ https://issues.apache.org/jira/browse/SPARK-31638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-31638: --- Description: Clean code for pagination for different pages of spark webUI (was: Clean code for pagination for different pages of spark web) > Clean code for pagination for all pages > --- > > Key: SPARK-31638 > URL: https://issues.apache.org/jira/browse/SPARK-31638 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Minor > > Clean code for pagination for different pages of spark webUI -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31638) Clean code for pagination for all pages
[ https://issues.apache.org/jira/browse/SPARK-31638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-31638: --- Description: Clean code for pagination for different pages of spark web > Clean code for pagination for all pages > --- > > Key: SPARK-31638 > URL: https://issues.apache.org/jira/browse/SPARK-31638 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Minor > > Clean code for pagination for different pages of spark web -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31638) Clean code for pagination for all pages
Rakesh Raushan created SPARK-31638: -- Summary: Clean code for pagination for all pages Key: SPARK-31638 URL: https://issues.apache.org/jira/browse/SPARK-31638 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.1.0 Environment: Clean code for pagination for different pages of webUI Reporter: Rakesh Raushan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31638) Clean code for pagination for all pages
[ https://issues.apache.org/jira/browse/SPARK-31638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-31638: --- Environment: (was: Clean code for pagination for different pages of webUI) > Clean code for pagination for all pages > --- > > Key: SPARK-31638 > URL: https://issues.apache.org/jira/browse/SPARK-31638 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31478) Executors Stop() method is not executed when they are killed
Rakesh Raushan created SPARK-31478: -- Summary: Executors Stop() method is not executed when they are killed Key: SPARK-31478 URL: https://issues.apache.org/jira/browse/SPARK-31478 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.0 Reporter: Rakesh Raushan In dynamic Allocation when executors are killed, stop() method of executors is never called. So executors never goes down properly. In SPARK-29152, shutdown hook was added to stop the executors properly. Instead of forcing a shutdown hook we should ask executors to stop themselves before killing them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31397) Support json_arrayAgg
[ https://issues.apache.org/jira/browse/SPARK-31397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17079268#comment-17079268 ] Rakesh Raushan commented on SPARK-31397: The same functionality can be achieved using to/from_json and performance will be almost equivalent as well. So we do not need to implement this new function. > Support json_arrayAgg > - > > Key: SPARK-31397 > URL: https://issues.apache.org/jira/browse/SPARK-31397 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > Returns a JSON array by aggregating all the JSON arrays from a set of JSON > arrays, or by aggregating the values of a Column. > Some of the Databases supporting this aggregate function are: > * MySQL > * PostgreSQL > * Maria_DB > * Sqlite > * IBM Db2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-31397) Support json_arrayAgg
[ https://issues.apache.org/jira/browse/SPARK-31397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-31397: --- Comment: was deleted (was: I am working on it.) > Support json_arrayAgg > - > > Key: SPARK-31397 > URL: https://issues.apache.org/jira/browse/SPARK-31397 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > Returns a JSON array by aggregating all the JSON arrays from a set of JSON > arrays, or by aggregating the values of a Column. > Some of the Databases supporting this aggregate function are: > * MySQL > * PostgreSQL > * Maria_DB > * Sqlite > * IBM Db2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31396) Support json_objectAgg function
[ https://issues.apache.org/jira/browse/SPARK-31396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17079267#comment-17079267 ] Rakesh Raushan commented on SPARK-31396: We can achieve the same functionality using to/from_json. So we do not need this new function. > Support json_objectAgg function > --- > > Key: SPARK-31396 > URL: https://issues.apache.org/jira/browse/SPARK-31396 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > Returns a JSON object containing the key-value pairs by aggregating the > key-values of set of Objects or columns. > > This aggregate function is supported by: > * MySQL > * PostgreSQL > * IBM Db2 > * Maria_DB > * Sqlite -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-31396) Support json_objectAgg function
[ https://issues.apache.org/jira/browse/SPARK-31396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-31396: --- Comment: was deleted (was: I am working on it.) > Support json_objectAgg function > --- > > Key: SPARK-31396 > URL: https://issues.apache.org/jira/browse/SPARK-31396 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > Returns a JSON object containing the key-value pairs by aggregating the > key-values of set of Objects or columns. > > This aggregate function is supported by: > * MySQL > * PostgreSQL > * IBM Db2 > * Maria_DB > * Sqlite -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31397) Support json_arrayAgg
[ https://issues.apache.org/jira/browse/SPARK-31397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17079226#comment-17079226 ] Rakesh Raushan commented on SPARK-31397: I am working on it. > Support json_arrayAgg > - > > Key: SPARK-31397 > URL: https://issues.apache.org/jira/browse/SPARK-31397 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > Returns a JSON array by aggregating all the JSON arrays from a set of JSON > arrays, or by aggregating the values of a Column. > Some of the Databases supporting this aggregate function are: > * MySQL > * PostgreSQL > * Maria_DB > * Sqlite > * IBM Db2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31397) Support json_arrayAgg
Rakesh Raushan created SPARK-31397: -- Summary: Support json_arrayAgg Key: SPARK-31397 URL: https://issues.apache.org/jira/browse/SPARK-31397 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.1.0 Reporter: Rakesh Raushan Returns a JSON array by aggregating all the JSON arrays from a set of JSON arrays, or by aggregating the values of a Column. Some of the Databases supporting this aggregate function are: * MySQL * PostgreSQL * Maria_DB * Sqlite * IBM Db2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31396) Support json_objectAgg function
[ https://issues.apache.org/jira/browse/SPARK-31396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17079198#comment-17079198 ] Rakesh Raushan commented on SPARK-31396: I am working on it. > Support json_objectAgg function > --- > > Key: SPARK-31396 > URL: https://issues.apache.org/jira/browse/SPARK-31396 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > Returns a JSON object containing the key-value pairs by aggregating the > key-values of set of Objects or columns. > > This aggregate function is supported by: > * MySQL > * PostgreSQL > * IBM Db2 > * Maria_DB > * Sqlite -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31396) Support json_objectAgg function
Rakesh Raushan created SPARK-31396: -- Summary: Support json_objectAgg function Key: SPARK-31396 URL: https://issues.apache.org/jira/browse/SPARK-31396 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.1.0 Reporter: Rakesh Raushan Returns a JSON object containing the key-value pairs by aggregating the key-values of set of Objects or columns. This aggregate function is supported by: * MySQL * PostgreSQL * IBM Db2 * Maria_DB * Sqlite -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31106) Support is_json function
[ https://issues.apache.org/jira/browse/SPARK-31106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-31106: --- Description: This function will allow users to verify whether the given string is valid JSON or not. It returns `true` for valid JSON and `false` for invalid JSON. `NULL` is returned for `NULL` input. DBMSs supporting this functions are : * MySQL * SQL Server * Sqlite * MariaDB * Amazon Redshift * IBM Db2 was: Currently, null is returned when we come across invalid json. We should either throw an exception for invalid json or false should be returned, like in other DBMSs. Like in `json_array_length` function we need to return NULL for null array. So this might confuse users. DBMSs supporting this functions are : * MySQL * SQL Server * Sqlite * MariaDB * Amazon Redshift > Support is_json function > > > Key: SPARK-31106 > URL: https://issues.apache.org/jira/browse/SPARK-31106 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > This function will allow users to verify whether the given string is valid > JSON or not. It returns `true` for valid JSON and `false` for invalid JSON. > `NULL` is returned for `NULL` input. > DBMSs supporting this functions are : > * MySQL > * SQL Server > * Sqlite > * MariaDB > * Amazon Redshift > * IBM Db2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31106) Support is_json function
[ https://issues.apache.org/jira/browse/SPARK-31106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-31106: --- Summary: Support is_json function (was: Support IS_JSON) > Support is_json function > > > Key: SPARK-31106 > URL: https://issues.apache.org/jira/browse/SPARK-31106 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > Currently, null is returned when we come across invalid json. We should > either throw an exception for invalid json or false should be returned, like > in other DBMSs. Like in `json_array_length` function we need to return NULL > for null array. So this might confuse users. > > DBMSs supporting this functions are : > * MySQL > * SQL Server > * Sqlite > * MariaDB > * Amazon Redshift -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31106) Support IS_JSON
[ https://issues.apache.org/jira/browse/SPARK-31106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-31106: --- Description: Currently, null is returned when we come across invalid json. We should either throw an exception for invalid json or false should be returned, like in other DBMSs. Like in `json_array_length` function we need to return NULL for null array. So this might confuse users. DBMSs supporting this functions are : * MySQL * SQL Server * Sqlite * MariaDB * Amazon Redshift was: Currently, null is returned when we come across invalid json. We should either throw an exception for invalid json or false should be returned, like in other DBMSs. Like in `json_array_length` function we need to return NULL for null array. So this might confuse users. DBMSs supporting this functions are : * MySQL * SQL Server * Sqlite * MariaDB > Support IS_JSON > --- > > Key: SPARK-31106 > URL: https://issues.apache.org/jira/browse/SPARK-31106 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > Currently, null is returned when we come across invalid json. We should > either throw an exception for invalid json or false should be returned, like > in other DBMSs. Like in `json_array_length` function we need to return NULL > for null array. So this might confuse users. > > DBMSs supporting this functions are : > * MySQL > * SQL Server > * Sqlite > * MariaDB > * Amazon Redshift -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31369) Add Documentation for JSON functions
[ https://issues.apache.org/jira/browse/SPARK-31369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17076930#comment-17076930 ] Rakesh Raushan commented on SPARK-31369: I am working on it. > Add Documentation for JSON functions > > > Key: SPARK-31369 > URL: https://issues.apache.org/jira/browse/SPARK-31369 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rakesh Raushan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31369) Add Documentation for JSON functions
[ https://issues.apache.org/jira/browse/SPARK-31369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-31369: --- Parent: SPARK-28588 Issue Type: Sub-task (was: Documentation) > Add Documentation for JSON functions > > > Key: SPARK-31369 > URL: https://issues.apache.org/jira/browse/SPARK-31369 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rakesh Raushan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31369) Add Documentation for JSON functions
Rakesh Raushan created SPARK-31369: -- Summary: Add Documentation for JSON functions Key: SPARK-31369 URL: https://issues.apache.org/jira/browse/SPARK-31369 Project: Spark Issue Type: Documentation Components: SQL Affects Versions: 3.0.0 Reporter: Rakesh Raushan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31104) Add documentation for all new Json Functions
[ https://issues.apache.org/jira/browse/SPARK-31104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-31104: --- Summary: Add documentation for all new Json Functions (was: Add documentation for all the Json Functions) > Add documentation for all new Json Functions > > > Key: SPARK-31104 > URL: https://issues.apache.org/jira/browse/SPARK-31104 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31104) Add documentation for all the Json Functions
[ https://issues.apache.org/jira/browse/SPARK-31104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056584#comment-17056584 ] Rakesh Raushan commented on SPARK-31104: I am working on it. > Add documentation for all the Json Functions > > > Key: SPARK-31104 > URL: https://issues.apache.org/jira/browse/SPARK-31104 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31106) Support IS_JSON
[ https://issues.apache.org/jira/browse/SPARK-31106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055757#comment-17055757 ] Rakesh Raushan commented on SPARK-31106: I am working on it. > Support IS_JSON > --- > > Key: SPARK-31106 > URL: https://issues.apache.org/jira/browse/SPARK-31106 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > Currently, null is returned when we come across invalid json. We should > either throw an exception for invalid json or false should be returned, like > in other DBMSs. Like in `json_array_length` function we need to return NULL > for null array. So this might confuse users. > > DBMSs supporting this functions are : > * MySQL > * SQL Server > * Sqlite > * MariaDB -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31106) Support IS_JSON
Rakesh Raushan created SPARK-31106: -- Summary: Support IS_JSON Key: SPARK-31106 URL: https://issues.apache.org/jira/browse/SPARK-31106 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.1.0 Reporter: Rakesh Raushan Currently, null is returned when we come across invalid json. We should either throw an exception for invalid json or false should be returned, like in other DBMSs. Like in `json_array_length` function we need to return NULL for null array. So this might confuse users. DBMSs supporting this functions are : * MySQL * SQL Server * Sqlite * MariaDB -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31104) Add documentation for all the Json Functions
Rakesh Raushan created SPARK-31104: -- Summary: Add documentation for all the Json Functions Key: SPARK-31104 URL: https://issues.apache.org/jira/browse/SPARK-31104 Project: Spark Issue Type: Documentation Components: SQL Affects Versions: 3.1.0 Reporter: Rakesh Raushan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31103) Extend Support for useful JSON Functions
Rakesh Raushan created SPARK-31103: -- Summary: Extend Support for useful JSON Functions Key: SPARK-31103 URL: https://issues.apache.org/jira/browse/SPARK-31103 Project: Spark Issue Type: Umbrella Components: SQL Affects Versions: 3.1.0 Reporter: Rakesh Raushan Currently, Spark only supports few functions for JSON. There are many other common utility functions which are supported by other popular DBMSs. Supporting these functions will make it easier for prospective users. Also some functions like `json_array_length` , `json_object_keys` are more intuitive and naive users life would be much simpler. I have added some JSON functions on which I am working on. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31009) Support json_object_keys function
[ https://issues.apache.org/jira/browse/SPARK-31009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-31009: --- Affects Version/s: (was: 3.0.0) 3.1.0 > Support json_object_keys function > - > > Key: SPARK-31009 > URL: https://issues.apache.org/jira/browse/SPARK-31009 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > This function will return all the keys from outer json object. > > PostgreSQL -> [https://www.postgresql.org/docs/9.3/functions-json.html] > Mysql -> > [https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html] > MariaDB -> [https://mariadb.com/kb/en/json-functions/] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31008) Support json_array_length function
[ https://issues.apache.org/jira/browse/SPARK-31008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-31008: --- Affects Version/s: (was: 3.0.0) 3.1.0 > Support json_array_length function > -- > > Key: SPARK-31008 > URL: https://issues.apache.org/jira/browse/SPARK-31008 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > At the moment we don't support json_array_length function in spark. > This function is supported by > a.) PostgreSQL -> [https://www.postgresql.org/docs/9.3/functions-json.html] > b.) Presto -> [https://prestodb.io/docs/current/functions/json.html] > c.) redshift -> > [https://docs.aws.amazon.com/redshift/latest/dg/JSON_ARRAY_LENGTH.html] > > This allows naive users to directly get array length with a well defined json > function. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31009) Support json_object_keys function
[ https://issues.apache.org/jira/browse/SPARK-31009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-31009: --- Description: This function will return all the keys from outer json object. PostgreSQL -> [https://www.postgresql.org/docs/9.3/functions-json.html] Mysql -> [https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html] MariaDB -> [https://mariadb.com/kb/en/json-functions/] was: This function will return all the keys from outer json object. PostgreSQL -> [https://www.postgresql.org/docs/9.3/functions-json.html] Mysql -> [https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html] > Support json_object_keys function > - > > Key: SPARK-31009 > URL: https://issues.apache.org/jira/browse/SPARK-31009 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rakesh Raushan >Priority: Major > > This function will return all the keys from outer json object. > > PostgreSQL -> [https://www.postgresql.org/docs/9.3/functions-json.html] > Mysql -> > [https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html] > MariaDB -> [https://mariadb.com/kb/en/json-functions/] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31009) Support json_object_keys function
[ https://issues.apache.org/jira/browse/SPARK-31009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-31009: --- Description: This function will return all the keys from outer json object. PostgreSQL -> [https://www.postgresql.org/docs/9.3/functions-json.html] Mysql -> [https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html] was: This function will return all the keys from outer json object. PostgreSQL support this function -> [https://www.postgresql.org/docs/9.3/functions-json.html] > Support json_object_keys function > - > > Key: SPARK-31009 > URL: https://issues.apache.org/jira/browse/SPARK-31009 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rakesh Raushan >Priority: Major > > This function will return all the keys from outer json object. > > PostgreSQL -> [https://www.postgresql.org/docs/9.3/functions-json.html] > Mysql -> > [https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31009) Support json_object_keys function
[ https://issues.apache.org/jira/browse/SPARK-31009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17049914#comment-17049914 ] Rakesh Raushan commented on SPARK-31009: [~hyukjin.kwon] I updated the description. PostgreSQL supports this function. We can use `from_json` to convert json to `MapType` and then extract the keys, but that won't be optimal. Apart from this one, there are some json function that supported by PostgreSQL, prestro, redshift,teradata. May be we can discuss on supporting some of them. I have already raised a PR for `json_array_length`. This is supported by all of the above mentioned DBMSs. > Support json_object_keys function > - > > Key: SPARK-31009 > URL: https://issues.apache.org/jira/browse/SPARK-31009 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rakesh Raushan >Priority: Major > > This function will return all the keys from outer json object. > > PostgreSQL support this function -> > [https://www.postgresql.org/docs/9.3/functions-json.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31009) Support json_object_keys function
[ https://issues.apache.org/jira/browse/SPARK-31009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-31009: --- Description: This function will return all the keys from outer json object. PostgreSQL support this function -> [https://www.postgresql.org/docs/9.3/functions-json.html] was:This function will return all the keys from outer json object. > Support json_object_keys function > - > > Key: SPARK-31009 > URL: https://issues.apache.org/jira/browse/SPARK-31009 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rakesh Raushan >Priority: Major > > This function will return all the keys from outer json object. > > PostgreSQL support this function -> > [https://www.postgresql.org/docs/9.3/functions-json.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28427) Support more Postgres JSON functions
[ https://issues.apache.org/jira/browse/SPARK-28427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17049367#comment-17049367 ] Rakesh Raushan commented on SPARK-28427: I think we should add some of Postgres JSON functions to Spark. > Support more Postgres JSON functions > > > Key: SPARK-28427 > URL: https://issues.apache.org/jira/browse/SPARK-28427 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Priority: Major > > Postgres features a number of JSON functions that are missing in Spark: > https://www.postgresql.org/docs/9.3/functions-json.html > Redshift's JSON functions > (https://docs.aws.amazon.com/redshift/latest/dg/json-functions.html) have > partial overlap with the Postgres list. > Some of these functions can be expressed in terms of compositions of existing > Spark functions. For example, I think that {{json_array_length}} can be > expressed with {{cardinality}} and {{from_json}}, but there's a caveat > related to legacy Hive compatibility (see the demo notebook at > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5796212617691211/45530874214710/4901752417050771/latest.html > for more details). > I'm filing this ticket so that we can triage the list of Postgres JSON > features and decide which ones make sense to support in Spark. After we've > done that, we can create individual tickets for specific functions and > features. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31009) Support json_object_keys function
[ https://issues.apache.org/jira/browse/SPARK-31009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17049116#comment-17049116 ] Rakesh Raushan commented on SPARK-31009: I am working on this. > Support json_object_keys function > - > > Key: SPARK-31009 > URL: https://issues.apache.org/jira/browse/SPARK-31009 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rakesh Raushan >Priority: Major > > This function will return all the keys from outer json object. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31009) Support json_object_keys function
Rakesh Raushan created SPARK-31009: -- Summary: Support json_object_keys function Key: SPARK-31009 URL: https://issues.apache.org/jira/browse/SPARK-31009 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Rakesh Raushan This function will return all the keys from outer json object. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31008) Support json_array_length function
[ https://issues.apache.org/jira/browse/SPARK-31008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17049097#comment-17049097 ] Rakesh Raushan commented on SPARK-31008: I will raise a PR soon > Support json_array_length function > -- > > Key: SPARK-31008 > URL: https://issues.apache.org/jira/browse/SPARK-31008 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rakesh Raushan >Priority: Major > > At the moment we don't support json_array_length function in spark. > This function is supported by > a.) PostgreSQL -> [https://www.postgresql.org/docs/9.3/functions-json.html] > b.) Presto -> [https://prestodb.io/docs/current/functions/json.html] > c.) redshift -> > [https://docs.aws.amazon.com/redshift/latest/dg/JSON_ARRAY_LENGTH.html] > > This allows naive users to directly get array length with a well defined json > function. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31008) Support json_array_length function
Rakesh Raushan created SPARK-31008: -- Summary: Support json_array_length function Key: SPARK-31008 URL: https://issues.apache.org/jira/browse/SPARK-31008 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Rakesh Raushan At the moment we don't support json_array_length function in spark. This function is supported by a.) PostgreSQL -> [https://www.postgresql.org/docs/9.3/functions-json.html] b.) Presto -> [https://prestodb.io/docs/current/functions/json.html] c.) redshift -> [https://docs.aws.amazon.com/redshift/latest/dg/JSON_ARRAY_LENGTH.html] This allows naive users to directly get array length with a well defined json function. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30917) The behaviour of UnaryMinus should not depend on SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-30917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041774#comment-17041774 ] Rakesh Raushan commented on SPARK-30917: I am working on this. > The behaviour of UnaryMinus should not depend on SQLConf.get > > > Key: SPARK-30917 > URL: https://issues.apache.org/jira/browse/SPARK-30917 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rakesh Raushan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30917) The behaviour of UnaryMinus should not depend on SQLConf.get
Rakesh Raushan created SPARK-30917: -- Summary: The behaviour of UnaryMinus should not depend on SQLConf.get Key: SPARK-30917 URL: https://issues.apache.org/jira/browse/SPARK-30917 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Rakesh Raushan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-30896) The behavior of JsonToStructs should not depend on SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-30896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-30896: --- Comment: was deleted (was: I am working on this.) > The behavior of JsonToStructs should not depend on SQLConf.get > -- > > Key: SPARK-30896 > URL: https://issues.apache.org/jira/browse/SPARK-30896 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30896) The behavior of JsonToStructs should not depend on SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-30896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041682#comment-17041682 ] Rakesh Raushan commented on SPARK-30896: I am working on this. > The behavior of JsonToStructs should not depend on SQLConf.get > -- > > Key: SPARK-30896 > URL: https://issues.apache.org/jira/browse/SPARK-30896 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30852) Use Long instead of Int as argument type in Dataset limit method
[ https://issues.apache.org/jira/browse/SPARK-30852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039031#comment-17039031 ] Rakesh Raushan commented on SPARK-30852: Ahh. In that case we cannot allow long values. Tail also give an array only. Can we mark this issue as won't fix then? > Use Long instead of Int as argument type in Dataset limit method > > > Key: SPARK-30852 > URL: https://issues.apache.org/jira/browse/SPARK-30852 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Damianos Christophides >Priority: Minor > > The Dataset limit method takes an input of type Int, which is a 32bit > integer. The numerical upper limit of this type is 2,147,483,647. I found in > my work to need to apply a limit to a Dataset higher than that which gives an > error: > "py4j.Py4JException: Method limit([class java.lang.Long]) does not exist" > > Could the input type of the limit method be changed to a Long (64bit)? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30852) Use Long instead of Int as argument type in Dataset limit method
[ https://issues.apache.org/jira/browse/SPARK-30852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038304#comment-17038304 ] Rakesh Raushan edited comment on SPARK-30852 at 2/17/20 12:33 PM: -- [~cloud_fan] [~dongjoon] long value can be used as limit expression in presto and postgresql. I think spark should also allow long values for limit expression. was (Author: rakson): [~cloud_fan] [~dongjoon] long value can be used as limit expression in presto. I think spark should also allow long values for limit expression. > Use Long instead of Int as argument type in Dataset limit method > > > Key: SPARK-30852 > URL: https://issues.apache.org/jira/browse/SPARK-30852 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Damianos Christophides >Priority: Minor > > The Dataset limit method takes an input of type Int, which is a 32bit > integer. The numerical upper limit of this type is 2,147,483,647. I found in > my work to need to apply a limit to a Dataset higher than that which gives an > error: > "py4j.Py4JException: Method limit([class java.lang.Long]) does not exist" > > Could the input type of the limit method be changed to a Long (64bit)? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30852) Use Long instead of Int as argument type in Dataset limit method
[ https://issues.apache.org/jira/browse/SPARK-30852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038304#comment-17038304 ] Rakesh Raushan commented on SPARK-30852: [~cloud_fan] [~dongjoon] long value can be used as limit expression in presto. I think spark should also allow long values for limit expression. > Use Long instead of Int as argument type in Dataset limit method > > > Key: SPARK-30852 > URL: https://issues.apache.org/jira/browse/SPARK-30852 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Damianos Christophides >Priority: Minor > > The Dataset limit method takes an input of type Int, which is a 32bit > integer. The numerical upper limit of this type is 2,147,483,647. I found in > my work to need to apply a limit to a Dataset higher than that which gives an > error: > "py4j.Py4JException: Method limit([class java.lang.Long]) does not exist" > > Could the input type of the limit method be changed to a Long (64bit)? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30852) Use Long instead of Int as argument type in Dataset limit method
[ https://issues.apache.org/jira/browse/SPARK-30852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038278#comment-17038278 ] Rakesh Raushan commented on SPARK-30852: I will check the issue. > Use Long instead of Int as argument type in Dataset limit method > > > Key: SPARK-30852 > URL: https://issues.apache.org/jira/browse/SPARK-30852 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Damianos Christophides >Priority: Minor > > The Dataset limit method takes an input of type Int, which is a 32bit > integer. The numerical upper limit of this type is 2,147,483,647. I found in > my work to need to apply a limit to a Dataset higher than that which gives an > error: > "py4j.Py4JException: Method limit([class java.lang.Long]) does not exist" > > Could the input type of the limit method be changed to a Long (64bit)? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27619) MapType should be prohibited in hash expressions
[ https://issues.apache.org/jira/browse/SPARK-27619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036831#comment-17036831 ] Rakesh Raushan commented on SPARK-27619: I am working on this. > MapType should be prohibited in hash expressions > > > Key: SPARK-27619 > URL: https://issues.apache.org/jira/browse/SPARK-27619 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0 >Reporter: Josh Rosen >Priority: Blocker > Labels: correctness > > Spark currently allows MapType expressions to be used as input to hash > expressions, but I think that this should be prohibited because Spark SQL > does not support map equality. > Currently, Spark SQL's map hashcodes are sensitive to the insertion order of > map elements: > {code:java} > val a = spark.createDataset(Map(1->1, 2->2) :: Nil) > val b = spark.createDataset(Map(2->2, 1->1) :: Nil) > // Demonstration of how Scala Map equality is unaffected by insertion order: > assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode()) > assert(Map(1->1, 2->2) == Map(2->2, 1->1)) > assert(a.first() == b.first()) > // In contrast, this will print two different hashcodes: > println(Seq(a, b).map(_.selectExpr("hash(*)").first())){code} > This behavior might be surprising to Scala developers. > I think there's precedence for banning the use of MapType here because we > already prohibit MapType in aggregation / joins / equality comparisons > (SPARK-9415) and set operations (SPARK-19893). > If we decide that we want this to be an error then it might also be a good > idea to add a {{spark.sql.legacy}} flag as an escape-hatch to re-enable the > old and buggy behavior (in case applications were relying on it in cases > where it just so happens to be safe-by-accident (e.g. maps which only have > one entry)). > Alternatively, we could support hashing here if we implemented support for > comparable map types (SPARK-18134). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27545) Update the Documentation for CACHE TABLE and UNCACHE TABLE
[ https://issues.apache.org/jira/browse/SPARK-27545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034479#comment-17034479 ] Rakesh Raushan commented on SPARK-27545: Please assign this to me. Thanks > Update the Documentation for CACHE TABLE and UNCACHE TABLE > -- > > Key: SPARK-27545 > URL: https://issues.apache.org/jira/browse/SPARK-27545 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 2.4.1 >Reporter: hantiantian >Assignee: hantiantian >Priority: Major > Fix For: 3.0.0 > > > spark-sql> cache table v1 as select * from a; > spark-sql> uncache table v1; > spark-sql> cache table v1 as select * from a; > 2019-04-23 14:50:09,038 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: > 0: get_table : db=apachespark tbl=a > 2019-04-23 14:50:09,038 INFO > org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=root > ip=unknown-ip-addr cmd=get_table : db=apachespark tbl=a > Error in query: Temporary view 'v1' already exists; > we should document it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30790) The datatype of map() should be map
[ https://issues.apache.org/jira/browse/SPARK-30790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034461#comment-17034461 ] Rakesh Raushan commented on SPARK-30790: Should i expose a legacy configuration for mapType as well ?? [~hyukjin.kwon] > The datatype of map() should be map > -- > > Key: SPARK-30790 > URL: https://issues.apache.org/jira/browse/SPARK-30790 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rakesh Raushan >Priority: Minor > > Currently , > spark.sql("select map()") gives {}. > To be consistent with the changes made in SPARK-29462, it should return > map. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30790) The datatype of map() should be map
Rakesh Raushan created SPARK-30790: -- Summary: The datatype of map() should be map Key: SPARK-30790 URL: https://issues.apache.org/jira/browse/SPARK-30790 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Rakesh Raushan Currently , spark.sql("select map()") gives {}. To be consistent with the changes made in SPARK-29462, it should return map. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30688) Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF
[ https://issues.apache.org/jira/browse/SPARK-30688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027330#comment-17027330 ] Rakesh Raushan commented on SPARK-30688: I will check this issue > Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF > -- > > Key: SPARK-30688 > URL: https://issues.apache.org/jira/browse/SPARK-30688 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Rajkumar Singh >Priority: Major > > > {code:java} > scala> spark.sql("select unix_timestamp('20201', 'ww')").show(); > +-+ > |unix_timestamp(20201, ww)| > +-+ > | null| > +-+ > > scala> spark.sql("select unix_timestamp('20202', 'ww')").show(); > -+ > |unix_timestamp(20202, ww)| > +-+ > | 1578182400| > +-+ > > {code} > > > This seems to happen for leap year only, I dig deeper into it and it seems > that Spark is using the java.text.SimpleDateFormat and try to parse the > expression here > [org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652] > {code:java} > formatter.parse( > t.asInstanceOf[UTF8String].toString).getTime / 1000L{code} > but fail and SimpleDateFormat unable to parse the date throw Unparseable > Exception but Spark handle it silently and returns NULL. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30415) Improve Readability of SQLConf Doc
[ https://issues.apache.org/jira/browse/SPARK-30415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17007975#comment-17007975 ] Rakesh Raushan commented on SPARK-30415: I didn't knew that earlier. From now onwards I will use [MINOR]. > Improve Readability of SQLConf Doc > -- > > Key: SPARK-30415 > URL: https://issues.apache.org/jira/browse/SPARK-30415 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rakesh Raushan >Priority: Trivial > > Improve Readability of SQLConf Doc -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30415) Improve Readability of SQLConf Doc
Rakesh Raushan created SPARK-30415: -- Summary: Improve Readability of SQLConf Doc Key: SPARK-30415 URL: https://issues.apache.org/jira/browse/SPARK-30415 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Rakesh Raushan Improve Readability of SQLConf Doc -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30363) Add Documentation for Refresh Resources
[ https://issues.apache.org/jira/browse/SPARK-30363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003887#comment-17003887 ] Rakesh Raushan commented on SPARK-30363: I am working on it. > Add Documentation for Refresh Resources > --- > > Key: SPARK-30363 > URL: https://issues.apache.org/jira/browse/SPARK-30363 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rakesh Raushan >Priority: Minor > > Refresh Resources is not documented in the docs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30363) Add Documentation for Refresh Resources
Rakesh Raushan created SPARK-30363: -- Summary: Add Documentation for Refresh Resources Key: SPARK-30363 URL: https://issues.apache.org/jira/browse/SPARK-30363 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Rakesh Raushan Refresh Resources is not documented in the docs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30342) Update LIST JAR/FILE command
Rakesh Raushan created SPARK-30342: -- Summary: Update LIST JAR/FILE command Key: SPARK-30342 URL: https://issues.apache.org/jira/browse/SPARK-30342 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Rakesh Raushan LIST FILE/JAR command is not documented properly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30288) Failed to write valid Parquet files when column names contains special characters like spaces
[ https://issues.apache.org/jira/browse/SPARK-30288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999349#comment-16999349 ] Rakesh Raushan commented on SPARK-30288: I am working on it. I will raise the PR soon. > Failed to write valid Parquet files when column names contains special > characters like spaces > - > > Key: SPARK-30288 > URL: https://issues.apache.org/jira/browse/SPARK-30288 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Jingyuan Wang >Priority: Major > > When I tried to write Parquet files using PySpark with columns containing > some special characters in their names, it threw the following exception: > {code} > org.apache.spark.sql.AnalysisException: Attribute name "col 1" contains > invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.; > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:583) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkFieldName(ParquetSchemaConverter.scala:570) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:444) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:111) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:282) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > a
[jira] [Created] (SPARK-30292) Throw Exception when invalid string is cast to decimal in ANSI mode
Rakesh Raushan created SPARK-30292: -- Summary: Throw Exception when invalid string is cast to decimal in ANSI mode Key: SPARK-30292 URL: https://issues.apache.org/jira/browse/SPARK-30292 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Rakesh Raushan When spark.sql.ansi.enabled is set, If we run select cast('str' as decimal), spark-sql outputs NULL. The ANSI SQL standard requires to throw exception when invalid strings are cast to numbers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30288) Failed to write valid Parquet files when column names contains special characters like spaces
[ https://issues.apache.org/jira/browse/SPARK-30288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998809#comment-16998809 ] Rakesh Raushan edited comment on SPARK-30288 at 12/18/19 4:49 AM: -- [~dongjoon] [~hyukjin.kwon] I have checked locally after making required changes. Column names with space , "=" are working fine for now. Also pandas support this. So should we also allow this? scala> Seq(100).toDF("a b").write.parquet("/tmp/dir") scala> spark.read.parquet("/tmp/dir").show() +---+ |a b| +---+ |100| +---+ scala> Seq(100).toDF("a=b").write.parquet("/tmp/dir2") scala> spark.read.parquet("/tmp/dir2").show() +---+ |a=b| +---+ |100| +---+ was (Author: rakson): [~dongjoon] [~hyukjin.kwon] I have checked locally after making required changes. Column names with space , "=" are working fine for now. Also pandas support this. So should we also allow this? scala> Seq(100).toDF("a b").write.parquet("/tmp/dir") scala> spark.read.parquet("/tmp/dir").show() +---+ |a b| +---+ |100| +---+ scala> Seq(1).toDF("a=b").write.parquet("/tmp/dir2") scala> spark.read.parquet("/tmp/foo").show() +---+ |a=b| +---+ |100| +---+ > Failed to write valid Parquet files when column names contains special > characters like spaces > - > > Key: SPARK-30288 > URL: https://issues.apache.org/jira/browse/SPARK-30288 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Jingyuan Wang >Priority: Major > > When I tried to write Parquet files using PySpark with columns containing > some special characters in their names, it threw the following exception: > {code} > org.apache.spark.sql.AnalysisException: Attribute name "col 1" contains > invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.; > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:583) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkFieldName(ParquetSchemaConverter.scala:570) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:444) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:111) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) > at > org.apache.spark.sql.DataFrameWriter.saveToV1S
[jira] [Comment Edited] (SPARK-30288) Failed to write valid Parquet files when column names contains special characters like spaces
[ https://issues.apache.org/jira/browse/SPARK-30288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998809#comment-16998809 ] Rakesh Raushan edited comment on SPARK-30288 at 12/18/19 4:48 AM: -- [~dongjoon] [~hyukjin.kwon] I have checked locally after making required changes. Column names with space , "=" are working fine for now. Also pandas support this. So should we also allow this? scala> Seq(100).toDF("a b").write.parquet("/tmp/dir") scala> spark.read.parquet("/tmp/dir").show() +---+ |a b| +---+ |100| +---+ scala> Seq(1).toDF("a=b").write.parquet("/tmp/dir2") scala> spark.read.parquet("/tmp/foo").show() +---+ |a=b| +---+ |100| +---+ was (Author: rakson): [~dongjoon] I have checked locally after making required changes. Column names with space , "=" are working fine for now. Also pandas support this. So should we also allow this? scala> Seq(100).toDF("a b").write.parquet("/tmp/dir") scala> spark.read.parquet("/tmp/dir").show() +---+ |a b| +---+ |100| +---+ scala> Seq(1).toDF("a=b").write.parquet("/tmp/dir2") scala> spark.read.parquet("/tmp/foo").show() +---+ |a=b| +---+ |100| +---+ > Failed to write valid Parquet files when column names contains special > characters like spaces > - > > Key: SPARK-30288 > URL: https://issues.apache.org/jira/browse/SPARK-30288 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Jingyuan Wang >Priority: Major > > When I tried to write Parquet files using PySpark with columns containing > some special characters in their names, it threw the following exception: > {code} > org.apache.spark.sql.AnalysisException: Attribute name "col 1" contains > invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.; > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:583) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkFieldName(ParquetSchemaConverter.scala:570) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:444) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:111) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:
[jira] [Commented] (SPARK-30288) Failed to write valid Parquet files when column names contains special characters like spaces
[ https://issues.apache.org/jira/browse/SPARK-30288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998809#comment-16998809 ] Rakesh Raushan commented on SPARK-30288: [~dongjoon] I have checked locally after making required changes. Column names with space , "=" are working fine for now. Also pandas support this. So should we also allow this? scala> Seq(100).toDF("a b").write.parquet("/tmp/dir") scala> spark.read.parquet("/tmp/dir").show() +---+ |a b| +---+ |100| +---+ scala> Seq(1).toDF("a=b").write.parquet("/tmp/dir2") scala> spark.read.parquet("/tmp/foo").show() +---+ |a=b| +---+ |100| +---+ > Failed to write valid Parquet files when column names contains special > characters like spaces > - > > Key: SPARK-30288 > URL: https://issues.apache.org/jira/browse/SPARK-30288 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Jingyuan Wang >Priority: Major > > When I tried to write Parquet files using PySpark with columns containing > some special characters in their names, it threw the following exception: > {code} > org.apache.spark.sql.AnalysisException: Attribute name "col 1" contains > invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.; > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:583) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkFieldName(ParquetSchemaConverter.scala:570) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:444) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:111) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Del
[jira] [Commented] (SPARK-30150) Manage resources (ADD/LIST) does not support quoted path
[ https://issues.apache.org/jira/browse/SPARK-30150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997314#comment-16997314 ] Rakesh Raushan commented on SPARK-30150: Thanks!! > Manage resources (ADD/LIST) does not support quoted path > > > Key: SPARK-30150 > URL: https://issues.apache.org/jira/browse/SPARK-30150 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Assignee: Rakesh Raushan >Priority: Minor > Fix For: 3.0.0 > > > Manage resources (ADD/LIST) does not support quoted path. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30249) Invalid Column Names in parquet tables should not be allowed
[ https://issues.apache.org/jira/browse/SPARK-30249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-30249: --- Description: Column names such as `a:b` , `??`, `,,`, `^^` , `++`etc are allowed when we are creating parquet tables. While when we are creating tables with `orc` all such column names are marked as invalid and analysis exception is thrown. These column names should also be not allowed for parquet tables as well. Also this induces inconsistency between column names for Parquet and ORC was: Column names such as `a:b` , `??`, `,,`, `^^` , `++`etc are allowed when we are creating parquet tables. While when we are creating tables with `orc` all such column names are marked as invalid and analysis exception is thrown. These column names should also be not allowed for parquet tables as well. > Invalid Column Names in parquet tables should not be allowed > > > Key: SPARK-30249 > URL: https://issues.apache.org/jira/browse/SPARK-30249 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rakesh Raushan >Priority: Minor > > Column names such as `a:b` , `??`, `,,`, `^^` , `++`etc are allowed when we > are creating parquet tables. > While when we are creating tables with `orc` all such column names are marked > as invalid and analysis exception is thrown. > These column names should also be not allowed for parquet tables as well. > Also this induces inconsistency between column names for Parquet and ORC -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30249) Wrong Column Names in parquet tables should not be allowed
[ https://issues.apache.org/jira/browse/SPARK-30249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995411#comment-16995411 ] Rakesh Raushan commented on SPARK-30249: cc [~dongjoon]. > Wrong Column Names in parquet tables should not be allowed > -- > > Key: SPARK-30249 > URL: https://issues.apache.org/jira/browse/SPARK-30249 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rakesh Raushan >Priority: Minor > > Column names such as `a:b` , `??`, `,,`, `^^` , `++`etc are allowed when we > are creating parquet tables. > While when we are creating tables with `orc` all such column names are marked > as invalid and analysis exception is thrown. > These column names should also be not allowed for parquet tables as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30249) Invalid Column Names in parquet tables should not be allowed
[ https://issues.apache.org/jira/browse/SPARK-30249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-30249: --- Summary: Invalid Column Names in parquet tables should not be allowed (was: Wrong Column Names in parquet tables should not be allowed) > Invalid Column Names in parquet tables should not be allowed > > > Key: SPARK-30249 > URL: https://issues.apache.org/jira/browse/SPARK-30249 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rakesh Raushan >Priority: Minor > > Column names such as `a:b` , `??`, `,,`, `^^` , `++`etc are allowed when we > are creating parquet tables. > While when we are creating tables with `orc` all such column names are marked > as invalid and analysis exception is thrown. > These column names should also be not allowed for parquet tables as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30249) Wrong Column Names in parquet tables should not be allowed
Rakesh Raushan created SPARK-30249: -- Summary: Wrong Column Names in parquet tables should not be allowed Key: SPARK-30249 URL: https://issues.apache.org/jira/browse/SPARK-30249 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Rakesh Raushan Column names such as `a:b` , `??`, `,,`, `^^` , `++`etc are allowed when we are creating parquet tables. While when we are creating tables with `orc` all such column names are marked as invalid and analysis exception is thrown. These column names should also be not allowed for parquet tables as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30150) Manage resources (ADD/LIST) does not support quoted path
[ https://issues.apache.org/jira/browse/SPARK-30150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994544#comment-16994544 ] Rakesh Raushan commented on SPARK-30150: Can you assign this to me. [~cloud_fan] > Manage resources (ADD/LIST) does not support quoted path > > > Key: SPARK-30150 > URL: https://issues.apache.org/jira/browse/SPARK-30150 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Assignee: jobit mathew >Priority: Minor > Fix For: 3.0.0 > > > Manage resources (ADD/LIST) does not support quoted path. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30234) ADD FILE can not add folder from Spark-sql
[ https://issues.apache.org/jira/browse/SPARK-30234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994457#comment-16994457 ] Rakesh Raushan commented on SPARK-30234: I will raise a PR for this soon. > ADD FILE can not add folder from Spark-sql > -- > > Key: SPARK-30234 > URL: https://issues.apache.org/jira/browse/SPARK-30234 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Rakesh Raushan >Priority: Minor > > We cannot add directories using spark-sql CLI. > In SPARK-4687 support was added for directories as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30234) ADD FILE can not add folder from Spark-sql
Rakesh Raushan created SPARK-30234: -- Summary: ADD FILE can not add folder from Spark-sql Key: SPARK-30234 URL: https://issues.apache.org/jira/browse/SPARK-30234 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.4 Reporter: Rakesh Raushan We cannot add directories using spark-sql CLI. In SPARK-4687 support was added for directories as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30139) get_json_object does not work correctly
[ https://issues.apache.org/jira/browse/SPARK-30139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994447#comment-16994447 ] Rakesh Raushan commented on SPARK-30139: Was busy with some other work. I will start working on this now. > get_json_object does not work correctly > --- > > Key: SPARK-30139 > URL: https://issues.apache.org/jira/browse/SPARK-30139 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Clemens Valiente >Priority: Major > > according to documentation: > [https://spark.apache.org/docs/2.4.4/api/java/org/apache/spark/sql/functions.html#get_json_object-org.apache.spark.sql.Column-java.lang.String-] > get_json_object "Extracts json object from a json string based on json path > specified, and returns json string of the extracted json object. It will > return null if the input json string is invalid." > > the following SQL snippet returns null even though it should return 'a' > {code} > select get_json_object([{"id":123,"value":"a"},\{"id":456,"value":"b"}], > $[?($.id==123)].value){code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-30176) Eliminate warnings: part 6
[ https://issues.apache.org/jira/browse/SPARK-30176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-30176: --- Comment: was deleted (was: i will work on this.) > Eliminate warnings: part 6 > -- > > Key: SPARK-30176 > URL: https://issues.apache.org/jira/browse/SPARK-30176 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Minor > > > sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala > {code:java} > Warning:Warning:line (32)java: > org.apache.spark.sql.expressions.javalang.typed in > org.apache.spark.sql.expressions.javalang has been deprecated > Warning:Warning:line (91)java: > org.apache.spark.sql.expressions.javalang.typed in > org.apache.spark.sql.expressions.javalang has been deprecated > Warning:Warning:line (100)java: > org.apache.spark.sql.expressions.javalang.typed in > org.apache.spark.sql.expressions.javalang has been deprecated > Warning:Warning:line (109)java: > org.apache.spark.sql.expressions.javalang.typed in > org.apache.spark.sql.expressions.javalang has been deprecated > Warning:Warning:line (118)java: > org.apache.spark.sql.expressions.javalang.typed in > org.apache.spark.sql.expressions.javalang has been deprecated > {code} > sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala > {code:java} > Warning:Warning:line (242)object typed in package scalalang is deprecated > (since 3.0.0): please use untyped builtin aggregate functions. > df.as[Data].select(typed.sumLong((d: Data) => > d.l)).queryExecution.toRdd.foreach(_ => ()) > {code} > sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala > {code:java} > Warning:Warning:line (714)method from_utc_timestamp in object functions is > deprecated (since 3.0.0): This function is deprecated and will be removed in > future versions. > df.select(from_utc_timestamp(col("a"), "PST")), > Warning:Warning:line (719)method from_utc_timestamp in object functions > is deprecated (since 3.0.0): This function is deprecated and will be removed > in future versions. > df.select(from_utc_timestamp(col("b"), "PST")), > Warning:Warning:line (725)method from_utc_timestamp in object functions > is deprecated (since 3.0.0): This function is deprecated and will be removed > in future versions. > df.select(from_utc_timestamp(col("a"), "PST")).collect() > Warning:Warning:line (737)method from_utc_timestamp in object functions > is deprecated (since 3.0.0): This function is deprecated and will be removed > in future versions. > df.select(from_utc_timestamp(col("a"), col("c"))), > Warning:Warning:line (742)method from_utc_timestamp in object functions > is deprecated (since 3.0.0): This function is deprecated and will be removed > in future versions. > df.select(from_utc_timestamp(col("b"), col("c"))), > Warning:Warning:line (756)method to_utc_timestamp in object functions is > deprecated (since 3.0.0): This function is deprecated and will be removed in > future versions. > df.select(to_utc_timestamp(col("a"), "PST")), > Warning:Warning:line (761)method to_utc_timestamp in object functions is > deprecated (since 3.0.0): This function is deprecated and will be removed in > future versions. > df.select(to_utc_timestamp(col("b"), "PST")), > Warning:Warning:line (767)method to_utc_timestamp in object functions is > deprecated (since 3.0.0): This function is deprecated and will be removed in > future versions. > df.select(to_utc_timestamp(col("a"), "PST")).collect() > Warning:Warning:line (779)method to_utc_timestamp in object functions is > deprecated (since 3.0.0): This function is deprecated and will be removed in > future versions. > df.select(to_utc_timestamp(col("a"), col("c"))), > Warning:Warning:line (784)method to_utc_timestamp in object functions is > deprecated (since 3.0.0): This function is deprecated and will be removed in > future versions. > df.select(to_utc_timestamp(col("b"), col("c"))), > {code} > sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala > {code:java} > Warning:Warning:line (241)method merge in object Row is deprecated (since > 3.0.0): This method is deprecated and will be removed in future versions. > testData.rdd.flatMap(row => Seq.fill(16)(Row.merge(row, > row))).collect().toSeq) > {code} > sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala > {code:java} > Warning:Warning:line (787)method merge in object Row is deprecated (since > 3.0.0): This method is deprecated and will be removed in future versions. > row => Seq.fill(16)(Row.merge(row, row))).collect().toSeq) > {code} > > s
[jira] [Commented] (SPARK-30176) Eliminate warnings: part 6
[ https://issues.apache.org/jira/browse/SPARK-30176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16991151#comment-16991151 ] Rakesh Raushan commented on SPARK-30176: i will work on this. > Eliminate warnings: part 6 > -- > > Key: SPARK-30176 > URL: https://issues.apache.org/jira/browse/SPARK-30176 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Minor > > > sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala > {code:java} > {code} > sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala > {code:java} > {code} > sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala > {code:java} > {code} > sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala > {code:java} > {code} > sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala > {code:java} > {code} > > sql/core/src/test/scala/org/apache/spark/sql/SparkSessionExtensionSuite.scala > {code:java} > {code} > > sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala > {code:java} > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30139) get_json_object does not work correctly
[ https://issues.apache.org/jira/browse/SPARK-30139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988693#comment-16988693 ] Rakesh Raushan commented on SPARK-30139: I will look into this issue. > get_json_object does not work correctly > --- > > Key: SPARK-30139 > URL: https://issues.apache.org/jira/browse/SPARK-30139 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Clemens Valiente >Priority: Major > > according to documentation: > [https://spark.apache.org/docs/2.4.4/api/java/org/apache/spark/sql/functions.html#get_json_object-org.apache.spark.sql.Column-java.lang.String-] > get_json_object "Extracts json object from a json string based on json path > specified, and returns json string of the extracted json object. It will > return null if the input json string is invalid." > > the following SQL snippet returns null even though it should return 'a' > {code} > select get_json_object([{"id":123,"value":"a"},\{"id":456,"value":"b"}], > $[?($.id==123)].value){code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org