[jira] [Commented] (SPARK-44817) SPIP: Incremental Stats Collection

2024-03-16 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17827739#comment-17827739
 ] 

Rakesh Raushan commented on SPARK-44817:


[~cloud_fan] [~dongjoon]  What do you think about the proposal ? Does this 
sounds useful ?

> SPIP: Incremental Stats Collection
> --
>
> Key: SPARK-44817
> URL: https://issues.apache.org/jira/browse/SPARK-44817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Spark's Cost Based Optimizer is dependent on the table and column statistics.
> After every execution of DML query, table and column stats are invalidated if 
> auto update of stats collection is not turned on. To keep stats updated we 
> need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very 
> expensive. It is not feasible to run this command after every DML query.
> Instead, we can incrementally update the stats during each DML query run 
> itself. This way our table and column stats would be fresh at all the time 
> and CBO benefits can be applied. Initially, we can only update table level 
> stats and gradually start updating column level stats as well.
> *Pros:*
> 1. Optimize queries over table which is updated frequently.
> 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
> STATISTICS` for updating stats.
> [SPIP Document 
> |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44817) SPIP: Incremental Stats Collection

2023-09-25 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-44817:
---
Summary: SPIP: Incremental Stats Collection  (was: Incremental Stats 
Collection)

> SPIP: Incremental Stats Collection
> --
>
> Key: SPARK-44817
> URL: https://issues.apache.org/jira/browse/SPARK-44817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Spark's Cost Based Optimizer is dependent on the table and column statistics.
> After every execution of DML query, table and column stats are invalidated if 
> auto update of stats collection is not turned on. To keep stats updated we 
> need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very 
> expensive. It is not feasible to run this command after every DML query.
> Instead, we can incrementally update the stats during each DML query run 
> itself. This way our table and column stats would be fresh at all the time 
> and CBO benefits can be applied. Initially, we can only update table level 
> stats and gradually start updating column level stats as well.
> *Pros:*
> 1. Optimize queries over table which is updated frequently.
> 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
> STATISTICS` for updating stats.
> [SPIP Document 
> |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-44817) Incremental Stats Collection

2023-09-25 Thread Rakesh Raushan (Jira)


[ https://issues.apache.org/jira/browse/SPARK-44817 ]


Rakesh Raushan deleted comment on SPARK-44817:


was (Author: rakson):
[~gurwls223] [~cloud_fan] [~dongjoon] 

I have added an SPIP document.

Does this feature seem useful to you?

> Incremental Stats Collection
> 
>
> Key: SPARK-44817
> URL: https://issues.apache.org/jira/browse/SPARK-44817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Spark's Cost Based Optimizer is dependent on the table and column statistics.
> After every execution of DML query, table and column stats are invalidated if 
> auto update of stats collection is not turned on. To keep stats updated we 
> need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very 
> expensive. It is not feasible to run this command after every DML query.
> Instead, we can incrementally update the stats during each DML query run 
> itself. This way our table and column stats would be fresh at all the time 
> and CBO benefits can be applied. Initially, we can only update table level 
> stats and gradually start updating column level stats as well.
> *Pros:*
> 1. Optimize queries over table which is updated frequently.
> 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
> STATISTICS` for updating stats.
> [SPIP Document 
> |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44817) Incremental Stats Collection

2023-09-17 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766213#comment-17766213
 ] 

Rakesh Raushan commented on SPARK-44817:


[~gurwls223] [~cloud_fan] [~dongjoon] 

I have added an SPIP document.

Does this feature seem useful to you?

> Incremental Stats Collection
> 
>
> Key: SPARK-44817
> URL: https://issues.apache.org/jira/browse/SPARK-44817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Spark's Cost Based Optimizer is dependent on the table and column statistics.
> After every execution of DML query, table and column stats are invalidated if 
> auto update of stats collection is not turned on. To keep stats updated we 
> need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very 
> expensive. It is not feasible to run this command after every DML query.
> Instead, we can incrementally update the stats during each DML query run 
> itself. This way our table and column stats would be fresh at all the time 
> and CBO benefits can be applied. Initially, we can only update table level 
> stats and gradually start updating column level stats as well.
> *Pros:*
> 1. Optimize queries over table which is updated frequently.
> 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
> STATISTICS` for updating stats.
> [SPIP Document 
> |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44817) Incremental Stats Collection

2023-08-26 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759229#comment-17759229
 ] 

Rakesh Raushan edited comment on SPARK-44817 at 8/26/23 9:02 AM:
-

[~gurwls223]  [~cloud_fan] 

Added SPIP Document.

Link for the document : 
[https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing]


was (Author: rakson):
Added SPIP Document.

Link for the document : 
https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing

> Incremental Stats Collection
> 
>
> Key: SPARK-44817
> URL: https://issues.apache.org/jira/browse/SPARK-44817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Spark's Cost Based Optimizer is dependent on the table and column statistics.
> After every execution of DML query, table and column stats are invalidated if 
> auto update of stats collection is not turned on. To keep stats updated we 
> need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very 
> expensive. It is not feasible to run this command after every DML query.
> Instead, we can incrementally update the stats during each DML query run 
> itself. This way our table and column stats would be fresh at all the time 
> and CBO benefits can be applied. Initially, we can only update table level 
> stats and gradually start updating column level stats as well.
> *Pros:*
> 1. Optimize queries over table which is updated frequently.
> 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
> STATISTICS` for updating stats.
> [SPIP Document 
> |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing]
>  added



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44817) Incremental Stats Collection

2023-08-26 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-44817:
---
Description: 
Spark's Cost Based Optimizer is dependent on the table and column statistics.

After every execution of DML query, table and column stats are invalidated if 
auto update of stats collection is not turned on. To keep stats updated we need 
to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very expensive. It 
is not feasible to run this command after every DML query.

Instead, we can incrementally update the stats during each DML query run 
itself. This way our table and column stats would be fresh at all the time and 
CBO benefits can be applied. Initially, we can only update table level stats 
and gradually start updating column level stats as well.

*Pros:*

1. Optimize queries over table which is updated frequently.
2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
STATISTICS` for updating stats.

[SPIP Document 
|https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing]

  was:
Spark's Cost Based Optimizer is dependent on the table and column statistics.

After every execution of DML query, table and column stats are invalidated if 
auto update of stats collection is not turned on. To keep stats updated we need 
to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very expensive. It 
is not feasible to run this command after every DML query.

Instead, we can incrementally update the stats during each DML query run 
itself. This way our table and column stats would be fresh at all the time and 
CBO benefits can be applied. Initially, we can only update table level stats 
and gradually start updating column level stats as well.

*Pros:*

1. Optimize queries over table which is updated frequently.
2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
STATISTICS` for updating stats.

[SPIP Document 
|https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing]
 added



> Incremental Stats Collection
> 
>
> Key: SPARK-44817
> URL: https://issues.apache.org/jira/browse/SPARK-44817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Spark's Cost Based Optimizer is dependent on the table and column statistics.
> After every execution of DML query, table and column stats are invalidated if 
> auto update of stats collection is not turned on. To keep stats updated we 
> need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very 
> expensive. It is not feasible to run this command after every DML query.
> Instead, we can incrementally update the stats during each DML query run 
> itself. This way our table and column stats would be fresh at all the time 
> and CBO benefits can be applied. Initially, we can only update table level 
> stats and gradually start updating column level stats as well.
> *Pros:*
> 1. Optimize queries over table which is updated frequently.
> 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
> STATISTICS` for updating stats.
> [SPIP Document 
> |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44817) Incremental Stats Collection

2023-08-26 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-44817:
---
Description: 
Spark's Cost Based Optimizer is dependent on the table and column statistics.

After every execution of DML query, table and column stats are invalidated if 
auto update of stats collection is not turned on. To keep stats updated we need 
to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very expensive. It 
is not feasible to run this command after every DML query.

Instead, we can incrementally update the stats during each DML query run 
itself. This way our table and column stats would be fresh at all the time and 
CBO benefits can be applied. Initially, we can only update table level stats 
and gradually start updating column level stats as well.

*Pros:*

1. Optimize queries over table which is updated frequently.
2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
STATISTICS` for updating stats.

[SPIP Document 
|https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing]
 added


  was:
Spark's Cost Based Optimizer is dependent on the table and column statistics.

After every execution of DML query, table and column stats are invalidated if 
auto update of stats collection is not turned on. To keep stats updated we need 
to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very expensive. It 
is not feasible to run this command after every DML query.

Instead, we can incrementally update the stats during each DML query run 
itself. This way our table and column stats would be fresh at all the time and 
CBO benefits can be applied. Initially, we can only update table level stats 
and gradually start updating column level stats as well.

*Pros:*

1. Optimize queries over table which is updated frequently.
2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
STATISTICS` for updating stats.



> Incremental Stats Collection
> 
>
> Key: SPARK-44817
> URL: https://issues.apache.org/jira/browse/SPARK-44817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Spark's Cost Based Optimizer is dependent on the table and column statistics.
> After every execution of DML query, table and column stats are invalidated if 
> auto update of stats collection is not turned on. To keep stats updated we 
> need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very 
> expensive. It is not feasible to run this command after every DML query.
> Instead, we can incrementally update the stats during each DML query run 
> itself. This way our table and column stats would be fresh at all the time 
> and CBO benefits can be applied. Initially, we can only update table level 
> stats and gradually start updating column level stats as well.
> *Pros:*
> 1. Optimize queries over table which is updated frequently.
> 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
> STATISTICS` for updating stats.
> [SPIP Document 
> |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing]
>  added



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44817) Incremental Stats Collection

2023-08-26 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759229#comment-17759229
 ] 

Rakesh Raushan commented on SPARK-44817:


Added SPIP Document.

Link for the document : 
https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing

> Incremental Stats Collection
> 
>
> Key: SPARK-44817
> URL: https://issues.apache.org/jira/browse/SPARK-44817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Spark's Cost Based Optimizer is dependent on the table and column statistics.
> After every execution of DML query, table and column stats are invalidated if 
> auto update of stats collection is not turned on. To keep stats updated we 
> need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very 
> expensive. It is not feasible to run this command after every DML query.
> Instead, we can incrementally update the stats during each DML query run 
> itself. This way our table and column stats would be fresh at all the time 
> and CBO benefits can be applied. Initially, we can only update table level 
> stats and gradually start updating column level stats as well.
> *Pros:*
> 1. Optimize queries over table which is updated frequently.
> 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
> STATISTICS` for updating stats.
> [SPIP Document 
> |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing]
>  added



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44817) Incremental Stats Collection

2023-08-22 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757539#comment-17757539
 ] 

Rakesh Raushan commented on SPARK-44817:


Sure. I would try to come up with a SPIP by this weekend.

> Incremental Stats Collection
> 
>
> Key: SPARK-44817
> URL: https://issues.apache.org/jira/browse/SPARK-44817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Spark's Cost Based Optimizer is dependent on the table and column statistics.
> After every execution of DML query, table and column stats are invalidated if 
> auto update of stats collection is not turned on. To keep stats updated we 
> need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very 
> expensive. It is not feasible to run this command after every DML query.
> Instead, we can incrementally update the stats during each DML query run 
> itself. This way our table and column stats would be fresh at all the time 
> and CBO benefits can be applied. Initially, we can only update table level 
> stats and gradually start updating column level stats as well.
> *Pros:*
> 1. Optimize queries over table which is updated frequently.
> 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
> STATISTICS` for updating stats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44817) Incremental Stats Collection

2023-08-18 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-44817:
---
Affects Version/s: 3.5.0

> Incremental Stats Collection
> 
>
> Key: SPARK-44817
> URL: https://issues.apache.org/jira/browse/SPARK-44817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Spark's Cost Based Optimizer is dependent on the table and column statistics.
> After every execution of DML query, table and column stats are invalidated if 
> auto update of stats collection is not turned on. To keep stats updated we 
> need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very 
> expensive. It is not feasible to run this command after every DML query.
> Instead, we can incrementally update the stats during each DML query run 
> itself. This way our table and column stats would be fresh at all the time 
> and CBO benefits can be applied. Initially, we can only update table level 
> stats and gradually start updating column level stats as well.
> *Pros:*
> 1. Optimize queries over table which is updated frequently.
> 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
> STATISTICS` for updating stats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-44817) Incremental Stats Collection

2023-08-16 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754694#comment-17754694
 ] 

Rakesh Raushan edited comment on SPARK-44817 at 8/16/23 1:08 PM:
-

[~cloud_fan] [~gurwls223] [~maxgekk] [~dongjoon] What are your thoughts over 
this ?
If this looks promising, i can work on raising PR for this.


was (Author: rakson):
[~cloud_fan] [~gurwls223] [~maxgekk] What are your thoughts over this ?
If this looks promising, i can work on raising PR for this.

> Incremental Stats Collection
> 
>
> Key: SPARK-44817
> URL: https://issues.apache.org/jira/browse/SPARK-44817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Spark's Cost Based Optimizer is dependent on the table and column statistics.
> After every execution of DML query, table and column stats are invalidated if 
> auto update of stats collection is not turned on. To keep stats updated we 
> need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very 
> expensive. It is not feasible to run this command after every DML query.
> Instead, we can incrementally update the stats during each DML query run 
> itself. This way our table and column stats would be fresh at all the time 
> and CBO benefits can be applied. Initially, we can only update table level 
> stats and gradually start updating column level stats as well.
> *Pros:*
> 1. Optimize queries over table which is updated frequently.
> 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
> STATISTICS` for updating stats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44817) Incremental Stats Collection

2023-08-15 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754694#comment-17754694
 ] 

Rakesh Raushan commented on SPARK-44817:


[~cloud_fan] [~gurwls223] [~maxgekk] What are your thoughts over this ?
If this looks promising, i can work on raising PR for this.

> Incremental Stats Collection
> 
>
> Key: SPARK-44817
> URL: https://issues.apache.org/jira/browse/SPARK-44817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Spark's Cost Based Optimizer is dependent on the table and column statistics.
> After every execution of DML query, table and column stats are invalidated if 
> auto update of stats collection is not turned on. To keep stats updated we 
> need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very 
> expensive. It is not feasible to run this command after every DML query.
> Instead, we can incrementally update the stats during each DML query run 
> itself. This way our table and column stats would be fresh at all the time 
> and CBO benefits can be applied. Initially, we can only update table level 
> stats and gradually start updating column level stats as well.
> *Pros:*
> 1. Optimize queries over table which is updated frequently.
> 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
> STATISTICS` for updating stats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44817) Incremental Stats Collection

2023-08-15 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-44817:
---
Description: 
Spark's Cost Based Optimizer is dependent on the table and column statistics.

After every execution of DML query, table and column stats are invalidated if 
auto update of stats collection is not turned on. To keep stats updated we need 
to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very expensive. It 
is not feasible to run this command after every DML query.

Instead, we can incrementally update the stats during each DML query run 
itself. This way our table and column stats would be fresh at all the time and 
CBO benefits can be applied. Initially, we can only update table level stats 
and gradually start updating column level stats as well.

*Pros:*

1. Optimize queries over table which is updated frequently.
2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
STATISTICS` for updating stats.


  was:
Spark's Cost Based Optimizer is dependent on the table and column statistics.

After every execution of DML query, table and column stats are invalidated if 
auto update of stats collection is not turned on. To keep stats updated we need 
to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very expensive. It 
is not feasible to run this command after every DML query.

Instead, we can incrementally update the stats during each DML query run 
itself. This way our table and column stats would be fresh at all the time and 
CBO benefits can be applied.

*Pros:*

1. Optimize queries over table which is updated frequently.
2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
STATISTICS` for updating stats.



> Incremental Stats Collection
> 
>
> Key: SPARK-44817
> URL: https://issues.apache.org/jira/browse/SPARK-44817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Spark's Cost Based Optimizer is dependent on the table and column statistics.
> After every execution of DML query, table and column stats are invalidated if 
> auto update of stats collection is not turned on. To keep stats updated we 
> need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very 
> expensive. It is not feasible to run this command after every DML query.
> Instead, we can incrementally update the stats during each DML query run 
> itself. This way our table and column stats would be fresh at all the time 
> and CBO benefits can be applied. Initially, we can only update table level 
> stats and gradually start updating column level stats as well.
> *Pros:*
> 1. Optimize queries over table which is updated frequently.
> 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
> STATISTICS` for updating stats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44817) Incremental Stats Collection

2023-08-15 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-44817:
--

 Summary: Incremental Stats Collection
 Key: SPARK-44817
 URL: https://issues.apache.org/jira/browse/SPARK-44817
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Rakesh Raushan


Spark's Cost Based Optimizer is dependent on the table and column statistics.

After every execution of DML query, table and column stats are invalidated if 
auto update of stats collection is not turned on. To keep stats updated we need 
to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very expensive. It 
is not feasible to run this command after every DML query.

Instead, we can incrementally update the stats during each DML query run 
itself. This way our table and column stats would be fresh at all the time and 
CBO benefits can be applied.

*Pros:*

1. Optimize queries over table which is updated frequently.
2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
STATISTICS` for updating stats.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37840) Dynamically update the loaded Hive UDF JAR

2022-01-11 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472537#comment-17472537
 ] 

Rakesh Raushan edited comment on SPARK-37840 at 1/11/22, 8:08 AM:
--

[~cutiechi]  The problem is with `jarClassLoader`. `jarClassLoader` needs to be 
updated after updated jar is added.


was (Author: rakson):
The problem is with `jarClassLoader`. We need to update our `jarClassLoader` 
after updated jar is added.

> Dynamically update the loaded Hive UDF JAR
> --
>
> Key: SPARK-37840
> URL: https://issues.apache.org/jira/browse/SPARK-37840
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: melin
>Priority: Major
>
> In the production environment, spark ThriftServer needs to be restarted if 
> jar files are updated after UDF files are loaded。



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37840) Dynamically update the loaded Hive UDF JAR

2022-01-11 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472537#comment-17472537
 ] 

Rakesh Raushan commented on SPARK-37840:


The problem is with `jarClassLoader`. We need to update our `jarClassLoader` 
after updated jar is added.

> Dynamically update the loaded Hive UDF JAR
> --
>
> Key: SPARK-37840
> URL: https://issues.apache.org/jira/browse/SPARK-37840
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: melin
>Priority: Major
>
> In the production environment, spark ThriftServer needs to be restarted if 
> jar files are updated after UDF files are loaded。



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37840) Dynamically update the loaded Hive UDF JAR

2022-01-07 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17470793#comment-17470793
 ] 

Rakesh Raushan commented on SPARK-37840:


We can dynamically update our UDF jars after loading them. I will try to raise 
a PR soon for this.

> Dynamically update the loaded Hive UDF JAR
> --
>
> Key: SPARK-37840
> URL: https://issues.apache.org/jira/browse/SPARK-37840
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: melin
>Priority: Major
>
> In the production environment, spark ThriftServer needs to be restarted if 
> jar files are updated after UDF files are loaded。



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32924) Web UI sort on duration is wrong

2020-10-09 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210977#comment-17210977
 ] 

Rakesh Raushan edited comment on SPARK-32924 at 10/9/20, 1:49 PM:
--

I think its due to string sorting. One similar issue is fixed here SPARK-31983


was (Author: rakson):
I thinking its due to string sorting. One similar issue is fixed here 
[SPARK-31983|https://issues.apache.org/jira/browse/SPARK-31983]

> Web UI sort on duration is wrong
> 
>
> Key: SPARK-32924
> URL: https://issues.apache.org/jira/browse/SPARK-32924
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.6
>Reporter: t oo
>Priority: Major
> Attachments: ui_sort.png
>
>
> See attachment, 9 s(econds) is showing as larger than 8.1min



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32924) Web UI sort on duration is wrong

2020-10-09 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210977#comment-17210977
 ] 

Rakesh Raushan commented on SPARK-32924:


I thinking its due to string sorting. One similar issue is fixed here 
[SPARK-31983|https://issues.apache.org/jira/browse/SPARK-31983]

> Web UI sort on duration is wrong
> 
>
> Key: SPARK-32924
> URL: https://issues.apache.org/jira/browse/SPARK-32924
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.6
>Reporter: t oo
>Priority: Major
> Attachments: ui_sort.png
>
>
> See attachment, 9 s(econds) is showing as larger than 8.1min



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32694) Pushdown cast to data sources

2020-08-24 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183719#comment-17183719
 ] 

Rakesh Raushan commented on SPARK-32694:


One of the proposed [solution|https://github.com/apache/spark/pull/27648] for 
similar issue.

> Pushdown cast to data sources
> -
>
> Key: SPARK-32694
> URL: https://issues.apache.org/jira/browse/SPARK-32694
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently we don't support pushing down cast to data source (see 
> [link|http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSql-Casting-of-Predicate-Literals-tp29956p30035.html]
>  for a discussion). For instance, in the following code snippet:
> {code}
> scala> case class Person(name: String, age: Short)
> scala> Seq(Person("John", 32), Person("David", 25), Person("Mike", 
> 18)).toDS().write.parquet("/tmp/person.parquet")
> scala> val personDS = spark.read.parquet("/tmp/person.parquet")
> scala> personDS.createOrReplaceTempView("person")
> scala> spark.sql("SELECT * FROM person where age < 30")
> {code}
> The predicate won't be pushed down to Parquet data source because in 
> {{DataSourceStrategy}}, {{PushableColumnBase}} only handles a few limited 
> cases such as {{Attribute}} and {{GetStructField}}. Potentially we can handle 
> {{Cast}} here as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31983) Tables of structured streaming tab show wrong result for duration column

2020-06-13 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-31983:
--

 Summary: Tables of structured streaming tab show wrong result for 
duration column
 Key: SPARK-31983
 URL: https://issues.apache.org/jira/browse/SPARK-31983
 Project: Spark
  Issue Type: Bug
  Components: SQL, Web UI
Affects Versions: 3.0.0
Reporter: Rakesh Raushan


Sorting result for duration column in tables of structured streaming tab is 
sometimes wrong. As we are sorting on string values. Consider "3ms" and "12ms".




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31873) Spark Sql Function year does not extract year from date/timestamp

2020-05-30 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17120335#comment-17120335
 ] 

Rakesh Raushan commented on SPARK-31873:


Yeah. This is a problem with 2.4.5. 
{code:java}
scala> val df = Seq(("1300-01-03 00:00:00") 
).toDF("date_val").withColumn("date_val_ts", 
to_timestamp(col("date_val"))).withColumn("year_val", 
year(to_timestamp(col("date_val"
df: org.apache.spark.sql.DataFrame = [date_val: string, date_val_ts: timestamp 
... 1 more field]
scala> df.show
+---+---++
| date_val  | date_val_ts.  |year_val|
+---+---++
|1300-01-03 00:00:00|1300-01-03 00:00:00| 1299   |
+---+---++
{code}
[~hyukjin.kwon] Do this need to get fixed in 2.4.5. If so, I can check this.

> Spark Sql Function year does not extract year from date/timestamp
> -
>
> Key: SPARK-31873
> URL: https://issues.apache.org/jira/browse/SPARK-31873
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Deepak Shingavi
>Priority: Major
>
> There is a Spark SQL function
> org.apache.spark.sql.functions.year which fails in below case
>  
> {code:java}
> // Code to extract year from Timestamp
> val df = Seq(
>   ("1300-01-03 00:00:00")
> ).toDF("date_val")
>   .withColumn("date_val_ts", to_timestamp(col("date_val")))
>   .withColumn("year_val", year(to_timestamp(col("date_val"
> df.show()
> //Output of the above code
> +---+---++
> |   date_val|date_val_ts|year_val|
> +---+---++
> |1300-01-03 00:00:00|1300-01-03 00:00:00|1299|
> +---+---++
> {code}
>  
> The above code works perfectly for all the years greater than 1300
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31873) Spark Sql Function year does not extract year from date/timestamp

2020-05-30 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17120301#comment-17120301
 ] 

Rakesh Raushan edited comment on SPARK-31873 at 5/30/20, 4:28 PM:
--

{code:java}
scala> val df = Seq(("1300-01-03 00:00:00") 
).toDF("date_val").withColumn("date_val_ts", 
to_timestamp(col("date_val"))).withColumn("year_val", 
year(to_timestamp(col("date_val"
df: org.apache.spark.sql.DataFrame = [date_val: string, date_val_ts: timestamp 
... 1 more field]

scala> df.show
+---+---++
| date_val  | date_val_ts.  |year_val|
+---+---++
|1300-01-03 00:00:00|1300-01-03 00:00:00| 1300   |
+---+---++

{code}
 

This works fine with master branch.


was (Author: rakson):
scala> val df = Seq(("1300-01-03 00:00:00") 
).toDF("date_val").withColumn("date_val_ts", 
to_timestamp(col("date_val"))).withColumn("year_val", 
year(to_timestamp(col("date_val"
df: org.apache.spark.sql.DataFrame = [date_val: string, date_val_ts: timestamp 
... 1 more field]

scala> df.show
+---+---++
| date_val| date_val_ts|year_val|
+---+---++
|1300-01-03 00:00:00|1300-01-03 00:00:00| 1300|
+---+---++

 

This works fine with master branch.

> Spark Sql Function year does not extract year from date/timestamp
> -
>
> Key: SPARK-31873
> URL: https://issues.apache.org/jira/browse/SPARK-31873
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Deepak Shingavi
>Priority: Major
>
> There is a Spark SQL function
> org.apache.spark.sql.functions.year which fails in below case
>  
> {code:java}
> // Code to extract year from Timestamp
> val df = Seq(
>   ("1300-01-03 00:00:00")
> ).toDF("date_val")
>   .withColumn("date_val_ts", to_timestamp(col("date_val")))
>   .withColumn("year_val", year(to_timestamp(col("date_val"
> df.show()
> //Output of the above code
> +---+---++
> |   date_val|date_val_ts|year_val|
> +---+---++
> |1300-01-03 00:00:00|1300-01-03 00:00:00|1299|
> +---+---++
> {code}
>  
> The above code works perfectly for all the years greater than 1300
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31873) Spark Sql Function year does not extract year from date/timestamp

2020-05-30 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17120301#comment-17120301
 ] 

Rakesh Raushan commented on SPARK-31873:


scala> val df = Seq(("1300-01-03 00:00:00") 
).toDF("date_val").withColumn("date_val_ts", 
to_timestamp(col("date_val"))).withColumn("year_val", 
year(to_timestamp(col("date_val"
df: org.apache.spark.sql.DataFrame = [date_val: string, date_val_ts: timestamp 
... 1 more field]

scala> df.show
+---+---++
| date_val| date_val_ts|year_val|
+---+---++
|1300-01-03 00:00:00|1300-01-03 00:00:00| 1300|
+---+---++

 

This works fine with master branch.

> Spark Sql Function year does not extract year from date/timestamp
> -
>
> Key: SPARK-31873
> URL: https://issues.apache.org/jira/browse/SPARK-31873
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Deepak Shingavi
>Priority: Major
>
> There is a Spark SQL function
> org.apache.spark.sql.functions.year which fails in below case
>  
> {code:java}
> // Code to extract year from Timestamp
> val df = Seq(
>   ("1300-01-03 00:00:00")
> ).toDF("date_val")
>   .withColumn("date_val_ts", to_timestamp(col("date_val")))
>   .withColumn("year_val", year(to_timestamp(col("date_val"
> df.show()
> //Output of the above code
> +---+---++
> |   date_val|date_val_ts|year_val|
> +---+---++
> |1300-01-03 00:00:00|1300-01-03 00:00:00|1299|
> +---+---++
> {code}
>  
> The above code works perfectly for all the years greater than 1300
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31763) DataFrame.inputFiles() not Available

2020-05-26 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17116738#comment-17116738
 ] 

Rakesh Raushan commented on SPARK-31763:


Shall I open a PR for this?

> DataFrame.inputFiles() not Available
> 
>
> Key: SPARK-31763
> URL: https://issues.apache.org/jira/browse/SPARK-31763
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> I have been trying to list inputFiles that compose my DataSet by using 
> *PySpark* 
> spark_session.read
>  .format(sourceFileFormat)
>  .load(S3A_FILESYSTEM_PREFIX + bucket + File.separator + sourceFolderPrefix)
>  *.inputFiles();*
> but I get an exception saying inputFiles attribute not present. But I was 
> able to get this functionality with Spark Java. 
> *So is this something missing in PySpark?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31688) Refactor pagination framework for spark web UI pages

2020-05-12 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-31688:
--

 Summary: Refactor pagination framework for spark web UI pages
 Key: SPARK-31688
 URL: https://issues.apache.org/jira/browse/SPARK-31688
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.1.0
Reporter: Rakesh Raushan


Currently, a large chunk of code is copied when we implement pagination using 
the current pagination framework. We also embed HTML a lot, this decreases code 
readability. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31104) Add documentation for all new Json Functions

2020-05-08 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102438#comment-17102438
 ] 

Rakesh Raushan commented on SPARK-31104:


[~hyukjin.kwon] We can mark this as resolved as this task has already been 
completed.

> Add documentation for all new Json Functions
> 
>
> Key: SPARK-31104
> URL: https://issues.apache.org/jira/browse/SPARK-31104
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31470) Introduce SORTED BY clause in CREATE TABLE statement

2020-05-08 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102437#comment-17102437
 ] 

Rakesh Raushan commented on SPARK-31470:


If this is required by community and [~yumwang] has not started working, I can 
work on this.

[~yumwang] What to do you say?

> Introduce SORTED BY clause in CREATE TABLE statement
> 
>
> Key: SPARK-31470
> URL: https://issues.apache.org/jira/browse/SPARK-31470
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> We usually sort on frequently filtered columns when writing data to improve 
> query performance. But there is no these info in the table information.
>  
> {code:sql}
> CREATE TABLE t(day INT, hour INT, year INT, month INT)
> USING parquet
> PARTITIONED BY (year, month)
> SORTED BY (day, hour);
> {code}
>  
> Impala, Oracle and redshift support this clause:
> https://issues.apache.org/jira/browse/IMPALA-4166
> https://docs.oracle.com/database/121/DWHSG/attcluster.htm#GUID-DAECFBC5-FD1A-45A5-8C2C-DC9884D0857B
> https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data-compare-sort-styles.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31642) Support pagination for spark structured streaming tab

2020-05-05 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099961#comment-17099961
 ] 

Rakesh Raushan commented on SPARK-31642:


I am working on it

 

> Support pagination for  spark structured streaming tab
> --
>
> Key: SPARK-31642
> URL: https://issues.apache.org/jira/browse/SPARK-31642
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>
> Support pagination for spark structured streaming tab



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31638) Clean code for pagination for all pages

2020-05-04 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-31638:
---
Description: Clean code for pagination for different pages of spark webUI  
(was: Clean code for pagination for different pages of spark web)

> Clean code for pagination for all pages
> ---
>
> Key: SPARK-31638
> URL: https://issues.apache.org/jira/browse/SPARK-31638
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Minor
>
> Clean code for pagination for different pages of spark webUI



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31638) Clean code for pagination for all pages

2020-05-04 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-31638:
---
Description: Clean code for pagination for different pages of spark web

> Clean code for pagination for all pages
> ---
>
> Key: SPARK-31638
> URL: https://issues.apache.org/jira/browse/SPARK-31638
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Minor
>
> Clean code for pagination for different pages of spark web



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31638) Clean code for pagination for all pages

2020-05-04 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-31638:
--

 Summary: Clean code for pagination for all pages
 Key: SPARK-31638
 URL: https://issues.apache.org/jira/browse/SPARK-31638
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.1.0
 Environment: Clean code for pagination for different pages of webUI
Reporter: Rakesh Raushan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31638) Clean code for pagination for all pages

2020-05-04 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-31638:
---
Environment: (was: Clean code for pagination for different pages of 
webUI)

> Clean code for pagination for all pages
> ---
>
> Key: SPARK-31638
> URL: https://issues.apache.org/jira/browse/SPARK-31638
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31478) Executors Stop() method is not executed when they are killed

2020-04-18 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-31478:
--

 Summary: Executors Stop() method is not executed when they are 
killed
 Key: SPARK-31478
 URL: https://issues.apache.org/jira/browse/SPARK-31478
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Rakesh Raushan


In dynamic Allocation when executors are killed, stop() method of executors is 
never called. So executors never goes down properly.

In SPARK-29152, shutdown hook was added to stop the executors properly.

Instead of forcing a shutdown hook we should ask executors to stop themselves 
before killing them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31397) Support json_arrayAgg

2020-04-09 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17079268#comment-17079268
 ] 

Rakesh Raushan commented on SPARK-31397:


The same functionality can be achieved using to/from_json and performance will 
be almost equivalent as well. So we do not need to implement this new function.

> Support json_arrayAgg
> -
>
> Key: SPARK-31397
> URL: https://issues.apache.org/jira/browse/SPARK-31397
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Returns a JSON array by aggregating all the JSON arrays from a set of JSON 
> arrays, or by aggregating the values of a Column.
> Some of the Databases supporting this aggregate function are:
>  * MySQL
>  * PostgreSQL
>  * Maria_DB
>  * Sqlite
>  * IBM Db2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-31397) Support json_arrayAgg

2020-04-09 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-31397:
---
Comment: was deleted

(was: I am working on it.)

> Support json_arrayAgg
> -
>
> Key: SPARK-31397
> URL: https://issues.apache.org/jira/browse/SPARK-31397
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Returns a JSON array by aggregating all the JSON arrays from a set of JSON 
> arrays, or by aggregating the values of a Column.
> Some of the Databases supporting this aggregate function are:
>  * MySQL
>  * PostgreSQL
>  * Maria_DB
>  * Sqlite
>  * IBM Db2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31396) Support json_objectAgg function

2020-04-09 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17079267#comment-17079267
 ] 

Rakesh Raushan commented on SPARK-31396:


We can achieve the same functionality using to/from_json. So we do not need 
this new function.

 

> Support json_objectAgg function
> ---
>
> Key: SPARK-31396
> URL: https://issues.apache.org/jira/browse/SPARK-31396
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Returns a JSON object containing the key-value pairs by aggregating the 
> key-values of set of Objects or columns. 
>  
> This aggregate function is supported by: 
>  * MySQL
>  * PostgreSQL
>  * IBM Db2
>  * Maria_DB
>  * Sqlite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-31396) Support json_objectAgg function

2020-04-09 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-31396:
---
Comment: was deleted

(was: I am working on it.)

> Support json_objectAgg function
> ---
>
> Key: SPARK-31396
> URL: https://issues.apache.org/jira/browse/SPARK-31396
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Returns a JSON object containing the key-value pairs by aggregating the 
> key-values of set of Objects or columns. 
>  
> This aggregate function is supported by: 
>  * MySQL
>  * PostgreSQL
>  * IBM Db2
>  * Maria_DB
>  * Sqlite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31397) Support json_arrayAgg

2020-04-09 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17079226#comment-17079226
 ] 

Rakesh Raushan commented on SPARK-31397:


I am working on it.

> Support json_arrayAgg
> -
>
> Key: SPARK-31397
> URL: https://issues.apache.org/jira/browse/SPARK-31397
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Returns a JSON array by aggregating all the JSON arrays from a set of JSON 
> arrays, or by aggregating the values of a Column.
> Some of the Databases supporting this aggregate function are:
>  * MySQL
>  * PostgreSQL
>  * Maria_DB
>  * Sqlite
>  * IBM Db2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31397) Support json_arrayAgg

2020-04-09 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-31397:
--

 Summary: Support json_arrayAgg
 Key: SPARK-31397
 URL: https://issues.apache.org/jira/browse/SPARK-31397
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.0
Reporter: Rakesh Raushan


Returns a JSON array by aggregating all the JSON arrays from a set of JSON 
arrays, or by aggregating the values of a Column.

Some of the Databases supporting this aggregate function are:
 * MySQL
 * PostgreSQL
 * Maria_DB
 * Sqlite
 * IBM Db2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31396) Support json_objectAgg function

2020-04-09 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17079198#comment-17079198
 ] 

Rakesh Raushan commented on SPARK-31396:


I am working on it.

> Support json_objectAgg function
> ---
>
> Key: SPARK-31396
> URL: https://issues.apache.org/jira/browse/SPARK-31396
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Returns a JSON object containing the key-value pairs by aggregating the 
> key-values of set of Objects or columns. 
>  
> This aggregate function is supported by: 
>  * MySQL
>  * PostgreSQL
>  * IBM Db2
>  * Maria_DB
>  * Sqlite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31396) Support json_objectAgg function

2020-04-09 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-31396:
--

 Summary: Support json_objectAgg function
 Key: SPARK-31396
 URL: https://issues.apache.org/jira/browse/SPARK-31396
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.0
Reporter: Rakesh Raushan


Returns a JSON object containing the key-value pairs by aggregating the 
key-values of set of Objects or columns. 

 

This aggregate function is supported by: 
 * MySQL
 * PostgreSQL
 * IBM Db2
 * Maria_DB
 * Sqlite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31106) Support is_json function

2020-04-09 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-31106:
---
Description: 
This function will allow users to verify whether the given string is valid JSON 
or not. It returns `true` for valid JSON and `false` for invalid JSON. `NULL` 
is returned for `NULL` input.

DBMSs supporting this functions are :
 * MySQL
 * SQL Server
 * Sqlite
 * MariaDB
 * Amazon Redshift
 * IBM Db2

  was:
Currently, null is returned when we come across invalid json. We should either 
throw an exception for invalid json or false should be returned, like in other 
DBMSs. Like in `json_array_length` function we need to return NULL for null 
array. So this might confuse users.

 

DBMSs supporting this functions are :
 * MySQL
 * SQL Server
 * Sqlite
 * MariaDB
 * Amazon Redshift


> Support is_json function
> 
>
> Key: SPARK-31106
> URL: https://issues.apache.org/jira/browse/SPARK-31106
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> This function will allow users to verify whether the given string is valid 
> JSON or not. It returns `true` for valid JSON and `false` for invalid JSON. 
> `NULL` is returned for `NULL` input.
> DBMSs supporting this functions are :
>  * MySQL
>  * SQL Server
>  * Sqlite
>  * MariaDB
>  * Amazon Redshift
>  * IBM Db2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31106) Support is_json function

2020-04-09 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-31106:
---
Summary: Support is_json function  (was: Support IS_JSON)

> Support is_json function
> 
>
> Key: SPARK-31106
> URL: https://issues.apache.org/jira/browse/SPARK-31106
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Currently, null is returned when we come across invalid json. We should 
> either throw an exception for invalid json or false should be returned, like 
> in other DBMSs. Like in `json_array_length` function we need to return NULL 
> for null array. So this might confuse users.
>  
> DBMSs supporting this functions are :
>  * MySQL
>  * SQL Server
>  * Sqlite
>  * MariaDB
>  * Amazon Redshift



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31106) Support IS_JSON

2020-04-09 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-31106:
---
Description: 
Currently, null is returned when we come across invalid json. We should either 
throw an exception for invalid json or false should be returned, like in other 
DBMSs. Like in `json_array_length` function we need to return NULL for null 
array. So this might confuse users.

 

DBMSs supporting this functions are :
 * MySQL
 * SQL Server
 * Sqlite
 * MariaDB
 * Amazon Redshift

  was:
Currently, null is returned when we come across invalid json. We should either 
throw an exception for invalid json or false should be returned, like in other 
DBMSs. Like in `json_array_length` function we need to return NULL for null 
array. So this might confuse users.

 

DBMSs supporting this functions are :
 * MySQL
 * SQL Server
 * Sqlite
 * MariaDB


> Support IS_JSON
> ---
>
> Key: SPARK-31106
> URL: https://issues.apache.org/jira/browse/SPARK-31106
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Currently, null is returned when we come across invalid json. We should 
> either throw an exception for invalid json or false should be returned, like 
> in other DBMSs. Like in `json_array_length` function we need to return NULL 
> for null array. So this might confuse users.
>  
> DBMSs supporting this functions are :
>  * MySQL
>  * SQL Server
>  * Sqlite
>  * MariaDB
>  * Amazon Redshift



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31369) Add Documentation for JSON functions

2020-04-06 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17076930#comment-17076930
 ] 

Rakesh Raushan commented on SPARK-31369:


I am working on it.

> Add Documentation for JSON functions
> 
>
> Key: SPARK-31369
> URL: https://issues.apache.org/jira/browse/SPARK-31369
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31369) Add Documentation for JSON functions

2020-04-06 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-31369:
---
Parent: SPARK-28588
Issue Type: Sub-task  (was: Documentation)

> Add Documentation for JSON functions
> 
>
> Key: SPARK-31369
> URL: https://issues.apache.org/jira/browse/SPARK-31369
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31369) Add Documentation for JSON functions

2020-04-06 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-31369:
--

 Summary: Add Documentation for JSON functions
 Key: SPARK-31369
 URL: https://issues.apache.org/jira/browse/SPARK-31369
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 3.0.0
Reporter: Rakesh Raushan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31104) Add documentation for all new Json Functions

2020-04-06 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-31104:
---
Summary: Add documentation for all new Json Functions  (was: Add 
documentation for all the Json Functions)

> Add documentation for all new Json Functions
> 
>
> Key: SPARK-31104
> URL: https://issues.apache.org/jira/browse/SPARK-31104
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31104) Add documentation for all the Json Functions

2020-03-10 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056584#comment-17056584
 ] 

Rakesh Raushan commented on SPARK-31104:


I am working on it.

> Add documentation for all the Json Functions
> 
>
> Key: SPARK-31104
> URL: https://issues.apache.org/jira/browse/SPARK-31104
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31106) Support IS_JSON

2020-03-10 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055757#comment-17055757
 ] 

Rakesh Raushan commented on SPARK-31106:


I am working on it.

> Support IS_JSON
> ---
>
> Key: SPARK-31106
> URL: https://issues.apache.org/jira/browse/SPARK-31106
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> Currently, null is returned when we come across invalid json. We should 
> either throw an exception for invalid json or false should be returned, like 
> in other DBMSs. Like in `json_array_length` function we need to return NULL 
> for null array. So this might confuse users.
>  
> DBMSs supporting this functions are :
>  * MySQL
>  * SQL Server
>  * Sqlite
>  * MariaDB



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31106) Support IS_JSON

2020-03-10 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-31106:
--

 Summary: Support IS_JSON
 Key: SPARK-31106
 URL: https://issues.apache.org/jira/browse/SPARK-31106
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.0
Reporter: Rakesh Raushan


Currently, null is returned when we come across invalid json. We should either 
throw an exception for invalid json or false should be returned, like in other 
DBMSs. Like in `json_array_length` function we need to return NULL for null 
array. So this might confuse users.

 

DBMSs supporting this functions are :
 * MySQL
 * SQL Server
 * Sqlite
 * MariaDB



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31104) Add documentation for all the Json Functions

2020-03-10 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-31104:
--

 Summary: Add documentation for all the Json Functions
 Key: SPARK-31104
 URL: https://issues.apache.org/jira/browse/SPARK-31104
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 3.1.0
Reporter: Rakesh Raushan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31103) Extend Support for useful JSON Functions

2020-03-10 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-31103:
--

 Summary: Extend Support for useful JSON Functions
 Key: SPARK-31103
 URL: https://issues.apache.org/jira/browse/SPARK-31103
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 3.1.0
Reporter: Rakesh Raushan


Currently, Spark only supports few functions for JSON. There are many other 
common utility functions which are supported by other popular DBMSs. Supporting 
these functions will make it easier for prospective users. Also some functions 
like `json_array_length` , `json_object_keys` are more intuitive and naive 
users life would be much simpler.

I have added some JSON functions on which I am working on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31009) Support json_object_keys function

2020-03-04 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-31009:
---
Affects Version/s: (was: 3.0.0)
   3.1.0

> Support json_object_keys function
> -
>
> Key: SPARK-31009
> URL: https://issues.apache.org/jira/browse/SPARK-31009
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> This function will return all the keys from outer json object.
>  
> PostgreSQL  -> [https://www.postgresql.org/docs/9.3/functions-json.html]
> Mysql -> 
> [https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html]
> MariaDB -> [https://mariadb.com/kb/en/json-functions/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31008) Support json_array_length function

2020-03-04 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-31008:
---
Affects Version/s: (was: 3.0.0)
   3.1.0

> Support json_array_length function
> --
>
> Key: SPARK-31008
> URL: https://issues.apache.org/jira/browse/SPARK-31008
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> At the moment we don't support json_array_length function in spark.
> This function is supported by
> a.) PostgreSQL -> [https://www.postgresql.org/docs/9.3/functions-json.html]
> b.) Presto -> [https://prestodb.io/docs/current/functions/json.html]
> c.) redshift -> 
> [https://docs.aws.amazon.com/redshift/latest/dg/JSON_ARRAY_LENGTH.html]
>  
> This allows naive users to directly get array length with a well defined json 
> function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31009) Support json_object_keys function

2020-03-04 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-31009:
---
Description: 
This function will return all the keys from outer json object.

 

PostgreSQL  -> [https://www.postgresql.org/docs/9.3/functions-json.html]

Mysql -> [https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html]

MariaDB -> [https://mariadb.com/kb/en/json-functions/]

  was:
This function will return all the keys from outer json object.

 

PostgreSQL  -> [https://www.postgresql.org/docs/9.3/functions-json.html]

Mysql -> [https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html]


> Support json_object_keys function
> -
>
> Key: SPARK-31009
> URL: https://issues.apache.org/jira/browse/SPARK-31009
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> This function will return all the keys from outer json object.
>  
> PostgreSQL  -> [https://www.postgresql.org/docs/9.3/functions-json.html]
> Mysql -> 
> [https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html]
> MariaDB -> [https://mariadb.com/kb/en/json-functions/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31009) Support json_object_keys function

2020-03-02 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-31009:
---
Description: 
This function will return all the keys from outer json object.

 

PostgreSQL  -> [https://www.postgresql.org/docs/9.3/functions-json.html]

Mysql -> [https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html]

  was:
This function will return all the keys from outer json object.

 

PostgreSQL support this function -> 
[https://www.postgresql.org/docs/9.3/functions-json.html]


> Support json_object_keys function
> -
>
> Key: SPARK-31009
> URL: https://issues.apache.org/jira/browse/SPARK-31009
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> This function will return all the keys from outer json object.
>  
> PostgreSQL  -> [https://www.postgresql.org/docs/9.3/functions-json.html]
> Mysql -> 
> [https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31009) Support json_object_keys function

2020-03-02 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17049914#comment-17049914
 ] 

Rakesh Raushan commented on SPARK-31009:


[~hyukjin.kwon] I updated the description. PostgreSQL supports this function.

We can use `from_json` to convert json to `MapType` and then extract the keys, 
but that won't be optimal.

 

Apart from this one, there are some json function that supported by PostgreSQL, 
prestro, redshift,teradata.

May be we can discuss on supporting some of them. I have already raised a PR 
for `json_array_length`. This is supported by all of the above mentioned DBMSs.

> Support json_object_keys function
> -
>
> Key: SPARK-31009
> URL: https://issues.apache.org/jira/browse/SPARK-31009
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> This function will return all the keys from outer json object.
>  
> PostgreSQL support this function -> 
> [https://www.postgresql.org/docs/9.3/functions-json.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31009) Support json_object_keys function

2020-03-02 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-31009:
---
Description: 
This function will return all the keys from outer json object.

 

PostgreSQL support this function -> 
[https://www.postgresql.org/docs/9.3/functions-json.html]

  was:This function will return all the keys from outer json object.


> Support json_object_keys function
> -
>
> Key: SPARK-31009
> URL: https://issues.apache.org/jira/browse/SPARK-31009
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> This function will return all the keys from outer json object.
>  
> PostgreSQL support this function -> 
> [https://www.postgresql.org/docs/9.3/functions-json.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28427) Support more Postgres JSON functions

2020-03-02 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17049367#comment-17049367
 ] 

Rakesh Raushan commented on SPARK-28427:


I think we should add some of Postgres JSON functions to Spark.

> Support more Postgres JSON functions
> 
>
> Key: SPARK-28427
> URL: https://issues.apache.org/jira/browse/SPARK-28427
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>
> Postgres features a number of JSON functions that are missing in Spark: 
> https://www.postgresql.org/docs/9.3/functions-json.html
> Redshift's JSON functions 
> (https://docs.aws.amazon.com/redshift/latest/dg/json-functions.html) have 
> partial overlap with the Postgres list.
> Some of these functions can be expressed in terms of compositions of existing 
> Spark functions. For example, I think that {{json_array_length}} can be 
> expressed with {{cardinality}} and {{from_json}}, but there's a caveat 
> related to legacy Hive compatibility (see the demo notebook at 
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5796212617691211/45530874214710/4901752417050771/latest.html
>  for more details).
> I'm filing this ticket so that we can triage the list of Postgres JSON 
> features and decide which ones make sense to support in Spark. After we've 
> done that, we can create individual tickets for specific functions and 
> features.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31009) Support json_object_keys function

2020-03-02 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17049116#comment-17049116
 ] 

Rakesh Raushan commented on SPARK-31009:


I am working on this.

> Support json_object_keys function
> -
>
> Key: SPARK-31009
> URL: https://issues.apache.org/jira/browse/SPARK-31009
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> This function will return all the keys from outer json object.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31009) Support json_object_keys function

2020-03-02 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-31009:
--

 Summary: Support json_object_keys function
 Key: SPARK-31009
 URL: https://issues.apache.org/jira/browse/SPARK-31009
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Rakesh Raushan


This function will return all the keys from outer json object.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31008) Support json_array_length function

2020-03-02 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17049097#comment-17049097
 ] 

Rakesh Raushan commented on SPARK-31008:


I will raise a PR soon

> Support json_array_length function
> --
>
> Key: SPARK-31008
> URL: https://issues.apache.org/jira/browse/SPARK-31008
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> At the moment we don't support json_array_length function in spark.
> This function is supported by
> a.) PostgreSQL -> [https://www.postgresql.org/docs/9.3/functions-json.html]
> b.) Presto -> [https://prestodb.io/docs/current/functions/json.html]
> c.) redshift -> 
> [https://docs.aws.amazon.com/redshift/latest/dg/JSON_ARRAY_LENGTH.html]
>  
> This allows naive users to directly get array length with a well defined json 
> function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31008) Support json_array_length function

2020-03-02 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-31008:
--

 Summary: Support json_array_length function
 Key: SPARK-31008
 URL: https://issues.apache.org/jira/browse/SPARK-31008
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Rakesh Raushan


At the moment we don't support json_array_length function in spark.

This function is supported by

a.) PostgreSQL -> [https://www.postgresql.org/docs/9.3/functions-json.html]

b.) Presto -> [https://prestodb.io/docs/current/functions/json.html]

c.) redshift -> 
[https://docs.aws.amazon.com/redshift/latest/dg/JSON_ARRAY_LENGTH.html]

 

This allows naive users to directly get array length with a well defined json 
function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30917) The behaviour of UnaryMinus should not depend on SQLConf.get

2020-02-21 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041774#comment-17041774
 ] 

Rakesh Raushan commented on SPARK-30917:


I am working on this.

> The behaviour of UnaryMinus should not depend on SQLConf.get
> 
>
> Key: SPARK-30917
> URL: https://issues.apache.org/jira/browse/SPARK-30917
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30917) The behaviour of UnaryMinus should not depend on SQLConf.get

2020-02-21 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-30917:
--

 Summary: The behaviour of UnaryMinus should not depend on 
SQLConf.get
 Key: SPARK-30917
 URL: https://issues.apache.org/jira/browse/SPARK-30917
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Rakesh Raushan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-30896) The behavior of JsonToStructs should not depend on SQLConf.get

2020-02-21 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-30896:
---
Comment: was deleted

(was: I am working on this.)

> The behavior of JsonToStructs should not depend on SQLConf.get
> --
>
> Key: SPARK-30896
> URL: https://issues.apache.org/jira/browse/SPARK-30896
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30896) The behavior of JsonToStructs should not depend on SQLConf.get

2020-02-21 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041682#comment-17041682
 ] 

Rakesh Raushan commented on SPARK-30896:


I am working on this.

> The behavior of JsonToStructs should not depend on SQLConf.get
> --
>
> Key: SPARK-30896
> URL: https://issues.apache.org/jira/browse/SPARK-30896
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30852) Use Long instead of Int as argument type in Dataset limit method

2020-02-18 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039031#comment-17039031
 ] 

Rakesh Raushan commented on SPARK-30852:


Ahh. In that case we cannot allow long values. Tail also give an array only.

Can we mark this issue as won't fix then?

> Use Long instead of Int as argument type in Dataset limit method
> 
>
> Key: SPARK-30852
> URL: https://issues.apache.org/jira/browse/SPARK-30852
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Damianos Christophides
>Priority: Minor
>
> The Dataset limit method takes an input of type Int, which is a 32bit 
> integer. The numerical upper limit of this type is 2,147,483,647. I found in 
> my work to need to apply a limit to a Dataset higher than that which gives an 
> error:
> "py4j.Py4JException: Method limit([class java.lang.Long]) does not exist"
>  
> Could the input type of the limit method be changed to a Long (64bit)?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30852) Use Long instead of Int as argument type in Dataset limit method

2020-02-17 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038304#comment-17038304
 ] 

Rakesh Raushan edited comment on SPARK-30852 at 2/17/20 12:33 PM:
--

[~cloud_fan] [~dongjoon] long value can be used as limit expression in presto 
and postgresql. I think spark should also allow long values for limit 
expression.


was (Author: rakson):
[~cloud_fan] [~dongjoon] long value can be used as limit expression in presto. 
I think spark should also allow long values for limit expression.

> Use Long instead of Int as argument type in Dataset limit method
> 
>
> Key: SPARK-30852
> URL: https://issues.apache.org/jira/browse/SPARK-30852
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Damianos Christophides
>Priority: Minor
>
> The Dataset limit method takes an input of type Int, which is a 32bit 
> integer. The numerical upper limit of this type is 2,147,483,647. I found in 
> my work to need to apply a limit to a Dataset higher than that which gives an 
> error:
> "py4j.Py4JException: Method limit([class java.lang.Long]) does not exist"
>  
> Could the input type of the limit method be changed to a Long (64bit)?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30852) Use Long instead of Int as argument type in Dataset limit method

2020-02-17 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038304#comment-17038304
 ] 

Rakesh Raushan commented on SPARK-30852:


[~cloud_fan] [~dongjoon] long value can be used as limit expression in presto. 
I think spark should also allow long values for limit expression.

> Use Long instead of Int as argument type in Dataset limit method
> 
>
> Key: SPARK-30852
> URL: https://issues.apache.org/jira/browse/SPARK-30852
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Damianos Christophides
>Priority: Minor
>
> The Dataset limit method takes an input of type Int, which is a 32bit 
> integer. The numerical upper limit of this type is 2,147,483,647. I found in 
> my work to need to apply a limit to a Dataset higher than that which gives an 
> error:
> "py4j.Py4JException: Method limit([class java.lang.Long]) does not exist"
>  
> Could the input type of the limit method be changed to a Long (64bit)?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30852) Use Long instead of Int as argument type in Dataset limit method

2020-02-17 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038278#comment-17038278
 ] 

Rakesh Raushan commented on SPARK-30852:


I will check the issue.

> Use Long instead of Int as argument type in Dataset limit method
> 
>
> Key: SPARK-30852
> URL: https://issues.apache.org/jira/browse/SPARK-30852
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Damianos Christophides
>Priority: Minor
>
> The Dataset limit method takes an input of type Int, which is a 32bit 
> integer. The numerical upper limit of this type is 2,147,483,647. I found in 
> my work to need to apply a limit to a Dataset higher than that which gives an 
> error:
> "py4j.Py4JException: Method limit([class java.lang.Long]) does not exist"
>  
> Could the input type of the limit method be changed to a Long (64bit)?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27619) MapType should be prohibited in hash expressions

2020-02-14 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036831#comment-17036831
 ] 

Rakesh Raushan commented on SPARK-27619:


I am working on this.

> MapType should be prohibited in hash expressions
> 
>
> Key: SPARK-27619
> URL: https://issues.apache.org/jira/browse/SPARK-27619
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: Josh Rosen
>Priority: Blocker
>  Labels: correctness
>
> Spark currently allows MapType expressions to be used as input to hash 
> expressions, but I think that this should be prohibited because Spark SQL 
> does not support map equality.
> Currently, Spark SQL's map hashcodes are sensitive to the insertion order of 
> map elements:
> {code:java}
> val a = spark.createDataset(Map(1->1, 2->2) :: Nil)
> val b = spark.createDataset(Map(2->2, 1->1) :: Nil)
> // Demonstration of how Scala Map equality is unaffected by insertion order:
> assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode())
> assert(Map(1->1, 2->2) == Map(2->2, 1->1))
> assert(a.first() == b.first())
> // In contrast, this will print two different hashcodes:
> println(Seq(a, b).map(_.selectExpr("hash(*)").first())){code}
> This behavior might be surprising to Scala developers.
> I think there's precedence for banning the use of MapType here because we 
> already prohibit MapType in aggregation / joins / equality comparisons 
> (SPARK-9415) and set operations (SPARK-19893).
> If we decide that we want this to be an error then it might also be a good 
> idea to add a {{spark.sql.legacy}} flag as an escape-hatch to re-enable the 
> old and buggy behavior (in case applications were relying on it in cases 
> where it just so happens to be safe-by-accident (e.g. maps which only have 
> one entry)).
> Alternatively, we could support hashing here if we implemented support for 
> comparable map types (SPARK-18134).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27545) Update the Documentation for CACHE TABLE and UNCACHE TABLE

2020-02-11 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034479#comment-17034479
 ] 

Rakesh Raushan commented on SPARK-27545:


Please assign this to me. Thanks

> Update the Documentation for CACHE TABLE and UNCACHE TABLE
> --
>
> Key: SPARK-27545
> URL: https://issues.apache.org/jira/browse/SPARK-27545
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: hantiantian
>Assignee: hantiantian
>Priority: Major
> Fix For: 3.0.0
>
>
> spark-sql> cache table v1 as select * from a;
> spark-sql> uncache table v1;
> spark-sql> cache table v1 as select * from a;
> 2019-04-23 14:50:09,038 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: 
> 0: get_table : db=apachespark tbl=a
> 2019-04-23 14:50:09,038 INFO 
> org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=root 
> ip=unknown-ip-addr cmd=get_table : db=apachespark tbl=a
> Error in query: Temporary view 'v1' already exists;
> we should document it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30790) The datatype of map() should be map

2020-02-11 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034461#comment-17034461
 ] 

Rakesh Raushan commented on SPARK-30790:


Should i expose a legacy configuration for mapType as well ??

[~hyukjin.kwon]

> The datatype of map() should be map
> --
>
> Key: SPARK-30790
> URL: https://issues.apache.org/jira/browse/SPARK-30790
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rakesh Raushan
>Priority: Minor
>
> Currently ,
> spark.sql("select map()") gives {}.
> To be consistent with the changes made in SPARK-29462, it should return 
> map.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30790) The datatype of map() should be map

2020-02-11 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-30790:
--

 Summary: The datatype of map() should be map
 Key: SPARK-30790
 URL: https://issues.apache.org/jira/browse/SPARK-30790
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Rakesh Raushan


Currently ,

spark.sql("select map()") gives {}.

To be consistent with the changes made in SPARK-29462, it should return 
map.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30688) Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF

2020-01-31 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027330#comment-17027330
 ] 

Rakesh Raushan commented on SPARK-30688:


I will check this issue

 

> Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF
> --
>
> Key: SPARK-30688
> URL: https://issues.apache.org/jira/browse/SPARK-30688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Rajkumar Singh
>Priority: Major
>
>  
> {code:java}
> scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
> +-+
> |unix_timestamp(20201, ww)|
> +-+
> |                         null|
> +-+
>  
> scala> spark.sql("select unix_timestamp('20202', 'ww')").show();
> -+
> |unix_timestamp(20202, ww)|
> +-+
> |                   1578182400|
> +-+
>  
> {code}
>  
>  
> This seems to happen for leap year only, I dig deeper into it and it seems 
> that  Spark is using the java.text.SimpleDateFormat and try to parse the 
> expression here
> [org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652]
> {code:java}
> formatter.parse(
>  t.asInstanceOf[UTF8String].toString).getTime / 1000L{code}
>  but fail and SimpleDateFormat unable to parse the date throw Unparseable 
> Exception but Spark handle it silently and returns NULL.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30415) Improve Readability of SQLConf Doc

2020-01-04 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17007975#comment-17007975
 ] 

Rakesh Raushan commented on SPARK-30415:


I didn't knew that earlier. From now onwards I will use [MINOR].

> Improve Readability of SQLConf Doc
> --
>
> Key: SPARK-30415
> URL: https://issues.apache.org/jira/browse/SPARK-30415
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rakesh Raushan
>Priority: Trivial
>
> Improve Readability of SQLConf Doc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30415) Improve Readability of SQLConf Doc

2020-01-03 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-30415:
--

 Summary: Improve Readability of SQLConf Doc
 Key: SPARK-30415
 URL: https://issues.apache.org/jira/browse/SPARK-30415
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Rakesh Raushan


Improve Readability of SQLConf Doc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30363) Add Documentation for Refresh Resources

2019-12-26 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003887#comment-17003887
 ] 

Rakesh Raushan commented on SPARK-30363:


I am working on it.

 

> Add Documentation for Refresh Resources
> ---
>
> Key: SPARK-30363
> URL: https://issues.apache.org/jira/browse/SPARK-30363
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rakesh Raushan
>Priority: Minor
>
> Refresh Resources is not documented in the docs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30363) Add Documentation for Refresh Resources

2019-12-26 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-30363:
--

 Summary: Add Documentation for Refresh Resources
 Key: SPARK-30363
 URL: https://issues.apache.org/jira/browse/SPARK-30363
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Rakesh Raushan


Refresh Resources is not documented in the docs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30342) Update LIST JAR/FILE command

2019-12-23 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-30342:
--

 Summary: Update LIST JAR/FILE command
 Key: SPARK-30342
 URL: https://issues.apache.org/jira/browse/SPARK-30342
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Rakesh Raushan


LIST FILE/JAR command is not documented properly. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30288) Failed to write valid Parquet files when column names contains special characters like spaces

2019-12-18 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999349#comment-16999349
 ] 

Rakesh Raushan commented on SPARK-30288:


I am working on it. I will raise the PR soon.

> Failed to write valid Parquet files when column names contains special 
> characters like spaces
> -
>
> Key: SPARK-30288
> URL: https://issues.apache.org/jira/browse/SPARK-30288
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Jingyuan Wang
>Priority: Major
>
> When I tried to write Parquet files using PySpark with columns containing 
> some special characters in their names, it threw the following exception:
> {code}
> org.apache.spark.sql.AnalysisException: Attribute name "col 1" contains 
> invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:583)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkFieldName(ParquetSchemaConverter.scala:570)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:444)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:111)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:282)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   a

[jira] [Created] (SPARK-30292) Throw Exception when invalid string is cast to decimal in ANSI mode

2019-12-17 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-30292:
--

 Summary: Throw Exception when invalid string is cast to decimal in 
ANSI mode
 Key: SPARK-30292
 URL: https://issues.apache.org/jira/browse/SPARK-30292
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Rakesh Raushan


When spark.sql.ansi.enabled is set,

If we run select cast('str' as decimal), spark-sql outputs NULL. 

The ANSI SQL standard requires to throw exception when invalid strings are cast 
to numbers.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30288) Failed to write valid Parquet files when column names contains special characters like spaces

2019-12-17 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998809#comment-16998809
 ] 

Rakesh Raushan edited comment on SPARK-30288 at 12/18/19 4:49 AM:
--

[~dongjoon]  [~hyukjin.kwon] I have checked locally after making required 
changes. Column names with space , "=" are working fine for now. Also pandas 
support this. So should we also allow this?
 scala> Seq(100).toDF("a b").write.parquet("/tmp/dir")

scala> spark.read.parquet("/tmp/dir").show()
 +---+
|a b|

+---+
|100|

+---+
 scala> Seq(100).toDF("a=b").write.parquet("/tmp/dir2")

scala> spark.read.parquet("/tmp/dir2").show()
 +---+
|a=b|

+---+
|100|

+---+


was (Author: rakson):
[~dongjoon]  [~hyukjin.kwon] I have checked locally after making required 
changes. Column names with space , "=" are working fine for now. Also pandas 
support this. So should we also allow this?
 scala> Seq(100).toDF("a b").write.parquet("/tmp/dir")

scala> spark.read.parquet("/tmp/dir").show()
 +---+
|a b|

+---+
|100|

+---+
 scala> Seq(1).toDF("a=b").write.parquet("/tmp/dir2")

scala> spark.read.parquet("/tmp/foo").show()
 +---+
|a=b|

+---+
|100|

+---+

> Failed to write valid Parquet files when column names contains special 
> characters like spaces
> -
>
> Key: SPARK-30288
> URL: https://issues.apache.org/jira/browse/SPARK-30288
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Jingyuan Wang
>Priority: Major
>
> When I tried to write Parquet files using PySpark with columns containing 
> some special characters in their names, it threw the following exception:
> {code}
> org.apache.spark.sql.AnalysisException: Attribute name "col 1" contains 
> invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:583)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkFieldName(ParquetSchemaConverter.scala:570)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:444)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:111)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1S

[jira] [Comment Edited] (SPARK-30288) Failed to write valid Parquet files when column names contains special characters like spaces

2019-12-17 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998809#comment-16998809
 ] 

Rakesh Raushan edited comment on SPARK-30288 at 12/18/19 4:48 AM:
--

[~dongjoon]  [~hyukjin.kwon] I have checked locally after making required 
changes. Column names with space , "=" are working fine for now. Also pandas 
support this. So should we also allow this?
 scala> Seq(100).toDF("a b").write.parquet("/tmp/dir")

scala> spark.read.parquet("/tmp/dir").show()
 +---+
|a b|

+---+
|100|

+---+
 scala> Seq(1).toDF("a=b").write.parquet("/tmp/dir2")

scala> spark.read.parquet("/tmp/foo").show()
 +---+
|a=b|

+---+
|100|

+---+


was (Author: rakson):
[~dongjoon] I have checked locally after making required changes. Column names 
with space , "=" are working fine for now. Also pandas support this. So should 
we also allow this?
scala> Seq(100).toDF("a b").write.parquet("/tmp/dir")

scala> spark.read.parquet("/tmp/dir").show()
+---+
|a b|
+---+
|100|
+---+
scala> Seq(1).toDF("a=b").write.parquet("/tmp/dir2")

scala> spark.read.parquet("/tmp/foo").show()
+---+
|a=b|
+---+
|100|
+---+

> Failed to write valid Parquet files when column names contains special 
> characters like spaces
> -
>
> Key: SPARK-30288
> URL: https://issues.apache.org/jira/browse/SPARK-30288
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Jingyuan Wang
>Priority: Major
>
> When I tried to write Parquet files using PySpark with columns containing 
> some special characters in their names, it threw the following exception:
> {code}
> org.apache.spark.sql.AnalysisException: Attribute name "col 1" contains 
> invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:583)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkFieldName(ParquetSchemaConverter.scala:570)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:444)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:111)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:

[jira] [Commented] (SPARK-30288) Failed to write valid Parquet files when column names contains special characters like spaces

2019-12-17 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998809#comment-16998809
 ] 

Rakesh Raushan commented on SPARK-30288:


[~dongjoon] I have checked locally after making required changes. Column names 
with space , "=" are working fine for now. Also pandas support this. So should 
we also allow this?
scala> Seq(100).toDF("a b").write.parquet("/tmp/dir")

scala> spark.read.parquet("/tmp/dir").show()
+---+
|a b|
+---+
|100|
+---+
scala> Seq(1).toDF("a=b").write.parquet("/tmp/dir2")

scala> spark.read.parquet("/tmp/foo").show()
+---+
|a=b|
+---+
|100|
+---+

> Failed to write valid Parquet files when column names contains special 
> characters like spaces
> -
>
> Key: SPARK-30288
> URL: https://issues.apache.org/jira/browse/SPARK-30288
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Jingyuan Wang
>Priority: Major
>
> When I tried to write Parquet files using PySpark with columns containing 
> some special characters in their names, it threw the following exception:
> {code}
> org.apache.spark.sql.AnalysisException: Attribute name "col 1" contains 
> invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:583)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkFieldName(ParquetSchemaConverter.scala:570)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:444)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:111)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Del

[jira] [Commented] (SPARK-30150) Manage resources (ADD/LIST) does not support quoted path

2019-12-16 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997314#comment-16997314
 ] 

Rakesh Raushan commented on SPARK-30150:


Thanks!!

> Manage resources (ADD/LIST) does not support quoted path
> 
>
> Key: SPARK-30150
> URL: https://issues.apache.org/jira/browse/SPARK-30150
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Assignee: Rakesh Raushan
>Priority: Minor
> Fix For: 3.0.0
>
>
> Manage resources (ADD/LIST) does not support quoted path.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30249) Invalid Column Names in parquet tables should not be allowed

2019-12-12 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-30249:
---
Description: 
Column names such as  `a:b` , `??`, `,,`, `^^` , `++`etc are allowed when we 
are creating parquet tables.

While when we are creating tables with `orc` all such column names are marked 
as invalid and analysis exception is thrown.

These column names should also be not allowed for parquet tables as well.

Also this induces inconsistency between column names for Parquet and ORC

  was:
Column names such as  `a:b` , `??`, `,,`, `^^` , `++`etc are allowed when we 
are creating parquet tables.

While when we are creating tables with `orc` all such column names are marked 
as invalid and analysis exception is thrown.

These column names should also be not allowed for parquet tables as well.


> Invalid Column Names in parquet tables should not be allowed
> 
>
> Key: SPARK-30249
> URL: https://issues.apache.org/jira/browse/SPARK-30249
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rakesh Raushan
>Priority: Minor
>
> Column names such as  `a:b` , `??`, `,,`, `^^` , `++`etc are allowed when we 
> are creating parquet tables.
> While when we are creating tables with `orc` all such column names are marked 
> as invalid and analysis exception is thrown.
> These column names should also be not allowed for parquet tables as well.
> Also this induces inconsistency between column names for Parquet and ORC



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30249) Wrong Column Names in parquet tables should not be allowed

2019-12-12 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995411#comment-16995411
 ] 

Rakesh Raushan commented on SPARK-30249:


cc [~dongjoon].

> Wrong Column Names in parquet tables should not be allowed
> --
>
> Key: SPARK-30249
> URL: https://issues.apache.org/jira/browse/SPARK-30249
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rakesh Raushan
>Priority: Minor
>
> Column names such as  `a:b` , `??`, `,,`, `^^` , `++`etc are allowed when we 
> are creating parquet tables.
> While when we are creating tables with `orc` all such column names are marked 
> as invalid and analysis exception is thrown.
> These column names should also be not allowed for parquet tables as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30249) Invalid Column Names in parquet tables should not be allowed

2019-12-12 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-30249:
---
Summary: Invalid Column Names in parquet tables should not be allowed  
(was: Wrong Column Names in parquet tables should not be allowed)

> Invalid Column Names in parquet tables should not be allowed
> 
>
> Key: SPARK-30249
> URL: https://issues.apache.org/jira/browse/SPARK-30249
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rakesh Raushan
>Priority: Minor
>
> Column names such as  `a:b` , `??`, `,,`, `^^` , `++`etc are allowed when we 
> are creating parquet tables.
> While when we are creating tables with `orc` all such column names are marked 
> as invalid and analysis exception is thrown.
> These column names should also be not allowed for parquet tables as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30249) Wrong Column Names in parquet tables should not be allowed

2019-12-12 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-30249:
--

 Summary: Wrong Column Names in parquet tables should not be allowed
 Key: SPARK-30249
 URL: https://issues.apache.org/jira/browse/SPARK-30249
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Rakesh Raushan


Column names such as  `a:b` , `??`, `,,`, `^^` , `++`etc are allowed when we 
are creating parquet tables.

While when we are creating tables with `orc` all such column names are marked 
as invalid and analysis exception is thrown.

These column names should also be not allowed for parquet tables as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30150) Manage resources (ADD/LIST) does not support quoted path

2019-12-12 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994544#comment-16994544
 ] 

Rakesh Raushan commented on SPARK-30150:


Can you assign this to me. [~cloud_fan]

> Manage resources (ADD/LIST) does not support quoted path
> 
>
> Key: SPARK-30150
> URL: https://issues.apache.org/jira/browse/SPARK-30150
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Assignee: jobit mathew
>Priority: Minor
> Fix For: 3.0.0
>
>
> Manage resources (ADD/LIST) does not support quoted path.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30234) ADD FILE can not add folder from Spark-sql

2019-12-12 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994457#comment-16994457
 ] 

Rakesh Raushan commented on SPARK-30234:


I will raise a PR for this soon.

> ADD FILE can not add folder from Spark-sql
> --
>
> Key: SPARK-30234
> URL: https://issues.apache.org/jira/browse/SPARK-30234
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Rakesh Raushan
>Priority: Minor
>
> We cannot add directories using spark-sql CLI.
> In SPARK-4687 support was added for directories as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30234) ADD FILE can not add folder from Spark-sql

2019-12-12 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-30234:
--

 Summary: ADD FILE can not add folder from Spark-sql
 Key: SPARK-30234
 URL: https://issues.apache.org/jira/browse/SPARK-30234
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.4
Reporter: Rakesh Raushan


We cannot add directories using spark-sql CLI.
In SPARK-4687 support was added for directories as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30139) get_json_object does not work correctly

2019-12-12 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994447#comment-16994447
 ] 

Rakesh Raushan commented on SPARK-30139:


Was busy with some other work. I will start working on this now.

> get_json_object does not work correctly
> ---
>
> Key: SPARK-30139
> URL: https://issues.apache.org/jira/browse/SPARK-30139
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Clemens Valiente
>Priority: Major
>
> according to documentation:
> [https://spark.apache.org/docs/2.4.4/api/java/org/apache/spark/sql/functions.html#get_json_object-org.apache.spark.sql.Column-java.lang.String-]
> get_json_object "Extracts json object from a json string based on json path 
> specified, and returns json string of the extracted json object. It will 
> return null if the input json string is invalid."
>  
> the following SQL snippet returns null even though it should return 'a'
> {code}
> select get_json_object([{"id":123,"value":"a"},\{"id":456,"value":"b"}], 
> $[?($.id==123)].value){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-30176) Eliminate warnings: part 6

2019-12-09 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-30176:
---
Comment: was deleted

(was: i will work on this.)

> Eliminate warnings: part 6
> --
>
> Key: SPARK-30176
> URL: https://issues.apache.org/jira/browse/SPARK-30176
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Minor
>
>   
> sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala
> {code:java}
>  Warning:Warning:line (32)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (91)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (100)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (109)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (118)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> {code}
>   sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala
> {code:java}
> Warning:Warning:line (242)object typed in package scalalang is deprecated 
> (since 3.0.0): please use untyped builtin aggregate functions.
>   df.as[Data].select(typed.sumLong((d: Data) => 
> d.l)).queryExecution.toRdd.foreach(_ => ())
> {code}
>   sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala
> {code:java}
> Warning:Warning:line (714)method from_utc_timestamp in object functions is 
> deprecated (since 3.0.0): This function is deprecated and will be removed in 
> future versions.
> df.select(from_utc_timestamp(col("a"), "PST")),
> Warning:Warning:line (719)method from_utc_timestamp in object functions 
> is deprecated (since 3.0.0): This function is deprecated and will be removed 
> in future versions.
> df.select(from_utc_timestamp(col("b"), "PST")),
> Warning:Warning:line (725)method from_utc_timestamp in object functions 
> is deprecated (since 3.0.0): This function is deprecated and will be removed 
> in future versions.
>   df.select(from_utc_timestamp(col("a"), "PST")).collect()
> Warning:Warning:line (737)method from_utc_timestamp in object functions 
> is deprecated (since 3.0.0): This function is deprecated and will be removed 
> in future versions.
> df.select(from_utc_timestamp(col("a"), col("c"))),
> Warning:Warning:line (742)method from_utc_timestamp in object functions 
> is deprecated (since 3.0.0): This function is deprecated and will be removed 
> in future versions.
> df.select(from_utc_timestamp(col("b"), col("c"))),
> Warning:Warning:line (756)method to_utc_timestamp in object functions is 
> deprecated (since 3.0.0): This function is deprecated and will be removed in 
> future versions.
> df.select(to_utc_timestamp(col("a"), "PST")),
> Warning:Warning:line (761)method to_utc_timestamp in object functions is 
> deprecated (since 3.0.0): This function is deprecated and will be removed in 
> future versions.
> df.select(to_utc_timestamp(col("b"), "PST")),
> Warning:Warning:line (767)method to_utc_timestamp in object functions is 
> deprecated (since 3.0.0): This function is deprecated and will be removed in 
> future versions.
>   df.select(to_utc_timestamp(col("a"), "PST")).collect()
> Warning:Warning:line (779)method to_utc_timestamp in object functions is 
> deprecated (since 3.0.0): This function is deprecated and will be removed in 
> future versions.
> df.select(to_utc_timestamp(col("a"), col("c"))),
> Warning:Warning:line (784)method to_utc_timestamp in object functions is 
> deprecated (since 3.0.0): This function is deprecated and will be removed in 
> future versions.
> df.select(to_utc_timestamp(col("b"), col("c"))),
> {code}
>   sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
> {code:java}
> Warning:Warning:line (241)method merge in object Row is deprecated (since 
> 3.0.0): This method is deprecated and will be removed in future versions.
>   testData.rdd.flatMap(row => Seq.fill(16)(Row.merge(row, 
> row))).collect().toSeq)
> {code}
>   sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
> {code:java}
>  Warning:Warning:line (787)method merge in object Row is deprecated (since 
> 3.0.0): This method is deprecated and will be removed in future versions.
> row => Seq.fill(16)(Row.merge(row, row))).collect().toSeq)
> {code}
>   
> s

[jira] [Commented] (SPARK-30176) Eliminate warnings: part 6

2019-12-08 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16991151#comment-16991151
 ] 

Rakesh Raushan commented on SPARK-30176:


i will work on this.

> Eliminate warnings: part 6
> --
>
> Key: SPARK-30176
> URL: https://issues.apache.org/jira/browse/SPARK-30176
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Minor
>
>   
> sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala
> {code:java}
> {code}
>   sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala
> {code:java}
> {code}
>   sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala
> {code:java}
> {code}
>   sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
> {code:java}
> {code}
>   sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
> {code:java}
> {code}
>   
> sql/core/src/test/scala/org/apache/spark/sql/SparkSessionExtensionSuite.scala
> {code:java}
> {code}
>   
> sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
> {code:java}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30139) get_json_object does not work correctly

2019-12-05 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988693#comment-16988693
 ] 

Rakesh Raushan commented on SPARK-30139:


I will look into this issue.

> get_json_object does not work correctly
> ---
>
> Key: SPARK-30139
> URL: https://issues.apache.org/jira/browse/SPARK-30139
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Clemens Valiente
>Priority: Major
>
> according to documentation:
> [https://spark.apache.org/docs/2.4.4/api/java/org/apache/spark/sql/functions.html#get_json_object-org.apache.spark.sql.Column-java.lang.String-]
> get_json_object "Extracts json object from a json string based on json path 
> specified, and returns json string of the extracted json object. It will 
> return null if the input json string is invalid."
>  
> the following SQL snippet returns null even though it should return 'a'
> {code}
> select get_json_object([{"id":123,"value":"a"},\{"id":456,"value":"b"}], 
> $[?($.id==123)].value){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >