date:20200821

[jira] [Created] (SPARK-32678) Rename EmptyHashedRelationWithAllNullKeys and simplify NAAJ generated code

2020-08-21 Thread Leanken.Lin (Jira)

Leanken.Lin created SPARK-32678:
---

 Summary: Rename EmptyHashedRelationWithAllNullKeys and simplify 
NAAJ generated code
 Key: SPARK-32678
 URL: https://issues.apache.org/jira/browse/SPARK-32678
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Leanken.Lin


EmptyHashedRelationWithAllNullKeys is a bit of confusing for its naming, and 
this minor change also simplify the generated code for BHJ NAAj.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32678) Rename EmptyHashedRelationWithAllNullKeys and simplify NAAJ generated code

2020-08-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181677#comment-17181677
 ] 

Apache Spark commented on SPARK-32678:
--

User 'leanken' has created a pull request for this issue:
https://github.com/apache/spark/pull/29503

> Rename EmptyHashedRelationWithAllNullKeys and simplify NAAJ generated code
> --
>
> Key: SPARK-32678
> URL: https://issues.apache.org/jira/browse/SPARK-32678
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Minor
>
> EmptyHashedRelationWithAllNullKeys is a bit of confusing for its naming, and 
> this minor change also simplify the generated code for BHJ NAAj.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32678) Rename EmptyHashedRelationWithAllNullKeys and simplify NAAJ generated code

2020-08-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32678:


Assignee: Apache Spark

> Rename EmptyHashedRelationWithAllNullKeys and simplify NAAJ generated code
> --
>
> Key: SPARK-32678
> URL: https://issues.apache.org/jira/browse/SPARK-32678
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Assignee: Apache Spark
>Priority: Minor
>
> EmptyHashedRelationWithAllNullKeys is a bit of confusing for its naming, and 
> this minor change also simplify the generated code for BHJ NAAj.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32678) Rename EmptyHashedRelationWithAllNullKeys and simplify NAAJ generated code

2020-08-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32678:


Assignee: (was: Apache Spark)

> Rename EmptyHashedRelationWithAllNullKeys and simplify NAAJ generated code
> --
>
> Key: SPARK-32678
> URL: https://issues.apache.org/jira/browse/SPARK-32678
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Minor
>
> EmptyHashedRelationWithAllNullKeys is a bit of confusing for its naming, and 
> this minor change also simplify the generated code for BHJ NAAj.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32667) Scrip transformation no-serde mode when column less then output length , Use null fill

2020-08-21 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32667:
---

Assignee: angerszhu

> Scrip transformation no-serde mode when column less then output length ,  Use 
> null fill
> ---
>
> Key: SPARK-32667
> URL: https://issues.apache.org/jira/browse/SPARK-32667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> Scrip transform no-serde mode should pad null value to filling column
> {code:java}
> hive> SELECT TRANSFORM(a, b)
> >   ROW FORMAT DELIMITED
> >   FIELDS TERMINATED BY '|'
> >   LINES TERMINATED BY '\n'
> >   NULL DEFINED AS 'NULL'
> > USING 'cat' as (a string, b string, c string, d string)
> >   ROW FORMAT DELIMITED
> >   FIELDS TERMINATED BY '|'
> >   LINES TERMINATED BY '\n'
> >   NULL DEFINED AS 'NULL'
> > FROM (
> > select 1 as a, 2 as b
> > ) tmp ;
> OK
> 1 2   NULLNULL
> Time taken: 24.626 seconds, Fetched: 1 row(s)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32667) Scrip transformation no-serde mode when column less then output length , Use null fill

2020-08-21 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32667.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29500
[https://github.com/apache/spark/pull/29500]

> Scrip transformation no-serde mode when column less then output length ,  Use 
> null fill
> ---
>
> Key: SPARK-32667
> URL: https://issues.apache.org/jira/browse/SPARK-32667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.1.0
>
>
> Scrip transform no-serde mode should pad null value to filling column
> {code:java}
> hive> SELECT TRANSFORM(a, b)
> >   ROW FORMAT DELIMITED
> >   FIELDS TERMINATED BY '|'
> >   LINES TERMINATED BY '\n'
> >   NULL DEFINED AS 'NULL'
> > USING 'cat' as (a string, b string, c string, d string)
> >   ROW FORMAT DELIMITED
> >   FIELDS TERMINATED BY '|'
> >   LINES TERMINATED BY '\n'
> >   NULL DEFINED AS 'NULL'
> > FROM (
> > select 1 as a, 2 as b
> > ) tmp ;
> OK
> 1 2   NULLNULL
> Time taken: 24.626 seconds, Fetched: 1 row(s)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32679) update "no-serde" in the codebase in other TRANSFORM PRs.

2020-08-21 Thread angerszhu (Jira)

angerszhu created SPARK-32679:
-

 Summary:  update "no-serde" in the codebase in other TRANSFORM PRs.
 Key: SPARK-32679
 URL: https://issues.apache.org/jira/browse/SPARK-32679
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: angerszhu


https://github.com/apache/spark/pull/29500#discussion_r474476579



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32674) Add suggestion for parallel directory listing in tuning doc

2020-08-21 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32674:


Assignee: Chao Sun

> Add suggestion for parallel directory listing in tuning doc
> ---
>
> Key: SPARK-32674
> URL: https://issues.apache.org/jira/browse/SPARK-32674
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Minor
>
> Sometimes directory listing could become a bottleneck when user jobs have 
> large number of input directories. This is especially true when against 
> object store like S3. 
> There are a few parameters to tune this. This proposes to add some info in 
> the tuning guide so that the knowledge can be better shared. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32674) Add suggestion for parallel directory listing in tuning doc

2020-08-21 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32674.
--
Fix Version/s: 3.1.0
   2.4.7
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 29498
[https://github.com/apache/spark/pull/29498]

> Add suggestion for parallel directory listing in tuning doc
> ---
>
> Key: SPARK-32674
> URL: https://issues.apache.org/jira/browse/SPARK-32674
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Minor
> Fix For: 3.0.1, 2.4.7, 3.1.0
>
>
> Sometimes directory listing could become a bottleneck when user jobs have 
> large number of input directories. This is especially true when against 
> object store like S3. 
> There are a few parameters to tune this. This proposes to add some info in 
> the tuning guide so that the knowledge can be better shared. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32646) ORC predicate pushdown should work with case-insensitive analysis

2020-08-21 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32646.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29457
[https://github.com/apache/spark/pull/29457]

> ORC predicate pushdown should work with case-insensitive analysis
> -
>
> Key: SPARK-32646
> URL: https://issues.apache.org/jira/browse/SPARK-32646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently ORC predicate pushdown doesn't work with case-insensitive analysis, 
> see SPARK-32622 for the test case.
> We should make ORC predicate pushdown work with case-insensitive analysis too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32680) CTAS with V2 catalog wrongly accessed unresolved query

2020-08-21 Thread Linhong Liu (Jira)

Linhong Liu created SPARK-32680:
---

 Summary: CTAS with V2 catalog wrongly accessed unresolved query
 Key: SPARK-32680
 URL: https://issues.apache.org/jira/browse/SPARK-32680
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Linhong Liu


Case:

{{CREATE TABLE t USING delta AS SELECT * from nonexist }}

 

Expected:

throw AnalysisException with "Table or view not found"

 

Actual:

{{throw UnresolvedException with 
"org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
toAttribute on unresolved object, tree: *"}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32648) Remove unused DELETE_ACTION in FileStreamSinkLog

2020-08-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181735#comment-17181735
 ] 

Apache Spark commented on SPARK-32648:
--

User 'michal-wieleba' has created a pull request for this issue:
https://github.com/apache/spark/pull/29505

> Remove unused DELETE_ACTION in FileStreamSinkLog
> 
>
> Key: SPARK-32648
> URL: https://issues.apache.org/jira/browse/SPARK-32648
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> DELETE_ACTION in FileStreamSinkLog has been never used from the introduction. 
> While there may be the possible usage on the action, given it hasn't been 
> used for years, it makes more sense to remove the action and restore back 
> with introducing actual usage. This is more realistic option as there has 
> been not much efforts on the area, so it might be unlikely addressed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32648) Remove unused DELETE_ACTION in FileStreamSinkLog

2020-08-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32648:


Assignee: Apache Spark

> Remove unused DELETE_ACTION in FileStreamSinkLog
> 
>
> Key: SPARK-32648
> URL: https://issues.apache.org/jira/browse/SPARK-32648
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> DELETE_ACTION in FileStreamSinkLog has been never used from the introduction. 
> While there may be the possible usage on the action, given it hasn't been 
> used for years, it makes more sense to remove the action and restore back 
> with introducing actual usage. This is more realistic option as there has 
> been not much efforts on the area, so it might be unlikely addressed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32648) Remove unused DELETE_ACTION in FileStreamSinkLog

2020-08-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32648:


Assignee: (was: Apache Spark)

> Remove unused DELETE_ACTION in FileStreamSinkLog
> 
>
> Key: SPARK-32648
> URL: https://issues.apache.org/jira/browse/SPARK-32648
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> DELETE_ACTION in FileStreamSinkLog has been never used from the introduction. 
> While there may be the possible usage on the action, given it hasn't been 
> used for years, it makes more sense to remove the action and restore back 
> with introducing actual usage. This is more realistic option as there has 
> been not much efforts on the area, so it might be unlikely addressed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32681) PySpark type hints support

2020-08-21 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-32681:


 Summary: PySpark type hints support
 Key: SPARK-32681
 URL: https://issues.apache.org/jira/browse/SPARK-32681
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


We're discussing about porting https://github.com/zero323/pyspark-stubs in the 
dev mailing list.

This is being discussed in the dev mailing list: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-td26232.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32681) PySpark type hints support

2020-08-21 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32681:
-
Parent: (was: SPARK-32082)
Issue Type: Improvement  (was: Sub-task)

> PySpark type hints support
> --
>
> Key: SPARK-32681
> URL: https://issues.apache.org/jira/browse/SPARK-32681
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Critical
>
> We're discussing about porting https://github.com/zero323/pyspark-stubs in 
> the dev mailing list.
> This is being discussed in the dev mailing list: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-PySpark-Revisiting-PySpark-type-annotations-td26232.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29471) "TaskResultLost (result lost from block manager)" error message is misleading in case result fetch is caused by client-side network connectivity issues

2020-08-21 Thread Chang chen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181736#comment-17181736
 ] 

Chang chen commented on SPARK-29471:


I met the same issue, and eventually find the reason is caused by unknown host.

> "TaskResultLost (result lost from block manager)" error message is misleading 
> in case result fetch is caused by client-side network connectivity issues
> ---
>
> Key: SPARK-29471
> URL: https://issues.apache.org/jira/browse/SPARK-29471
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Minor
>
> I recently encountered a problem where jobs non-deterministically failed with
> {code:java}
> TaskResultLost (result lost from block manager) {code}
> exceptions.
> It turned out that this was due to some sort of networking issue where the 
> Spark driver was unable to initiate outgoing connections to executors' block 
> managers in order to fetch indirect task results.
> In this situation, the error message was slightly misleading: the "result 
> lost from block manager" makes it sound like we received an error / 
> block-not-found response from the remote host, whereas in my case the problem 
> was actually a network connectivity issue where we weren't even able to 
> connect in the first place.
> If it's easy to do so, it might be nice to refine the error-handling / 
> logging code so that we distinguish between the receipt of an error response 
> vs. a lower-level networking / connectivity issue. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32682) Use workflow_dispatch to enable manual test triggers

2020-08-21 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-32682:


 Summary: Use workflow_dispatch to enable manual test triggers
 Key: SPARK-32682
 URL: https://issues.apache.org/jira/browse/SPARK-32682
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.1.0
Reporter: Takeshi Yamamuro


This ticket targets at adding a workflow_dispatch entry in the GitHub Action 
script (build_and_test.yml). This update can enable developers to run the Spark 
tests for a specific branch on their own local repository, so I think it might 
help to check if al the tests can pass before opening a new PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32682) Use workflow_dispatch to enable manual test triggers

2020-08-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32682:


Assignee: Apache Spark

> Use workflow_dispatch to enable manual test triggers
> 
>
> Key: SPARK-32682
> URL: https://issues.apache.org/jira/browse/SPARK-32682
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Major
>
> This ticket targets at adding a workflow_dispatch entry in the GitHub Action 
> script (build_and_test.yml). This update can enable developers to run the 
> Spark tests for a specific branch on their own local repository, so I think 
> it might help to check if al the tests can pass before opening a new PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32682) Use workflow_dispatch to enable manual test triggers

2020-08-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181767#comment-17181767
 ] 

Apache Spark commented on SPARK-32682:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29504

> Use workflow_dispatch to enable manual test triggers
> 
>
> Key: SPARK-32682
> URL: https://issues.apache.org/jira/browse/SPARK-32682
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This ticket targets at adding a workflow_dispatch entry in the GitHub Action 
> script (build_and_test.yml). This update can enable developers to run the 
> Spark tests for a specific branch on their own local repository, so I think 
> it might help to check if al the tests can pass before opening a new PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32682) Use workflow_dispatch to enable manual test triggers

2020-08-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32682:


Assignee: (was: Apache Spark)

> Use workflow_dispatch to enable manual test triggers
> 
>
> Key: SPARK-32682
> URL: https://issues.apache.org/jira/browse/SPARK-32682
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This ticket targets at adding a workflow_dispatch entry in the GitHub Action 
> script (build_and_test.yml). This update can enable developers to run the 
> Spark tests for a specific branch on their own local repository, so I think 
> it might help to check if al the tests can pass before opening a new PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32682) Use workflow_dispatch to enable manual test triggers

2020-08-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181765#comment-17181765
 ] 

Apache Spark commented on SPARK-32682:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29504

> Use workflow_dispatch to enable manual test triggers
> 
>
> Key: SPARK-32682
> URL: https://issues.apache.org/jira/browse/SPARK-32682
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This ticket targets at adding a workflow_dispatch entry in the GitHub Action 
> script (build_and_test.yml). This update can enable developers to run the 
> Spark tests for a specific branch on their own local repository, so I think 
> it might help to check if al the tests can pass before opening a new PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

2020-08-21 Thread Vinod KC (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod KC updated SPARK-32635:
-
Description: 
When pyspark.sql.functions.lit() function is used with dataframe cache, it 
returns wrong result

eg:lit() function with cache() function.
 ---
{code:java}
from pyspark.sql import Row
from pyspark.sql import functions as F

df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
'b'}]).withColumn("col2", F.lit(str(2)))
df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
8}]).withColumn("col2", F.lit(str(1)))
df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
9}]).withColumn("col2", F.lit(str(2)))
df_23 = df_2.union(df_3)
df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
9}]).withColumn("col2", F.lit(str(2)))

sel_col3 = df_23.select('col3', 'col2')
df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() 
finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
how='left').filter(F.col('col3') == 9)
finaldf.show()
finaldf.select('col2').show() #Wrong result
{code}
 

Output
 ---
{code:java}
>>> finaldf.show()
++++
|col2|col3|col1|
++++
| 2| 9| b|
++++
>>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1
++
|col2|
++
| 1|
++
++{code}
 lit() function without cache() function.
{code:java}
from pyspark.sql import Row
from pyspark.sql import functions as F

df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
'b'}]).withColumn("col2", F.lit(str(2)))
df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
8}]).withColumn("col2", F.lit(str(1)))
df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
9}]).withColumn("col2", F.lit(str(2)))
df_23 = df_2.union(df_3)
df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
9}]).withColumn("col2", F.lit(str(2)))

sel_col3 = df_23.select('col3', 'col2')
df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner")
finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
how='left').filter(F.col('col3') == 9)
finaldf.show() 
finaldf.select('col2').show() #Correct result
{code}
 

Output
{code:java}
--
>>> finaldf.show()
++++
|col2|col3|col1|
++++
| 2| 9| b|
++++
>>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached 
++
|col2|
++
| 2|
++
{code}
 

  was:
When pyspark.sql.functions.lit() function is used with dataframe cache, it 
returns wrong result

eg:lit() function with cache() function.
 ---
{code:java}
from pyspark.sql import Row
from pyspark.sql import functions as F

df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
'b'}]).withColumn("col2", F.lit(str(2)))
df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
8}]).withColumn("col2", F.lit(str(1)))
df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
9}]).withColumn("col2", F.lit(str(2)))
df_23 = df_2.union(df_3)
df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
9}]).withColumn("col2", F.lit(str(2)))

sel_col3 = df_23.select('col3', 'col2')
df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() 
finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
how='left').filter(F.col('col3') == 9)
finaldf.show(
finaldf.select('col2').show() #Wrong result
{code}
 

Output
 ---
{code:java}
>>> finaldf.show()
++++
|col2|col3|col1|
++++
| 2| 9| b|
++++
>>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1
++
|col2|
++
| 1|
++
++{code}
 lit() function without cache() function.
{code:java}
from pyspark.sql import Row
from pyspark.sql import functions as F

df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
'b'}]).withColumn("col2", F.lit(str(2)))
df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
8}]).withColumn("col2", F.lit(str(1)))
df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
9}]).withColumn("col2", F.lit(str(2)))
df_23 = df_2.union(df_3)
df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
9}]).withColumn("col2", F.lit(str(2)))

sel_col3 = df_23.select('col3', 'col2')
df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner")
finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
how='left').filter(F.col('col3') == 9)
finaldf.show() 
finaldf.select('col2').show() #Correct result
{code}
 

Output
{code:java}
--
>>> finaldf.show()
++++
|col2|col3|col1|
++++
| 2| 9| b|
++++
>>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached

[jira] [Commented] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result

2020-08-21 Thread Vinod KC (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181800#comment-17181800
 ] 

Vinod KC commented on SPARK-32635:
--

[~sascha.baumanns], updated code section. Thanks

> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> ---
>
> Key: SPARK-32635
> URL: https://issues.apache.org/jira/browse/SPARK-32635
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Vinod KC
>Priority: Major
>
> When pyspark.sql.functions.lit() function is used with dataframe cache, it 
> returns wrong result
> eg:lit() function with cache() function.
>  ---
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() 
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show()
> finaldf.select('col2').show() #Wrong result
> {code}
>  
> Output
>  ---
> {code:java}
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1
> ++
> |col2|
> ++
> | 1|
> ++
> ++{code}
>  lit() function without cache() function.
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql import functions as F
> df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': 
> 'b'}]).withColumn("col2", F.lit(str(2)))
> df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': 
> 8}]).withColumn("col2", F.lit(str(1)))
> df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> df_23 = df_2.union(df_3)
> df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': 
> 9}]).withColumn("col2", F.lit(str(2)))
> sel_col3 = df_23.select('col3', 'col2')
> df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner")
> df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner")
> finaldf = df_23_a.join(df_4, on=['col2', 'col3'], 
> how='left').filter(F.col('col3') == 9)
> finaldf.show() 
> finaldf.select('col2').show() #Correct result
> {code}
>  
> Output
> {code:java}
> --
> >>> finaldf.show()
> ++++
> |col2|col3|col1|
> ++++
> | 2| 9| b|
> ++++
> >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached 
> ++
> |col2|
> ++
> | 2|
> ++
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29654) Add configuration to allow disabling registration of static sources to the metrics system

2020-08-21 Thread Pavol Knapek (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181827#comment-17181827
 ] 

Pavol Knapek commented on SPARK-29654:
--

Is there any chance to have this backported to a minor version of Spark 2.x.x 
(2.4.7) ?

> Add configuration to allow disabling registration of static sources to the 
> metrics system
> -
>
> Key: SPARK-29654
> URL: https://issues.apache.org/jira/browse/SPARK-29654
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 3.0.0
>
>
> The Spark metrics system produces many different metrics and not all of them 
> are used at the same time. This proposes to introduce a configuration 
> parameter to allow disabling the registration of metrics in the "static 
> sources" category, in other to reduce the load and clutter on the sink, in 
> the cases when the metrics in question are not needed. The metrics registerd 
> as "static sources" are under the namespaces CodeGenerator and 
> HiveExternalCatalog and can produce a significant amount of data, as they are 
> registered for the driver and executors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32682) Use workflow_dispatch to enable manual test triggers

2020-08-21 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-32682.
--
Fix Version/s: 3.1.0
 Assignee: Takeshi Yamamuro
   Resolution: Fixed

Resolved by 
[https://github.com/apache/spark/pull/29504|https://github.com/apache/spark/pull/29504/files]

> Use workflow_dispatch to enable manual test triggers
> 
>
> Key: SPARK-32682
> URL: https://issues.apache.org/jira/browse/SPARK-32682
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.1.0
>
>
> This ticket targets at adding a workflow_dispatch entry in the GitHub Action 
> script (build_and_test.yml). This update can enable developers to run the 
> Spark tests for a specific branch on their own local repository, so I think 
> it might help to check if al the tests can pass before opening a new PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32683) Datetime Pattern F works wrongly.

2020-08-21 Thread Daeho Ro (Jira)

Daeho Ro created SPARK-32683:


 Summary: Datetime Pattern F works wrongly.
 Key: SPARK-32683
 URL: https://issues.apache.org/jira/browse/SPARK-32683
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
 Environment: Windows 10 Pro with Jupyter Lab Docker Image for the 
spark 3.0.0 and python 3.8.5.


REPOSITORY : jupyter/all-spark-notebook
TAG:  f1811928b3dd 
Reporter: Daeho Ro


>From the docs, the pattern F should give a week of the month.
|*Symbol*|*Meaning*|*Presentation*|*Example*|
|F|week-of-month|number(1)|3|

I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
{code:java}
from pyspark.sql.functions import *df.withColumn('date', to_timestamp('date', 
'-MM-dd')) \
  .withColumn('month', month('date')) \
  .withColumn('week', date_format('date', 'F')) \
  .show(10, False)

+---+-++
|date   |month|week|
+---+-++
|2020-08-01 00:00:00|8|1   |
|2020-08-02 00:00:00|8|2   |
|2020-08-03 00:00:00|8|3   |
|2020-08-04 00:00:00|8|4   |
|2020-08-05 00:00:00|8|5   |
|2020-08-06 00:00:00|8|6   |
|2020-08-07 00:00:00|8|7   |
|2020-08-08 00:00:00|8|1   |
|2020-08-09 00:00:00|8|2   |
|2020-08-10 00:00:00|8|3   |
+---+-++ {code}
The `week` column is not the week of the month. It is a day of the week as a 
number.

!image-2020-08-21-21-31-32-297.png!

>From my calendar, the first day of August should have 1 for the week-of-month 
>and from 2nd to 8th should have 2 and so on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32683) Datetime Pattern F not working as expected

2020-08-21 Thread Daeho Ro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daeho Ro updated SPARK-32683:
-
Summary: Datetime Pattern F not working as expected  (was: Datetime Pattern 
F not working expected)

> Datetime Pattern F not working as expected
> --
>
> Key: SPARK-32683
> URL: https://issues.apache.org/jira/browse/SPARK-32683
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Windows 10 Pro with Jupyter Lab Docker Image for the 
> spark 3.0.0 and python 3.8.5.
> REPOSITORY : jupyter/all-spark-notebook
> TAG:  f1811928b3dd 
>Reporter: Daeho Ro
>Priority: Major
>
> From the docs, the pattern F should give a week of the month.
> |*Symbol*|*Meaning*|*Presentation*|*Example*|
> |F|week-of-month|number(1)|3|
> I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
> {code:java}
> from pyspark.sql.functions import *df.withColumn('date', to_timestamp('date', 
> '-MM-dd')) \
>   .withColumn('month', month('date')) \
>   .withColumn('week', date_format('date', 'F')) \
>   .show(10, False)
> +---+-++
> |date   |month|week|
> +---+-++
> |2020-08-01 00:00:00|8|1   |
> |2020-08-02 00:00:00|8|2   |
> |2020-08-03 00:00:00|8|3   |
> |2020-08-04 00:00:00|8|4   |
> |2020-08-05 00:00:00|8|5   |
> |2020-08-06 00:00:00|8|6   |
> |2020-08-07 00:00:00|8|7   |
> |2020-08-08 00:00:00|8|1   |
> |2020-08-09 00:00:00|8|2   |
> |2020-08-10 00:00:00|8|3   |
> +---+-++ {code}
> The `week` column is not the week of the month. It is a day of the week as a 
> number.
> !image-2020-08-21-21-31-32-297.png!
> From my calendar, the first day of August should have 1 for the week-of-month 
> and from 2nd to 8th should have 2 and so on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32683) Datetime Pattern F not working expected

2020-08-21 Thread Daeho Ro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daeho Ro updated SPARK-32683:
-
Summary: Datetime Pattern F not working expected  (was: Datetime Pattern F 
works wrongly.)

> Datetime Pattern F not working expected
> ---
>
> Key: SPARK-32683
> URL: https://issues.apache.org/jira/browse/SPARK-32683
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Windows 10 Pro with Jupyter Lab Docker Image for the 
> spark 3.0.0 and python 3.8.5.
> REPOSITORY : jupyter/all-spark-notebook
> TAG:  f1811928b3dd 
>Reporter: Daeho Ro
>Priority: Major
>
> From the docs, the pattern F should give a week of the month.
> |*Symbol*|*Meaning*|*Presentation*|*Example*|
> |F|week-of-month|number(1)|3|
> I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
> {code:java}
> from pyspark.sql.functions import *df.withColumn('date', to_timestamp('date', 
> '-MM-dd')) \
>   .withColumn('month', month('date')) \
>   .withColumn('week', date_format('date', 'F')) \
>   .show(10, False)
> +---+-++
> |date   |month|week|
> +---+-++
> |2020-08-01 00:00:00|8|1   |
> |2020-08-02 00:00:00|8|2   |
> |2020-08-03 00:00:00|8|3   |
> |2020-08-04 00:00:00|8|4   |
> |2020-08-05 00:00:00|8|5   |
> |2020-08-06 00:00:00|8|6   |
> |2020-08-07 00:00:00|8|7   |
> |2020-08-08 00:00:00|8|1   |
> |2020-08-09 00:00:00|8|2   |
> |2020-08-10 00:00:00|8|3   |
> +---+-++ {code}
> The `week` column is not the week of the month. It is a day of the week as a 
> number.
> !image-2020-08-21-21-31-32-297.png!
> From my calendar, the first day of August should have 1 for the week-of-month 
> and from 2nd to 8th should have 2 and so on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32683) Datetime Pattern F not working as expected

2020-08-21 Thread Daeho Ro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daeho Ro updated SPARK-32683:
-
Description: 
>From the docs, the pattern F should give a week of the month.
|*Symbol*|*Meaning*|*Presentation*|*Example*|
|F|week-of-month|number(1)|3|

I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
{code:java}
from pyspark.sql.functions import *df.withColumn('date', to_timestamp('date', 
'-MM-dd')) \
  .withColumn('month', month('date')) \
  .withColumn('week', date_format('date', 'F')) \
  .show(10, False)

+---+-++
|date   |month|week|
+---+-++
|2020-08-01 00:00:00|8|1   |
|2020-08-02 00:00:00|8|2   |
|2020-08-03 00:00:00|8|3   |
|2020-08-04 00:00:00|8|4   |
|2020-08-05 00:00:00|8|5   |
|2020-08-06 00:00:00|8|6   |
|2020-08-07 00:00:00|8|7   |
|2020-08-08 00:00:00|8|1   |
|2020-08-09 00:00:00|8|2   |
|2020-08-10 00:00:00|8|3   |
+---+-++ {code}
The `week` column is not the week of the month. It is a day of the week as a 
number.

 

>From my calendar, the first day of August should have 1 for the week-of-month 
>and from 2nd to 8th should have 2 and so on.

  was:
>From the docs, the pattern F should give a week of the month.
|*Symbol*|*Meaning*|*Presentation*|*Example*|
|F|week-of-month|number(1)|3|

I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
{code:java}
from pyspark.sql.functions import *df.withColumn('date', to_timestamp('date', 
'-MM-dd')) \
  .withColumn('month', month('date')) \
  .withColumn('week', date_format('date', 'F')) \
  .show(10, False)

+---+-++
|date   |month|week|
+---+-++
|2020-08-01 00:00:00|8|1   |
|2020-08-02 00:00:00|8|2   |
|2020-08-03 00:00:00|8|3   |
|2020-08-04 00:00:00|8|4   |
|2020-08-05 00:00:00|8|5   |
|2020-08-06 00:00:00|8|6   |
|2020-08-07 00:00:00|8|7   |
|2020-08-08 00:00:00|8|1   |
|2020-08-09 00:00:00|8|2   |
|2020-08-10 00:00:00|8|3   |
+---+-++ {code}
The `week` column is not the week of the month. It is a day of the week as a 
number.

!image-2020-08-21-21-31-32-297.png!

>From my calendar, the first day of August should have 1 for the week-of-month 
>and from 2nd to 8th should have 2 and so on.


> Datetime Pattern F not working as expected
> --
>
> Key: SPARK-32683
> URL: https://issues.apache.org/jira/browse/SPARK-32683
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Windows 10 Pro with Jupyter Lab Docker Image for the 
> spark 3.0.0 and python 3.8.5.
> REPOSITORY : jupyter/all-spark-notebook
> TAG:  f1811928b3dd 
>Reporter: Daeho Ro
>Priority: Major
> Attachments: comment.png
>
>
> From the docs, the pattern F should give a week of the month.
> |*Symbol*|*Meaning*|*Presentation*|*Example*|
> |F|week-of-month|number(1)|3|
> I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
> {code:java}
> from pyspark.sql.functions import *df.withColumn('date', to_timestamp('date', 
> '-MM-dd')) \
>   .withColumn('month', month('date')) \
>   .withColumn('week', date_format('date', 'F')) \
>   .show(10, False)
> +---+-++
> |date   |month|week|
> +---+-++
> |2020-08-01 00:00:00|8|1   |
> |2020-08-02 00:00:00|8|2   |
> |2020-08-03 00:00:00|8|3   |
> |2020-08-04 00:00:00|8|4   |
> |2020-08-05 00:00:00|8|5   |
> |2020-08-06 00:00:00|8|6   |
> |2020-08-07 00:00:00|8|7   |
> |2020-08-08 00:00:00|8|1   |
> |2020-08-09 00:00:00|8|2   |
> |2020-08-10 00:00:00|8|3   |
> +---+-++ {code}
> The `week` column is not the week of the month. It is a day of the week as a 
> number.
>  
> From my calendar, the first day of August should have 1 for the week-of-month 
> and from 2nd to 8th should have 2 and so on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32683) Datetime Pattern F not working as expected

2020-08-21 Thread Daeho Ro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daeho Ro updated SPARK-32683:
-
Attachment: 주석 2020-08-21 213522.png

> Datetime Pattern F not working as expected
> --
>
> Key: SPARK-32683
> URL: https://issues.apache.org/jira/browse/SPARK-32683
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Windows 10 Pro with Jupyter Lab Docker Image for the 
> spark 3.0.0 and python 3.8.5.
> REPOSITORY : jupyter/all-spark-notebook
> TAG:  f1811928b3dd 
>Reporter: Daeho Ro
>Priority: Major
> Attachments: comment.png
>
>
> From the docs, the pattern F should give a week of the month.
> |*Symbol*|*Meaning*|*Presentation*|*Example*|
> |F|week-of-month|number(1)|3|
> I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
> {code:java}
> from pyspark.sql.functions import *df.withColumn('date', to_timestamp('date', 
> '-MM-dd')) \
>   .withColumn('month', month('date')) \
>   .withColumn('week', date_format('date', 'F')) \
>   .show(10, False)
> +---+-++
> |date   |month|week|
> +---+-++
> |2020-08-01 00:00:00|8|1   |
> |2020-08-02 00:00:00|8|2   |
> |2020-08-03 00:00:00|8|3   |
> |2020-08-04 00:00:00|8|4   |
> |2020-08-05 00:00:00|8|5   |
> |2020-08-06 00:00:00|8|6   |
> |2020-08-07 00:00:00|8|7   |
> |2020-08-08 00:00:00|8|1   |
> |2020-08-09 00:00:00|8|2   |
> |2020-08-10 00:00:00|8|3   |
> +---+-++ {code}
> The `week` column is not the week of the month. It is a day of the week as a 
> number.
>  
> From my calendar, the first day of August should have 1 for the week-of-month 
> and from 2nd to 8th should have 2 and so on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32683) Datetime Pattern F not working as expected

2020-08-21 Thread Daeho Ro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daeho Ro updated SPARK-32683:
-
Attachment: (was: 주석 2020-08-21 213522.png)

> Datetime Pattern F not working as expected
> --
>
> Key: SPARK-32683
> URL: https://issues.apache.org/jira/browse/SPARK-32683
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Windows 10 Pro with Jupyter Lab Docker Image for the 
> spark 3.0.0 and python 3.8.5.
> REPOSITORY : jupyter/all-spark-notebook
> TAG:  f1811928b3dd 
>Reporter: Daeho Ro
>Priority: Major
> Attachments: comment.png
>
>
> From the docs, the pattern F should give a week of the month.
> |*Symbol*|*Meaning*|*Presentation*|*Example*|
> |F|week-of-month|number(1)|3|
> I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
> {code:java}
> from pyspark.sql.functions import *df.withColumn('date', to_timestamp('date', 
> '-MM-dd')) \
>   .withColumn('month', month('date')) \
>   .withColumn('week', date_format('date', 'F')) \
>   .show(10, False)
> +---+-++
> |date   |month|week|
> +---+-++
> |2020-08-01 00:00:00|8|1   |
> |2020-08-02 00:00:00|8|2   |
> |2020-08-03 00:00:00|8|3   |
> |2020-08-04 00:00:00|8|4   |
> |2020-08-05 00:00:00|8|5   |
> |2020-08-06 00:00:00|8|6   |
> |2020-08-07 00:00:00|8|7   |
> |2020-08-08 00:00:00|8|1   |
> |2020-08-09 00:00:00|8|2   |
> |2020-08-10 00:00:00|8|3   |
> +---+-++ {code}
> The `week` column is not the week of the month. It is a day of the week as a 
> number.
>  
> From my calendar, the first day of August should have 1 for the week-of-month 
> and from 2nd to 8th should have 2 and so on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32683) Datetime Pattern F not working as expected

2020-08-21 Thread Daeho Ro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daeho Ro updated SPARK-32683:
-
Attachment: comment.png

> Datetime Pattern F not working as expected
> --
>
> Key: SPARK-32683
> URL: https://issues.apache.org/jira/browse/SPARK-32683
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Windows 10 Pro with Jupyter Lab Docker Image for the 
> spark 3.0.0 and python 3.8.5.
> REPOSITORY : jupyter/all-spark-notebook
> TAG:  f1811928b3dd 
>Reporter: Daeho Ro
>Priority: Major
> Attachments: comment.png
>
>
> From the docs, the pattern F should give a week of the month.
> |*Symbol*|*Meaning*|*Presentation*|*Example*|
> |F|week-of-month|number(1)|3|
> I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
> {code:java}
> from pyspark.sql.functions import *df.withColumn('date', to_timestamp('date', 
> '-MM-dd')) \
>   .withColumn('month', month('date')) \
>   .withColumn('week', date_format('date', 'F')) \
>   .show(10, False)
> +---+-++
> |date   |month|week|
> +---+-++
> |2020-08-01 00:00:00|8|1   |
> |2020-08-02 00:00:00|8|2   |
> |2020-08-03 00:00:00|8|3   |
> |2020-08-04 00:00:00|8|4   |
> |2020-08-05 00:00:00|8|5   |
> |2020-08-06 00:00:00|8|6   |
> |2020-08-07 00:00:00|8|7   |
> |2020-08-08 00:00:00|8|1   |
> |2020-08-09 00:00:00|8|2   |
> |2020-08-10 00:00:00|8|3   |
> +---+-++ {code}
> The `week` column is not the week of the month. It is a day of the week as a 
> number.
>  
> From my calendar, the first day of August should have 1 for the week-of-month 
> and from 2nd to 8th should have 2 and so on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32683) Datetime Pattern F not working as expected

2020-08-21 Thread Daeho Ro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daeho Ro updated SPARK-32683:
-
Description: 
>From the docs, the pattern F should give a week of the month.
|*Symbol*|*Meaning*|*Presentation*|*Example*|
|F|week-of-month|number(1)|3|

I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
{code:java}
from pyspark.sql.functions import *df.withColumn('date', to_timestamp('date', 
'-MM-dd')) \
  .withColumn('month', month('date')) \
  .withColumn('week', date_format('date', 'F')) \
  .show(10, False)

+---+-++
|date   |month|week|
+---+-++
|2020-08-01 00:00:00|8|1   |
|2020-08-02 00:00:00|8|2   |
|2020-08-03 00:00:00|8|3   |
|2020-08-04 00:00:00|8|4   |
|2020-08-05 00:00:00|8|5   |
|2020-08-06 00:00:00|8|6   |
|2020-08-07 00:00:00|8|7   |
|2020-08-08 00:00:00|8|1   |
|2020-08-09 00:00:00|8|2   |
|2020-08-10 00:00:00|8|3   |
+---+-++ {code}
The `week` column is not the week of the month. It is a day of the week as a 
number.

 \{comment.png|width=100%}

>From my calendar, the first day of August should have 1 for the week-of-month 
>and from 2nd to 8th should have 2 and so on.

  was:
>From the docs, the pattern F should give a week of the month.
|*Symbol*|*Meaning*|*Presentation*|*Example*|
|F|week-of-month|number(1)|3|

I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
{code:java}
from pyspark.sql.functions import *df.withColumn('date', to_timestamp('date', 
'-MM-dd')) \
  .withColumn('month', month('date')) \
  .withColumn('week', date_format('date', 'F')) \
  .show(10, False)

+---+-++
|date   |month|week|
+---+-++
|2020-08-01 00:00:00|8|1   |
|2020-08-02 00:00:00|8|2   |
|2020-08-03 00:00:00|8|3   |
|2020-08-04 00:00:00|8|4   |
|2020-08-05 00:00:00|8|5   |
|2020-08-06 00:00:00|8|6   |
|2020-08-07 00:00:00|8|7   |
|2020-08-08 00:00:00|8|1   |
|2020-08-09 00:00:00|8|2   |
|2020-08-10 00:00:00|8|3   |
+---+-++ {code}
The `week` column is not the week of the month. It is a day of the week as a 
number.

 

>From my calendar, the first day of August should have 1 for the week-of-month 
>and from 2nd to 8th should have 2 and so on.


> Datetime Pattern F not working as expected
> --
>
> Key: SPARK-32683
> URL: https://issues.apache.org/jira/browse/SPARK-32683
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Windows 10 Pro with Jupyter Lab Docker Image for the 
> spark 3.0.0 and python 3.8.5.
> REPOSITORY : jupyter/all-spark-notebook
> TAG:  f1811928b3dd 
>Reporter: Daeho Ro
>Priority: Major
> Attachments: comment.png
>
>
> From the docs, the pattern F should give a week of the month.
> |*Symbol*|*Meaning*|*Presentation*|*Example*|
> |F|week-of-month|number(1)|3|
> I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
> {code:java}
> from pyspark.sql.functions import *df.withColumn('date', to_timestamp('date', 
> '-MM-dd')) \
>   .withColumn('month', month('date')) \
>   .withColumn('week', date_format('date', 'F')) \
>   .show(10, False)
> +---+-++
> |date   |month|week|
> +---+-++
> |2020-08-01 00:00:00|8|1   |
> |2020-08-02 00:00:00|8|2   |
> |2020-08-03 00:00:00|8|3   |
> |2020-08-04 00:00:00|8|4   |
> |2020-08-05 00:00:00|8|5   |
> |2020-08-06 00:00:00|8|6   |
> |2020-08-07 00:00:00|8|7   |
> |2020-08-08 00:00:00|8|1   |
> |2020-08-09 00:00:00|8|2   |
> |2020-08-10 00:00:00|8|3   |
> +---+-++ {code}
> The `week` column is not the week of the month. It is a day of the week as a 
> number.
>  \{comment.png|width=100%}
> From my calendar, the first day of August should have 1 for the week-of-month 
> and from 2nd to 8th should have 2 and so on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32683) Datetime Pattern F not working as expected

2020-08-21 Thread Daeho Ro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daeho Ro updated SPARK-32683:
-
Description: 
>From the docs, the pattern F should give a week of the month.
|*Symbol*|*Meaning*|*Presentation*|*Example*|
|F|week-of-month|number(1)|3|

I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
{code:java}
from pyspark.sql.functions import *df.withColumn('date', to_timestamp('date', 
'-MM-dd')) \
  .withColumn('month', month('date')) \
  .withColumn('week', date_format('date', 'F')) \
  .show(10, False)

+---+-++
|date   |month|week|
+---+-++
|2020-08-01 00:00:00|8|1   |
|2020-08-02 00:00:00|8|2   |
|2020-08-03 00:00:00|8|3   |
|2020-08-04 00:00:00|8|4   |
|2020-08-05 00:00:00|8|5   |
|2020-08-06 00:00:00|8|6   |
|2020-08-07 00:00:00|8|7   |
|2020-08-08 00:00:00|8|1   |
|2020-08-09 00:00:00|8|2   |
|2020-08-10 00:00:00|8|3   |
+---+-++ {code}
The `week` column is not the week of the month. It is a day of the week as a 
number.

  !comment.png!

>From my calendar, the first day of August should have 1 for the week-of-month 
>and from 2nd to 8th should have 2 and so on.

  was:
>From the docs, the pattern F should give a week of the month.
|*Symbol*|*Meaning*|*Presentation*|*Example*|
|F|week-of-month|number(1)|3|

I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
{code:java}
from pyspark.sql.functions import *df.withColumn('date', to_timestamp('date', 
'-MM-dd')) \
  .withColumn('month', month('date')) \
  .withColumn('week', date_format('date', 'F')) \
  .show(10, False)

+---+-++
|date   |month|week|
+---+-++
|2020-08-01 00:00:00|8|1   |
|2020-08-02 00:00:00|8|2   |
|2020-08-03 00:00:00|8|3   |
|2020-08-04 00:00:00|8|4   |
|2020-08-05 00:00:00|8|5   |
|2020-08-06 00:00:00|8|6   |
|2020-08-07 00:00:00|8|7   |
|2020-08-08 00:00:00|8|1   |
|2020-08-09 00:00:00|8|2   |
|2020-08-10 00:00:00|8|3   |
+---+-++ {code}
The `week` column is not the week of the month. It is a day of the week as a 
number.

 \{comment.png|width=100%}

>From my calendar, the first day of August should have 1 for the week-of-month 
>and from 2nd to 8th should have 2 and so on.


> Datetime Pattern F not working as expected
> --
>
> Key: SPARK-32683
> URL: https://issues.apache.org/jira/browse/SPARK-32683
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Windows 10 Pro with Jupyter Lab Docker Image for the 
> spark 3.0.0 and python 3.8.5.
> REPOSITORY : jupyter/all-spark-notebook
> TAG:  f1811928b3dd 
>Reporter: Daeho Ro
>Priority: Major
> Attachments: comment.png
>
>
> From the docs, the pattern F should give a week of the month.
> |*Symbol*|*Meaning*|*Presentation*|*Example*|
> |F|week-of-month|number(1)|3|
> I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
> {code:java}
> from pyspark.sql.functions import *df.withColumn('date', to_timestamp('date', 
> '-MM-dd')) \
>   .withColumn('month', month('date')) \
>   .withColumn('week', date_format('date', 'F')) \
>   .show(10, False)
> +---+-++
> |date   |month|week|
> +---+-++
> |2020-08-01 00:00:00|8|1   |
> |2020-08-02 00:00:00|8|2   |
> |2020-08-03 00:00:00|8|3   |
> |2020-08-04 00:00:00|8|4   |
> |2020-08-05 00:00:00|8|5   |
> |2020-08-06 00:00:00|8|6   |
> |2020-08-07 00:00:00|8|7   |
> |2020-08-08 00:00:00|8|1   |
> |2020-08-09 00:00:00|8|2   |
> |2020-08-10 00:00:00|8|3   |
> +---+-++ {code}
> The `week` column is not the week of the month. It is a day of the week as a 
> number.
>   !comment.png!
> From my calendar, the first day of August should have 1 for the week-of-month 
> and from 2nd to 8th should have 2 and so on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32683) Datetime Pattern F not working as expected

2020-08-21 Thread Daeho Ro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daeho Ro updated SPARK-32683:
-
Description: 
h3. Background

>From the 
>[documentation|https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html],
> the pattern F should give a week of the month.
|*Symbol*|*Meaning*|*Presentation*|*Example*|
|F|week-of-month|number(1)|3|
h3. Test Data

Here is my test data, that is a csv file.
{code:java}
date
2020-08-01
2020-08-02
2020-08-03
2020-08-04
2020-08-05
2020-08-06
2020-08-07
2020-08-08
2020-08-09
2020-08-10 {code}
h3. Steps to the bug

I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
{code:java}
// Spark

df.withColumn("date", to_timestamp('date, "-MM-dd"))
  .withColumn("week", date_format('date, "F")).show

+---++
|   date|week|
+---++
|2020-08-01 00:00:00|   1|
|2020-08-02 00:00:00|   2|
|2020-08-03 00:00:00|   3|
|2020-08-04 00:00:00|   4|
|2020-08-05 00:00:00|   5|
|2020-08-06 00:00:00|   6|
|2020-08-07 00:00:00|   7|
|2020-08-08 00:00:00|   1|
|2020-08-09 00:00:00|   2|
|2020-08-10 00:00:00|   3|
+---++


# pyspark

df.withColumn('date', to_timestamp('date', '-MM-dd')) \
  .withColumn('week', date_format('date', 'F')) \
  .show(10, False)

+---++
|date   |week|
+---++
|2020-08-01 00:00:00|1   |
|2020-08-02 00:00:00|2   |
|2020-08-03 00:00:00|3   |
|2020-08-04 00:00:00|4   |
|2020-08-05 00:00:00|5   |
|2020-08-06 00:00:00|6   |
|2020-08-07 00:00:00|7   |
|2020-08-08 00:00:00|1   |
|2020-08-09 00:00:00|2   |
|2020-08-10 00:00:00|3   |
+---++{code}
h3. Expected result

The `week` column is not the week of the month. It is a day of the week as a 
number.

  !comment.png!

>From my calendar, the first day of August should have 1 for the week-of-month 
>and from 2nd to 8th should have 2 and so on.
{code:java}
+---++
|date   |week|
+---++
|2020-08-01 00:00:00|1   |
|2020-08-02 00:00:00|2   |
|2020-08-03 00:00:00|2   |
|2020-08-04 00:00:00|2   |
|2020-08-05 00:00:00|2   |
|2020-08-06 00:00:00|2   |
|2020-08-07 00:00:00|2   |
|2020-08-08 00:00:00|2   |
|2020-08-09 00:00:00|3   |
|2020-08-10 00:00:00|3   |
+---++{code}

  was:
>From the docs, the pattern F should give a week of the month.
|*Symbol*|*Meaning*|*Presentation*|*Example*|
|F|week-of-month|number(1)|3|

I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
{code:java}
from pyspark.sql.functions import *df.withColumn('date', to_timestamp('date', 
'-MM-dd')) \
  .withColumn('month', month('date')) \
  .withColumn('week', date_format('date', 'F')) \
  .show(10, False)

+---+-++
|date   |month|week|
+---+-++
|2020-08-01 00:00:00|8|1   |
|2020-08-02 00:00:00|8|2   |
|2020-08-03 00:00:00|8|3   |
|2020-08-04 00:00:00|8|4   |
|2020-08-05 00:00:00|8|5   |
|2020-08-06 00:00:00|8|6   |
|2020-08-07 00:00:00|8|7   |
|2020-08-08 00:00:00|8|1   |
|2020-08-09 00:00:00|8|2   |
|2020-08-10 00:00:00|8|3   |
+---+-++ {code}
The `week` column is not the week of the month. It is a day of the week as a 
number.

  !comment.png!

>From my calendar, the first day of August should have 1 for the week-of-month 
>and from 2nd to 8th should have 2 and so on.


> Datetime Pattern F not working as expected
> --
>
> Key: SPARK-32683
> URL: https://issues.apache.org/jira/browse/SPARK-32683
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Windows 10 Pro with Jupyter Lab Docker Image for the 
> spark 3.0.0 and python 3.8.5.
> REPOSITORY : jupyter/all-spark-notebook
> TAG:  f1811928b3dd 
>Reporter: Daeho Ro
>Priority: Major
> Attachments: comment.png
>
>
> h3. Background
> From the 
> [documentation|https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html],
>  the pattern F should give a week of the month.
> |*Symbol*|*Meaning*|*Presentation*|*Example*|
> |F|week-of-month|number(1)|3|
> h3. Test Data
> Here is my test data, that is a csv file.
> {code:java}
> date
> 2020-08-01
> 2020-08-02
> 2020-08-03
> 2020-08-04
> 2020-08-05
> 2020-08-06
> 2020-08-07
> 2020-08-08
> 2020-08-09
> 2020-08-10 {code}
> h3. Steps to the bug
> I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
> {code:java}
> // Spark
> df.withColumn("date", to_timestamp('date, "-MM-dd"))
>   .withColumn("week", date_format('date, "F")).show
> +---++
> |   date|week|
> +---++
> |2020-08-01 00:00:00|   1|
> |2020-08-02 00:00:00|   2|
> |2020-08-03 00:00:00|   3|
> |2020-08-04 00:00:00|

[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181867#comment-17181867
 ] 

Thomas Graves commented on SPARK-32672:
---

[~cltlfcjin]  Please do not be changing priority just because people are not 
committers. You should first evaluate what they are reporting.  If you don't 
think its a blocker then we should state why the reason.

I looked at this after it was filed and added correctness tag and it was 
already marked as Blocker so I didn't need to change it.  As you can see from 
[https://spark.apache.org/contributing.html,|https://spark.apache.org/contributing.html]
 correctness issues should be marked as a blocker at least until it's 
investigated and discussed. 

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Robert Joseph Evans (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181868#comment-17181868
 ] 

Robert Joseph Evans commented on SPARK-32672:
-

So I am able to reduce the corruption down to just a single 10,000 row chunk, 
and still get it to happen. I'll post a new parquet file soon that will 
hopefully make debugging a little simpler.

{code}
scala> val bad_order = 
spark.read.parquet("/home/roberte/src/rapids-plugin-4-spark/integration_tests/bad_order.snappy.parquet").selectExpr("b",
 "monotonically_increasing_id() as id").where(col("id")>=7 and col("id") < 
8)
bad_order: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [b: 
boolean, id: bigint]

scala> bad_order.groupBy("b").count.show
+-+-+
|b|count|
+-+-+
| null|  619|
| true| 4701|
|false| 4680|
+-+-+


scala> bad_order.cache()
res2: bad_order.type = [b: boolean, id: bigint]

scala> bad_order.groupBy("b").count.show
+-+-+
|b|count|
+-+-+
| null|  618|
| true| 4701|
|false| 4681|
+-+-+

{code}

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Robert Joseph Evans (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated SPARK-32672:

Attachment: small_bad.snappy.parquet

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32683) Datetime Pattern F not working as expected

2020-08-21 Thread Daeho Ro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daeho Ro updated SPARK-32683:
-
Environment: 
Windows 10 Pro
 * with Jupyter Lab Docker Image
 ** for the spark 3.0
 ** python 3.8.5
 ** openjdk 11.0.8

REPOSITORY : jupyter/all-spark-notebook
 TAG:  f1811928b3dd 

  was:
Windows 10 Pro with Jupyter Lab Docker Image for the spark 3.0.0 and python 
3.8.5.


REPOSITORY : jupyter/all-spark-notebook
TAG:  f1811928b3dd 


> Datetime Pattern F not working as expected
> --
>
> Key: SPARK-32683
> URL: https://issues.apache.org/jira/browse/SPARK-32683
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Windows 10 Pro
>  * with Jupyter Lab Docker Image
>  ** for the spark 3.0
>  ** python 3.8.5
>  ** openjdk 11.0.8
> REPOSITORY : jupyter/all-spark-notebook
>  TAG:  f1811928b3dd 
>Reporter: Daeho Ro
>Priority: Major
> Attachments: comment.png
>
>
> h3. Background
> From the 
> [documentation|https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html],
>  the pattern F should give a week of the month.
> |*Symbol*|*Meaning*|*Presentation*|*Example*|
> |F|week-of-month|number(1)|3|
> h3. Test Data
> Here is my test data, that is a csv file.
> {code:java}
> date
> 2020-08-01
> 2020-08-02
> 2020-08-03
> 2020-08-04
> 2020-08-05
> 2020-08-06
> 2020-08-07
> 2020-08-08
> 2020-08-09
> 2020-08-10 {code}
> h3. Steps to the bug
> I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
> {code:java}
> // Spark
> df.withColumn("date", to_timestamp('date, "-MM-dd"))
>   .withColumn("week", date_format('date, "F")).show
> +---++
> |   date|week|
> +---++
> |2020-08-01 00:00:00|   1|
> |2020-08-02 00:00:00|   2|
> |2020-08-03 00:00:00|   3|
> |2020-08-04 00:00:00|   4|
> |2020-08-05 00:00:00|   5|
> |2020-08-06 00:00:00|   6|
> |2020-08-07 00:00:00|   7|
> |2020-08-08 00:00:00|   1|
> |2020-08-09 00:00:00|   2|
> |2020-08-10 00:00:00|   3|
> +---++
> # pyspark
> df.withColumn('date', to_timestamp('date', '-MM-dd')) \
>   .withColumn('week', date_format('date', 'F')) \
>   .show(10, False)
> +---++
> |date   |week|
> +---++
> |2020-08-01 00:00:00|1   |
> |2020-08-02 00:00:00|2   |
> |2020-08-03 00:00:00|3   |
> |2020-08-04 00:00:00|4   |
> |2020-08-05 00:00:00|5   |
> |2020-08-06 00:00:00|6   |
> |2020-08-07 00:00:00|7   |
> |2020-08-08 00:00:00|1   |
> |2020-08-09 00:00:00|2   |
> |2020-08-10 00:00:00|3   |
> +---++{code}
> h3. Expected result
> The `week` column is not the week of the month. It is a day of the week as a 
> number.
>   !comment.png!
> From my calendar, the first day of August should have 1 for the week-of-month 
> and from 2nd to 8th should have 2 and so on.
> {code:java}
> +---++
> |date   |week|
> +---++
> |2020-08-01 00:00:00|1   |
> |2020-08-02 00:00:00|2   |
> |2020-08-03 00:00:00|2   |
> |2020-08-04 00:00:00|2   |
> |2020-08-05 00:00:00|2   |
> |2020-08-06 00:00:00|2   |
> |2020-08-07 00:00:00|2   |
> |2020-08-08 00:00:00|2   |
> |2020-08-09 00:00:00|3   |
> |2020-08-10 00:00:00|3   |
> +---++{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32683) Datetime Pattern F not working as expected

2020-08-21 Thread Daeho Ro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daeho Ro updated SPARK-32683:
-
Environment: 
Windows 10 Pro
 * with Jupyter Lab - Docker Image 
 ** jupyter/all-spark-notebook:f1811928b3dd 
 *** spark 3.0.0
 *** python 3.8.5
 *** openjdk 11.0.8

  was:
Windows 10 Pro
 * with Jupyter Lab - Docker Image 
 ** jupyter/all-spark-notebook:f1811928b3dd 
 ** spark 3.0.0
 ** python 3.8.5
 ** openjdk 11.0.8


> Datetime Pattern F not working as expected
> --
>
> Key: SPARK-32683
> URL: https://issues.apache.org/jira/browse/SPARK-32683
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Windows 10 Pro
>  * with Jupyter Lab - Docker Image 
>  ** jupyter/all-spark-notebook:f1811928b3dd 
>  *** spark 3.0.0
>  *** python 3.8.5
>  *** openjdk 11.0.8
>Reporter: Daeho Ro
>Priority: Major
> Attachments: comment.png
>
>
> h3. Background
> From the 
> [documentation|https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html],
>  the pattern F should give a week of the month.
> |*Symbol*|*Meaning*|*Presentation*|*Example*|
> |F|week-of-month|number(1)|3|
> h3. Test Data
> Here is my test data, that is a csv file.
> {code:java}
> date
> 2020-08-01
> 2020-08-02
> 2020-08-03
> 2020-08-04
> 2020-08-05
> 2020-08-06
> 2020-08-07
> 2020-08-08
> 2020-08-09
> 2020-08-10 {code}
> h3. Steps to the bug
> I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
> {code:java}
> // Spark
> df.withColumn("date", to_timestamp('date, "-MM-dd"))
>   .withColumn("week", date_format('date, "F")).show
> +---++
> |   date|week|
> +---++
> |2020-08-01 00:00:00|   1|
> |2020-08-02 00:00:00|   2|
> |2020-08-03 00:00:00|   3|
> |2020-08-04 00:00:00|   4|
> |2020-08-05 00:00:00|   5|
> |2020-08-06 00:00:00|   6|
> |2020-08-07 00:00:00|   7|
> |2020-08-08 00:00:00|   1|
> |2020-08-09 00:00:00|   2|
> |2020-08-10 00:00:00|   3|
> +---++
> # pyspark
> df.withColumn('date', to_timestamp('date', '-MM-dd')) \
>   .withColumn('week', date_format('date', 'F')) \
>   .show(10, False)
> +---++
> |date   |week|
> +---++
> |2020-08-01 00:00:00|1   |
> |2020-08-02 00:00:00|2   |
> |2020-08-03 00:00:00|3   |
> |2020-08-04 00:00:00|4   |
> |2020-08-05 00:00:00|5   |
> |2020-08-06 00:00:00|6   |
> |2020-08-07 00:00:00|7   |
> |2020-08-08 00:00:00|1   |
> |2020-08-09 00:00:00|2   |
> |2020-08-10 00:00:00|3   |
> +---++{code}
> h3. Expected result
> The `week` column is not the week of the month. It is a day of the week as a 
> number.
>   !comment.png!
> From my calendar, the first day of August should have 1 for the week-of-month 
> and from 2nd to 8th should have 2 and so on.
> {code:java}
> +---++
> |date   |week|
> +---++
> |2020-08-01 00:00:00|1   |
> |2020-08-02 00:00:00|2   |
> |2020-08-03 00:00:00|2   |
> |2020-08-04 00:00:00|2   |
> |2020-08-05 00:00:00|2   |
> |2020-08-06 00:00:00|2   |
> |2020-08-07 00:00:00|2   |
> |2020-08-08 00:00:00|2   |
> |2020-08-09 00:00:00|3   |
> |2020-08-10 00:00:00|3   |
> +---++{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32683) Datetime Pattern F not working as expected

2020-08-21 Thread Daeho Ro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daeho Ro updated SPARK-32683:
-
Environment: 
Windows 10 Pro
 * with Jupyter Lab - Docker Image 
 ** jupyter/all-spark-notebook:f1811928b3dd 
 ** spark 3.0.0
 ** python 3.8.5
 ** openjdk 11.0.8

  was:
Windows 10 Pro
 * with Jupyter Lab Docker Image
 ** for the spark 3.0
 ** python 3.8.5
 ** openjdk 11.0.8

REPOSITORY : jupyter/all-spark-notebook
 TAG:  f1811928b3dd 


> Datetime Pattern F not working as expected
> --
>
> Key: SPARK-32683
> URL: https://issues.apache.org/jira/browse/SPARK-32683
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Windows 10 Pro
>  * with Jupyter Lab - Docker Image 
>  ** jupyter/all-spark-notebook:f1811928b3dd 
>  ** spark 3.0.0
>  ** python 3.8.5
>  ** openjdk 11.0.8
>Reporter: Daeho Ro
>Priority: Major
> Attachments: comment.png
>
>
> h3. Background
> From the 
> [documentation|https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html],
>  the pattern F should give a week of the month.
> |*Symbol*|*Meaning*|*Presentation*|*Example*|
> |F|week-of-month|number(1)|3|
> h3. Test Data
> Here is my test data, that is a csv file.
> {code:java}
> date
> 2020-08-01
> 2020-08-02
> 2020-08-03
> 2020-08-04
> 2020-08-05
> 2020-08-06
> 2020-08-07
> 2020-08-08
> 2020-08-09
> 2020-08-10 {code}
> h3. Steps to the bug
> I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
> {code:java}
> // Spark
> df.withColumn("date", to_timestamp('date, "-MM-dd"))
>   .withColumn("week", date_format('date, "F")).show
> +---++
> |   date|week|
> +---++
> |2020-08-01 00:00:00|   1|
> |2020-08-02 00:00:00|   2|
> |2020-08-03 00:00:00|   3|
> |2020-08-04 00:00:00|   4|
> |2020-08-05 00:00:00|   5|
> |2020-08-06 00:00:00|   6|
> |2020-08-07 00:00:00|   7|
> |2020-08-08 00:00:00|   1|
> |2020-08-09 00:00:00|   2|
> |2020-08-10 00:00:00|   3|
> +---++
> # pyspark
> df.withColumn('date', to_timestamp('date', '-MM-dd')) \
>   .withColumn('week', date_format('date', 'F')) \
>   .show(10, False)
> +---++
> |date   |week|
> +---++
> |2020-08-01 00:00:00|1   |
> |2020-08-02 00:00:00|2   |
> |2020-08-03 00:00:00|3   |
> |2020-08-04 00:00:00|4   |
> |2020-08-05 00:00:00|5   |
> |2020-08-06 00:00:00|6   |
> |2020-08-07 00:00:00|7   |
> |2020-08-08 00:00:00|1   |
> |2020-08-09 00:00:00|2   |
> |2020-08-10 00:00:00|3   |
> +---++{code}
> h3. Expected result
> The `week` column is not the week of the month. It is a day of the week as a 
> number.
>   !comment.png!
> From my calendar, the first day of August should have 1 for the week-of-month 
> and from 2nd to 8th should have 2 and so on.
> {code:java}
> +---++
> |date   |week|
> +---++
> |2020-08-01 00:00:00|1   |
> |2020-08-02 00:00:00|2   |
> |2020-08-03 00:00:00|2   |
> |2020-08-04 00:00:00|2   |
> |2020-08-05 00:00:00|2   |
> |2020-08-06 00:00:00|2   |
> |2020-08-07 00:00:00|2   |
> |2020-08-08 00:00:00|2   |
> |2020-08-09 00:00:00|3   |
> |2020-08-10 00:00:00|3   |
> +---++{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Robert Joseph Evans (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181873#comment-17181873
 ] 

Robert Joseph Evans commented on SPARK-32672:
-

OK reading through the code I understand what is happening now.  The 
compression format ignores nulls, which are stored separately.  As such the bit 
set stored is only for non-null boolean values/bits. The number of entries 
stored in the compression format is  the number of non-null boolean values that 
are stored.

So the stopping condition on a batch decompress.

{code}
while (visitedLocal < countLocal) {
{code}

skips all of null values at the end.  But because the length of the column is 
known ahead of time it falls back to the default value which is false.

I'll try to get a patch up shortly to fix this.

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181909#comment-17181909
 ] 

Apache Spark commented on SPARK-32672:
--

User 'revans2' has created a pull request for this issue:
https://github.com/apache/spark/pull/29506

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32672:


Assignee: Apache Spark

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Assignee: Apache Spark
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181910#comment-17181910
 ] 

Apache Spark commented on SPARK-32672:
--

User 'revans2' has created a pull request for this issue:
https://github.com/apache/spark/pull/29506

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32672:


Assignee: (was: Apache Spark)

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32680) CTAS with V2 catalog wrongly accessed unresolved query

2020-08-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32680:


Assignee: Apache Spark

> CTAS with V2 catalog wrongly accessed unresolved query
> --
>
> Key: SPARK-32680
> URL: https://issues.apache.org/jira/browse/SPARK-32680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Linhong Liu
>Assignee: Apache Spark
>Priority: Major
>
> Case:
> {{CREATE TABLE t USING delta AS SELECT * from nonexist }}
>  
> Expected:
> throw AnalysisException with "Table or view not found"
>  
> Actual:
> {{throw UnresolvedException with 
> "org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> toAttribute on unresolved object, tree: *"}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32680) CTAS with V2 catalog wrongly accessed unresolved query

2020-08-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181912#comment-17181912
 ] 

Apache Spark commented on SPARK-32680:
--

User 'linhongliu-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/29507

> CTAS with V2 catalog wrongly accessed unresolved query
> --
>
> Key: SPARK-32680
> URL: https://issues.apache.org/jira/browse/SPARK-32680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Linhong Liu
>Priority: Major
>
> Case:
> {{CREATE TABLE t USING delta AS SELECT * from nonexist }}
>  
> Expected:
> throw AnalysisException with "Table or view not found"
>  
> Actual:
> {{throw UnresolvedException with 
> "org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> toAttribute on unresolved object, tree: *"}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32680) CTAS with V2 catalog wrongly accessed unresolved query

2020-08-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32680:


Assignee: (was: Apache Spark)

> CTAS with V2 catalog wrongly accessed unresolved query
> --
>
> Key: SPARK-32680
> URL: https://issues.apache.org/jira/browse/SPARK-32680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Linhong Liu
>Priority: Major
>
> Case:
> {{CREATE TABLE t USING delta AS SELECT * from nonexist }}
>  
> Expected:
> throw AnalysisException with "Table or view not found"
>  
> Actual:
> {{throw UnresolvedException with 
> "org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> toAttribute on unresolved object, tree: *"}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32287) Flaky Test: ExecutorAllocationManagerSuite.add executors default profile

2020-08-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181923#comment-17181923
 ] 

Apache Spark commented on SPARK-32287:
--

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/29508

> Flaky Test: ExecutorAllocationManagerSuite.add executors default profile
> 
>
> Key: SPARK-32287
> URL: https://issues.apache.org/jira/browse/SPARK-32287
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
>  This test becomes flaky in Github Actions, see: 
> https://github.com/apache/spark/pull/29072/checks?check_run_id=861689509
> {code:java}
> [info] - add executors default profile *** FAILED *** (33 milliseconds)
> [info]   4 did not equal 2 (ExecutorAllocationManagerSuite.scala:132)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
> [info]   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
> [info]   at 
> org.apache.spark.ExecutorAllocationManagerSuite.$anonfun$new$7(ExecutorAllocationManagerSuite.scala:132)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
> [info]   at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157)
> [info]   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
> [info]   at 
> org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)
> [info]   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
> [info]   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
> [info]   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:59)
> [info]   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
> [info]   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
> [info]   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:59)
> [info]   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
> [info]   ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32684) Scrip transform hive serde/default-serde mode null value keep same with hive

2020-08-21 Thread angerszhu (Jira)

angerszhu created SPARK-32684:
-

 Summary: Scrip transform hive serde/default-serde mode null value 
keep same with hive
 Key: SPARK-32684
 URL: https://issues.apache.org/jira/browse/SPARK-32684
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32684) Scrip transform hive serde/default-serde mode null value keep same with hive

2020-08-21 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-32684:
--
Parent: SPARK-31936
Issue Type: Sub-task  (was: Bug)

> Scrip transform hive serde/default-serde mode null value keep same with hive
> 
>
> Key: SPARK-32684
> URL: https://issues.apache.org/jira/browse/SPARK-32684
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32684) Scrip transform hive serde/default-serde mode null value keep same with hive

2020-08-21 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-32684:
--
Description: 
Hive serde default NULL value is '\N'
{code:java}
String nullString = tbl.getProperty(
serdeConstants.SERIALIZATION_NULL_FORMAT, "\\N");
nullSequence = new Text(nullString);
{code}

> Scrip transform hive serde/default-serde mode null value keep same with hive
> 
>
> Key: SPARK-32684
> URL: https://issues.apache.org/jira/browse/SPARK-32684
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Hive serde default NULL value is '\N'
> {code:java}
> String nullString = tbl.getProperty(
> serdeConstants.SERIALIZATION_NULL_FORMAT, "\\N");
> nullSequence = new Text(nullString);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32684) Scrip transform hive serde/default-serde mode null value keep same with hive as '\\N'

2020-08-21 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-32684:
--
Summary: Scrip transform hive serde/default-serde mode null value keep same 
with hive as '\\N'  (was: Scrip transform hive serde/default-serde mode null 
value keep same with hive)

> Scrip transform hive serde/default-serde mode null value keep same with hive 
> as '\\N'
> -
>
> Key: SPARK-32684
> URL: https://issues.apache.org/jira/browse/SPARK-32684
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Hive serde default NULL value is '\N'
> {code:java}
> String nullString = tbl.getProperty(
> serdeConstants.SERIALIZATION_NULL_FORMAT, "\\N");
> nullSequence = new Text(nullString);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32685) Script transform hive serde default field.delimit is '\t'

2020-08-21 Thread angerszhu (Jira)

angerszhu created SPARK-32685:
-

 Summary: Script transform hive serde default field.delimit is '\t'
 Key: SPARK-32685
 URL: https://issues.apache.org/jira/browse/SPARK-32685
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: angerszhu


 
{code:java}
select split(value, "\t") from (
SELECT TRANSFORM(a, b, c, null)
USING 'cat' 
FROM (select 1 as a, 2 as b, 3  as c) t
) temp;

result is :
_c0
["2","3","\\N"]{code}
 
{code:java}
select split(value, "\t") from (
SELECT TRANSFORM(a, b, c, null)
  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
USING 'cat' 
  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
  WITH SERDEPROPERTIES (
   'serialization.last.column.takes.rest' = 'true'
  )
FROM (select 1 as a, 2 as b, 3  as c) t
) temp;


result is :
_c0
["2","3","\\N"]{code}
 

 

 
{code:java}
select split(value, "\t") from (
SELECT TRANSFORM(a, b, c, null)
  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
USING 'cat' 
  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
FROM (select 1 as a, 2 as b, 3  as c) t
) temp;

result is :
_c0 
["2"]
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31608) Add a hybrid KVStore to make UI loading faster

2020-08-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181968#comment-17181968
 ] 

Apache Spark commented on SPARK-31608:
--

User 'baohe-zhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29509

> Add a hybrid KVStore to make UI loading faster
> --
>
> Key: SPARK-31608
> URL: https://issues.apache.org/jira/browse/SPARK-31608
> Project: Spark
>  Issue Type: Story
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: Baohe Zhang
>Assignee: Baohe Zhang
>Priority: Major
> Fix For: 3.1.0
>
>
> This is a follow-up for the work done by Hieu Huynh in 2019.
> Add a new class HybridKVStore to make the history server faster when loading 
> event files. When rebuilding the application state from event logs, 
> HybridKVStore will first write data to an in-memory store and having a 
> background thread that keeps pushing the change to levelDB.
> I ran some tests on 3.0.1 on mac os:
> ||kvstore type / log size||100m||200m||500m||1g||2g||
> |HybridKVStore|5s to parse, 7s(include the parsing time) to switch to 
> leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to 
> leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to 
> leveldb|
> |LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse|
> For example when loading a 1g file, HybridKVStore takes 23s to parse (that 
> means, users only need to wait for 23s to see the UI), the background thread 
> will still run 17s to copy data to leveldb. And after that, the in memory 
> store can be closed, the entire store now moves to leveldb. So in general, it 
> has 3x - 4x UI loading speed improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31608) Add a hybrid KVStore to make UI loading faster

2020-08-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181969#comment-17181969
 ] 

Apache Spark commented on SPARK-31608:
--

User 'baohe-zhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29509

> Add a hybrid KVStore to make UI loading faster
> --
>
> Key: SPARK-31608
> URL: https://issues.apache.org/jira/browse/SPARK-31608
> Project: Spark
>  Issue Type: Story
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: Baohe Zhang
>Assignee: Baohe Zhang
>Priority: Major
> Fix For: 3.1.0
>
>
> This is a follow-up for the work done by Hieu Huynh in 2019.
> Add a new class HybridKVStore to make the history server faster when loading 
> event files. When rebuilding the application state from event logs, 
> HybridKVStore will first write data to an in-memory store and having a 
> background thread that keeps pushing the change to levelDB.
> I ran some tests on 3.0.1 on mac os:
> ||kvstore type / log size||100m||200m||500m||1g||2g||
> |HybridKVStore|5s to parse, 7s(include the parsing time) to switch to 
> leveldb|6s to parse, 10s to switch to leveldb|15s to parse, 23s to switch to 
> leveldb|23s to parse, 40s to switch to leveldb|37s to parse, 73s to switch to 
> leveldb|
> |LevelDB|12s to parse|19s to parse|43s to parse|69s to parse|124s to parse|
> For example when loading a 1g file, HybridKVStore takes 23s to parse (that 
> means, users only need to wait for 23s to see the UI), the background thread 
> will still run 17s to copy data to leveldb. And after that, the in memory 
> store can be closed, the entire store now moves to leveldb. So in general, it 
> has 3x - 4x UI loading speed improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32686) Un-deprecate inferring DataFrame schema from list of dictionaries

2020-08-21 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182040#comment-17182040
 ] 

Nicholas Chammas commented on SPARK-32686:
--

Not sure if I have the "Affects Version" field set appropriately.

> Un-deprecate inferring DataFrame schema from list of dictionaries
> -
>
> Key: SPARK-32686
> URL: https://issues.apache.org/jira/browse/SPARK-32686
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Inferring the schema of a DataFrame from a list of dictionaries feels natural 
> for PySpark users, and also mirrors [basic functionality in 
> Pandas|https://stackoverflow.com/a/20638258/877069].
> This is currently possible in PySpark but comes with a deprecation warning. 
> We should un-deprecate this behavior if there are no deeper reasons to 
> discourage users from this API, beyond wanting to push them to use {{Row}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32686) Un-deprecate inferring DataFrame schema from list of dictionaries

2020-08-21 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-32686:


 Summary: Un-deprecate inferring DataFrame schema from list of 
dictionaries
 Key: SPARK-32686
 URL: https://issues.apache.org/jira/browse/SPARK-32686
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.1.0
Reporter: Nicholas Chammas


Inferring the schema of a DataFrame from a list of dictionaries feels natural 
for PySpark users, and also mirrors [basic functionality in 
Pandas|https://stackoverflow.com/a/20638258/877069].

This is currently possible in PySpark but comes with a deprecation warning. We 
should un-deprecate this behavior if there are no deeper reasons to discourage 
users from this API, beyond wanting to push them to use {{Row}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32686) Un-deprecate inferring DataFrame schema from list of dictionaries

2020-08-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32686:


Assignee: (was: Apache Spark)

> Un-deprecate inferring DataFrame schema from list of dictionaries
> -
>
> Key: SPARK-32686
> URL: https://issues.apache.org/jira/browse/SPARK-32686
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Inferring the schema of a DataFrame from a list of dictionaries feels natural 
> for PySpark users, and also mirrors [basic functionality in 
> Pandas|https://stackoverflow.com/a/20638258/877069].
> This is currently possible in PySpark but comes with a deprecation warning. 
> We should un-deprecate this behavior if there are no deeper reasons to 
> discourage users from this API, beyond wanting to push them to use {{Row}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32686) Un-deprecate inferring DataFrame schema from list of dictionaries

2020-08-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32686:


Assignee: Apache Spark

> Un-deprecate inferring DataFrame schema from list of dictionaries
> -
>
> Key: SPARK-32686
> URL: https://issues.apache.org/jira/browse/SPARK-32686
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Assignee: Apache Spark
>Priority: Minor
>
> Inferring the schema of a DataFrame from a list of dictionaries feels natural 
> for PySpark users, and also mirrors [basic functionality in 
> Pandas|https://stackoverflow.com/a/20638258/877069].
> This is currently possible in PySpark but comes with a deprecation warning. 
> We should un-deprecate this behavior if there are no deeper reasons to 
> discourage users from this API, beyond wanting to push them to use {{Row}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32686) Un-deprecate inferring DataFrame schema from list of dictionaries

2020-08-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182045#comment-17182045
 ] 

Apache Spark commented on SPARK-32686:
--

User 'nchammas' has created a pull request for this issue:
https://github.com/apache/spark/pull/29510

> Un-deprecate inferring DataFrame schema from list of dictionaries
> -
>
> Key: SPARK-32686
> URL: https://issues.apache.org/jira/browse/SPARK-32686
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Inferring the schema of a DataFrame from a list of dictionaries feels natural 
> for PySpark users, and also mirrors [basic functionality in 
> Pandas|https://stackoverflow.com/a/20638258/877069].
> This is currently possible in PySpark but comes with a deprecation warning. 
> We should un-deprecate this behavior if there are no deeper reasons to 
> discourage users from this API, beyond wanting to push them to use {{Row}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32672:
--
Affects Version/s: 2.3.4

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32672:
--
Affects Version/s: 2.2.3

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182156#comment-17182156
 ] 

Dongjoon Hyun commented on SPARK-32672:
---

Thank you, [~revans2]. 

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32672:
--
Affects Version/s: (was: 2.2.3)

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32669) expression unit tests should explore all cases that can lead to null result

2020-08-21 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-32669.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Resolved by 
[https://github.com/apache/spark/pull/29493|https://github.com/apache/spark/pull/29493#]

> expression unit tests should explore all cases that can lead to null result
> ---
>
> Key: SPARK-32669
> URL: https://issues.apache.org/jira/browse/SPARK-32669
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.1.0
>
>
> Add document to {{ExpressionEvalHelper}}, and ask people to explore all the 
> cases that can lead to null results (including null in struct fields, array 
> elements and map values).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32642) Add support for ESS in Spark sidecar

2020-08-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32642:


Assignee: Holden Karau  (was: Apache Spark)

> Add support for ESS in Spark sidecar
> 
>
> Key: SPARK-32642
> URL: https://issues.apache.org/jira/browse/SPARK-32642
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>
> For Spark on Kubernetes there are multiple ways to achieve dynamic scaling. 
> Unlike in YARN mode ESS on Spark on K8s as a sidecar would not help with 
> dynamic scaling. Instead this helps with the situations where the executor 
> JVM may be overloaded or crash. If we do this as a sidecar we can share the 
> same mounts between the executor and the external shuffle storage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32642) Add support for ESS in Spark sidecar

2020-08-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182158#comment-17182158
 ] 

Apache Spark commented on SPARK-32642:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29511

> Add support for ESS in Spark sidecar
> 
>
> Key: SPARK-32642
> URL: https://issues.apache.org/jira/browse/SPARK-32642
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>
> For Spark on Kubernetes there are multiple ways to achieve dynamic scaling. 
> Unlike in YARN mode ESS on Spark on K8s as a sidecar would not help with 
> dynamic scaling. Instead this helps with the situations where the executor 
> JVM may be overloaded or crash. If we do this as a sidecar we can share the 
> same mounts between the executor and the external shuffle storage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32642) Add support for ESS in Spark sidecar

2020-08-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32642:


Assignee: Apache Spark  (was: Holden Karau)

> Add support for ESS in Spark sidecar
> 
>
> Key: SPARK-32642
> URL: https://issues.apache.org/jira/browse/SPARK-32642
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Major
>
> For Spark on Kubernetes there are multiple ways to achieve dynamic scaling. 
> Unlike in YARN mode ESS on Spark on K8s as a sidecar would not help with 
> dynamic scaling. Instead this helps with the situations where the executor 
> JVM may be overloaded or crash. If we do this as a sidecar we can share the 
> same mounts between the executor and the external shuffle storage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32642) Add support for ESS in Spark sidecar

2020-08-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182160#comment-17182160
 ] 

Apache Spark commented on SPARK-32642:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29511

> Add support for ESS in Spark sidecar
> 
>
> Key: SPARK-32642
> URL: https://issues.apache.org/jira/browse/SPARK-32642
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>
> For Spark on Kubernetes there are multiple ways to achieve dynamic scaling. 
> Unlike in YARN mode ESS on Spark on K8s as a sidecar would not help with 
> dynamic scaling. Instead this helps with the situations where the executor 
> JVM may be overloaded or crash. If we do this as a sidecar we can share the 
> same mounts between the executor and the external shuffle storage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32662) CountVectorizerModel: Remove requirement for minimum vocabulary size

2020-08-21 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao reassigned SPARK-32662:
--

Assignee: Jatin Puri

> CountVectorizerModel: Remove requirement for minimum vocabulary size
> 
>
> Key: SPARK-32662
> URL: https://issues.apache.org/jira/browse/SPARK-32662
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 3.0.0
>Reporter: Jatin Puri
>Assignee: Jatin Puri
>Priority: Minor
>
> Currently `CountVectorizer.scala` has the following requirement:
> {code:java}
> require(vocab.length > 0, "The vocabulary size should be > 0. Lower minDF as 
> necessary."){code}
> But this is not a necessary constraint. It should be able to function even 
> for empty vocabulary case.
> This also gives the ability to run the model over empty datasets. HashingTF 
> works fine in such scenarios. CountVectorizer doesn't.
>  
> spark-user discussion reference: 
> [http://apache-spark-user-list.1001560.n3.nabble.com/Ability-to-have-CountVectorizerModel-vocab-as-empty-td38396.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32662) CountVectorizerModel: Remove requirement for minimum vocabulary size

2020-08-21 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao resolved SPARK-32662.

Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29482
[https://github.com/apache/spark/pull/29482]

> CountVectorizerModel: Remove requirement for minimum vocabulary size
> 
>
> Key: SPARK-32662
> URL: https://issues.apache.org/jira/browse/SPARK-32662
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 3.0.0
>Reporter: Jatin Puri
>Assignee: Jatin Puri
>Priority: Minor
> Fix For: 3.1.0
>
>
> Currently `CountVectorizer.scala` has the following requirement:
> {code:java}
> require(vocab.length > 0, "The vocabulary size should be > 0. Lower minDF as 
> necessary."){code}
> But this is not a necessary constraint. It should be able to function even 
> for empty vocabulary case.
> This also gives the ability to run the model over empty datasets. HashingTF 
> works fine in such scenarios. CountVectorizer doesn't.
>  
> spark-user discussion reference: 
> [http://apache-spark-user-list.1001560.n3.nabble.com/Ability-to-have-CountVectorizerModel-vocab-as-empty-td38396.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32646) ORC predicate pushdown should work with case-insensitive analysis

2020-08-21 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-32646:

Affects Version/s: 3.0.0

> ORC predicate pushdown should work with case-insensitive analysis
> -
>
> Key: SPARK-32646
> URL: https://issues.apache.org/jira/browse/SPARK-32646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently ORC predicate pushdown doesn't work with case-insensitive analysis, 
> see SPARK-32622 for the test case.
> We should make ORC predicate pushdown work with case-insensitive analysis too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32646) ORC predicate pushdown should work with case-insensitive analysis

2020-08-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182218#comment-17182218
 ] 

Apache Spark commented on SPARK-32646:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/29513

> ORC predicate pushdown should work with case-insensitive analysis
> -
>
> Key: SPARK-32646
> URL: https://issues.apache.org/jira/browse/SPARK-32646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently ORC predicate pushdown doesn't work with case-insensitive analysis, 
> see SPARK-32622 for the test case.
> We should make ORC predicate pushdown work with case-insensitive analysis too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32672.
--
Fix Version/s: 3.1.0
   2.4.7
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 29506
[https://github.com/apache/spark/pull/29506]

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.1, 2.4.7, 3.1.0
>
> Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32672:


Assignee: Robert Joseph Evans

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182225#comment-17182225
 ] 

Hyukjin Kwon commented on SPARK-32672:
--

[~revans2] is a PMC, and it is a correctness issue. This indeed is a blocker. I 
think the initial action was a mistake.

I think [~cltlfcjin] referred:

{quote}
Set to Major or below; higher priorities are generally reserved for committers 
to set. 
{quote}

I fully agree that ideally we should first evaluate what they are reporting 
with stating the reason.

The problem is that we don't have a lot of manpower here in triaging/managing 
JIRAs. It's just that there are not so many people who do.
Given this situation, I would like to encourage to aggressively triage - there 
are many JIRAs that set priority incorrectly.

For example, many JIRAs just ask questions and/or investigations with setting 
the priority as a blocker. Such blockers matter for release managers.
If we want more fine-grained and ideal evaluation of JIRAs, I would encourage 
our PMC members to take a look more often.


> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
> Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182225#comment-17182225
 ] 

Hyukjin Kwon edited comment on SPARK-32672 at 8/22/20, 2:27 AM:


[~revans2] is a PMC, and it is a correctness issue. This indeed is a blocker. I 
think the initial action was a mistake.

I think [~cltlfcjin] referred:

{quote}
Set to Major or below; higher priorities are generally reserved for committers 
to set. 
{quote}

I fully agree that ideally we should first evaluate what they are reporting 
with stating the reason.

The problem is that we don't have a lot of manpower here in triaging/managing 
JIRAs. It's just that there are not so many people who do.
Given this situation, I would like to encourage to aggressively triage - there 
are many JIRAs that set priority incorrectly.

For example, many JIRAs just ask questions and/or investigations with setting 
the priority as a blocker, presumably expecting quicker responses and actions. 
Such blockers matter for release managers.

If we want more fine-grained and ideal evaluation of JIRAs, I would encourage 
our PMC members to take a look more often.

We also ended up with introducing auto-resolving JIRAs that sets affect 
versions as EOL Spark versions, due to the lack of the manpower in JIRA 
management (which I don't think this is ideal but I think it was inevitable) 
for the same reason.


was (Author: hyukjin.kwon):
[~revans2] is a PMC, and it is a correctness issue. This indeed is a blocker. I 
think the initial action was a mistake.

I think [~cltlfcjin] referred:

{quote}
Set to Major or below; higher priorities are generally reserved for committers 
to set. 
{quote}

I fully agree that ideally we should first evaluate what they are reporting 
with stating the reason.

The problem is that we don't have a lot of manpower here in triaging/managing 
JIRAs. It's just that there are not so many people who do.
Given this situation, I would like to encourage to aggressively triage - there 
are many JIRAs that set priority incorrectly.

For example, many JIRAs just ask questions and/or investigations with setting 
the priority as a blocker. Such blockers matter for release managers.
If we want more fine-grained and ideal evaluation of JIRAs, I would encourage 
our PMC members to take a look more often.


> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
> Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort

2020-08-21 Thread Cheng Su (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182239#comment-17182239
 ] 

Cheng Su commented on SPARK-26164:
--

[~cloud_fan], [~hyukjin.kwon] - wondering how do you think of this JIRA? I can 
raise another PR for latest trunk. Thanks.

The JIRA is still a valid problem for latest master trunk. Internally in our 
fork, we are seeing much better CPU/IO performance when writing big output 
table with dynamic partitions or buckets. The sort before writing output can 
have obvious overhead if the sort needs spill.

> [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
> --
>
> Key: SPARK-26164
> URL: https://issues.apache.org/jira/browse/SPARK-26164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Cheng Su
>Priority: Minor
>
> Problem:
> Current spark always requires a local sort before writing to output table on 
> partition/bucket columns [1]. The disadvantage is the sort might waste 
> reserved CPU time on executor due to spill. Hive does not require the local 
> sort before writing output table [2], and we saw performance regression when 
> migrating hive workload to spark.
>  
> Proposal:
> We can avoid the local sort by keeping the mapping between file path and 
> output writer. In case of writing row to a new file path, we create a new 
> output writer. Otherwise, re-use the same output writer if the writer already 
> exists (mainly change should be in FileFormatDataWriter.scala). This is very 
> similar to what hive does in [2].
> Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
> consumes more memory on executor (multiple output writer needs to be opened 
> in same time), than the current behavior (i.e. only one output writer 
> opened). We can add the config to switch between the current and new behavior.
>  
> [1]: spark FileFormatWriter.scala - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]
> [2]: hive FileSinkOperator.java - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

sub

2020-08-21 Thread Bowen Li

[jira] [Comment Edited] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182225#comment-17182225
 ] 

Hyukjin Kwon edited comment on SPARK-32672 at 8/22/20, 4:57 AM:


[~revans2] is a PMC, and it is a correctness issue. This indeed is a blocker. I 
think the initial action was a mistake.

I think [~cltlfcjin] referred:

{quote}
Set to Major or below; higher priorities are generally reserved for committers 
to set. 
{quote}

I fully agree that ideally we should first evaluate what they are reporting 
with stating the reason.

The problem is that we don't have a lot of manpower here in triaging/managing 
JIRAs. It's just that there are not so many people who do.
Given this situation, I would like to encourage to aggressively triage - there 
are many JIRAs that set priority incorrectly.

For example, many JIRAs just ask questions and/or investigations with setting 
the priority as a blocker, presumably expecting quicker responses and actions. 
Such blockers matter for release managers.

If we want more fine-grained and ideal evaluation of JIRAs, I would encourage 
our PMC members/committers/contributors to take a look more often.

We also ended up with introducing auto-resolving JIRAs that sets affect 
versions as EOL Spark versions, due to the lack of the manpower in JIRA 
management (which I don't think this is ideal but I think it was inevitable) 
for the same reason.


was (Author: hyukjin.kwon):
[~revans2] is a PMC, and it is a correctness issue. This indeed is a blocker. I 
think the initial action was a mistake.

I think [~cltlfcjin] referred:

{quote}
Set to Major or below; higher priorities are generally reserved for committers 
to set. 
{quote}

I fully agree that ideally we should first evaluate what they are reporting 
with stating the reason.

The problem is that we don't have a lot of manpower here in triaging/managing 
JIRAs. It's just that there are not so many people who do.
Given this situation, I would like to encourage to aggressively triage - there 
are many JIRAs that set priority incorrectly.

For example, many JIRAs just ask questions and/or investigations with setting 
the priority as a blocker, presumably expecting quicker responses and actions. 
Such blockers matter for release managers.

If we want more fine-grained and ideal evaluation of JIRAs, I would encourage 
our PMC members to take a look more often.

We also ended up with introducing auto-resolving JIRAs that sets affect 
versions as EOL Spark versions, due to the lack of the manpower in JIRA 
management (which I don't think this is ideal but I think it was inevitable) 
for the same reason.

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
> Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

83 matches

Mail list logo