[jira] [Updated] (SPARK-33229) UnsupportedOperationException when group by with cube

2020-10-25 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-33229:
-
Issue Type: Improvement  (was: Bug)

> UnsupportedOperationException when group by with cube
> -
>
> Key: SPARK-33229
> URL: https://issues.apache.org/jira/browse/SPARK-33229
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:sql}
> create table test_cube using parquet as select id as a, id as b, id as c from 
> range(10);
> select a, b, c, count(*) from test_cube group by 1, cube(2, 3);
> {code}
> {noformat}
> spark-sql> select a, b, c, count(*) from test_cube group by 1, cube(2, 3);
> 20/10/23 06:31:51 ERROR SparkSQLDriver: Failed in [select a, b, c, count(*) 
> from test_cube group by 1, cube(2, 3)]
> java.lang.UnsupportedOperationException
>   at 
> org.apache.spark.sql.catalyst.expressions.GroupingSet.dataType(grouping.scala:35)
>   at 
> org.apache.spark.sql.catalyst.expressions.GroupingSet.dataType$(grouping.scala:35)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cube.dataType(grouping.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidGroupingExprs$1(CheckAnalysis.scala:268)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$12(CheckAnalysis.scala:284)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$12$adapted(CheckAnalysis.scala:284)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:284)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:92)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:177)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:92)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:89)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:130)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:153)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:68)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:133)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:133)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:68)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:66)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:58)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33229) UnsupportedOperationException when group by with cube

2020-10-25 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-33229:
-
Affects Version/s: (was: 3.0.1)
   (was: 3.0.0)

> UnsupportedOperationException when group by with cube
> -
>
> Key: SPARK-33229
> URL: https://issues.apache.org/jira/browse/SPARK-33229
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:sql}
> create table test_cube using parquet as select id as a, id as b, id as c from 
> range(10);
> select a, b, c, count(*) from test_cube group by 1, cube(2, 3);
> {code}
> {noformat}
> spark-sql> select a, b, c, count(*) from test_cube group by 1, cube(2, 3);
> 20/10/23 06:31:51 ERROR SparkSQLDriver: Failed in [select a, b, c, count(*) 
> from test_cube group by 1, cube(2, 3)]
> java.lang.UnsupportedOperationException
>   at 
> org.apache.spark.sql.catalyst.expressions.GroupingSet.dataType(grouping.scala:35)
>   at 
> org.apache.spark.sql.catalyst.expressions.GroupingSet.dataType$(grouping.scala:35)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cube.dataType(grouping.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidGroupingExprs$1(CheckAnalysis.scala:268)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$12(CheckAnalysis.scala:284)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$12$adapted(CheckAnalysis.scala:284)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:284)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:92)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:177)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:92)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:89)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:130)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:153)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:68)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:133)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:133)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:68)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:66)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:58)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33240) Fail fast when fails to instantiate configured v2 session catalog

2020-10-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220514#comment-17220514
 ] 

Apache Spark commented on SPARK-33240:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/30147

> Fail fast when fails to instantiate configured v2 session catalog
> -
>
> Key: SPARK-33240
> URL: https://issues.apache.org/jira/browse/SPARK-33240
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Now Spark fails back to use "default catalog" when Spark fails to instantiate 
> configured v2 session catalog.
> While the error log message says nothing about why the instantiation has been 
> failing and the error log message pollutes the log file (as it's logged every 
> time when resolving the catalog), it should be considered as "incorrect" 
> behavior as end users are intended to set the custom catalog and Spark 
> sometimes ignores it, which is against the intention.
> We should simply fail in the case so that end users indicate the failure 
> earlier and try to fix the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33240) Fail fast when fails to instantiate configured v2 session catalog

2020-10-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33240:


Assignee: (was: Apache Spark)

> Fail fast when fails to instantiate configured v2 session catalog
> -
>
> Key: SPARK-33240
> URL: https://issues.apache.org/jira/browse/SPARK-33240
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Now Spark fails back to use "default catalog" when Spark fails to instantiate 
> configured v2 session catalog.
> While the error log message says nothing about why the instantiation has been 
> failing and the error log message pollutes the log file (as it's logged every 
> time when resolving the catalog), it should be considered as "incorrect" 
> behavior as end users are intended to set the custom catalog and Spark 
> sometimes ignores it, which is against the intention.
> We should simply fail in the case so that end users indicate the failure 
> earlier and try to fix the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33240) Fail fast when fails to instantiate configured v2 session catalog

2020-10-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220513#comment-17220513
 ] 

Apache Spark commented on SPARK-33240:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/30147

> Fail fast when fails to instantiate configured v2 session catalog
> -
>
> Key: SPARK-33240
> URL: https://issues.apache.org/jira/browse/SPARK-33240
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Now Spark fails back to use "default catalog" when Spark fails to instantiate 
> configured v2 session catalog.
> While the error log message says nothing about why the instantiation has been 
> failing and the error log message pollutes the log file (as it's logged every 
> time when resolving the catalog), it should be considered as "incorrect" 
> behavior as end users are intended to set the custom catalog and Spark 
> sometimes ignores it, which is against the intention.
> We should simply fail in the case so that end users indicate the failure 
> earlier and try to fix the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33240) Fail fast when fails to instantiate configured v2 session catalog

2020-10-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33240:


Assignee: Apache Spark

> Fail fast when fails to instantiate configured v2 session catalog
> -
>
> Key: SPARK-33240
> URL: https://issues.apache.org/jira/browse/SPARK-33240
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> Now Spark fails back to use "default catalog" when Spark fails to instantiate 
> configured v2 session catalog.
> While the error log message says nothing about why the instantiation has been 
> failing and the error log message pollutes the log file (as it's logged every 
> time when resolving the catalog), it should be considered as "incorrect" 
> behavior as end users are intended to set the custom catalog and Spark 
> sometimes ignores it, which is against the intention.
> We should simply fail in the case so that end users indicate the failure 
> earlier and try to fix the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33243) Add numpydoc into documentation dependency

2020-10-25 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-33243:


 Summary: Add numpydoc into documentation dependency
 Key: SPARK-33243
 URL: https://issues.apache.org/jira/browse/SPARK-33243
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


To switch the docstring formats, we should add numpydoc package into Sphinx.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33241) Dynamic pruning for data column

2020-10-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220508#comment-17220508
 ] 

Apache Spark commented on SPARK-33241:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/30146

> Dynamic pruning for data column
> ---
>
> Key: SPARK-33241
> URL: https://issues.apache.org/jira/browse/SPARK-33241
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> SPARK-11150 add new feature  to support dynamic partition pruning. This 
> feature has some limitations:
> 1. Only support pruning on partition column.
> 2. This feature is rarely used in production because the proportion of join 
> on the partition column is relatively low.
> This ticket aim to address support dynamic data(column) pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32797) Install mypy on the Jenkins CI workers

2020-10-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32797:


Assignee: Shane Knapp

> Install mypy on the Jenkins CI workers
> --
>
> Key: SPARK-32797
> URL: https://issues.apache.org/jira/browse/SPARK-32797
> Project: Spark
>  Issue Type: Improvement
>  Components: jenkins, PySpark
>Affects Versions: 3.1.0
>Reporter: Fokko Driesprong
>Assignee: Shane Knapp
>Priority: Major
>
> We want to check the types of the PySpark code. This requires mypy to be 
> installed on the CI. Can you do this [~shaneknapp]? 
> Related PR: [https://github.com/apache/spark/pull/29180]
> You can install this using pip: [https://pypi.org/project/mypy/] Should be 
> similar to flake8 and sphinx. The latest version is ok! Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33241) Dynamic pruning for data column

2020-10-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33241:


Assignee: Apache Spark

> Dynamic pruning for data column
> ---
>
> Key: SPARK-33241
> URL: https://issues.apache.org/jira/browse/SPARK-33241
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-11150 add new feature  to support dynamic partition pruning. This 
> feature has some limitations:
> 1. Only support pruning on partition column.
> 2. This feature is rarely used in production because the proportion of join 
> on the partition column is relatively low.
> This ticket aim to address support dynamic data(column) pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33241) Dynamic pruning for data column

2020-10-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33241:


Assignee: (was: Apache Spark)

> Dynamic pruning for data column
> ---
>
> Key: SPARK-33241
> URL: https://issues.apache.org/jira/browse/SPARK-33241
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> SPARK-11150 add new feature  to support dynamic partition pruning. This 
> feature has some limitations:
> 1. Only support pruning on partition column.
> 2. This feature is rarely used in production because the proportion of join 
> on the partition column is relatively low.
> This ticket aim to address support dynamic data(column) pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33241) Dynamic pruning for data column

2020-10-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33241:


Assignee: Apache Spark

> Dynamic pruning for data column
> ---
>
> Key: SPARK-33241
> URL: https://issues.apache.org/jira/browse/SPARK-33241
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-11150 add new feature  to support dynamic partition pruning. This 
> feature has some limitations:
> 1. Only support pruning on partition column.
> 2. This feature is rarely used in production because the proportion of join 
> on the partition column is relatively low.
> This ticket aim to address support dynamic data(column) pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33242) Install numpydoc in Jenkins machines

2020-10-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33242:


Assignee: Shane Knapp

> Install numpydoc in Jenkins machines
> 
>
> Key: SPARK-33242
> URL: https://issues.apache.org/jira/browse/SPARK-33242
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Shane Knapp
>Priority: Major
>
> To switch to reST style to numpydoc style, we should install numpydoc as 
> well. This is being used in Sphinx. See the parent JIRA as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33242) Install numpydoc in Jenkins machines

2020-10-25 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-33242:


 Summary: Install numpydoc in Jenkins machines
 Key: SPARK-33242
 URL: https://issues.apache.org/jira/browse/SPARK-33242
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra, PySpark
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


To switch to reST style to numpydoc style, we should install numpydoc as well. 
This is being used in Sphinx. See the parent JIRA as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32195) Standardize warning types and messages

2020-10-25 Thread Kevin Su (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220501#comment-17220501
 ] 

Kevin Su commented on SPARK-32195:
--

okay, thanks.

> Standardize warning types and messages
> --
>
> Key: SPARK-32195
> URL: https://issues.apache.org/jira/browse/SPARK-32195
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently PySpark uses a somewhat inconsistent warning type and message such 
> as UserWarning. We should standardize it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32195) Standardize warning types and messages

2020-10-25 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220496#comment-17220496
 ] 

Hyukjin Kwon commented on SPARK-32195:
--

It doesn't have to be assigned. Once a PR is open, that will automatically 
assign properly.

> Standardize warning types and messages
> --
>
> Key: SPARK-32195
> URL: https://issues.apache.org/jira/browse/SPARK-32195
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently PySpark uses a somewhat inconsistent warning type and message such 
> as UserWarning. We should standardize it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32195) Standardize warning types and messages

2020-10-25 Thread Kevin Su (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220492#comment-17220492
 ] 

Kevin Su commented on SPARK-32195:
--

Hi, [~hyukjin.kwon]. I'm Kevin, could you assign this issue to me. I'd like to 
solve it.

> Standardize warning types and messages
> --
>
> Key: SPARK-32195
> URL: https://issues.apache.org/jira/browse/SPARK-32195
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently PySpark uses a somewhat inconsistent warning type and message such 
> as UserWarning. We should standardize it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33241) Dynamic pruning for data column

2020-10-25 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-33241:
---

 Summary: Dynamic pruning for data column
 Key: SPARK-33241
 URL: https://issues.apache.org/jira/browse/SPARK-33241
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang


SPARK-11150 add new feature  to support dynamic partition pruning. This feature 
has some limitations:
1. Only support pruning on partition column.
2. This feature is rarely used in production because the proportion of join on 
the partition column is relatively low.

This ticket aim to address support dynamic data(column) pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33240) Fail fast when fails to instantiate configured v2 session catalog

2020-10-25 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220487#comment-17220487
 ] 

Jungtaek Lim commented on SPARK-33240:
--

Will work on it.

> Fail fast when fails to instantiate configured v2 session catalog
> -
>
> Key: SPARK-33240
> URL: https://issues.apache.org/jira/browse/SPARK-33240
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Now Spark fails back to use "default catalog" when Spark fails to instantiate 
> configured v2 session catalog.
> While the error log message says nothing about why the instantiation has been 
> failing and the error log message pollutes the log file (as it's logged every 
> time when resolving the catalog), it should be considered as "incorrect" 
> behavior as end users are intended to set the custom catalog and Spark 
> sometimes ignores it, which is against the intention.
> We should simply fail in the case so that end users indicate the failure 
> earlier and try to fix the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33240) Fail fast when fails to instantiate configured v2 session catalog

2020-10-25 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-33240:


 Summary: Fail fast when fails to instantiate configured v2 session 
catalog
 Key: SPARK-33240
 URL: https://issues.apache.org/jira/browse/SPARK-33240
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 3.0.0, 3.1.0
Reporter: Jungtaek Lim


Now Spark fails back to use "default catalog" when Spark fails to instantiate 
configured v2 session catalog.

While the error log message says nothing about why the instantiation has been 
failing and the error log message pollutes the log file (as it's logged every 
time when resolving the catalog), it should be considered as "incorrect" 
behavior as end users are intended to set the custom catalog and Spark 
sometimes ignores it, which is against the intention.

We should simply fail in the case so that end users indicate the failure 
earlier and try to fix the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32194) Standardize exceptions in PySpark

2020-10-25 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220477#comment-17220477
 ] 

Hyukjin Kwon commented on SPARK-32194:
--

There should be some kind of simple design for that. It doesn't exist yet.

> Standardize exceptions in PySpark
> -
>
> Key: SPARK-32194
> URL: https://issues.apache.org/jira/browse/SPARK-32194
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, PySpark throws {{Exception}} or just {{RuntimeException}} in many 
> cases. We should standardize them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32195) Standardize warning types and messages

2020-10-25 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220478#comment-17220478
 ] 

Hyukjin Kwon commented on SPARK-32195:
--

We'll have to follow 
https://docs.python.org/3/library/warnings.html#warning-categories

> Standardize warning types and messages
> --
>
> Key: SPARK-32195
> URL: https://issues.apache.org/jira/browse/SPARK-32195
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently PySpark uses a somewhat inconsistent warning type and message such 
> as UserWarning. We should standardize it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32862) Left semi stream-stream join

2020-10-25 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-32862:


Assignee: Cheng Su

> Left semi stream-stream join
> 
>
> Key: SPARK-32862
> URL: https://issues.apache.org/jira/browse/SPARK-32862
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Major
>
> Current stream-stream join supports inner, left outer and right outer join 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166]
>  ). We do see internally a lot of users are using left semi stream-stream 
> join (not spark structured streaming), e.g. I want to get the ad impression 
> (join left side) which has click (joint right side), but I don't care how 
> many clicks per ad (left semi semantics).
>  
> Left semi stream-stream join will work as followed:
> (1).for left side input row, check if there's a match on right side state 
> store
>   (1.1). if there's a match, output the left side row.
>   (1.2). if there's no match, put the row in left side state store (with 
> "matched" field to set to false in state store).
> (2).for right side input row, check if there's a match on left side state 
> store. If there's a match, update left side row state with "matched" field to 
> set to true. Put the right side row in right side state store.
> (3).for left side row needs to be evicted from state store, output the row if 
> "matched" field is true.
> (4).for right side row needs to be evicted from state store, doing nothing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32862) Left semi stream-stream join

2020-10-25 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-32862.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30076
[https://github.com/apache/spark/pull/30076]

> Left semi stream-stream join
> 
>
> Key: SPARK-32862
> URL: https://issues.apache.org/jira/browse/SPARK-32862
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Major
> Fix For: 3.1.0
>
>
> Current stream-stream join supports inner, left outer and right outer join 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166]
>  ). We do see internally a lot of users are using left semi stream-stream 
> join (not spark structured streaming), e.g. I want to get the ad impression 
> (join left side) which has click (joint right side), but I don't care how 
> many clicks per ad (left semi semantics).
>  
> Left semi stream-stream join will work as followed:
> (1).for left side input row, check if there's a match on right side state 
> store
>   (1.1). if there's a match, output the left side row.
>   (1.2). if there's no match, put the row in left side state store (with 
> "matched" field to set to false in state store).
> (2).for right side input row, check if there's a match on left side state 
> store. If there's a match, update left side row state with "matched" field to 
> set to true. Put the right side row in right side state store.
> (3).for left side row needs to be evicted from state store, output the row if 
> "matched" field is true.
> (4).for right side row needs to be evicted from state store, doing nothing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32194) Standardize exceptions in PySpark

2020-10-25 Thread RISHAV DUTTA (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220469#comment-17220469
 ] 

RISHAV DUTTA commented on SPARK-32194:
--

I would like to work on this.Please provide a guide on how to standardise 
exceptions

> Standardize exceptions in PySpark
> -
>
> Key: SPARK-32194
> URL: https://issues.apache.org/jira/browse/SPARK-32194
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, PySpark throws {{Exception}} or just {{RuntimeException}} in many 
> cases. We should standardize them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32195) Standardize warning types and messages

2020-10-25 Thread RISHAV DUTTA (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220468#comment-17220468
 ] 

RISHAV DUTTA commented on SPARK-32195:
--

Hi is there a guide on how to standardise the exceptions?

> Standardize warning types and messages
> --
>
> Key: SPARK-32195
> URL: https://issues.apache.org/jira/browse/SPARK-32195
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently PySpark uses a somewhat inconsistent warning type and message such 
> as UserWarning. We should standardize it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32390) TRANSFORM with hive serde support CalendarIntervalType and UserDefinedType

2020-10-25 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-32390:
--
Description: 
convert as string to support.

UTD 

[https://github.com/apache/spark/pull/29085/commits/e367c0544298f6639e8898029eab5e29ea1f91ea]

 

https://github.com/apache/spark/pull/29085#issuecomment-660580865

  was:
convert as string to support.

UTD 

https://github.com/apache/spark/pull/29085/commits/e367c0544298f6639e8898029eab5e29ea1f91ea


> TRANSFORM with hive serde support CalendarIntervalType and UserDefinedType
> --
>
> Key: SPARK-32390
> URL: https://issues.apache.org/jira/browse/SPARK-32390
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> convert as string to support.
> UTD 
> [https://github.com/apache/spark/pull/29085/commits/e367c0544298f6639e8898029eab5e29ea1f91ea]
>  
> https://github.com/apache/spark/pull/29085#issuecomment-660580865



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-32390) TRANSFORM with hive serde support CalendarIntervalType and UserDefinedType

2020-10-25 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu reopened SPARK-32390:
---

need to implement

> TRANSFORM with hive serde support CalendarIntervalType and UserDefinedType
> --
>
> Key: SPARK-32390
> URL: https://issues.apache.org/jira/browse/SPARK-32390
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> convert as string to support.
> UTD 
> https://github.com/apache/spark/pull/29085/commits/e367c0544298f6639e8898029eab5e29ea1f91ea



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33239) Use pre-built image at GitHub Action SparkR job

2020-10-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220458#comment-17220458
 ] 

Apache Spark commented on SPARK-33239:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30066

> Use pre-built image at GitHub Action SparkR job
> ---
>
> Key: SPARK-33239
> URL: https://issues.apache.org/jira/browse/SPARK-33239
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33239) Use pre-built image at GitHub Action SparkR job

2020-10-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220459#comment-17220459
 ] 

Apache Spark commented on SPARK-33239:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30066

> Use pre-built image at GitHub Action SparkR job
> ---
>
> Key: SPARK-33239
> URL: https://issues.apache.org/jira/browse/SPARK-33239
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33239) Use pre-built image at GitHub Action SparkR job

2020-10-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33239:


Assignee: Dongjoon Hyun  (was: Apache Spark)

> Use pre-built image at GitHub Action SparkR job
> ---
>
> Key: SPARK-33239
> URL: https://issues.apache.org/jira/browse/SPARK-33239
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33239) Use pre-built image at GitHub Action SparkR job

2020-10-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33239:


Assignee: Apache Spark  (was: Dongjoon Hyun)

> Use pre-built image at GitHub Action SparkR job
> ---
>
> Key: SPARK-33239
> URL: https://issues.apache.org/jira/browse/SPARK-33239
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32390) TRANSFORM with hive serde support CalendarIntervalType and UserDefinedType

2020-10-25 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu resolved SPARK-32390.
---
Resolution: Fixed

> TRANSFORM with hive serde support CalendarIntervalType and UserDefinedType
> --
>
> Key: SPARK-32390
> URL: https://issues.apache.org/jira/browse/SPARK-32390
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> convert as string to support.
> UTD 
> https://github.com/apache/spark/pull/29085/commits/e367c0544298f6639e8898029eab5e29ea1f91ea



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33239) Use pre-built image at GitHub Action SparkR job

2020-10-25 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-33239:
-

 Summary: Use pre-built image at GitHub Action SparkR job
 Key: SPARK-33239
 URL: https://issues.apache.org/jira/browse/SPARK-33239
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun
Assignee: Dongjoon Hyun
 Fix For: 3.1.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32390) TRANSFORM with hive serde support CalendarIntervalType and UserDefinedType

2020-10-25 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-32390:
--
Description: 
convert as string to support.

UTD 

https://github.com/apache/spark/pull/29085/commits/e367c0544298f6639e8898029eab5e29ea1f91ea

  was:convert as string to support


> TRANSFORM with hive serde support CalendarIntervalType and UserDefinedType
> --
>
> Key: SPARK-32390
> URL: https://issues.apache.org/jira/browse/SPARK-32390
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> convert as string to support.
> UTD 
> https://github.com/apache/spark/pull/29085/commits/e367c0544298f6639e8898029eab5e29ea1f91ea



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31937) Support processing array and map type using spark noserde mode

2020-10-25 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-31937:
--
Description: 
Currently, It is not supported to use script(e.g. python) to process array type 
or map type, it will complain with below message:
 {{org.apache.spark.sql.catalyst.expressions.UnsafeArrayData cannot be cast to 
[Ljava.lang.Object}}
 {{org.apache.spark.sql.catalyst.expressions.UnsafeMapData cannot be cast to 
java.util.Map}}

To support it

 

https://github.com/apache/spark/pull/29085/commits/43d0f24f2c769dc270cf7e5fa2c5c13c32d0a631?file-filters%5B%5D=.scala

  was:
Currently, It is not supported to use script(e.g. python) to process array type 
or map type, it will complain with below message:
{{org.apache.spark.sql.catalyst.expressions.UnsafeArrayData cannot be cast to 
[Ljava.lang.Object}}
{{org.apache.spark.sql.catalyst.expressions.UnsafeMapData cannot be cast to 
java.util.Map}}



To support it


> Support processing array and map type using spark noserde mode
> --
>
> Key: SPARK-31937
> URL: https://issues.apache.org/jira/browse/SPARK-31937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Currently, It is not supported to use script(e.g. python) to process array 
> type or map type, it will complain with below message:
>  {{org.apache.spark.sql.catalyst.expressions.UnsafeArrayData cannot be cast 
> to [Ljava.lang.Object}}
>  {{org.apache.spark.sql.catalyst.expressions.UnsafeMapData cannot be cast to 
> java.util.Map}}
> To support it
>  
> https://github.com/apache/spark/pull/29085/commits/43d0f24f2c769dc270cf7e5fa2c5c13c32d0a631?file-filters%5B%5D=.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33238) Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size

2020-10-25 Thread angerszhu (Jira)
angerszhu created SPARK-33238:
-

 Summary: Add a configuration to control the legacy behavior of 
whether need to pad null value when value size less then schema size
 Key: SPARK-33238
 URL: https://issues.apache.org/jira/browse/SPARK-33238
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: angerszhu


FOR comment https://github.com/apache/spark/pull/29421#discussion_r511684691



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32388) TRANSFORM when schema less should keep same with hive

2020-10-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32388:


Assignee: Apache Spark  (was: angerszhu)

> TRANSFORM when schema less should keep same with hive
> -
>
> Key: SPARK-32388
> URL: https://issues.apache.org/jira/browse/SPARK-32388
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> Hive transform without schema
>  
> {code:java}
> hive> create table t (c0 int, c1 int, c2 int);
> hive> INSERT INTO t VALUES (1, 1, 1);
> hive> INSERT INTO t VALUES (2, 2, 2);
> hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t;
> hive> DESCRIBE v;
> key   string  
> value string  
> hive> SELECT * FROM v;
> 1 1   1
> 2 2   2
> hive> SELECT key FROM v;
> 1
> 2
> hive> SELECT value FROM v;
> 1 1
> 2 2{code}
> Spark
> {code:java}
> hive> create table t (c0 int, c1 int, c2 int); 
> hive> INSERT INTO t VALUES (1, 1, 1); 
> hive> INSERT INTO t VALUES (2, 2, 2); 
> hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t; 
> hive> SELECT * FROM v; 
> 1   11
> 2   22 {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32388) TRANSFORM when schema less should keep same with hive

2020-10-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32388:


Assignee: angerszhu  (was: Apache Spark)

> TRANSFORM when schema less should keep same with hive
> -
>
> Key: SPARK-32388
> URL: https://issues.apache.org/jira/browse/SPARK-32388
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> Hive transform without schema
>  
> {code:java}
> hive> create table t (c0 int, c1 int, c2 int);
> hive> INSERT INTO t VALUES (1, 1, 1);
> hive> INSERT INTO t VALUES (2, 2, 2);
> hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t;
> hive> DESCRIBE v;
> key   string  
> value string  
> hive> SELECT * FROM v;
> 1 1   1
> 2 2   2
> hive> SELECT key FROM v;
> 1
> 2
> hive> SELECT value FROM v;
> 1 1
> 2 2{code}
> Spark
> {code:java}
> hive> create table t (c0 int, c1 int, c2 int); 
> hive> INSERT INTO t VALUES (1, 1, 1); 
> hive> INSERT INTO t VALUES (2, 2, 2); 
> hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t; 
> hive> SELECT * FROM v; 
> 1   11
> 2   22 {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-32388) TRANSFORM when schema less should keep same with hive

2020-10-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-32388:
--

> TRANSFORM when schema less should keep same with hive
> -
>
> Key: SPARK-32388
> URL: https://issues.apache.org/jira/browse/SPARK-32388
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.1.0
>
>
> Hive transform without schema
>  
> {code:java}
> hive> create table t (c0 int, c1 int, c2 int);
> hive> INSERT INTO t VALUES (1, 1, 1);
> hive> INSERT INTO t VALUES (2, 2, 2);
> hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t;
> hive> DESCRIBE v;
> key   string  
> value string  
> hive> SELECT * FROM v;
> 1 1   1
> 2 2   2
> hive> SELECT key FROM v;
> 1
> 2
> hive> SELECT value FROM v;
> 1 1
> 2 2{code}
> Spark
> {code:java}
> hive> create table t (c0 int, c1 int, c2 int); 
> hive> INSERT INTO t VALUES (1, 1, 1); 
> hive> INSERT INTO t VALUES (2, 2, 2); 
> hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t; 
> hive> SELECT * FROM v; 
> 1   11
> 2   22 {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32388) TRANSFORM when schema less should keep same with hive

2020-10-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32388:
-
Fix Version/s: (was: 3.1.0)

> TRANSFORM when schema less should keep same with hive
> -
>
> Key: SPARK-32388
> URL: https://issues.apache.org/jira/browse/SPARK-32388
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> Hive transform without schema
>  
> {code:java}
> hive> create table t (c0 int, c1 int, c2 int);
> hive> INSERT INTO t VALUES (1, 1, 1);
> hive> INSERT INTO t VALUES (2, 2, 2);
> hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t;
> hive> DESCRIBE v;
> key   string  
> value string  
> hive> SELECT * FROM v;
> 1 1   1
> 2 2   2
> hive> SELECT key FROM v;
> 1
> 2
> hive> SELECT value FROM v;
> 1 1
> 2 2{code}
> Spark
> {code:java}
> hive> create table t (c0 int, c1 int, c2 int); 
> hive> INSERT INTO t VALUES (1, 1, 1); 
> hive> INSERT INTO t VALUES (2, 2, 2); 
> hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t; 
> hive> SELECT * FROM v; 
> 1   11
> 2   22 {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32388) TRANSFORM when schema less should keep same with hive

2020-10-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32388:


Assignee: angerszhu

> TRANSFORM when schema less should keep same with hive
> -
>
> Key: SPARK-32388
> URL: https://issues.apache.org/jira/browse/SPARK-32388
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> Hive transform without schema
>  
> {code:java}
> hive> create table t (c0 int, c1 int, c2 int);
> hive> INSERT INTO t VALUES (1, 1, 1);
> hive> INSERT INTO t VALUES (2, 2, 2);
> hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t;
> hive> DESCRIBE v;
> key   string  
> value string  
> hive> SELECT * FROM v;
> 1 1   1
> 2 2   2
> hive> SELECT key FROM v;
> 1
> 2
> hive> SELECT value FROM v;
> 1 1
> 2 2{code}
> Spark
> {code:java}
> hive> create table t (c0 int, c1 int, c2 int); 
> hive> INSERT INTO t VALUES (1, 1, 1); 
> hive> INSERT INTO t VALUES (2, 2, 2); 
> hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t; 
> hive> SELECT * FROM v; 
> 1   11
> 2   22 {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32388) TRANSFORM when schema less should keep same with hive

2020-10-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32388.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29421
[https://github.com/apache/spark/pull/29421]

> TRANSFORM when schema less should keep same with hive
> -
>
> Key: SPARK-32388
> URL: https://issues.apache.org/jira/browse/SPARK-32388
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.1.0
>
>
> Hive transform without schema
>  
> {code:java}
> hive> create table t (c0 int, c1 int, c2 int);
> hive> INSERT INTO t VALUES (1, 1, 1);
> hive> INSERT INTO t VALUES (2, 2, 2);
> hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t;
> hive> DESCRIBE v;
> key   string  
> value string  
> hive> SELECT * FROM v;
> 1 1   1
> 2 2   2
> hive> SELECT key FROM v;
> 1
> 2
> hive> SELECT value FROM v;
> 1 1
> 2 2{code}
> Spark
> {code:java}
> hive> create table t (c0 int, c1 int, c2 int); 
> hive> INSERT INTO t VALUES (1, 1, 1); 
> hive> INSERT INTO t VALUES (2, 2, 2); 
> hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t; 
> hive> SELECT * FROM v; 
> 1   11
> 2   22 {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33237) Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT Jenkins job

2020-10-25 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220424#comment-17220424
 ] 

Dongjoon Hyun commented on SPARK-33237:
---

Hi, [~shaneknapp]. Could you remove `-Phadoop-2.7` in the following?

{code}
./dev/make-distribution.sh --name ${DATE}-${REVISION} --r --pip --tgz 
-DzincPort=${ZINC_PORT} \
 -Phadoop-2.7 -Pkubernetes -Pkinesis-asl -Phive -Phive-thriftserver
{code}

> Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT 
> Jenkins job
> --
>
> Key: SPARK-33237
> URL: https://issues.apache.org/jira/browse/SPARK-33237
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Since Apache Spark 3.1.0, the default Hadoop version is 3.1.0.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/configure



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33237) Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT Jenkins job

2020-10-25 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-33237:
-

 Summary: Use default Hadoop profile by removing explicit 
`-Phadoop-2.7` from K8s IT Jenkins job
 Key: SPARK-33237
 URL: https://issues.apache.org/jira/browse/SPARK-33237
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun


Since Apache Spark 3.1.0, the default Hadoop version is 3.1.0.

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/configure



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33236) Enable Push-based shuffle service to store state in NM level DB for work preserving restart

2020-10-25 Thread Chandni Singh (Jira)
Chandni Singh created SPARK-33236:
-

 Summary: Enable Push-based shuffle service to store state in NM 
level DB for work preserving restart
 Key: SPARK-33236
 URL: https://issues.apache.org/jira/browse/SPARK-33236
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle
Affects Versions: 3.1.0
Reporter: Chandni Singh






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33235) Push-based Shuffle Phase 2 Tasks

2020-10-25 Thread Chandni Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated SPARK-33235:
--
Description: 
This is the parent jira for the phase 2 or follow-up tasks for supporting 
Push-based shuffle. Refer SPARK-30602.


  was:
In a large deployment of a Spark compute infrastructure, Spark shuffle is 
becoming a potential scaling bottleneck and a source of inefficiency in the 
cluster. When doing Spark on YARN for a large-scale deployment, people usually 
enable Spark external shuffle service and store the intermediate shuffle files 
on HDD. Because the number of blocks generated for a particular shuffle grows 
quadratically compared to the size of shuffled data (# mappers and reducers 
grows linearly with the size of shuffled data, but # blocks is # mappers * # 
reducers), one general trend we have observed is that the more data a Spark 
application processes, the smaller the block size becomes. In a few production 
clusters we have seen, the average shuffle block size is only 10s of KBs. 
Because of the inefficiency of performing random reads on HDD for small amount 
of data, the overall efficiency of the Spark external shuffle services serving 
the shuffle blocks degrades as we see an increasing # of Spark applications 
processing an increasing amount of data. In addition, because Spark external 
shuffle service is a shared service in a multi-tenancy cluster, the 
inefficiency with one Spark application could propagate to other applications 
as well.

In this ticket, we propose a solution to improve Spark shuffle efficiency in 
above mentioned environments with push-based shuffle. With push-based shuffle, 
shuffle is performed at the end of mappers and blocks get pre-merged and move 
towards reducers. In our prototype implementation, we have seen significant 
efficiency improvements when performing large shuffles. We take a Spark-native 
approach to achieve this, i.e., extending Spark’s existing shuffle netty 
protocol, and the behaviors of Spark mappers, reducers and drivers. This way, 
we can bring the benefits of more efficient shuffle in Spark without incurring 
the dependency or overhead of either specialized storage layer or external 
infrastructure pieces.

 

Link to dev mailing list discussion: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html


> Push-based Shuffle Phase 2 Tasks
> 
>
> Key: SPARK-33235
> URL: https://issues.apache.org/jira/browse/SPARK-33235
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Priority: Major
>  Labels: release-notes
>
> This is the parent jira for the phase 2 or follow-up tasks for supporting 
> Push-based shuffle. Refer SPARK-30602.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33235) Push-based Shuffle Phase 2 Tasks

2020-10-25 Thread Chandni Singh (Jira)
Chandni Singh created SPARK-33235:
-

 Summary: Push-based Shuffle Phase 2 Tasks
 Key: SPARK-33235
 URL: https://issues.apache.org/jira/browse/SPARK-33235
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 3.1.0
Reporter: Chandni Singh


In a large deployment of a Spark compute infrastructure, Spark shuffle is 
becoming a potential scaling bottleneck and a source of inefficiency in the 
cluster. When doing Spark on YARN for a large-scale deployment, people usually 
enable Spark external shuffle service and store the intermediate shuffle files 
on HDD. Because the number of blocks generated for a particular shuffle grows 
quadratically compared to the size of shuffled data (# mappers and reducers 
grows linearly with the size of shuffled data, but # blocks is # mappers * # 
reducers), one general trend we have observed is that the more data a Spark 
application processes, the smaller the block size becomes. In a few production 
clusters we have seen, the average shuffle block size is only 10s of KBs. 
Because of the inefficiency of performing random reads on HDD for small amount 
of data, the overall efficiency of the Spark external shuffle services serving 
the shuffle blocks degrades as we see an increasing # of Spark applications 
processing an increasing amount of data. In addition, because Spark external 
shuffle service is a shared service in a multi-tenancy cluster, the 
inefficiency with one Spark application could propagate to other applications 
as well.

In this ticket, we propose a solution to improve Spark shuffle efficiency in 
above mentioned environments with push-based shuffle. With push-based shuffle, 
shuffle is performed at the end of mappers and blocks get pre-merged and move 
towards reducers. In our prototype implementation, we have seen significant 
efficiency improvements when performing large shuffles. We take a Spark-native 
approach to achieve this, i.e., extending Spark’s existing shuffle netty 
protocol, and the behaviors of Spark mappers, reducers and drivers. This way, 
we can bring the benefits of more efficient shuffle in Spark without incurring 
the dependency or overhead of either specialized storage layer or external 
infrastructure pieces.

 

Link to dev mailing list discussion: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33234) Generates SHA-512 using shasum

2020-10-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33234.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30123
[https://github.com/apache/spark/pull/30123]

> Generates SHA-512 using shasum
> --
>
> Key: SPARK-33234
> URL: https://issues.apache.org/jira/browse/SPARK-33234
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33234) Generates SHA-512 using shasum

2020-10-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33234:


Assignee: (was: Apache Spark)

> Generates SHA-512 using shasum
> --
>
> Key: SPARK-33234
> URL: https://issues.apache.org/jira/browse/SPARK-33234
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33234) Generates SHA-512 using shasum

2020-10-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33234:


Assignee: Apache Spark

> Generates SHA-512 using shasum
> --
>
> Key: SPARK-33234
> URL: https://issues.apache.org/jira/browse/SPARK-33234
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33234) Generates SHA-512 using shasum

2020-10-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220415#comment-17220415
 ] 

Apache Spark commented on SPARK-33234:
--

User 'emilianbold' has created a pull request for this issue:
https://github.com/apache/spark/pull/30123

> Generates SHA-512 using shasum
> --
>
> Key: SPARK-33234
> URL: https://issues.apache.org/jira/browse/SPARK-33234
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33234) Generates SHA-512 using shasum

2020-10-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220414#comment-17220414
 ] 

Apache Spark commented on SPARK-33234:
--

User 'emilianbold' has created a pull request for this issue:
https://github.com/apache/spark/pull/30123

> Generates SHA-512 using shasum
> --
>
> Key: SPARK-33234
> URL: https://issues.apache.org/jira/browse/SPARK-33234
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33234) Generates SHA-512 using shasum

2020-10-25 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-33234:
-

 Summary: Generates SHA-512 using shasum
 Key: SPARK-33234
 URL: https://issues.apache.org/jira/browse/SPARK-33234
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33228) Don't uncache data when replacing an existing view having the same plan

2020-10-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33228.
---
Fix Version/s: 2.4.8
   3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 30140
[https://github.com/apache/spark/pull/30140]

> Don't uncache data when replacing an existing view having the same plan
> ---
>
> Key: SPARK-33228
> URL: https://issues.apache.org/jira/browse/SPARK-33228
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.1.0, 3.0.2, 2.4.8
>
>
> SPARK-30494's updated the `CreateViewCommand` code to implicitly drop cache 
> when replacing an existing view. But, this change drops cache even when 
> replacing a view having the same logical plan. A sequence of queries to 
> reproduce this as follows;
> {code}
> scala> val df = spark.range(1).selectExpr("id a", "id b")
> scala> df.cache()
> scala> df.explain()
> == Physical Plan ==
> *(1) ColumnarToRow
> +- InMemoryTableScan [a#2L, b#3L]
>  +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
>  +- *(1) Range (0, 1, step=1, splits=4)
> scala> df.createOrReplaceTempView("t")
> scala> sql("select * from t").explain()
> == Physical Plan ==
> *(1) ColumnarToRow
> +- InMemoryTableScan [a#2L, b#3L]
>  +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
>  +- *(1) Range (0, 1, step=1, splits=4)
> // If one re-runs the same query `df.createOrReplaceTempView("t")`, the 
> cache's swept away
> scala> df.createOrReplaceTempView("t")
> scala> sql("select * from t").explain()
> == Physical Plan ==
> *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
> +- *(1) Range (0, 1, step=1, splits=4)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33228) Don't uncache data when replacing an existing view having the same plan

2020-10-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33228:
-

Assignee: Takeshi Yamamuro

> Don't uncache data when replacing an existing view having the same plan
> ---
>
> Key: SPARK-33228
> URL: https://issues.apache.org/jira/browse/SPARK-33228
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>
> SPARK-30494's updated the `CreateViewCommand` code to implicitly drop cache 
> when replacing an existing view. But, this change drops cache even when 
> replacing a view having the same logical plan. A sequence of queries to 
> reproduce this as follows;
> {code}
> scala> val df = spark.range(1).selectExpr("id a", "id b")
> scala> df.cache()
> scala> df.explain()
> == Physical Plan ==
> *(1) ColumnarToRow
> +- InMemoryTableScan [a#2L, b#3L]
>  +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
>  +- *(1) Range (0, 1, step=1, splits=4)
> scala> df.createOrReplaceTempView("t")
> scala> sql("select * from t").explain()
> == Physical Plan ==
> *(1) ColumnarToRow
> +- InMemoryTableScan [a#2L, b#3L]
>  +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
>  +- *(1) Range (0, 1, step=1, splits=4)
> // If one re-runs the same query `df.createOrReplaceTempView("t")`, the 
> cache's swept away
> scala> df.createOrReplaceTempView("t")
> scala> sql("select * from t").explain()
> == Physical Plan ==
> *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
> +- *(1) Range (0, 1, step=1, splits=4)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31342) Fail by default if Parquet DATE or TIMESTAMP data is before October 15, 1582

2020-10-25 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins resolved SPARK-31342.
---
Resolution: Duplicate

> Fail by default if Parquet DATE or TIMESTAMP data is before October 15, 1582
> 
>
> Key: SPARK-31342
> URL: https://issues.apache.org/jira/browse/SPARK-31342
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Some users may not know they are creating and/or reading DATE or TIMESTAMP 
> data from before October 15, 1582 (because of data encoding libraries, etc.). 
> Therefore, it may not be clear whether they need to toggle the two 
> rebaseDateTime config settings.
> By default, Spark should fail if it reads or writes data from October 15, 
> 1582 or before.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31342) Fail by default if Parquet DATE or TIMESTAMP data is before October 15, 1582

2020-10-25 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220274#comment-17220274
 ] 

Maxim Gekk commented on SPARK-31342:


[~cloud_fan] I think we can close this because it has been already implemented.

> Fail by default if Parquet DATE or TIMESTAMP data is before October 15, 1582
> 
>
> Key: SPARK-31342
> URL: https://issues.apache.org/jira/browse/SPARK-31342
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Some users may not know they are creating and/or reading DATE or TIMESTAMP 
> data from before October 15, 1582 (because of data encoding libraries, etc.). 
> Therefore, it may not be clear whether they need to toggle the two 
> rebaseDateTime config settings.
> By default, Spark should fail if it reads or writes data from October 15, 
> 1582 or before.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33195) stages/stage UI page fails to load when spark reverse proxy is enabled

2020-10-25 Thread Liran (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liran updated SPARK-33195:
--
Description: 
I think we have the same issue reported in SPARK-32467, reproduced with reverse 
proxy redirects, I'm getting the exact same error in spark UI.

Url page:
{code:java}
http://:8080/proxy/app-20201020143315-0005/stages/stage/?id=7&attempt=0{code}
The url above fails to load, looking at the network tab - this request fails:
{code:java}
http://:8080/proxy/app-20201020143315-0005/api/v1/applications/app-20201020143315-0005/stages/7/0/taskTable?draw=1&order%5B0%5D%5Bcolumn%5D=0&order%5B0%5D%5Bdir%5D=asc&start=0&length=20&search%5Bvalue%5D=&search%5Bregex%5D=false&numTasks=1&columnIndexToSort=0&columnNameToSort=Index&_=1603206039549
{code}
Server error stack trace:
{code:java}
/api/v1/applications/app-20201020113310-0004/stages/7/0/taskTable/api/v1/applications/app-20201020113310-0004/stages/7/0/taskTablejavax.servlet.ServletException:
 java.lang.NullPointerException at 
org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:410) at 
org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346) at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:366)
 at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:319)
 at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
 at org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:873) 
at 
org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623)
 at 
org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95) at 
org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
 at 
org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540) 
at 
org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
 at 
org.sparkproject.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
 at 
org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
 at 
org.sparkproject.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480) 
at 
org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
 at 
org.sparkproject.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
 at 
org.sparkproject.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
 at 
org.sparkproject.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753)
 at 
org.sparkproject.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
 at 
org.sparkproject.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
 at org.sparkproject.jetty.server.Server.handle(Server.java:505) at 
org.sparkproject.jetty.server.HttpChannel.handle(HttpChannel.java:370) at 
org.sparkproject.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
 at 
org.sparkproject.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
 at org.sparkproject.jetty.io.FillInterest.fillable(FillInterest.java:103) at 
org.sparkproject.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117) at 
org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
 at 
org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
 at 
org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
 at 
org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
 at 
org.sparkproject.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
 at 
org.sparkproject.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
 at 
org.sparkproject.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
 at java.lang.Thread.run(Thread.java:748)Caused by: 
java.lang.NullPointerException at 
org.apache.spark.status.api.v1.StagesResource.$anonfun$doPagination$1(StagesResource.scala:175)
 at 
org.apache.spark.status.api.v1.BaseAppResource.$anonfun$withUI$1(ApiRootResource.scala:140)
 at org.apache.spark.ui.SparkUI.withSparkUI(SparkUI.scala:107) at 
org.apache.spark.status.api.v1.BaseAppResource.withUI(ApiRootResource.scala:135)
 at 
org.apache.spark.status.api.v1.BaseAppResource.withUI$(ApiRootResource.scala:133)
 at 
org.apache.spark.status.api.v1.StagesResource.withUI(StagesResource.scala:28) 
at 
org.apache.spark.status.api.v1.StagesResource.doPagination(StagesResource.scala:174)
 at 
org.apache.spark.status.api.v1.StagesResource.$anonfun$taskTable$1(StagesResource.scala:129)
 at 
org.apache.spark.status.api.v1.BaseAppResource.$anonfun$withUI$1(ApiRootResource.scala:140)
 at org.apache.spark.ui.SparkUI.withSparkUI(SparkUI.scala:107) at 
org.apache.spark.status.api