[jira] [Commented] (SPARK-44131) Add call_function for Scala API

2023-08-27 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759408#comment-17759408
 ] 

Xiao Li commented on SPARK-44131:
-

[https://github.com/apache/spark/pull/41950] reverted the deprecation. 

> Add call_function for Scala API
> ---
>
> Key: SPARK-44131
> URL: https://issues.apache.org/jira/browse/SPARK-44131
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.5.0
>
>
> The scala API for SQL exists a method call_udf used to call the user-defined 
> functions.
> In fact, call_udf also could call the builtin functions.
> The behavior is confused for users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44131) Add call_function for Scala API

2023-08-27 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-44131:

Summary: Add call_function for Scala API  (was: Add call_function and 
deprecate call_udf for Scala API)

> Add call_function for Scala API
> ---
>
> Key: SPARK-44131
> URL: https://issues.apache.org/jira/browse/SPARK-44131
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.5.0
>
>
> The scala API for SQL exists a method call_udf used to call the user-defined 
> functions.
> In fact, call_udf also could call the builtin functions.
> The behavior is confused for users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44264) DeepSpeed Distributor

2023-08-27 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-44264:

Summary: DeepSpeed Distributor  (was: DeepSpeed Distrobutor)

> DeepSpeed Distributor
> -
>
> Key: SPARK-44264
> URL: https://issues.apache.org/jira/browse/SPARK-44264
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.4.1
>Reporter: Lu Wang
>Priority: Critical
> Fix For: 3.5.0
>
> Attachments: Trying to Run Deepspeed Funcs.html
>
>
> To make it easier for Pyspark users to run distributed training and inference 
> with DeepSpeed on spark clusters using PySpark. This was a project determined 
> by the Databricks ML Training Team.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43907) Add SQL functions into Scala, Python and R API

2023-08-24 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-43907:

Fix Version/s: 3.5.0

> Add SQL functions into Scala, Python and R API
> --
>
> Key: SPARK-43907
> URL: https://issues.apache.org/jira/browse/SPARK-43907
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SparkR, SQL
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.0
>
>
> See the discussion in dev mailing list 
> (https://lists.apache.org/thread/0tdcfyzxzcv8w46qbgwys2rormhdgyqg).
> This is an umbrella JIRA to implement all SQL functions in Scala, Python and R



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41231) Built-in SQL Function Improvement

2023-08-24 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-41231:

Fix Version/s: 3.5.0

> Built-in SQL Function Improvement
> -
>
> Key: SPARK-41231
> URL: https://issues.apache.org/jira/browse/SPARK-41231
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44782) Adjust Pull Request Template to incorporate the ASF Generative Tooling Guidance recommendations

2023-08-15 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754429#comment-17754429
 ] 

Xiao Li commented on SPARK-44782:
-

+1 We should update the PR template. 

> Adjust Pull Request Template to incorporate the ASF Generative Tooling 
> Guidance recommendations
> ---
>
> Key: SPARK-44782
> URL: https://issues.apache.org/jira/browse/SPARK-44782
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.3.2, 3.4.1
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Recently releases [ASF Generative Tooling 
> Guidance|https://www.apache.org/legal/generative-tooling.html] recommends 
> keeping track of the generative AI tools used to author patches
> ??When providing contributions authored using generative AI tooling, a 
> recommended practice is for contributors to indicate the tooling used to 
> create the contribution. This should be included as a token in the source 
> control commit message, for example including the phrase “Generated-by: ”. 
> This allows for future release tooling to be considered that pulls this 
> content into a machine parsable Tooling-Provenance file.??
> We should adjust PR template accordingly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40576) Support pandas 1.5.x.

2023-03-26 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-40576:

Fix Version/s: 3.4.0

> Support pandas 1.5.x.
> -
>
> Key: SPARK-40576
> URL: https://issues.apache.org/jira/browse/SPARK-40576
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> Since the pandas version 1.5.0 is released from Sep 19, 2022, we should 
> support the same behavior with latest pandas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40576) Support pandas 1.5.x.

2023-03-26 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-40576:
---

Assignee: Haejoon Lee

> Support pandas 1.5.x.
> -
>
> Key: SPARK-40576
> URL: https://issues.apache.org/jira/browse/SPARK-40576
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> Since the pandas version 1.5.0 is released from Sep 19, 2022, we should 
> support the same behavior with latest pandas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41594) Support table-valued generator functions in the FROM clause

2023-03-26 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-41594.
-
Fix Version/s: 3.4.0
 Assignee: Allison Wang
   Resolution: Fixed

> Support table-valued generator functions in the FROM clause
> ---
>
> Key: SPARK-41594
> URL: https://issues.apache.org/jira/browse/SPARK-41594
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> Umbrella Jira for supporting table-valued generator functions in the FROM 
> clause of a query. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42122) Add built-in table-valued function stack

2023-03-26 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-42122.
-
Fix Version/s: 3.4.0
 Assignee: Allison Wang
   Resolution: Fixed

> Add built-in table-valued function stack
> 
>
> Key: SPARK-42122
> URL: https://issues.apache.org/jira/browse/SPARK-42122
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> Add `stack` to the built-in table function registry.
> Add new SQL tests in `table-valued-functions.sql` and `join-lateral.sql`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42120) Add built-in table-valued function json_tuple

2023-03-26 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-42120.
-
Fix Version/s: 3.4.0
 Assignee: Allison Wang
   Resolution: Fixed

> Add built-in table-valued function json_tuple
> -
>
> Key: SPARK-42120
> URL: https://issues.apache.org/jira/browse/SPARK-42120
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> Add `json_tuple` to the built-in table function registry.
> Add new SQL tests in `table-valued-functions.sql` and `join-lateral.sql`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42702) Support parameterized CTE

2023-03-20 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-42702:

Fix Version/s: 3.4.0
   (was: 3.4.1)

> Support parameterized CTE
> -
>
> Key: SPARK-42702
> URL: https://issues.apache.org/jira/browse/SPARK-42702
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.4.0
>
>
> Support named parameters in named common table expressions (CTE). At the 
> moment, such queries failed:
> {code:java}
> CREATE TABLE tbl(namespace STRING) USING parquet
> INSERT INTO tbl SELECT 'abc'
> WITH transitions AS (
>   SELECT * FROM tbl WHERE namespace = :namespace
> ) SELECT * FROM transitions {code}
> w/ the following error:
> {code:java}
> [UNBOUND_SQL_PARAMETER] Found the unbound parameter: `namespace`. Please, fix 
> `args` and provide a mapping of the parameter to a SQL literal.; line 3 pos 
> 38;
> 'WithCTE
> :- 'CTERelationDef 0, false
> :  +- 'SubqueryAlias transitions
> :     +- 'Project [*]
> :        +- 'Filter (namespace#3 = parameter(namespace))
> :           +- SubqueryAlias spark_catalog.default.tbl
> :              +- Relation spark_catalog.default.tbl[namespace#3] parquet
> +- 'Project [*]
>    +- 'SubqueryAlias transitions
>       +- 'CTERelationRef 0, falseorg.apache.spark.sql.AnalysisException: 
> [UNBOUND_SQL_PARAMETER] Found the unbound parameter: `namespace`. Please, fix 
> `args` and provide a mapping of the parameter to a SQL literal.; line 3 pos 
> 38;
> 'WithCTE
> :- 'CTERelationDef 0, false
> :  +- 'SubqueryAlias transitions
> :     +- 'Project [*]
> :        +- 'Filter (namespace#3 = parameter(namespace))
> :           +- SubqueryAlias spark_catalog.default.tbl
> :              +- Relation spark_catalog.default.tbl[namespace#3] parquet
> +- 'Project [*]
>    +- 'SubqueryAlias transitions
>       +- 'CTERelationRef 0, false    at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:339)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:244)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-16484) Incremental Cardinality estimation operations with Hyperloglog

2023-01-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-16484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-16484:
-

> Incremental Cardinality estimation operations with Hyperloglog
> --
>
> Key: SPARK-16484
> URL: https://issues.apache.org/jira/browse/SPARK-16484
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yongjia Wang
>Priority: Major
>  Labels: bulk-closed
>
> Efficient cardinality estimation is very important, and SparkSQL has had 
> approxCountDistinct based on Hyperloglog for quite some time. However, there 
> isn't a way to do incremental estimation. For example, if we want to get 
> updated distinct counts of the last 90 days, we need to do the aggregation 
> for the entire window over and over again. The more efficient way involves 
> serializing the counter for smaller time windows (such as hourly) so the 
> counts can be efficiently updated in an incremental fashion for any time 
> window.
> With the support of custom UDAF, Binary DataType and the HyperloglogPlusPlus 
> implementation in the current Spark version, it's easy enough to extend the 
> functionality to include incremental counting, and even other general set 
> operations such as intersection and set difference. Spark API is already as 
> elegant as it can be, but it still takes quite some effort to do a custom 
> implementation of the aforementioned operations which are supposed to be in 
> high demand. I have been searching but failed to find an usable existing 
> solution nor any ongoing effort for this. The closest I got is the following 
> but it does not work with Spark 1.6 due to API changes. 
> https://github.com/collectivemedia/spark-hyperloglog/blob/master/src/main/scala/org/apache/spark/sql/hyperloglog/aggregates.scala
> I wonder if it worth to integrate such operations into SparkSQL. The only 
> problem I see is it depends on serialization of a specific HLL implementation 
> and introduce compatibility issues. But as long as the user is aware of such 
> issue, it should be fine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32082) Project Zen: Improving Python usability

2022-11-07 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-32082:

Description: 
The importance of Python and PySpark has grown radically in the last few years. 
The number of PySpark downloads reached [more than 1.3 million _every 
week_|https://pypistats.org/packages/pyspark] when we count them _only_ in 
PyPI. Nevertheless, PySpark is still less Pythonic. It exposes many JVM error 
messages as an example, and the API documentation is poorly written.

This epic tickets aims to improve the usability in PySpark, and make it more 
Pythonic. To be more explicit, this JIRA targets four bullet points below. Each 
includes examples:
 * Being Pythonic
 ** Pandas UDF enhancements and type hints
 ** Avoid dynamic function definitions, for example, at {{funcitons.py}} which 
makes IDEs unable to detect.

 * Better and easier usability in PySpark
 ** User-facing error message and warnings
 ** Documentation
 ** User guide
 ** Better examples and API documentation, e.g. 
[Koalas|https://koalas.readthedocs.io/en/latest/] and 
[pandas|https://pandas.pydata.org/docs/]

 * Better interoperability with other Python libraries
 ** Visualization and plotting
 ** Potentially better interface by leveraging Arrow
 ** Compatibility with other libraries such as NumPy universal functions or 
pandas possibly by leveraging Koalas

 * PyPI Installation
 ** PySpark with Hadoop 3 support on PyPi
 ** Better error handling

 
| | | | |
|SPARK-31382|Show a better error message for different python and pip 
installation mistake|{color:#006644}RESOLVED{color}|[Hyukjin 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-31849|Improve Python exception messages to be more 
Pythonic|{color:#006644}RESOLVED{color}|[Hyukjin 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-31851|Redesign PySpark 
documentation|{color:#006644}RESOLVED{color}|[Hyukjin 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-32017|Make Pyspark Hadoop 3.2+ Variant available in 
PyPI|{color:#006644}RESOLVED{color}|[Hyukjin 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-32084|Replace dictionary-based function definitions to proper functions 
in functions.py|{color:#006644}RESOLVED{color}|[Maciej 
Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
|SPARK-32085|Migrate to NumPy documentation 
style|{color:#006644}RESOLVED{color}|[Maciej 
Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
|SPARK-32161|Hide JVM traceback for 
SparkUpgradeException|{color:#006644}RESOLVED{color}|[Pralabh 
Kumar|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=pralabhkumar]|
|SPARK-32185|User Guide - Monitoring|{color:#006644}RESOLVED{color}|[Abhijeet 
Prasad|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=a7prasad]|
|SPARK-32195|Standardize warning types and 
messages|{color:#006644}RESOLVED{color}|[Maciej 
Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
|SPARK-32204|Binder Integration|{color:#006644}RESOLVED{color}|[Hyukjin 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-32681|PySpark type hints support|{color:#006644}RESOLVED{color}|[Maciej 
Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
|SPARK-32686|Un-deprecate inferring DataFrame schema from list of 
dictionaries|{color:#006644}RESOLVED{color}|[Nicholas 
Chammas|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=nchammas]|
|SPARK-33247|Improve examples and scenarios in 
docstrings|{color:#006644}RESOLVED{color}|_Unassigned_|
|SPARK-33407|Simplify the exception message from Python 
UDFs|{color:#006644}RESOLVED{color}|[Hyukjin 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-33530|Support --archives option 
natively|{color:#006644}RESOLVED{color}|[Hyukjin 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-34629|Python type hints 
improvement|{color:#006644}RESOLVED{color}|[Maciej 
Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
|SPARK-34849|SPIP: Support pandas API layer on 
PySpark|{color:#006644}RESOLVED{color}|[Haejoon 
Lee|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=itholic]|
|SPARK-34885|Port/integrate Koalas documentation into 
PySpark|{color:#006644}RESOLVED{color}|[Hyukjin 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-35337|pandas API on Spark: Separate basic operations into data type 
based structures|{color:#006644}RESOLVED{color}|[Xinrong 
Meng|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=XinrongM]|
|SPARK-35419|Enable spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled 
by 

[jira] [Updated] (SPARK-40309) Introduce sql_conf context manager for pyspark.sql

2022-09-01 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-40309:

Labels: release-notes  (was: )

> Introduce sql_conf context manager for pyspark.sql
> --
>
> Key: SPARK-40309
> URL: https://issues.apache.org/jira/browse/SPARK-40309
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>  Labels: release-notes
>
> [https://github.com/apache/spark/blob/master/python/pyspark/pandas/utils.py#L490]
> a context manager is introduced to set the Spark SQL configuration and
> then restores it back when it exits, in Pandas API on Spark.
> That simplifies the control of Spark SQL configuration, 
> from
> {code:java}
> original_value = spark.conf.get("key")
> spark.conf.set("key", "value")
> ...
> spark.conf.set("key", original_value){code}
> to
> {code:java}
> with sql_conf({"key": "value"}):
> ...
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key

2022-08-27 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-40149:

Priority: Blocker  (was: Major)

> Star expansion after outer join asymmetrically includes joining key
> ---
>
> Key: SPARK-40149
> URL: https://issues.apache.org/jira/browse/SPARK-40149
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Otakar Truněček
>Priority: Blocker
>
> When star expansion is used on left side of a join, the result will include 
> joining key, while on the right side of join it doesn't. I would expect the 
> behaviour to be symmetric (either include on both sides or on neither). 
> Example:
> {code:python}
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> spark = SparkSession.builder.getOrCreate()
> df_left = spark.range(5).withColumn('val', f.lit('left'))
> df_right = spark.range(3, 7).withColumn('val', f.lit('right'))
> df_merged = (
> df_left
> .alias('left')
> .join(df_right.alias('right'), on='id', how='full_outer')
> .withColumn('left_all', f.struct('left.*'))
> .withColumn('right_all', f.struct('right.*'))
> )
> df_merged.show()
> {code}
> result:
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|   {0, left}|   {null}|
> |  1|left| null|   {1, left}|   {null}|
> |  2|left| null|   {2, left}|   {null}|
> |  3|left|right|   {3, left}|  {right}|
> |  4|left|right|   {4, left}|  {right}|
> |  5|null|right|{null, null}|  {right}|
> |  6|null|right|{null, null}|  {right}|
> +---++-++-+
> {code}
> This behaviour started with release 3.2.0. Previously the key was not 
> included on either side. 
> Result from Spark 3.1.3
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|  {left}|   {null}|
> |  6|null|right|  {null}|  {right}|
> |  5|null|right|  {null}|  {right}|
> |  1|left| null|  {left}|   {null}|
> |  3|left|right|  {left}|  {right}|
> |  2|left| null|  {left}|   {null}|
> |  4|left|right|  {left}|  {right}|
> +---++-++-+ {code}
> I have a gut feeling this is related to these issues:
> https://issues.apache.org/jira/browse/SPARK-39376
> https://issues.apache.org/jira/browse/SPARK-34527
> https://issues.apache.org/jira/browse/SPARK-38603
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key

2022-08-27 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-40149:

Target Version/s: 3.4.0

> Star expansion after outer join asymmetrically includes joining key
> ---
>
> Key: SPARK-40149
> URL: https://issues.apache.org/jira/browse/SPARK-40149
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Otakar Truněček
>Priority: Blocker
>
> When star expansion is used on left side of a join, the result will include 
> joining key, while on the right side of join it doesn't. I would expect the 
> behaviour to be symmetric (either include on both sides or on neither). 
> Example:
> {code:python}
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> spark = SparkSession.builder.getOrCreate()
> df_left = spark.range(5).withColumn('val', f.lit('left'))
> df_right = spark.range(3, 7).withColumn('val', f.lit('right'))
> df_merged = (
> df_left
> .alias('left')
> .join(df_right.alias('right'), on='id', how='full_outer')
> .withColumn('left_all', f.struct('left.*'))
> .withColumn('right_all', f.struct('right.*'))
> )
> df_merged.show()
> {code}
> result:
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|   {0, left}|   {null}|
> |  1|left| null|   {1, left}|   {null}|
> |  2|left| null|   {2, left}|   {null}|
> |  3|left|right|   {3, left}|  {right}|
> |  4|left|right|   {4, left}|  {right}|
> |  5|null|right|{null, null}|  {right}|
> |  6|null|right|{null, null}|  {right}|
> +---++-++-+
> {code}
> This behaviour started with release 3.2.0. Previously the key was not 
> included on either side. 
> Result from Spark 3.1.3
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|  {left}|   {null}|
> |  6|null|right|  {null}|  {right}|
> |  5|null|right|  {null}|  {right}|
> |  1|left| null|  {left}|   {null}|
> |  3|left|right|  {left}|  {right}|
> |  2|left| null|  {left}|   {null}|
> |  4|left|right|  {left}|  {right}|
> +---++-++-+ {code}
> I have a gut feeling this is related to these issues:
> https://issues.apache.org/jira/browse/SPARK-39376
> https://issues.apache.org/jira/browse/SPARK-34527
> https://issues.apache.org/jira/browse/SPARK-38603
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36837) Upgrade Kafka to 3.1.0

2022-05-16 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17537829#comment-17537829
 ] 

Xiao Li commented on SPARK-36837:
-

Since we will bump to Kafka 3.2.0 in Spark 3.4, we can close it as Done. 

> Upgrade Kafka to 3.1.0
> --
>
> Key: SPARK-36837
> URL: https://issues.apache.org/jira/browse/SPARK-36837
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.4.0
>
>
> Kafka 3.1.0 has the official Java 17 support. We had better align with it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36837) Upgrade Kafka to 3.1.0

2022-05-16 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-36837.
-
Resolution: Done

> Upgrade Kafka to 3.1.0
> --
>
> Key: SPARK-36837
> URL: https://issues.apache.org/jira/browse/SPARK-36837
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.4.0
>
>
> Kafka 3.1.0 has the official Java 17 support. We had better align with it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36837) Upgrade Kafka to 3.1.0

2022-05-16 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-36837:

Labels:   (was: releasenotes)

> Upgrade Kafka to 3.1.0
> --
>
> Key: SPARK-36837
> URL: https://issues.apache.org/jira/browse/SPARK-36837
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0
>
>
> Kafka 3.1.0 has the official Java 17 support. We had better align with it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36837) Upgrade Kafka to 3.1.0

2022-05-16 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-36837.
-
Resolution: Not A Problem

> Upgrade Kafka to 3.1.0
> --
>
> Key: SPARK-36837
> URL: https://issues.apache.org/jira/browse/SPARK-36837
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.4.0
>
>
> Kafka 3.1.0 has the official Java 17 support. We had better align with it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-36837) Upgrade Kafka to 3.1.0

2022-05-16 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-36837:
-

> Upgrade Kafka to 3.1.0
> --
>
> Key: SPARK-36837
> URL: https://issues.apache.org/jira/browse/SPARK-36837
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.4.0
>
>
> Kafka 3.1.0 has the official Java 17 support. We had better align with it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-36837) Upgrade Kafka to 3.1.0

2022-05-16 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-36837:
-

> Upgrade Kafka to 3.1.0
> --
>
> Key: SPARK-36837
> URL: https://issues.apache.org/jira/browse/SPARK-36837
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.3.0, 3.4.0
>
>
> Kafka 3.1.0 has the official Java 17 support. We had better align with it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36837) Upgrade Kafka to 3.1.0

2022-05-16 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-36837:

Fix Version/s: (was: 3.3.0)

> Upgrade Kafka to 3.1.0
> --
>
> Key: SPARK-36837
> URL: https://issues.apache.org/jira/browse/SPARK-36837
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.4.0
>
>
> Kafka 3.1.0 has the official Java 17 support. We had better align with it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38384) Improve error messages of ParseException from ANTLR

2022-04-05 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-38384.
-
Fix Version/s: 3.3.0
 Assignee: Xinyi Yu
   Resolution: Fixed

> Improve error messages of ParseException from ANTLR
> ---
>
> Key: SPARK-38384
> URL: https://issues.apache.org/jira/browse/SPARK-38384
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Assignee: Xinyi Yu
>Priority: Major
> Fix For: 3.3.0
>
>
> This task is intended to improve the error messages of ParseException 
> directly coming from ANTLR.
> h2. Bad Error Messages
> Many error messages defined in ANTLR are not user-friendly. For example,
> {code:java}
> spark.sql("sel 1")
>  
> ParseException: 
> mismatched input 'sel' expecting {'(', 'APPLY', 'CONVERT', 'COPY', 
> 'OPTIMIZE', 'RESTORE', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 
> 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 
> 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 
> 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 
> 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'SYNC', 'TABLE', 
> 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, 
> pos 0)
>  
> == SQL ==
> sel 1
> ^^^ {code}
> Following the [Spark Error Message 
> Guidelines|https://spark.apache.org/error-message-guidelines.html], the words 
> in this message are vague and hard to follow. It states ‘What’, but is 
> unclear on the ‘Why’ and ‘How’.
> Or,
> {code:java}
> spark.sql("") // empty query
> ParseException: 
> mismatched input '' expecting {'(', 'CONVERT', 'COPY', 'OPTIMIZE', 
> 'RESTORE', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 
> 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 
> 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 
> 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 
> 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 
> 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0)
> == SQL ==
> ^^^ {code}
> Instead of simply telling users it’s an empty line, it outputs a long 
> message, even giving the jargon ''.
> h2. Where do these error messages come from?
> There has been much work on improving ParseException in general (see 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryParsingErrors.scala]
>  for example). But lots of the above error messages are defined in ANTLR and 
> stay unmodified in Spark.
> When such an error is encountered in ANTLR, ANTLR notified the exception 
> listener with a message like ‘mismatched input {} expecting {}’. The Spark 
> exception listener _appends_ the line and position to the message, as well as 
> the problematic SQL and several ‘^^^’ marking the error position. Then it 
> throws a ParseException with the appended error message. Spark doesn’t modify 
> the error message given from ANTLR. 
> This task focuses on those error messages from ANTLR.
> h2. Goals
>  # Improve the error messages of ParseException that are from ANTLR; Modify 
> all affected test cases accordingly.
>  # Make sure the new error message framework is applied in this change.
> h2. Proposed Error Messages Change
> It should be in each sub-task and includes concrete before & after cases. See 
> the description of each sub-task for more details.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38781) Error Message Improvements in Spark 3.3

2022-04-04 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-38781:

Summary: Error Message Improvements in Spark 3.3  (was: Improve errors of 
Spark 3.3)

> Error Message Improvements in Spark 3.3
> ---
>
> Key: SPARK-38781
> URL: https://issues.apache.org/jira/browse/SPARK-38781
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Get together all tickets related to improvements of error messages in Spark 
> SQL like migration to error classes.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32268) Bloom Filter Join

2022-03-22 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-32268:
---

Assignee: Yingyi Bu  (was: Yuming Wang)

> Bloom Filter Join
> -
>
> Key: SPARK-32268
> URL: https://issues.apache.org/jira/browse/SPARK-32268
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yingyi Bu
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: q16-bloom-filter.jpg, q16-default.jpg
>
>
> We can improve the performance of some joins by pre-filtering one side of a 
> join using a Bloom filter and IN predicate generated from the values from the 
> other side of the join.
>  For 
> example:[tpcds/q16.sql|https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q16.sql].
>  [Before this 
> optimization|https://issues.apache.org/jira/secure/attachment/13007418/q16-default.jpg].
>  [After this 
> optimization|https://issues.apache.org/jira/secure/attachment/13007416/q16-bloom-filter.jpg].
> *Query Performance Benchmarks: TPC-DS Performance Evaluation*
>  Our setup for running TPC-DS benchmark was as follows: TPC-DS 5T and 
> Partitioned Parquet table
>  
> |Query|Default(Seconds)|Enable Bloom Filter Join(Seconds)|
> |tpcds q16|84|46|
> |tpcds q36|29|21|
> |tpcds q57|39|28|
> |tpcds q94|42|34|
> |tpcds q95|306|288|



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27790) Support ANSI SQL INTERVAL types

2022-03-22 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-27790.
-
Fix Version/s: 3.3.0
 Assignee: Max Gekk  (was: Apache Spark)
   Resolution: Fixed

> Support ANSI SQL INTERVAL types
> ---
>
> Key: SPARK-27790
> URL: https://issues.apache.org/jira/browse/SPARK-27790
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.0
>
>
> Spark has an INTERVAL data type, but it is “broken”:
> # It cannot be persisted
> # It is not comparable because it crosses the month day line. That is there 
> is no telling whether “1 Month 1 Day” is equal to “1 Month 1 Day” since not 
> all months have the same number of days.
> I propose here to introduce the two flavours of INTERVAL as described in the 
> ANSI SQL Standard and deprecate the Sparks interval type.
> * ANSI describes two non overlapping “classes”: 
> ** YEAR-MONTH, 
> ** DAY-SECOND ranges
> * Members within each class can be compared and sorted.
> * Supports datetime arithmetic
> * Can be persisted.
> The old and new flavors of INTERVAL can coexist until Spark INTERVAL is 
> eventually retired. Also any semantic “breakage” can be controlled via legacy 
> config settings. 
> *Milestone 1* --  Spark Interval equivalency (   The new interval types meet 
> or exceed all function of the existing SQL Interval):
> * Add two new DataType implementations for interval year-month and 
> day-second. Includes the JSON format and DLL string.
> * Infra support: check the caller sides of DateType/TimestampType
> * Support the two new interval types in Dataset/UDF.
> * Interval literals (with a legacy config to still allow mixed year-month 
> day-seconds fields and return legacy interval values)
> * Interval arithmetic(interval * num, interval / num, interval +/- interval)
> * Datetime functions/operators: Datetime - Datetime (to days or day second), 
> Datetime +/- interval
> * Cast to and from the new two interval types, cast string to interval, cast 
> interval to string (pretty printing), with the SQL syntax to specify the types
> * Support sorting intervals.
> *Milestone 2* -- Persistence:
> * Ability to create tables of type interval
> * Ability to write to common file formats such as Parquet and JSON.
> * INSERT, SELECT, UPDATE, MERGE
> * Discovery
> *Milestone 3* --  Client support
> * JDBC support
> * Hive Thrift server
> *Milestone 4* -- PySpark and Spark R integration
> * Python UDF can take and return intervals
> * DataFrame support



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27790) Support ANSI SQL INTERVAL types

2022-03-22 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-27790:

Labels: release-notes  (was: )

> Support ANSI SQL INTERVAL types
> ---
>
> Key: SPARK-27790
> URL: https://issues.apache.org/jira/browse/SPARK-27790
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>  Labels: release-notes
> Fix For: 3.3.0
>
>
> Spark has an INTERVAL data type, but it is “broken”:
> # It cannot be persisted
> # It is not comparable because it crosses the month day line. That is there 
> is no telling whether “1 Month 1 Day” is equal to “1 Month 1 Day” since not 
> all months have the same number of days.
> I propose here to introduce the two flavours of INTERVAL as described in the 
> ANSI SQL Standard and deprecate the Sparks interval type.
> * ANSI describes two non overlapping “classes”: 
> ** YEAR-MONTH, 
> ** DAY-SECOND ranges
> * Members within each class can be compared and sorted.
> * Supports datetime arithmetic
> * Can be persisted.
> The old and new flavors of INTERVAL can coexist until Spark INTERVAL is 
> eventually retired. Also any semantic “breakage” can be controlled via legacy 
> config settings. 
> *Milestone 1* --  Spark Interval equivalency (   The new interval types meet 
> or exceed all function of the existing SQL Interval):
> * Add two new DataType implementations for interval year-month and 
> day-second. Includes the JSON format and DLL string.
> * Infra support: check the caller sides of DateType/TimestampType
> * Support the two new interval types in Dataset/UDF.
> * Interval literals (with a legacy config to still allow mixed year-month 
> day-seconds fields and return legacy interval values)
> * Interval arithmetic(interval * num, interval / num, interval +/- interval)
> * Datetime functions/operators: Datetime - Datetime (to days or day second), 
> Datetime +/- interval
> * Cast to and from the new two interval types, cast string to interval, cast 
> interval to string (pretty printing), with the SQL syntax to specify the types
> * Support sorting intervals.
> *Milestone 2* -- Persistence:
> * Ability to create tables of type interval
> * Ability to write to common file formats such as Parquet and JSON.
> * INSERT, SELECT, UPDATE, MERGE
> * Discovery
> *Milestone 3* --  Client support
> * JDBC support
> * Hive Thrift server
> *Milestone 4* -- PySpark and Spark R integration
> * Python UDF can take and return intervals
> * DataFrame support



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35781) Support Spark on Apple Silicon on macOS natively on Java 17

2022-03-20 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-35781:

Labels: release-notes  (was: )

> Support Spark on Apple Silicon on macOS natively on Java 17
> ---
>
> Key: SPARK-35781
> URL: https://issues.apache.org/jira/browse/SPARK-35781
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: DB Tsai
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: release-notes
>
> This is an umbrella JIRA tracking the progress of supporting Apple Silicon on 
> macOS natively.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38527) Set the minimum Volcano version

2022-03-15 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-38527:

Labels: release-notes  (was: )

> Set the minimum Volcano version
> ---
>
> Key: SPARK-38527
> URL: https://issues.apache.org/jira/browse/SPARK-38527
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Kubernetes
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: release-notes
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36371) Support raw string literal

2022-03-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-36371:

Labels: release-notes  (was: )

> Support raw string literal
> --
>
> Key: SPARK-36371
> URL: https://issues.apache.org/jira/browse/SPARK-36371
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>  Labels: release-notes
> Fix For: 3.3.0
>
>
> In the current master, sometimes it's too confusable to represent JSON and 
> regex in a string literal if they contain backslash.
> For example, in JSON, \ needs to be escaped like as follows.
> {code}
> {"a": "\\"}
> {code}
> But, if the JSON above is represented in a string literal, further two \ are 
> needed because string literal also requires \ to be escaped.
> {code}
> SELECT from_json('{"a": ""}', 'a string')
> {"a":"\"}
> {code}
> To make such case simpler, it's great if Spark supports raw string literal.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34806) Helper class for batch Dataset.observe()

2022-03-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-34806:

Labels: release-notes  (was: )

> Helper class for batch Dataset.observe()
> 
>
> Key: SPARK-34806
> URL: https://issues.apache.org/jira/browse/SPARK-34806
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.3.0
>
>
> The {{observe}} method has been added to the {{Dataset}} API in 3.0.0. It 
> allows to collect aggregate metrics over data of a Dataset while they are 
> being processed during an action.
> These metrics are collected in a separate thread after registering 
> {{QueryExecutionListener}} for batch datasets and {{StreamingQueryListener}} 
> for stream datasets, respectively. While in streaming context it makes 
> perfectly sense to process incremental metrics in an event-based fashion, for 
> simple batch datatset processing, a single result should be retrievable 
> without the need to register listeners or handling threading.
> Introducing an {{Observation}} helper class can hide that complexity for 
> simple use-cases in batch processing.
> Similar to {{AccumulatorV2}} provided by {{SparkContext}} (e.g. 
> {{SparkContext.LongAccumulator}}), the {{SparkSession}} can provide a method 
> to create a new {{Observation}} instance and register it with the session.
> Alternatively, an {{Observation}} instance could be instantiated on its own 
> which on calling {{Observation.on(Dataset)}} registers with 
> {{Dataset.sparkSession}}. This "registration" registers a listener with the 
> session that retrieves the metrics.
> The {{Observation}} class provides methods to retrieve the metrics. This 
> retrieval has to wait for the listener to be called in a separate thread. So 
> all methods will wait for this, optionally with a timeout:
>  - {{Observation.get}} waits without timeout and returns the metric.
>  - {{Observation.option(time, unit)}} waits at most {{time}}, returns the 
> metric as an {{Option}}, or {{None}} when the timeout occurs.
>  - {{Observation.waitCompleted(time, unit)}} waits for the metrics and 
> indicates timeout by returning {{false}}.
> Obviously, an action has to be called on the observed dataset before any of 
> these methods are called, otherwise a timeout will occur.
> With {{Observation.reset}}, another action can be observed. Finally, 
> {{Observation.close}} unregisters the listener from the session.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37273) Hidden File Metadata Support for Spark SQL

2022-03-10 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-37273:

Labels: release-notes  (was: )

> Hidden File Metadata Support for Spark SQL
> --
>
> Key: SPARK-37273
> URL: https://issues.apache.org/jira/browse/SPARK-37273
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yaohua Zhao
>Assignee: Yaohua Zhao
>Priority: Major
>  Labels: release-notes
> Fix For: 3.3.0
>
>
> Provide a new interface in Spark SQL that allows users to query the metadata 
> of the input files for all file formats, expose them as *built-in hidden 
> columns* meaning *users can only see them when they explicitly reference 
> them* (e.g. file path, file name)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37367) Reenable exception test in DDLParserSuite.create view -- basic

2021-11-23 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448342#comment-17448342
 ] 

Xiao Li commented on SPARK-37367:
-

[~beliefer] Can you take a look at this? 

> Reenable exception test in DDLParserSuite.create view -- basic
> --
>
> Key: SPARK-37367
> URL: https://issues.apache.org/jira/browse/SPARK-37367
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> SPARK-37308 disabled a test due to unknown flakiness. We should enable this 
> back after investigation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37367) Reenable exception test in DDLParserSuite.create view -- basic

2021-11-23 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-37367:

Target Version/s: 3.3.0

> Reenable exception test in DDLParserSuite.create view -- basic
> --
>
> Key: SPARK-37367
> URL: https://issues.apache.org/jira/browse/SPARK-37367
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> SPARK-37308 disabled a test due to unknown flakiness. We should enable this 
> back after investigation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36861) Partition columns are overly eagerly parsed as dates

2021-09-27 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-36861:

Target Version/s: 3.3.0
Priority: Blocker  (was: Major)

> Partition columns are overly eagerly parsed as dates
> 
>
> Key: SPARK-36861
> URL: https://issues.apache.org/jira/browse/SPARK-36861
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Tanel Kiis
>Priority: Blocker
>
> I have an input directory with subdirs:
> * hour=2021-01-01T00
> * hour=2021-01-01T01
> * hour=2021-01-01T02
> * ...
> in spark 3.1 the 'hour' column is parsed as a string type, but in 3.2 RC it 
> is parsed as date type and the hour part is lost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36681) Fail to load Snappy codec

2021-09-23 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419557#comment-17419557
 ] 

Xiao Li commented on SPARK-36681:
-

Can you link the related Jira in Apache Hadoop?

> Fail to load Snappy codec
> -
>
> Key: SPARK-36681
> URL: https://issues.apache.org/jira/browse/SPARK-36681
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> snappy-java as a native library should not be relocated in Hadoop shaded 
> client libraries. Currently we use Hadoop shaded client libraries in Spark. 
> If trying to use SnappyCodec to write sequence file, we will encounter the 
> following error:
> {code}
> [info]   Cause: java.lang.UnsatisfiedLinkError: 
> org.apache.hadoop.shaded.org.xerial.snappy.SnappyNative.rawCompress(Ljava/nio/ByteBuffer;IILjava/nio/ByteBuffer;I)I
> [info]   at 
> org.apache.hadoop.shaded.org.xerial.snappy.SnappyNative.rawCompress(Native 
> Method)   
>   
> [info]   at 
> org.apache.hadoop.shaded.org.xerial.snappy.Snappy.compress(Snappy.java:151)   
>   
>
> [info]   at 
> org.apache.hadoop.io.compress.snappy.SnappyCompressor.compressDirectBuf(SnappyCompressor.java:282)
> [info]   at 
> org.apache.hadoop.io.compress.snappy.SnappyCompressor.compress(SnappyCompressor.java:210)
> [info]   at 
> org.apache.hadoop.io.compress.BlockCompressorStream.compress(BlockCompressorStream.java:149)
> [info]   at 
> org.apache.hadoop.io.compress.BlockCompressorStream.finish(BlockCompressorStream.java:142)
> [info]   at 
> org.apache.hadoop.io.SequenceFile$BlockCompressWriter.writeBuffer(SequenceFile.java:1589)
>  
> [info]   at 
> org.apache.hadoop.io.SequenceFile$BlockCompressWriter.sync(SequenceFile.java:1605)
> [info]   at 
> org.apache.hadoop.io.SequenceFile$BlockCompressWriter.close(SequenceFile.java:1629)
>  
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36696) spark.read.parquet loads empty dataset

2021-09-08 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-36696:

Target Version/s: 3.2.0

> spark.read.parquet loads empty dataset
> --
>
> Key: SPARK-36696
> URL: https://issues.apache.org/jira/browse/SPARK-36696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Blocker
> Attachments: example.parquet
>
>
> Here's a parquet file Spark 3.2/master can't read properly.
> The file was stored by pandas and must contain 3650 rows, but Spark 
> 3.2/master returns an empty dataset.
> {code:python}
> >>> import pandas as pd
> >>> len(pd.read_parquet('/path/to/example.parquet'))
> 3650
> >>> spark.read.parquet('/path/to/example.parquet').count()
> 0
> {code}
> I guess it's caused by the parquet 1.12.0.
> When I reverted two commits related to the parquet 1.12.0 from branch-3.2:
>  - 
> [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa]
>  - 
> [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da]
> it reads the data successfully.
> We need to add some workaround, or revert the commits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36696) spark.read.parquet loads empty dataset

2021-09-08 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-36696:

Priority: Blocker  (was: Major)

> spark.read.parquet loads empty dataset
> --
>
> Key: SPARK-36696
> URL: https://issues.apache.org/jira/browse/SPARK-36696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Blocker
> Attachments: example.parquet
>
>
> Here's a parquet file Spark 3.2/master can't read properly.
> The file was stored by pandas and must contain 3650 rows, but Spark 
> 3.2/master returns an empty dataset.
> {code:python}
> >>> import pandas as pd
> >>> len(pd.read_parquet('/path/to/example.parquet'))
> 3650
> >>> spark.read.parquet('/path/to/example.parquet').count()
> 0
> {code}
> I guess it's caused by the parquet 1.12.0.
> When I reverted two commits related to the parquet 1.12.0 from branch-3.2:
>  - 
> [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa]
>  - 
> [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da]
> it reads the data successfully.
> We need to add some workaround, or revert the commits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12

2021-08-27 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-34276:

Description: 
Before the release, we need to double check the unreleased/unresolved JIRAs/PRs 
of Parquet 1.11/1.12 and then decide whether we should upgrade/revert Parquet. 
At the same time, we should encourage the whole community to do the 
compatibility and performance tests for their production workloads, including 
both read and write code paths.

More details: 
[https://github.com/apache/spark/pull/26804#issuecomment-768790620]

  was:
Before the release, we need to double check the unreleased/unresolved JIRAs/PRs 
of Parquet 1.11 and then decide whether we should upgrade/revert Parquet. At 
the same time, we should encourage the whole community to do the compatibility 
and performance tests for their production workloads, including both read and 
write code paths.

More details: https://github.com/apache/spark/pull/26804#issuecomment-768790620


> Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
> --
>
> Key: SPARK-34276
> URL: https://issues.apache.org/jira/browse/SPARK-34276
> Project: Spark
>  Issue Type: Task
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Blocker
>
> Before the release, we need to double check the unreleased/unresolved 
> JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should 
> upgrade/revert Parquet. At the same time, we should encourage the whole 
> community to do the compatibility and performance tests for their production 
> workloads, including both read and write code paths.
> More details: 
> [https://github.com/apache/spark/pull/26804#issuecomment-768790620]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12

2021-08-27 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405878#comment-17405878
 ] 

Xiao Li commented on SPARK-34276:
-

https://issues.apache.org/jira/browse/PARQUET-2078 Do we have this problem?

> Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
> --
>
> Key: SPARK-34276
> URL: https://issues.apache.org/jira/browse/SPARK-34276
> Project: Spark
>  Issue Type: Task
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Blocker
>
> Before the release, we need to double check the unreleased/unresolved 
> JIRAs/PRs of Parquet 1.11 and then decide whether we should upgrade/revert 
> Parquet. At the same time, we should encourage the whole community to do the 
> compatibility and performance tests for their production workloads, including 
> both read and write code paths.
> More details: 
> https://github.com/apache/spark/pull/26804#issuecomment-768790620



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12

2021-08-27 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-34276:

Summary: Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12 
 (was: Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 )

> Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
> --
>
> Key: SPARK-34276
> URL: https://issues.apache.org/jira/browse/SPARK-34276
> Project: Spark
>  Issue Type: Task
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Blocker
>
> Before the release, we need to double check the unreleased/unresolved 
> JIRAs/PRs of Parquet 1.11 and then decide whether we should upgrade/revert 
> Parquet. At the same time, we should encourage the whole community to do the 
> compatibility and performance tests for their production workloads, including 
> both read and write code paths.
> More details: 
> https://github.com/apache/spark/pull/26804#issuecomment-768790620



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36457) Review and fix issues in API docs

2021-08-23 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-36457:

Priority: Blocker  (was: Critical)

> Review and fix issues in API docs
> -
>
> Key: SPARK-36457
> URL: https://issues.apache.org/jira/browse/SPARK-36457
> Project: Spark
>  Issue Type: Improvement
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Blocker
>
> Compare the 3.2.0 API doc with the latest release version 3.1.2. Fix the 
> following issues:
> * Add missing `Since` annotation for new APIs
> * Remove the leaking class/object in API doc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36457) Review and fix issues in API docs

2021-08-09 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-36457:

Target Version/s: 3.2.0

> Review and fix issues in API docs
> -
>
> Key: SPARK-36457
> URL: https://issues.apache.org/jira/browse/SPARK-36457
> Project: Spark
>  Issue Type: Improvement
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Priority: Blocker
>
> Compare the 3.2.0 API doc with the latest release version 3.1.2. Fix the 
> following issues:
> * Add missing `Since` annotation for new APIs
> * Remove the leaking class/object in API doc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36457) Review and fix issues in API docs

2021-08-09 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-36457:

Priority: Blocker  (was: Major)

> Review and fix issues in API docs
> -
>
> Key: SPARK-36457
> URL: https://issues.apache.org/jira/browse/SPARK-36457
> Project: Spark
>  Issue Type: Improvement
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Priority: Blocker
>
> Compare the 3.2.0 API doc with the latest release version 3.1.2. Fix the 
> following issues:
> * Add missing `Since` annotation for new APIs
> * Remove the leaking class/object in API doc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35050) Deprecate Apache Mesos as resource manager

2021-08-09 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-35050:

Labels: release-notes  (was: )

> Deprecate Apache Mesos as resource manager
> --
>
> Key: SPARK-35050
> URL: https://issues.apache.org/jira/browse/SPARK-35050
> Project: Spark
>  Issue Type: Task
>  Components: Mesos, Spark Core
>Affects Versions: 3.2.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.2.0
>
>
> As highlighted in 
> https://lists.apache.org/thread.html/rab2a820507f7c846e54a847398ab20f47698ec5bce0c8e182bfe51ba%40%3Cdev.mesos.apache.org%3E
>  , Apache Mesos is moving to the attic and ceasing development.
> We can/should maintain support for some time, but, can probably go ahead and 
> deprecate it now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29330) Allow users to chose the name of Spark Shuffle service

2021-08-09 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-29330.
-
Fix Version/s: 3.2.0
   Resolution: Duplicate

> Allow users to chose the name of Spark Shuffle service
> --
>
> Key: SPARK-29330
> URL: https://issues.apache.org/jira/browse/SPARK-29330
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 3.1.0
>Reporter: Alexander Bessonov
>Priority: Minor
> Fix For: 3.2.0
>
>
> As of now, Spark uses hardcoded value {{spark_shuffle}} as the name of the 
> Shuffle Service.
> HDP distribution of Spark, on the other hand, uses 
> [{{spark2_shuffle}}|https://github.com/hortonworks/spark2-release/blob/HDP-3.1.0.0-78-tag/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala#L117].
>  This is done to be able to run both Spark 1.6 and Spark 2.x on the same 
> Hadoop cluster.
> Running vanilla Spark on HDP cluster with only Spark 2.x shuffle service (HDP 
> favor) running becomes impossible due to the shuffle service name mismatch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34828) YARN Shuffle Service: Support configurability of aux service name and service-specific config overrides

2021-08-09 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-34828:

Labels: release-notes  (was: )

> YARN Shuffle Service: Support configurability of aux service name and 
> service-specific config overrides
> ---
>
> Key: SPARK-34828
> URL: https://issues.apache.org/jira/browse/SPARK-34828
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.2.0
>
>
> In some cases it may be desirable to run multiple instances of the Spark 
> Shuffle Service which are using different versions of Spark. This can be 
> helpful, for example, when running a YARN cluster with a mixed workload of 
> applications running multiple Spark versions, since a given version of the 
> shuffle service is not always compatible with other versions of Spark. (See 
> SPARK-27780 for more detail on this)
> YARN versions since 2.9.0 support the ability to run shuffle services within 
> an isolated classloader (see YARN-4577), meaning multiple Spark versions can 
> coexist within a single NodeManager.
> To support this from the Spark side, we need to make two enhancements:
> * Make the name of the shuffle service configurable. Currently it is 
> hard-coded to be {{spark_shuffle}} on both the client and server side. The 
> server-side name is not actually used anywhere, as it is the value within the 
> {{yarn.nodemanager.aux-services}} which is considered by the NodeManager to 
> be definitive name. However, if you change this in the configs, the 
> hard-coded name within the client will no longer match. So, this needs to be 
> configurable.
> * Add a way to separately configure the two shuffle service instances. Since 
> the configurations such as the port number are taken from the NodeManager 
> config, they will both try to use the same port, which obviously won't work. 
> So, we need to provide a way to selectively configure the two shuffle service 
> instances. I will go into details on my proposal for how to achieve this 
> within the PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32953) Lower memory usage in toPandas with Arrow self_destruct

2021-08-08 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-32953:

Description: 
As described on the mailing list:
 
[http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Reducing-memory-usage-of-toPandas-with-Arrow-quot-self-destruct-quot-option-td30149.html]
 
[https://lists.apache.org/thread.html/r581d7c82ada1c2ac3f0584615785cc60cf5ac231e1f29737d3a6569f%40%3Cdev.spark.apache.org%3E]

toPandas() can as much as double memory usage as both Arrow and Pandas retain a 
copy of a dataframe in memory during the conversion. Arrow >= 0.16 offers a 
self_destruct mode that avoids this with some caveats.

  was:
As described on the mailing list:
[http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Reducing-memory-usage-of-toPandas-with-Arrow-quot-self-destruct-quot-option-td30149.html]

toPandas() can as much as double memory usage as both Arrow and Pandas retain a 
copy of a dataframe in memory during the conversion. Arrow >= 0.16 offers a 
self_destruct mode that avoids this with some caveats.


> Lower memory usage in toPandas with Arrow self_destruct
> ---
>
> Key: SPARK-32953
> URL: https://issues.apache.org/jira/browse/SPARK-32953
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: David Li
>Assignee: David Li
>Priority: Major
> Fix For: 3.2.0
>
>
> As described on the mailing list:
>  
> [http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Reducing-memory-usage-of-toPandas-with-Arrow-quot-self-destruct-quot-option-td30149.html]
>  
> [https://lists.apache.org/thread.html/r581d7c82ada1c2ac3f0584615785cc60cf5ac231e1f29737d3a6569f%40%3Cdev.spark.apache.org%3E]
> toPandas() can as much as double memory usage as both Arrow and Pandas retain 
> a copy of a dataframe in memory during the conversion. Arrow >= 0.16 offers a 
> self_destruct mode that avoids this with some caveats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode

2021-07-13 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-36034:

Priority: Blocker  (was: Major)

> Incorrect datetime filter when reading Parquet files written in legacy mode
> ---
>
> Key: SPARK-36034
> URL: https://issues.apache.org/jira/browse/SPARK-36034
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Willi Raschkowski
>Priority: Blocker
>  Labels: correctness
>
> We're seeing incorrect date filters on Parquet files written by Spark 2 or by 
> Spark 3 with legacy rebase mode.
> This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2):
> {code:title=Good (Corrected Mode)}
> >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> >>> "CORRECTED")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected")
> >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = 
> >>> '0001-01-01'").show()
> +--+
> |  date|
> +--+
> |0001-01-01|
> +--+
> {code}
> This is how we get incorrect results in _legacy_ mode, in this case the 
> filter is dropping rows it shouldn't:
> {code:title=Bad (Legacy Mode)}
> In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> "LEGACY")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
> >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").show()
> ++
> |date|
> ++
> ++
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").explain()
> == Physical Plan ==
> *(1) Filter (isnotnull(date#154) AND (date#154 = -719162))
> +- *(1) ColumnarToRow
>+- FileScan parquet [date#154] Batched: true, DataFilters: 
> [isnotnull(date#154), (date#154 = -719162)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Volumes/git/spark-installs/spark-3.1.2-bin-hadoop3.2/date_written_by_spar...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(date), 
> EqualTo(date,0001-01-01)], ReadSchema: struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode

2021-07-13 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-36034:

Target Version/s: 3.2.0

> Incorrect datetime filter when reading Parquet files written in legacy mode
> ---
>
> Key: SPARK-36034
> URL: https://issues.apache.org/jira/browse/SPARK-36034
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Willi Raschkowski
>Priority: Blocker
>  Labels: correctness
>
> We're seeing incorrect date filters on Parquet files written by Spark 2 or by 
> Spark 3 with legacy rebase mode.
> This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2):
> {code:title=Good (Corrected Mode)}
> >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> >>> "CORRECTED")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected")
> >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = 
> >>> '0001-01-01'").show()
> +--+
> |  date|
> +--+
> |0001-01-01|
> +--+
> {code}
> This is how we get incorrect results in _legacy_ mode, in this case the 
> filter is dropping rows it shouldn't:
> {code:title=Bad (Legacy Mode)}
> In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> "LEGACY")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
> >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").show()
> ++
> |date|
> ++
> ++
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").explain()
> == Physical Plan ==
> *(1) Filter (isnotnull(date#154) AND (date#154 = -719162))
> +- *(1) ColumnarToRow
>+- FileScan parquet [date#154] Batched: true, DataFilters: 
> [isnotnull(date#154), (date#154 = -719162)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Volumes/git/spark-installs/spark-3.1.2-bin-hadoop3.2/date_written_by_spar...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(date), 
> EqualTo(date,0001-01-01)], ReadSchema: struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34745) Unify overflow exception error message of integral types

2021-07-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-34745.
-
Resolution: Later

> Unify overflow exception error message of integral types
> 
>
> Key: SPARK-34745
> URL: https://issues.apache.org/jira/browse/SPARK-34745
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Current, the overflow exception error messages of integral types are 
> different.
> For Byte/Short type, the message is "... caused overflow"
> For Int/Long, the message is "int/long overflow" since Spark is calling the 
> "*Exact"(e.g. addExact, negateExact) methods from java.lang.Math.
> We should unify the error message by changing the message of Byte/Short as 
> "tinyint/smallint overflow"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34402) Group exception about data format schema

2021-07-12 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17378962#comment-17378962
 ] 

Xiao Li commented on SPARK-34402:
-

[~angerszhuuu] any update?

> Group exception about data format schema
> 
>
> Key: SPARK-34402
> URL: https://issues.apache.org/jira/browse/SPARK-34402
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> https://github.com/apache/spark/pull/30869#discussion_r57203



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33223) Expose state information on SS UI

2021-04-25 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17331657#comment-17331657
 ] 

Xiao Li commented on SPARK-33223:
-

[~gsomogyi] Could you add the new metrics to the doc? 
[https://spark.apache.org/docs/latest/web-ui.html#structured-streaming-tab]  

> Expose state information on SS UI
> -
>
> Key: SPARK-33223
> URL: https://issues.apache.org/jira/browse/SPARK-33223
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming, Web UI
>Affects Versions: 3.0.1
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35097) Add column name to SparkUpgradeException about ancient datetime

2021-04-16 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323943#comment-17323943
 ] 

Xiao Li commented on SPARK-35097:
-

Can you help fix this, [~angerszhuuu] ?

> Add column name to SparkUpgradeException about ancient datetime
> ---
>
> Key: SPARK-35097
> URL: https://issues.apache.org/jira/browse/SPARK-35097
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> The error message:
> {code:java}
> org.apache.spark.SparkUpgradeException: You may get a different result due to 
> the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps 
> before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files 
> may be written by Spark 2.x or legacy versions of Hive, which uses a legacy 
> hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian 
> calendar. See more details in SPARK-31404. You can set 
> spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the 
> datetime values w.r.t. the calendar difference during reading. Or set 
> spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the 
> datetime values as it is.
> {code}
> doesn't have any clues of which column causes the issue. Need to improve the 
> message and add column name to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28379) Correlated scalar subqueries must be aggregated

2021-03-11 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299908#comment-17299908
 ] 

Xiao Li commented on SPARK-28379:
-

CC [~allisonwang-db] 

> Correlated scalar subqueries must be aggregated
> ---
>
> Key: SPARK-28379
> URL: https://issues.apache.org/jira/browse/SPARK-28379
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> create or replace temporary view INT8_TBL as select * from
>   (values
> (123, 456),
> (123, 4567890123456789),
> (4567890123456789, 123),
> (4567890123456789, 4567890123456789),
> (4567890123456789, -4567890123456789))
>   as v(q1, q2);
> select * from
>   int8_tbl t1 left join
>   (select q1 as x, 42 as y from int8_tbl t2) ss
>   on t1.q2 = ss.x
> where
>   1 = (select 1 from int8_tbl t3 where ss.y is not null limit 1)
> order by 1,2;
> {code}
> PostgreSQL:
> {noformat}
> postgres=# select * from
> postgres-#   int8_tbl t1 left join
> postgres-#   (select q1 as x, 42 as y from int8_tbl t2) ss
> postgres-#   on t1.q2 = ss.x
> postgres-# where
> postgres-#   1 = (select 1 from int8_tbl t3 where ss.y is not null limit 1)
> postgres-# order by 1,2;
> q1|q2|x | y
> --+--+--+
>   123 | 4567890123456789 | 4567890123456789 | 42
>   123 | 4567890123456789 | 4567890123456789 | 42
>   123 | 4567890123456789 | 4567890123456789 | 42
>  4567890123456789 |  123 |  123 | 42
>  4567890123456789 |  123 |  123 | 42
>  4567890123456789 | 4567890123456789 | 4567890123456789 | 42
>  4567890123456789 | 4567890123456789 | 4567890123456789 | 42
>  4567890123456789 | 4567890123456789 | 4567890123456789 | 42
> (8 rows)
> {noformat}
> Spark SQL:
> {noformat}
> spark-sql> select * from
>  >   int8_tbl t1 left join
>  >   (select q1 as x, 42 as y from int8_tbl t2) ss
>  >   on t1.q2 = ss.x
>  > where
>  >   1 = (select 1 from int8_tbl t3 where ss.y is not null limit 1)
>  > order by 1,2;
> Error in query: Correlated scalar subqueries must be aggregated: GlobalLimit 1
> +- LocalLimit 1
>+- Project [1 AS 1#169]
>   +- Filter isnotnull(outer(y#167))
>  +- SubqueryAlias `t3`
> +- SubqueryAlias `int8_tbl`
>+- Project [q1#164L, q2#165L]
>   +- Project [col1#162L AS q1#164L, col2#163L AS q2#165L]
>  +- SubqueryAlias `v`
> +- LocalRelation [col1#162L, col2#163L]
> ;;
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34525) Update Spark Create Table DDL Docs

2021-02-24 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-34525:

Labels: starter  (was: )

> Update Spark Create Table DDL Docs
> --
>
> Key: SPARK-34525
> URL: https://issues.apache.org/jira/browse/SPARK-34525
> Project: Spark
>  Issue Type: Improvement
>  Components: docs, Documentation
>Affects Versions: 3.0.3
>Reporter: Miklos Christine
>Priority: Major
>  Labels: starter
>
> Within the `CREATE TABLE` docs, the `OPTIONS` and `TBLPROPERTIES`specify 
> `key=value` parameters with a `=` as the delimiter between the key value 
> pairs. 
> The `=` is optional and can be space delimited. We should document that both 
> methods are supported when defining these parameters.
>  
> One location within the current docs page that should be updated: 
> [https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-datasource.html]
>  
> Code reference showing equal as an optional parameter:
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L401



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34459) map_from_arrays() throws UnsupportedOperationException with array of ColumnarMap

2021-02-18 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-34459.
-
Resolution: Invalid

> map_from_arrays() throws UnsupportedOperationException with array of 
> ColumnarMap
> 
>
> Key: SPARK-34459
> URL: https://issues.apache.org/jira/browse/SPARK-34459
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Minor
>
> An example to reproduce this error: 
> {code:scala}
> sql("select map(1, 2) as m_a, map(3, 4) as m_b").write.saveAsTable("t")
> sql("select map_from_arrays(array(1, 2), array(m_a, m_b)) from t")
> {code}
> Exception trace:
> {code:java}
> java.lang.UnsupportedOperationException
>   at org.apache.spark.sql.vectorized.ColumnarMap.copy(ColumnarMap.java:51)
>   at org.apache.spark.sql.vectorized.ColumnarMap.copy(ColumnarMap.java:25)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$.copyValue(InternalRow.scala:121)
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.copy(GenericArrayData.scala:54){code}
> This is due to ColumnarMap's copy method not implemented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34459) map_from_arrays() throws UnsupportedOperationException with array of ColumnarMap

2021-02-17 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17286299#comment-17286299
 ] 

Xiao Li commented on SPARK-34459:
-

cc [~beliefer]

> map_from_arrays() throws UnsupportedOperationException with array of 
> ColumnarMap
> 
>
> Key: SPARK-34459
> URL: https://issues.apache.org/jira/browse/SPARK-34459
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Minor
>
> An example to reproduce this error: 
> {code:scala}
> sql("select map(1, 2) as m_a, map(3, 4) as m_b").write.saveAsTable("t")
> sql("select map_from_arrays(array(1, 2), array(m_a, m_b)) from t")
> {code}
> Exception trace:
> {code:java}
> java.lang.UnsupportedOperationException
>   at org.apache.spark.sql.vectorized.ColumnarMap.copy(ColumnarMap.java:51)
>   at org.apache.spark.sql.vectorized.ColumnarMap.copy(ColumnarMap.java:25)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$.copyValue(InternalRow.scala:121)
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.copy(GenericArrayData.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:754)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
>   at 
> org.apache.spark.sql.execution.collect.Collector.$anonfun$processPartition$1(Collector.scala:179)
>   at 
> org.apache.spark.SparkContext.$anonfun$runJob$6(SparkContext.scala:2541)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.doRunTask(Task.scala:150)
>   at org.apache.spark.scheduler.Task.run(Task.scala:119)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$10(Executor.scala:733)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1643)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:736)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> This is due to ColumnarMap's copy method not implemented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32883) Stream-stream join improvement

2021-02-15 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17285057#comment-17285057
 ] 

Xiao Li commented on SPARK-32883:
-

[https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc2-docs/_site/structured-streaming-programming-guide.html#support-matrix-for-joins-in-streaming-queries]
  It sounds like our document is still not updated. cc [~chengsu] [~kabhwan] 
[~XuanYuan] [~hyukjin.kwon] [~cloud_fan]

> Stream-stream join improvement
> --
>
> Key: SPARK-32883
> URL: https://issues.apache.org/jira/browse/SPARK-32883
> Project: Spark
>  Issue Type: Umbrella
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Major
>
> Creating this umbrella Jira to track overall progress for stream-stream join 
> improvement. See each individual sub-task for details.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34421) Custom functions can't be used in temporary views with CTEs

2021-02-11 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-34421:

Target Version/s: 3.1.0

> Custom functions can't be used in temporary views with CTEs
> ---
>
> Key: SPARK-34421
> URL: https://issues.apache.org/jira/browse/SPARK-34421
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
> Environment: Databricks Runtime 8.0
>Reporter: Lauri Koobas
>Priority: Blocker
>
> Works in DBR 7.4, which is Spark 3.0.1. Breaks in DBR8.0(beta), which is 
> Spark 3.1.
>  
> Start with:
> {{spark.udf.registerJavaFunction("custom_func", "com.stuff.path.custom_func", 
> LongType())}}
>  
> Works: * {{select custom_func()}}
>  * {{create temporary view blaah as select custom_func()}}
>  * {{with step_1 as ( select custom_func() ) select * from step_1}}
> Broken:
> {{create temporary view blaah as with step_1 as ( select custom_func() ) 
> select * from step_1}}
>  
> followed by:
> {{select * from blaah}}
>  
> Error:
> {{Error in SQL statement: AnalysisException: No handler for UDF/UDAF/UDTF 
> '}}{{com.stuff.path.custom_func}}{{';}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34421) Custom functions can't be used in temporary views with CTEs

2021-02-11 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-34421:

Priority: Blocker  (was: Major)

> Custom functions can't be used in temporary views with CTEs
> ---
>
> Key: SPARK-34421
> URL: https://issues.apache.org/jira/browse/SPARK-34421
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
> Environment: Databricks Runtime 8.0
>Reporter: Lauri Koobas
>Priority: Blocker
>
> Works in DBR 7.4, which is Spark 3.0.1. Breaks in DBR8.0(beta), which is 
> Spark 3.1.
>  
> Start with:
> {{spark.udf.registerJavaFunction("custom_func", "com.stuff.path.custom_func", 
> LongType())}}
>  
> Works: * {{select custom_func()}}
>  * {{create temporary view blaah as select custom_func()}}
>  * {{with step_1 as ( select custom_func() ) select * from step_1}}
> Broken:
> {{create temporary view blaah as with step_1 as ( select custom_func() ) 
> select * from step_1}}
>  
> followed by:
> {{select * from blaah}}
>  
> Error:
> {{Error in SQL statement: AnalysisException: No handler for UDF/UDAF/UDTF 
> '}}{{com.stuff.path.custom_func}}{{';}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31949) Add spark.default.parallelism in SQLConf for isolated across session

2021-02-10 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-31949.
-
Resolution: Won't Fix

> Add spark.default.parallelism in SQLConf for isolated across session
> 
>
> Key: SPARK-31949
> URL: https://issues.apache.org/jira/browse/SPARK-31949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34382) ANSI SQL: LATERAL derived table(T491)

2021-02-05 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17280051#comment-17280051
 ] 

Xiao Li commented on SPARK-34382:
-

reopened. This is a nice SQL feature we can support. This is also being 
supported by the other database systems.

> ANSI SQL: LATERAL derived table(T491)
> -
>
> Key: SPARK-34382
> URL: https://issues.apache.org/jira/browse/SPARK-34382
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Subqueries appearing in {{FROM}} can be preceded by the key word {{LATERAL}}. 
> This allows them to reference columns provided by preceding {{FROM}} items. 
> (Without {{LATERAL}}, each subquery is evaluated independently and so cannot 
> cross-reference any other {{FROM}} item.)
> Table functions appearing in {{FROM}} can also be preceded by the key word 
> {{LATERAL}}, but for functions the key word is optional; the function's 
> arguments can contain references to columns provided by preceding {{FROM}} 
> items in any case.
> A {{LATERAL}} item can appear at top level in the {{FROM}} list, or within a 
> {{JOIN}} tree. In the latter case it can also refer to any items that are on 
> the left-hand side of a {{JOIN}} that it is on the right-hand side of.
> When a {{FROM}} item contains {{LATERAL}} cross-references, evaluation 
> proceeds as follows: for each row of the {{FROM}} item providing the 
> cross-referenced column(s), or set of rows of multiple {{FROM}} items 
> providing the columns, the {{LATERAL}} item is evaluated using that row or 
> row set's values of the columns. The resulting row(s) are joined as usual 
> with the rows they were computed from. This is repeated for each row or set 
> of rows from the column source table(s).
> A trivial example of {{LATERAL}} is
> {code:sql}
> SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
> {code}
> *Feature ID*: T491
> [https://www.postgresql.org/docs/11/queries-table-expressions.html#QUERIES-FROM]
> [https://github.com/postgres/postgres/commit/5ebaaa49445eb1ba7b299bbea3a477d4e4c0430]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34382) ANSI SQL: LATERAL derived table(T491)

2021-02-05 Thread Xiao Li (Jira)
Xiao Li created SPARK-34382:
---

 Summary: ANSI SQL: LATERAL derived table(T491)
 Key: SPARK-34382
 URL: https://issues.apache.org/jira/browse/SPARK-34382
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


Subqueries appearing in {{FROM}} can be preceded by the key word {{LATERAL}}. 
This allows them to reference columns provided by preceding {{FROM}} items. 
(Without {{LATERAL}}, each subquery is evaluated independently and so cannot 
cross-reference any other {{FROM}} item.)

Table functions appearing in {{FROM}} can also be preceded by the key word 
{{LATERAL}}, but for functions the key word is optional; the function's 
arguments can contain references to columns provided by preceding {{FROM}} 
items in any case.

A {{LATERAL}} item can appear at top level in the {{FROM}} list, or within a 
{{JOIN}} tree. In the latter case it can also refer to any items that are on 
the left-hand side of a {{JOIN}} that it is on the right-hand side of.

When a {{FROM}} item contains {{LATERAL}} cross-references, evaluation proceeds 
as follows: for each row of the {{FROM}} item providing the cross-referenced 
column(s), or set of rows of multiple {{FROM}} items providing the columns, the 
{{LATERAL}} item is evaluated using that row or row set's values of the 
columns. The resulting row(s) are joined as usual with the rows they were 
computed from. This is repeated for each row or set of rows from the column 
source table(s).

A trivial example of {{LATERAL}} is
{code:sql}
SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
{code}

*Feature ID*: T491

[https://www.postgresql.org/docs/11/queries-table-expressions.html#QUERIES-FROM]
[https://github.com/postgres/postgres/commit/5ebaaa49445eb1ba7b299bbea3a477d4e4c0430]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27877) ANSI SQL: LATERAL derived table(T491)

2021-02-05 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-27877:

Issue Type: Technical task  (was: Sub-task)

> ANSI SQL: LATERAL derived table(T491)
> -
>
> Key: SPARK-27877
> URL: https://issues.apache.org/jira/browse/SPARK-27877
> Project: Spark
>  Issue Type: Technical task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Subqueries appearing in {{FROM}} can be preceded by the key word {{LATERAL}}. 
> This allows them to reference columns provided by preceding {{FROM}} items. 
> (Without {{LATERAL}}, each subquery is evaluated independently and so cannot 
> cross-reference any other {{FROM}} item.)
> Table functions appearing in {{FROM}} can also be preceded by the key word 
> {{LATERAL}}, but for functions the key word is optional; the function's 
> arguments can contain references to columns provided by preceding {{FROM}} 
> items in any case.
> A {{LATERAL}} item can appear at top level in the {{FROM}} list, or within a 
> {{JOIN}} tree. In the latter case it can also refer to any items that are on 
> the left-hand side of a {{JOIN}} that it is on the right-hand side of.
> When a {{FROM}} item contains {{LATERAL}} cross-references, evaluation 
> proceeds as follows: for each row of the {{FROM}} item providing the 
> cross-referenced column(s), or set of rows of multiple {{FROM}} items 
> providing the columns, the {{LATERAL}} item is evaluated using that row or 
> row set's values of the columns. The resulting row(s) are joined as usual 
> with the rows they were computed from. This is repeated for each row or set 
> of rows from the column source table(s).
> A trivial example of {{LATERAL}} is
> {code:sql}
> SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
> {code}
> *Feature ID*: T491
> [https://www.postgresql.org/docs/11/queries-table-expressions.html#QUERIES-FROM]
> [https://github.com/postgres/postgres/commit/5ebaaa49445eb1ba7b299bbea3a477d4e4c0430]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27877) ANSI SQL: LATERAL derived table(T491)

2021-02-05 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-27877:

Issue Type: Sub-task  (was: Technical task)

> ANSI SQL: LATERAL derived table(T491)
> -
>
> Key: SPARK-27877
> URL: https://issues.apache.org/jira/browse/SPARK-27877
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Subqueries appearing in {{FROM}} can be preceded by the key word {{LATERAL}}. 
> This allows them to reference columns provided by preceding {{FROM}} items. 
> (Without {{LATERAL}}, each subquery is evaluated independently and so cannot 
> cross-reference any other {{FROM}} item.)
> Table functions appearing in {{FROM}} can also be preceded by the key word 
> {{LATERAL}}, but for functions the key word is optional; the function's 
> arguments can contain references to columns provided by preceding {{FROM}} 
> items in any case.
> A {{LATERAL}} item can appear at top level in the {{FROM}} list, or within a 
> {{JOIN}} tree. In the latter case it can also refer to any items that are on 
> the left-hand side of a {{JOIN}} that it is on the right-hand side of.
> When a {{FROM}} item contains {{LATERAL}} cross-references, evaluation 
> proceeds as follows: for each row of the {{FROM}} item providing the 
> cross-referenced column(s), or set of rows of multiple {{FROM}} items 
> providing the columns, the {{LATERAL}} item is evaluated using that row or 
> row set's values of the columns. The resulting row(s) are joined as usual 
> with the rows they were computed from. This is repeated for each row or set 
> of rows from the column source table(s).
> A trivial example of {{LATERAL}} is
> {code:sql}
> SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
> {code}
> *Feature ID*: T491
> [https://www.postgresql.org/docs/11/queries-table-expressions.html#QUERIES-FROM]
> [https://github.com/postgres/postgres/commit/5ebaaa49445eb1ba7b299bbea3a477d4e4c0430]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34078) Provide async variants for Dataset APIs

2021-01-11 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17263044#comment-17263044
 ] 

Xiao Li commented on SPARK-34078:
-

A sample PR? or a lightweight design doc?

> Provide async variants for Dataset APIs
> ---
>
> Key: SPARK-34078
> URL: https://issues.apache.org/jira/browse/SPARK-34078
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Yesheng Ma
>Priority: Major
>
> Spark RDDs have async variants such as `collectAsync`, which comes handy when 
> we want to cancel a job. However, Dataset APIs are lacking such APIs, which 
> makes it very painful to cancel a Dataset/SQL job.
>  
> The proposed change was to add async variants so that we can directly cancel 
> a Dataset/SQL query via a future programmatically.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33977) Add doc for "'like any' and 'like all' operators"

2021-01-03 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257939#comment-17257939
 ] 

Xiao Li commented on SPARK-33977:
-

cc [~yumwang]

> Add doc for "'like any' and 'like all' operators"
> -
>
> Key: SPARK-33977
> URL: https://issues.apache.org/jira/browse/SPARK-33977
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Priority: Major
>
> Need to update the doc for the new LIKE predicates in the following file:
> [https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-qry-select-like.md]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33977) Add doc for "'like any' and 'like all' operators"

2021-01-03 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257938#comment-17257938
 ] 

Xiao Li commented on SPARK-33977:
-

https://issues.apache.org/jira/browse/SPARK-30724 is added in 3.1

  

> Add doc for "'like any' and 'like all' operators"
> -
>
> Key: SPARK-33977
> URL: https://issues.apache.org/jira/browse/SPARK-33977
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Priority: Major
>
> Need to update the doc for the new LIKE predicates in the following file:
> [https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-qry-select-like.md]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33977) Add doc for "'like any' and 'like all' operators"

2021-01-03 Thread Xiao Li (Jira)
Xiao Li created SPARK-33977:
---

 Summary: Add doc for "'like any' and 'like all' operators"
 Key: SPARK-33977
 URL: https://issues.apache.org/jira/browse/SPARK-33977
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 3.1.0
Reporter: Xiao Li


Need to update the doc for the new LIKE predicates in the following file:

[https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-qry-select-like.md]
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33725) Upgrade snappy-java to 1.1.8.2

2020-12-13 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-33725:

Fix Version/s: 3.0.2

> Upgrade snappy-java to 1.1.8.2
> --
>
> Key: SPARK-33725
> URL: https://issues.apache.org/jira/browse/SPARK-33725
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.7, 3.0.1, 3.1.0, 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> Minor version upgrade that includes:
>  * Fixed an initialization issue when using a recent Mac OS X version #265
>  * Support Apple Silicon (M1, Mac-aarch64)
>  * Fixed the pure-java Snappy fallback logic when no native library for your 
> platform is found.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33694) Auditing the API changes and behavior changes in Spark 3.1

2020-12-07 Thread Xiao Li (Jira)
Xiao Li created SPARK-33694:
---

 Summary: Auditing the API changes and behavior changes in Spark 3.1
 Key: SPARK-33694
 URL: https://issues.apache.org/jira/browse/SPARK-33694
 Project: Spark
  Issue Type: Epic
  Components: PySpark, Spark Core, SparkR, SQL, Structured Streaming
Affects Versions: 3.1.0
Reporter: Xiao Li






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33694) Auditing the API changes and behavior changes in Spark 3.1

2020-12-07 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-33694:

Priority: Blocker  (was: Major)

> Auditing the API changes and behavior changes in Spark 3.1
> --
>
> Key: SPARK-33694
> URL: https://issues.apache.org/jira/browse/SPARK-33694
> Project: Spark
>  Issue Type: Epic
>  Components: PySpark, Spark Core, SparkR, SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32991) RESET can clear StaticSQLConfs

2020-12-05 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-32991:

Target Version/s: 3.1.0

> RESET can clear StaticSQLConfs
> --
>
> Key: SPARK-32991
> URL: https://issues.apache.org/jira/browse/SPARK-32991
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Herman van Hövell
>Assignee: Kent Yao
>Priority: Blocker
> Fix For: 3.1.0
>
>
> The RESET command can clear a sessions' static SQL configurations, when that 
> static SQL configuration was set on a SparkSession that uses a pre-existing 
> SparkContext. Here is repro:
> {code:java}
> // Blow away any pre-existing session thread locals
> org.apache.spark.sql.SparkSession.clearDefaultSession()
> org.apache.spark.sql.SparkSession.clearActiveSession()
> // Create new session and explicitly set a spark context
> val newSession = org.apache.spark.sql.SparkSession.builder
>  .sparkContext(sc)
>  .config("spark.sql.globalTempDatabase", "bob")
>  .getOrCreate()
> assert(newSession.conf.get("spark.sql.globalTempDatabase") == "bob")
> newSession.sql("reset")
> assert(newSession.conf.get("spark.sql.globalTempDatabase") == "bob") // Boom!
> {code}
> The problem is that RESET assumes it can use the SparkContext's 
> configurations as its default.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32991) RESET can clear StaticSQLConfs

2020-12-05 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-32991:

Priority: Blocker  (was: Major)

> RESET can clear StaticSQLConfs
> --
>
> Key: SPARK-32991
> URL: https://issues.apache.org/jira/browse/SPARK-32991
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Herman van Hövell
>Assignee: Kent Yao
>Priority: Blocker
> Fix For: 3.1.0
>
>
> The RESET command can clear a sessions' static SQL configurations, when that 
> static SQL configuration was set on a SparkSession that uses a pre-existing 
> SparkContext. Here is repro:
> {code:java}
> // Blow away any pre-existing session thread locals
> org.apache.spark.sql.SparkSession.clearDefaultSession()
> org.apache.spark.sql.SparkSession.clearActiveSession()
> // Create new session and explicitly set a spark context
> val newSession = org.apache.spark.sql.SparkSession.builder
>  .sparkContext(sc)
>  .config("spark.sql.globalTempDatabase", "bob")
>  .getOrCreate()
> assert(newSession.conf.get("spark.sql.globalTempDatabase") == "bob")
> newSession.sql("reset")
> assert(newSession.conf.get("spark.sql.globalTempDatabase") == "bob") // Boom!
> {code}
> The problem is that RESET assumes it can use the SparkContext's 
> configurations as its default.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32991) RESET can clear StaticSQLConfs

2020-12-05 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244675#comment-17244675
 ] 

Xiao Li commented on SPARK-32991:
-

[~Qin Yao] I reopened the Jira and set it to Blocker, because the follow-up PR 
is needed before  the 3.1 release.

> RESET can clear StaticSQLConfs
> --
>
> Key: SPARK-32991
> URL: https://issues.apache.org/jira/browse/SPARK-32991
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Herman van Hövell
>Assignee: Kent Yao
>Priority: Blocker
> Fix For: 3.1.0
>
>
> The RESET command can clear a sessions' static SQL configurations, when that 
> static SQL configuration was set on a SparkSession that uses a pre-existing 
> SparkContext. Here is repro:
> {code:java}
> // Blow away any pre-existing session thread locals
> org.apache.spark.sql.SparkSession.clearDefaultSession()
> org.apache.spark.sql.SparkSession.clearActiveSession()
> // Create new session and explicitly set a spark context
> val newSession = org.apache.spark.sql.SparkSession.builder
>  .sparkContext(sc)
>  .config("spark.sql.globalTempDatabase", "bob")
>  .getOrCreate()
> assert(newSession.conf.get("spark.sql.globalTempDatabase") == "bob")
> newSession.sql("reset")
> assert(newSession.conf.get("spark.sql.globalTempDatabase") == "bob") // Boom!
> {code}
> The problem is that RESET assumes it can use the SparkContext's 
> configurations as its default.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-32991) RESET can clear StaticSQLConfs

2020-12-05 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-32991:
-

> RESET can clear StaticSQLConfs
> --
>
> Key: SPARK-32991
> URL: https://issues.apache.org/jira/browse/SPARK-32991
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Herman van Hövell
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.1.0
>
>
> The RESET command can clear a sessions' static SQL configurations, when that 
> static SQL configuration was set on a SparkSession that uses a pre-existing 
> SparkContext. Here is repro:
> {code:java}
> // Blow away any pre-existing session thread locals
> org.apache.spark.sql.SparkSession.clearDefaultSession()
> org.apache.spark.sql.SparkSession.clearActiveSession()
> // Create new session and explicitly set a spark context
> val newSession = org.apache.spark.sql.SparkSession.builder
>  .sparkContext(sc)
>  .config("spark.sql.globalTempDatabase", "bob")
>  .getOrCreate()
> assert(newSession.conf.get("spark.sql.globalTempDatabase") == "bob")
> newSession.sql("reset")
> assert(newSession.conf.get("spark.sql.globalTempDatabase") == "bob") // Boom!
> {code}
> The problem is that RESET assumes it can use the SparkContext's 
> configurations as its default.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33674) Show Slowpoke notifications in SBT tests

2020-12-05 Thread Xiao Li (Jira)
Xiao Li created SPARK-33674:
---

 Summary: Show Slowpoke notifications in SBT tests
 Key: SPARK-33674
 URL: https://issues.apache.org/jira/browse/SPARK-33674
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 3.0.1
Reporter: Xiao Li
Assignee: Xiao Li


When the tests/code has bug and enters the infinite loop, it is hard to tell 
which test cases hit some issues from the log, especially when we are running 
the tests in parallel. It would be nice to show the Slowpoke notifications.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33551) Do not use custom shuffle reader for repartition

2020-11-25 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-33551.
-
Fix Version/s: 3.1.0
 Assignee: Wei Xue
   Resolution: Fixed

> Do not use custom shuffle reader for repartition
> 
>
> Key: SPARK-33551
> URL: https://issues.apache.org/jira/browse/SPARK-33551
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Major
> Fix For: 3.1.0
>
>
> We should have a more thorough fix for all sorts of custom shuffle readers 
> when the original query has a repartition shuffle, based on the discussions 
> on the initial PR: [https://github.com/apache/spark/pull/29797].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32670) Group exception messages in Catalyst Analyzer in one file

2020-11-24 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-32670:
---

Assignee: Xinyi Yu  (was: Xiao Li)

> Group exception messages in Catalyst Analyzer in one file
> -
>
> Key: SPARK-32670
> URL: https://issues.apache.org/jira/browse/SPARK-32670
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Assignee: Xinyi Yu
>Priority: Minor
> Fix For: 3.1.0
>
>
> For standardization of error messages and its maintenance, we can try to 
> group the exception messages into a single file. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32670) Group exception messages in Catalyst Analyzer in one file

2020-11-24 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-32670:

Parent: SPARK-33539
Issue Type: Sub-task  (was: Improvement)

> Group exception messages in Catalyst Analyzer in one file
> -
>
> Key: SPARK-32670
> URL: https://issues.apache.org/jira/browse/SPARK-32670
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
> Fix For: 3.1.0
>
>
> For standardization of error messages and its maintenance, we can try to 
> group the exception messages into a single file. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32862) Left semi stream-stream join

2020-11-17 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-32862:

Labels: release-notes  (was: )

> Left semi stream-stream join
> 
>
> Key: SPARK-32862
> URL: https://issues.apache.org/jira/browse/SPARK-32862
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Major
>  Labels: release-notes
> Fix For: 3.1.0
>
>
> Current stream-stream join supports inner, left outer and right outer join 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166]
>  ). We do see internally a lot of users are using left semi stream-stream 
> join (not spark structured streaming), e.g. I want to get the ad impression 
> (join left side) which has click (joint right side), but I don't care how 
> many clicks per ad (left semi semantics).
>  
> Left semi stream-stream join will work as followed:
> (1).for left side input row, check if there's a match on right side state 
> store
>   (1.1). if there's a match, output the left side row.
>   (1.2). if there's no match, put the row in left side state store (with 
> "matched" field to set to false in state store).
> (2).for right side input row, check if there's a match on left side state 
> store. If there's a match, update left side row state with "matched" field to 
> set to true. Put the right side row in right side state store.
> (3).for left side row needs to be evicted from state store, output the row if 
> "matched" field is true.
> (4).for right side row needs to be evicted from state store, doing nothing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2020-10-22 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30602:

Labels: release-notes  (was: )

> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>  Labels: release-notes
> Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, 
> vldb_magnet_final.pdf
>
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable Spark external shuffle service and store the intermediate 
> shuffle files on HDD. Because the number of blocks generated for a particular 
> shuffle grows quadratically compared to the size of shuffled data (# mappers 
> and reducers grows linearly with the size of shuffled data, but # blocks is # 
> mappers * # reducers), one general trend we have observed is that the more 
> data a Spark application processes, the smaller the block size becomes. In a 
> few production clusters we have seen, the average shuffle block size is only 
> 10s of KBs. Because of the inefficiency of performing random reads on HDD for 
> small amount of data, the overall efficiency of the Spark external shuffle 
> services serving the shuffle blocks degrades as we see an increasing # of 
> Spark applications processing an increasing amount of data. In addition, 
> because Spark external shuffle service is a shared service in a multi-tenancy 
> cluster, the inefficiency with one Spark application could propagate to other 
> applications as well.
> In this ticket, we propose a solution to improve Spark shuffle efficiency in 
> above mentioned environments with push-based shuffle. With push-based 
> shuffle, shuffle is performed at the end of mappers and blocks get pre-merged 
> and move towards reducers. In our prototype implementation, we have seen 
> significant efficiency improvements when performing large shuffles. We take a 
> Spark-native approach to achieve this, i.e., extending Spark’s existing 
> shuffle netty protocol, and the behaviors of Spark mappers, reducers and 
> drivers. This way, we can bring the benefits of more efficient shuffle in 
> Spark without incurring the dependency or overhead of either specialized 
> storage layer or external infrastructure pieces.
>  
> Link to dev mailing list discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32461) Shuffled hash join improvement

2020-10-22 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-32461:

Priority: Major  (was: Minor)

> Shuffled hash join improvement
> --
>
> Key: SPARK-32461
> URL: https://issues.apache.org/jira/browse/SPARK-32461
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Major
>
> Shuffled hash join avoids sort compared to sort merge join. This advantage 
> shows up obviously when joining large table in terms of saving CPU and IO (in 
> case of external sort happens). In latest master trunk, shuffled hash join is 
> disabled by default with config "spark.sql.join.preferSortMergeJoin"=true, 
> with favor of reducing risk of OOM. However shuffled hash join could be 
> improved to a better state (validated in our internal fork). Creating this 
> Jira to track overall progress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32461) Shuffled hash join improvement

2020-10-22 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-32461:

Labels: release-notes  (was: )

> Shuffled hash join improvement
> --
>
> Key: SPARK-32461
> URL: https://issues.apache.org/jira/browse/SPARK-32461
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Major
>  Labels: release-notes
>
> Shuffled hash join avoids sort compared to sort merge join. This advantage 
> shows up obviously when joining large table in terms of saving CPU and IO (in 
> case of external sort happens). In latest master trunk, shuffled hash join is 
> disabled by default with config "spark.sql.join.preferSortMergeJoin"=true, 
> with favor of reducing risk of OOM. However shuffled hash join could be 
> improved to a better state (validated in our internal fork). Creating this 
> Jira to track overall progress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33181) SQL Reference: Run SQL on files directly

2020-10-19 Thread Xiao Li (Jira)
Xiao Li created SPARK-33181:
---

 Summary: SQL Reference: Run SQL on files directly
 Key: SPARK-33181
 URL: https://issues.apache.org/jira/browse/SPARK-33181
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 3.0.1
Reporter: Xiao Li


Currently, SQL reference 
([https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select.html] ) does 
not show the feature "Run SQL on files directly", which is documented in 
[https://spark.apache.org/docs/3.0.1/sql-data-sources-load-save-functions.html#run-sql-on-files-directly]
 We should add it to SQL reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33181) SQL Reference: Run SQL on files directly

2020-10-19 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-33181:

Labels: starter  (was: )

> SQL Reference: Run SQL on files directly
> 
>
> Key: SPARK-33181
> URL: https://issues.apache.org/jira/browse/SPARK-33181
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> Currently, SQL reference 
> ([https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select.html] ) does 
> not show the feature "Run SQL on files directly", which is documented in 
> [https://spark.apache.org/docs/3.0.1/sql-data-sources-load-save-functions.html#run-sql-on-files-directly]
>  We should add it to SQL reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33166) Provide Search Function in Spark docs site

2020-10-15 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-33166:

Target Version/s: 3.1.0

> Provide Search Function in Spark docs site
> --
>
> Key: SPARK-33166
> URL: https://issues.apache.org/jira/browse/SPARK-33166
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Priority: Major
>
> In the last few releases, our Spark documentation  
> https://spark.apache.org/docs/latest/ becomes richer. It would nice to 
> provide a search function to make our users find contents faster. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33166) Provide Search Function in Spark docs site

2020-10-15 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215133#comment-17215133
 ] 

Xiao Li commented on SPARK-33166:
-

cc [~huaxingao] [~dongjoon]

> Provide Search Function in Spark docs site
> --
>
> Key: SPARK-33166
> URL: https://issues.apache.org/jira/browse/SPARK-33166
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Priority: Major
>
> In the last few releases, our Spark documentation  
> https://spark.apache.org/docs/latest/ becomes richer. It would nice to 
> provide a search function to make our users find contents faster. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33166) Provide Search Function in Spark docs site

2020-10-15 Thread Xiao Li (Jira)
Xiao Li created SPARK-33166:
---

 Summary: Provide Search Function in Spark docs site
 Key: SPARK-33166
 URL: https://issues.apache.org/jira/browse/SPARK-33166
 Project: Spark
  Issue Type: New Feature
  Components: Documentation
Affects Versions: 3.1.0
Reporter: Xiao Li


In the last few releases, our Spark documentation  
https://spark.apache.org/docs/latest/ becomes richer. It would nice to provide 
a search function to make our users find contents faster. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33134) Incorrect nested complex JSON fields raise an exception

2020-10-14 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-33134:

Issue Type: Bug  (was: Improvement)

> Incorrect nested complex JSON fields raise an exception
> ---
>
> Key: SPARK-33134
> URL: https://issues.apache.org/jira/browse/SPARK-33134
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> The code below:
> {code:scala}
> val pokerhand_raw = Seq("""[{"cards": [19], "playerId": 
> 123456}]""").toDF("events")
> val event = new StructType()
>   .add("playerId", LongType)
>   .add("cards", ArrayType(
> new StructType()
>   .add("id", LongType)
>   .add("rank", StringType)))
> val pokerhand_events = pokerhand_raw
>   .select(explode(from_json($"events", ArrayType(event))).as("event"))
> pokerhand_events.show
> {code}
> throw the exception in the PERMISSIVE mode (default):
> {code:java}
> Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
> org.apache.spark.sql.catalyst.util.ArrayData
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:560)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:597)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313)
>   at 
> org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
> {code}
> The same works in Spark 2.4:
> {code:scala}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.6
>   /_/
> Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265)
> ...
> scala> pokerhand_events.show()
> +-+
> |event|
> +-+
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33038) AQE plan string should only display one plan when the initial and the current plan are the same

2020-10-05 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-33038.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

> AQE plan string should only display one plan when the initial and the current 
> plan are the same
> ---
>
> Key: SPARK-33038
> URL: https://issues.apache.org/jira/browse/SPARK-33038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Minor
> Fix For: 3.1.0
>
>
> Currently, the AQE plan string displays both the initial plan and the current 
> or the final plan. This can be redundant when the initial plan and the 
> current physical plan are exactly the same. For instance, the `EXPLAIN` 
> command will not actually execute the query, and thus the plan string will 
> never change, but currently, the plan string still shows both the current and 
> the initial plan:
>  
> {code:java}
> AdaptiveSparkPlan (8)
> +- == Current Plan ==
>Sort (7)
>+- Exchange (6)
>   +- HashAggregate (5)
>  +- Exchange (4)
> +- HashAggregate (3)
>+- Filter (2)
>   +- Scan parquet default.explain_temp1 (1)
> +- == Initial Plan ==
>Sort (7)
>+- Exchange (6)
>   +- HashAggregate (5)
>  +- Exchange (4)
> +- HashAggregate (3)
>+- Filter (2)
>   +- Scan parquet default.explain_temp1 (1)
> {code}
> When the initial and the current plan are the same, there should be only one 
> plan string displayed. For example
> {code:java}
> AdaptiveSparkPlan (8)
> +- Sort (7)
>+- Exchange (6)
>   +- HashAggregate (5)
>  +- Exchange (4)
> +- HashAggregate (3)
>+- Filter (2)
>   +- Scan parquet default.explain_temp1 (1){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33038) AQE plan string should only display one plan when the initial and the current plan are the same

2020-10-05 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-33038:
---

Assignee: Allison Wang

> AQE plan string should only display one plan when the initial and the current 
> plan are the same
> ---
>
> Key: SPARK-33038
> URL: https://issues.apache.org/jira/browse/SPARK-33038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Minor
>
> Currently, the AQE plan string displays both the initial plan and the current 
> or the final plan. This can be redundant when the initial plan and the 
> current physical plan are exactly the same. For instance, the `EXPLAIN` 
> command will not actually execute the query, and thus the plan string will 
> never change, but currently, the plan string still shows both the current and 
> the initial plan:
>  
> {code:java}
> AdaptiveSparkPlan (8)
> +- == Current Plan ==
>Sort (7)
>+- Exchange (6)
>   +- HashAggregate (5)
>  +- Exchange (4)
> +- HashAggregate (3)
>+- Filter (2)
>   +- Scan parquet default.explain_temp1 (1)
> +- == Initial Plan ==
>Sort (7)
>+- Exchange (6)
>   +- HashAggregate (5)
>  +- Exchange (4)
> +- HashAggregate (3)
>+- Filter (2)
>   +- Scan parquet default.explain_temp1 (1)
> {code}
> When the initial and the current plan are the same, there should be only one 
> plan string displayed. For example
> {code:java}
> AdaptiveSparkPlan (8)
> +- Sort (7)
>+- Exchange (6)
>   +- HashAggregate (5)
>  +- Exchange (4)
> +- HashAggregate (3)
>+- Filter (2)
>   +- Scan parquet default.explain_temp1 (1){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31257) Unify create table syntax to fix ambiguous two different CREATE TABLE syntaxes

2020-10-03 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-31257:

Target Version/s: 3.1.0

> Unify create table syntax to fix ambiguous two different CREATE TABLE syntaxes
> --
>
> Key: SPARK-31257
> URL: https://issues.apache.org/jira/browse/SPARK-31257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> There's a discussion in dev@ mailing list to point out ambiguous syntaxes for 
> CREATE TABLE DDL. This issue tracks the efforts to resolve the root issue via 
> unifying the create table syntax.
> https://lists.apache.org/thread.html/rf1acfaaa3de2d3129575199c28e7d529d38f2783e7d3c5be2ac8923d%40%3Cdev.spark.apache.org%3E
> We should ensure the new "single" create table syntax is very deterministic 
> to both devs and end users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32991) RESET can clear StaticSQLConfs

2020-09-24 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201806#comment-17201806
 ] 

Xiao Li commented on SPARK-32991:
-

ping [~Qin Yao] [~yumwang]

> RESET can clear StaticSQLConfs
> --
>
> Key: SPARK-32991
> URL: https://issues.apache.org/jira/browse/SPARK-32991
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Herman van Hövell
>Priority: Major
>
> The RESET command can clear a sessions' static SQL configurations, when that 
> static SQL configuration was set on a SparkSession that uses a pre-existing 
> SparkContext. Here is repro:
> {code:java}
> // Blow away any pre-existing session thread locals
> org.apache.spark.sql.SparkSession.clearDefaultSession()
> org.apache.spark.sql.SparkSession.clearActiveSession()
> // Create new session and explicitly set a spark context
> val newSession = org.apache.spark.sql.SparkSession.builder
>  .sparkContext(sc)
>  .config("spark.sql.globalTempDatabase", "bob")
>  .getOrCreate()
> assert(newSession.conf.get("spark.sql.globalTempDatabase") == "bob")
> newSession.sql("reset")
> assert(newSession.conf.get("spark.sql.globalTempDatabase") == "bob") // Boom!
> {code}
> The problem is that RESET assumes it can use the SparkContext's 
> configurations as its default.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32976) Support column list in INSERT statement

2020-09-23 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-32976:

Description: 
INSERT currently does not support named column lists.  

{{INSERT INTO  (col1, col2,…) VALUES( 'val1', 'val2', … )}}

Note, we assume the column list contains all the column names. The order could 
be different from the column order defined in the table definition.

  was:
INSERT currently does not support named column lists.  

{{INSERT INTO  (col1, col2,…) VALUES( 'val1', 'val2', … )}}


> Support column list in INSERT statement
> ---
>
> Key: SPARK-32976
> URL: https://issues.apache.org/jira/browse/SPARK-32976
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Priority: Major
>
> INSERT currently does not support named column lists.  
> {{INSERT INTO  (col1, col2,…) VALUES( 'val1', 'val2', … )}}
> Note, we assume the column list contains all the column names. The order 
> could be different from the column order defined in the table definition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32976) Support column list in INSERT statement

2020-09-23 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-32976:

Description: 
INSERT currently does not support named column lists.  

{{INSERT INTO  (col1, col2,…) VALUES( 'val1', 'val2', … )}}

Note, we assume the column list contains all the column names. Issue an 
exception if the list is not complete. The column order could be different from 
the column order defined in the table definition.

  was:
INSERT currently does not support named column lists.  

{{INSERT INTO  (col1, col2,…) VALUES( 'val1', 'val2', … )}}

Note, we assume the column list contains all the column names. The order could 
be different from the column order defined in the table definition.


> Support column list in INSERT statement
> ---
>
> Key: SPARK-32976
> URL: https://issues.apache.org/jira/browse/SPARK-32976
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Priority: Major
>
> INSERT currently does not support named column lists.  
> {{INSERT INTO  (col1, col2,…) VALUES( 'val1', 'val2', … )}}
> Note, we assume the column list contains all the column names. Issue an 
> exception if the list is not complete. The column order could be different 
> from the column order defined in the table definition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32976) Support column list in INSERT statement

2020-09-23 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200877#comment-17200877
 ] 

Xiao Li commented on SPARK-32976:
-

[~Qin Yao] Are you interested in this?

> Support column list in INSERT statement
> ---
>
> Key: SPARK-32976
> URL: https://issues.apache.org/jira/browse/SPARK-32976
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Priority: Major
>
> INSERT currently does not support named column lists.  
> {{INSERT INTO  (col1, col2,…) VALUES( 'val1', 'val2', … )}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >