[jira] [Updated] (SPARK-30704) Use jekyll-redirect-from 0.15.0 instead of the latest

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30704:
--
Description: 
We use Ruby 2.3 in our release docker image.

The latest version of `jekyll-redirect-from 0.16.0` causes a failure on Ruby 
2.3.
- https://github.com/jekyll/jekyll-redirect-from/releases/tag/v0.16.0

{code}
root@dc0bc546e377:/# gem install jekyll-redirect-from
ERROR:  Error installing jekyll-redirect-from:
jekyll-redirect-from requires Ruby version >= 2.4.0.
{code}

  was:
The latest version causes a failure on Ruby 2.3.
- https://github.com/jekyll/jekyll-redirect-from/releases/tag/v0.16.0


> Use jekyll-redirect-from 0.15.0 instead of the latest
> -
>
> Key: SPARK-30704
> URL: https://issues.apache.org/jira/browse/SPARK-30704
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> We use Ruby 2.3 in our release docker image.
> The latest version of `jekyll-redirect-from 0.16.0` causes a failure on Ruby 
> 2.3.
> - https://github.com/jekyll/jekyll-redirect-from/releases/tag/v0.16.0
> {code}
> root@dc0bc546e377:/# gem install jekyll-redirect-from
> ERROR:  Error installing jekyll-redirect-from:
>   jekyll-redirect-from requires Ruby version >= 2.4.0.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30704) Use jekyll-redirect-from 0.15.0 instead of the latest

2020-02-01 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-30704:
-

 Summary: Use jekyll-redirect-from 0.15.0 instead of the latest
 Key: SPARK-30704
 URL: https://issues.apache.org/jira/browse/SPARK-30704
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 2.4.5, 3.0.0
Reporter: Dongjoon Hyun


The latest version causes a failure on Ruby 2.3.
- https://github.com/jekyll/jekyll-redirect-from/releases/tag/v0.16.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27686) Update migration guide

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27686.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27161
[https://github.com/apache/spark/pull/27161]

> Update migration guide 
> ---
>
> Key: SPARK-27686
> URL: https://issues.apache.org/jira/browse/SPARK-27686
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: hive-1.2.1-lib.tgz
>
>
> The built-in Hive 2.3 fixes the following issues:
>  * HIVE-6727: Table level stats for external tables are set incorrectly.
>  * HIVE-15653: Some ALTER TABLE commands drop table stats.
>  * SPARK-12014: Spark SQL query containing semicolon is broken in Beeline.
>  * SPARK-25193: insert overwrite doesn't throw exception when drop old data 
> fails.
>  * SPARK-25919: Date value corrupts when tables are "ParquetHiveSerDe" 
> formatted and target table is Partitioned.
>  * SPARK-26332: Spark sql write orc table on viewFS throws exception.
>  * SPARK-26437: Decimal data becomes bigint to query, unable to query.
> We need update migration guide.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27686) Update migration guide

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-27686:
-

Assignee: Yuming Wang

> Update migration guide 
> ---
>
> Key: SPARK-27686
> URL: https://issues.apache.org/jira/browse/SPARK-27686
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Minor
> Attachments: hive-1.2.1-lib.tgz
>
>
> The built-in Hive 2.3 fixes the following issues:
>  * HIVE-6727: Table level stats for external tables are set incorrectly.
>  * HIVE-15653: Some ALTER TABLE commands drop table stats.
>  * SPARK-12014: Spark SQL query containing semicolon is broken in Beeline.
>  * SPARK-25193: insert overwrite doesn't throw exception when drop old data 
> fails.
>  * SPARK-25919: Date value corrupts when tables are "ParquetHiveSerDe" 
> formatted and target table is Partitioned.
>  * SPARK-26332: Spark sql write orc table on viewFS throws exception.
>  * SPARK-26437: Decimal data becomes bigint to query, unable to query.
> We need update migration guide.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30703) Add a documentation page for ANSI mode

2020-02-01 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028253#comment-17028253
 ] 

Takeshi Yamamuro commented on SPARK-30703:
--

Yea, sure, I will this week ;)

> Add a documentation page for ANSI mode
> --
>
> Key: SPARK-30703
> URL: https://issues.apache.org/jira/browse/SPARK-30703
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> ANSI mode is introduced in Spark 3.0. We need to clearly document the 
> behavior difference when spark.sql.ansi.enabled is on and off. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30703) Add a documentation page for ANSI mode

2020-02-01 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028251#comment-17028251
 ] 

Xiao Li commented on SPARK-30703:
-

[~maropu] Could you help this?

> Add a documentation page for ANSI mode
> --
>
> Key: SPARK-30703
> URL: https://issues.apache.org/jira/browse/SPARK-30703
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> ANSI mode is introduced in Spark 3.0. We need to clearly document the 
> behavior difference when spark.sql.ansi.enabled is on and off. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30703) Add a documentation page for ANSI mode

2020-02-01 Thread Xiao Li (Jira)
Xiao Li created SPARK-30703:
---

 Summary: Add a documentation page for ANSI mode
 Key: SPARK-30703
 URL: https://issues.apache.org/jira/browse/SPARK-30703
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xiao Li


ANSI mode is introduced in Spark 3.0. We need to clearly document the behavior 
difference when spark.sql.ansi.enabled is on and off. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-19842:
--
Target Version/s:   (was: 3.0.0)

> Informational Referential Integrity Constraints Support in Spark
> 
>
> Key: SPARK-19842
> URL: https://issues.apache.org/jira/browse/SPARK-19842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ioana Delaney
>Priority: Major
> Attachments: InformationalRIConstraints.doc
>
>
> *Informational Referential Integrity Constraints Support in Spark*
> This work proposes support for _informational primary key_ and _foreign key 
> (referential integrity) constraints_ in Spark. The main purpose is to open up 
> an area of query optimization techniques that rely on referential integrity 
> constraints semantics. 
> An _informational_ or _statistical constraint_ is a constraint such as a 
> _unique_, _primary key_, _foreign key_, or _check constraint_, that can be 
> used by Spark to improve query performance. Informational constraints are not 
> enforced by the Spark SQL engine; rather, they are used by Catalyst to 
> optimize the query processing. They provide semantics information that allows 
> Catalyst to rewrite queries to eliminate joins, push down aggregates, remove 
> unnecessary Distinct operations, and perform a number of other optimizations. 
> Informational constraints are primarily targeted to applications that load 
> and analyze data that originated from a data warehouse. For such 
> applications, the conditions for a given constraint are known to be true, so 
> the constraint does not need to be enforced during data load operations. 
> The attached document covers constraint definition, metastore storage, 
> constraint validation, and maintenance. The document shows many examples of 
> query performance improvements that utilize referential integrity constraints 
> and can be implemented in Spark.
> Link to the google doc: 
> [InformationalRIConstraints|https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark

2020-02-01 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028234#comment-17028234
 ] 

Dongjoon Hyun commented on SPARK-19842:
---

I removed the `Target Version : 3.0.0` because we created `branch-3.0` and 
entered `Feature Freeze` phase.

> Informational Referential Integrity Constraints Support in Spark
> 
>
> Key: SPARK-19842
> URL: https://issues.apache.org/jira/browse/SPARK-19842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ioana Delaney
>Priority: Major
> Attachments: InformationalRIConstraints.doc
>
>
> *Informational Referential Integrity Constraints Support in Spark*
> This work proposes support for _informational primary key_ and _foreign key 
> (referential integrity) constraints_ in Spark. The main purpose is to open up 
> an area of query optimization techniques that rely on referential integrity 
> constraints semantics. 
> An _informational_ or _statistical constraint_ is a constraint such as a 
> _unique_, _primary key_, _foreign key_, or _check constraint_, that can be 
> used by Spark to improve query performance. Informational constraints are not 
> enforced by the Spark SQL engine; rather, they are used by Catalyst to 
> optimize the query processing. They provide semantics information that allows 
> Catalyst to rewrite queries to eliminate joins, push down aggregates, remove 
> unnecessary Distinct operations, and perform a number of other optimizations. 
> Informational constraints are primarily targeted to applications that load 
> and analyze data that originated from a data warehouse. For such 
> applications, the conditions for a given constraint are known to be true, so 
> the constraint does not need to be enforced during data load operations. 
> The attached document covers constraint definition, metastore storage, 
> constraint validation, and maintenance. The document shows many examples of 
> query performance improvements that utilize referential integrity constraints 
> and can be implemented in Spark.
> Link to the google doc: 
> [InformationalRIConstraints|https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20964) Make some keywords reserved along with the ANSI/SQL standard

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20964:
--
Target Version/s:   (was: 3.0.0)

> Make some keywords reserved along with the ANSI/SQL standard
> 
>
> Key: SPARK-20964
> URL: https://issues.apache.org/jira/browse/SPARK-20964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The current Spark has many non-reserved words that are essentially reserved 
> in the ANSI/SQL standard 
> (http://developer.mimer.se/validator/sql-reserved-words.tml). 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L709
> This is because there are many datasources (for instance twitter4j) that 
> unfortunately use reserved keywords for column names (See [~hvanhovell]'s 
> comments: https://github.com/apache/spark/pull/18079#discussion_r118842186). 
> We might fix this issue in future major releases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20964) Make some keywords reserved along with the ANSI/SQL standard

2020-02-01 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028233#comment-17028233
 ] 

Dongjoon Hyun commented on SPARK-20964:
---

Hi, [~maropu]. I removed the target version first. Please resolve this if this 
is done in 3.0.0.

> Make some keywords reserved along with the ANSI/SQL standard
> 
>
> Key: SPARK-20964
> URL: https://issues.apache.org/jira/browse/SPARK-20964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The current Spark has many non-reserved words that are essentially reserved 
> in the ANSI/SQL standard 
> (http://developer.mimer.se/validator/sql-reserved-words.tml). 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L709
> This is because there are many datasources (for instance twitter4j) that 
> unfortunately use reserved keywords for column names (See [~hvanhovell]'s 
> comments: https://github.com/apache/spark/pull/18079#discussion_r118842186). 
> We might fix this issue in future major releases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22231:
--
Target Version/s: 3.1.0  (was: 3.0.0)

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Jeremy Smith
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the nested list of struct. Here is two 
> examples demonstrating the API together with *mapItems*; the first one 
> replaces the existing column,
> {code:java}
> case class Item(a: Int, b: Double)
> case class Data(foo: Int, bar: Double, items: Seq[Item])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))),
>   Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0)))
> ))
> val result = df.mapItems("items") {
>   item => item.withColumn(item("b") + 1 as "b")
> }
> result.printSchema
> root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: struct (containsNull = true)
> // |||-- a: integer (nullable = true)
> // |||-- b: double (nullable = true)
> result.show(false)
> // +---++--+
> // |foo|bar |items |
> // +---++--+
> // |10 |10.0|[[10,11.0], [11,12.0]]|
> // |20 |20.0|[[20,21.0], [21,22.0]]|
> // 

[jira] [Updated] (SPARK-24625) put all the backward compatible behavior change configs under spark.sql.legacy.*

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24625:
--
Target Version/s:   (was: 3.0.0)

> put all the backward compatible behavior change configs under 
> spark.sql.legacy.*
> 
>
> Key: SPARK-24625
> URL: https://issues.apache.org/jira/browse/SPARK-24625
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Recently we made several behavior changes to Spark SQL, to make it more ANSI 
> SQL compliant or fix some unreasonable behaviors. For backward compatibility, 
> we add configs to allow users fallback to the old behavior and plan to remove 
> them in Spark 3.0. It's better to put these configs under spark.sql.legacy.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24625) put all the backward compatible behavior change configs under spark.sql.legacy.*

2020-02-01 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028232#comment-17028232
 ] 

Dongjoon Hyun commented on SPARK-24625:
---

For this, I'll remove the target version first. Please resolve this if we 
finished this, [~cloud_fan].

> put all the backward compatible behavior change configs under 
> spark.sql.legacy.*
> 
>
> Key: SPARK-24625
> URL: https://issues.apache.org/jira/browse/SPARK-24625
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Recently we made several behavior changes to Spark SQL, to make it more ANSI 
> SQL compliant or fix some unreasonable behaviors. For backward compatibility, 
> we add configs to allow users fallback to the old behavior and plan to remove 
> them in Spark 3.0. It's better to put these configs under spark.sql.legacy.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24941) Add RDDBarrier.coalesce() function

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24941:
--
Target Version/s: 3.1.0  (was: 3.0.0)

> Add RDDBarrier.coalesce() function
> --
>
> Key: SPARK-24941
> URL: https://issues.apache.org/jira/browse/SPARK-24941
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> https://github.com/apache/spark/pull/21758#discussion_r204917245
> The number of partitions from the input data can be unexpectedly large, eg. 
> if you do
> {code}
> sc.textFile(...).barrier().mapPartitions()
> {code}
> The number of input partitions is based on the hdfs input splits. We shall 
> provide a way in RDDBarrier to enable users to specify the number of tasks in 
> a barrier stage. Maybe something like RDDBarrier.coalesce(numPartitions: Int) 
> .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25531) new write APIs for data source v2

2020-02-01 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028231#comment-17028231
 ] 

Dongjoon Hyun commented on SPARK-25531:
---

I moved this to 3.1.0 for now. If we can resolve this issue, you can change 
back to 3.0.0.

> new write APIs for data source v2
> -
>
> Key: SPARK-25531
> URL: https://issues.apache.org/jira/browse/SPARK-25531
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The current data source write API heavily depend on {{SaveMode}}, which 
> doesn't have a clear semantic, especially when writing to tables.
> We should design a new set of write API without {{SaveMode}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25531) new write APIs for data source v2

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25531:
--
Target Version/s: 3.1.0  (was: 3.0.0)

> new write APIs for data source v2
> -
>
> Key: SPARK-25531
> URL: https://issues.apache.org/jira/browse/SPARK-25531
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The current data source write API heavily depend on {{SaveMode}}, which 
> doesn't have a clear semantic, especially when writing to tables.
> We should design a new set of write API without {{SaveMode}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24942) Improve cluster resource management with jobs containing barrier stage

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24942:
--
Target Version/s: 3.1.0  (was: 3.0.0)

> Improve cluster resource management with jobs containing barrier stage
> --
>
> Key: SPARK-24942
> URL: https://issues.apache.org/jira/browse/SPARK-24942
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> https://github.com/apache/spark/pull/21758#discussion_r205652317
> We shall improve cluster resource management to address the following issues:
> - With dynamic resource allocation enabled, it may happen that we acquire 
> some executors (but not enough to launch all the tasks in a barrier stage) 
> and later release them due to executor idle time expire, and then acquire 
> again.
> - There can be deadlock with two concurrent applications. Each application 
> may acquire some resources, but not enough to launch all the tasks in a 
> barrier stage. And after hitting the idle timeout and releasing them, they 
> may acquire resources again, but just continually trade resources between 
> each other.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25383) Image data source supports sample pushdown

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25383:
--
Target Version/s: 3.1.0  (was: 3.0.0)

> Image data source supports sample pushdown
> --
>
> Key: SPARK-25383
> URL: https://issues.apache.org/jira/browse/SPARK-25383
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> After SPARK-25349, we should update image data source to support sampling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25531) new write APIs for data source v2

2020-02-01 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028230#comment-17028230
 ] 

Dongjoon Hyun commented on SPARK-25531:
---

Hi, [~cloud_fan].
Could you resolve this issue or adjust the target version to `3.1.0` ?

> new write APIs for data source v2
> -
>
> Key: SPARK-25531
> URL: https://issues.apache.org/jira/browse/SPARK-25531
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The current data source write API heavily depend on {{SaveMode}}, which 
> doesn't have a clear semantic, especially when writing to tables.
> We should design a new set of write API without {{SaveMode}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26425) Add more constraint checks in file streaming source to avoid checkpoint corruption

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26425:
--
Target Version/s: 3.1.0  (was: 3.0.0)

> Add more constraint checks in file streaming source to avoid checkpoint 
> corruption
> --
>
> Key: SPARK-26425
> URL: https://issues.apache.org/jira/browse/SPARK-26425
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>
> Two issues observed in production. 
> - HDFSMetadataLog.getLatest() tries to read older versions when it is not 
> able to read the latest listed version file. Not sure why this was done but 
> this should not be done. If the latest listed file is not readable, then 
> something is horribly wrong and we should fail rather than report an older 
> version as that can completely corrupt the checkpoint directory. 
> - FileStreamSource should check whether adding the a new batch to the 
> FileStreamSourceLog succeeded or not (similar to how StreamExecution checks 
> for the OffsetSeqLog)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25752) Add trait to easily whitelist logical operators that produce named output from CleanupAliases

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25752:
--
Target Version/s: 3.1.0  (was: 3.0.0)

> Add trait to easily whitelist logical operators that produce named output 
> from CleanupAliases
> -
>
> Key: SPARK-25752
> URL: https://issues.apache.org/jira/browse/SPARK-25752
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
>
> The rule `CleanupAliases` cleans up aliases from logical operators that do 
> not match a whitelist. This whitelist is hardcoded inside the rule which is 
> cumbersome. This PR is to clean that up by making a trait `HasNamedOutput` 
> that will be ignored by `CleanupAliases` and other ops that require aliases 
> to be preserved in the operator should extend it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27471) Reorganize public v2 catalog API

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27471.
---
  Assignee: Ryan Blue
Resolution: Done

> Reorganize public v2 catalog API
> 
>
> Key: SPARK-27471
> URL: https://issues.apache.org/jira/browse/SPARK-27471
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Blocker
>
> In the review for SPARK-27181, Reynold suggested some package moves. We've 
> decided (at the v2 community sync) not to delay by having this discussion now 
> because we want to get the new catalog API in so we can work on more logical 
> plans in parallel. But we do need to make sure we have a sane package scheme 
> for the next release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27471) Reorganize public v2 catalog API

2020-02-01 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028229#comment-17028229
 ] 

Dongjoon Hyun commented on SPARK-27471:
---

I believe we've done what we need at 3.0.0. For the rest reorganization, we can 
file another JIRA.

> Reorganize public v2 catalog API
> 
>
> Key: SPARK-27471
> URL: https://issues.apache.org/jira/browse/SPARK-27471
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Blocker
>
> In the review for SPARK-27181, Reynold suggested some package moves. We've 
> decided (at the v2 community sync) not to delay by having this discussion now 
> because we want to get the new catalog API in so we can work on more logical 
> plans in parallel. But we do need to make sure we have a sane package scheme 
> for the next release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27780) Shuffle server & client should be versioned to enable smoother upgrade

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27780:
--
Target Version/s: 3.1.0  (was: 3.0.0)

> Shuffle server & client should be versioned to enable smoother upgrade
> --
>
> Key: SPARK-27780
> URL: https://issues.apache.org/jira/browse/SPARK-27780
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> The external shuffle service is often upgraded at a different time than spark 
> itself.  However, this causes problems when the protocol changes between the 
> shuffle service and the spark runtime -- this forces users to upgrade 
> everything simultaneously.
> We should add versioning to the shuffle client & server, so they know what 
> messages the other will support.  This would allow better handling of mixed 
> versions, from better error msgs to allowing some mismatched versions (with 
> reduced capabilities).
> This originally came up in a discussion here: 
> https://github.com/apache/spark/pull/24565#issuecomment-493496466
> There are a few ways we could do the versioning which we still need to 
> discuss:
> 1) Version specified by config.  This allows for mixed versions across the 
> cluster and rolling upgrades.  It also will let a spark 3.0 client talk to a 
> 2.4 shuffle service.  But, may be a nuisance for users to get this right.
> 2) Auto-detection during registration with local shuffle service.  This makes 
> the versioning easy for the end user, and can even handle a 2.4 shuffle 
> service though it does not support the new versioning.  However, it will not 
> handle a rolling upgrade correctly -- if the local shuffle service has been 
> upgraded, but other nodes in the cluster have not, it will get the version 
> wrong.
> 3) Exchange versions per-connection.  When a connection is opened, the server 
> & client could first exchange messages with their versions, so they know how 
> to continue communication after that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27936) Support local dependency uploading from --py-files

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27936:
--
Target Version/s: 3.1.0

> Support local dependency uploading from --py-files
> --
>
> Key: SPARK-27936
> URL: https://issues.apache.org/jira/browse/SPARK-27936
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Erik Erlandson
>Priority: Major
>
> Support python dependency uploads, as in SPARK-23153



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28629) Capture the missing rules in HiveSessionStateBuilder

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28629:
--
Target Version/s: 3.1.0  (was: 3.0.0)

> Capture the missing rules in HiveSessionStateBuilder
> 
>
> Key: SPARK-28629
> URL: https://issues.apache.org/jira/browse/SPARK-28629
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> A general mistake for new contributors is to forget adding the corresponding 
> rules into the extended extendedResolutionRules, postHocResolutionRules, 
> extendedCheckRules in HiveSessionStateBuilder. We need to avoid missing the 
> rules or capture them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28717) Update SQL ALTER TABLE RENAME to use TableCatalog API

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28717:
--
Target Version/s:   (was: 3.0.0)

> Update SQL ALTER TABLE RENAME  to use TableCatalog API
> --
>
> Key: SPARK-28717
> URL: https://issues.apache.org/jira/browse/SPARK-28717
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Edgar Rodriguez
>Priority: Major
>
> Follow-up from SPARK-28265
> SQL implementation of ALTER TABLE RENAME needs to be updated to use the 
> TableCatalog API operation {{renameTable}} - having something like: 
> {code:java}
> ALTER TABLE [catalog_name] [namespace_name] table_name
> TO [new_namespace_name] new_table_name{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27936) Support local dependency uploading from --py-files

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27936:
--
Target Version/s:   (was: 3.0.0)

> Support local dependency uploading from --py-files
> --
>
> Key: SPARK-27936
> URL: https://issues.apache.org/jira/browse/SPARK-27936
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Erik Erlandson
>Priority: Major
>
> Support python dependency uploads, as in SPARK-23153



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30097) Adding support for core writers

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30097.
---
Resolution: Won't Do

> Adding support for core writers 
> 
>
> Key: SPARK-30097
> URL: https://issues.apache.org/jira/browse/SPARK-30097
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
> Environment: {code:java}
> {code}
>  
>Reporter: German Schiavon Matteo
>Priority: Minor
>
> When using *writeStream* we always have to use *format("xxx")* in order to 
> target the selected sink while in r*eadStream* you can use directly 
> *.parquet* 
> Basically this is to add the support to the core writers for *writeStream*
> Example:
>   
> {code:java}
> writeStream
> .outputMode("append")
> .partitionBy("id")
> .options(options)
> .parquet(path)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30097) Adding support for core writers

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30097:
--
Target Version/s:   (was: 3.0.0)

> Adding support for core writers 
> 
>
> Key: SPARK-30097
> URL: https://issues.apache.org/jira/browse/SPARK-30097
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
> Environment: {code:java}
> {code}
>  
>Reporter: German Schiavon Matteo
>Priority: Minor
>
> When using *writeStream* we always have to use *format("xxx")* in order to 
> target the selected sink while in r*eadStream* you can use directly 
> *.parquet* 
> Basically this is to add the support to the core writers for *writeStream*
> Example:
>   
> {code:java}
> writeStream
> .outputMode("append")
> .partitionBy("id")
> .options(options)
> .parquet(path)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30186) support Dynamic Partition Pruning in Adaptive Execution

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30186:
--
Fix Version/s: (was: 3.0.0)

> support Dynamic Partition Pruning in Adaptive Execution
> ---
>
> Key: SPARK-30186
> URL: https://issues.apache.org/jira/browse/SPARK-30186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiaoju Wu
>Priority: Major
>
> Currently Adaptive Execution cannot work if Dynamic Partition Pruning is 
> applied.
> private def supportAdaptive(plan: SparkPlan): Boolean = {
>  // TODO migrate dynamic-partition-pruning onto adaptive execution.
>  sanityCheck(plan) &&
>  !plan.logicalLink.exists(_.isStreaming) &&
>  
> *!plan.expressions.exists(_.find(_.isInstanceOf[DynamicPruningSubquery]).isDefined)*
>  &&
>  plan.children.forall(supportAdaptive)
> }
> It means we cannot benefit the performance from both AE and DPP.
> This ticket is target to make DPP + AE works together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30186) support Dynamic Partition Pruning in Adaptive Execution

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30186:
--
Target Version/s:   (was: 3.0.0)

> support Dynamic Partition Pruning in Adaptive Execution
> ---
>
> Key: SPARK-30186
> URL: https://issues.apache.org/jira/browse/SPARK-30186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiaoju Wu
>Priority: Major
>
> Currently Adaptive Execution cannot work if Dynamic Partition Pruning is 
> applied.
> private def supportAdaptive(plan: SparkPlan): Boolean = {
>  // TODO migrate dynamic-partition-pruning onto adaptive execution.
>  sanityCheck(plan) &&
>  !plan.logicalLink.exists(_.isStreaming) &&
>  
> *!plan.expressions.exists(_.find(_.isInstanceOf[DynamicPruningSubquery]).isDefined)*
>  &&
>  plan.children.forall(supportAdaptive)
> }
> It means we cannot benefit the performance from both AE and DPP.
> This ticket is target to make DPP + AE works together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-30097) Adding support for core writers

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-30097.
-

> Adding support for core writers 
> 
>
> Key: SPARK-30097
> URL: https://issues.apache.org/jira/browse/SPARK-30097
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
> Environment: {code:java}
> {code}
>  
>Reporter: German Schiavon Matteo
>Priority: Minor
>
> When using *writeStream* we always have to use *format("xxx")* in order to 
> target the selected sink while in r*eadStream* you can use directly 
> *.parquet* 
> Basically this is to add the support to the core writers for *writeStream*
> Example:
>   
> {code:java}
> writeStream
> .outputMode("append")
> .partitionBy("id")
> .options(options)
> .parquet(path)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30334) Add metadata around semi-structured columns to Spark

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30334:
--
Target Version/s: 3.1.0  (was: 3.0.0)

> Add metadata around semi-structured columns to Spark
> 
>
> Key: SPARK-30334
> URL: https://issues.apache.org/jira/browse/SPARK-30334
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Burak Yavuz
>Priority: Major
>
> Semi-structured data is used widely in the data industry for reporting events 
> in a wide variety of formats. Click events in product analytics can be stored 
> as json. Some application logs can be in the form of delimited key=value 
> text. Some data may be in xml.
> The goal of this project is to be able to signal Spark that such a column 
> exists. This will then enable Spark to "auto-parse" these columns on the fly. 
> The proposal is to store this information as part of the column metadata, in 
> the fields:
>  - format: The format of the semi-structured column, e.g. json, xml, avro
>  - options: Options for parsing these columns
> Then imagine having the following data:
> {code:java}
> ++---++
> | ts | event |raw |
> ++---++
> | 2019-10-12 | click | {"field":"value"}  |
> ++---++ {code}
> SELECT raw.field FROM data
> will return "value"
> or the following data
> {code:java}
> ++---+--+
> | ts | event | raw  |
> ++---+--+
> | 2019-10-12 | click | field1=v1|field2=v2  |
> ++---+--+ {code}
> SELECT raw.field1 FROM data
> will return v1.
>  
> As a first step, we will introduce the function "as_json", which accomplishes 
> this for JSON columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30324) Simplify API for JSON access in DataFrames/SQL

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30324:
--
Target Version/s: 3.1.0  (was: 3.0.0)

> Simplify API for JSON access in DataFrames/SQL
> --
>
> Key: SPARK-30324
> URL: https://issues.apache.org/jira/browse/SPARK-30324
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Burak Yavuz
>Priority: Major
>
> get_json_object() is a UDF to parse JSON fields. It is verbose and hard to 
> use, e.g. I wasn't expecting the path to a field to have to start with "$.". 
> We can simplify all of this when a column is of StringType, and a nested 
> field is requested. This API sugar will in the query planner be rewritten as 
> get_json_object.
> This nested access can then be extended in the future to other 
> semi-structured formats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30567) setDelegateCatalog should be called if catalog has implemented CatalogExtension

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30567:
--
Target Version/s:   (was: 3.1.0)

> setDelegateCatalog should be called if catalog has implemented 
> CatalogExtension
> ---
>
> Key: SPARK-30567
> URL: https://issues.apache.org/jira/browse/SPARK-30567
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yu jiantao
>Priority: Major
>
> CatalogManager.catalog calls Catalogs.load to load a catalog if it is not 
> 'spark_catalog' . If the catalog has implemented CatalogExtension, 
> setDelegateCatalog is not called when the catalog is loaded, which is not 
> like that we have done for v2SessionCatalog, and that makes a confusion for 
> customized session catalog, like iceberg SparkSessionCatalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30667) Support simple all gather in barrier task context

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30667:
--
Target Version/s: 3.1.0  (was: 3.0.0)

> Support simple all gather in barrier task context
> -
>
> Key: SPARK-30667
> URL: https://issues.apache.org/jira/browse/SPARK-30667
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks 
> can see all IP addresses from BarrierTaskContext. It would be simpler to 
> integrate with distributed frameworks like TensorFlow DistributionStrategy if 
> we provide all gather that can let tasks share additional information with 
> others, e.g., an available port.
> Note that with all gather, tasks are share their IP addresses as well.
> {code}
> port = ... # get an available port
> ports = context.all_gather(port) # get all available ports, ordered by task ID
> ...  # set up distributed training service
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30567) setDelegateCatalog should be called if catalog has implemented CatalogExtension

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30567:
--
Fix Version/s: (was: 3.0.0)

> setDelegateCatalog should be called if catalog has implemented 
> CatalogExtension
> ---
>
> Key: SPARK-30567
> URL: https://issues.apache.org/jira/browse/SPARK-30567
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yu jiantao
>Priority: Major
>
> CatalogManager.catalog calls Catalogs.load to load a catalog if it is not 
> 'spark_catalog' . If the catalog has implemented CatalogExtension, 
> setDelegateCatalog is not called when the catalog is loaded, which is not 
> like that we have done for v2SessionCatalog, and that makes a confusion for 
> customized session catalog, like iceberg SparkSessionCatalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30567) setDelegateCatalog should be called if catalog has implemented CatalogExtension

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30567:
--
Target Version/s: 3.1.0  (was: 3.0.0)

> setDelegateCatalog should be called if catalog has implemented 
> CatalogExtension
> ---
>
> Key: SPARK-30567
> URL: https://issues.apache.org/jira/browse/SPARK-30567
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yu jiantao
>Priority: Major
>
> CatalogManager.catalog calls Catalogs.load to load a catalog if it is not 
> 'spark_catalog' . If the catalog has implemented CatalogExtension, 
> setDelegateCatalog is not called when the catalog is loaded, which is not 
> like that we have done for v2SessionCatalog, and that makes a confusion for 
> customized session catalog, like iceberg SparkSessionCatalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24941) Add RDDBarrier.coalesce() function

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24941:
--
Target Version/s: 3.1.0

> Add RDDBarrier.coalesce() function
> --
>
> Key: SPARK-24941
> URL: https://issues.apache.org/jira/browse/SPARK-24941
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> https://github.com/apache/spark/pull/21758#discussion_r204917245
> The number of partitions from the input data can be unexpectedly large, eg. 
> if you do
> {code}
> sc.textFile(...).barrier().mapPartitions()
> {code}
> The number of input partitions is based on the hdfs input splits. We shall 
> provide a way in RDDBarrier to enable users to specify the number of tasks in 
> a barrier stage. Maybe something like RDDBarrier.coalesce(numPartitions: Int) 
> .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30334) Add metadata around semi-structured columns to Spark

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30334:
--
Target Version/s: 3.1.0

> Add metadata around semi-structured columns to Spark
> 
>
> Key: SPARK-30334
> URL: https://issues.apache.org/jira/browse/SPARK-30334
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Burak Yavuz
>Priority: Major
>
> Semi-structured data is used widely in the data industry for reporting events 
> in a wide variety of formats. Click events in product analytics can be stored 
> as json. Some application logs can be in the form of delimited key=value 
> text. Some data may be in xml.
> The goal of this project is to be able to signal Spark that such a column 
> exists. This will then enable Spark to "auto-parse" these columns on the fly. 
> The proposal is to store this information as part of the column metadata, in 
> the fields:
>  - format: The format of the semi-structured column, e.g. json, xml, avro
>  - options: Options for parsing these columns
> Then imagine having the following data:
> {code:java}
> ++---++
> | ts | event |raw |
> ++---++
> | 2019-10-12 | click | {"field":"value"}  |
> ++---++ {code}
> SELECT raw.field FROM data
> will return "value"
> or the following data
> {code:java}
> ++---+--+
> | ts | event | raw  |
> ++---+--+
> | 2019-10-12 | click | field1=v1|field2=v2  |
> ++---+--+ {code}
> SELECT raw.field1 FROM data
> will return v1.
>  
> As a first step, we will introduce the function "as_json", which accomplishes 
> this for JSON columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20964) Make some keywords reserved along with the ANSI/SQL standard

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20964:
--
Target Version/s: 3.1.0

> Make some keywords reserved along with the ANSI/SQL standard
> 
>
> Key: SPARK-20964
> URL: https://issues.apache.org/jira/browse/SPARK-20964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The current Spark has many non-reserved words that are essentially reserved 
> in the ANSI/SQL standard 
> (http://developer.mimer.se/validator/sql-reserved-words.tml). 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L709
> This is because there are many datasources (for instance twitter4j) that 
> unfortunately use reserved keywords for column names (See [~hvanhovell]'s 
> comments: https://github.com/apache/spark/pull/18079#discussion_r118842186). 
> We might fix this issue in future major releases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30186) support Dynamic Partition Pruning in Adaptive Execution

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30186:
--
Target Version/s: 3.1.0

> support Dynamic Partition Pruning in Adaptive Execution
> ---
>
> Key: SPARK-30186
> URL: https://issues.apache.org/jira/browse/SPARK-30186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiaoju Wu
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently Adaptive Execution cannot work if Dynamic Partition Pruning is 
> applied.
> private def supportAdaptive(plan: SparkPlan): Boolean = {
>  // TODO migrate dynamic-partition-pruning onto adaptive execution.
>  sanityCheck(plan) &&
>  !plan.logicalLink.exists(_.isStreaming) &&
>  
> *!plan.expressions.exists(_.find(_.isInstanceOf[DynamicPruningSubquery]).isDefined)*
>  &&
>  plan.children.forall(supportAdaptive)
> }
> It means we cannot benefit the performance from both AE and DPP.
> This ticket is target to make DPP + AE works together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25531) new write APIs for data source v2

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25531:
--
Target Version/s: 3.1.0

> new write APIs for data source v2
> -
>
> Key: SPARK-25531
> URL: https://issues.apache.org/jira/browse/SPARK-25531
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The current data source write API heavily depend on {{SaveMode}}, which 
> doesn't have a clear semantic, especially when writing to tables.
> We should design a new set of write API without {{SaveMode}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24942) Improve cluster resource management with jobs containing barrier stage

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24942:
--
Target Version/s: 3.1.0

> Improve cluster resource management with jobs containing barrier stage
> --
>
> Key: SPARK-24942
> URL: https://issues.apache.org/jira/browse/SPARK-24942
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> https://github.com/apache/spark/pull/21758#discussion_r205652317
> We shall improve cluster resource management to address the following issues:
> - With dynamic resource allocation enabled, it may happen that we acquire 
> some executors (but not enough to launch all the tasks in a barrier stage) 
> and later release them due to executor idle time expire, and then acquire 
> again.
> - There can be deadlock with two concurrent applications. Each application 
> may acquire some resources, but not enough to launch all the tasks in a 
> barrier stage. And after hitting the idle timeout and releasing them, they 
> may acquire resources again, but just continually trade resources between 
> each other.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30667) Support simple all gather in barrier task context

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30667:
--
Target Version/s: 3.1.0

> Support simple all gather in barrier task context
> -
>
> Key: SPARK-30667
> URL: https://issues.apache.org/jira/browse/SPARK-30667
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks 
> can see all IP addresses from BarrierTaskContext. It would be simpler to 
> integrate with distributed frameworks like TensorFlow DistributionStrategy if 
> we provide all gather that can let tasks share additional information with 
> others, e.g., an available port.
> Note that with all gather, tasks are share their IP addresses as well.
> {code}
> port = ... # get an available port
> ports = context.all_gather(port) # get all available ports, ordered by task ID
> ...  # set up distributed training service
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27780) Shuffle server & client should be versioned to enable smoother upgrade

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27780:
--
Target Version/s: 3.1.0

> Shuffle server & client should be versioned to enable smoother upgrade
> --
>
> Key: SPARK-27780
> URL: https://issues.apache.org/jira/browse/SPARK-27780
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> The external shuffle service is often upgraded at a different time than spark 
> itself.  However, this causes problems when the protocol changes between the 
> shuffle service and the spark runtime -- this forces users to upgrade 
> everything simultaneously.
> We should add versioning to the shuffle client & server, so they know what 
> messages the other will support.  This would allow better handling of mixed 
> versions, from better error msgs to allowing some mismatched versions (with 
> reduced capabilities).
> This originally came up in a discussion here: 
> https://github.com/apache/spark/pull/24565#issuecomment-493496466
> There are a few ways we could do the versioning which we still need to 
> discuss:
> 1) Version specified by config.  This allows for mixed versions across the 
> cluster and rolling upgrades.  It also will let a spark 3.0 client talk to a 
> 2.4 shuffle service.  But, may be a nuisance for users to get this right.
> 2) Auto-detection during registration with local shuffle service.  This makes 
> the versioning easy for the end user, and can even handle a 2.4 shuffle 
> service though it does not support the new versioning.  However, it will not 
> handle a rolling upgrade correctly -- if the local shuffle service has been 
> upgraded, but other nodes in the cluster have not, it will get the version 
> wrong.
> 3) Exchange versions per-connection.  When a connection is opened, the server 
> & client could first exchange messages with their versions, so they know how 
> to continue communication after that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26425) Add more constraint checks in file streaming source to avoid checkpoint corruption

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26425:
--
Target Version/s: 3.1.0

> Add more constraint checks in file streaming source to avoid checkpoint 
> corruption
> --
>
> Key: SPARK-26425
> URL: https://issues.apache.org/jira/browse/SPARK-26425
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>
> Two issues observed in production. 
> - HDFSMetadataLog.getLatest() tries to read older versions when it is not 
> able to read the latest listed version file. Not sure why this was done but 
> this should not be done. If the latest listed file is not readable, then 
> something is horribly wrong and we should fail rather than report an older 
> version as that can completely corrupt the checkpoint directory. 
> - FileStreamSource should check whether adding the a new batch to the 
> FileStreamSourceLog succeeded or not (similar to how StreamExecution checks 
> for the OffsetSeqLog)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30567) setDelegateCatalog should be called if catalog has implemented CatalogExtension

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30567:
--
Target Version/s: 3.1.0

> setDelegateCatalog should be called if catalog has implemented 
> CatalogExtension
> ---
>
> Key: SPARK-30567
> URL: https://issues.apache.org/jira/browse/SPARK-30567
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yu jiantao
>Priority: Major
> Fix For: 3.0.0
>
>
> CatalogManager.catalog calls Catalogs.load to load a catalog if it is not 
> 'spark_catalog' . If the catalog has implemented CatalogExtension, 
> setDelegateCatalog is not called when the catalog is loaded, which is not 
> like that we have done for v2SessionCatalog, and that makes a confusion for 
> customized session catalog, like iceberg SparkSessionCatalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30324) Simplify API for JSON access in DataFrames/SQL

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30324:
--
Target Version/s: 3.1.0

> Simplify API for JSON access in DataFrames/SQL
> --
>
> Key: SPARK-30324
> URL: https://issues.apache.org/jira/browse/SPARK-30324
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Burak Yavuz
>Priority: Major
>
> get_json_object() is a UDF to parse JSON fields. It is verbose and hard to 
> use, e.g. I wasn't expecting the path to a field to have to start with "$.". 
> We can simplify all of this when a column is of StringType, and a nested 
> field is requested. This API sugar will in the query planner be rewritten as 
> get_json_object.
> This nested access can then be extended in the future to other 
> semi-structured formats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30097) Adding support for core writers

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30097:
--
Target Version/s: 3.1.0

> Adding support for core writers 
> 
>
> Key: SPARK-30097
> URL: https://issues.apache.org/jira/browse/SPARK-30097
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
> Environment: {code:java}
> {code}
>  
>Reporter: German Schiavon Matteo
>Priority: Minor
>
> When using *writeStream* we always have to use *format("xxx")* in order to 
> target the selected sink while in r*eadStream* you can use directly 
> *.parquet* 
> Basically this is to add the support to the core writers for *writeStream*
> Example:
>   
> {code:java}
> writeStream
> .outputMode("append")
> .partitionBy("id")
> .options(options)
> .parquet(path)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24625) put all the backward compatible behavior change configs under spark.sql.legacy.*

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24625:
--
Target Version/s: 3.1.0

> put all the backward compatible behavior change configs under 
> spark.sql.legacy.*
> 
>
> Key: SPARK-24625
> URL: https://issues.apache.org/jira/browse/SPARK-24625
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Recently we made several behavior changes to Spark SQL, to make it more ANSI 
> SQL compliant or fix some unreasonable behaviors. For backward compatibility, 
> we add configs to allow users fallback to the old behavior and plan to remove 
> them in Spark 3.0. It's better to put these configs under spark.sql.legacy.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-19842:
--
Target Version/s: 3.1.0

> Informational Referential Integrity Constraints Support in Spark
> 
>
> Key: SPARK-19842
> URL: https://issues.apache.org/jira/browse/SPARK-19842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ioana Delaney
>Priority: Major
> Attachments: InformationalRIConstraints.doc
>
>
> *Informational Referential Integrity Constraints Support in Spark*
> This work proposes support for _informational primary key_ and _foreign key 
> (referential integrity) constraints_ in Spark. The main purpose is to open up 
> an area of query optimization techniques that rely on referential integrity 
> constraints semantics. 
> An _informational_ or _statistical constraint_ is a constraint such as a 
> _unique_, _primary key_, _foreign key_, or _check constraint_, that can be 
> used by Spark to improve query performance. Informational constraints are not 
> enforced by the Spark SQL engine; rather, they are used by Catalyst to 
> optimize the query processing. They provide semantics information that allows 
> Catalyst to rewrite queries to eliminate joins, push down aggregates, remove 
> unnecessary Distinct operations, and perform a number of other optimizations. 
> Informational constraints are primarily targeted to applications that load 
> and analyze data that originated from a data warehouse. For such 
> applications, the conditions for a given constraint are known to be true, so 
> the constraint does not need to be enforced during data load operations. 
> The attached document covers constraint definition, metastore storage, 
> constraint validation, and maintenance. The document shows many examples of 
> query performance improvements that utilize referential integrity constraints 
> and can be implemented in Spark.
> Link to the google doc: 
> [InformationalRIConstraints|https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27471) Reorganize public v2 catalog API

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27471:
--
Target Version/s: 3.1.0

> Reorganize public v2 catalog API
> 
>
> Key: SPARK-27471
> URL: https://issues.apache.org/jira/browse/SPARK-27471
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Blocker
>
> In the review for SPARK-27181, Reynold suggested some package moves. We've 
> decided (at the v2 community sync) not to delay by having this discussion now 
> because we want to get the new catalog API in so we can work on more logical 
> plans in parallel. But we do need to make sure we have a sane package scheme 
> for the next release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28629) Capture the missing rules in HiveSessionStateBuilder

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28629:
--
Target Version/s: 3.1.0

> Capture the missing rules in HiveSessionStateBuilder
> 
>
> Key: SPARK-28629
> URL: https://issues.apache.org/jira/browse/SPARK-28629
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> A general mistake for new contributors is to forget adding the corresponding 
> rules into the extended extendedResolutionRules, postHocResolutionRules, 
> extendedCheckRules in HiveSessionStateBuilder. We need to avoid missing the 
> rules or capture them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28717) Update SQL ALTER TABLE RENAME to use TableCatalog API

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28717:
--
Target Version/s: 3.1.0

> Update SQL ALTER TABLE RENAME  to use TableCatalog API
> --
>
> Key: SPARK-28717
> URL: https://issues.apache.org/jira/browse/SPARK-28717
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Edgar Rodriguez
>Priority: Major
>
> Follow-up from SPARK-28265
> SQL implementation of ALTER TABLE RENAME needs to be updated to use the 
> TableCatalog API operation {{renameTable}} - having something like: 
> {code:java}
> ALTER TABLE [catalog_name] [namespace_name] table_name
> TO [new_namespace_name] new_table_name{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25383) Image data source supports sample pushdown

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25383:
--
Target Version/s: 3.1.0

> Image data source supports sample pushdown
> --
>
> Key: SPARK-25383
> URL: https://issues.apache.org/jira/browse/SPARK-25383
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> After SPARK-25349, we should update image data source to support sampling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25752) Add trait to easily whitelist logical operators that produce named output from CleanupAliases

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25752:
--
Target Version/s: 3.1.0

> Add trait to easily whitelist logical operators that produce named output 
> from CleanupAliases
> -
>
> Key: SPARK-25752
> URL: https://issues.apache.org/jira/browse/SPARK-25752
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
>
> The rule `CleanupAliases` cleans up aliases from logical operators that do 
> not match a whitelist. This whitelist is hardcoded inside the rule which is 
> cumbersome. This PR is to clean that up by making a trait `HasNamedOutput` 
> that will be ignored by `CleanupAliases` and other ops that require aliases 
> to be preserved in the operator should extend it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27936) Support local dependency uploading from --py-files

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27936:
--
Target Version/s: 3.1.0

> Support local dependency uploading from --py-files
> --
>
> Key: SPARK-27936
> URL: https://issues.apache.org/jira/browse/SPARK-27936
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Erik Erlandson
>Priority: Major
>
> Support python dependency uploads, as in SPARK-23153



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-02-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22231:
--
Target Version/s: 3.1.0

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Jeremy Smith
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the nested list of struct. Here is two 
> examples demonstrating the API together with *mapItems*; the first one 
> replaces the existing column,
> {code:java}
> case class Item(a: Int, b: Double)
> case class Data(foo: Int, bar: Double, items: Seq[Item])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))),
>   Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0)))
> ))
> val result = df.mapItems("items") {
>   item => item.withColumn(item("b") + 1 as "b")
> }
> result.printSchema
> root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: struct (containsNull = true)
> // |||-- a: integer (nullable = true)
> // |||-- b: double (nullable = true)
> result.show(false)
> // +---++--+
> // |foo|bar |items |
> // +---++--+
> // |10 |10.0|[[10,11.0], [11,12.0]]|
> // |20 |20.0|[[20,21.0], [21,22.0]]|
> // +---++--+

[jira] [Commented] (SPARK-30657) Streaming limit after streaming dropDuplicates can throw error

2020-02-01 Thread Shixiong Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028223#comment-17028223
 ] 

Shixiong Zhu commented on SPARK-30657:
--

[~tdas] Make sense. Agreed that the risk is high but the benefit is pretty low. 
We can backport it later whenever needed.

> Streaming limit after streaming dropDuplicates can throw error
> --
>
> Key: SPARK-30657
> URL: https://issues.apache.org/jira/browse/SPARK-30657
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 3.0.0
>
>
> {{LocalLimitExec}} does not consume the iterator of the child plan. So if 
> there is a limit after a stateful operator like streaming dedup in append 
> mode (e.g. {{streamingdf.dropDuplicates().limit(5}})), the state changes of 
> streaming duplicate may not be committed (most stateful ops commit state 
> changes only after the generated iterator is fully consumed). This leads to 
> the next batch failing with {{java.lang.IllegalStateException: Error reading 
> delta file .../N.delta does not exist}} as the state store delta file was 
> never generated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30701) SQL test running on Windows: hadoop chgrp warnings

2020-02-01 Thread Guram Savinov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028026#comment-17028026
 ] 

Guram Savinov edited comment on SPARK-30701 at 2/1/20 9:38 AM:
---

So the problem is: backslash character isn't included to allowedChars, see 
attached HadoopGroupTest.java
This is Hadoop issue, not about Spark.


was (Author: gsavinov):
So the problem is: backslash character isn't included to allowedChars, see 
attached HadoopGroupTest.java

> SQL test running on Windows: hadoop chgrp warnings
> --
>
> Key: SPARK-30701
> URL: https://issues.apache.org/jira/browse/SPARK-30701
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Winutils 2.7.1: 
> [https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1]
> Oracle JavaSE 8
> SparkSQL 2.4.4 / Hadoop 2.6.5
> Using: -Dhive.exec.scratchdir=C:\Users\OSUser\hadoop\tmp\hive
> Set: winutils chmod -R 777 \Users\OSUser\hadoop\tmp\hive
>Reporter: Guram Savinov
>Assignee: Felix Cheung
>Priority: Major
>  Labels: WIndows, hive, unit-test
> Attachments: HadoopGroupTest.java
>
>
> Running SparkSQL local embedded unit tests on Win10, using winutils.
> Got warnings about 'hadoop chgrp'.
> See environment info.
> {code:bash}
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> {code}
> Related info on SO: 
> https://stackoverflow.com/questions/48605907/error-in-pyspark-when-insert-data-in-hive
> Seems like the problem is here: 
> hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FsShellPermissions.java:210



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30701) SQL test running on Windows: hadoop chgrp warnings

2020-02-01 Thread Guram Savinov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guram Savinov updated SPARK-30701:
--
Environment: 
Windows 10

Winutils 2.7.1: 
[https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1]

Oracle JavaSE 8

SparkSQL 2.4.4 / Hadoop 2.6.5

Using: -Dhive.exec.scratchdir=C:\Users\OSUser\hadoop\tmp\hive

Set: winutils chmod -R 777 \Users\OSUser\hadoop\tmp\hive

  was:
Windows 10

Winutils 2.7.1: 
[https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1]

Oracle JavaSE 8

SparkSQL 2.4.4

Using: -Dhive.exec.scratchdir=C:\Users\OSUser\hadoop\tmp\hive

Set: winutils chmod -R 777 \Users\OSUser\hadoop\tmp\hive


> SQL test running on Windows: hadoop chgrp warnings
> --
>
> Key: SPARK-30701
> URL: https://issues.apache.org/jira/browse/SPARK-30701
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Winutils 2.7.1: 
> [https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1]
> Oracle JavaSE 8
> SparkSQL 2.4.4 / Hadoop 2.6.5
> Using: -Dhive.exec.scratchdir=C:\Users\OSUser\hadoop\tmp\hive
> Set: winutils chmod -R 777 \Users\OSUser\hadoop\tmp\hive
>Reporter: Guram Savinov
>Assignee: Felix Cheung
>Priority: Major
>  Labels: WIndows, hive, unit-test
> Attachments: HadoopGroupTest.java
>
>
> Running SparkSQL local embedded unit tests on Win10, using winutils.
> Got warnings about 'hadoop chgrp'.
> See environment info.
> {code:bash}
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> {code}
> Related info on SO: 
> https://stackoverflow.com/questions/48605907/error-in-pyspark-when-insert-data-in-hive
> Seems like the problem is here: 
> hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FsShellPermissions.java:210



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30702) Support subexpression elimination in whole stage codegen

2020-02-01 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-30702:
---

 Summary: Support subexpression elimination in whole stage codegen
 Key: SPARK-30702
 URL: https://issues.apache.org/jira/browse/SPARK-30702
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


Please see 
https://github.com/apache/spark/blob/a3a42b30d04009282e770c289b043ca5941e32e5/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L2011-L2067
 for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30701) SQL test running on Windows: hadoop chgrp warnings

2020-02-01 Thread Guram Savinov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028026#comment-17028026
 ] 

Guram Savinov commented on SPARK-30701:
---

So the problem is: backslash character isn't included to allowedChars, see 
attached HadoopGroupTest.java

> SQL test running on Windows: hadoop chgrp warnings
> --
>
> Key: SPARK-30701
> URL: https://issues.apache.org/jira/browse/SPARK-30701
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Winutils 2.7.1: 
> [https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1]
> Oracle JavaSE 8
> SparkSQL 2.4.4
> Using: -Dhive.exec.scratchdir=C:\Users\OSUser\hadoop\tmp\hive
> Set: winutils chmod -R 777 \Users\OSUser\hadoop\tmp\hive
>Reporter: Guram Savinov
>Assignee: Felix Cheung
>Priority: Major
>  Labels: WIndows, hive, unit-test
> Attachments: HadoopGroupTest.java
>
>
> Running SparkSQL local embedded unit tests on Win10, using winutils.
> Got warnings about 'hadoop chgrp'.
> See environment info.
> {code:bash}
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> {code}
> Related info on SO: 
> https://stackoverflow.com/questions/48605907/error-in-pyspark-when-insert-data-in-hive
> Seems like the problem is here: 
> hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FsShellPermissions.java:210



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30701) SQL test running on Windows: hadoop chgrp warnings

2020-02-01 Thread Guram Savinov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guram Savinov updated SPARK-30701:
--
Attachment: HadoopGroupTest.java

> SQL test running on Windows: hadoop chgrp warnings
> --
>
> Key: SPARK-30701
> URL: https://issues.apache.org/jira/browse/SPARK-30701
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Winutils 2.7.1: 
> [https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1]
> Oracle JavaSE 8
> SparkSQL 2.4.4
> Using: -Dhive.exec.scratchdir=C:\Users\OSUser\hadoop\tmp\hive
> Set: winutils chmod -R 777 \Users\OSUser\hadoop\tmp\hive
>Reporter: Guram Savinov
>Assignee: Felix Cheung
>Priority: Major
>  Labels: WIndows, hive, unit-test
> Attachments: HadoopGroupTest.java
>
>
> Running SparkSQL local embedded unit tests on Win10, using winutils.
> Got warnings about 'hadoop chgrp'.
> See environment info.
> {code:bash}
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> {code}
> Related info on SO: 
> https://stackoverflow.com/questions/48605907/error-in-pyspark-when-insert-data-in-hive
> Seems like the problem is here: 
> hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FsShellPermissions.java:210



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30701) SQL test running on Windows: hadoop chgrp warnings

2020-02-01 Thread Guram Savinov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guram Savinov updated SPARK-30701:
--
Description: 
Running SparkSQL local embedded unit tests on Win10, using winutils.

Got warnings about 'hadoop chgrp'.

See environment info.
{code:bash}
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
{code}

Related info on SO: 
https://stackoverflow.com/questions/48605907/error-in-pyspark-when-insert-data-in-hive
Seems like the problem is here: 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FsShellPermissions.java:210

  was:
Running SparkSQL local embedded unit tests on Win10, using winutils.

Got warnings about 'hadoop chgrp'.

See environment info.
{code:bash}
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
{code}

Related info on SO: 
https://stackoverflow.com/questions/48605907/error-in-pyspark-when-insert-data-in-hive


> SQL test running on Windows: hadoop chgrp warnings
> --
>
> Key: SPARK-30701
> URL: https://issues.apache.org/jira/browse/SPARK-30701
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Winutils 2.7.1: 
> [https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1]
> Oracle JavaSE 8
> SparkSQL 2.4.4
> Using: -Dhive.exec.scratchdir=C:\Users\OSUser\hadoop\tmp\hive
> Set: winutils chmod -R 777 \Users\OSUser\hadoop\tmp\hive
>Reporter: Guram Savinov
>Assignee: Felix Cheung
>Priority: Major
>  Labels: WIndows, hive, unit-test
>
> Running SparkSQL local embedded unit tests on Win10, using winutils.
> Got warnings about 'hadoop chgrp'.
> See environment info.
> {code:bash}
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> {code}
> Related info on SO: 
> https://stackoverflow.com/questions/48605907/error-in-pyspark-when-insert-data-in-hive
> Seems like the problem is here: 
> hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FsShellPermissions.java:210



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30701) SQL test running on Windows: hadoop chgrp warnings

2020-02-01 Thread Guram Savinov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guram Savinov updated SPARK-30701:
--
Description: 
Running SparkSQL local embedded unit tests on Win10, using winutils.

Got warnings about 'hadoop chgrp'.

See environment info.
{code:bash}
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
{code}

Related info on SO: 
https://stackoverflow.com/questions/48605907/error-in-pyspark-when-insert-data-in-hive

  was:
Running SparkSQL local embedded unit tests on Win10, using winutils.

Got warnings about 'hadoop chgrp'.

See environment info.
{code:java}
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
{code}

Related info on SO: 
https://stackoverflow.com/questions/48605907/error-in-pyspark-when-insert-data-in-hive


> SQL test running on Windows: hadoop chgrp warnings
> --
>
> Key: SPARK-30701
> URL: https://issues.apache.org/jira/browse/SPARK-30701
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Winutils 2.7.1: 
> [https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1]
> Oracle JavaSE 8
> SparkSQL 2.4.4
> Using: -Dhive.exec.scratchdir=C:\Users\OSUser\hadoop\tmp\hive
> Set: winutils chmod -R 777 \Users\OSUser\hadoop\tmp\hive
>Reporter: Guram Savinov
>Assignee: Felix Cheung
>Priority: Major
>  Labels: WIndows, hive, unit-test
>
> Running SparkSQL local embedded unit tests on Win10, using winutils.
> Got warnings about 'hadoop chgrp'.
> See environment info.
> {code:bash}
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> {code}
> Related info on SO: 
> https://stackoverflow.com/questions/48605907/error-in-pyspark-when-insert-data-in-hive



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30701) SQL test running on Windows: hadoop chgrp warnings

2020-02-01 Thread Guram Savinov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guram Savinov updated SPARK-30701:
--
Description: 
Running SparkSQL local embedded unit tests on Win10, using winutils.

Got warnings about 'hadoop chgrp'.

See environment info.
{code:java}
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
{code}

Related info on SO: 
https://stackoverflow.com/questions/48605907/error-in-pyspark-when-insert-data-in-hive

  was:
Running SparkSQL local embedded unit tests on Win10, using winutils.

Got warnings about 'hadoop chgrp'.

See environment info.
{code:java}
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
{code}


> SQL test running on Windows: hadoop chgrp warnings
> --
>
> Key: SPARK-30701
> URL: https://issues.apache.org/jira/browse/SPARK-30701
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Winutils 2.7.1: 
> [https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1]
> Oracle JavaSE 8
> SparkSQL 2.4.4
> Using: -Dhive.exec.scratchdir=C:\Users\OSUser\hadoop\tmp\hive
> Set: winutils chmod -R 777 \Users\OSUser\hadoop\tmp\hive
>Reporter: Guram Savinov
>Assignee: Felix Cheung
>Priority: Major
>  Labels: WIndows, hive, unit-test
>
> Running SparkSQL local embedded unit tests on Win10, using winutils.
> Got warnings about 'hadoop chgrp'.
> See environment info.
> {code:java}
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> {code}
> Related info on SO: 
> https://stackoverflow.com/questions/48605907/error-in-pyspark-when-insert-data-in-hive



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30701) SQL test running on Windows: hadoop chgrp warnings

2020-02-01 Thread Guram Savinov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guram Savinov updated SPARK-30701:
--
Description: 
Running SparkSQL local embedded unit tests on Win10, using winutils.

Got warnings about 'hadoop chgrp'.

See environment info.
{code:java}
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
{code}

  was:
Running SparkSQL embedded unit tests on Win10, using winutils.

Got warnings about 'hadoop chgrp'.

See environment info.
{code:java}
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
{code}


> SQL test running on Windows: hadoop chgrp warnings
> --
>
> Key: SPARK-30701
> URL: https://issues.apache.org/jira/browse/SPARK-30701
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Winutils 2.7.1: 
> [https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1]
> Oracle JavaSE 8
> SparkSQL 2.4.4
> Using: -Dhive.exec.scratchdir=C:\Users\OSUser\hadoop\tmp\hive
> Set: winutils chmod -R 777 \Users\OSUser\hadoop\tmp\hive
>Reporter: Guram Savinov
>Assignee: Felix Cheung
>Priority: Major
>  Labels: WIndows, hive, unit-test
>
> Running SparkSQL local embedded unit tests on Win10, using winutils.
> Got warnings about 'hadoop chgrp'.
> See environment info.
> {code:java}
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30701) SQL test running on Windows: hadoop chgrp warnings

2020-02-01 Thread Guram Savinov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guram Savinov updated SPARK-30701:
--
Environment: 
Windows 10

Winutils 2.7.1: 
[https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1]

Oracle JavaSE 8

SparkSQL 2.4.4

Using: -Dhive.exec.scratchdir=C:\Users\OSUser\hadoop\tmp\hive

Set: winutils chmod -R 777 \Users\OSUser\hadoop\tmp\hive

  was:
Windows 10

Winutils 2.7.1: 
[https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1]

Oracle JavaSE 8

SparkSQL 2.4.4

Using: -Dhive.exec.scratchdir=C:\Users\OSUser\hadoop\tmp\hive


> SQL test running on Windows: hadoop chgrp warnings
> --
>
> Key: SPARK-30701
> URL: https://issues.apache.org/jira/browse/SPARK-30701
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Winutils 2.7.1: 
> [https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1]
> Oracle JavaSE 8
> SparkSQL 2.4.4
> Using: -Dhive.exec.scratchdir=C:\Users\OSUser\hadoop\tmp\hive
> Set: winutils chmod -R 777 \Users\OSUser\hadoop\tmp\hive
>Reporter: Guram Savinov
>Assignee: Felix Cheung
>Priority: Major
>  Labels: WIndows, hive, unit-test
>
> Running SparkSQL unit tests on Win10, using winutils.
> Got warnings about 'hadoop chgrp'.
> See environment info.
> {code:java}
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30701) SQL test running on Windows: hadoop chgrp warnings

2020-02-01 Thread Guram Savinov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guram Savinov updated SPARK-30701:
--
Description: 
Running SparkSQL embedded unit tests on Win10, using winutils.

Got warnings about 'hadoop chgrp'.

See environment info.
{code:java}
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
{code}

  was:
Running SparkSQL unit tests on Win10, using winutils.

Got warnings about 'hadoop chgrp'.

See environment info.
{code:java}
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
{code}


> SQL test running on Windows: hadoop chgrp warnings
> --
>
> Key: SPARK-30701
> URL: https://issues.apache.org/jira/browse/SPARK-30701
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Winutils 2.7.1: 
> [https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1]
> Oracle JavaSE 8
> SparkSQL 2.4.4
> Using: -Dhive.exec.scratchdir=C:\Users\OSUser\hadoop\tmp\hive
> Set: winutils chmod -R 777 \Users\OSUser\hadoop\tmp\hive
>Reporter: Guram Savinov
>Assignee: Felix Cheung
>Priority: Major
>  Labels: WIndows, hive, unit-test
>
> Running SparkSQL embedded unit tests on Win10, using winutils.
> Got warnings about 'hadoop chgrp'.
> See environment info.
> {code:java}
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30701) SQL test running on Windows: hadoop chgrp warnings

2020-02-01 Thread Guram Savinov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guram Savinov updated SPARK-30701:
--
Description: 
Running SparkSQL unit tests on Win10, using winutils.

Got warnings about 'hadoop chgrp'.

See environment info.
{code:java}
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
{code}

  was:
{code:java}
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
{code}


> SQL test running on Windows: hadoop chgrp warnings
> --
>
> Key: SPARK-30701
> URL: https://issues.apache.org/jira/browse/SPARK-30701
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Winutils 2.7.1: 
> [https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1]
> Oracle JavaSE 8
> SparkSQL 2.4.4
> Using: -Dhive.exec.scratchdir=C:\Users\OSUser\hadoop\tmp\hive
>Reporter: Guram Savinov
>Assignee: Felix Cheung
>Priority: Major
>  Labels: WIndows, hive, unit-test
>
> Running SparkSQL unit tests on Win10, using winutils.
> Got warnings about 'hadoop chgrp'.
> See environment info.
> {code:java}
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30701) SQL test running on Windows: hadoop chgrp warnings

2020-02-01 Thread Guram Savinov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guram Savinov updated SPARK-30701:
--
Description: 
{code:java}
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'TEST\Domain users' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
{code}

  was:
{code}

-chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
.-chgrp: 'APPVYR-WIN\None' does 
not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
.-chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
{code}

Environment: 
Windows 10

Winutils 2.7.1: 
[https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1]

Oracle JavaSE 8

SparkSQL 2.4.4

Using: -Dhive.exec.scratchdir=C:\Users\OSUser\hadoop\tmp\hive

> SQL test running on Windows: hadoop chgrp warnings
> --
>
> Key: SPARK-30701
> URL: https://issues.apache.org/jira/browse/SPARK-30701
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Winutils 2.7.1: 
> [https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1]
> Oracle JavaSE 8
> SparkSQL 2.4.4
> Using: -Dhive.exec.scratchdir=C:\Users\OSUser\hadoop\tmp\hive
>Reporter: Guram Savinov
>Assignee: Felix Cheung
>Priority: Major
>  Labels: WIndows, hive, unit-test
>
> {code:java}
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'TEST\Domain users' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30701) SQL test running on Windows: hadoop chgrp warnings

2020-02-01 Thread Guram Savinov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guram Savinov updated SPARK-30701:
--
Affects Version/s: (was: 2.3.0)
   2.4.4

> SQL test running on Windows: hadoop chgrp warnings
> --
>
> Key: SPARK-30701
> URL: https://issues.apache.org/jira/browse/SPARK-30701
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Guram Savinov
>Assignee: Felix Cheung
>Priority: Major
>  Labels: WIndows, hive, unit-test
>
> {code}
> -chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> .-chgrp: 'APPVYR-WIN\None' 
> does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> .-chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30701) SQL test running on Windows: hadoop chgrp warnings

2020-02-01 Thread Guram Savinov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guram Savinov updated SPARK-30701:
--
Labels: WIndows hive unit-test  (was: bulk-closed)

> SQL test running on Windows: hadoop chgrp warnings
> --
>
> Key: SPARK-30701
> URL: https://issues.apache.org/jira/browse/SPARK-30701
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Guram Savinov
>Assignee: Felix Cheung
>Priority: Major
>  Labels: WIndows, hive, unit-test
>
> {code}
> -chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> .-chgrp: 'APPVYR-WIN\None' 
> does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> .-chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30701) SQL test running on Windows: hadoop chgrp warnings

2020-02-01 Thread Guram Savinov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guram Savinov updated SPARK-30701:
--
Component/s: (was: SparkR)

> SQL test running on Windows: hadoop chgrp warnings
> --
>
> Key: SPARK-30701
> URL: https://issues.apache.org/jira/browse/SPARK-30701
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Guram Savinov
>Assignee: Felix Cheung
>Priority: Major
>  Labels: bulk-closed
>
> {code}
> -chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> .-chgrp: 'APPVYR-WIN\None' 
> does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> .-chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30701) SQL test running on Windows: hadoop chgrp warnings

2020-02-01 Thread Guram Savinov (Jira)
Guram Savinov created SPARK-30701:
-

 Summary: SQL test running on Windows: hadoop chgrp warnings
 Key: SPARK-30701
 URL: https://issues.apache.org/jira/browse/SPARK-30701
 Project: Spark
  Issue Type: Bug
  Components: SparkR, SQL
Affects Versions: 2.3.0
Reporter: Guram Savinov
Assignee: Felix Cheung


{code}

-chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
.-chgrp: 'APPVYR-WIN\None' does 
not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
.-chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
-chgrp: 'APPVYR-WIN\None' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org