[jira] [Commented] (SPARK-33787) Add `purge` to `dropPartition` in `SupportsPartitionManagement`

2020-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249544#comment-17249544
 ] 

Apache Spark commented on SPARK-33787:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30776

> Add `purge` to `dropPartition` in `SupportsPartitionManagement`
> ---
>
> Key: SPARK-33787
> URL: https://issues.apache.org/jira/browse/SPARK-33787
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Add the `purge` parameter to the `dropPartition` in 
> `SupportsPartitionManagement` and to the `dropPartitions` in 
> `SupportsAtomicPartitionManagement`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33787) Add `purge` to `dropPartition` in `SupportsPartitionManagement`

2020-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33787:


Assignee: (was: Apache Spark)

> Add `purge` to `dropPartition` in `SupportsPartitionManagement`
> ---
>
> Key: SPARK-33787
> URL: https://issues.apache.org/jira/browse/SPARK-33787
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Add the `purge` parameter to the `dropPartition` in 
> `SupportsPartitionManagement` and to the `dropPartitions` in 
> `SupportsAtomicPartitionManagement`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28863) Add an AlreadyPlanned logical node that skips query planning

2020-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249545#comment-17249545
 ] 

Apache Spark commented on SPARK-28863:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/30777

> Add an AlreadyPlanned logical node that skips query planning
> 
>
> Key: SPARK-28863
> URL: https://issues.apache.org/jira/browse/SPARK-28863
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> With the DataSourceV2 write operations, we have a way to fallback to the V1 
> writer APIs using InsertableRelation.
> The gross part is that we're in physical land, but the InsertableRelation 
> takes a logical plan, so we have to pass the logical plans to these physical 
> nodes, and then potentially go through re-planning.
> A useful primitive could be specifying that a plan is ready for execution 
> through a logical node AlreadyPlanned. This would wrap a physical plan, and 
> then we can go straight to execution.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33787) Add `purge` to `dropPartition` in `SupportsPartitionManagement`

2020-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33787:


Assignee: Apache Spark

> Add `purge` to `dropPartition` in `SupportsPartitionManagement`
> ---
>
> Key: SPARK-33787
> URL: https://issues.apache.org/jira/browse/SPARK-33787
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Add the `purge` parameter to the `dropPartition` in 
> `SupportsPartitionManagement` and to the `dropPartitions` in 
> `SupportsAtomicPartitionManagement`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33787) Add `purge` to `dropPartition` in `SupportsPartitionManagement`

2020-12-14 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33787:
--

 Summary: Add `purge` to `dropPartition` in 
`SupportsPartitionManagement`
 Key: SPARK-33787
 URL: https://issues.apache.org/jira/browse/SPARK-33787
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Maxim Gekk


Add the `purge` parameter to the `dropPartition` in 
`SupportsPartitionManagement` and to the `dropPartitions` in 
`SupportsAtomicPartitionManagement`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33778) Allow typesafe join for LeftSemi and LeftAnti

2020-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33778:


Assignee: Apache Spark

> Allow typesafe join for LeftSemi and LeftAnti
> -
>
> Key: SPARK-33778
> URL: https://issues.apache.org/jira/browse/SPARK-33778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Venkata krishnan Sowrirajan
>Assignee: Apache Spark
>Priority: Major
>
> With [SPARK-21333|https://issues.apache.org/jira/browse/SPARK-21333] change, 
> LeftSemi and LeftAnti no longer has a typesafe join API. It makes sense to 
> not support LeftSemi and LeftAnti as part of joinWith as it returns tuples 
> which includes values from both the datasets which is not possible in the 
> above joins. Neverthless, it would be nice to have a separate join API or in 
> the existing API to support LeftSemi and LeftAnti which returns Dataset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33778) Allow typesafe join for LeftSemi and LeftAnti

2020-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33778:


Assignee: (was: Apache Spark)

> Allow typesafe join for LeftSemi and LeftAnti
> -
>
> Key: SPARK-33778
> URL: https://issues.apache.org/jira/browse/SPARK-33778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Venkata krishnan Sowrirajan
>Priority: Major
>
> With [SPARK-21333|https://issues.apache.org/jira/browse/SPARK-21333] change, 
> LeftSemi and LeftAnti no longer has a typesafe join API. It makes sense to 
> not support LeftSemi and LeftAnti as part of joinWith as it returns tuples 
> which includes values from both the datasets which is not possible in the 
> above joins. Neverthless, it would be nice to have a separate join API or in 
> the existing API to support LeftSemi and LeftAnti which returns Dataset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33778) Allow typesafe join for LeftSemi and LeftAnti

2020-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249509#comment-17249509
 ] 

Apache Spark commented on SPARK-33778:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/30775

> Allow typesafe join for LeftSemi and LeftAnti
> -
>
> Key: SPARK-33778
> URL: https://issues.apache.org/jira/browse/SPARK-33778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Venkata krishnan Sowrirajan
>Priority: Major
>
> With [SPARK-21333|https://issues.apache.org/jira/browse/SPARK-21333] change, 
> LeftSemi and LeftAnti no longer has a typesafe join API. It makes sense to 
> not support LeftSemi and LeftAnti as part of joinWith as it returns tuples 
> which includes values from both the datasets which is not possible in the 
> above joins. Neverthless, it would be nice to have a separate join API or in 
> the existing API to support LeftSemi and LeftAnti which returns Dataset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33653) DSv2: REFRESH TABLE should recache the table itself

2020-12-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33653:
--
Fix Version/s: (was: 3.2.0)
   3.1.0

> DSv2: REFRESH TABLE should recache the table itself
> ---
>
> Key: SPARK-33653
> URL: https://issues.apache.org/jira/browse/SPARK-33653
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.1.0
>
>
> As "CACHE TABLE" is supported in DSv2 now, we should also recache the table 
> itself in "REFRESH TABLE" command, to match the behavior in DSv1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33767) Unify v1 and v2 ALTER TABLE .. DROP PARTITION tests

2020-12-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33767.
-
Fix Version/s: (was: 3.1.0)
   3.2.0
   Resolution: Fixed

Issue resolved by pull request 30747
[https://github.com/apache/spark/pull/30747]

> Unify v1 and v2 ALTER TABLE .. DROP PARTITION tests
> ---
>
> Key: SPARK-33767
> URL: https://issues.apache.org/jira/browse/SPARK-33767
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Extract ALTER TABLE .. DROP PARTITION tests to the common place to run them 
> for V1 and v2 datasources. Some tests can be places to V1 and V2 specific 
> test suites.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33785) Migrate ALTER TABLE ... RECOVER PARTITIONS to new resolution framework

2020-12-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33785.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30773
[https://github.com/apache/spark/pull/30773]

> Migrate ALTER TABLE ... RECOVER PARTITIONS to new resolution framework
> --
>
> Key: SPARK-33785
> URL: https://issues.apache.org/jira/browse/SPARK-33785
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
> Fix For: 3.2.0
>
>
> Migrate ALTER TABLE ... RECOVER PARTITIONS to new resolution framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33785) Migrate ALTER TABLE ... RECOVER PARTITIONS to new resolution framework

2020-12-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33785:
---

Assignee: Terry Kim

> Migrate ALTER TABLE ... RECOVER PARTITIONS to new resolution framework
> --
>
> Key: SPARK-33785
> URL: https://issues.apache.org/jira/browse/SPARK-33785
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
>
> Migrate ALTER TABLE ... RECOVER PARTITIONS to new resolution framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33782) Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

2020-12-14 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249485#comment-17249485
 ] 

Hyukjin Kwon commented on SPARK-33782:
--

BTW, [~tgraves], I think I would likely take a look for this if no one takes a 
look but it will happen a bit later in the middle of Spark 3.2 dev.

> Place spark.files, spark.jars and spark.files under the current working 
> directory on the driver in K8S cluster mode
> ---
>
> Key: SPARK-33782
> URL: https://issues.apache.org/jira/browse/SPARK-33782
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> In Yarn cluster modes, the passed files are able to be accessed in the 
> current working directory. Looks like this is not the case in Kubernates 
> cluset mode.
> By doing this, users can, for example, leverage PEX to manage Python 
> dependences in Apache Spark:
> {code}
> pex pyspark==3.0.1 pyarrow==0.15.1 pandas==0.25.3 -o myarchive.pex
> PYSPARK_PYTHON=./myarchive.pex spark-submit --files myarchive.pex
> {code}
> See also https://github.com/apache/spark/pull/30735/files#r540935585.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33786) Cache's storage level is not respected when a table name is altered.

2020-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33786:


Assignee: Apache Spark

> Cache's storage level is not respected when a table name is altered.
> 
>
> Key: SPARK-33786
> URL: https://issues.apache.org/jira/browse/SPARK-33786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Major
>
> To repro:
> {code:java}
> Seq(1 -> "a").toDF("i", "j").write.parquet(path.getCanonicalPath)
> sql(s"CREATE TABLE old USING parquet LOCATION '${path.toURI}'")
> sql("CACHE TABLE old OPTIONS('storageLevel' 'MEMORY_ONLY')")
> val oldStorageLevel = getStorageLevel("old")
> sql("ALTER TABLE old RENAME TO new")
> val newStorageLevel = getStorageLevel("new")
> assert(oldStorageLevel === newStorageLevel)
> {code}
> The assert fails:
> Expected :StorageLevel(disk, memory, deserialized, 1 replicas)
> Actual   :StorageLevel(memory, deserialized, 1 replicas)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33786) Cache's storage level is not respected when a table name is altered.

2020-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33786:


Assignee: (was: Apache Spark)

> Cache's storage level is not respected when a table name is altered.
> 
>
> Key: SPARK-33786
> URL: https://issues.apache.org/jira/browse/SPARK-33786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Priority: Major
>
> To repro:
> {code:java}
> Seq(1 -> "a").toDF("i", "j").write.parquet(path.getCanonicalPath)
> sql(s"CREATE TABLE old USING parquet LOCATION '${path.toURI}'")
> sql("CACHE TABLE old OPTIONS('storageLevel' 'MEMORY_ONLY')")
> val oldStorageLevel = getStorageLevel("old")
> sql("ALTER TABLE old RENAME TO new")
> val newStorageLevel = getStorageLevel("new")
> assert(oldStorageLevel === newStorageLevel)
> {code}
> The assert fails:
> Expected :StorageLevel(disk, memory, deserialized, 1 replicas)
> Actual   :StorageLevel(memory, deserialized, 1 replicas)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33786) Cache's storage level is not respected when a table name is altered.

2020-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249473#comment-17249473
 ] 

Apache Spark commented on SPARK-33786:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/30774

> Cache's storage level is not respected when a table name is altered.
> 
>
> Key: SPARK-33786
> URL: https://issues.apache.org/jira/browse/SPARK-33786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Priority: Major
>
> To repro:
> {code:java}
> Seq(1 -> "a").toDF("i", "j").write.parquet(path.getCanonicalPath)
> sql(s"CREATE TABLE old USING parquet LOCATION '${path.toURI}'")
> sql("CACHE TABLE old OPTIONS('storageLevel' 'MEMORY_ONLY')")
> val oldStorageLevel = getStorageLevel("old")
> sql("ALTER TABLE old RENAME TO new")
> val newStorageLevel = getStorageLevel("new")
> assert(oldStorageLevel === newStorageLevel)
> {code}
> The assert fails:
> Expected :StorageLevel(disk, memory, deserialized, 1 replicas)
> Actual   :StorageLevel(memory, deserialized, 1 replicas)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33776) spark sql datasourceV2 support kafka table

2020-12-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33776:
-
Fix Version/s: (was: 3.0.2)

> spark sql datasourceV2 support   kafka table 
> -
>
> Key: SPARK-33776
> URL: https://issues.apache.org/jira/browse/SPARK-33776
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.1
>Reporter: YIWEIPING
>Priority: Major
>
> ddl sql like below:
>  
> {{CREATE Stream  kafkatable}}
> {{USING org.apache.spark.sql.kafkaOPTIONS (}}
> {{  topic "topic",}}
> {{..}}
> {{}}
> {{)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33776) spark sql datasourceV2 support kafka table

2020-12-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33776:
-
Target Version/s:   (was: 3.0.2)

> spark sql datasourceV2 support   kafka table 
> -
>
> Key: SPARK-33776
> URL: https://issues.apache.org/jira/browse/SPARK-33776
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.1
>Reporter: YIWEIPING
>Priority: Major
> Fix For: 3.0.2
>
>
> ddl sql like below:
>  
> {{CREATE Stream  kafkatable}}
> {{USING org.apache.spark.sql.kafkaOPTIONS (}}
> {{  topic "topic",}}
> {{..}}
> {{}}
> {{)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33786) Cache's storage level is not respected when a table name is altered.

2020-12-14 Thread Terry Kim (Jira)
Terry Kim created SPARK-33786:
-

 Summary: Cache's storage level is not respected when a table name 
is altered.
 Key: SPARK-33786
 URL: https://issues.apache.org/jira/browse/SPARK-33786
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Terry Kim


To repro:
{code:java}
Seq(1 -> "a").toDF("i", "j").write.parquet(path.getCanonicalPath)
sql(s"CREATE TABLE old USING parquet LOCATION '${path.toURI}'")
sql("CACHE TABLE old OPTIONS('storageLevel' 'MEMORY_ONLY')")
val oldStorageLevel = getStorageLevel("old")

sql("ALTER TABLE old RENAME TO new")
val newStorageLevel = getStorageLevel("new")
assert(oldStorageLevel === newStorageLevel)
{code}
The assert fails:
Expected :StorageLevel(disk, memory, deserialized, 1 replicas)
Actual   :StorageLevel(memory, deserialized, 1 replicas)




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33784) Rename dataSourceRewriteRules and customDataSourceRewriteRules in BaseSessionStateBuilder

2020-12-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33784:


Assignee: Anton Okolnychyi

> Rename dataSourceRewriteRules and customDataSourceRewriteRules in 
> BaseSessionStateBuilder
> -
>
> Key: SPARK-33784
> URL: https://issues.apache.org/jira/browse/SPARK-33784
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Anton Okolnychyi
>Priority: Blocker
>
> This is under discussion at 
> https://github.com/apache/spark/pull/30558#discussion_r533885837.
> We happened to have rule extension that are not specific to Data source 
> rewrites (SPARK-33612), but we named it so, and people agree with having a 
> good name here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33778) Allow typesafe join for LeftSemi and LeftAnti

2020-12-14 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249455#comment-17249455
 ] 

angerszhu commented on SPARK-33778:
---

Can I take this ? Would like to work for this.

> Allow typesafe join for LeftSemi and LeftAnti
> -
>
> Key: SPARK-33778
> URL: https://issues.apache.org/jira/browse/SPARK-33778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Venkata krishnan Sowrirajan
>Priority: Major
>
> With [SPARK-21333|https://issues.apache.org/jira/browse/SPARK-21333] change, 
> LeftSemi and LeftAnti no longer has a typesafe join API. It makes sense to 
> not support LeftSemi and LeftAnti as part of joinWith as it returns tuples 
> which includes values from both the datasets which is not possible in the 
> above joins. Neverthless, it would be nice to have a separate join API or in 
> the existing API to support LeftSemi and LeftAnti which returns Dataset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33780) YARN doesn't know about resource yarn.io/gpu

2020-12-14 Thread Bruno Faustino Amorim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Faustino Amorim resolved SPARK-33780.
---
Resolution: Not A Bug

To use EMR you need to make configurations when creating the cluster. 
Documentation link: 
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-rapids.html

> YARN doesn't know about resource yarn.io/gpu
> 
>
> Key: SPARK-33780
> URL: https://issues.apache.org/jira/browse/SPARK-33780
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 3.0.1
> Environment: Amazon EMR: emr-6.2.0
>  Spark Version: Spark 3.0.1
> Instance Type: g3.4xlarge
>  AMI Name: emr-6_2_0-image-builder-ami-hvm-x86_64 2020-11-01T00-56-10.917Z
> Spark Configs:
> {code:java}
> sc_conf = SparkConf() \
>  .set('spark.driver.resource.gpu.discoveryScript', 
> '/opt/spark/getGpusResources.sh') \
>  .set('spark.driver.resource.gpu.amount', '1') \
>  .set('spark.rapids.sql.enabled', 'ALL'){code}
>  
>Reporter: Bruno Faustino Amorim
>Priority: Trivial
>
> Error to execute Spark on GPU. The stack trace is below:
> {code:java}
> 20/12/14 18:39:41 WARN ResourceRequestHelper: YARN doesn't know about 
> resource yarn.io/gpu, your resource discovery has to handle properly 
> discovering and isolating the resource! Error: The resource manager 
> encountered a problem that should not occur under normal circumstances. 
> Please report this error to the Hadoop community by opening a JIRA ticket at 
> http://issues.apache.org/jira and including the following 
> information:20/12/14 18:39:41 WARN ResourceRequestHelper: YARN doesn't know 
> about resource yarn.io/gpu, your resource discovery has to handle properly 
> discovering and isolating the resource! Error: The resource manager 
> encountered a problem that should not occur under normal circumstances. 
> Please report this error to the Hadoop community by opening a JIRA ticket at 
> http://issues.apache.org/jira and including the following information:* 
> Resource type requested: yarn.io/gpu* Resource object:  vCores:1>* The stack trace for this exception: java.lang.Exception at 
> org.apache.hadoop.yarn.exceptions.ResourceNotFoundException.(ResourceNotFoundException.java:47)
>  at 
> org.apache.hadoop.yarn.api.records.Resource.getResourceInformation(Resource.java:268)
>  at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.setResourceInformation(ResourcePBImpl.java:198)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.yarn.ResourceRequestHelper$.$anonfun$setResourceRequests$4(ResourceRequestHelper.scala:183)
>  at scala.collection.immutable.Map$Map1.foreach(Map.scala:128) at 
> org.apache.spark.deploy.yarn.ResourceRequestHelper$.setResourceRequests(ResourceRequestHelper.scala:170)
>  at 
> org.apache.spark.deploy.yarn.Client.createApplicationSubmissionContext(Client.scala:277)
>  at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:196) 
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:60)
>  at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:201)
>  at org.apache.spark.SparkContext.(SparkContext.scala:555) at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) 
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
> py4j.Gateway.invoke(Gateway.java:238) at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
>  at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> After encountering this error, the resource manager is in an inconsistent 
> state. It is safe for the resource manager to be restarted as the error 
> encountered should be transitive. If high availability is enabled, failing 
> over to a standby resource manager is also safe.20/12/14 18:39:46 WARN 
> YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors 
> before the AM has registered!{code}
>  
>  



--
This me

[jira] [Commented] (SPARK-33733) PullOutNondeterministic should check and collect deterministic field

2020-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249450#comment-17249450
 ] 

Apache Spark commented on SPARK-33733:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/30772

> PullOutNondeterministic should check and collect deterministic field
> 
>
> Key: SPARK-33733
> URL: https://issues.apache.org/jira/browse/SPARK-33733
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Major
> Fix For: 3.1.0
>
>
> The deterministic field is wider than `NonDerterministic`, we should keepe 
> same range between pull out and check analysis.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33785) Migrate ALTER TABLE ... RECOVER PARTITIONS to new resolution framework

2020-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33785:


Assignee: (was: Apache Spark)

> Migrate ALTER TABLE ... RECOVER PARTITIONS to new resolution framework
> --
>
> Key: SPARK-33785
> URL: https://issues.apache.org/jira/browse/SPARK-33785
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Priority: Minor
>
> Migrate ALTER TABLE ... RECOVER PARTITIONS to new resolution framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33785) Migrate ALTER TABLE ... RECOVER PARTITIONS to new resolution framework

2020-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33785:


Assignee: Apache Spark

> Migrate ALTER TABLE ... RECOVER PARTITIONS to new resolution framework
> --
>
> Key: SPARK-33785
> URL: https://issues.apache.org/jira/browse/SPARK-33785
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Minor
>
> Migrate ALTER TABLE ... RECOVER PARTITIONS to new resolution framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33785) Migrate ALTER TABLE ... RECOVER PARTITIONS to new resolution framework

2020-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249448#comment-17249448
 ] 

Apache Spark commented on SPARK-33785:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/30773

> Migrate ALTER TABLE ... RECOVER PARTITIONS to new resolution framework
> --
>
> Key: SPARK-33785
> URL: https://issues.apache.org/jira/browse/SPARK-33785
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Priority: Minor
>
> Migrate ALTER TABLE ... RECOVER PARTITIONS to new resolution framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33733) PullOutNondeterministic should check and collect deterministic field

2020-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249447#comment-17249447
 ] 

Apache Spark commented on SPARK-33733:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/30772

> PullOutNondeterministic should check and collect deterministic field
> 
>
> Key: SPARK-33733
> URL: https://issues.apache.org/jira/browse/SPARK-33733
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Major
> Fix For: 3.1.0
>
>
> The deterministic field is wider than `NonDerterministic`, we should keepe 
> same range between pull out and check analysis.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33733) PullOutNondeterministic should check and collect deterministic field

2020-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249446#comment-17249446
 ] 

Apache Spark commented on SPARK-33733:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/30771

> PullOutNondeterministic should check and collect deterministic field
> 
>
> Key: SPARK-33733
> URL: https://issues.apache.org/jira/browse/SPARK-33733
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Major
> Fix For: 3.1.0
>
>
> The deterministic field is wider than `NonDerterministic`, we should keepe 
> same range between pull out and check analysis.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31960) Only populate Hadoop classpath for no-hadoop build

2020-12-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31960:
-
Labels: release-notes  (was: )

> Only populate Hadoop classpath for no-hadoop build
> --
>
> Key: SPARK-31960
> URL: https://issues.apache.org/jira/browse/SPARK-31960
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>  Labels: release-notes
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33785) Migrate ALTER TABLE ... RECOVER PARTITIONS to new resolution framework

2020-12-14 Thread Terry Kim (Jira)
Terry Kim created SPARK-33785:
-

 Summary: Migrate ALTER TABLE ... RECOVER PARTITIONS to new 
resolution framework
 Key: SPARK-33785
 URL: https://issues.apache.org/jira/browse/SPARK-33785
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Terry Kim


Migrate ALTER TABLE ... RECOVER PARTITIONS to new resolution framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33784) Rename dataSourceRewriteRules and customDataSourceRewriteRules in BaseSessionStateBuilder

2020-12-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33784:
-
Priority: Blocker  (was: Major)

> Rename dataSourceRewriteRules and customDataSourceRewriteRules in 
> BaseSessionStateBuilder
> -
>
> Key: SPARK-33784
> URL: https://issues.apache.org/jira/browse/SPARK-33784
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Blocker
>
> This is under discussion at 
> https://github.com/apache/spark/pull/30558#discussion_r533885837.
> We happened to have rule extension that are not specific to Data source 
> rewrites (SPARK-33612), but we named it so, and people agree with having a 
> good name here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33784) Rename dataSourceRewriteRules and customDataSourceRewriteRules in BaseSessionStateBuilder

2020-12-14 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-33784:


 Summary: Rename dataSourceRewriteRules and 
customDataSourceRewriteRules in BaseSessionStateBuilder
 Key: SPARK-33784
 URL: https://issues.apache.org/jira/browse/SPARK-33784
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


This is under discussion at 
https://github.com/apache/spark/pull/30558#discussion_r533885837.

We happened to have rule extension that are not specific to Data source 
rewrites (SPARK-33612), but we named it so, and people agree with having a good 
name here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33783) Unload State Store Provider after configured keep alive time

2020-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33783:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Unload State Store Provider after configured keep alive time
> 
>
> Key: SPARK-33783
> URL: https://issues.apache.org/jira/browse/SPARK-33783
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Currently Spark unloads an inactive state store provider in an maintenance 
> task which is run periodically. So it is said one state store provider might 
> be unloaded immediately after it becomes inactive because the maintenance 
> task is asynchronous. It is inefficient as the state store provider could be 
> reused in next batches and we should be able to have more control of 
> unloading behavior. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33783) Unload State Store Provider after configured keep alive time

2020-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249425#comment-17249425
 ] 

Apache Spark commented on SPARK-33783:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/30770

> Unload State Store Provider after configured keep alive time
> 
>
> Key: SPARK-33783
> URL: https://issues.apache.org/jira/browse/SPARK-33783
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Currently Spark unloads an inactive state store provider in an maintenance 
> task which is run periodically. So it is said one state store provider might 
> be unloaded immediately after it becomes inactive because the maintenance 
> task is asynchronous. It is inefficient as the state store provider could be 
> reused in next batches and we should be able to have more control of 
> unloading behavior. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33783) Unload State Store Provider after configured keep alive time

2020-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33783:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Unload State Store Provider after configured keep alive time
> 
>
> Key: SPARK-33783
> URL: https://issues.apache.org/jira/browse/SPARK-33783
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> Currently Spark unloads an inactive state store provider in an maintenance 
> task which is run periodically. So it is said one state store provider might 
> be unloaded immediately after it becomes inactive because the maintenance 
> task is asynchronous. It is inefficient as the state store provider could be 
> reused in next batches and we should be able to have more control of 
> unloading behavior. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33783) Unload State Store Provider after configured keep alive time

2020-12-14 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-33783:
---

 Summary: Unload State Store Provider after configured keep alive 
time
 Key: SPARK-33783
 URL: https://issues.apache.org/jira/browse/SPARK-33783
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 3.2.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


Currently Spark unloads an inactive state store provider in an maintenance task 
which is run periodically. So it is said one state store provider might be 
unloaded immediately after it becomes inactive because the maintenance task is 
asynchronous. It is inefficient as the state store provider could be reused in 
next batches and we should be able to have more control of unloading behavior. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33653) DSv2: REFRESH TABLE should recache the table itself

2020-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249416#comment-17249416
 ] 

Apache Spark commented on SPARK-33653:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/30769

> DSv2: REFRESH TABLE should recache the table itself
> ---
>
> Key: SPARK-33653
> URL: https://issues.apache.org/jira/browse/SPARK-33653
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.2.0
>
>
> As "CACHE TABLE" is supported in DSv2 now, we should also recache the table 
> itself in "REFRESH TABLE" command, to match the behavior in DSv1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33748) Support PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables

2020-12-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33748:


Assignee: Hyukjin Kwon

> Support PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables
> --
>
> Key: SPARK-33748
> URL: https://issues.apache.org/jira/browse/SPARK-33748
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> See [https://github.com/apache/spark/pull/21092#discussion_r540240095.]
> We should respect PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON like we do in all 
> other places.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33748) Support PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables

2020-12-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33748.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30735
[https://github.com/apache/spark/pull/30735]

> Support PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables
> --
>
> Key: SPARK-33748
> URL: https://issues.apache.org/jira/browse/SPARK-33748
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> See [https://github.com/apache/spark/pull/21092#discussion_r540240095.]
> We should respect PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON like we do in all 
> other places.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33653) DSv2: REFRESH TABLE should recache the table itself

2020-12-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33653:
-

Assignee: Chao Sun

> DSv2: REFRESH TABLE should recache the table itself
> ---
>
> Key: SPARK-33653
> URL: https://issues.apache.org/jira/browse/SPARK-33653
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> As "CACHE TABLE" is supported in DSv2 now, we should also recache the table 
> itself in "REFRESH TABLE" command, to match the behavior in DSv1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33653) DSv2: REFRESH TABLE should recache the table itself

2020-12-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33653.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30742
[https://github.com/apache/spark/pull/30742]

> DSv2: REFRESH TABLE should recache the table itself
> ---
>
> Key: SPARK-33653
> URL: https://issues.apache.org/jira/browse/SPARK-33653
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.2.0
>
>
> As "CACHE TABLE" is supported in DSv2 now, we should also recache the table 
> itself in "REFRESH TABLE" command, to match the behavior in DSv1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33777) Sort output of V2 SHOW PARTITIONS

2020-12-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33777.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30764
[https://github.com/apache/spark/pull/30764]

> Sort output of V2 SHOW PARTITIONS
> -
>
> Key: SPARK-33777
> URL: https://issues.apache.org/jira/browse/SPARK-33777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> V1 SHOW PARTITIONS command sorts its results. Both V1 implementations 
> in-memory and Hive catalog (according to Hive docs 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowPartitions)]
>  perform sorting. V2 should have the same behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33777) Sort output of V2 SHOW PARTITIONS

2020-12-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33777:
-

Assignee: Maxim Gekk

> Sort output of V2 SHOW PARTITIONS
> -
>
> Key: SPARK-33777
> URL: https://issues.apache.org/jira/browse/SPARK-33777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> V1 SHOW PARTITIONS command sorts its results. Both V1 implementations 
> in-memory and Hive catalog (according to Hive docs 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowPartitions)]
>  perform sorting. V2 should have the same behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33771) Fix Invalid value for HourOfAmPm when testing on JDK 14

2020-12-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33771:
-

Assignee: Yuming Wang

> Fix Invalid value for HourOfAmPm when testing on JDK 14
> ---
>
> Key: SPARK-33771
> URL: https://issues.apache.org/jira/browse/SPARK-33771
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> - parsing hour with various patterns *** FAILED *** 
> java.time.format.DateTimeParseException: Text '2009-12-12 12 am' could not be 
> parsed: Invalid value for HourOfAmPm (valid values 0 - 11): 12



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33771) Fix Invalid value for HourOfAmPm when testing on JDK 14

2020-12-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33771.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30754
[https://github.com/apache/spark/pull/30754]

> Fix Invalid value for HourOfAmPm when testing on JDK 14
> ---
>
> Key: SPARK-33771
> URL: https://issues.apache.org/jira/browse/SPARK-33771
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> - parsing hour with various patterns *** FAILED *** 
> java.time.format.DateTimeParseException: Text '2009-12-12 12 am' could not be 
> parsed: Invalid value for HourOfAmPm (valid values 0 - 11): 12



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33261) Allow people to extend the pod feature steps

2020-12-14 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-33261.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

> Allow people to extend the pod feature steps
> 
>
> Key: SPARK-33261
> URL: https://issues.apache.org/jira/browse/SPARK-33261
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
> Fix For: 3.2.0
>
>
> While we allow people to specify pod templates, some deployments could 
> benefit from being able to add a feature step.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32922) Add support for ShuffleBlockFetcherIterator to read from merged shuffle partitions and to fallback to original shuffle blocks if encountering failures

2020-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249274#comment-17249274
 ] 

Apache Spark commented on SPARK-32922:
--

User 'otterc' has created a pull request for this issue:
https://github.com/apache/spark/pull/30768

> Add support for ShuffleBlockFetcherIterator to read from merged shuffle 
> partitions and to fallback to original shuffle blocks if encountering failures
> --
>
> Key: SPARK-32922
> URL: https://issues.apache.org/jira/browse/SPARK-32922
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>
> With the extended MapOutputTracker, the reducers can now get the task input 
> data from the merged shuffle partitions for more efficient shuffle data 
> fetch. The reducers should also be able to fallback to fetching the original 
> unmarked blocks if it encounters failures when fetching the merged shuffle 
> partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32922) Add support for ShuffleBlockFetcherIterator to read from merged shuffle partitions and to fallback to original shuffle blocks if encountering failures

2020-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32922:


Assignee: (was: Apache Spark)

> Add support for ShuffleBlockFetcherIterator to read from merged shuffle 
> partitions and to fallback to original shuffle blocks if encountering failures
> --
>
> Key: SPARK-32922
> URL: https://issues.apache.org/jira/browse/SPARK-32922
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>
> With the extended MapOutputTracker, the reducers can now get the task input 
> data from the merged shuffle partitions for more efficient shuffle data 
> fetch. The reducers should also be able to fallback to fetching the original 
> unmarked blocks if it encounters failures when fetching the merged shuffle 
> partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32922) Add support for ShuffleBlockFetcherIterator to read from merged shuffle partitions and to fallback to original shuffle blocks if encountering failures

2020-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32922:


Assignee: Apache Spark

> Add support for ShuffleBlockFetcherIterator to read from merged shuffle 
> partitions and to fallback to original shuffle blocks if encountering failures
> --
>
> Key: SPARK-32922
> URL: https://issues.apache.org/jira/browse/SPARK-32922
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Apache Spark
>Priority: Major
>
> With the extended MapOutputTracker, the reducers can now get the task input 
> data from the merged shuffle partitions for more efficient shuffle data 
> fetch. The reducers should also be able to fallback to fetching the original 
> unmarked blocks if it encounters failures when fetching the merged shuffle 
> partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32922) Add support for ShuffleBlockFetcherIterator to read from merged shuffle partitions and to fallback to original shuffle blocks if encountering failures

2020-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249272#comment-17249272
 ] 

Apache Spark commented on SPARK-32922:
--

User 'otterc' has created a pull request for this issue:
https://github.com/apache/spark/pull/30768

> Add support for ShuffleBlockFetcherIterator to read from merged shuffle 
> partitions and to fallback to original shuffle blocks if encountering failures
> --
>
> Key: SPARK-32922
> URL: https://issues.apache.org/jira/browse/SPARK-32922
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>
> With the extended MapOutputTracker, the reducers can now get the task input 
> data from the merged shuffle partitions for more efficient shuffle data 
> fetch. The reducers should also be able to fallback to fetching the original 
> unmarked blocks if it encounters failures when fetching the merged shuffle 
> partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33782) Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

2020-12-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33782:
-
Summary: Place spark.files, spark.jars and spark.files under the current 
working directory on the driver in K8S cluster mode  (was: Place spark.files, 
spark.jars and spark.files under the current working directory on the driver in 
K8S)

> Place spark.files, spark.jars and spark.files under the current working 
> directory on the driver in K8S cluster mode
> ---
>
> Key: SPARK-33782
> URL: https://issues.apache.org/jira/browse/SPARK-33782
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> In Yarn cluster modes, the passed files are able to be accessed in the 
> current working directory. Looks like this is not the case in Kubernates 
> cluset mode.
> By doing this, users can, for example, leverage PEX to manage Python 
> dependences in Apache Spark:
> {code}
> pex pyspark==3.0.1 pyarrow==0.15.1 pandas==0.25.3 -o myarchive.pex
> PYSPARK_PYTHON=./myarchive.pex spark-submit --files myarchive.pex
> {code}
> See also https://github.com/apache/spark/pull/30735/files#r540935585.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33782) Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S

2020-12-14 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-33782:


 Summary: Place spark.files, spark.jars and spark.files under the 
current working directory on the driver in K8S
 Key: SPARK-33782
 URL: https://issues.apache.org/jira/browse/SPARK-33782
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon


In Yarn cluster modes, the passed files are able to be accessed in the current 
working directory. Looks like this is not the case in Kubernates cluset mode.

By doing this, users can, for example, leverage PEX to manage Python 
dependences in Apache Spark:

{code}
pex pyspark==3.0.1 pyarrow==0.15.1 pandas==0.25.3 -o myarchive.pex
PYSPARK_PYTHON=./myarchive.pex spark-submit --files myarchive.pex
{code}

See also https://github.com/apache/spark/pull/30735/files#r540935585.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33779) DataSource V2: API to request distribution and ordering on write

2020-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249254#comment-17249254
 ] 

Apache Spark commented on SPARK-33779:
--

User 'aokolnychyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/30767

> DataSource V2: API to request distribution and ordering on write
> 
>
> Key: SPARK-33779
> URL: https://issues.apache.org/jira/browse/SPARK-33779
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Anton Okolnychyi
>Priority: Major
> Fix For: 3.2.0
>
>
> We need to have proper APIs for requesting a specific distribution and 
> ordering on writes for data sources that implement the V2 interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33779) DataSource V2: API to request distribution and ordering on write

2020-12-14 Thread Ryan Blue (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-33779.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Merged PR #30706. Thanks [~aokolnychyi]!

> DataSource V2: API to request distribution and ordering on write
> 
>
> Key: SPARK-33779
> URL: https://issues.apache.org/jira/browse/SPARK-33779
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Anton Okolnychyi
>Priority: Major
> Fix For: 3.2.0
>
>
> We need to have proper APIs for requesting a specific distribution and 
> ordering on writes for data sources that implement the V2 interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33780) YARN doesn't know about resource yarn.io/gpu

2020-12-14 Thread Bruno Faustino Amorim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Faustino Amorim updated SPARK-33780:
--
Description: 
Error to execute Spark on GPU. The stack trace is below:
{code:java}
20/12/14 18:39:41 WARN ResourceRequestHelper: YARN doesn't know about resource 
yarn.io/gpu, your resource discovery has to handle properly discovering and 
isolating the resource! Error: The resource manager encountered a problem that 
should not occur under normal circumstances. Please report this error to the 
Hadoop community by opening a JIRA ticket at http://issues.apache.org/jira and 
including the following information:20/12/14 18:39:41 WARN 
ResourceRequestHelper: YARN doesn't know about resource yarn.io/gpu, your 
resource discovery has to handle properly discovering and isolating the 
resource! Error: The resource manager encountered a problem that should not 
occur under normal circumstances. Please report this error to the Hadoop 
community by opening a JIRA ticket at http://issues.apache.org/jira and 
including the following information:* Resource type requested: yarn.io/gpu* 
Resource object: * The stack trace for this exception: 
java.lang.Exception at 
org.apache.hadoop.yarn.exceptions.ResourceNotFoundException.(ResourceNotFoundException.java:47)
 at 
org.apache.hadoop.yarn.api.records.Resource.getResourceInformation(Resource.java:268)
 at 
org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.setResourceInformation(ResourcePBImpl.java:198)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.spark.deploy.yarn.ResourceRequestHelper$.$anonfun$setResourceRequests$4(ResourceRequestHelper.scala:183)
 at scala.collection.immutable.Map$Map1.foreach(Map.scala:128) at 
org.apache.spark.deploy.yarn.ResourceRequestHelper$.setResourceRequests(ResourceRequestHelper.scala:170)
 at 
org.apache.spark.deploy.yarn.Client.createApplicationSubmissionContext(Client.scala:277)
 at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:196) at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:60)
 at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:201) 
at org.apache.spark.SparkContext.(SparkContext.scala:555) at 
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) at 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
 at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at 
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
py4j.Gateway.invoke(Gateway.java:238) at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) 
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at 
py4j.GatewayConnection.run(GatewayConnection.java:238) at 
java.lang.Thread.run(Thread.java:748)
After encountering this error, the resource manager is in an inconsistent 
state. It is safe for the resource manager to be restarted as the error 
encountered should be transitive. If high availability is enabled, failing over 
to a standby resource manager is also safe.20/12/14 18:39:46 WARN 
YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors 
before the AM has registered!{code}
 
 

  was:
Error to execute Spark on GPU. The stack trace is below:


{code:java}
20/12/14 18:39:41 WARN ResourceRequestHelper: YARN doesn't know about resource 
yarn.io/gpu, your resource discovery has to handle properly discovering and 
isolating the resource! Error: The resource manager encountered a problem that 
should not occur under normal circumstances. Please report this error to the 
Hadoop community by opening a JIRA ticket at http://issues.apache.org/jira and 
including the following information:20/12/14 18:39:41 WARN 
ResourceRequestHelper: YARN doesn't know about resource yarn.io/gpu, your 
resource discovery has to handle properly discovering and isolating the 
resource! Error: The resource manager encountered a problem that should not 
occur under normal circumstances. Please report this error to the Hadoop 
community by opening a JIRA ticket at http://issues.apache.org/jira and 
including the following information:* Resource type requested: yarn.io/gpu* 
Resource object: * The stack trace for this exception: 
java.lang.Exception at 
org.apache.hadoop.yarn.exceptions.ResourceNotFoundException.(ResourceNotFoundEx

[jira] [Commented] (SPARK-33273) Fix Flaky Test: ThriftServerQueryTestSuite. subquery_scalar_subquery_scalar_subquery_select_sql

2020-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249226#comment-17249226
 ] 

Apache Spark commented on SPARK-33273:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/30766

> Fix Flaky Test: ThriftServerQueryTestSuite. 
> subquery_scalar_subquery_scalar_subquery_select_sql
> ---
>
> Key: SPARK-33273
> URL: https://issues.apache.org/jira/browse/SPARK-33273
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
> Attachments: failures.png
>
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130369/testReport/org.apache.spark.sql.hive.thriftserver/ThriftServerQueryTestSuite/subquery_scalar_subquery_scalar_subquery_select_sql/
> {code}
> [info] - subquery/scalar-subquery/scalar-subquery-select.sql *** FAILED *** 
> (3 seconds, 877 milliseconds)
> [info]   Expected "[1]0   2017-05-04 01:01:0...", but got "[]0
> 2017-05-04 01:01:0..." Result did not match for query #3
> [info]   SELECT (SELECT min(t3d) FROM t3) min_t3d,
> [info]  (SELECT max(t2h) FROM t2) max_t2h
> [info]   FROM   t1
> [info]   WHERE  t1a = 'val1c' (ThriftServerQueryTestSuite.scala:197)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33273) Fix Flaky Test: ThriftServerQueryTestSuite. subquery_scalar_subquery_scalar_subquery_select_sql

2020-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249225#comment-17249225
 ] 

Apache Spark commented on SPARK-33273:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/30766

> Fix Flaky Test: ThriftServerQueryTestSuite. 
> subquery_scalar_subquery_scalar_subquery_select_sql
> ---
>
> Key: SPARK-33273
> URL: https://issues.apache.org/jira/browse/SPARK-33273
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
> Attachments: failures.png
>
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130369/testReport/org.apache.spark.sql.hive.thriftserver/ThriftServerQueryTestSuite/subquery_scalar_subquery_scalar_subquery_select_sql/
> {code}
> [info] - subquery/scalar-subquery/scalar-subquery-select.sql *** FAILED *** 
> (3 seconds, 877 milliseconds)
> [info]   Expected "[1]0   2017-05-04 01:01:0...", but got "[]0
> 2017-05-04 01:01:0..." Result did not match for query #3
> [info]   SELECT (SELECT min(t3d) FROM t3) min_t3d,
> [info]  (SELECT max(t2h) FROM t2) max_t2h
> [info]   FROM   t1
> [info]   WHERE  t1a = 'val1c' (ThriftServerQueryTestSuite.scala:197)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33781) Improve caching of MergeStatus on the executor side to save memory

2020-12-14 Thread Min Shen (Jira)
Min Shen created SPARK-33781:


 Summary: Improve caching of MergeStatus on the executor side to 
save memory
 Key: SPARK-33781
 URL: https://issues.apache.org/jira/browse/SPARK-33781
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Min Shen


In MapOutputTrackerWorker, it would cache the retrieved MapStatus or 
MergeStatus array for a given shuffle received from the driver in memory so 
that all tasks doing shuffle fetch for that shuffle can reuse the cached 
metadata.

However, different from MapStatus array, where each task would need to access 
every single instance in the array, each task would only need one or just a few 
MergeStatus objects from the MergeStatus array depending on which shuffle 
partitions the task is processing.

For large shuffles with 10s or 100s of thousands of shuffle partitions, caching 
the entire deserialized and decompressed MergeStatus array on the executor 
side, while perhaps only 0.1% of them are going to be used by the tasks running 
in this executor is a huge waste of memory.

We could improve this by caching the serialized and compressed bytes for 
MergeStatus array instead and only cache the needed deserialized MergeStatus 
object on the executor side. In addition to saving memory, it also helps with 
reducing GC pressure on executor side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33780) YARN doesn't know about resource yarn.io/gpu

2020-12-14 Thread Bruno Faustino Amorim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Faustino Amorim updated SPARK-33780:
--
Environment: 
Amazon EMR: emr-6.2.0
 Spark Version: Spark 3.0.1

Instance Type: g3.4xlarge
 AMI Name: emr-6_2_0-image-builder-ami-hvm-x86_64 2020-11-01T00-56-10.917Z

Spark Configs:
{code:java}
sc_conf = SparkConf() \
 .set('spark.driver.resource.gpu.discoveryScript', 
'/opt/spark/getGpusResources.sh') \
 .set('spark.driver.resource.gpu.amount', '1') \
 .set('spark.rapids.sql.enabled', 'ALL'){code}
 

  was:
Amazon EMR: emr-6.2.0
Spark Version: Spark 3.0.1

Instance Type: g3.4xlarge
AMI Name: emr-6_2_0-image-builder-ami-hvm-x86_64 2020-11-01T00-56-10.917Z

Spark Configs:
{code:java}
sc_conf = SparkConf() \
 .set('spark.driver.resource.gpu.discoveryScript', 
'/opt/spark/getGpusResources.sh') \
 .set('spark.driver.resource.gpu.amount', '1') \
 .set('spark.rapids.sql.enabled', 'ALL') \{code}


> YARN doesn't know about resource yarn.io/gpu
> 
>
> Key: SPARK-33780
> URL: https://issues.apache.org/jira/browse/SPARK-33780
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 3.0.1
> Environment: Amazon EMR: emr-6.2.0
>  Spark Version: Spark 3.0.1
> Instance Type: g3.4xlarge
>  AMI Name: emr-6_2_0-image-builder-ami-hvm-x86_64 2020-11-01T00-56-10.917Z
> Spark Configs:
> {code:java}
> sc_conf = SparkConf() \
>  .set('spark.driver.resource.gpu.discoveryScript', 
> '/opt/spark/getGpusResources.sh') \
>  .set('spark.driver.resource.gpu.amount', '1') \
>  .set('spark.rapids.sql.enabled', 'ALL'){code}
>  
>Reporter: Bruno Faustino Amorim
>Priority: Trivial
>
> Error to execute Spark on GPU. The stack trace is below:
> {code:java}
> 20/12/14 18:39:41 WARN ResourceRequestHelper: YARN doesn't know about 
> resource yarn.io/gpu, your resource discovery has to handle properly 
> discovering and isolating the resource! Error: The resource manager 
> encountered a problem that should not occur under normal circumstances. 
> Please report this error to the Hadoop community by opening a JIRA ticket at 
> http://issues.apache.org/jira and including the following 
> information:20/12/14 18:39:41 WARN ResourceRequestHelper: YARN doesn't know 
> about resource yarn.io/gpu, your resource discovery has to handle properly 
> discovering and isolating the resource! Error: The resource manager 
> encountered a problem that should not occur under normal circumstances. 
> Please report this error to the Hadoop community by opening a JIRA ticket at 
> http://issues.apache.org/jira and including the following information:* 
> Resource type requested: yarn.io/gpu* Resource object:  vCores:1>* The stack trace for this exception: java.lang.Exception at 
> org.apache.hadoop.yarn.exceptions.ResourceNotFoundException.(ResourceNotFoundException.java:47)
>  at 
> org.apache.hadoop.yarn.api.records.Resource.getResourceInformation(Resource.java:268)
>  at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.setResourceInformation(ResourcePBImpl.java:198)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.yarn.ResourceRequestHelper$.$anonfun$setResourceRequests$4(ResourceRequestHelper.scala:183)
>  at scala.collection.immutable.Map$Map1.foreach(Map.scala:128) at 
> org.apache.spark.deploy.yarn.ResourceRequestHelper$.setResourceRequests(ResourceRequestHelper.scala:170)
>  at 
> org.apache.spark.deploy.yarn.Client.createApplicationSubmissionContext(Client.scala:277)
>  at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:196) 
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:60)
>  at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:201)
>  at org.apache.spark.SparkContext.(SparkContext.scala:555) at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) 
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
> py4j.Gateway.invoke(Gateway.java:238) at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
>  at py4j.

[jira] [Created] (SPARK-33780) YARN doesn't know about resource yarn.io/gpu

2020-12-14 Thread Bruno Faustino Amorim (Jira)
Bruno Faustino Amorim created SPARK-33780:
-

 Summary: YARN doesn't know about resource yarn.io/gpu
 Key: SPARK-33780
 URL: https://issues.apache.org/jira/browse/SPARK-33780
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 3.0.1
 Environment: Amazon EMR: emr-6.2.0
Spark Version: Spark 3.0.1

Instance Type: g3.4xlarge
AMI Name: emr-6_2_0-image-builder-ami-hvm-x86_64 2020-11-01T00-56-10.917Z

Spark Configs:
{code:java}
sc_conf = SparkConf() \
 .set('spark.driver.resource.gpu.discoveryScript', 
'/opt/spark/getGpusResources.sh') \
 .set('spark.driver.resource.gpu.amount', '1') \
 .set('spark.rapids.sql.enabled', 'ALL') \{code}
Reporter: Bruno Faustino Amorim


Error to execute Spark on GPU. The stack trace is below:


{code:java}
20/12/14 18:39:41 WARN ResourceRequestHelper: YARN doesn't know about resource 
yarn.io/gpu, your resource discovery has to handle properly discovering and 
isolating the resource! Error: The resource manager encountered a problem that 
should not occur under normal circumstances. Please report this error to the 
Hadoop community by opening a JIRA ticket at http://issues.apache.org/jira and 
including the following information:20/12/14 18:39:41 WARN 
ResourceRequestHelper: YARN doesn't know about resource yarn.io/gpu, your 
resource discovery has to handle properly discovering and isolating the 
resource! Error: The resource manager encountered a problem that should not 
occur under normal circumstances. Please report this error to the Hadoop 
community by opening a JIRA ticket at http://issues.apache.org/jira and 
including the following information:* Resource type requested: yarn.io/gpu* 
Resource object: * The stack trace for this exception: 
java.lang.Exception at 
org.apache.hadoop.yarn.exceptions.ResourceNotFoundException.(ResourceNotFoundException.java:47)
 at 
org.apache.hadoop.yarn.api.records.Resource.getResourceInformation(Resource.java:268)
 at 
org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.setResourceInformation(ResourcePBImpl.java:198)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.spark.deploy.yarn.ResourceRequestHelper$.$anonfun$setResourceRequests$4(ResourceRequestHelper.scala:183)
 at scala.collection.immutable.Map$Map1.foreach(Map.scala:128) at 
org.apache.spark.deploy.yarn.ResourceRequestHelper$.setResourceRequests(ResourceRequestHelper.scala:170)
 at 
org.apache.spark.deploy.yarn.Client.createApplicationSubmissionContext(Client.scala:277)
 at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:196) at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:60)
 at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:201) 
at org.apache.spark.SparkContext.(SparkContext.scala:555) at 
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) at 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
 at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at 
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
py4j.Gateway.invoke(Gateway.java:238) at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) 
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at 
py4j.GatewayConnection.run(GatewayConnection.java:238) at 
java.lang.Thread.run(Thread.java:748)
After encountering this error, the resource manager is in an inconsistent 
state. It is safe for the resource manager to be restarted as the error 
encountered should be transitive. If high availability is enabled, failing over 
to a standby resource manager is also safe.20/12/14 18:39:46 WARN 
YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors 
before the AM has registered!{code}

This exception happened  when start Spark on GPU



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-33734) Spark Core ::Spark core versions upto 3.0.1 using interdependency on Jackson-core-asl version 1.9.13, which is having security issues reported.

2020-12-14 Thread Aparna (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aparna reopened SPARK-33734:


> Spark Core ::Spark core versions upto 3.0.1 using interdependency on 
> Jackson-core-asl version 1.9.13, which is having security issues reported. 
> 
>
> Key: SPARK-33734
> URL: https://issues.apache.org/jira/browse/SPARK-33734
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Aparna
>Priority: Major
>
> spark-core version upto latest 3.0.1 is using dependency 
> [org.apache.avro|https://mvnrepository.com/artifact/org.apache.avro] version 
> 1.8.2 which is having 
> [jackson-core-asl|https://mvnrepository.com/artifact/org.codehaus.jackson/jackson-core-asl]
>  version 1.9.13 which has security issues.
> Please fix and share the new version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33734) Spark Core ::Spark core versions upto 3.0.1 using interdependency on Jackson-core-asl version 1.9.13, which is having security issues reported.

2020-12-14 Thread Aparna (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249201#comment-17249201
 ] 

Aparna commented on SPARK-33734:


Hello [~hyukjin.kwon] 
It has been captured from BlackDuck scanning.

*Please find details on below link:*

[https://www.openhub.net/p/jackson/security]

CVE-2019-10172

CVE-2017-7525

CVE-2017-15095


Let me know if that would work.

 

> Spark Core ::Spark core versions upto 3.0.1 using interdependency on 
> Jackson-core-asl version 1.9.13, which is having security issues reported. 
> 
>
> Key: SPARK-33734
> URL: https://issues.apache.org/jira/browse/SPARK-33734
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Aparna
>Priority: Major
>
> spark-core version upto latest 3.0.1 is using dependency 
> [org.apache.avro|https://mvnrepository.com/artifact/org.apache.avro] version 
> 1.8.2 which is having 
> [jackson-core-asl|https://mvnrepository.com/artifact/org.codehaus.jackson/jackson-core-asl]
>  version 1.9.13 which has security issues.
> Please fix and share the new version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33779) DataSource V2: API to request distribution and ordering on write

2020-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33779:


Assignee: (was: Apache Spark)

> DataSource V2: API to request distribution and ordering on write
> 
>
> Key: SPARK-33779
> URL: https://issues.apache.org/jira/browse/SPARK-33779
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> We need to have proper APIs for requesting a specific distribution and 
> ordering on writes for data sources that implement the V2 interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33779) DataSource V2: API to request distribution and ordering on write

2020-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249182#comment-17249182
 ] 

Apache Spark commented on SPARK-33779:
--

User 'aokolnychyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/30706

> DataSource V2: API to request distribution and ordering on write
> 
>
> Key: SPARK-33779
> URL: https://issues.apache.org/jira/browse/SPARK-33779
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> We need to have proper APIs for requesting a specific distribution and 
> ordering on writes for data sources that implement the V2 interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33779) DataSource V2: API to request distribution and ordering on write

2020-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33779:


Assignee: Apache Spark

> DataSource V2: API to request distribution and ordering on write
> 
>
> Key: SPARK-33779
> URL: https://issues.apache.org/jira/browse/SPARK-33779
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Anton Okolnychyi
>Assignee: Apache Spark
>Priority: Major
>
> We need to have proper APIs for requesting a specific distribution and 
> ordering on writes for data sources that implement the V2 interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33779) DataSource V2: API to request distribution and ordering on write

2020-12-14 Thread Anton Okolnychyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249176#comment-17249176
 ] 

Anton Okolnychyi commented on SPARK-33779:
--

This is a part of the work in SPARK-23889.

> DataSource V2: API to request distribution and ordering on write
> 
>
> Key: SPARK-33779
> URL: https://issues.apache.org/jira/browse/SPARK-33779
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> We need to have proper APIs for requesting a specific distribution and 
> ordering on writes for data sources that implement the V2 interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33273) Fix Flaky Test: ThriftServerQueryTestSuite. subquery_scalar_subquery_scalar_subquery_select_sql

2020-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33273:


Assignee: Apache Spark

> Fix Flaky Test: ThriftServerQueryTestSuite. 
> subquery_scalar_subquery_scalar_subquery_select_sql
> ---
>
> Key: SPARK-33273
> URL: https://issues.apache.org/jira/browse/SPARK-33273
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Blocker
>  Labels: correctness
> Attachments: failures.png
>
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130369/testReport/org.apache.spark.sql.hive.thriftserver/ThriftServerQueryTestSuite/subquery_scalar_subquery_scalar_subquery_select_sql/
> {code}
> [info] - subquery/scalar-subquery/scalar-subquery-select.sql *** FAILED *** 
> (3 seconds, 877 milliseconds)
> [info]   Expected "[1]0   2017-05-04 01:01:0...", but got "[]0
> 2017-05-04 01:01:0..." Result did not match for query #3
> [info]   SELECT (SELECT min(t3d) FROM t3) min_t3d,
> [info]  (SELECT max(t2h) FROM t2) max_t2h
> [info]   FROM   t1
> [info]   WHERE  t1a = 'val1c' (ThriftServerQueryTestSuite.scala:197)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33273) Fix Flaky Test: ThriftServerQueryTestSuite. subquery_scalar_subquery_scalar_subquery_select_sql

2020-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33273:


Assignee: (was: Apache Spark)

> Fix Flaky Test: ThriftServerQueryTestSuite. 
> subquery_scalar_subquery_scalar_subquery_select_sql
> ---
>
> Key: SPARK-33273
> URL: https://issues.apache.org/jira/browse/SPARK-33273
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
> Attachments: failures.png
>
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130369/testReport/org.apache.spark.sql.hive.thriftserver/ThriftServerQueryTestSuite/subquery_scalar_subquery_scalar_subquery_select_sql/
> {code}
> [info] - subquery/scalar-subquery/scalar-subquery-select.sql *** FAILED *** 
> (3 seconds, 877 milliseconds)
> [info]   Expected "[1]0   2017-05-04 01:01:0...", but got "[]0
> 2017-05-04 01:01:0..." Result did not match for query #3
> [info]   SELECT (SELECT min(t3d) FROM t3) min_t3d,
> [info]  (SELECT max(t2h) FROM t2) max_t2h
> [info]   FROM   t1
> [info]   WHERE  t1a = 'val1c' (ThriftServerQueryTestSuite.scala:197)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33273) Fix Flaky Test: ThriftServerQueryTestSuite. subquery_scalar_subquery_scalar_subquery_select_sql

2020-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249175#comment-17249175
 ] 

Apache Spark commented on SPARK-33273:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/30765

> Fix Flaky Test: ThriftServerQueryTestSuite. 
> subquery_scalar_subquery_scalar_subquery_select_sql
> ---
>
> Key: SPARK-33273
> URL: https://issues.apache.org/jira/browse/SPARK-33273
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
> Attachments: failures.png
>
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130369/testReport/org.apache.spark.sql.hive.thriftserver/ThriftServerQueryTestSuite/subquery_scalar_subquery_scalar_subquery_select_sql/
> {code}
> [info] - subquery/scalar-subquery/scalar-subquery-select.sql *** FAILED *** 
> (3 seconds, 877 milliseconds)
> [info]   Expected "[1]0   2017-05-04 01:01:0...", but got "[]0
> 2017-05-04 01:01:0..." Result did not match for query #3
> [info]   SELECT (SELECT min(t3d) FROM t3) min_t3d,
> [info]  (SELECT max(t2h) FROM t2) max_t2h
> [info]   FROM   t1
> [info]   WHERE  t1a = 'val1c' (ThriftServerQueryTestSuite.scala:197)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33779) DataSource V2: API to request distribution and ordering on write

2020-12-14 Thread Anton Okolnychyi (Jira)
Anton Okolnychyi created SPARK-33779:


 Summary: DataSource V2: API to request distribution and ordering 
on write
 Key: SPARK-33779
 URL: https://issues.apache.org/jira/browse/SPARK-33779
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Anton Okolnychyi


We need to have proper APIs for requesting a specific distribution and ordering 
on writes for data sources that implement the V2 interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33778) Allow typesafe join for LeftSemi and LeftAnti

2020-12-14 Thread Venkata krishnan Sowrirajan (Jira)
Venkata krishnan Sowrirajan created SPARK-33778:
---

 Summary: Allow typesafe join for LeftSemi and LeftAnti
 Key: SPARK-33778
 URL: https://issues.apache.org/jira/browse/SPARK-33778
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1
Reporter: Venkata krishnan Sowrirajan


With [SPARK-21333|https://issues.apache.org/jira/browse/SPARK-21333] change, 
LeftSemi and LeftAnti no longer has a typesafe join API. It makes sense to not 
support LeftSemi and LeftAnti as part of joinWith as it returns tuples which 
includes values from both the datasets which is not possible in the above 
joins. Neverthless, it would be nice to have a separate join API or in the 
existing API to support LeftSemi and LeftAnti which returns Dataset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33777) Sort output of V2 SHOW PARTITIONS

2020-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249148#comment-17249148
 ] 

Apache Spark commented on SPARK-33777:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30764

> Sort output of V2 SHOW PARTITIONS
> -
>
> Key: SPARK-33777
> URL: https://issues.apache.org/jira/browse/SPARK-33777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> V1 SHOW PARTITIONS command sorts its results. Both V1 implementations 
> in-memory and Hive catalog (according to Hive docs 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowPartitions)]
>  perform sorting. V2 should have the same behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33777) Sort output of V2 SHOW PARTITIONS

2020-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249147#comment-17249147
 ] 

Apache Spark commented on SPARK-33777:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30764

> Sort output of V2 SHOW PARTITIONS
> -
>
> Key: SPARK-33777
> URL: https://issues.apache.org/jira/browse/SPARK-33777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> V1 SHOW PARTITIONS command sorts its results. Both V1 implementations 
> in-memory and Hive catalog (according to Hive docs 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowPartitions)]
>  perform sorting. V2 should have the same behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33777) Sort output of V2 SHOW PARTITIONS

2020-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33777:


Assignee: (was: Apache Spark)

> Sort output of V2 SHOW PARTITIONS
> -
>
> Key: SPARK-33777
> URL: https://issues.apache.org/jira/browse/SPARK-33777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> V1 SHOW PARTITIONS command sorts its results. Both V1 implementations 
> in-memory and Hive catalog (according to Hive docs 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowPartitions)]
>  perform sorting. V2 should have the same behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33777) Sort output of V2 SHOW PARTITIONS

2020-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33777:


Assignee: Apache Spark

> Sort output of V2 SHOW PARTITIONS
> -
>
> Key: SPARK-33777
> URL: https://issues.apache.org/jira/browse/SPARK-33777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> V1 SHOW PARTITIONS command sorts its results. Both V1 implementations 
> in-memory and Hive catalog (according to Hive docs 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowPartitions)]
>  perform sorting. V2 should have the same behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33777) Sort output of V2 SHOW PARTITIONS

2020-12-14 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249120#comment-17249120
 ] 

Maxim Gekk commented on SPARK-33777:


I am working on this.

> Sort output of V2 SHOW PARTITIONS
> -
>
> Key: SPARK-33777
> URL: https://issues.apache.org/jira/browse/SPARK-33777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> V1 SHOW PARTITIONS command sorts its results. Both V1 implementations 
> in-memory and Hive catalog (according to Hive docs 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowPartitions)]
>  perform sorting. V2 should have the same behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33772) Build and Run Spark on JDK17

2020-12-14 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249088#comment-17249088
 ] 

Erik Krogen commented on SPARK-33772:
-

It is very weird to see JDK versions bumping up by 6 whole major versions after 
years of watching it very slowly tick by one by one :) Thanks [~dongjoon]!

> Build and Run Spark on JDK17
> 
>
> Key: SPARK-33772
> URL: https://issues.apache.org/jira/browse/SPARK-33772
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Apache Spark supports Java 8 and Java 11 (LTS). The next Java LTS version is 
> 17.
> ||Version||Release Date||
> |Java 17 (LTS)|September 2021|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33777) Sort output of SHOW PARTITIONS V2

2020-12-14 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33777:
--

 Summary: Sort output of SHOW PARTITIONS V2
 Key: SPARK-33777
 URL: https://issues.apache.org/jira/browse/SPARK-33777
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Maxim Gekk


V1 SHOW PARTITIONS command sorts its results. Both V1 implementations in-memory 
and Hive catalog (according to Hive docs 
[https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowPartitions)]
 perform sorting. V2 should have the same behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33777) Sort output of V2 SHOW PARTITIONS

2020-12-14 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-33777:
---
Summary: Sort output of V2 SHOW PARTITIONS  (was: Sort output of SHOW 
PARTITIONS V2)

> Sort output of V2 SHOW PARTITIONS
> -
>
> Key: SPARK-33777
> URL: https://issues.apache.org/jira/browse/SPARK-33777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> V1 SHOW PARTITIONS command sorts its results. Both V1 implementations 
> in-memory and Hive catalog (according to Hive docs 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowPartitions)]
>  perform sorting. V2 should have the same behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33733) PullOutNondeterministic should check and collect deterministic field

2020-12-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33733.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30703
[https://github.com/apache/spark/pull/30703]

> PullOutNondeterministic should check and collect deterministic field
> 
>
> Key: SPARK-33733
> URL: https://issues.apache.org/jira/browse/SPARK-33733
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Major
> Fix For: 3.1.0
>
>
> The deterministic field is wider than `NonDerterministic`, we should keepe 
> same range between pull out and check analysis.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33733) PullOutNondeterministic should check and collect deterministic field

2020-12-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33733:
---

Assignee: ulysses you

> PullOutNondeterministic should check and collect deterministic field
> 
>
> Key: SPARK-33733
> URL: https://issues.apache.org/jira/browse/SPARK-33733
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Major
>
> The deterministic field is wider than `NonDerterministic`, we should keepe 
> same range between pull out and check analysis.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33428) conv UDF returns incorrect value

2020-12-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33428:
---

Assignee: angerszhu

> conv UDF returns incorrect value
> 
>
> Key: SPARK-33428
> URL: https://issues.apache.org/jira/browse/SPARK-33428
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> How to reproduce this issue:
> {noformat}
> spark-sql> select java_method('scala.math.BigInt', 'apply', 
> 'c8dcdfb41711fc9a1f17928001d7fd61', 16);
> 266992441711411603393340504520074460513
> spark-sql> select conv('c8dcdfb41711fc9a1f17928001d7fd61', 16, 10);
> 18446744073709551615
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33428) conv UDF returns incorrect value

2020-12-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33428.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30350
[https://github.com/apache/spark/pull/30350]

> conv UDF returns incorrect value
> 
>
> Key: SPARK-33428
> URL: https://issues.apache.org/jira/browse/SPARK-33428
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> How to reproduce this issue:
> {noformat}
> spark-sql> select java_method('scala.math.BigInt', 'apply', 
> 'c8dcdfb41711fc9a1f17928001d7fd61', 16);
> 266992441711411603393340504520074460513
> spark-sql> select conv('c8dcdfb41711fc9a1f17928001d7fd61', 16, 10);
> 18446744073709551615
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31801) Register shuffle map output metadata with a shuffle output tracker

2020-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248995#comment-17248995
 ] 

Apache Spark commented on SPARK-31801:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/30763

> Register shuffle map output metadata with a shuffle output tracker
> --
>
> Key: SPARK-31801
> URL: https://issues.apache.org/jira/browse/SPARK-31801
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Matt Cheah
>Priority: Major
>
> Part of the design as discussed in [this 
> document|https://docs.google.com/document/d/1Aj6IyMsbS2sdIfHxLvIbHUNjHIWHTabfknIPoxOrTjk/edit#].
> Establish a {{ShuffleOutputTracker}} API that resides on the driver, and 
> handle accepting map output metadata returned by the map output writers and 
> send them to the output tracker module accordingly.
> Requires https://issues.apache.org/jira/browse/SPARK-31798.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33758) Prune unnecessary output partitioning when the attribute is not part of output.

2020-12-14 Thread Prakhar Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakhar Jain updated SPARK-33758:
-
Description: 
Consider the query:
{noformat}
val planned = sql(
  """
| SELECT t1.id as t1id
| FROM t1, t2
| WHERE t1.id = t2.id
  """.stripMargin).queryExecution.executedPlan

println(planned.outputPartitioning)
{noformat}
The output of this will be:

{noformat}
res10: org.apache.spark.sql.catalyst.plans.physical.Partitioning = 
(hashpartitioning(it1d#6L, 200) or hashpartitioning(t2id#7L, 200)) 
{noformat}

 

This query will have top level Project node which will just project t1.id. So 
outputPartitioning of this project node should be:

hashpartitioning(it1d#6L, 200)

 

cc - [~maropu] [~cloud_fan]

  was:
Consider the query:

 

select t1.id from t1 JOIN t2 on t1.id = t2.id

 

This query will have top level Project node which will just project t1.id. But 
the outputPartitioning of this project node will be:

PartitioningCollection(HashPartitioning(t1.id), HashPartitioning(t2.id))

 

We should drop HashPartitioning(t2.id) from outputPartitioning of Project node.

 

cc - [~maropu] [~cloud_fan]


> Prune unnecessary output partitioning when the attribute is not part of 
> output.
> ---
>
> Key: SPARK-33758
> URL: https://issues.apache.org/jira/browse/SPARK-33758
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Prakhar Jain
>Priority: Major
>
> Consider the query:
> {noformat}
> val planned = sql(
>   """
> | SELECT t1.id as t1id
> | FROM t1, t2
> | WHERE t1.id = t2.id
>   """.stripMargin).queryExecution.executedPlan
> println(planned.outputPartitioning)
> {noformat}
> The output of this will be:
> {noformat}
> res10: org.apache.spark.sql.catalyst.plans.physical.Partitioning = 
> (hashpartitioning(it1d#6L, 200) or hashpartitioning(t2id#7L, 200)) 
> {noformat}
>  
> This query will have top level Project node which will just project t1.id. So 
> outputPartitioning of this project node should be:
> hashpartitioning(it1d#6L, 200)
>  
> cc - [~maropu] [~cloud_fan]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33758) Prune unnecessary output partitioning when the attribute is not part of output.

2020-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33758:


Assignee: Apache Spark

> Prune unnecessary output partitioning when the attribute is not part of 
> output.
> ---
>
> Key: SPARK-33758
> URL: https://issues.apache.org/jira/browse/SPARK-33758
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Prakhar Jain
>Assignee: Apache Spark
>Priority: Major
>
> Consider the query:
>  
> select t1.id from t1 JOIN t2 on t1.id = t2.id
>  
> This query will have top level Project node which will just project t1.id. 
> But the outputPartitioning of this project node will be:
> PartitioningCollection(HashPartitioning(t1.id), HashPartitioning(t2.id))
>  
> We should drop HashPartitioning(t2.id) from outputPartitioning of Project 
> node.
>  
> cc - [~maropu] [~cloud_fan]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33758) Prune unnecessary output partitioning when the attribute is not part of output.

2020-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33758:


Assignee: (was: Apache Spark)

> Prune unnecessary output partitioning when the attribute is not part of 
> output.
> ---
>
> Key: SPARK-33758
> URL: https://issues.apache.org/jira/browse/SPARK-33758
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Prakhar Jain
>Priority: Major
>
> Consider the query:
>  
> select t1.id from t1 JOIN t2 on t1.id = t2.id
>  
> This query will have top level Project node which will just project t1.id. 
> But the outputPartitioning of this project node will be:
> PartitioningCollection(HashPartitioning(t1.id), HashPartitioning(t2.id))
>  
> We should drop HashPartitioning(t2.id) from outputPartitioning of Project 
> node.
>  
> cc - [~maropu] [~cloud_fan]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33758) Prune unnecessary output partitioning when the attribute is not part of output.

2020-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248954#comment-17248954
 ] 

Apache Spark commented on SPARK-33758:
--

User 'prakharjain09' has created a pull request for this issue:
https://github.com/apache/spark/pull/30762

> Prune unnecessary output partitioning when the attribute is not part of 
> output.
> ---
>
> Key: SPARK-33758
> URL: https://issues.apache.org/jira/browse/SPARK-33758
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Prakhar Jain
>Priority: Major
>
> Consider the query:
>  
> select t1.id from t1 JOIN t2 on t1.id = t2.id
>  
> This query will have top level Project node which will just project t1.id. 
> But the outputPartitioning of this project node will be:
> PartitioningCollection(HashPartitioning(t1.id), HashPartitioning(t2.id))
>  
> We should drop HashPartitioning(t2.id) from outputPartitioning of Project 
> node.
>  
> cc - [~maropu] [~cloud_fan]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33769) improve the next-day function of the sql component to deal with Column type

2020-12-14 Thread Chongguang LIU (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248947#comment-17248947
 ] 

Chongguang LIU commented on SPARK-33769:


Hello [~hyukjin.kwon],

 

I think the pull request is ready for review: 
https://github.com/apache/spark/pull/30761

> improve the next-day function of the sql component to deal with Column type
> ---
>
> Key: SPARK-33769
> URL: https://issues.apache.org/jira/browse/SPARK-33769
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chongguang LIU
>Priority: Major
>
> Hello all,
>  
> I used the function next_day in the spark SQL component and loved it: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3077]
>  
> Actually the signature of this function is: def next_day(date: Column, 
> dayOfWeek: String): Column.
> It accepts the dayOfWeek parameter as a String. However in my case, the 
> dayOfWeek is in a Column, so different values for each row of the dataframe. 
> So I had to use the NextDay function like this: NextDay(dateCol.expr, 
> dayOfWeekCol.expr).
>  
> My proposition is to add another signature for this function: def 
> next_day(date: Column, dayOfWeek: Column): Column
>  
> In fact it is already the case for some other functions in this scala object, 
> exemple:
> def date_sub(start: Column, days: Int): Column = date_sub(start, lit(days))
> def date_sub(start: Column, days: Column): Column = withExpr \{ 
> DateSub(start.expr, days.expr) }
>  
> or 
>  
> def add_months(startDate: Column, numMonths: Int): Column = 
> add_months(startDate, lit(numMonths))
> def add_months(startDate: Column, numMonths: Column): Column = withExpr {
>  AddMonths(startDate.expr, numMonths.expr)
>  }
>  
> I hope have explained my idea clearly. Let me know what are your opinions. If 
> you are ok, I can submit a pull request with the necessary change.
>  
> Kind regardes,
> Chongguang
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33758) Prune unnecessary output partitioning when the attribute is not part of output.

2020-12-14 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248924#comment-17248924
 ] 

angerszhu commented on SPARK-33758:
---

I am interested in this and can you have a more clear desc or reproduce case ?

Since I see the. outputPartitioning is default value

> Prune unnecessary output partitioning when the attribute is not part of 
> output.
> ---
>
> Key: SPARK-33758
> URL: https://issues.apache.org/jira/browse/SPARK-33758
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Prakhar Jain
>Priority: Major
>
> Consider the query:
>  
> select t1.id from t1 JOIN t2 on t1.id = t2.id
>  
> This query will have top level Project node which will just project t1.id. 
> But the outputPartitioning of this project node will be:
> PartitioningCollection(HashPartitioning(t1.id), HashPartitioning(t2.id))
>  
> We should drop HashPartitioning(t2.id) from outputPartitioning of Project 
> node.
>  
> cc - [~maropu] [~cloud_fan]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33769) improve the next-day function of the sql component to deal with Column type

2020-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248914#comment-17248914
 ] 

Apache Spark commented on SPARK-33769:
--

User 'chongguang' has created a pull request for this issue:
https://github.com/apache/spark/pull/30761

> improve the next-day function of the sql component to deal with Column type
> ---
>
> Key: SPARK-33769
> URL: https://issues.apache.org/jira/browse/SPARK-33769
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chongguang LIU
>Priority: Major
>
> Hello all,
>  
> I used the function next_day in the spark SQL component and loved it: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3077]
>  
> Actually the signature of this function is: def next_day(date: Column, 
> dayOfWeek: String): Column.
> It accepts the dayOfWeek parameter as a String. However in my case, the 
> dayOfWeek is in a Column, so different values for each row of the dataframe. 
> So I had to use the NextDay function like this: NextDay(dateCol.expr, 
> dayOfWeekCol.expr).
>  
> My proposition is to add another signature for this function: def 
> next_day(date: Column, dayOfWeek: Column): Column
>  
> In fact it is already the case for some other functions in this scala object, 
> exemple:
> def date_sub(start: Column, days: Int): Column = date_sub(start, lit(days))
> def date_sub(start: Column, days: Column): Column = withExpr \{ 
> DateSub(start.expr, days.expr) }
>  
> or 
>  
> def add_months(startDate: Column, numMonths: Int): Column = 
> add_months(startDate, lit(numMonths))
> def add_months(startDate: Column, numMonths: Column): Column = withExpr {
>  AddMonths(startDate.expr, numMonths.expr)
>  }
>  
> I hope have explained my idea clearly. Let me know what are your opinions. If 
> you are ok, I can submit a pull request with the necessary change.
>  
> Kind regardes,
> Chongguang
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33769) improve the next-day function of the sql component to deal with Column type

2020-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33769:


Assignee: (was: Apache Spark)

> improve the next-day function of the sql component to deal with Column type
> ---
>
> Key: SPARK-33769
> URL: https://issues.apache.org/jira/browse/SPARK-33769
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chongguang LIU
>Priority: Major
>
> Hello all,
>  
> I used the function next_day in the spark SQL component and loved it: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3077]
>  
> Actually the signature of this function is: def next_day(date: Column, 
> dayOfWeek: String): Column.
> It accepts the dayOfWeek parameter as a String. However in my case, the 
> dayOfWeek is in a Column, so different values for each row of the dataframe. 
> So I had to use the NextDay function like this: NextDay(dateCol.expr, 
> dayOfWeekCol.expr).
>  
> My proposition is to add another signature for this function: def 
> next_day(date: Column, dayOfWeek: Column): Column
>  
> In fact it is already the case for some other functions in this scala object, 
> exemple:
> def date_sub(start: Column, days: Int): Column = date_sub(start, lit(days))
> def date_sub(start: Column, days: Column): Column = withExpr \{ 
> DateSub(start.expr, days.expr) }
>  
> or 
>  
> def add_months(startDate: Column, numMonths: Int): Column = 
> add_months(startDate, lit(numMonths))
> def add_months(startDate: Column, numMonths: Column): Column = withExpr {
>  AddMonths(startDate.expr, numMonths.expr)
>  }
>  
> I hope have explained my idea clearly. Let me know what are your opinions. If 
> you are ok, I can submit a pull request with the necessary change.
>  
> Kind regardes,
> Chongguang
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33769) improve the next-day function of the sql component to deal with Column type

2020-12-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33769:


Assignee: Apache Spark

> improve the next-day function of the sql component to deal with Column type
> ---
>
> Key: SPARK-33769
> URL: https://issues.apache.org/jira/browse/SPARK-33769
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chongguang LIU
>Assignee: Apache Spark
>Priority: Major
>
> Hello all,
>  
> I used the function next_day in the spark SQL component and loved it: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3077]
>  
> Actually the signature of this function is: def next_day(date: Column, 
> dayOfWeek: String): Column.
> It accepts the dayOfWeek parameter as a String. However in my case, the 
> dayOfWeek is in a Column, so different values for each row of the dataframe. 
> So I had to use the NextDay function like this: NextDay(dateCol.expr, 
> dayOfWeekCol.expr).
>  
> My proposition is to add another signature for this function: def 
> next_day(date: Column, dayOfWeek: Column): Column
>  
> In fact it is already the case for some other functions in this scala object, 
> exemple:
> def date_sub(start: Column, days: Int): Column = date_sub(start, lit(days))
> def date_sub(start: Column, days: Column): Column = withExpr \{ 
> DateSub(start.expr, days.expr) }
>  
> or 
>  
> def add_months(startDate: Column, numMonths: Int): Column = 
> add_months(startDate, lit(numMonths))
> def add_months(startDate: Column, numMonths: Column): Column = withExpr {
>  AddMonths(startDate.expr, numMonths.expr)
>  }
>  
> I hope have explained my idea clearly. Let me know what are your opinions. If 
> you are ok, I can submit a pull request with the necessary change.
>  
> Kind regardes,
> Chongguang
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33769) improve the next-day function of the sql component to deal with Column type

2020-12-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248912#comment-17248912
 ] 

Apache Spark commented on SPARK-33769:
--

User 'chongguang' has created a pull request for this issue:
https://github.com/apache/spark/pull/30761

> improve the next-day function of the sql component to deal with Column type
> ---
>
> Key: SPARK-33769
> URL: https://issues.apache.org/jira/browse/SPARK-33769
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chongguang LIU
>Priority: Major
>
> Hello all,
>  
> I used the function next_day in the spark SQL component and loved it: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3077]
>  
> Actually the signature of this function is: def next_day(date: Column, 
> dayOfWeek: String): Column.
> It accepts the dayOfWeek parameter as a String. However in my case, the 
> dayOfWeek is in a Column, so different values for each row of the dataframe. 
> So I had to use the NextDay function like this: NextDay(dateCol.expr, 
> dayOfWeekCol.expr).
>  
> My proposition is to add another signature for this function: def 
> next_day(date: Column, dayOfWeek: Column): Column
>  
> In fact it is already the case for some other functions in this scala object, 
> exemple:
> def date_sub(start: Column, days: Int): Column = date_sub(start, lit(days))
> def date_sub(start: Column, days: Column): Column = withExpr \{ 
> DateSub(start.expr, days.expr) }
>  
> or 
>  
> def add_months(startDate: Column, numMonths: Int): Column = 
> add_months(startDate, lit(numMonths))
> def add_months(startDate: Column, numMonths: Column): Column = withExpr {
>  AddMonths(startDate.expr, numMonths.expr)
>  }
>  
> I hope have explained my idea clearly. Let me know what are your opinions. If 
> you are ok, I can submit a pull request with the necessary change.
>  
> Kind regardes,
> Chongguang
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-12-14 Thread Simon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248899#comment-17248899
 ] 

Simon edited comment on SPARK-33571 at 12/14/20, 10:51 AM:
---

[~maxgekk] OK, all clear. Thanks again for the clarifications!


was (Author: simonvanderveldt):
[~maxgekk]OK, all clear. Thanks again for the clarifications!

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
> Fix For: 3.1.0
>
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` which contain timestamps before 
> the above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compared to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` or `DateType` which contain 
> dates or timestamps before the above mentioned moments in time with 
> `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the 
> dataframe as when using `CORRECTED`, so it seems like no rebasing is 
> happening.
> I've made some scripts to help with testing/show the behavior, it uses 
> pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here 
> [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the 
> outputs in a comment below as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-12-14 Thread Simon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248899#comment-17248899
 ] 

Simon commented on SPARK-33571:
---

[~maxgekk]OK, all clear. Thanks again for the clarifications!

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
> Fix For: 3.1.0
>
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` which contain timestamps before 
> the above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compared to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` or `DateType` which contain 
> dates or timestamps before the above mentioned moments in time with 
> `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the 
> dataframe as when using `CORRECTED`, so it seems like no rebasing is 
> happening.
> I've made some scripts to help with testing/show the behavior, it uses 
> pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here 
> [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the 
> outputs in a comment below as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33770) Test failures: ALTER TABLE .. DROP PARTITION tries to delete files out of partition path

2020-12-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33770.
--
Fix Version/s: 2.4.8
   3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 30757
[https://github.com/apache/spark/pull/30757]

> Test failures: ALTER TABLE .. DROP PARTITION tries to delete files out of 
> partition path
> 
>
> Key: SPARK-33770
> URL: https://issues.apache.org/jira/browse/SPARK-33770
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0, 3.0.2, 2.4.8
>
>
> For example: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132719/testReport/org.apache.spark.sql.hive.execution.command/AlterTableAddPartitionSuite/ALTER_TABLEADD_PARTITION_Hive_V1__SPARK_33521__universal_type_conversions_of_partition_values/
> {code:java}
> org.apache.spark.sql.hive.execution.command.AlterTableAddPartitionSuite.ALTER 
> TABLE .. ADD PARTITION Hive V1: SPARK-33521: universal type conversions of 
> partition values
> sbt.ForkMain$ForkError: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: File 
> file:/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-38fe2706-33e5-469a-ba3a-682391e02179
>  does not exist;
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.dropPartitions(HiveExternalCatalog.scala:1014)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.dropPartitions(ExternalCatalogWithListener.scala:211)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.dropPartitions(SessionCatalog.scala:1036)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand.run(ddl.scala:582)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33770) Test failures: ALTER TABLE .. DROP PARTITION tries to delete files out of partition path

2020-12-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33770:


Assignee: Maxim Gekk

> Test failures: ALTER TABLE .. DROP PARTITION tries to delete files out of 
> partition path
> 
>
> Key: SPARK-33770
> URL: https://issues.apache.org/jira/browse/SPARK-33770
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> For example: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132719/testReport/org.apache.spark.sql.hive.execution.command/AlterTableAddPartitionSuite/ALTER_TABLEADD_PARTITION_Hive_V1__SPARK_33521__universal_type_conversions_of_partition_values/
> {code:java}
> org.apache.spark.sql.hive.execution.command.AlterTableAddPartitionSuite.ALTER 
> TABLE .. ADD PARTITION Hive V1: SPARK-33521: universal type conversions of 
> partition values
> sbt.ForkMain$ForkError: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: File 
> file:/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-38fe2706-33e5-469a-ba3a-682391e02179
>  does not exist;
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.dropPartitions(HiveExternalCatalog.scala:1014)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.dropPartitions(ExternalCatalogWithListener.scala:211)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.dropPartitions(SessionCatalog.scala:1036)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand.run(ddl.scala:582)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33716) Decommissioning Race Condition during Pod Snapshot

2020-12-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33716:
--
Parent: SPARK-33005
Issue Type: Sub-task  (was: Bug)

> Decommissioning Race Condition during Pod Snapshot
> --
>
> Key: SPARK-33716
> URL: https://issues.apache.org/jira/browse/SPARK-33716
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
> Fix For: 3.1.0
>
>
> Some version of Kubernetes may create a deletion timestamp field before 
> changing the pod status to terminating, so a decommissioning node may have a 
> deletion timestamp and a stage of running. Depending on when the K8s snapshot 
> comes back this can cause a race condition with Spark believing the pod has 
> been deleted before it has been.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33716) Decommissioning Race Condition during Pod Snapshot

2020-12-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33716.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30693
[https://github.com/apache/spark/pull/30693]

> Decommissioning Race Condition during Pod Snapshot
> --
>
> Key: SPARK-33716
> URL: https://issues.apache.org/jira/browse/SPARK-33716
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
> Fix For: 3.1.0
>
>
> Some version of Kubernetes may create a deletion timestamp field before 
> changing the pod status to terminating, so a decommissioning node may have a 
> deletion timestamp and a stage of running. Depending on when the K8s snapshot 
> comes back this can cause a race condition with Spark believing the pod has 
> been deleted before it has been.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >