[jira] [Created] (SPARK-32948) Optimize Json expression chain

2020-09-20 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-32948:
---

 Summary: Optimize Json expression chain
 Key: SPARK-32948
 URL: https://issues.apache.org/jira/browse/SPARK-32948
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


Like GetStructField and CreateNamedStruct, the expression chain of json 
expressions, e.g., StructsToJson and JsonToStructs, could be optimized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32096) Improve sorting performance for Spark SQL rank window function​

2020-09-20 Thread Zikun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zikun updated SPARK-32096:
--
Description: 
Spark SQL rank window function needs to sort the data in each window partition, 
and it relies on the execution operator[ 
|https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsqlhelsinki.visualstudio.com%2Foss%2F_git%2Fspark%3Fpath%3D%252Fsql%252Fcore%252Fsrc%252Fmain%252Fscala%252Forg%252Fapache%252Fspark%252Fsql%252Fexecution%252FSortExec.scala%26version%3DGBsql-2.4%26line%3D37%26lineEnd%3D38%26lineStartColumn%3D1%26lineEndColumn%3D1%26lineStyle%3Dplain=02%7C01%7Czixu%40microsoft.com%7Cdc51f9940fc64981c8bd08d7f05ef7c0%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637242163078452885=HGPm4TbMeJLp9wS0YZmIyqyE4%2BS4Ylw7lebFztX8PWc%3D=0]
 [*_SortExec_* 
|https://sqlhelsinki.visualstudio.com/oss/_git/spark?path=%2Fsql%2Fcore%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fsql%2Fexecution%2FSortExec.scala=GBsql-2.4=37=43=1=1=plain]to
 do the sort. During sorting, the window partition key is also put at the front 
of the sort order and thus it brings unnecessary comparisons on the partition 
key. Instead, we can group the rows by partition key first, and inside each 
group we sort the rows without comparing the partition key.​ 

 

The Jira https://issues.apache.org/jira/browse/SPARK-32947 is a follow-up 
effort of this improvement.

  was:
Spark SQL rank window function needs to sort the data in each window partition, 
and it relies on the execution operator[ 
|https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsqlhelsinki.visualstudio.com%2Foss%2F_git%2Fspark%3Fpath%3D%252Fsql%252Fcore%252Fsrc%252Fmain%252Fscala%252Forg%252Fapache%252Fspark%252Fsql%252Fexecution%252FSortExec.scala%26version%3DGBsql-2.4%26line%3D37%26lineEnd%3D38%26lineStartColumn%3D1%26lineEndColumn%3D1%26lineStyle%3Dplain=02%7C01%7Czixu%40microsoft.com%7Cdc51f9940fc64981c8bd08d7f05ef7c0%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637242163078452885=HGPm4TbMeJLp9wS0YZmIyqyE4%2BS4Ylw7lebFztX8PWc%3D=0]
 [*_SortExec_* 
|https://sqlhelsinki.visualstudio.com/oss/_git/spark?path=%2Fsql%2Fcore%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fsql%2Fexecution%2FSortExec.scala=GBsql-2.4=37=43=1=1=plain]to
 do the sort. During sorting, the window partition key is also put at the front 
of the sort order and thus it brings unnecessary comparisons on the partition 
key. Instead, we can group the rows by partition key first, and inside each 
group we sort the rows without comparing the partition key.​ 

 

In Spark SQL, there are two types of sort execution, *_SortExec_* and 
*_TakeOrderedAndProjectExec_* . *_SortExec_* is a general sorting execution and 
it does not support top-N sort. ​*_TakeOrderedAndProjectExec_* is the execution 
for top-N sort in Spark. Spark SQL rank window function needs to sort the data 
locally and it relies on the execution plan *_SortExec_* to sort the data in 
each physical data partition. When the filter of the window rank (e.g. rank <= 
100) is specified in a user's query, the filter can actually be pushed down to 
the SortExec and then we let SortExec operates top-N sort. Right now SortExec 
does not support top-N sort and we need to extend the capability of SortExec to 
support top-N sort. 


> Improve sorting performance for Spark SQL rank window function​
> ---
>
> Key: SPARK-32096
> URL: https://issues.apache.org/jira/browse/SPARK-32096
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Any environment that supports Spark.
>Reporter: Zikun
>Priority: Major
> Attachments: windowSortPerf (1).docx
>
>
> Spark SQL rank window function needs to sort the data in each window 
> partition, and it relies on the execution operator[ 
> |https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsqlhelsinki.visualstudio.com%2Foss%2F_git%2Fspark%3Fpath%3D%252Fsql%252Fcore%252Fsrc%252Fmain%252Fscala%252Forg%252Fapache%252Fspark%252Fsql%252Fexecution%252FSortExec.scala%26version%3DGBsql-2.4%26line%3D37%26lineEnd%3D38%26lineStartColumn%3D1%26lineEndColumn%3D1%26lineStyle%3Dplain=02%7C01%7Czixu%40microsoft.com%7Cdc51f9940fc64981c8bd08d7f05ef7c0%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637242163078452885=HGPm4TbMeJLp9wS0YZmIyqyE4%2BS4Ylw7lebFztX8PWc%3D=0]
>  [*_SortExec_* 
> |https://sqlhelsinki.visualstudio.com/oss/_git/spark?path=%2Fsql%2Fcore%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fsql%2Fexecution%2FSortExec.scala=GBsql-2.4=37=43=1=1=plain]to
>  do the sort. During sorting, the window partition key is also put at the 
> front of the sort order and thus it brings unnecessary comparisons on the 
> partition key. Instead, we can group the rows by partition key first, and 
> inside each 

[jira] [Created] (SPARK-32947) Support top-N sort for Spark SQL window function

2020-09-20 Thread Zikun (Jira)
Zikun created SPARK-32947:
-

 Summary: Support top-N sort for Spark SQL window function
 Key: SPARK-32947
 URL: https://issues.apache.org/jira/browse/SPARK-32947
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.1
Reporter: Zikun


In Spark SQL, there are two types of sort execution, *_SortExec_* and 
*_TakeOrderedAndProjectExec_* . *_SortExec_* is a general sorting execution and 
it does not support top-N sort. ​*_TakeOrderedAndProjectExec_* is the execution 
for top-N sort in Spark. Spark SQL rank window function needs to sort the data 
locally and it relies on the execution plan *_SortExec_* to sort the data in 
each physical data partition. When the filter of the window rank (e.g. rank <= 
100) is specified in a user's query, the filter can actually be pushed down to 
the SortExec and then we let SortExec operates top-N sort. Right now SortExec 
does not support top-N sort and we need to extend the capability of SortExec to 
support top-N sort. 

 
This Jira has dependency on another existing Jira
https://issues.apache.org/jira/browse/SPARK-32096
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32189) Development - Setting up PyCharm

2020-09-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32189.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29781
[https://github.com/apache/spark/pull/29781]

> Development - Setting up PyCharm
> 
>
> Key: SPARK-32189
> URL: https://issues.apache.org/jira/browse/SPARK-32189
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.1.0
>
>
> PyCharm is probably one of the most widely used IDE for Python development. 
> We should document a standard way so users can easily set up.
> Very rough example: 
> https://hyukjin-spark.readthedocs.io/en/latest/development/developer-tools.html#setup-pycharm-with-pyspark
>  but it should be more in details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32937) DecomissionSuite in k8s integration tests is failing.

2020-09-20 Thread wuyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199125#comment-17199125
 ] 

wuyi commented on SPARK-32937:
--

I'm looking at it. Thanks for reporting!

> DecomissionSuite in k8s integration tests is failing.
> -
>
> Key: SPARK-32937
> URL: https://issues.apache.org/jira/browse/SPARK-32937
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
>
> Logs from the failing test, copied from jenkins. As of now, it is always 
> failing. 
> {code}
> - Test basic decommissioning *** FAILED ***
>   The code passed to eventually never returned normally. Attempted 182 times 
> over 3.00377927275 minutes. Last failure message: "++ id -u
>   + myuid=185
>   ++ id -g
>   + mygid=0
>   + set +e
>   ++ getent passwd 185
>   + uidentry=
>   + set -e
>   + '[' -z '' ']'
>   + '[' -w /etc/passwd ']'
>   + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
>   + SPARK_CLASSPATH=':/opt/spark/jars/*'
>   + env
>   + grep SPARK_JAVA_OPT_
>   + sort -t_ -k4 -n
>   + sed 's/[^=]*=\(.*\)/\1/g'
>   + readarray -t SPARK_EXECUTOR_JAVA_OPTS
>   + '[' -n '' ']'
>   + '[' 3 == 2 ']'
>   + '[' 3 == 3 ']'
>   ++ python3 -V
>   + pyv3='Python 3.7.3'
>   + export PYTHON_VERSION=3.7.3
>   + PYTHON_VERSION=3.7.3
>   + export PYSPARK_PYTHON=python3
>   + PYSPARK_PYTHON=python3
>   + export PYSPARK_DRIVER_PYTHON=python3
>   + PYSPARK_DRIVER_PYTHON=python3
>   + '[' -n '' ']'
>   + '[' -z ']'
>   + '[' -z x ']'
>   + SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*'
>   + case "$1" in
>   + shift 1
>   + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client 
> "$@")
>   + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=172.17.0.4 --deploy-mode client --properties-file 
> /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner 
> local:///opt/spark/tests/decommissioning.py
>   20/09/17 11:06:56 WARN NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
>   Starting decom test
>   Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
>   20/09/17 11:06:56 INFO SparkContext: Running Spark version 3.1.0-SNAPSHOT
>   20/09/17 11:06:57 INFO ResourceUtils: 
> ==
>   20/09/17 11:06:57 INFO ResourceUtils: No custom resources configured for 
> spark.driver.
>   20/09/17 11:06:57 INFO ResourceUtils: 
> ==
>   20/09/17 11:06:57 INFO SparkContext: Submitted application: PyMemoryTest
>   20/09/17 11:06:57 INFO ResourceProfile: Default ResourceProfile created, 
> executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , 
> memory -> name: memory, amount: 1024, script: , vendor: ), task resources: 
> Map(cpus -> name: cpus, amount: 1.0)
>   20/09/17 11:06:57 INFO ResourceProfile: Limiting resource is cpus at 1 
> tasks per executor
>   20/09/17 11:06:57 INFO ResourceProfileManager: Added ResourceProfile id: 0
>   20/09/17 11:06:57 INFO SecurityManager: Changing view acls to: 185,jenkins
>   20/09/17 11:06:57 INFO SecurityManager: Changing modify acls to: 185,jenkins
>   20/09/17 11:06:57 INFO SecurityManager: Changing view acls groups to: 
>   20/09/17 11:06:57 INFO SecurityManager: Changing modify acls groups to: 
>   20/09/17 11:06:57 INFO SecurityManager: SecurityManager: authentication 
> enabled; ui acls disabled; users  with view permissions: Set(185, jenkins); 
> groups with view permissions: Set(); users  with modify permissions: Set(185, 
> jenkins); groups with modify permissions: Set()
>   20/09/17 11:06:57 INFO Utils: Successfully started service 'sparkDriver' on 
> port 7078.
>   20/09/17 11:06:57 INFO SparkEnv: Registering MapOutputTracker
>   20/09/17 11:06:57 INFO SparkEnv: Registering BlockManagerMaster
>   20/09/17 11:06:57 INFO BlockManagerMasterEndpoint: Using 
> org.apache.spark.storage.DefaultTopologyMapper for getting topology 
> information
>   20/09/17 11:06:57 INFO BlockManagerMasterEndpoint: 
> BlockManagerMasterEndpoint up
>   20/09/17 11:06:57 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
>   20/09/17 11:06:57 INFO DiskBlockManager: Created local directory at 
> /var/data/spark-7985c075-3b02-42ec-9111-cefba535adf0/blockmgr-3bd403d0-6689-46be-997e-5bc699ecefd3
>   20/09/17 11:06:57 INFO MemoryStore: MemoryStore started with capacity 593.9 
> MiB
>   20/09/17 11:06:57 INFO SparkEnv: Registering OutputCommitCoordinator
>   20/09/17 11:06:58 INFO Utils: Successfully started service 'SparkUI' on 
> port 4040.
>   20/09/17 11:06:58 INFO 

[jira] [Updated] (SPARK-32821) cannot group by with window in sql statement for structured streaming with watermark

2020-09-20 Thread Johnny Bai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnny Bai updated SPARK-32821:
---
Attachment: the watermark grammer.md

> cannot group by with window in sql statement for structured streaming with 
> watermark
> 
>
> Key: SPARK-32821
> URL: https://issues.apache.org/jira/browse/SPARK-32821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Johnny Bai
>Priority: Major
> Attachments: the watermark grammer.md
>
>
> current only support dsl style as below: 
> {code}
> import spark.implicits._
> val words = ... // streaming DataFrame of schema { timestamp: Timestamp, 
> word: String }
> // Group the data by window and word and compute the count of each group
> val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
> minutes"),$"word").count()
> {code}
>  
> but not support group by with window in sql style as below:
> {code}
> select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 
> minute') with watermark 1 minute from tableX group by ts_field
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32821) cannot group by with window in sql statement for structured streaming with watermark

2020-09-20 Thread Johnny Bai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199110#comment-17199110
 ] 

Johnny Bai commented on SPARK-32821:


[~hyukjin.kwon] sorry, I'm busy recently, I show a self-contained reproducer in 
attachment, please download it and open with the markdown editor.[^the 
watermark grammer.md]

> cannot group by with window in sql statement for structured streaming with 
> watermark
> 
>
> Key: SPARK-32821
> URL: https://issues.apache.org/jira/browse/SPARK-32821
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Johnny Bai
>Priority: Major
> Attachments: the watermark grammer.md
>
>
> current only support dsl style as below: 
> {code}
> import spark.implicits._
> val words = ... // streaming DataFrame of schema { timestamp: Timestamp, 
> word: String }
> // Group the data by window and word and compute the count of each group
> val windowedCounts = words.groupBy(window($"timestamp", "10 minutes", "5 
> minutes"),$"word").count()
> {code}
>  
> but not support group by with window in sql style as below:
> {code}
> select ts_field,count(\*) as cnt over window(ts_field, '1 minute', '1 
> minute') with watermark 1 minute from tableX group by ts_field
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32799) Make unionByName optionally fill missing columns with nulls in SparkR

2020-09-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32799:


Assignee: Maciej Szymkiewicz

> Make unionByName optionally fill missing columns with nulls in SparkR
> -
>
> Key: SPARK-32799
> URL: https://issues.apache.org/jira/browse/SPARK-32799
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Maciej Szymkiewicz
>Priority: Major
>  Labels: starter
> Fix For: 3.1.0
>
>
> It would be nicer to expose {{unionByName}} parameter in R APIs as well. 
> Currently this is only exposed in Scala/Java sides at SPARK-29358



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32799) Make unionByName optionally fill missing columns with nulls in SparkR

2020-09-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32799.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29813
[https://github.com/apache/spark/pull/29813]

> Make unionByName optionally fill missing columns with nulls in SparkR
> -
>
> Key: SPARK-32799
> URL: https://issues.apache.org/jira/browse/SPARK-32799
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: starter
> Fix For: 3.1.0
>
>
> It would be nicer to expose {{unionByName}} parameter in R APIs as well. 
> Currently this is only exposed in Scala/Java sides at SPARK-29358



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32932) AQE local shuffle reader breaks repartitioning for dynamic partition overwrite

2020-09-20 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-32932:
---
Priority: Major  (was: Minor)

> AQE local shuffle reader breaks repartitioning for dynamic partition overwrite
> --
>
> Key: SPARK-32932
> URL: https://issues.apache.org/jira/browse/SPARK-32932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Major
>
> With AQE, local shuffle reader breaks users' repartitioning for dynamic 
> partition overwrite as in the following case.
> {code:java}
> test("repartition with local reader") {
>   withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> 
> PartitionOverwriteMode.DYNAMIC.toString,
> SQLConf.SHUFFLE_PARTITIONS.key -> "5",
> SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
> withTable("t") {
>   val data = for (
> i <- 1 to 10;
> j <- 1 to 3
>   ) yield (i, j)
>   data.toDF("a", "b")
> .repartition($"b")
> .write
> .partitionBy("b")
> .mode("overwrite")
> .saveAsTable("t")
>   assert(spark.read.table("t").inputFiles.length == 3)
> }
>   }
> }{code}
> Coalescing shuffle partitions could also break it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19335) Spark should support doing an efficient DataFrame Upsert via JDBC

2020-09-20 Thread Alex Hoffer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199064#comment-17199064
 ] 

Alex Hoffer commented on SPARK-19335:
-

+1, this would be really helpful. We have aggregates we'd like to upsert in a 
Postgres table.

> Spark should support doing an efficient DataFrame Upsert via JDBC
> -
>
> Key: SPARK-19335
> URL: https://issues.apache.org/jira/browse/SPARK-19335
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ilya Ganelin
>Priority: Minor
>
> Doing a database update, as opposed to an insert is useful, particularly when 
> working with streaming applications which may require revisions to previously 
> stored data. 
> Spark DataFrames/DataSets do not currently support an Update feature via the 
> JDBC Writer allowing only Overwrite or Append.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32778) Accidental Data Deletion on calling saveAsTable

2020-09-20 Thread Aman Rastogi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199041#comment-17199041
 ] 

Aman Rastogi edited comment on SPARK-32778 at 9/20/20, 4:22 PM:


This does not look like a bug, it is by design [link 
title|http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameWriter.html#saveAsTable(tableName:String):Unit]

Should we consider this as improvement?

 

[~hyukjin.kwon] [~maropu]

 


was (Author: amanr):
This does not look like a bug, it is by design [link 
saveAsTable|#saveAsTable(tableName:String):Unit].]

Should we consider this as improvement?

 

[~hyukjin.kwon] [~maropu]

 

> Accidental Data Deletion on calling saveAsTable
> ---
>
> Key: SPARK-32778
> URL: https://issues.apache.org/jira/browse/SPARK-32778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Aman Rastogi
>Priority: Major
>
> {code:java}
> df.write.option("path", 
> "/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table)
> {code}
> Above code deleted the data present in path "/already/existing/path". This 
> happened because table was already not there in hive metastore however, path 
> given had data. And if table is not present in Hive Metastore, SaveMode gets 
> modified internally to SaveMode.Overwrite irrespective of what user has 
> provided, which leads to data deletion. This change was introduced as part of 
> https://issues.apache.org/jira/browse/SPARK-19583. 
> Now, suppose if user is not using external hive metastore (hive metastore is 
> associated with a cluster) and if cluster goes down or due to some reason 
> user has to migrate to a new cluster. Once user tries to save data using 
> above code in new cluster, it will first delete the data. It could be a 
> production data and user is completely unaware of it as they have provided 
> SaveMode.Append or ErrorIfExists. This will be an accidental data deletion.
>  
> Repro Steps:
>  
>  # Save data through a hive table as mentioned in above code
>  # create another cluster and save data in new table in new cluster by giving 
> same path
>  
> Proposed Fix:
> Instead of modifying SaveMode to Overwrite, we should modify it to 
> ErrorIfExists in class CreateDataSourceTableAsSelectCommand.
> Change (line 154)
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = 
> false)
>  
> {code}
> to
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, 
> tableExists = false){code}
> This should not break CTAS. Even in case of CTAS, user may not want to delete 
> data if already exists as it could be accidental.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32778) Accidental Data Deletion on calling saveAsTable

2020-09-20 Thread Aman Rastogi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199041#comment-17199041
 ] 

Aman Rastogi edited comment on SPARK-32778 at 9/20/20, 4:22 PM:


This does not look like a bug, it is by design 
[l[saveAsTable|http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameWriter.html#saveAsTable(tableName:String):Unit]|http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameWriter.html#saveAsTable(tableName:String):Unit]

Should we consider this as improvement?

 

[~hyukjin.kwon] [~maropu]

 


was (Author: amanr):
This does not look like a bug, it is by design [link 
title|http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameWriter.html#saveAsTable(tableName:String):Unit]

Should we consider this as improvement?

 

[~hyukjin.kwon] [~maropu]

 

> Accidental Data Deletion on calling saveAsTable
> ---
>
> Key: SPARK-32778
> URL: https://issues.apache.org/jira/browse/SPARK-32778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Aman Rastogi
>Priority: Major
>
> {code:java}
> df.write.option("path", 
> "/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table)
> {code}
> Above code deleted the data present in path "/already/existing/path". This 
> happened because table was already not there in hive metastore however, path 
> given had data. And if table is not present in Hive Metastore, SaveMode gets 
> modified internally to SaveMode.Overwrite irrespective of what user has 
> provided, which leads to data deletion. This change was introduced as part of 
> https://issues.apache.org/jira/browse/SPARK-19583. 
> Now, suppose if user is not using external hive metastore (hive metastore is 
> associated with a cluster) and if cluster goes down or due to some reason 
> user has to migrate to a new cluster. Once user tries to save data using 
> above code in new cluster, it will first delete the data. It could be a 
> production data and user is completely unaware of it as they have provided 
> SaveMode.Append or ErrorIfExists. This will be an accidental data deletion.
>  
> Repro Steps:
>  
>  # Save data through a hive table as mentioned in above code
>  # create another cluster and save data in new table in new cluster by giving 
> same path
>  
> Proposed Fix:
> Instead of modifying SaveMode to Overwrite, we should modify it to 
> ErrorIfExists in class CreateDataSourceTableAsSelectCommand.
> Change (line 154)
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = 
> false)
>  
> {code}
> to
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, 
> tableExists = false){code}
> This should not break CTAS. Even in case of CTAS, user may not want to delete 
> data if already exists as it could be accidental.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32778) Accidental Data Deletion on calling saveAsTable

2020-09-20 Thread Aman Rastogi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199041#comment-17199041
 ] 

Aman Rastogi edited comment on SPARK-32778 at 9/20/20, 4:21 PM:


This does not look like a bug, it is by design [link 
saveAsTable|#saveAsTable(tableName:String):Unit].]

Should we consider this as improvement?

 

[~hyukjin.kwon] [~maropu]

 


was (Author: amanr):
This does not look like a bug, it is by design [link 
saveAsTable|[http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameWriter.html#saveAsTable(tableName:String):Unit].]

Should we consider this as improvement?

 

> Accidental Data Deletion on calling saveAsTable
> ---
>
> Key: SPARK-32778
> URL: https://issues.apache.org/jira/browse/SPARK-32778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Aman Rastogi
>Priority: Major
>
> {code:java}
> df.write.option("path", 
> "/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table)
> {code}
> Above code deleted the data present in path "/already/existing/path". This 
> happened because table was already not there in hive metastore however, path 
> given had data. And if table is not present in Hive Metastore, SaveMode gets 
> modified internally to SaveMode.Overwrite irrespective of what user has 
> provided, which leads to data deletion. This change was introduced as part of 
> https://issues.apache.org/jira/browse/SPARK-19583. 
> Now, suppose if user is not using external hive metastore (hive metastore is 
> associated with a cluster) and if cluster goes down or due to some reason 
> user has to migrate to a new cluster. Once user tries to save data using 
> above code in new cluster, it will first delete the data. It could be a 
> production data and user is completely unaware of it as they have provided 
> SaveMode.Append or ErrorIfExists. This will be an accidental data deletion.
>  
> Repro Steps:
>  
>  # Save data through a hive table as mentioned in above code
>  # create another cluster and save data in new table in new cluster by giving 
> same path
>  
> Proposed Fix:
> Instead of modifying SaveMode to Overwrite, we should modify it to 
> ErrorIfExists in class CreateDataSourceTableAsSelectCommand.
> Change (line 154)
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = 
> false)
>  
> {code}
> to
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, 
> tableExists = false){code}
> This should not break CTAS. Even in case of CTAS, user may not want to delete 
> data if already exists as it could be accidental.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32778) Accidental Data Deletion on calling saveAsTable

2020-09-20 Thread Aman Rastogi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199041#comment-17199041
 ] 

Aman Rastogi commented on SPARK-32778:
--

This does not look like a bug, it is by design [link 
saveAsTable|[http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameWriter.html#saveAsTable(tableName:String):Unit].]

Should we consider this as improvement?

 

> Accidental Data Deletion on calling saveAsTable
> ---
>
> Key: SPARK-32778
> URL: https://issues.apache.org/jira/browse/SPARK-32778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Aman Rastogi
>Priority: Major
>
> {code:java}
> df.write.option("path", 
> "/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table)
> {code}
> Above code deleted the data present in path "/already/existing/path". This 
> happened because table was already not there in hive metastore however, path 
> given had data. And if table is not present in Hive Metastore, SaveMode gets 
> modified internally to SaveMode.Overwrite irrespective of what user has 
> provided, which leads to data deletion. This change was introduced as part of 
> https://issues.apache.org/jira/browse/SPARK-19583. 
> Now, suppose if user is not using external hive metastore (hive metastore is 
> associated with a cluster) and if cluster goes down or due to some reason 
> user has to migrate to a new cluster. Once user tries to save data using 
> above code in new cluster, it will first delete the data. It could be a 
> production data and user is completely unaware of it as they have provided 
> SaveMode.Append or ErrorIfExists. This will be an accidental data deletion.
>  
> Repro Steps:
>  
>  # Save data through a hive table as mentioned in above code
>  # create another cluster and save data in new table in new cluster by giving 
> same path
>  
> Proposed Fix:
> Instead of modifying SaveMode to Overwrite, we should modify it to 
> ErrorIfExists in class CreateDataSourceTableAsSelectCommand.
> Change (line 154)
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = 
> false)
>  
> {code}
> to
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, 
> tableExists = false){code}
> This should not break CTAS. Even in case of CTAS, user may not want to delete 
> data if already exists as it could be accidental.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org