[jira] [Created] (SPARK-37660) Spark-3.2.0 Fetch Hbase Data not working

2021-12-15 Thread Bhavya Raj Sharma (Jira)
Bhavya Raj Sharma created SPARK-37660:
-

 Summary: Spark-3.2.0 Fetch Hbase Data not working
 Key: SPARK-37660
 URL: https://issues.apache.org/jira/browse/SPARK-37660
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.0
 Environment: Hadoop version : hadoop-2.9.2

HBase version : hbase-2.2.5

Spark version : spark-3.2.0-bin-without-hadoop

java version : jdk1.8.0_151

scala version : scala-sdk-2.12.10

os version : Red Hat Enterprise Linux Server release 6.6 (Santiago)
Reporter: Bhavya Raj Sharma


Below is the sample code snipet that is used to fetch data from hbase. This 
used to work fine with spark-3.1.1

However after upgrading to psark-3.2.0 it is not working, The issue is it is 
not throwing any exception, it just don't fill RDD.

 
{code:java}
 
   def getInfo(sc: SparkContext, startDate:String, cachingValue: Int, 
sparkLoggerParams: SparkLoggerParams, zkIP: String, zkPort: String): 
RDD[(String)] = {{
val scan = new Scan
    scan.addFamily("family")
    scan.addColumn("family","time")
    val rdd = getHbaseConfiguredRDDFromScan(sc, zkIP, zkPort, "myTable", scan, 
cachingValue, sparkLoggerParams)
    val output: RDD[(String)] = rdd.map { row =>
      (Bytes.toString(row._2.getRow))
    }
    output
  }
 
def getHbaseConfiguredRDDFromScan(sc: SparkContext, zkIP: String, zkPort: 
String, tableName: String,
                                    scan: Scan, cachingValue: Int, 
sparkLoggerParams: SparkLoggerParams): NewHadoopRDD[ImmutableBytesWritable, 
Result] = {
    scan.setCaching(cachingValue)
    val scanString = 
Base64.getEncoder.encodeToString(org.apache.hadoop.hbase.protobuf.ProtobufUtil.toScan(scan).toByteArray)
    val hbaseContext = new SparkHBaseContext(zkIP, zkPort)
    val hbaseConfig = hbaseContext.getConfiguration()
    hbaseConfig.set(TableInputFormat.INPUT_TABLE, tableName)
    hbaseConfig.set(TableInputFormat.SCAN, scanString)
    sc.newAPIHadoopRDD(
      hbaseConfig,
      classOf[TableInputFormat],
      classOf[ImmutableBytesWritable], classOf[Result]
    ).asInstanceOf[NewHadoopRDD[ImmutableBytesWritable, Result]]
  }
 
{code}
 

If we fetch with using scan directly without using newAPIHadoopRDD, it works.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Issue Type: Improvement  (was: Bug)

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For nullValue, we will convert nothing or nullValue strings to null 
> in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
> strings) to emptyValue strings rather than to convert both "\"\""(quoted 
> empty strings) and emptyValue strings to ""(empty) in dataframe.
> I think it's better 

[jira] [Updated] (SPARK-37060) Report driver status does not handle response from backup masters

2021-12-15 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-37060:
-
Fix Version/s: 3.1.3
   3.2.1

> Report driver status does not handle response from backup masters
> -
>
> Key: SPARK-37060
> URL: https://issues.apache.org/jira/browse/SPARK-37060
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0
>Reporter: Mohamadreza Rostami
>Assignee: Mohamadreza Rostami
>Priority: Critical
> Fix For: 3.1.3, 3.2.1, 3.3.0
>
>
> After an improvement in SPARK-31486, contributor uses 
> 'asyncSendToMasterAndForwardReply' method instead of 
> 'activeMasterEndpoint.askSync' to get the status of driver. Since the 
> driver's status is only available in active master and the 
> 'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we 
> have to handle the response from the backup masters in the client, which the 
> developer did not consider in the SPARK-31486 change. So drivers running in 
> cluster mode and on a cluster with multi-master affected by this bug. I 
> created the patch for this bug and will soon be sent the pull request.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37659) Fix FsHistoryProvider race condition between list and delet log info

2021-12-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460448#comment-17460448
 ] 

Apache Spark commented on SPARK-37659:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/34919

> Fix FsHistoryProvider race condition between list and delet log info
> 
>
> Key: SPARK-37659
> URL: https://issues.apache.org/jira/browse/SPARK-37659
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.2, 3.2.1, 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> After SPARK-29043, FsHistoryProvider will list the log info without waitting 
> all `mergeApplicationListing` task finished.
> However the `LevelDBIterator` of list log info is not thread safe if some 
> other threads delete the related log info at same time.
> There is the error msg:
> {code:java}
> 21/12/15 14:12:02 ERROR FsHistoryProvider: Exception in checking for event 
> log updates
> java.util.NoSuchElementException: 
> 1^@__main__^@+hdfs://xxx/application_xxx.inprogress
> at org.apache.spark.util.kvstore.LevelDB.get(LevelDB.java:132)
> at 
> org.apache.spark.util.kvstore.LevelDBIterator.next(LevelDBIterator.java:137)
> at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44)
> at scala.collection.Iterator.foreach(Iterator.scala:941)
> at scala.collection.Iterator.foreach$(Iterator.scala:941)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
> at scala.collection.IterableLike.foreach(IterableLike.scala:74)
> at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
> at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
> at 
> scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184)
> at 
> scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47)
> at scala.collection.TraversableLike.to(TraversableLike.scala:678)
> at scala.collection.TraversableLike.to$(TraversableLike.scala:675)
> at scala.collection.AbstractTraversable.to(Traversable.scala:108)
> at scala.collection.TraversableOnce.toList(TraversableOnce.scala:299)
> at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:299)
> at scala.collection.AbstractTraversable.toList(Traversable.scala:108)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:588)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:299)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37659) Fix FsHistoryProvider race condition between list and delet log info

2021-12-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37659:


Assignee: (was: Apache Spark)

> Fix FsHistoryProvider race condition between list and delet log info
> 
>
> Key: SPARK-37659
> URL: https://issues.apache.org/jira/browse/SPARK-37659
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.2, 3.2.1, 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> After SPARK-29043, FsHistoryProvider will list the log info without waitting 
> all `mergeApplicationListing` task finished.
> However the `LevelDBIterator` of list log info is not thread safe if some 
> other threads delete the related log info at same time.
> There is the error msg:
> {code:java}
> 21/12/15 14:12:02 ERROR FsHistoryProvider: Exception in checking for event 
> log updates
> java.util.NoSuchElementException: 
> 1^@__main__^@+hdfs://xxx/application_xxx.inprogress
> at org.apache.spark.util.kvstore.LevelDB.get(LevelDB.java:132)
> at 
> org.apache.spark.util.kvstore.LevelDBIterator.next(LevelDBIterator.java:137)
> at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44)
> at scala.collection.Iterator.foreach(Iterator.scala:941)
> at scala.collection.Iterator.foreach$(Iterator.scala:941)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
> at scala.collection.IterableLike.foreach(IterableLike.scala:74)
> at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
> at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
> at 
> scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184)
> at 
> scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47)
> at scala.collection.TraversableLike.to(TraversableLike.scala:678)
> at scala.collection.TraversableLike.to$(TraversableLike.scala:675)
> at scala.collection.AbstractTraversable.to(Traversable.scala:108)
> at scala.collection.TraversableOnce.toList(TraversableOnce.scala:299)
> at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:299)
> at scala.collection.AbstractTraversable.toList(Traversable.scala:108)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:588)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:299)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37659) Fix FsHistoryProvider race condition between list and delet log info

2021-12-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37659:


Assignee: Apache Spark

> Fix FsHistoryProvider race condition between list and delet log info
> 
>
> Key: SPARK-37659
> URL: https://issues.apache.org/jira/browse/SPARK-37659
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.2, 3.2.1, 3.3.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> After SPARK-29043, FsHistoryProvider will list the log info without waitting 
> all `mergeApplicationListing` task finished.
> However the `LevelDBIterator` of list log info is not thread safe if some 
> other threads delete the related log info at same time.
> There is the error msg:
> {code:java}
> 21/12/15 14:12:02 ERROR FsHistoryProvider: Exception in checking for event 
> log updates
> java.util.NoSuchElementException: 
> 1^@__main__^@+hdfs://xxx/application_xxx.inprogress
> at org.apache.spark.util.kvstore.LevelDB.get(LevelDB.java:132)
> at 
> org.apache.spark.util.kvstore.LevelDBIterator.next(LevelDBIterator.java:137)
> at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44)
> at scala.collection.Iterator.foreach(Iterator.scala:941)
> at scala.collection.Iterator.foreach$(Iterator.scala:941)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
> at scala.collection.IterableLike.foreach(IterableLike.scala:74)
> at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
> at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
> at 
> scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184)
> at 
> scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47)
> at scala.collection.TraversableLike.to(TraversableLike.scala:678)
> at scala.collection.TraversableLike.to$(TraversableLike.scala:675)
> at scala.collection.AbstractTraversable.to(Traversable.scala:108)
> at scala.collection.TraversableOnce.toList(TraversableOnce.scala:299)
> at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:299)
> at scala.collection.AbstractTraversable.toList(Traversable.scala:108)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:588)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:299)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37656) Upgrade SBT to 1.5.7

2021-12-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37656:
-
Fix Version/s: 3.2.1

> Upgrade SBT to 1.5.7
> 
>
> Key: SPARK-37656
> URL: https://issues.apache.org/jira/browse/SPARK-37656
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.1, 3.3.0
>
>
> SBT 1.5.7 was released a few hours ago, which includes a fix for 
> CVE-2021-45046.
> https://github.com/sbt/sbt/releases/tag/v1.5.7



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37658) Skip PIP packaging test if Python version is lower than 3.7

2021-12-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37658.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34917
[https://github.com/apache/spark/pull/34917]

> Skip PIP packaging test if Python version is lower than 3.7
> ---
>
> Key: SPARK-37658
> URL: https://issues.apache.org/jira/browse/SPARK-37658
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.3.0
>
>
> {code}
> Writing pyspark-3.3.0.dev0/setup.cfg
> Creating tar archive
> removing 'pyspark-3.3.0.dev0' (and everything under it)
> Installing dist into virtual env
> Obtaining file:///home/jenkins/workspace/SparkPullRequestBuilder%402/python
> pyspark requires Python '>=3.7' but the running Python is 3.6.8
> Cleaning up temporary directory - /tmp/tmp.CCragmNU1X
> [error] running 
> /home/jenkins/workspace/SparkPullRequestBuilder@2/dev/run-pip-tests ; 
> received return code 1
> Attempting to post to GitHub...
> {code}
> After we drop Python 3.6 at SPARK-37632, PIP packaging test fails 
> intermittently 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146255/console.
>  Apparently, different Python versions are installed. We should skip the test 
> in these cases.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37658) Skip PIP packaging test if Python version is lower than 3.7

2021-12-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37658:
-

Assignee: Hyukjin Kwon

> Skip PIP packaging test if Python version is lower than 3.7
> ---
>
> Key: SPARK-37658
> URL: https://issues.apache.org/jira/browse/SPARK-37658
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> {code}
> Writing pyspark-3.3.0.dev0/setup.cfg
> Creating tar archive
> removing 'pyspark-3.3.0.dev0' (and everything under it)
> Installing dist into virtual env
> Obtaining file:///home/jenkins/workspace/SparkPullRequestBuilder%402/python
> pyspark requires Python '>=3.7' but the running Python is 3.6.8
> Cleaning up temporary directory - /tmp/tmp.CCragmNU1X
> [error] running 
> /home/jenkins/workspace/SparkPullRequestBuilder@2/dev/run-pip-tests ; 
> received return code 1
> Attempting to post to GitHub...
> {code}
> After we drop Python 3.6 at SPARK-37632, PIP packaging test fails 
> intermittently 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146255/console.
>  Apparently, different Python versions are installed. We should skip the test 
> in these cases.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32294) GroupedData Pandas UDF 2Gb limit

2021-12-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32294:
-
Fix Version/s: 3.2.0
   (was: 3.3.0)

> GroupedData Pandas UDF 2Gb limit
> 
>
> Key: SPARK-32294
> URL: https://issues.apache.org/jira/browse/SPARK-32294
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
> Fix For: 3.2.0
>
>
> `spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for 
> GroupedData, the whole group is passed to Pandas UDF at once, which can cause 
> various 2Gb limitations on Arrow side (and in current versions of Arrow, also 
> 2Gb limitation on Netty allocator side) - 
> https://issues.apache.org/jira/browse/ARROW-4890 
> Would be great to consider feeding GroupedData into a pandas UDF in batches 
> to solve this issue. 
> cc [~hyukjin.kwon] 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37659) Fix FsHistoryProvider race condition between list and delet log info

2021-12-15 Thread XiDuo You (Jira)
XiDuo You created SPARK-37659:
-

 Summary: Fix FsHistoryProvider race condition between list and 
delet log info
 Key: SPARK-37659
 URL: https://issues.apache.org/jira/browse/SPARK-37659
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.1.2, 3.2.1, 3.3.0
Reporter: XiDuo You


After SPARK-29043, FsHistoryProvider will list the log info without waitting 
all `mergeApplicationListing` task finished.

However the `LevelDBIterator` of list log info is not thread safe if some other 
threads delete the related log info at same time.

There is the error msg:
{code:java}
21/12/15 14:12:02 ERROR FsHistoryProvider: Exception in checking for event log 
updates
java.util.NoSuchElementException: 
1^@__main__^@+hdfs://xxx/application_xxx.inprogress
at org.apache.spark.util.kvstore.LevelDB.get(LevelDB.java:132)
at 
org.apache.spark.util.kvstore.LevelDBIterator.next(LevelDBIterator.java:137)
at 
scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at 
scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184)
at 
scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47)
at scala.collection.TraversableLike.to(TraversableLike.scala:678)
at scala.collection.TraversableLike.to$(TraversableLike.scala:675)
at scala.collection.AbstractTraversable.to(Traversable.scala:108)
at scala.collection.TraversableOnce.toList(TraversableOnce.scala:299)
at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:299)
at scala.collection.AbstractTraversable.toList(Traversable.scala:108)
at 
org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:588)
at 
org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:299)
{code}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34521) spark.createDataFrame does not support Pandas StringDtype extension type

2021-12-15 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-34521.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34509
[https://github.com/apache/spark/pull/34509]

> spark.createDataFrame does not support Pandas StringDtype extension type
> 
>
> Key: SPARK-34521
> URL: https://issues.apache.org/jira/browse/SPARK-34521
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: Pavel Ganelin
>Priority: Major
> Fix For: 3.3.0
>
>
> The following test case demonstrates the problem:
> {code:java}
> import pandas as pd
> from pyspark.sql import SparkSession, types
> spark = SparkSession.builder.appName(__file__)\
> .config("spark.sql.execution.arrow.pyspark.enabled","true") \
> .getOrCreate()
> good = pd.DataFrame([["abc"]], columns=["col"])
> schema = types.StructType([types.StructField("col", types.StringType(), 
> True)])
> df = spark.createDataFrame(good, schema=schema)
> df.show()
> bad = good.copy()
> bad["col"]=bad["col"].astype("string")
> schema = types.StructType([types.StructField("col", types.StringType(), 
> True)])
> df = spark.createDataFrame(bad, schema=schema)
> df.show(){code}
> The error:
> {code:java}
> C:\Python\3.8.3\lib\site-packages\pyspark\sql\pandas\conversion.py:289: 
> UserWarning: createDataFrame attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed 
> by the reason below:
>   Cannot specify a mask or a size when passing an object that is converted 
> with the __arrow_array__ protocol.
> Attempting non-optimization as 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
>   warnings.warn(msg)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37483) Support push down top N to JDBC data source V2

2021-12-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460427#comment-17460427
 ] 

Apache Spark commented on SPARK-37483:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/34918

> Support push down top N to JDBC data source V2
> --
>
> Key: SPARK-37483
> URL: https://issues.apache.org/jira/browse/SPARK-37483
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37657) Fix the bug in ps.(Series|DataFrame).describe()

2021-12-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37657:
-
Description: 
Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888]

 

The `(Series|DataFrame).describe()` in pandas API on Spark doesn't work 
properly when DataFrame has no numeric column.

 

 
{code:java}
>>> df = ps.DataFrame({'a': ["a", "b", "c"]})
>>> df.describe()
Traceback (most recent call last):
  File "", line 1, in 
  File "/.../python/pyspark/pandas/frame.py", line 7582, in describe
raise ValueError("Cannot describe a DataFrame without columns")
ValueError: Cannot describe a DataFrame without columns 
{code}
 

As it works fine in pandas, we should fix it.

  was:
Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888.]

 

The `(Series|DataFrame).describe()` in pandas API on Spark doesn't work 
properly when DataFrame has no numeric column.

 

 
{code:java}
>>> df = ps.DataFrame({'a': ["a", "b", "c"]})
>>> df.describe()
Traceback (most recent call last):
  File "", line 1, in 
  File "/.../python/pyspark/pandas/frame.py", line 7582, in describe
raise ValueError("Cannot describe a DataFrame without columns")
ValueError: Cannot describe a DataFrame without columns 
{code}
 

As it works fine in pandas, we should fix it.


> Fix the bug in ps.(Series|DataFrame).describe()
> ---
>
> Key: SPARK-37657
> URL: https://issues.apache.org/jira/browse/SPARK-37657
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Initialized in Koalas issue: 
> [https://github.com/databricks/koalas/issues/1888]
>  
> The `(Series|DataFrame).describe()` in pandas API on Spark doesn't work 
> properly when DataFrame has no numeric column.
>  
>  
> {code:java}
> >>> df = ps.DataFrame({'a': ["a", "b", "c"]})
> >>> df.describe()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../python/pyspark/pandas/frame.py", line 7582, in describe
> raise ValueError("Cannot describe a DataFrame without columns")
> ValueError: Cannot describe a DataFrame without columns 
> {code}
>  
> As it works fine in pandas, we should fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37657) Fix the bug in ps.(Series|DataFrame).describe()

2021-12-15 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-37657:

Summary: Fix the bug in ps.(Series|DataFrame).describe()  (was: Fix the bug 
in ps.DataFrame.describe())

> Fix the bug in ps.(Series|DataFrame).describe()
> ---
>
> Key: SPARK-37657
> URL: https://issues.apache.org/jira/browse/SPARK-37657
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Initialized in Koalas issue: 
> [https://github.com/databricks/koalas/issues/1888.]
>  
> The `DataFrame.describe()` in pandas API on Spark doesn't work properly when 
> DataFrame has no numeric column.
>  
>  
> {code:java}
> >>> df = ps.DataFrame({'a': ["a", "b", "c"]})
> >>> df.describe()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../python/pyspark/pandas/frame.py", line 7582, in describe
> raise ValueError("Cannot describe a DataFrame without columns")
> ValueError: Cannot describe a DataFrame without columns 
> {code}
>  
> As it works fine in pandas, we should fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37657) Fix the bug in ps.(Series|DataFrame).describe()

2021-12-15 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-37657:

Description: 
Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888.]

 

The `(Series|DataFrame).describe()` in pandas API on Spark doesn't work 
properly when DataFrame has no numeric column.

 

 
{code:java}
>>> df = ps.DataFrame({'a': ["a", "b", "c"]})
>>> df.describe()
Traceback (most recent call last):
  File "", line 1, in 
  File "/.../python/pyspark/pandas/frame.py", line 7582, in describe
raise ValueError("Cannot describe a DataFrame without columns")
ValueError: Cannot describe a DataFrame without columns 
{code}
 

As it works fine in pandas, we should fix it.

  was:
Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888.]

 

The `DataFrame.describe()` in pandas API on Spark doesn't work properly when 
DataFrame has no numeric column.

 

 
{code:java}
>>> df = ps.DataFrame({'a': ["a", "b", "c"]})
>>> df.describe()
Traceback (most recent call last):
  File "", line 1, in 
  File "/.../python/pyspark/pandas/frame.py", line 7582, in describe
raise ValueError("Cannot describe a DataFrame without columns")
ValueError: Cannot describe a DataFrame without columns 
{code}
 

As it works fine in pandas, we should fix it.


> Fix the bug in ps.(Series|DataFrame).describe()
> ---
>
> Key: SPARK-37657
> URL: https://issues.apache.org/jira/browse/SPARK-37657
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Initialized in Koalas issue: 
> [https://github.com/databricks/koalas/issues/1888.]
>  
> The `(Series|DataFrame).describe()` in pandas API on Spark doesn't work 
> properly when DataFrame has no numeric column.
>  
>  
> {code:java}
> >>> df = ps.DataFrame({'a': ["a", "b", "c"]})
> >>> df.describe()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../python/pyspark/pandas/frame.py", line 7582, in describe
> raise ValueError("Cannot describe a DataFrame without columns")
> ValueError: Cannot describe a DataFrame without columns 
> {code}
>  
> As it works fine in pandas, we should fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37656) Upgrade SBT to 1.5.7

2021-12-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37656.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34915
[https://github.com/apache/spark/pull/34915]

> Upgrade SBT to 1.5.7
> 
>
> Key: SPARK-37656
> URL: https://issues.apache.org/jira/browse/SPARK-37656
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.3.0
>
>
> SBT 1.5.7 was released a few hours ago, which includes a fix for 
> CVE-2021-45046.
> https://github.com/sbt/sbt/releases/tag/v1.5.7



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37651) Use existing active Spark session in all places of pandas API on Spark

2021-12-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37651:


Assignee: Xinrong Meng

> Use existing active Spark session in all places of pandas API on Spark
> --
>
> Key: SPARK-37651
> URL: https://issues.apache.org/jira/browse/SPARK-37651
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> Use existing active Spark session instead of SparkSession.getOrCreate in all 
> places in pandas API on Spark



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37651) Use existing active Spark session in all places of pandas API on Spark

2021-12-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37651.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34906
[https://github.com/apache/spark/pull/34906]

> Use existing active Spark session in all places of pandas API on Spark
> --
>
> Key: SPARK-37651
> URL: https://issues.apache.org/jira/browse/SPARK-37651
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> Use existing active Spark session instead of SparkSession.getOrCreate in all 
> places in pandas API on Spark



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37658) Skip PIP packaging test if Python version is lower than 3.7

2021-12-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37658:


Assignee: Apache Spark

> Skip PIP packaging test if Python version is lower than 3.7
> ---
>
> Key: SPARK-37658
> URL: https://issues.apache.org/jira/browse/SPARK-37658
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> {code}
> Writing pyspark-3.3.0.dev0/setup.cfg
> Creating tar archive
> removing 'pyspark-3.3.0.dev0' (and everything under it)
> Installing dist into virtual env
> Obtaining file:///home/jenkins/workspace/SparkPullRequestBuilder%402/python
> pyspark requires Python '>=3.7' but the running Python is 3.6.8
> Cleaning up temporary directory - /tmp/tmp.CCragmNU1X
> [error] running 
> /home/jenkins/workspace/SparkPullRequestBuilder@2/dev/run-pip-tests ; 
> received return code 1
> Attempting to post to GitHub...
> {code}
> After we drop Python 3.6 at SPARK-37632, PIP packaging test fails 
> intermittently 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146255/console.
>  Apparently, different Python versions are installed. We should skip the test 
> in these cases.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37658) Skip PIP packaging test if Python version is lower than 3.7

2021-12-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37658:


Assignee: (was: Apache Spark)

> Skip PIP packaging test if Python version is lower than 3.7
> ---
>
> Key: SPARK-37658
> URL: https://issues.apache.org/jira/browse/SPARK-37658
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> Writing pyspark-3.3.0.dev0/setup.cfg
> Creating tar archive
> removing 'pyspark-3.3.0.dev0' (and everything under it)
> Installing dist into virtual env
> Obtaining file:///home/jenkins/workspace/SparkPullRequestBuilder%402/python
> pyspark requires Python '>=3.7' but the running Python is 3.6.8
> Cleaning up temporary directory - /tmp/tmp.CCragmNU1X
> [error] running 
> /home/jenkins/workspace/SparkPullRequestBuilder@2/dev/run-pip-tests ; 
> received return code 1
> Attempting to post to GitHub...
> {code}
> After we drop Python 3.6 at SPARK-37632, PIP packaging test fails 
> intermittently 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146255/console.
>  Apparently, different Python versions are installed. We should skip the test 
> in these cases.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37658) Skip PIP packaging test if Python version is lower than 3.7

2021-12-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460397#comment-17460397
 ] 

Apache Spark commented on SPARK-37658:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34917

> Skip PIP packaging test if Python version is lower than 3.7
> ---
>
> Key: SPARK-37658
> URL: https://issues.apache.org/jira/browse/SPARK-37658
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> Writing pyspark-3.3.0.dev0/setup.cfg
> Creating tar archive
> removing 'pyspark-3.3.0.dev0' (and everything under it)
> Installing dist into virtual env
> Obtaining file:///home/jenkins/workspace/SparkPullRequestBuilder%402/python
> pyspark requires Python '>=3.7' but the running Python is 3.6.8
> Cleaning up temporary directory - /tmp/tmp.CCragmNU1X
> [error] running 
> /home/jenkins/workspace/SparkPullRequestBuilder@2/dev/run-pip-tests ; 
> received return code 1
> Attempting to post to GitHub...
> {code}
> After we drop Python 3.6 at SPARK-37632, PIP packaging test fails 
> intermittently 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146255/console.
>  Apparently, different Python versions are installed. We should skip the test 
> in these cases.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37658) Skip PIP packaging test if Python version is lower than 3.7

2021-12-15 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-37658:


 Summary: Skip PIP packaging test if Python version is lower than 
3.7
 Key: SPARK-37658
 URL: https://issues.apache.org/jira/browse/SPARK-37658
 Project: Spark
  Issue Type: Bug
  Components: Project Infra, PySpark
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


{code}
Writing pyspark-3.3.0.dev0/setup.cfg
Creating tar archive
removing 'pyspark-3.3.0.dev0' (and everything under it)
Installing dist into virtual env
Obtaining file:///home/jenkins/workspace/SparkPullRequestBuilder%402/python
pyspark requires Python '>=3.7' but the running Python is 3.6.8
Cleaning up temporary directory - /tmp/tmp.CCragmNU1X
[error] running 
/home/jenkins/workspace/SparkPullRequestBuilder@2/dev/run-pip-tests ; received 
return code 1
Attempting to post to GitHub...
{code}

After we drop Python 3.6 at SPARK-37632, PIP packaging test fails 
intermittently 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146255/console.
 Apparently, different Python versions are installed. We should skip the test 
in these cases.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37632) Drop code targetting Python < 3.7

2021-12-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37632.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34886
[https://github.com/apache/spark/pull/34886]

> Drop code targetting Python < 3.7
> -
>
> Key: SPARK-37632
> URL: https://issues.apache.org/jira/browse/SPARK-37632
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> We should drop parts of code that target Python 3.6 (or less if these are 
> still present).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35355) improve execution performance in insert...select...limit case

2021-12-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460375#comment-17460375
 ] 

Apache Spark commented on SPARK-35355:
--

User 'yikf' has created a pull request for this issue:
https://github.com/apache/spark/pull/34916

> improve execution performance in insert...select...limit  case
> --
>
> Key: SPARK-35355
> URL: https://issues.apache.org/jira/browse/SPARK-35355
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: yikf
>Priority: Minor
> Fix For: 3.0.0
>
>
> In the case of `insert into...select...limit` , `CollectLimitExec` has better 
> execution performance than `GlobalLimit` .
> Before:
> {code:java}
> == Physical Plan ==
>  Execute InsertIntoHadoopFsRelationCommand ...
>  +- *(2) GlobalLimit 5
>  +- Exchange SinglePartition, true, id=#39
>  +- *(1) LocalLimit 5
>  +- *(1) ColumnarToRow
>  +- FileScan ...
> {code}
> After:
> {code:java}
> == Physical Plan ==
>  Execute InsertIntoHadoopFsRelationCommand ...
>  +- CollectLimit 5
>  +- *(1) ColumnarToRow
>  +- FileScan 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37632) Drop code targetting Python < 3.7

2021-12-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37632:


Assignee: Maciej Szymkiewicz

> Drop code targetting Python < 3.7
> -
>
> Key: SPARK-37632
> URL: https://issues.apache.org/jira/browse/SPARK-37632
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> We should drop parts of code that target Python 3.6 (or less if these are 
> still present).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37657) Fix the bug in ps.DataFrame.describe()

2021-12-15 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-37657:

Description: 
Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888.]

 

The `DataFrame.describe()` in pandas API on Spark doesn't work properly when 
DataFrame has no numeric column.

 

 
{code:java}
>>> df = ps.DataFrame({'a': ["a", "b", "c"]})
>>> df.describe()
Traceback (most recent call last):
 File "", line 1, in 
 File "/.../python/pyspark/pandas/frame.py", line 7582, in describe
  raise ValueError("Cannot describe a DataFrame without columns")
ValueError: Cannot describe a DataFrame without columns 
{code}
 

As it works fine in pandas, we should fix it.

  was:
Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888.]

 

The `DataFrame.describe()` in pandas API on Spark doesn't work properly when 
DataFrame has no numeric column.

 

 
{code:java}
>>> df = ps.DataFrame({'a': ["a", "b", "c"]})
>>> df.describe()
Traceback (most recent call last): File "", line 1, in  File 
"/.../python/pyspark/pandas/frame.py", line 7582, in describe raise 
ValueError("Cannot describe a DataFrame without columns") ValueError: Cannot 
describe a DataFrame without columns 
{code}
 

As it works fine in pandas, we should fix it.


> Fix the bug in ps.DataFrame.describe()
> --
>
> Key: SPARK-37657
> URL: https://issues.apache.org/jira/browse/SPARK-37657
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Initialized in Koalas issue: 
> [https://github.com/databricks/koalas/issues/1888.]
>  
> The `DataFrame.describe()` in pandas API on Spark doesn't work properly when 
> DataFrame has no numeric column.
>  
>  
> {code:java}
> >>> df = ps.DataFrame({'a': ["a", "b", "c"]})
> >>> df.describe()
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "/.../python/pyspark/pandas/frame.py", line 7582, in describe
>   raise ValueError("Cannot describe a DataFrame without columns")
> ValueError: Cannot describe a DataFrame without columns 
> {code}
>  
> As it works fine in pandas, we should fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37657) Fix the bug in ps.DataFrame.describe()

2021-12-15 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-37657:

Description: 
Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888.]

 

The `DataFrame.describe()` in pandas API on Spark doesn't work properly when 
DataFrame has no numeric column.

 

 
{code:java}
>>> df = ps.DataFrame({'a': ["a", "b", "c"]})
>>> df.describe()
Traceback (most recent call last): File "", line 1, in  File 
"/.../python/pyspark/pandas/frame.py", line 7582, in describe raise 
ValueError("Cannot describe a DataFrame without columns") ValueError: Cannot 
describe a DataFrame without columns 
{code}
 

As it works fine in pandas, we should fix it.

  was:
Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888.]

 

The `DataFrame.describe()` in pandas API on Spark doesn't work properly when 
DataFrame has no numeric column.

 

 
{code:java}
>>> df = ks.DataFrame({'a': ["a", "b", "c"]}) >>> df.describe() Traceback (most 
>>> recent call last): File "", line 1, in  File 
>>> "/.../koalas/databricks/koalas/frame.py", line 7582, in describe raise 
>>> ValueError("Cannot describe a DataFrame without columns") ValueError: 
>>> Cannot describe a DataFrame without columns 
{code}
 

As it works fine in pandas, we should fix it.


> Fix the bug in ps.DataFrame.describe()
> --
>
> Key: SPARK-37657
> URL: https://issues.apache.org/jira/browse/SPARK-37657
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Initialized in Koalas issue: 
> [https://github.com/databricks/koalas/issues/1888.]
>  
> The `DataFrame.describe()` in pandas API on Spark doesn't work properly when 
> DataFrame has no numeric column.
>  
>  
> {code:java}
> >>> df = ps.DataFrame({'a': ["a", "b", "c"]})
> >>> df.describe()
> Traceback (most recent call last): File "", line 1, in  File 
> "/.../python/pyspark/pandas/frame.py", line 7582, in describe raise 
> ValueError("Cannot describe a DataFrame without columns") ValueError: Cannot 
> describe a DataFrame without columns 
> {code}
>  
> As it works fine in pandas, we should fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37657) Fix the bug in ps.DataFrame.describe()

2021-12-15 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-37657:

Description: 
Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888.]

 

The `DataFrame.describe()` in pandas API on Spark doesn't work properly when 
DataFrame has no numeric column.

 

 
{code:java}
>>> df = ps.DataFrame({'a': ["a", "b", "c"]})
>>> df.describe()
Traceback (most recent call last):
  File "", line 1, in 
  File "/.../python/pyspark/pandas/frame.py", line 7582, in describe
raise ValueError("Cannot describe a DataFrame without columns")
ValueError: Cannot describe a DataFrame without columns 
{code}
 

As it works fine in pandas, we should fix it.

  was:
Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888.]

 

The `DataFrame.describe()` in pandas API on Spark doesn't work properly when 
DataFrame has no numeric column.

 

 
{code:java}
>>> df = ps.DataFrame({'a': ["a", "b", "c"]})
>>> df.describe()
Traceback (most recent call last):
 File "", line 1, in 
 File "/.../python/pyspark/pandas/frame.py", line 7582, in describe
  raise ValueError("Cannot describe a DataFrame without columns")
ValueError: Cannot describe a DataFrame without columns 
{code}
 

As it works fine in pandas, we should fix it.


> Fix the bug in ps.DataFrame.describe()
> --
>
> Key: SPARK-37657
> URL: https://issues.apache.org/jira/browse/SPARK-37657
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Initialized in Koalas issue: 
> [https://github.com/databricks/koalas/issues/1888.]
>  
> The `DataFrame.describe()` in pandas API on Spark doesn't work properly when 
> DataFrame has no numeric column.
>  
>  
> {code:java}
> >>> df = ps.DataFrame({'a': ["a", "b", "c"]})
> >>> df.describe()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../python/pyspark/pandas/frame.py", line 7582, in describe
> raise ValueError("Cannot describe a DataFrame without columns")
> ValueError: Cannot describe a DataFrame without columns 
> {code}
>  
> As it works fine in pandas, we should fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37657) Fix the bug in ps.DataFrame.describe()

2021-12-15 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460366#comment-17460366
 ] 

Haejoon Lee commented on SPARK-37657:
-

I'm working on this

> Fix the bug in ps.DataFrame.describe()
> --
>
> Key: SPARK-37657
> URL: https://issues.apache.org/jira/browse/SPARK-37657
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Initialized in Koalas issue: 
> [https://github.com/databricks/koalas/issues/1888.]
>  
> The `DataFrame.describe()` in pandas API on Spark doesn't work properly when 
> DataFrame has no numeric column.
>  
>  
> {code:java}
> >>> df = ks.DataFrame({'a': ["a", "b", "c"]}) >>> df.describe() Traceback 
> >>> (most recent call last): File "", line 1, in  File 
> >>> "/.../koalas/databricks/koalas/frame.py", line 7582, in describe raise 
> >>> ValueError("Cannot describe a DataFrame without columns") ValueError: 
> >>> Cannot describe a DataFrame without columns 
> {code}
>  
> As it works fine in pandas, we should fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37657) Fix the bug in ps.DataFrame.describe()

2021-12-15 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-37657:

Description: 
Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888.]

 

The `DataFrame.describe()` in pandas API on Spark doesn't work properly when 
DataFrame has no numeric column.

 

 
{code:java}
>>> df = ks.DataFrame({'a': ["a", "b", "c"]}) >>> df.describe() Traceback (most 
>>> recent call last): File "", line 1, in  File 
>>> "/.../koalas/databricks/koalas/frame.py", line 7582, in describe raise 
>>> ValueError("Cannot describe a DataFrame without columns") ValueError: 
>>> Cannot describe a DataFrame without columns 
{code}
 

As it works fine in pandas, we should fix it.

  was:
Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888.]

 

The `DataFrame.describe()` in pandas API on Spark doesn't work properly when 
DataFrame has no numeric column.

 

As it's work fine in pandas, we should fix it.


> Fix the bug in ps.DataFrame.describe()
> --
>
> Key: SPARK-37657
> URL: https://issues.apache.org/jira/browse/SPARK-37657
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Initialized in Koalas issue: 
> [https://github.com/databricks/koalas/issues/1888.]
>  
> The `DataFrame.describe()` in pandas API on Spark doesn't work properly when 
> DataFrame has no numeric column.
>  
>  
> {code:java}
> >>> df = ks.DataFrame({'a': ["a", "b", "c"]}) >>> df.describe() Traceback 
> >>> (most recent call last): File "", line 1, in  File 
> >>> "/.../koalas/databricks/koalas/frame.py", line 7582, in describe raise 
> >>> ValueError("Cannot describe a DataFrame without columns") ValueError: 
> >>> Cannot describe a DataFrame without columns 
> {code}
>  
> As it works fine in pandas, we should fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37657) Fix the bug in ps.DataFrame.describe()

2021-12-15 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-37657:
---

 Summary: Fix the bug in ps.DataFrame.describe()
 Key: SPARK-37657
 URL: https://issues.apache.org/jira/browse/SPARK-37657
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Haejoon Lee


Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888.]

 

The `DataFrame.describe()` in pandas API on Spark doesn't work properly when 
DataFrame has no numeric column.

 

As it's work fine in pandas, we should fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37656) Upgrade SBT to 1.5.7

2021-12-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37656:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Upgrade SBT to 1.5.7
> 
>
> Key: SPARK-37656
> URL: https://issues.apache.org/jira/browse/SPARK-37656
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Major
>
> SBT 1.5.7 was released a few hours ago, which includes a fix for 
> CVE-2021-45046.
> https://github.com/sbt/sbt/releases/tag/v1.5.7



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37656) Upgrade SBT to 1.5.7

2021-12-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37656:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Upgrade SBT to 1.5.7
> 
>
> Key: SPARK-37656
> URL: https://issues.apache.org/jira/browse/SPARK-37656
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> SBT 1.5.7 was released a few hours ago, which includes a fix for 
> CVE-2021-45046.
> https://github.com/sbt/sbt/releases/tag/v1.5.7



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37656) Upgrade SBT to 1.5.7

2021-12-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460357#comment-17460357
 ] 

Apache Spark commented on SPARK-37656:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34915

> Upgrade SBT to 1.5.7
> 
>
> Key: SPARK-37656
> URL: https://issues.apache.org/jira/browse/SPARK-37656
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> SBT 1.5.7 was released a few hours ago, which includes a fix for 
> CVE-2021-45046.
> https://github.com/sbt/sbt/releases/tag/v1.5.7



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37656) Upgrade SBT to 1.5.7

2021-12-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460358#comment-17460358
 ] 

Apache Spark commented on SPARK-37656:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34915

> Upgrade SBT to 1.5.7
> 
>
> Key: SPARK-37656
> URL: https://issues.apache.org/jira/browse/SPARK-37656
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> SBT 1.5.7 was released a few hours ago, which includes a fix for 
> CVE-2021-45046.
> https://github.com/sbt/sbt/releases/tag/v1.5.7



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37656) Upgrade SBT to 1.5.7

2021-12-15 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-37656:
--

 Summary: Upgrade SBT to 1.5.7
 Key: SPARK-37656
 URL: https://issues.apache.org/jira/browse/SPARK-37656
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.2.1, 3.3.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


SBT 1.5.7 was released a few hours ago, which includes a fix for CVE-2021-45046.
https://github.com/sbt/sbt/releases/tag/v1.5.7



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34198) Add RocksDB StateStore implementation

2021-12-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34198:
-
Shepherd: L. C. Hsieh

> Add RocksDB StateStore implementation
> -
>
> Key: SPARK-34198
> URL: https://issues.apache.org/jira/browse/SPARK-34198
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Yuanjian Li
>Priority: Major
>
> Currently Spark SS only has one built-in StateStore implementation 
> HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As 
> there are more and more streaming applications, some of them requires to use 
> large state in stateful operations such as streaming aggregation and join.
> Several other major streaming frameworks already use RocksDB for state 
> management. So it is proven to be good choice for large state usage. But 
> Spark SS still lacks of a built-in state store for the requirement.
> We would like to explore the possibility to add RocksDB-based StateStore into 
> Spark SS.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34198) Add RocksDB StateStore implementation

2021-12-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-34198:


Assignee: Yuanjian Li  (was: Apache Spark)

> Add RocksDB StateStore implementation
> -
>
> Key: SPARK-34198
> URL: https://issues.apache.org/jira/browse/SPARK-34198
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Yuanjian Li
>Priority: Major
>
> Currently Spark SS only has one built-in StateStore implementation 
> HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As 
> there are more and more streaming applications, some of them requires to use 
> large state in stateful operations such as streaming aggregation and join.
> Several other major streaming frameworks already use RocksDB for state 
> management. So it is proven to be good choice for large state usage. But 
> Spark SS still lacks of a built-in state store for the requirement.
> We would like to explore the possibility to add RocksDB-based StateStore into 
> Spark SS.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37483) Support push down top N to JDBC data source V2

2021-12-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37483:


Assignee: (was: Apache Spark)

> Support push down top N to JDBC data source V2
> --
>
> Key: SPARK-37483
> URL: https://issues.apache.org/jira/browse/SPARK-37483
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37483) Support push down top N to JDBC data source V2

2021-12-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37483:


Assignee: Apache Spark

> Support push down top N to JDBC data source V2
> --
>
> Key: SPARK-37483
> URL: https://issues.apache.org/jira/browse/SPARK-37483
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37483) Support push down top N to JDBC data source V2

2021-12-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37483:
-
Fix Version/s: (was: 3.3.0)

> Support push down top N to JDBC data source V2
> --
>
> Key: SPARK-37483
> URL: https://issues.apache.org/jira/browse/SPARK-37483
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-37483) Support push down top N to JDBC data source V2

2021-12-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-37483:
--
  Assignee: (was: jiaan.geng)

Reverted at 
https://github.com/apache/spark/commit/16841286ecb31446300c81e3f98516e92846d9fa

> Support push down top N to JDBC data source V2
> --
>
> Key: SPARK-37483
> URL: https://issues.apache.org/jira/browse/SPARK-37483
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37627) Add sorted column in BucketTransform

2021-12-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460331#comment-17460331
 ] 

Apache Spark commented on SPARK-37627:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34914

> Add sorted column in BucketTransform
> 
>
> Key: SPARK-37627
> URL: https://issues.apache.org/jira/browse/SPARK-37627
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.3.0
>
>
> In V1, we can create table with sorted bucket like the following:
> {code:java}
>   sql("CREATE TABLE tbl(a INT, b INT) USING parquet " +
> "CLUSTERED BY (a) SORTED BY (b) INTO 5 BUCKETS")
> {code}
> However, creating table with sorted bucket in V2 failed with Exception
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot convert bucketing with sort 
> columns to a transform.
> {code}
> We should be able to create table with sorted bucket in V2.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37627) Add sorted column in BucketTransform

2021-12-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460329#comment-17460329
 ] 

Apache Spark commented on SPARK-37627:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34914

> Add sorted column in BucketTransform
> 
>
> Key: SPARK-37627
> URL: https://issues.apache.org/jira/browse/SPARK-37627
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.3.0
>
>
> In V1, we can create table with sorted bucket like the following:
> {code:java}
>   sql("CREATE TABLE tbl(a INT, b INT) USING parquet " +
> "CLUSTERED BY (a) SORTED BY (b) INTO 5 BUCKETS")
> {code}
> However, creating table with sorted bucket in V2 failed with Exception
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot convert bucketing with sort 
> columns to a transform.
> {code}
> We should be able to create table with sorted bucket in V2.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37630) Security issue from Log4j 1.X exploit

2021-12-15 Thread Samved Chandrakant Divekar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460321#comment-17460321
 ] 

Samved Chandrakant Divekar commented on SPARK-37630:


New CVE which specifically affects log4j 1.2.x has popped up. It's similar to 
the one being discussed above.

https://nvd.nist.gov/vuln/detail/CVE-2021-4104

> Security issue from Log4j 1.X exploit
> -
>
> Key: SPARK-37630
> URL: https://issues.apache.org/jira/browse/SPARK-37630
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.2.0
>Reporter: Ismail H
>Priority: Major
>  Labels: security
>
> log4j is being used in version [1.2.17|#L122]]
>  
> This version has been deprecated and since [then have a known issue that 
> hasn't been adressed in 1.X 
> versions|https://www.cvedetails.com/cve/CVE-2019-17571/].
>  
> *Solution:*
>  * Upgrade log4j to version 2.15.0 which correct all known issues. [Last 
> known issues |https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44228]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37655) Add RocksDB Implementation for KVStore

2021-12-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37655:
--
Parent: SPARK-33772
Issue Type: Sub-task  (was: Improvement)

> Add RocksDB Implementation for KVStore
> --
>
> Key: SPARK-37655
> URL: https://issues.apache.org/jira/browse/SPARK-37655
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37655) Add RocksDB Implementation for KVStore

2021-12-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37655:
-

Assignee: Dongjoon Hyun  (was: Apache Spark)

> Add RocksDB Implementation for KVStore
> --
>
> Key: SPARK-37655
> URL: https://issues.apache.org/jira/browse/SPARK-37655
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37653) Upgrade RoaringBitmap to 0.9.23

2021-12-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37653:
--
Parent: SPARK-33772
Issue Type: Sub-task  (was: Improvement)

> Upgrade RoaringBitmap to 0.9.23
> ---
>
> Key: SPARK-37653
> URL: https://issues.apache.org/jira/browse/SPARK-37653
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37653) Upgrade RoaringBitmap to 0.9.23

2021-12-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37653:
-

Assignee: Yang Jie

> Upgrade RoaringBitmap to 0.9.23
> ---
>
> Key: SPARK-37653
> URL: https://issues.apache.org/jira/browse/SPARK-37653
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37653) Upgrade RoaringBitmap to 0.9.23

2021-12-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37653.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34909
[https://github.com/apache/spark/pull/34909]

> Upgrade RoaringBitmap to 0.9.23
> ---
>
> Key: SPARK-37653
> URL: https://issues.apache.org/jira/browse/SPARK-37653
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37655) Add RocksDB Implementation for KVStore

2021-12-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460284#comment-17460284
 ] 

Apache Spark commented on SPARK-37655:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34913

> Add RocksDB Implementation for KVStore
> --
>
> Key: SPARK-37655
> URL: https://issues.apache.org/jira/browse/SPARK-37655
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37655) Add RocksDB Implementation for KVStore

2021-12-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37655:


Assignee: Apache Spark

> Add RocksDB Implementation for KVStore
> --
>
> Key: SPARK-37655
> URL: https://issues.apache.org/jira/browse/SPARK-37655
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37655) Add RocksDB Implementation for KVStore

2021-12-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37655:


Assignee: (was: Apache Spark)

> Add RocksDB Implementation for KVStore
> --
>
> Key: SPARK-37655
> URL: https://issues.apache.org/jira/browse/SPARK-37655
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37655) Add RocksDB Implementation for KVStore

2021-12-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37655:


Assignee: Apache Spark

> Add RocksDB Implementation for KVStore
> --
>
> Key: SPARK-37655
> URL: https://issues.apache.org/jira/browse/SPARK-37655
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37655) Add RocksDB Implementation for KVStore

2021-12-15 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-37655:
-

 Summary: Add RocksDB Implementation for KVStore
 Key: SPARK-37655
 URL: https://issues.apache.org/jira/browse/SPARK-37655
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37154) Inline type hints for python/pyspark/rdd.py

2021-12-15 Thread Byron Hsu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460243#comment-17460243
 ] 

Byron Hsu commented on SPARK-37154:
---

I am looking into this one

> Inline type hints for python/pyspark/rdd.py
> ---
>
> Key: SPARK-37154
> URL: https://issues.apache.org/jira/browse/SPARK-37154
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Byron Hsu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37633) Unwrap cast should skip if downcast failed with ansi enabled

2021-12-15 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37633:
-
Affects Version/s: (was: 3.0.3)

> Unwrap cast should skip if downcast failed with ansi enabled
> 
>
> Key: SPARK-37633
> URL: https://issues.apache.org/jira/browse/SPARK-37633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.2.1, 3.3.0
>
>
> Currently, unwrap cast throws ArithmeticException if down cast failed with 
> ansi enabled. Since UnwrapCastInBinaryComparison is an optimizer rule, we 
> should always skip on failure regardless of ansi config.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37633) Unwrap cast should skip if downcast failed with ansi enabled

2021-12-15 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-37633:


Assignee: Manu Zhang

> Unwrap cast should skip if downcast failed with ansi enabled
> 
>
> Key: SPARK-37633
> URL: https://issues.apache.org/jira/browse/SPARK-37633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
>
> Currently, unwrap cast throws ArithmeticException if down cast failed with 
> ansi enabled. Since UnwrapCastInBinaryComparison is an optimizer rule, we 
> should always skip on failure regardless of ansi config.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37633) Unwrap cast should skip if downcast failed with ansi enabled

2021-12-15 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-37633.
--
Fix Version/s: 3.3.0
   3.2.1
   Resolution: Fixed

Issue resolved by pull request 34888
[https://github.com/apache/spark/pull/34888]

> Unwrap cast should skip if downcast failed with ansi enabled
> 
>
> Key: SPARK-37633
> URL: https://issues.apache.org/jira/browse/SPARK-37633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.3.0, 3.2.1
>
>
> Currently, unwrap cast throws ArithmeticException if down cast failed with 
> ansi enabled. Since UnwrapCastInBinaryComparison is an optimizer rule, we 
> should always skip on failure regardless of ansi config.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144
 ] 

Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:05 PM:


For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!empty_test.png|width=701,height=286!

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible. 

FYI. [~maxgekk] 


was (Author: wayne guo):
For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!empty_test.png!

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible. 

FYI. [~maxgekk] 

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with 

[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144
 ] 

Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:05 PM:


For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!empty_test.png!

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible. 

FYI. [~maxgekk] 


was (Author: wayne guo):
For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!image-2021-12-16-01-57-55-864.png|width=424,height=173!

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible. 

FYI. [~maxgekk] 

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for 

[jira] [Updated] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Attachment: (was: image-2021-12-16-01-57-55-864.png)

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For nullValue, we will convert nothing or nullValue strings to null 
> in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
> strings) to emptyValue strings rather than to convert both "\"\""(quoted 
> empty strings) and emptyValue strings to ""(empty) in dataframe.
> I think 

[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144
 ] 

Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:04 PM:


For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!image-2021-12-16-01-57-55-864.png|width=424,height=173!

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible. 

FYI. [~maxgekk] 


was (Author: wayne guo):
For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!image-2021-12-16-01-57-55-864.png|width=424,height=173!

 

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible. 

FYI. [~maxgekk] 

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a 

[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144
 ] 

Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:03 PM:


For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!image-2021-12-16-01-57-55-864.png|width=424,height=173!

 

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible. 

FYI. [~maxgekk] 


was (Author: wayne guo):
For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!image-2021-12-16-01-57-55-864.png|width=424,height=173!

 

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for 

[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144
 ] 

Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:03 PM:


For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!image-2021-12-16-01-57-55-864.png|width=424,height=173!

 

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible


was (Author: wayne guo):
For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!image-2021-12-16-01-57-55-864.png|width=424,height=173!

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the 

[jira] [Updated] (SPARK-37654) Regression - NullPointerException in Row.getSeq when field null

2021-12-15 Thread Brandon Dahler (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Dahler updated SPARK-37654:
---
Description: 
h2. Description

A NullPointerException occurs in _org.apache.spark.sql.Row.getSeq(int)_ if the 
row contains a _null_ value at the requested index.
{code:java}
java.lang.NullPointerException
at org.apache.spark.sql.Row.getSeq(Row.scala:319)
at org.apache.spark.sql.Row.getSeq$(Row.scala:319)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getSeq(rows.scala:166)
at org.apache.spark.sql.Row.getList(Row.scala:327)
at org.apache.spark.sql.Row.getList$(Row.scala:326)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getList(rows.scala:166)
...
{code}
 

Prior to 3.1.1, the code would not throw an exception and instead would return 
a null _Seq_ instance.
h2. Reproduction
 # Start a new spark-shell instance
 # Execute the following script:
{code:scala}
import org.apache.spark.sql.Row

Row(Seq("value")).getSeq(0)
Row(Seq()).getSeq(0)
Row(null).getSeq(0) {code}

h3. Expected Output

res2 outputs a _null_ value.
{code:java}
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala>

scala> Row(Seq("value")).getSeq(0)
res0: Seq[Nothing] = List(value)

scala> Row(Seq()).getSeq(0)
res1: Seq[Nothing] = List()

scala> Row(null).getSeq(0)
res2: Seq[Nothing] = null
{code}
h3. Actual Output

res2 throws a NullPointerException.
{code:java}
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala>

scala> Row(Seq("value")).getSeq(0)
res0: Seq[Nothing] = List(value)

scala> Row(Seq()).getSeq(0)
res1: Seq[Nothing] = List()

scala> Row(null).getSeq(0)
java.lang.NullPointerException
  at org.apache.spark.sql.Row.getSeq(Row.scala:319)
  at org.apache.spark.sql.Row.getSeq$(Row.scala:319)
  at org.apache.spark.sql.catalyst.expressions.GenericRow.getSeq(rows.scala:166)
  ... 47 elided
{code}

h3. Environments Tested
Tested against the following releases using the provided reproduction steps:
 # spark-3.0.3-bin-hadoop2.7 - Succeeded
{code:java}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.3
  /_/Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) 
{code}
 # spark-3.1.2-bin-hadoop3.2 - Failed
{code:java}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.2
  /_/Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) 
{code}
 # spark-3.2.0-bin-hadoop3.2 - Failed
{code:java}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.0
  /_/Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) 
{code}


h2. Regression Source
The regression appears to have been introduced in 
[25c7d0fe6ae20a4c1c42e0cd0b448c08ab03f3fb|https://github.com/apache/spark/commit/25c7d0fe6ae20a4c1c42e0cd0b448c08ab03f3fb#diff-722324a11a0e4635a59a9debc962da2c1678d86702a9a106fd0d51188f83853bR317],
 which addressed [SPARK-32526|https://issues.apache.org/jira/browse/SPARK-32526]

h2. Work Around
This regression can be worked around by using _Row.isNullAt(int)_ and handling 
the null scenario in user code, prior to calling _Row.getSeq(int)_ or 
_Row.getList(int)_.

  was:
h2. Description

A NullPointerException occurs in _org.apache.spark.sql.Row.getSeq(int)_ if the 
row contains a _null_ value at the requested index.
{code:java}
java.lang.NullPointerException
at org.apache.spark.sql.Row.getSeq(Row.scala:319)
at org.apache.spark.sql.Row.getSeq$(Row.scala:319)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getSeq(rows.scala:166)
at org.apache.spark.sql.Row.getList(Row.scala:327)
at org.apache.spark.sql.Row.getList$(Row.scala:326)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getList(rows.scala:166)
...
{code}
 

Prior to 3.1.1, the code would not throw an exception and instead would return 
a null _Seq_ instance.
h2. Reproduction
 # Start a new spark-shell instance
 # Execute the following script:
{code:scala}
import org.apache.spark.sql.Row

Row(Seq("value")).getSeq(0)
Row(Seq()).getSeq(0)
Row(null).getSeq(0) {code}

h3. Expected Output

res2 outputs a _null_ value.
{code:java}
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala>

scala> Row(Seq("value")).getSeq(0)
res0: Seq[Nothing] = List(value)

scala> Row(Seq()).getSeq(0)
res1: Seq[Nothing] = List()

scala> Row(null).getSeq(0)
res2: Seq[Nothing] = null
{code}
h3. Actual Output

res2 throws a NullPointerException.
{code:java}
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala>

scala> Row(Seq("value")).getSeq(0)
res0: Seq[Nothing] = List(value)


[jira] [Updated] (SPARK-37654) Regression - NullPointerException in Row.getSeq when field null

2021-12-15 Thread Brandon Dahler (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Dahler updated SPARK-37654:
---
Description: 
h2. Description

A NullPointerException occurs in _org.apache.spark.sql.Row.getSeq(int)_ if the 
row contains a _null_ value at the requested index.
{code:java}
java.lang.NullPointerException
at org.apache.spark.sql.Row.getSeq(Row.scala:319)
at org.apache.spark.sql.Row.getSeq$(Row.scala:319)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getSeq(rows.scala:166)
at org.apache.spark.sql.Row.getList(Row.scala:327)
at org.apache.spark.sql.Row.getList$(Row.scala:326)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getList(rows.scala:166)
...
{code}
 

Prior to 3.1.1, the code would not throw an exception and instead would return 
a null _Seq_ instance.
h2. Reproduction
 # Start a new spark-shell instance
 # Execute the following script:
{code:scala}
import org.apache.spark.sql.Row

Row(Seq("value")).getSeq(0)
Row(Seq()).getSeq(0)
Row(null).getSeq(0) {code}

h3. Expected Output

res2 outputs a _null_ value.
{code:java}
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala>

scala> Row(Seq("value")).getSeq(0)
res0: Seq[Nothing] = List(value)

scala> Row(Seq()).getSeq(0)
res1: Seq[Nothing] = List()

scala> Row(null).getSeq(0)
res2: Seq[Nothing] = null
{code}
h3. Actual Output

res2 throws a NullPointerException.
{code:java}
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala>

scala> Row(Seq("value")).getSeq(0)
res0: Seq[Nothing] = List(value)

scala> Row(Seq()).getSeq(0)
res1: Seq[Nothing] = List()

scala> Row(null).getSeq(0)
java.lang.NullPointerException
  at org.apache.spark.sql.Row.getSeq(Row.scala:319)
  at org.apache.spark.sql.Row.getSeq$(Row.scala:319)
  at org.apache.spark.sql.catalyst.expressions.GenericRow.getSeq(rows.scala:166)
  ... 47 elided
{code}

h3. Environments Tested
Tested against the following releases using the provided reproduction steps:
 # spark-3.0.3-bin-hadoop2.7 - Succeeded
{code:java}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.3
  /_/Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) 
{code}
 # spark-3.1.2-bin-hadoop3.2 - Failed
{code:java}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.2
  /_/Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) 
{code}
 # spark-3.2.0-bin-hadoop3.2 - Failed
{code:java}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.0
  /_/Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) 
{code}
 
h2. Regression Source
The regression appears to have been introduced in 
[25c7d0fe6ae20a4c1c42e0cd0b448c08ab03f3fb|https://github.com/apache/spark/commit/25c7d0fe6ae20a4c1c42e0cd0b448c08ab03f3fb#diff-722324a11a0e4635a59a9debc962da2c1678d86702a9a106fd0d51188f83853bR317],
 which addressed [SPARK-32526|https://issues.apache.org/jira/browse/SPARK-32526]

h2. Work Around
This regression can be worked around by using _Row.isNullAt(int)_ and handling 
the null scenario in user code, prior to calling _Row.getSeq(int)_ or 
_Row.getList(int)_.

  was:
h2. Description

A NullPointerException occurs in _org.apache.spark.sql.Row.getSeq(int)_ if the 
row contains a _null_ value at the requested index.
{code:java}
java.lang.NullPointerException
at org.apache.spark.sql.Row.getSeq(Row.scala:319)
at org.apache.spark.sql.Row.getSeq$(Row.scala:319)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getSeq(rows.scala:166)
at org.apache.spark.sql.Row.getList(Row.scala:327)
at org.apache.spark.sql.Row.getList$(Row.scala:326)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getList(rows.scala:166)
...
{code}
 

Prior to 3.1.1, the code would not throw an exception and instead would return 
a null _Seq_ instance.
h2. Reproduction
 # Start a new spark-shell instance
 # Execute the following script:
{code:scala}
import org.apache.spark.sql.Row

Row(Seq("value")).getSeq(0)
Row(Seq()).getSeq(0)
Row(null).getSeq(0) {code}

h3. Expected Output

res2 outputs a _null_ value.
{code:java}
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala>

scala> Row(Seq("value")).getSeq(0)
res0: Seq[Nothing] = List(value)

scala> Row(Seq()).getSeq(0)
res1: Seq[Nothing] = List()

scala> Row(null).getSeq(0)
res2: Seq[Nothing] = null
{code}
h3. Actual Output

res2 throws a NullPointerException.
{code:java}
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala>

scala> Row(Seq("value")).getSeq(0)
res0: Seq[Nothing] = List(value)


[jira] [Updated] (SPARK-37654) Regression - NullPointerException in Row.getSeq when field null

2021-12-15 Thread Brandon Dahler (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Dahler updated SPARK-37654:
---
Environment: (was: Tested against the following releases using the 
provided reproduction steps:
 # spark-3.0.3-bin-hadoop2.7 - Succeeded

{code:java}
Welcome to
                    __
     / __/__  ___ _/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.3
      /_/Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) 
{code}

 # spark-3.1.2-bin-hadoop3.2 - Failed

{code:java}
Welcome to
                    __
     / __/__  ___ _/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.2
      /_/Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) 
{code}

 # spark-3.2.0-bin-hadoop3.2 - Failed

{code:java}
Welcome to
                    __
     / __/__  ___ _/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.0
      /_/Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) 
{code})

> Regression - NullPointerException in Row.getSeq when field null
> ---
>
> Key: SPARK-37654
> URL: https://issues.apache.org/jira/browse/SPARK-37654
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.1.2, 3.2.0
>Reporter: Brandon Dahler
>Priority: Major
>
> h2. Description
> A NullPointerException occurs in _org.apache.spark.sql.Row.getSeq(int)_ if 
> the row contains a _null_ value at the requested index.
> {code:java}
> java.lang.NullPointerException
>   at org.apache.spark.sql.Row.getSeq(Row.scala:319)
>   at org.apache.spark.sql.Row.getSeq$(Row.scala:319)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getSeq(rows.scala:166)
>   at org.apache.spark.sql.Row.getList(Row.scala:327)
>   at org.apache.spark.sql.Row.getList$(Row.scala:326)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getList(rows.scala:166)
> ...
> {code}
>  
> Prior to 3.1.1, the code would not throw an exception and instead would 
> return a null _Seq_ instance.
> h2. Reproduction
>  # Start a new spark-shell instance
>  # Execute the following script:
> {code:scala}
> import org.apache.spark.sql.Row
> Row(Seq("value")).getSeq(0)
> Row(Seq()).getSeq(0)
> Row(null).getSeq(0) {code}
> h3. Expected Output
> res2 outputs a _null_ value.
> {code:java}
> scala> import org.apache.spark.sql.Row
> import org.apache.spark.sql.Row
> scala>
> scala> Row(Seq("value")).getSeq(0)
> res0: Seq[Nothing] = List(value)
> scala> Row(Seq()).getSeq(0)
> res1: Seq[Nothing] = List()
> scala> Row(null).getSeq(0)
> res2: Seq[Nothing] = null
> {code}
> h3. Actual Output
> res2 throws a NullPointerException.
> {code:java}
> scala> import org.apache.spark.sql.Row
> import org.apache.spark.sql.Row
> scala>
> scala> Row(Seq("value")).getSeq(0)
> res0: Seq[Nothing] = List(value)
> scala> Row(Seq()).getSeq(0)
> res1: Seq[Nothing] = List()
> scala> Row(null).getSeq(0)
> java.lang.NullPointerException
>   at org.apache.spark.sql.Row.getSeq(Row.scala:319)
>   at org.apache.spark.sql.Row.getSeq$(Row.scala:319)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getSeq(rows.scala:166)
>   ... 47 elided
> {code}
>  
> h2. Regression Source
> The regression appears to have been introduced in 
> [25c7d0fe6ae20a4c1c42e0cd0b448c08ab03f3fb|https://github.com/apache/spark/commit/25c7d0fe6ae20a4c1c42e0cd0b448c08ab03f3fb#diff-722324a11a0e4635a59a9debc962da2c1678d86702a9a106fd0d51188f83853bR317],
>  which addressed 
> [SPARK-32526|https://issues.apache.org/jira/browse/SPARK-32526]
> h2. Work Around
> This regression can be worked around by using _Row.isNullAt(int)_ and 
> handling the null scenario in user code, prior to calling _Row.getSeq(int)_ 
> or _Row.getList(int)_.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37654) Regression - NullPointerException in Row.getSeq when field null

2021-12-15 Thread Brandon Dahler (Jira)
Brandon Dahler created SPARK-37654:
--

 Summary: Regression - NullPointerException in Row.getSeq when 
field null
 Key: SPARK-37654
 URL: https://issues.apache.org/jira/browse/SPARK-37654
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0, 3.1.2, 3.1.1
 Environment: Tested against the following releases using the provided 
reproduction steps:
 # spark-3.0.3-bin-hadoop2.7 - Succeeded

{code:java}
Welcome to
                    __
     / __/__  ___ _/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.3
      /_/Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) 
{code}

 # spark-3.1.2-bin-hadoop3.2 - Failed

{code:java}
Welcome to
                    __
     / __/__  ___ _/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.2
      /_/Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) 
{code}

 # spark-3.2.0-bin-hadoop3.2 - Failed

{code:java}
Welcome to
                    __
     / __/__  ___ _/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.0
      /_/Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) 
{code}
Reporter: Brandon Dahler


h2. Description

A NullPointerException occurs in _org.apache.spark.sql.Row.getSeq(int)_ if the 
row contains a _null_ value at the requested index.
{code:java}
java.lang.NullPointerException
at org.apache.spark.sql.Row.getSeq(Row.scala:319)
at org.apache.spark.sql.Row.getSeq$(Row.scala:319)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getSeq(rows.scala:166)
at org.apache.spark.sql.Row.getList(Row.scala:327)
at org.apache.spark.sql.Row.getList$(Row.scala:326)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getList(rows.scala:166)
...
{code}
 

Prior to 3.1.1, the code would not throw an exception and instead would return 
a null _Seq_ instance.
h2. Reproduction
 # Start a new spark-shell instance
 # Execute the following script:
{code:scala}
import org.apache.spark.sql.Row

Row(Seq("value")).getSeq(0)
Row(Seq()).getSeq(0)
Row(null).getSeq(0) {code}

h3. Expected Output

res2 outputs a _null_ value.
{code:java}
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala>

scala> Row(Seq("value")).getSeq(0)
res0: Seq[Nothing] = List(value)

scala> Row(Seq()).getSeq(0)
res1: Seq[Nothing] = List()

scala> Row(null).getSeq(0)
res2: Seq[Nothing] = null
{code}
h3. Actual Output

res2 throws a NullPointerException.
{code:java}
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala>

scala> Row(Seq("value")).getSeq(0)
res0: Seq[Nothing] = List(value)

scala> Row(Seq()).getSeq(0)
res1: Seq[Nothing] = List()

scala> Row(null).getSeq(0)
java.lang.NullPointerException
  at org.apache.spark.sql.Row.getSeq(Row.scala:319)
  at org.apache.spark.sql.Row.getSeq$(Row.scala:319)
  at org.apache.spark.sql.catalyst.expressions.GenericRow.getSeq(rows.scala:166)
  ... 47 elided
{code}
 
h2. Regression Source
The regression appears to have been introduced in 
[25c7d0fe6ae20a4c1c42e0cd0b448c08ab03f3fb|https://github.com/apache/spark/commit/25c7d0fe6ae20a4c1c42e0cd0b448c08ab03f3fb#diff-722324a11a0e4635a59a9debc962da2c1678d86702a9a106fd0d51188f83853bR317],
 which addressed [SPARK-32526|https://issues.apache.org/jira/browse/SPARK-32526]

h2. Work Around
This regression can be worked around by using _Row.isNullAt(int)_ and handling 
the null scenario in user code, prior to calling _Row.getSeq(int)_ or 
_Row.getList(int)_.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144
 ] 

Wei Guo commented on SPARK-37604:
-

For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!image-2021-12-16-01-57-55-864.png|width=424,height=173!

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in 

[jira] [Updated] (SPARK-37572) Flexible ways of launching executors

2021-12-15 Thread Dagang Wei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dagang Wei updated SPARK-37572:
---
Affects Version/s: 3.3.0
   (was: 3.2.0)

> Flexible ways of launching executors
> 
>
> Key: SPARK-37572
> URL: https://issues.apache.org/jira/browse/SPARK-37572
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Affects Versions: 3.3.0
>Reporter: Dagang Wei
>Priority: Major
>
> Currently Spark launches executor processes by constructing and running 
> commands [1], for example:
> {code:java}
> /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp 
> /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* 
> -Xmx1024M -Dspark.driver.port=35729 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@example.host:35729 --executor-id 0 --hostname 
> 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 --worker-url 
> spark://Worker@100.116.124.193:45287 {code}
> But there are use cases which require more flexible ways of launching 
> executors. In particular, our use case is that we run Spark in standalone 
> mode, Spark master and workers are running in VMs. We want to allow Spark app 
> developers to provide custom container images to customize the job runtime 
> environment (typically Java and Python dependencies), so executors (which run 
> the job code) need to run in Docker containers.
> After reading the source code, we found that the concept of Spark Command 
> Runner might be a good solution. Basically, we want to introduce an optional 
> Spark command runner in Spark, so that instead of running the command to 
> launch executor directly, it passes the command to the runner, the runner 
> then runs the command with its own strategy which could be running in Docker, 
> or by default running the command directly.
> The runner is specified through an env variable `SPARK_COMMAND_RUNNER`, which 
> by default could be a simple script like:
> {code:java}
> #!/bin/bash
> exec "$@" {code}
> or in the case of Docker container:
> {code:java}
> #!/bin/bash
> docker run ... – "$@" {code}
>  
> I already have a patch for this feature and have tested in our environment.
>  
> [1]: 
> [https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Attachment: empty_test.png

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For nullValue, we will convert nothing or nullValue strings to null 
> in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
> strings) to emptyValue strings rather than to convert both "\"\""(quoted 
> empty strings) and emptyValue strings to ""(empty) in dataframe.
> I 

[jira] [Updated] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Attachment: image-2021-12-16-01-57-55-864.png

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For nullValue, we will convert nothing or nullValue strings to null 
> in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
> strings) to emptyValue strings rather than to convert both "\"\""(quoted 
> empty strings) and emptyValue strings to ""(empty) 

[jira] [Updated] (SPARK-37572) Flexible ways of launching executors

2021-12-15 Thread Dagang Wei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dagang Wei updated SPARK-37572:
---
Description: 
Currently Spark launches executor processes by constructing and running 
commands [1], for example:
{code:java}
/usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp 
/opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* 
-Xmx1024M -Dspark.driver.port=35729 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://coarsegrainedschedu...@example.host:35729 --executor-id 0 --hostname 
100.116.124.193 --cores 6 --app-id app-20211207131146-0002 --worker-url 
spark://Worker@100.116.124.193:45287 {code}
But there are use cases which require more flexible ways of launching 
executors. In particular, our use case is that we run Spark in standalone mode, 
Spark master and workers are running in VMs. We want to allow Spark app 
developers to provide custom container images to customize the job runtime 
environment (typically Java and Python dependencies), so executors (which run 
the job code) need to run in Docker containers.

After reading the source code, we found that the concept of Spark Command 
Runner might be a good solution. Basically, we want to introduce an optional 
Spark command runner in Spark, so that instead of running the command to launch 
executor directly, it passes the command to the runner, the runner then runs 
the command with its own strategy which could be running in Docker, or by 
default running the command directly.

The runner is specified through an env variable `SPARK_COMMAND_RUNNER`, which 
by default could be a simple script like:
{code:java}
#!/bin/bash
exec "$@" {code}
or in the case of Docker container:
{code:java}
#!/bin/bash
docker run ... – "$@" {code}
 

I already have a patch for this feature and have tested in our environment.

 

[1]: 
[https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52]

  was:
Currently Spark launches executor processes by constructing and running 
commands [1], for example:
{code:java}
/usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp 
/opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* 
-Xmx1024M -Dspark.driver.port=35729 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id 0 
--hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 
--worker-url spark://Worker@100.116.124.193:45287 {code}
But there are use cases which require more flexible ways of launching 
executors. In particular, our use case is that we run Spark in standalone mode, 
Spark master and workers are running in VMs. We want to allow Spark app 
developers to provide custom container images to customize the job runtime 
environment (typically Java and Python dependencies), so executors (which run 
the job code) need to run in Docker containers.

After reading the source code, we found that the concept of Spark Command 
Runner might be a good solution. Basically, we want to introduce an optional 
Spark command runner in Spark, so that instead of running the command to launch 
executor directly, it passes the command to the runner, the runner then runs 
the command with its own strategy which could be running in Docker, or by 
default running the command directly.

The runner is specified through an env variable `SPARK_COMMAND_RUNNER`, which 
by default could be a simple script like:
{code:java}
#!/bin/bash
exec "$@" {code}
or in the case of Docker container:
{code:java}
#!/bin/bash
docker run ... – "$@" {code}
 

I already have a patch for this feature and have tested in our environment.

 

[1]: 
[https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52]


> Flexible ways of launching executors
> 
>
> Key: SPARK-37572
> URL: https://issues.apache.org/jira/browse/SPARK-37572
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Affects Versions: 3.2.0
>Reporter: Dagang Wei
>Priority: Major
>
> Currently Spark launches executor processes by constructing and running 
> commands [1], for example:
> {code:java}
> /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp 
> /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* 
> -Xmx1024M -Dspark.driver.port=35729 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://coarsegrainedschedu...@example.host:35729 --executor-id 0 --hostname 
> 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 --worker-url 
> spark://Worker@100.116.124.193:45287 {code}
> But there are use cases which require more flexible ways of launching 
> executors. In 

[jira] [Updated] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Summary: Change emptyValueInRead's effect to that any fields matching this 
string will be set as "" when reading csv files  (was: The option 
emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields 
matching this string will be set as empty values "" when reading)

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For nullValue, we will convert nothing or nullValue strings to null 
> in dataframe, but for emptyValue, we 

[jira] [Commented] (SPARK-37630) Security issue from Log4j 1.X exploit

2021-12-15 Thread Ankur Jain (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460117#comment-17460117
 ] 

Ankur Jain commented on SPARK-37630:


Log4j 2.15.0 also has a vulnerability, good if we can upgrade it to 2.16.0.

> Security issue from Log4j 1.X exploit
> -
>
> Key: SPARK-37630
> URL: https://issues.apache.org/jira/browse/SPARK-37630
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.2.0
>Reporter: Ismail H
>Priority: Major
>  Labels: security
>
> log4j is being used in version [1.2.17|#L122]]
>  
> This version has been deprecated and since [then have a known issue that 
> hasn't been adressed in 1.X 
> versions|https://www.cvedetails.com/cve/CVE-2019-17571/].
>  
> *Solution:*
>  * Upgrade log4j to version 2.15.0 which correct all known issues. [Last 
> known issues |https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44228]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460111#comment-17460111
 ] 

Max Gekk commented on SPARK-37604:
--

> but "EMPTY" strings in csv files can not be parsed as empty columns

The use case when an empty string is interpreted as non-empty like "EMPTY" is 
not clear to me. [~Wayne Guo] Could you describe the case when it is needed.

> The option emptyValueInRead(in CSVOptions) is suggested to be designed as 
> that any fields matching this string will be set as empty values "" when 
> reading
> --
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For nullValue, we will convert nothing or 

[jira] [Comment Edited] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460091#comment-17460091
 ] 

Wei Guo edited comment on SPARK-37604 at 12/15/21, 4:48 PM:


[~hyukjin.kwon], [~maxgekk] Shall we have a simple discussion about it in your 
free time, I'd like to hear your thoughts on this.


was (Author: wayne guo):
[~hyukjin.kwon][~maxgekk] Shall we have a simple discussion about it in your 
free time, I'd like to hear your thoughts on this.

> The option emptyValueInRead(in CSVOptions) is suggested to be designed as 
> that any fields matching this string will be set as empty values "" when 
> reading
> --
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious 

[jira] [Commented] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460091#comment-17460091
 ] 

Wei Guo commented on SPARK-37604:
-

[~hyukjin.kwon][~maxgekk] Shall we have a simple discussion about it in your 
free time, I'd like to hear your thoughts on this.

> The option emptyValueInRead(in CSVOptions) is suggested to be designed as 
> that any fields matching this string will be set as empty values "" when 
> reading
> --
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For nullValue, we will convert nothing or nullValue strings to null 
> in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 

[jira] [Commented] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460084#comment-17460084
 ] 

Wei Guo commented on SPARK-37604:
-

Maybe this issue is not a notable bug or promotion, but, for users' common 
usage, they prefer to be able to convert these emptyValue strings in csv files 
into ""(empty strings) again after writing out empty strings as emptyValue 
strings rather than current behaviors.

> The option emptyValueInRead(in CSVOptions) is suggested to be designed as 
> that any fields matching this string will be set as empty values "" when 
> reading
> --
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For 

[jira] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


[ https://issues.apache.org/jira/browse/SPARK-37604 ]


Wei Guo deleted comment on SPARK-37604:
-

was (Author: wayne guo):
The current behavior of emptyValueInRead is more like the function for null 
values in Dataset: 
{code:scala}
dataframe.na.fill(fillMap){code}

So, we can also provide a function in Dataset similar to it, such as:
{code:scala}
dataframe.empty.fill(fillMap){code}
rather than to change empty strings to emptyValueInRead in DataFrame when 
reading csv files.

> The option emptyValueInRead(in CSVOptions) is suggested to be designed as 
> that any fields matching this string will be set as empty values "" when 
> reading
> --
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:scala}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as null 
columns*{color} in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
{*}There is a difference when reading{*}. In univocity, nothing content will be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 

>From now, we start to talk about emptyValue.

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that Spark keeps the same 
behaviors for emptyValue with univocity, that is:
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
"EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:scala}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is an obvious difference between nullValue and emptyValue in read 
handling. For nullValue, we will convert nothing or nullValue strings to null 
in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue strings rather than to convert both "\"\""(quoted empty 
strings) and emptyValue strings to ""(empty) in dataframe.

I think it's better that if we keep the similar behavior(try to recover 
emptyValue in csv files to "") for emptyValue as nullValue when reading. So, I 
suggest that the emptyValueInRead(in CSVOptions) should  be designed as that 
any fields matching this string will be set as empty values "" when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as null 
columns*{color} in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
{*}There is a difference when reading{*}. In univocity, nothing content will be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 

>From now, we start to talk about emptyValue.

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that Spark keeps the same 
behaviors for emptyValue with univocity, that is:
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
"EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:scala}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior(try to recover emptyValue 
to "") for emptyValue as nullValue when reading, so I suggest that the 
emptyValueInRead(in CSVOptions) should  be designed as that any fields matching 
this string will be set as empty values "" when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that 

[jira] [Resolved] (SPARK-37483) Support push down top N to JDBC data source V2

2021-12-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37483.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34738
[https://github.com/apache/spark/pull/34738]

> Support push down top N to JDBC data source V2
> --
>
> Key: SPARK-37483
> URL: https://issues.apache.org/jira/browse/SPARK-37483
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as null 
columns*{color} in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
{*}There is a difference when reading{*}. In univocity, nothing content will be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 

>From now, we start to talk about emptyValue.

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior(try to recover emptyValue 
to "") for emptyValue as nullValue when reading, so I suggest that the 
emptyValueInRead(in CSVOptions) should  be designed as that any fields matching 
this string will be set as empty values "" when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:

[jira] [Assigned] (SPARK-37483) Support push down top N to JDBC data source V2

2021-12-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37483:
---

Assignee: jiaan.geng

> Support push down top N to JDBC data source V2
> --
>
> Key: SPARK-37483
> URL: https://issues.apache.org/jira/browse/SPARK-37483
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as null 
columns*{color} in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
{*}There is a difference when reading{*}. In univocity, nothing content will be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior(try to recover emptyValue 
to "") for emptyValue as nullValue when reading, so I suggest that the 
emptyValueInRead(in CSVOptions) should  be designed as that any fields matching 
this string will be set as empty values "" when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as null 
columns*{color} in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
{*}There is a difference when reading{*}. In univocity, nothing content would 
be convert to nullValue strings. But In Spark, we finally convert nothing 
content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 



 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior(try to recover emptyValue 
to "") for emptyValue as nullValue when reading, so I suggest that the 
emptyValueInRead(in CSVOptions) should  be designed as that any fields matching 
this string will be set as empty values "" when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as null 
columns*{color} in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
There is a difference when reading. In univocity, nothing content would be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 



 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior(try to recover emptyValue 
to "") for emptyValue as nullValue when reading, so I suggest that the 
emptyValueInRead(in CSVOptions) should  be designed as that any fields matching 
this string will be set as empty values "" when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null 
columns{color}* in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
There is a difference when reading. In univocity, nothing content would be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 



 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior(try to recover emptyValue 
to "") for emptyValue as nullValue when reading, so I suggest that the 
emptyValueInRead(in CSVOptions) should  be designed as that any fields matching 
this string will be set as empty values "" when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null 
columns{color}* in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
There is a difference when reading. In univocity, nothing content would be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 



 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior(try to recover emptyValue 
to "") for emptyValue as nullValue when reading, so I suggest that the 
emptyValueInRead(in CSVOptions) should  be designed as that any fields matching 
this string will be set as empty values "" when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features description in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null 
columns{color}* in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
There is a difference when reading. In univocity, nothing content would be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 



 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior(try to recover emptyValue 
to "") for emptyValue as nullValue when reading, so I suggest that the 
emptyValueInRead(in CSVOptions) should  be designed as that any fields matching 
this string will be set as empty values "" when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features description in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null 
columns{color}* in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
There is a difference when reading. In univocity, nothing content would be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 



 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior(try to recover emptyValue 
to "") for emptyValue as nullValue when reading, so I suggest that the 
emptyValueInRead(in CSVOptions) should  be designed as that any fields matching 
this string will be set as empty values "" when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null 
columns{color}* in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
There is a difference when reading. In univocity, nothing content would be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 



 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior(try to recover emptyValue 
to "") for emptyValue as nullValue when reading, so I suggest that the 
emptyValueInRead(in CSVOptions) should  be designed as that any fields matching 
this string will be set as empty values "" when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null 
columns{color}* in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
There is a difference when reading. In univocity, nothing content would be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 



 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior(try to recover emptyValue 
to "") for emptyValue as nullValue when reading, so I suggest that the 
emptyValueInRead(in CSVOptions) should  be designed as that any fields matching 
this string will be set as empty values "" when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null 
columns{color}* in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
There is a difference when reading. In univocity, nothing content would be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 



 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior(try to recover emptyValue 
to "") for emptyValue as nullValue when reading, so 

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null 
columns{color}* in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
There is a difference when reading. In univocity, nothing content would be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 



 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior for emptyValue as 
nullValue when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null 
columns{color}* in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
There is a difference when reading. In univocity, nothing content would be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 



 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think 

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as NULL strings in csv 
files and NULL strings in csv files can be parsed as columns of null values in 
dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
There is a difference when reading. In univocity, nothing content would be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 



 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as NULL strings in csv 
files and NULL strings in csv files can be parsed as columns of null values in 
dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
 

Since Spark 2.4, for empty strings, there are  emptyValueInRead for reading and 
emptyValueInWrite for writing that can be set in CSVOptions:
{code:scala}
// For writing, convert: ""(dataframe) => emptyValueInWrite(csv)

// For reading, convert: "" (csv) => emptyValueInRead(dataframe){code}
I think the read handling is not suitable, we can not convert "" or 
`{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but 
get {color:#172b4d}emptyValueInRead's setting value actually{color}, it 
supposed to be as flows:
{code:scala}
// For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code}
 
{color:#de350b}*We can not  recovery it to the original DataFrame.*{color}

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

For the nullValue option, according to the features description in spark-csv 
readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 

  1   2   >