[jira] [Created] (SPARK-37660) Spark-3.2.0 Fetch Hbase Data not working
Bhavya Raj Sharma created SPARK-37660: - Summary: Spark-3.2.0 Fetch Hbase Data not working Key: SPARK-37660 URL: https://issues.apache.org/jira/browse/SPARK-37660 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.0 Environment: Hadoop version : hadoop-2.9.2 HBase version : hbase-2.2.5 Spark version : spark-3.2.0-bin-without-hadoop java version : jdk1.8.0_151 scala version : scala-sdk-2.12.10 os version : Red Hat Enterprise Linux Server release 6.6 (Santiago) Reporter: Bhavya Raj Sharma Below is the sample code snipet that is used to fetch data from hbase. This used to work fine with spark-3.1.1 However after upgrading to psark-3.2.0 it is not working, The issue is it is not throwing any exception, it just don't fill RDD. {code:java} def getInfo(sc: SparkContext, startDate:String, cachingValue: Int, sparkLoggerParams: SparkLoggerParams, zkIP: String, zkPort: String): RDD[(String)] = {{ val scan = new Scan scan.addFamily("family") scan.addColumn("family","time") val rdd = getHbaseConfiguredRDDFromScan(sc, zkIP, zkPort, "myTable", scan, cachingValue, sparkLoggerParams) val output: RDD[(String)] = rdd.map { row => (Bytes.toString(row._2.getRow)) } output } def getHbaseConfiguredRDDFromScan(sc: SparkContext, zkIP: String, zkPort: String, tableName: String, scan: Scan, cachingValue: Int, sparkLoggerParams: SparkLoggerParams): NewHadoopRDD[ImmutableBytesWritable, Result] = { scan.setCaching(cachingValue) val scanString = Base64.getEncoder.encodeToString(org.apache.hadoop.hbase.protobuf.ProtobufUtil.toScan(scan).toByteArray) val hbaseContext = new SparkHBaseContext(zkIP, zkPort) val hbaseConfig = hbaseContext.getConfiguration() hbaseConfig.set(TableInputFormat.INPUT_TABLE, tableName) hbaseConfig.set(TableInputFormat.SCAN, scanString) sc.newAPIHadoopRDD( hbaseConfig, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result] ).asInstanceOf[NewHadoopRDD[ImmutableBytesWritable, Result]] } {code} If we fetch with using scan directly without using newAPIHadoopRDD, it works. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Issue Type: Improvement (was: Bug) > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in dataframe. That is: > {noformat} > When writing, convert "" empty(in dataframe) to emptyValue(in csv) > When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) > {noformat} > > There is an obvious difference between nullValue and emptyValue in read > handling. For nullValue, we will convert nothing or nullValue strings to null > in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty > strings) to emptyValue strings rather than to convert both "\"\""(quoted > empty strings) and emptyValue strings to ""(empty) in dataframe. > I think it's better
[jira] [Updated] (SPARK-37060) Report driver status does not handle response from backup masters
[ https://issues.apache.org/jira/browse/SPARK-37060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-37060: - Fix Version/s: 3.1.3 3.2.1 > Report driver status does not handle response from backup masters > - > > Key: SPARK-37060 > URL: https://issues.apache.org/jira/browse/SPARK-37060 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0 >Reporter: Mohamadreza Rostami >Assignee: Mohamadreza Rostami >Priority: Critical > Fix For: 3.1.3, 3.2.1, 3.3.0 > > > After an improvement in SPARK-31486, contributor uses > 'asyncSendToMasterAndForwardReply' method instead of > 'activeMasterEndpoint.askSync' to get the status of driver. Since the > driver's status is only available in active master and the > 'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we > have to handle the response from the backup masters in the client, which the > developer did not consider in the SPARK-31486 change. So drivers running in > cluster mode and on a cluster with multi-master affected by this bug. I > created the patch for this bug and will soon be sent the pull request. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37659) Fix FsHistoryProvider race condition between list and delet log info
[ https://issues.apache.org/jira/browse/SPARK-37659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460448#comment-17460448 ] Apache Spark commented on SPARK-37659: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/34919 > Fix FsHistoryProvider race condition between list and delet log info > > > Key: SPARK-37659 > URL: https://issues.apache.org/jira/browse/SPARK-37659 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.2, 3.2.1, 3.3.0 >Reporter: XiDuo You >Priority: Major > > After SPARK-29043, FsHistoryProvider will list the log info without waitting > all `mergeApplicationListing` task finished. > However the `LevelDBIterator` of list log info is not thread safe if some > other threads delete the related log info at same time. > There is the error msg: > {code:java} > 21/12/15 14:12:02 ERROR FsHistoryProvider: Exception in checking for event > log updates > java.util.NoSuchElementException: > 1^@__main__^@+hdfs://xxx/application_xxx.inprogress > at org.apache.spark.util.kvstore.LevelDB.get(LevelDB.java:132) > at > org.apache.spark.util.kvstore.LevelDBIterator.next(LevelDBIterator.java:137) > at > scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) > at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) > at > scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184) > at > scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47) > at scala.collection.TraversableLike.to(TraversableLike.scala:678) > at scala.collection.TraversableLike.to$(TraversableLike.scala:675) > at scala.collection.AbstractTraversable.to(Traversable.scala:108) > at scala.collection.TraversableOnce.toList(TraversableOnce.scala:299) > at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:299) > at scala.collection.AbstractTraversable.toList(Traversable.scala:108) > at > org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:588) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:299) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37659) Fix FsHistoryProvider race condition between list and delet log info
[ https://issues.apache.org/jira/browse/SPARK-37659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37659: Assignee: (was: Apache Spark) > Fix FsHistoryProvider race condition between list and delet log info > > > Key: SPARK-37659 > URL: https://issues.apache.org/jira/browse/SPARK-37659 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.2, 3.2.1, 3.3.0 >Reporter: XiDuo You >Priority: Major > > After SPARK-29043, FsHistoryProvider will list the log info without waitting > all `mergeApplicationListing` task finished. > However the `LevelDBIterator` of list log info is not thread safe if some > other threads delete the related log info at same time. > There is the error msg: > {code:java} > 21/12/15 14:12:02 ERROR FsHistoryProvider: Exception in checking for event > log updates > java.util.NoSuchElementException: > 1^@__main__^@+hdfs://xxx/application_xxx.inprogress > at org.apache.spark.util.kvstore.LevelDB.get(LevelDB.java:132) > at > org.apache.spark.util.kvstore.LevelDBIterator.next(LevelDBIterator.java:137) > at > scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) > at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) > at > scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184) > at > scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47) > at scala.collection.TraversableLike.to(TraversableLike.scala:678) > at scala.collection.TraversableLike.to$(TraversableLike.scala:675) > at scala.collection.AbstractTraversable.to(Traversable.scala:108) > at scala.collection.TraversableOnce.toList(TraversableOnce.scala:299) > at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:299) > at scala.collection.AbstractTraversable.toList(Traversable.scala:108) > at > org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:588) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:299) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37659) Fix FsHistoryProvider race condition between list and delet log info
[ https://issues.apache.org/jira/browse/SPARK-37659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37659: Assignee: Apache Spark > Fix FsHistoryProvider race condition between list and delet log info > > > Key: SPARK-37659 > URL: https://issues.apache.org/jira/browse/SPARK-37659 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.2, 3.2.1, 3.3.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > > After SPARK-29043, FsHistoryProvider will list the log info without waitting > all `mergeApplicationListing` task finished. > However the `LevelDBIterator` of list log info is not thread safe if some > other threads delete the related log info at same time. > There is the error msg: > {code:java} > 21/12/15 14:12:02 ERROR FsHistoryProvider: Exception in checking for event > log updates > java.util.NoSuchElementException: > 1^@__main__^@+hdfs://xxx/application_xxx.inprogress > at org.apache.spark.util.kvstore.LevelDB.get(LevelDB.java:132) > at > org.apache.spark.util.kvstore.LevelDBIterator.next(LevelDBIterator.java:137) > at > scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) > at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) > at > scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184) > at > scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47) > at scala.collection.TraversableLike.to(TraversableLike.scala:678) > at scala.collection.TraversableLike.to$(TraversableLike.scala:675) > at scala.collection.AbstractTraversable.to(Traversable.scala:108) > at scala.collection.TraversableOnce.toList(TraversableOnce.scala:299) > at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:299) > at scala.collection.AbstractTraversable.toList(Traversable.scala:108) > at > org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:588) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:299) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37656) Upgrade SBT to 1.5.7
[ https://issues.apache.org/jira/browse/SPARK-37656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37656: - Fix Version/s: 3.2.1 > Upgrade SBT to 1.5.7 > > > Key: SPARK-37656 > URL: https://issues.apache.org/jira/browse/SPARK-37656 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.1, 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.2.1, 3.3.0 > > > SBT 1.5.7 was released a few hours ago, which includes a fix for > CVE-2021-45046. > https://github.com/sbt/sbt/releases/tag/v1.5.7 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37658) Skip PIP packaging test if Python version is lower than 3.7
[ https://issues.apache.org/jira/browse/SPARK-37658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-37658. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34917 [https://github.com/apache/spark/pull/34917] > Skip PIP packaging test if Python version is lower than 3.7 > --- > > Key: SPARK-37658 > URL: https://issues.apache.org/jira/browse/SPARK-37658 > Project: Spark > Issue Type: Bug > Components: Project Infra, PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.3.0 > > > {code} > Writing pyspark-3.3.0.dev0/setup.cfg > Creating tar archive > removing 'pyspark-3.3.0.dev0' (and everything under it) > Installing dist into virtual env > Obtaining file:///home/jenkins/workspace/SparkPullRequestBuilder%402/python > pyspark requires Python '>=3.7' but the running Python is 3.6.8 > Cleaning up temporary directory - /tmp/tmp.CCragmNU1X > [error] running > /home/jenkins/workspace/SparkPullRequestBuilder@2/dev/run-pip-tests ; > received return code 1 > Attempting to post to GitHub... > {code} > After we drop Python 3.6 at SPARK-37632, PIP packaging test fails > intermittently > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146255/console. > Apparently, different Python versions are installed. We should skip the test > in these cases. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37658) Skip PIP packaging test if Python version is lower than 3.7
[ https://issues.apache.org/jira/browse/SPARK-37658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-37658: - Assignee: Hyukjin Kwon > Skip PIP packaging test if Python version is lower than 3.7 > --- > > Key: SPARK-37658 > URL: https://issues.apache.org/jira/browse/SPARK-37658 > Project: Spark > Issue Type: Bug > Components: Project Infra, PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > {code} > Writing pyspark-3.3.0.dev0/setup.cfg > Creating tar archive > removing 'pyspark-3.3.0.dev0' (and everything under it) > Installing dist into virtual env > Obtaining file:///home/jenkins/workspace/SparkPullRequestBuilder%402/python > pyspark requires Python '>=3.7' but the running Python is 3.6.8 > Cleaning up temporary directory - /tmp/tmp.CCragmNU1X > [error] running > /home/jenkins/workspace/SparkPullRequestBuilder@2/dev/run-pip-tests ; > received return code 1 > Attempting to post to GitHub... > {code} > After we drop Python 3.6 at SPARK-37632, PIP packaging test fails > intermittently > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146255/console. > Apparently, different Python versions are installed. We should skip the test > in these cases. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32294) GroupedData Pandas UDF 2Gb limit
[ https://issues.apache.org/jira/browse/SPARK-32294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32294: - Fix Version/s: 3.2.0 (was: 3.3.0) > GroupedData Pandas UDF 2Gb limit > > > Key: SPARK-32294 > URL: https://issues.apache.org/jira/browse/SPARK-32294 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0, 3.1.0 >Reporter: Ruslan Dautkhanov >Priority: Major > Fix For: 3.2.0 > > > `spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for > GroupedData, the whole group is passed to Pandas UDF at once, which can cause > various 2Gb limitations on Arrow side (and in current versions of Arrow, also > 2Gb limitation on Netty allocator side) - > https://issues.apache.org/jira/browse/ARROW-4890 > Would be great to consider feeding GroupedData into a pandas UDF in batches > to solve this issue. > cc [~hyukjin.kwon] > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37659) Fix FsHistoryProvider race condition between list and delet log info
XiDuo You created SPARK-37659: - Summary: Fix FsHistoryProvider race condition between list and delet log info Key: SPARK-37659 URL: https://issues.apache.org/jira/browse/SPARK-37659 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.1.2, 3.2.1, 3.3.0 Reporter: XiDuo You After SPARK-29043, FsHistoryProvider will list the log info without waitting all `mergeApplicationListing` task finished. However the `LevelDBIterator` of list log info is not thread safe if some other threads delete the related log info at same time. There is the error msg: {code:java} 21/12/15 14:12:02 ERROR FsHistoryProvider: Exception in checking for event log updates java.util.NoSuchElementException: 1^@__main__^@+hdfs://xxx/application_xxx.inprogress at org.apache.spark.util.kvstore.LevelDB.get(LevelDB.java:132) at org.apache.spark.util.kvstore.LevelDBIterator.next(LevelDBIterator.java:137) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184) at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47) at scala.collection.TraversableLike.to(TraversableLike.scala:678) at scala.collection.TraversableLike.to$(TraversableLike.scala:675) at scala.collection.AbstractTraversable.to(Traversable.scala:108) at scala.collection.TraversableOnce.toList(TraversableOnce.scala:299) at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:299) at scala.collection.AbstractTraversable.toList(Traversable.scala:108) at org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:588) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:299) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34521) spark.createDataFrame does not support Pandas StringDtype extension type
[ https://issues.apache.org/jira/browse/SPARK-34521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-34521. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34509 [https://github.com/apache/spark/pull/34509] > spark.createDataFrame does not support Pandas StringDtype extension type > > > Key: SPARK-34521 > URL: https://issues.apache.org/jira/browse/SPARK-34521 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.1 >Reporter: Pavel Ganelin >Priority: Major > Fix For: 3.3.0 > > > The following test case demonstrates the problem: > {code:java} > import pandas as pd > from pyspark.sql import SparkSession, types > spark = SparkSession.builder.appName(__file__)\ > .config("spark.sql.execution.arrow.pyspark.enabled","true") \ > .getOrCreate() > good = pd.DataFrame([["abc"]], columns=["col"]) > schema = types.StructType([types.StructField("col", types.StringType(), > True)]) > df = spark.createDataFrame(good, schema=schema) > df.show() > bad = good.copy() > bad["col"]=bad["col"].astype("string") > schema = types.StructType([types.StructField("col", types.StringType(), > True)]) > df = spark.createDataFrame(bad, schema=schema) > df.show(){code} > The error: > {code:java} > C:\Python\3.8.3\lib\site-packages\pyspark\sql\pandas\conversion.py:289: > UserWarning: createDataFrame attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed > by the reason below: > Cannot specify a mask or a size when passing an object that is converted > with the __arrow_array__ protocol. > Attempting non-optimization as > 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. > warnings.warn(msg) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37483) Support push down top N to JDBC data source V2
[ https://issues.apache.org/jira/browse/SPARK-37483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460427#comment-17460427 ] Apache Spark commented on SPARK-37483: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/34918 > Support push down top N to JDBC data source V2 > -- > > Key: SPARK-37483 > URL: https://issues.apache.org/jira/browse/SPARK-37483 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37657) Fix the bug in ps.(Series|DataFrame).describe()
[ https://issues.apache.org/jira/browse/SPARK-37657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37657: - Description: Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888] The `(Series|DataFrame).describe()` in pandas API on Spark doesn't work properly when DataFrame has no numeric column. {code:java} >>> df = ps.DataFrame({'a': ["a", "b", "c"]}) >>> df.describe() Traceback (most recent call last): File "", line 1, in File "/.../python/pyspark/pandas/frame.py", line 7582, in describe raise ValueError("Cannot describe a DataFrame without columns") ValueError: Cannot describe a DataFrame without columns {code} As it works fine in pandas, we should fix it. was: Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888.] The `(Series|DataFrame).describe()` in pandas API on Spark doesn't work properly when DataFrame has no numeric column. {code:java} >>> df = ps.DataFrame({'a': ["a", "b", "c"]}) >>> df.describe() Traceback (most recent call last): File "", line 1, in File "/.../python/pyspark/pandas/frame.py", line 7582, in describe raise ValueError("Cannot describe a DataFrame without columns") ValueError: Cannot describe a DataFrame without columns {code} As it works fine in pandas, we should fix it. > Fix the bug in ps.(Series|DataFrame).describe() > --- > > Key: SPARK-37657 > URL: https://issues.apache.org/jira/browse/SPARK-37657 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > Initialized in Koalas issue: > [https://github.com/databricks/koalas/issues/1888] > > The `(Series|DataFrame).describe()` in pandas API on Spark doesn't work > properly when DataFrame has no numeric column. > > > {code:java} > >>> df = ps.DataFrame({'a': ["a", "b", "c"]}) > >>> df.describe() > Traceback (most recent call last): > File "", line 1, in > File "/.../python/pyspark/pandas/frame.py", line 7582, in describe > raise ValueError("Cannot describe a DataFrame without columns") > ValueError: Cannot describe a DataFrame without columns > {code} > > As it works fine in pandas, we should fix it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37657) Fix the bug in ps.(Series|DataFrame).describe()
[ https://issues.apache.org/jira/browse/SPARK-37657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-37657: Summary: Fix the bug in ps.(Series|DataFrame).describe() (was: Fix the bug in ps.DataFrame.describe()) > Fix the bug in ps.(Series|DataFrame).describe() > --- > > Key: SPARK-37657 > URL: https://issues.apache.org/jira/browse/SPARK-37657 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > Initialized in Koalas issue: > [https://github.com/databricks/koalas/issues/1888.] > > The `DataFrame.describe()` in pandas API on Spark doesn't work properly when > DataFrame has no numeric column. > > > {code:java} > >>> df = ps.DataFrame({'a': ["a", "b", "c"]}) > >>> df.describe() > Traceback (most recent call last): > File "", line 1, in > File "/.../python/pyspark/pandas/frame.py", line 7582, in describe > raise ValueError("Cannot describe a DataFrame without columns") > ValueError: Cannot describe a DataFrame without columns > {code} > > As it works fine in pandas, we should fix it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37657) Fix the bug in ps.(Series|DataFrame).describe()
[ https://issues.apache.org/jira/browse/SPARK-37657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-37657: Description: Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888.] The `(Series|DataFrame).describe()` in pandas API on Spark doesn't work properly when DataFrame has no numeric column. {code:java} >>> df = ps.DataFrame({'a': ["a", "b", "c"]}) >>> df.describe() Traceback (most recent call last): File "", line 1, in File "/.../python/pyspark/pandas/frame.py", line 7582, in describe raise ValueError("Cannot describe a DataFrame without columns") ValueError: Cannot describe a DataFrame without columns {code} As it works fine in pandas, we should fix it. was: Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888.] The `DataFrame.describe()` in pandas API on Spark doesn't work properly when DataFrame has no numeric column. {code:java} >>> df = ps.DataFrame({'a': ["a", "b", "c"]}) >>> df.describe() Traceback (most recent call last): File "", line 1, in File "/.../python/pyspark/pandas/frame.py", line 7582, in describe raise ValueError("Cannot describe a DataFrame without columns") ValueError: Cannot describe a DataFrame without columns {code} As it works fine in pandas, we should fix it. > Fix the bug in ps.(Series|DataFrame).describe() > --- > > Key: SPARK-37657 > URL: https://issues.apache.org/jira/browse/SPARK-37657 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > Initialized in Koalas issue: > [https://github.com/databricks/koalas/issues/1888.] > > The `(Series|DataFrame).describe()` in pandas API on Spark doesn't work > properly when DataFrame has no numeric column. > > > {code:java} > >>> df = ps.DataFrame({'a': ["a", "b", "c"]}) > >>> df.describe() > Traceback (most recent call last): > File "", line 1, in > File "/.../python/pyspark/pandas/frame.py", line 7582, in describe > raise ValueError("Cannot describe a DataFrame without columns") > ValueError: Cannot describe a DataFrame without columns > {code} > > As it works fine in pandas, we should fix it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37656) Upgrade SBT to 1.5.7
[ https://issues.apache.org/jira/browse/SPARK-37656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37656. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34915 [https://github.com/apache/spark/pull/34915] > Upgrade SBT to 1.5.7 > > > Key: SPARK-37656 > URL: https://issues.apache.org/jira/browse/SPARK-37656 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.1, 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.3.0 > > > SBT 1.5.7 was released a few hours ago, which includes a fix for > CVE-2021-45046. > https://github.com/sbt/sbt/releases/tag/v1.5.7 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37651) Use existing active Spark session in all places of pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-37651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-37651: Assignee: Xinrong Meng > Use existing active Spark session in all places of pandas API on Spark > -- > > Key: SPARK-37651 > URL: https://issues.apache.org/jira/browse/SPARK-37651 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > Use existing active Spark session instead of SparkSession.getOrCreate in all > places in pandas API on Spark -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37651) Use existing active Spark session in all places of pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-37651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37651. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34906 [https://github.com/apache/spark/pull/34906] > Use existing active Spark session in all places of pandas API on Spark > -- > > Key: SPARK-37651 > URL: https://issues.apache.org/jira/browse/SPARK-37651 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > > Use existing active Spark session instead of SparkSession.getOrCreate in all > places in pandas API on Spark -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37658) Skip PIP packaging test if Python version is lower than 3.7
[ https://issues.apache.org/jira/browse/SPARK-37658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37658: Assignee: Apache Spark > Skip PIP packaging test if Python version is lower than 3.7 > --- > > Key: SPARK-37658 > URL: https://issues.apache.org/jira/browse/SPARK-37658 > Project: Spark > Issue Type: Bug > Components: Project Infra, PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > {code} > Writing pyspark-3.3.0.dev0/setup.cfg > Creating tar archive > removing 'pyspark-3.3.0.dev0' (and everything under it) > Installing dist into virtual env > Obtaining file:///home/jenkins/workspace/SparkPullRequestBuilder%402/python > pyspark requires Python '>=3.7' but the running Python is 3.6.8 > Cleaning up temporary directory - /tmp/tmp.CCragmNU1X > [error] running > /home/jenkins/workspace/SparkPullRequestBuilder@2/dev/run-pip-tests ; > received return code 1 > Attempting to post to GitHub... > {code} > After we drop Python 3.6 at SPARK-37632, PIP packaging test fails > intermittently > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146255/console. > Apparently, different Python versions are installed. We should skip the test > in these cases. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37658) Skip PIP packaging test if Python version is lower than 3.7
[ https://issues.apache.org/jira/browse/SPARK-37658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37658: Assignee: (was: Apache Spark) > Skip PIP packaging test if Python version is lower than 3.7 > --- > > Key: SPARK-37658 > URL: https://issues.apache.org/jira/browse/SPARK-37658 > Project: Spark > Issue Type: Bug > Components: Project Infra, PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > Writing pyspark-3.3.0.dev0/setup.cfg > Creating tar archive > removing 'pyspark-3.3.0.dev0' (and everything under it) > Installing dist into virtual env > Obtaining file:///home/jenkins/workspace/SparkPullRequestBuilder%402/python > pyspark requires Python '>=3.7' but the running Python is 3.6.8 > Cleaning up temporary directory - /tmp/tmp.CCragmNU1X > [error] running > /home/jenkins/workspace/SparkPullRequestBuilder@2/dev/run-pip-tests ; > received return code 1 > Attempting to post to GitHub... > {code} > After we drop Python 3.6 at SPARK-37632, PIP packaging test fails > intermittently > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146255/console. > Apparently, different Python versions are installed. We should skip the test > in these cases. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37658) Skip PIP packaging test if Python version is lower than 3.7
[ https://issues.apache.org/jira/browse/SPARK-37658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460397#comment-17460397 ] Apache Spark commented on SPARK-37658: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/34917 > Skip PIP packaging test if Python version is lower than 3.7 > --- > > Key: SPARK-37658 > URL: https://issues.apache.org/jira/browse/SPARK-37658 > Project: Spark > Issue Type: Bug > Components: Project Infra, PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > Writing pyspark-3.3.0.dev0/setup.cfg > Creating tar archive > removing 'pyspark-3.3.0.dev0' (and everything under it) > Installing dist into virtual env > Obtaining file:///home/jenkins/workspace/SparkPullRequestBuilder%402/python > pyspark requires Python '>=3.7' but the running Python is 3.6.8 > Cleaning up temporary directory - /tmp/tmp.CCragmNU1X > [error] running > /home/jenkins/workspace/SparkPullRequestBuilder@2/dev/run-pip-tests ; > received return code 1 > Attempting to post to GitHub... > {code} > After we drop Python 3.6 at SPARK-37632, PIP packaging test fails > intermittently > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146255/console. > Apparently, different Python versions are installed. We should skip the test > in these cases. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37658) Skip PIP packaging test if Python version is lower than 3.7
Hyukjin Kwon created SPARK-37658: Summary: Skip PIP packaging test if Python version is lower than 3.7 Key: SPARK-37658 URL: https://issues.apache.org/jira/browse/SPARK-37658 Project: Spark Issue Type: Bug Components: Project Infra, PySpark Affects Versions: 3.3.0 Reporter: Hyukjin Kwon {code} Writing pyspark-3.3.0.dev0/setup.cfg Creating tar archive removing 'pyspark-3.3.0.dev0' (and everything under it) Installing dist into virtual env Obtaining file:///home/jenkins/workspace/SparkPullRequestBuilder%402/python pyspark requires Python '>=3.7' but the running Python is 3.6.8 Cleaning up temporary directory - /tmp/tmp.CCragmNU1X [error] running /home/jenkins/workspace/SparkPullRequestBuilder@2/dev/run-pip-tests ; received return code 1 Attempting to post to GitHub... {code} After we drop Python 3.6 at SPARK-37632, PIP packaging test fails intermittently https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146255/console. Apparently, different Python versions are installed. We should skip the test in these cases. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37632) Drop code targetting Python < 3.7
[ https://issues.apache.org/jira/browse/SPARK-37632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37632. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34886 [https://github.com/apache/spark/pull/34886] > Drop code targetting Python < 3.7 > - > > Key: SPARK-37632 > URL: https://issues.apache.org/jira/browse/SPARK-37632 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > > We should drop parts of code that target Python 3.6 (or less if these are > still present). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35355) improve execution performance in insert...select...limit case
[ https://issues.apache.org/jira/browse/SPARK-35355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460375#comment-17460375 ] Apache Spark commented on SPARK-35355: -- User 'yikf' has created a pull request for this issue: https://github.com/apache/spark/pull/34916 > improve execution performance in insert...select...limit case > -- > > Key: SPARK-35355 > URL: https://issues.apache.org/jira/browse/SPARK-35355 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: yikf >Priority: Minor > Fix For: 3.0.0 > > > In the case of `insert into...select...limit` , `CollectLimitExec` has better > execution performance than `GlobalLimit` . > Before: > {code:java} > == Physical Plan == > Execute InsertIntoHadoopFsRelationCommand ... > +- *(2) GlobalLimit 5 > +- Exchange SinglePartition, true, id=#39 > +- *(1) LocalLimit 5 > +- *(1) ColumnarToRow > +- FileScan ... > {code} > After: > {code:java} > == Physical Plan == > Execute InsertIntoHadoopFsRelationCommand ... > +- CollectLimit 5 > +- *(1) ColumnarToRow > +- FileScan > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37632) Drop code targetting Python < 3.7
[ https://issues.apache.org/jira/browse/SPARK-37632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-37632: Assignee: Maciej Szymkiewicz > Drop code targetting Python < 3.7 > - > > Key: SPARK-37632 > URL: https://issues.apache.org/jira/browse/SPARK-37632 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > > We should drop parts of code that target Python 3.6 (or less if these are > still present). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37657) Fix the bug in ps.DataFrame.describe()
[ https://issues.apache.org/jira/browse/SPARK-37657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-37657: Description: Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888.] The `DataFrame.describe()` in pandas API on Spark doesn't work properly when DataFrame has no numeric column. {code:java} >>> df = ps.DataFrame({'a': ["a", "b", "c"]}) >>> df.describe() Traceback (most recent call last): File "", line 1, in File "/.../python/pyspark/pandas/frame.py", line 7582, in describe raise ValueError("Cannot describe a DataFrame without columns") ValueError: Cannot describe a DataFrame without columns {code} As it works fine in pandas, we should fix it. was: Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888.] The `DataFrame.describe()` in pandas API on Spark doesn't work properly when DataFrame has no numeric column. {code:java} >>> df = ps.DataFrame({'a': ["a", "b", "c"]}) >>> df.describe() Traceback (most recent call last): File "", line 1, in File "/.../python/pyspark/pandas/frame.py", line 7582, in describe raise ValueError("Cannot describe a DataFrame without columns") ValueError: Cannot describe a DataFrame without columns {code} As it works fine in pandas, we should fix it. > Fix the bug in ps.DataFrame.describe() > -- > > Key: SPARK-37657 > URL: https://issues.apache.org/jira/browse/SPARK-37657 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > Initialized in Koalas issue: > [https://github.com/databricks/koalas/issues/1888.] > > The `DataFrame.describe()` in pandas API on Spark doesn't work properly when > DataFrame has no numeric column. > > > {code:java} > >>> df = ps.DataFrame({'a': ["a", "b", "c"]}) > >>> df.describe() > Traceback (most recent call last): > File "", line 1, in > File "/.../python/pyspark/pandas/frame.py", line 7582, in describe > raise ValueError("Cannot describe a DataFrame without columns") > ValueError: Cannot describe a DataFrame without columns > {code} > > As it works fine in pandas, we should fix it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37657) Fix the bug in ps.DataFrame.describe()
[ https://issues.apache.org/jira/browse/SPARK-37657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-37657: Description: Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888.] The `DataFrame.describe()` in pandas API on Spark doesn't work properly when DataFrame has no numeric column. {code:java} >>> df = ps.DataFrame({'a': ["a", "b", "c"]}) >>> df.describe() Traceback (most recent call last): File "", line 1, in File "/.../python/pyspark/pandas/frame.py", line 7582, in describe raise ValueError("Cannot describe a DataFrame without columns") ValueError: Cannot describe a DataFrame without columns {code} As it works fine in pandas, we should fix it. was: Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888.] The `DataFrame.describe()` in pandas API on Spark doesn't work properly when DataFrame has no numeric column. {code:java} >>> df = ks.DataFrame({'a': ["a", "b", "c"]}) >>> df.describe() Traceback (most >>> recent call last): File "", line 1, in File >>> "/.../koalas/databricks/koalas/frame.py", line 7582, in describe raise >>> ValueError("Cannot describe a DataFrame without columns") ValueError: >>> Cannot describe a DataFrame without columns {code} As it works fine in pandas, we should fix it. > Fix the bug in ps.DataFrame.describe() > -- > > Key: SPARK-37657 > URL: https://issues.apache.org/jira/browse/SPARK-37657 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > Initialized in Koalas issue: > [https://github.com/databricks/koalas/issues/1888.] > > The `DataFrame.describe()` in pandas API on Spark doesn't work properly when > DataFrame has no numeric column. > > > {code:java} > >>> df = ps.DataFrame({'a': ["a", "b", "c"]}) > >>> df.describe() > Traceback (most recent call last): File "", line 1, in File > "/.../python/pyspark/pandas/frame.py", line 7582, in describe raise > ValueError("Cannot describe a DataFrame without columns") ValueError: Cannot > describe a DataFrame without columns > {code} > > As it works fine in pandas, we should fix it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37657) Fix the bug in ps.DataFrame.describe()
[ https://issues.apache.org/jira/browse/SPARK-37657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-37657: Description: Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888.] The `DataFrame.describe()` in pandas API on Spark doesn't work properly when DataFrame has no numeric column. {code:java} >>> df = ps.DataFrame({'a': ["a", "b", "c"]}) >>> df.describe() Traceback (most recent call last): File "", line 1, in File "/.../python/pyspark/pandas/frame.py", line 7582, in describe raise ValueError("Cannot describe a DataFrame without columns") ValueError: Cannot describe a DataFrame without columns {code} As it works fine in pandas, we should fix it. was: Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888.] The `DataFrame.describe()` in pandas API on Spark doesn't work properly when DataFrame has no numeric column. {code:java} >>> df = ps.DataFrame({'a': ["a", "b", "c"]}) >>> df.describe() Traceback (most recent call last): File "", line 1, in File "/.../python/pyspark/pandas/frame.py", line 7582, in describe raise ValueError("Cannot describe a DataFrame without columns") ValueError: Cannot describe a DataFrame without columns {code} As it works fine in pandas, we should fix it. > Fix the bug in ps.DataFrame.describe() > -- > > Key: SPARK-37657 > URL: https://issues.apache.org/jira/browse/SPARK-37657 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > Initialized in Koalas issue: > [https://github.com/databricks/koalas/issues/1888.] > > The `DataFrame.describe()` in pandas API on Spark doesn't work properly when > DataFrame has no numeric column. > > > {code:java} > >>> df = ps.DataFrame({'a': ["a", "b", "c"]}) > >>> df.describe() > Traceback (most recent call last): > File "", line 1, in > File "/.../python/pyspark/pandas/frame.py", line 7582, in describe > raise ValueError("Cannot describe a DataFrame without columns") > ValueError: Cannot describe a DataFrame without columns > {code} > > As it works fine in pandas, we should fix it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37657) Fix the bug in ps.DataFrame.describe()
[ https://issues.apache.org/jira/browse/SPARK-37657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460366#comment-17460366 ] Haejoon Lee commented on SPARK-37657: - I'm working on this > Fix the bug in ps.DataFrame.describe() > -- > > Key: SPARK-37657 > URL: https://issues.apache.org/jira/browse/SPARK-37657 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > Initialized in Koalas issue: > [https://github.com/databricks/koalas/issues/1888.] > > The `DataFrame.describe()` in pandas API on Spark doesn't work properly when > DataFrame has no numeric column. > > > {code:java} > >>> df = ks.DataFrame({'a': ["a", "b", "c"]}) >>> df.describe() Traceback > >>> (most recent call last): File "", line 1, in File > >>> "/.../koalas/databricks/koalas/frame.py", line 7582, in describe raise > >>> ValueError("Cannot describe a DataFrame without columns") ValueError: > >>> Cannot describe a DataFrame without columns > {code} > > As it works fine in pandas, we should fix it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37657) Fix the bug in ps.DataFrame.describe()
[ https://issues.apache.org/jira/browse/SPARK-37657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-37657: Description: Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888.] The `DataFrame.describe()` in pandas API on Spark doesn't work properly when DataFrame has no numeric column. {code:java} >>> df = ks.DataFrame({'a': ["a", "b", "c"]}) >>> df.describe() Traceback (most >>> recent call last): File "", line 1, in File >>> "/.../koalas/databricks/koalas/frame.py", line 7582, in describe raise >>> ValueError("Cannot describe a DataFrame without columns") ValueError: >>> Cannot describe a DataFrame without columns {code} As it works fine in pandas, we should fix it. was: Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888.] The `DataFrame.describe()` in pandas API on Spark doesn't work properly when DataFrame has no numeric column. As it's work fine in pandas, we should fix it. > Fix the bug in ps.DataFrame.describe() > -- > > Key: SPARK-37657 > URL: https://issues.apache.org/jira/browse/SPARK-37657 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > Initialized in Koalas issue: > [https://github.com/databricks/koalas/issues/1888.] > > The `DataFrame.describe()` in pandas API on Spark doesn't work properly when > DataFrame has no numeric column. > > > {code:java} > >>> df = ks.DataFrame({'a': ["a", "b", "c"]}) >>> df.describe() Traceback > >>> (most recent call last): File "", line 1, in File > >>> "/.../koalas/databricks/koalas/frame.py", line 7582, in describe raise > >>> ValueError("Cannot describe a DataFrame without columns") ValueError: > >>> Cannot describe a DataFrame without columns > {code} > > As it works fine in pandas, we should fix it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37657) Fix the bug in ps.DataFrame.describe()
Haejoon Lee created SPARK-37657: --- Summary: Fix the bug in ps.DataFrame.describe() Key: SPARK-37657 URL: https://issues.apache.org/jira/browse/SPARK-37657 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.0 Reporter: Haejoon Lee Initialized in Koalas issue: [https://github.com/databricks/koalas/issues/1888.] The `DataFrame.describe()` in pandas API on Spark doesn't work properly when DataFrame has no numeric column. As it's work fine in pandas, we should fix it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37656) Upgrade SBT to 1.5.7
[ https://issues.apache.org/jira/browse/SPARK-37656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37656: Assignee: Apache Spark (was: Kousuke Saruta) > Upgrade SBT to 1.5.7 > > > Key: SPARK-37656 > URL: https://issues.apache.org/jira/browse/SPARK-37656 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.1, 3.3.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Major > > SBT 1.5.7 was released a few hours ago, which includes a fix for > CVE-2021-45046. > https://github.com/sbt/sbt/releases/tag/v1.5.7 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37656) Upgrade SBT to 1.5.7
[ https://issues.apache.org/jira/browse/SPARK-37656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37656: Assignee: Kousuke Saruta (was: Apache Spark) > Upgrade SBT to 1.5.7 > > > Key: SPARK-37656 > URL: https://issues.apache.org/jira/browse/SPARK-37656 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.1, 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > SBT 1.5.7 was released a few hours ago, which includes a fix for > CVE-2021-45046. > https://github.com/sbt/sbt/releases/tag/v1.5.7 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37656) Upgrade SBT to 1.5.7
[ https://issues.apache.org/jira/browse/SPARK-37656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460357#comment-17460357 ] Apache Spark commented on SPARK-37656: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/34915 > Upgrade SBT to 1.5.7 > > > Key: SPARK-37656 > URL: https://issues.apache.org/jira/browse/SPARK-37656 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.1, 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > SBT 1.5.7 was released a few hours ago, which includes a fix for > CVE-2021-45046. > https://github.com/sbt/sbt/releases/tag/v1.5.7 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37656) Upgrade SBT to 1.5.7
[ https://issues.apache.org/jira/browse/SPARK-37656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460358#comment-17460358 ] Apache Spark commented on SPARK-37656: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/34915 > Upgrade SBT to 1.5.7 > > > Key: SPARK-37656 > URL: https://issues.apache.org/jira/browse/SPARK-37656 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.1, 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > SBT 1.5.7 was released a few hours ago, which includes a fix for > CVE-2021-45046. > https://github.com/sbt/sbt/releases/tag/v1.5.7 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37656) Upgrade SBT to 1.5.7
Kousuke Saruta created SPARK-37656: -- Summary: Upgrade SBT to 1.5.7 Key: SPARK-37656 URL: https://issues.apache.org/jira/browse/SPARK-37656 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.2.1, 3.3.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta SBT 1.5.7 was released a few hours ago, which includes a fix for CVE-2021-45046. https://github.com/sbt/sbt/releases/tag/v1.5.7 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34198) Add RocksDB StateStore implementation
[ https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34198: - Shepherd: L. C. Hsieh > Add RocksDB StateStore implementation > - > > Key: SPARK-34198 > URL: https://issues.apache.org/jira/browse/SPARK-34198 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: Yuanjian Li >Priority: Major > > Currently Spark SS only has one built-in StateStore implementation > HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As > there are more and more streaming applications, some of them requires to use > large state in stateful operations such as streaming aggregation and join. > Several other major streaming frameworks already use RocksDB for state > management. So it is proven to be good choice for large state usage. But > Spark SS still lacks of a built-in state store for the requirement. > We would like to explore the possibility to add RocksDB-based StateStore into > Spark SS. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34198) Add RocksDB StateStore implementation
[ https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-34198: Assignee: Yuanjian Li (was: Apache Spark) > Add RocksDB StateStore implementation > - > > Key: SPARK-34198 > URL: https://issues.apache.org/jira/browse/SPARK-34198 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: Yuanjian Li >Priority: Major > > Currently Spark SS only has one built-in StateStore implementation > HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As > there are more and more streaming applications, some of them requires to use > large state in stateful operations such as streaming aggregation and join. > Several other major streaming frameworks already use RocksDB for state > management. So it is proven to be good choice for large state usage. But > Spark SS still lacks of a built-in state store for the requirement. > We would like to explore the possibility to add RocksDB-based StateStore into > Spark SS. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37483) Support push down top N to JDBC data source V2
[ https://issues.apache.org/jira/browse/SPARK-37483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37483: Assignee: (was: Apache Spark) > Support push down top N to JDBC data source V2 > -- > > Key: SPARK-37483 > URL: https://issues.apache.org/jira/browse/SPARK-37483 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37483) Support push down top N to JDBC data source V2
[ https://issues.apache.org/jira/browse/SPARK-37483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37483: Assignee: Apache Spark > Support push down top N to JDBC data source V2 > -- > > Key: SPARK-37483 > URL: https://issues.apache.org/jira/browse/SPARK-37483 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37483) Support push down top N to JDBC data source V2
[ https://issues.apache.org/jira/browse/SPARK-37483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37483: - Fix Version/s: (was: 3.3.0) > Support push down top N to JDBC data source V2 > -- > > Key: SPARK-37483 > URL: https://issues.apache.org/jira/browse/SPARK-37483 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-37483) Support push down top N to JDBC data source V2
[ https://issues.apache.org/jira/browse/SPARK-37483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-37483: -- Assignee: (was: jiaan.geng) Reverted at https://github.com/apache/spark/commit/16841286ecb31446300c81e3f98516e92846d9fa > Support push down top N to JDBC data source V2 > -- > > Key: SPARK-37483 > URL: https://issues.apache.org/jira/browse/SPARK-37483 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37627) Add sorted column in BucketTransform
[ https://issues.apache.org/jira/browse/SPARK-37627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460331#comment-17460331 ] Apache Spark commented on SPARK-37627: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/34914 > Add sorted column in BucketTransform > > > Key: SPARK-37627 > URL: https://issues.apache.org/jira/browse/SPARK-37627 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.3.0 > > > In V1, we can create table with sorted bucket like the following: > {code:java} > sql("CREATE TABLE tbl(a INT, b INT) USING parquet " + > "CLUSTERED BY (a) SORTED BY (b) INTO 5 BUCKETS") > {code} > However, creating table with sorted bucket in V2 failed with Exception > {code:java} > org.apache.spark.sql.AnalysisException: Cannot convert bucketing with sort > columns to a transform. > {code} > We should be able to create table with sorted bucket in V2. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37627) Add sorted column in BucketTransform
[ https://issues.apache.org/jira/browse/SPARK-37627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460329#comment-17460329 ] Apache Spark commented on SPARK-37627: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/34914 > Add sorted column in BucketTransform > > > Key: SPARK-37627 > URL: https://issues.apache.org/jira/browse/SPARK-37627 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.3.0 > > > In V1, we can create table with sorted bucket like the following: > {code:java} > sql("CREATE TABLE tbl(a INT, b INT) USING parquet " + > "CLUSTERED BY (a) SORTED BY (b) INTO 5 BUCKETS") > {code} > However, creating table with sorted bucket in V2 failed with Exception > {code:java} > org.apache.spark.sql.AnalysisException: Cannot convert bucketing with sort > columns to a transform. > {code} > We should be able to create table with sorted bucket in V2. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37630) Security issue from Log4j 1.X exploit
[ https://issues.apache.org/jira/browse/SPARK-37630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460321#comment-17460321 ] Samved Chandrakant Divekar commented on SPARK-37630: New CVE which specifically affects log4j 1.2.x has popped up. It's similar to the one being discussed above. https://nvd.nist.gov/vuln/detail/CVE-2021-4104 > Security issue from Log4j 1.X exploit > - > > Key: SPARK-37630 > URL: https://issues.apache.org/jira/browse/SPARK-37630 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.2.0 >Reporter: Ismail H >Priority: Major > Labels: security > > log4j is being used in version [1.2.17|#L122]] > > This version has been deprecated and since [then have a known issue that > hasn't been adressed in 1.X > versions|https://www.cvedetails.com/cve/CVE-2019-17571/]. > > *Solution:* > * Upgrade log4j to version 2.15.0 which correct all known issues. [Last > known issues |https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44228] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37655) Add RocksDB Implementation for KVStore
[ https://issues.apache.org/jira/browse/SPARK-37655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37655: -- Parent: SPARK-33772 Issue Type: Sub-task (was: Improvement) > Add RocksDB Implementation for KVStore > -- > > Key: SPARK-37655 > URL: https://issues.apache.org/jira/browse/SPARK-37655 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37655) Add RocksDB Implementation for KVStore
[ https://issues.apache.org/jira/browse/SPARK-37655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-37655: - Assignee: Dongjoon Hyun (was: Apache Spark) > Add RocksDB Implementation for KVStore > -- > > Key: SPARK-37655 > URL: https://issues.apache.org/jira/browse/SPARK-37655 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37653) Upgrade RoaringBitmap to 0.9.23
[ https://issues.apache.org/jira/browse/SPARK-37653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37653: -- Parent: SPARK-33772 Issue Type: Sub-task (was: Improvement) > Upgrade RoaringBitmap to 0.9.23 > --- > > Key: SPARK-37653 > URL: https://issues.apache.org/jira/browse/SPARK-37653 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37653) Upgrade RoaringBitmap to 0.9.23
[ https://issues.apache.org/jira/browse/SPARK-37653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-37653: - Assignee: Yang Jie > Upgrade RoaringBitmap to 0.9.23 > --- > > Key: SPARK-37653 > URL: https://issues.apache.org/jira/browse/SPARK-37653 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37653) Upgrade RoaringBitmap to 0.9.23
[ https://issues.apache.org/jira/browse/SPARK-37653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-37653. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34909 [https://github.com/apache/spark/pull/34909] > Upgrade RoaringBitmap to 0.9.23 > --- > > Key: SPARK-37653 > URL: https://issues.apache.org/jira/browse/SPARK-37653 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37655) Add RocksDB Implementation for KVStore
[ https://issues.apache.org/jira/browse/SPARK-37655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460284#comment-17460284 ] Apache Spark commented on SPARK-37655: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/34913 > Add RocksDB Implementation for KVStore > -- > > Key: SPARK-37655 > URL: https://issues.apache.org/jira/browse/SPARK-37655 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37655) Add RocksDB Implementation for KVStore
[ https://issues.apache.org/jira/browse/SPARK-37655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37655: Assignee: Apache Spark > Add RocksDB Implementation for KVStore > -- > > Key: SPARK-37655 > URL: https://issues.apache.org/jira/browse/SPARK-37655 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37655) Add RocksDB Implementation for KVStore
[ https://issues.apache.org/jira/browse/SPARK-37655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37655: Assignee: (was: Apache Spark) > Add RocksDB Implementation for KVStore > -- > > Key: SPARK-37655 > URL: https://issues.apache.org/jira/browse/SPARK-37655 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37655) Add RocksDB Implementation for KVStore
[ https://issues.apache.org/jira/browse/SPARK-37655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37655: Assignee: Apache Spark > Add RocksDB Implementation for KVStore > -- > > Key: SPARK-37655 > URL: https://issues.apache.org/jira/browse/SPARK-37655 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37655) Add RocksDB Implementation for KVStore
Dongjoon Hyun created SPARK-37655: - Summary: Add RocksDB Implementation for KVStore Key: SPARK-37655 URL: https://issues.apache.org/jira/browse/SPARK-37655 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37154) Inline type hints for python/pyspark/rdd.py
[ https://issues.apache.org/jira/browse/SPARK-37154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460243#comment-17460243 ] Byron Hsu commented on SPARK-37154: --- I am looking into this one > Inline type hints for python/pyspark/rdd.py > --- > > Key: SPARK-37154 > URL: https://issues.apache.org/jira/browse/SPARK-37154 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Byron Hsu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37633) Unwrap cast should skip if downcast failed with ansi enabled
[ https://issues.apache.org/jira/browse/SPARK-37633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37633: - Affects Version/s: (was: 3.0.3) > Unwrap cast should skip if downcast failed with ansi enabled > > > Key: SPARK-37633 > URL: https://issues.apache.org/jira/browse/SPARK-37633 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Minor > Fix For: 3.2.1, 3.3.0 > > > Currently, unwrap cast throws ArithmeticException if down cast failed with > ansi enabled. Since UnwrapCastInBinaryComparison is an optimizer rule, we > should always skip on failure regardless of ansi config. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37633) Unwrap cast should skip if downcast failed with ansi enabled
[ https://issues.apache.org/jira/browse/SPARK-37633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-37633: Assignee: Manu Zhang > Unwrap cast should skip if downcast failed with ansi enabled > > > Key: SPARK-37633 > URL: https://issues.apache.org/jira/browse/SPARK-37633 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Minor > > Currently, unwrap cast throws ArithmeticException if down cast failed with > ansi enabled. Since UnwrapCastInBinaryComparison is an optimizer rule, we > should always skip on failure regardless of ansi config. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37633) Unwrap cast should skip if downcast failed with ansi enabled
[ https://issues.apache.org/jira/browse/SPARK-37633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-37633. -- Fix Version/s: 3.3.0 3.2.1 Resolution: Fixed Issue resolved by pull request 34888 [https://github.com/apache/spark/pull/34888] > Unwrap cast should skip if downcast failed with ansi enabled > > > Key: SPARK-37633 > URL: https://issues.apache.org/jira/browse/SPARK-37633 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Minor > Fix For: 3.3.0, 3.2.1 > > > Currently, unwrap cast throws ArithmeticException if down cast failed with > ansi enabled. Since UnwrapCastInBinaryComparison is an optimizer rule, we > should always skip on failure regardless of ansi config. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144 ] Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:05 PM: For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !empty_test.png|width=701,height=286! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible. FYI. [~maxgekk] was (Author: wayne guo): For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !empty_test.png! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible. FYI. [~maxgekk] > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with
[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144 ] Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:05 PM: For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !empty_test.png! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible. FYI. [~maxgekk] was (Author: wayne guo): For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !image-2021-12-16-01-57-55-864.png|width=424,height=173! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible. FYI. [~maxgekk] > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for
[jira] [Updated] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Attachment: (was: image-2021-12-16-01-57-55-864.png) > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in dataframe. That is: > {noformat} > When writing, convert "" empty(in dataframe) to emptyValue(in csv) > When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) > {noformat} > > There is an obvious difference between nullValue and emptyValue in read > handling. For nullValue, we will convert nothing or nullValue strings to null > in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty > strings) to emptyValue strings rather than to convert both "\"\""(quoted > empty strings) and emptyValue strings to ""(empty) in dataframe. > I think
[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144 ] Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:04 PM: For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !image-2021-12-16-01-57-55-864.png|width=424,height=173! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible. FYI. [~maxgekk] was (Author: wayne guo): For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !image-2021-12-16-01-57-55-864.png|width=424,height=173! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible. FYI. [~maxgekk] > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a
[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144 ] Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:03 PM: For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !image-2021-12-16-01-57-55-864.png|width=424,height=173! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible. FYI. [~maxgekk] was (Author: wayne guo): For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !image-2021-12-16-01-57-55-864.png|width=424,height=173! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for
[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144 ] Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:03 PM: For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !image-2021-12-16-01-57-55-864.png|width=424,height=173! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible was (Author: wayne guo): For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !image-2021-12-16-01-57-55-864.png|width=424,height=173! > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the
[jira] [Updated] (SPARK-37654) Regression - NullPointerException in Row.getSeq when field null
[ https://issues.apache.org/jira/browse/SPARK-37654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Dahler updated SPARK-37654: --- Description: h2. Description A NullPointerException occurs in _org.apache.spark.sql.Row.getSeq(int)_ if the row contains a _null_ value at the requested index. {code:java} java.lang.NullPointerException at org.apache.spark.sql.Row.getSeq(Row.scala:319) at org.apache.spark.sql.Row.getSeq$(Row.scala:319) at org.apache.spark.sql.catalyst.expressions.GenericRow.getSeq(rows.scala:166) at org.apache.spark.sql.Row.getList(Row.scala:327) at org.apache.spark.sql.Row.getList$(Row.scala:326) at org.apache.spark.sql.catalyst.expressions.GenericRow.getList(rows.scala:166) ... {code} Prior to 3.1.1, the code would not throw an exception and instead would return a null _Seq_ instance. h2. Reproduction # Start a new spark-shell instance # Execute the following script: {code:scala} import org.apache.spark.sql.Row Row(Seq("value")).getSeq(0) Row(Seq()).getSeq(0) Row(null).getSeq(0) {code} h3. Expected Output res2 outputs a _null_ value. {code:java} scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> scala> Row(Seq("value")).getSeq(0) res0: Seq[Nothing] = List(value) scala> Row(Seq()).getSeq(0) res1: Seq[Nothing] = List() scala> Row(null).getSeq(0) res2: Seq[Nothing] = null {code} h3. Actual Output res2 throws a NullPointerException. {code:java} scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> scala> Row(Seq("value")).getSeq(0) res0: Seq[Nothing] = List(value) scala> Row(Seq()).getSeq(0) res1: Seq[Nothing] = List() scala> Row(null).getSeq(0) java.lang.NullPointerException at org.apache.spark.sql.Row.getSeq(Row.scala:319) at org.apache.spark.sql.Row.getSeq$(Row.scala:319) at org.apache.spark.sql.catalyst.expressions.GenericRow.getSeq(rows.scala:166) ... 47 elided {code} h3. Environments Tested Tested against the following releases using the provided reproduction steps: # spark-3.0.3-bin-hadoop2.7 - Succeeded {code:java} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.3 /_/Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) {code} # spark-3.1.2-bin-hadoop3.2 - Failed {code:java} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.2 /_/Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) {code} # spark-3.2.0-bin-hadoop3.2 - Failed {code:java} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.2.0 /_/Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) {code} h2. Regression Source The regression appears to have been introduced in [25c7d0fe6ae20a4c1c42e0cd0b448c08ab03f3fb|https://github.com/apache/spark/commit/25c7d0fe6ae20a4c1c42e0cd0b448c08ab03f3fb#diff-722324a11a0e4635a59a9debc962da2c1678d86702a9a106fd0d51188f83853bR317], which addressed [SPARK-32526|https://issues.apache.org/jira/browse/SPARK-32526] h2. Work Around This regression can be worked around by using _Row.isNullAt(int)_ and handling the null scenario in user code, prior to calling _Row.getSeq(int)_ or _Row.getList(int)_. was: h2. Description A NullPointerException occurs in _org.apache.spark.sql.Row.getSeq(int)_ if the row contains a _null_ value at the requested index. {code:java} java.lang.NullPointerException at org.apache.spark.sql.Row.getSeq(Row.scala:319) at org.apache.spark.sql.Row.getSeq$(Row.scala:319) at org.apache.spark.sql.catalyst.expressions.GenericRow.getSeq(rows.scala:166) at org.apache.spark.sql.Row.getList(Row.scala:327) at org.apache.spark.sql.Row.getList$(Row.scala:326) at org.apache.spark.sql.catalyst.expressions.GenericRow.getList(rows.scala:166) ... {code} Prior to 3.1.1, the code would not throw an exception and instead would return a null _Seq_ instance. h2. Reproduction # Start a new spark-shell instance # Execute the following script: {code:scala} import org.apache.spark.sql.Row Row(Seq("value")).getSeq(0) Row(Seq()).getSeq(0) Row(null).getSeq(0) {code} h3. Expected Output res2 outputs a _null_ value. {code:java} scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> scala> Row(Seq("value")).getSeq(0) res0: Seq[Nothing] = List(value) scala> Row(Seq()).getSeq(0) res1: Seq[Nothing] = List() scala> Row(null).getSeq(0) res2: Seq[Nothing] = null {code} h3. Actual Output res2 throws a NullPointerException. {code:java} scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> scala> Row(Seq("value")).getSeq(0) res0: Seq[Nothing] = List(value)
[jira] [Updated] (SPARK-37654) Regression - NullPointerException in Row.getSeq when field null
[ https://issues.apache.org/jira/browse/SPARK-37654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Dahler updated SPARK-37654: --- Description: h2. Description A NullPointerException occurs in _org.apache.spark.sql.Row.getSeq(int)_ if the row contains a _null_ value at the requested index. {code:java} java.lang.NullPointerException at org.apache.spark.sql.Row.getSeq(Row.scala:319) at org.apache.spark.sql.Row.getSeq$(Row.scala:319) at org.apache.spark.sql.catalyst.expressions.GenericRow.getSeq(rows.scala:166) at org.apache.spark.sql.Row.getList(Row.scala:327) at org.apache.spark.sql.Row.getList$(Row.scala:326) at org.apache.spark.sql.catalyst.expressions.GenericRow.getList(rows.scala:166) ... {code} Prior to 3.1.1, the code would not throw an exception and instead would return a null _Seq_ instance. h2. Reproduction # Start a new spark-shell instance # Execute the following script: {code:scala} import org.apache.spark.sql.Row Row(Seq("value")).getSeq(0) Row(Seq()).getSeq(0) Row(null).getSeq(0) {code} h3. Expected Output res2 outputs a _null_ value. {code:java} scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> scala> Row(Seq("value")).getSeq(0) res0: Seq[Nothing] = List(value) scala> Row(Seq()).getSeq(0) res1: Seq[Nothing] = List() scala> Row(null).getSeq(0) res2: Seq[Nothing] = null {code} h3. Actual Output res2 throws a NullPointerException. {code:java} scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> scala> Row(Seq("value")).getSeq(0) res0: Seq[Nothing] = List(value) scala> Row(Seq()).getSeq(0) res1: Seq[Nothing] = List() scala> Row(null).getSeq(0) java.lang.NullPointerException at org.apache.spark.sql.Row.getSeq(Row.scala:319) at org.apache.spark.sql.Row.getSeq$(Row.scala:319) at org.apache.spark.sql.catalyst.expressions.GenericRow.getSeq(rows.scala:166) ... 47 elided {code} h3. Environments Tested Tested against the following releases using the provided reproduction steps: # spark-3.0.3-bin-hadoop2.7 - Succeeded {code:java} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.3 /_/Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) {code} # spark-3.1.2-bin-hadoop3.2 - Failed {code:java} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.2 /_/Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) {code} # spark-3.2.0-bin-hadoop3.2 - Failed {code:java} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.2.0 /_/Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) {code} h2. Regression Source The regression appears to have been introduced in [25c7d0fe6ae20a4c1c42e0cd0b448c08ab03f3fb|https://github.com/apache/spark/commit/25c7d0fe6ae20a4c1c42e0cd0b448c08ab03f3fb#diff-722324a11a0e4635a59a9debc962da2c1678d86702a9a106fd0d51188f83853bR317], which addressed [SPARK-32526|https://issues.apache.org/jira/browse/SPARK-32526] h2. Work Around This regression can be worked around by using _Row.isNullAt(int)_ and handling the null scenario in user code, prior to calling _Row.getSeq(int)_ or _Row.getList(int)_. was: h2. Description A NullPointerException occurs in _org.apache.spark.sql.Row.getSeq(int)_ if the row contains a _null_ value at the requested index. {code:java} java.lang.NullPointerException at org.apache.spark.sql.Row.getSeq(Row.scala:319) at org.apache.spark.sql.Row.getSeq$(Row.scala:319) at org.apache.spark.sql.catalyst.expressions.GenericRow.getSeq(rows.scala:166) at org.apache.spark.sql.Row.getList(Row.scala:327) at org.apache.spark.sql.Row.getList$(Row.scala:326) at org.apache.spark.sql.catalyst.expressions.GenericRow.getList(rows.scala:166) ... {code} Prior to 3.1.1, the code would not throw an exception and instead would return a null _Seq_ instance. h2. Reproduction # Start a new spark-shell instance # Execute the following script: {code:scala} import org.apache.spark.sql.Row Row(Seq("value")).getSeq(0) Row(Seq()).getSeq(0) Row(null).getSeq(0) {code} h3. Expected Output res2 outputs a _null_ value. {code:java} scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> scala> Row(Seq("value")).getSeq(0) res0: Seq[Nothing] = List(value) scala> Row(Seq()).getSeq(0) res1: Seq[Nothing] = List() scala> Row(null).getSeq(0) res2: Seq[Nothing] = null {code} h3. Actual Output res2 throws a NullPointerException. {code:java} scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> scala> Row(Seq("value")).getSeq(0) res0: Seq[Nothing] = List(value)
[jira] [Updated] (SPARK-37654) Regression - NullPointerException in Row.getSeq when field null
[ https://issues.apache.org/jira/browse/SPARK-37654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Dahler updated SPARK-37654: --- Environment: (was: Tested against the following releases using the provided reproduction steps: # spark-3.0.3-bin-hadoop2.7 - Succeeded {code:java} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.3 /_/Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) {code} # spark-3.1.2-bin-hadoop3.2 - Failed {code:java} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.2 /_/Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) {code} # spark-3.2.0-bin-hadoop3.2 - Failed {code:java} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.2.0 /_/Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) {code}) > Regression - NullPointerException in Row.getSeq when field null > --- > > Key: SPARK-37654 > URL: https://issues.apache.org/jira/browse/SPARK-37654 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.1.2, 3.2.0 >Reporter: Brandon Dahler >Priority: Major > > h2. Description > A NullPointerException occurs in _org.apache.spark.sql.Row.getSeq(int)_ if > the row contains a _null_ value at the requested index. > {code:java} > java.lang.NullPointerException > at org.apache.spark.sql.Row.getSeq(Row.scala:319) > at org.apache.spark.sql.Row.getSeq$(Row.scala:319) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getSeq(rows.scala:166) > at org.apache.spark.sql.Row.getList(Row.scala:327) > at org.apache.spark.sql.Row.getList$(Row.scala:326) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getList(rows.scala:166) > ... > {code} > > Prior to 3.1.1, the code would not throw an exception and instead would > return a null _Seq_ instance. > h2. Reproduction > # Start a new spark-shell instance > # Execute the following script: > {code:scala} > import org.apache.spark.sql.Row > Row(Seq("value")).getSeq(0) > Row(Seq()).getSeq(0) > Row(null).getSeq(0) {code} > h3. Expected Output > res2 outputs a _null_ value. > {code:java} > scala> import org.apache.spark.sql.Row > import org.apache.spark.sql.Row > scala> > scala> Row(Seq("value")).getSeq(0) > res0: Seq[Nothing] = List(value) > scala> Row(Seq()).getSeq(0) > res1: Seq[Nothing] = List() > scala> Row(null).getSeq(0) > res2: Seq[Nothing] = null > {code} > h3. Actual Output > res2 throws a NullPointerException. > {code:java} > scala> import org.apache.spark.sql.Row > import org.apache.spark.sql.Row > scala> > scala> Row(Seq("value")).getSeq(0) > res0: Seq[Nothing] = List(value) > scala> Row(Seq()).getSeq(0) > res1: Seq[Nothing] = List() > scala> Row(null).getSeq(0) > java.lang.NullPointerException > at org.apache.spark.sql.Row.getSeq(Row.scala:319) > at org.apache.spark.sql.Row.getSeq$(Row.scala:319) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getSeq(rows.scala:166) > ... 47 elided > {code} > > h2. Regression Source > The regression appears to have been introduced in > [25c7d0fe6ae20a4c1c42e0cd0b448c08ab03f3fb|https://github.com/apache/spark/commit/25c7d0fe6ae20a4c1c42e0cd0b448c08ab03f3fb#diff-722324a11a0e4635a59a9debc962da2c1678d86702a9a106fd0d51188f83853bR317], > which addressed > [SPARK-32526|https://issues.apache.org/jira/browse/SPARK-32526] > h2. Work Around > This regression can be worked around by using _Row.isNullAt(int)_ and > handling the null scenario in user code, prior to calling _Row.getSeq(int)_ > or _Row.getList(int)_. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37654) Regression - NullPointerException in Row.getSeq when field null
Brandon Dahler created SPARK-37654: -- Summary: Regression - NullPointerException in Row.getSeq when field null Key: SPARK-37654 URL: https://issues.apache.org/jira/browse/SPARK-37654 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0, 3.1.2, 3.1.1 Environment: Tested against the following releases using the provided reproduction steps: # spark-3.0.3-bin-hadoop2.7 - Succeeded {code:java} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.3 /_/Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) {code} # spark-3.1.2-bin-hadoop3.2 - Failed {code:java} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.2 /_/Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) {code} # spark-3.2.0-bin-hadoop3.2 - Failed {code:java} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.2.0 /_/Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) {code} Reporter: Brandon Dahler h2. Description A NullPointerException occurs in _org.apache.spark.sql.Row.getSeq(int)_ if the row contains a _null_ value at the requested index. {code:java} java.lang.NullPointerException at org.apache.spark.sql.Row.getSeq(Row.scala:319) at org.apache.spark.sql.Row.getSeq$(Row.scala:319) at org.apache.spark.sql.catalyst.expressions.GenericRow.getSeq(rows.scala:166) at org.apache.spark.sql.Row.getList(Row.scala:327) at org.apache.spark.sql.Row.getList$(Row.scala:326) at org.apache.spark.sql.catalyst.expressions.GenericRow.getList(rows.scala:166) ... {code} Prior to 3.1.1, the code would not throw an exception and instead would return a null _Seq_ instance. h2. Reproduction # Start a new spark-shell instance # Execute the following script: {code:scala} import org.apache.spark.sql.Row Row(Seq("value")).getSeq(0) Row(Seq()).getSeq(0) Row(null).getSeq(0) {code} h3. Expected Output res2 outputs a _null_ value. {code:java} scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> scala> Row(Seq("value")).getSeq(0) res0: Seq[Nothing] = List(value) scala> Row(Seq()).getSeq(0) res1: Seq[Nothing] = List() scala> Row(null).getSeq(0) res2: Seq[Nothing] = null {code} h3. Actual Output res2 throws a NullPointerException. {code:java} scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> scala> Row(Seq("value")).getSeq(0) res0: Seq[Nothing] = List(value) scala> Row(Seq()).getSeq(0) res1: Seq[Nothing] = List() scala> Row(null).getSeq(0) java.lang.NullPointerException at org.apache.spark.sql.Row.getSeq(Row.scala:319) at org.apache.spark.sql.Row.getSeq$(Row.scala:319) at org.apache.spark.sql.catalyst.expressions.GenericRow.getSeq(rows.scala:166) ... 47 elided {code} h2. Regression Source The regression appears to have been introduced in [25c7d0fe6ae20a4c1c42e0cd0b448c08ab03f3fb|https://github.com/apache/spark/commit/25c7d0fe6ae20a4c1c42e0cd0b448c08ab03f3fb#diff-722324a11a0e4635a59a9debc962da2c1678d86702a9a106fd0d51188f83853bR317], which addressed [SPARK-32526|https://issues.apache.org/jira/browse/SPARK-32526] h2. Work Around This regression can be worked around by using _Row.isNullAt(int)_ and handling the null scenario in user code, prior to calling _Row.getSeq(int)_ or _Row.getList(int)_. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144 ] Wei Guo commented on SPARK-37604: - For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !image-2021-12-16-01-57-55-864.png|width=424,height=173! > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in
[jira] [Updated] (SPARK-37572) Flexible ways of launching executors
[ https://issues.apache.org/jira/browse/SPARK-37572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dagang Wei updated SPARK-37572: --- Affects Version/s: 3.3.0 (was: 3.2.0) > Flexible ways of launching executors > > > Key: SPARK-37572 > URL: https://issues.apache.org/jira/browse/SPARK-37572 > Project: Spark > Issue Type: New Feature > Components: Deploy >Affects Versions: 3.3.0 >Reporter: Dagang Wei >Priority: Major > > Currently Spark launches executor processes by constructing and running > commands [1], for example: > {code:java} > /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp > /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* > -Xmx1024M -Dspark.driver.port=35729 > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@example.host:35729 --executor-id 0 --hostname > 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 --worker-url > spark://Worker@100.116.124.193:45287 {code} > But there are use cases which require more flexible ways of launching > executors. In particular, our use case is that we run Spark in standalone > mode, Spark master and workers are running in VMs. We want to allow Spark app > developers to provide custom container images to customize the job runtime > environment (typically Java and Python dependencies), so executors (which run > the job code) need to run in Docker containers. > After reading the source code, we found that the concept of Spark Command > Runner might be a good solution. Basically, we want to introduce an optional > Spark command runner in Spark, so that instead of running the command to > launch executor directly, it passes the command to the runner, the runner > then runs the command with its own strategy which could be running in Docker, > or by default running the command directly. > The runner is specified through an env variable `SPARK_COMMAND_RUNNER`, which > by default could be a simple script like: > {code:java} > #!/bin/bash > exec "$@" {code} > or in the case of Docker container: > {code:java} > #!/bin/bash > docker run ... – "$@" {code} > > I already have a patch for this feature and have tested in our environment. > > [1]: > [https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Attachment: empty_test.png > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in dataframe. That is: > {noformat} > When writing, convert "" empty(in dataframe) to emptyValue(in csv) > When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) > {noformat} > > There is an obvious difference between nullValue and emptyValue in read > handling. For nullValue, we will convert nothing or nullValue strings to null > in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty > strings) to emptyValue strings rather than to convert both "\"\""(quoted > empty strings) and emptyValue strings to ""(empty) in dataframe. > I
[jira] [Updated] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Attachment: image-2021-12-16-01-57-55-864.png > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in dataframe. That is: > {noformat} > When writing, convert "" empty(in dataframe) to emptyValue(in csv) > When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) > {noformat} > > There is an obvious difference between nullValue and emptyValue in read > handling. For nullValue, we will convert nothing or nullValue strings to null > in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty > strings) to emptyValue strings rather than to convert both "\"\""(quoted > empty strings) and emptyValue strings to ""(empty)
[jira] [Updated] (SPARK-37572) Flexible ways of launching executors
[ https://issues.apache.org/jira/browse/SPARK-37572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dagang Wei updated SPARK-37572: --- Description: Currently Spark launches executor processes by constructing and running commands [1], for example: {code:java} /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* -Xmx1024M -Dspark.driver.port=35729 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://coarsegrainedschedu...@example.host:35729 --executor-id 0 --hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 --worker-url spark://Worker@100.116.124.193:45287 {code} But there are use cases which require more flexible ways of launching executors. In particular, our use case is that we run Spark in standalone mode, Spark master and workers are running in VMs. We want to allow Spark app developers to provide custom container images to customize the job runtime environment (typically Java and Python dependencies), so executors (which run the job code) need to run in Docker containers. After reading the source code, we found that the concept of Spark Command Runner might be a good solution. Basically, we want to introduce an optional Spark command runner in Spark, so that instead of running the command to launch executor directly, it passes the command to the runner, the runner then runs the command with its own strategy which could be running in Docker, or by default running the command directly. The runner is specified through an env variable `SPARK_COMMAND_RUNNER`, which by default could be a simple script like: {code:java} #!/bin/bash exec "$@" {code} or in the case of Docker container: {code:java} #!/bin/bash docker run ... – "$@" {code} I already have a patch for this feature and have tested in our environment. [1]: [https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52] was: Currently Spark launches executor processes by constructing and running commands [1], for example: {code:java} /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* -Xmx1024M -Dspark.driver.port=35729 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://coarsegrainedschedu...@dagang.svl.corp.google.com:35729 --executor-id 0 --hostname 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 --worker-url spark://Worker@100.116.124.193:45287 {code} But there are use cases which require more flexible ways of launching executors. In particular, our use case is that we run Spark in standalone mode, Spark master and workers are running in VMs. We want to allow Spark app developers to provide custom container images to customize the job runtime environment (typically Java and Python dependencies), so executors (which run the job code) need to run in Docker containers. After reading the source code, we found that the concept of Spark Command Runner might be a good solution. Basically, we want to introduce an optional Spark command runner in Spark, so that instead of running the command to launch executor directly, it passes the command to the runner, the runner then runs the command with its own strategy which could be running in Docker, or by default running the command directly. The runner is specified through an env variable `SPARK_COMMAND_RUNNER`, which by default could be a simple script like: {code:java} #!/bin/bash exec "$@" {code} or in the case of Docker container: {code:java} #!/bin/bash docker run ... – "$@" {code} I already have a patch for this feature and have tested in our environment. [1]: [https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala#L52] > Flexible ways of launching executors > > > Key: SPARK-37572 > URL: https://issues.apache.org/jira/browse/SPARK-37572 > Project: Spark > Issue Type: New Feature > Components: Deploy >Affects Versions: 3.2.0 >Reporter: Dagang Wei >Priority: Major > > Currently Spark launches executor processes by constructing and running > commands [1], for example: > {code:java} > /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/jre/bin/java -cp > /opt/spark-3.2.0-bin-hadoop3.2/conf/:/opt/spark-3.2.0-bin-hadoop3.2/jars/* > -Xmx1024M -Dspark.driver.port=35729 > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@example.host:35729 --executor-id 0 --hostname > 100.116.124.193 --cores 6 --app-id app-20211207131146-0002 --worker-url > spark://Worker@100.116.124.193:45287 {code} > But there are use cases which require more flexible ways of launching > executors. In
[jira] [Updated] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Summary: Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files (was: The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading) > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in dataframe. That is: > {noformat} > When writing, convert "" empty(in dataframe) to emptyValue(in csv) > When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) > {noformat} > > There is an obvious difference between nullValue and emptyValue in read > handling. For nullValue, we will convert nothing or nullValue strings to null > in dataframe, but for emptyValue, we
[jira] [Commented] (SPARK-37630) Security issue from Log4j 1.X exploit
[ https://issues.apache.org/jira/browse/SPARK-37630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460117#comment-17460117 ] Ankur Jain commented on SPARK-37630: Log4j 2.15.0 also has a vulnerability, good if we can upgrade it to 2.16.0. > Security issue from Log4j 1.X exploit > - > > Key: SPARK-37630 > URL: https://issues.apache.org/jira/browse/SPARK-37630 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.8, 3.2.0 >Reporter: Ismail H >Priority: Major > Labels: security > > log4j is being used in version [1.2.17|#L122]] > > This version has been deprecated and since [then have a known issue that > hasn't been adressed in 1.X > versions|https://www.cvedetails.com/cve/CVE-2019-17571/]. > > *Solution:* > * Upgrade log4j to version 2.15.0 which correct all known issues. [Last > known issues |https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44228] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460111#comment-17460111 ] Max Gekk commented on SPARK-37604: -- > but "EMPTY" strings in csv files can not be parsed as empty columns The use case when an empty string is interpreted as non-empty like "EMPTY" is not clear to me. [~Wayne Guo] Could you describe the case when it is needed. > The option emptyValueInRead(in CSVOptions) is suggested to be designed as > that any fields matching this string will be set as empty values "" when > reading > -- > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in dataframe. That is: > {noformat} > When writing, convert "" empty(in dataframe) to emptyValue(in csv) > When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) > {noformat} > > There is an obvious difference between nullValue and emptyValue in read > handling. For nullValue, we will convert nothing or
[jira] [Comment Edited] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460091#comment-17460091 ] Wei Guo edited comment on SPARK-37604 at 12/15/21, 4:48 PM: [~hyukjin.kwon], [~maxgekk] Shall we have a simple discussion about it in your free time, I'd like to hear your thoughts on this. was (Author: wayne guo): [~hyukjin.kwon][~maxgekk] Shall we have a simple discussion about it in your free time, I'd like to hear your thoughts on this. > The option emptyValueInRead(in CSVOptions) is suggested to be designed as > that any fields matching this string will be set as empty values "" when > reading > -- > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in dataframe. That is: > {noformat} > When writing, convert "" empty(in dataframe) to emptyValue(in csv) > When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) > {noformat} > > There is an obvious
[jira] [Commented] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460091#comment-17460091 ] Wei Guo commented on SPARK-37604: - [~hyukjin.kwon][~maxgekk] Shall we have a simple discussion about it in your free time, I'd like to hear your thoughts on this. > The option emptyValueInRead(in CSVOptions) is suggested to be designed as > that any fields matching this string will be set as empty values "" when > reading > -- > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in dataframe. That is: > {noformat} > When writing, convert "" empty(in dataframe) to emptyValue(in csv) > When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) > {noformat} > > There is an obvious difference between nullValue and emptyValue in read > handling. For nullValue, we will convert nothing or nullValue strings to null > in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty
[jira] [Commented] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460084#comment-17460084 ] Wei Guo commented on SPARK-37604: - Maybe this issue is not a notable bug or promotion, but, for users' common usage, they prefer to be able to convert these emptyValue strings in csv files into ""(empty strings) again after writing out empty strings as emptyValue strings rather than current behaviors. > The option emptyValueInRead(in CSVOptions) is suggested to be designed as > that any fields matching this string will be set as empty values "" when > reading > -- > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in dataframe. That is: > {noformat} > When writing, convert "" empty(in dataframe) to emptyValue(in csv) > When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) > {noformat} > > There is an obvious difference between nullValue and emptyValue in read > handling. For
[jira] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604 ] Wei Guo deleted comment on SPARK-37604: - was (Author: wayne guo): The current behavior of emptyValueInRead is more like the function for null values in Dataset: {code:scala} dataframe.na.fill(fillMap){code} So, we can also provide a function in Dataset similar to it, such as: {code:scala} dataframe.empty.fill(fillMap){code} rather than to change empty strings to emptyValueInRead in DataFrame when reading csv files. > The option emptyValueInRead(in CSVOptions) is suggested to be designed as > that any fields matching this string will be set as empty values "" when > reading > -- > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in dataframe. That is: > {noformat} > When writing, convert "" empty(in dataframe) to emptyValue(in csv) > When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) > {noformat} > > There is an obvious difference between nullValue and emptyValue
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:scala} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as null columns*{color} in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} {*}There is a difference when reading{*}. In univocity, nothing content will be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} >From now, we start to talk about emptyValue. {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that Spark keeps the same behaviors for emptyValue with univocity, that is: {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:scala} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is an obvious difference between nullValue and emptyValue in read handling. For nullValue, we will convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue strings rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that if we keep the similar behavior(try to recover emptyValue in csv files to "") for emptyValue as nullValue when reading. So, I suggest that the emptyValueInRead(in CSVOptions) should be designed as that any fields matching this string will be set as empty values "" when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue:
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as null columns*{color} in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} {*}There is a difference when reading{*}. In univocity, nothing content will be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} >From now, we start to talk about emptyValue. {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that Spark keeps the same behaviors for emptyValue with univocity, that is: {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:scala} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior(try to recover emptyValue to "") for emptyValue as nullValue when reading, so I suggest that the emptyValueInRead(in CSVOptions) should be designed as that any fields matching this string will be set as empty values "" when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that
[jira] [Resolved] (SPARK-37483) Support push down top N to JDBC data source V2
[ https://issues.apache.org/jira/browse/SPARK-37483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37483. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34738 [https://github.com/apache/spark/pull/34738] > Support push down top N to JDBC data source V2 > -- > > Key: SPARK-37483 > URL: https://issues.apache.org/jira/browse/SPARK-37483 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as null columns*{color} in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} {*}There is a difference when reading{*}. In univocity, nothing content will be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} >From now, we start to talk about emptyValue. {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior(try to recover emptyValue to "") for emptyValue as nullValue when reading, so I suggest that the emptyValueInRead(in CSVOptions) should be designed as that any fields matching this string will be set as empty values "" when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files:
[jira] [Assigned] (SPARK-37483) Support push down top N to JDBC data source V2
[ https://issues.apache.org/jira/browse/SPARK-37483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37483: --- Assignee: jiaan.geng > Support push down top N to JDBC data source V2 > -- > > Key: SPARK-37483 > URL: https://issues.apache.org/jira/browse/SPARK-37483 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as null columns*{color} in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} {*}There is a difference when reading{*}. In univocity, nothing content will be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior(try to recover emptyValue to "") for emptyValue as nullValue when reading, so I suggest that the emptyValueInRead(in CSVOptions) should be designed as that any fields matching this string will be set as empty values "" when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as null columns*{color} in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} {*}There is a difference when reading{*}. In univocity, nothing content would be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior(try to recover emptyValue to "") for emptyValue as nullValue when reading, so I suggest that the emptyValueInRead(in CSVOptions) should be designed as that any fields matching this string will be set as empty values "" when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as null columns*{color} in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} There is a difference when reading. In univocity, nothing content would be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior(try to recover emptyValue to "") for emptyValue as nullValue when reading, so I suggest that the emptyValueInRead(in CSVOptions) should be designed as that any fields matching this string will be set as empty values "" when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null columns{color}* in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} There is a difference when reading. In univocity, nothing content would be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior(try to recover emptyValue to "") for emptyValue as nullValue when reading, so I suggest that the emptyValueInRead(in CSVOptions) should be designed as that any fields matching this string will be set as empty values "" when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null columns{color}* in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} There is a difference when reading. In univocity, nothing content would be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior(try to recover emptyValue to "") for emptyValue as nullValue when reading, so I suggest that the emptyValueInRead(in CSVOptions) should be designed as that any fields matching this string will be set as empty values "" when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features description in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null columns{color}* in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} There is a difference when reading. In univocity, nothing content would be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior(try to recover emptyValue to "") for emptyValue as nullValue when reading, so I suggest that the emptyValueInRead(in CSVOptions) should be designed as that any fields matching this string will be set as empty values "" when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features description in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null columns{color}* in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} There is a difference when reading. In univocity, nothing content would be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior(try to recover emptyValue to "") for emptyValue as nullValue when reading, so I suggest that the emptyValueInRead(in CSVOptions) should be designed as that any fields matching this string will be set as empty values "" when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null columns{color}* in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} There is a difference when reading. In univocity, nothing content would be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior(try to recover emptyValue to "") for emptyValue as nullValue when reading, so I suggest that the emptyValueInRead(in CSVOptions) should be designed as that any fields matching this string will be set as empty values "" when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null columns{color}* in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} There is a difference when reading. In univocity, nothing content would be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior(try to recover emptyValue to "") for emptyValue as nullValue when reading, so I suggest that the emptyValueInRead(in CSVOptions) should be designed as that any fields matching this string will be set as empty values "" when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null columns{color}* in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} There is a difference when reading. In univocity, nothing content would be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior(try to recover emptyValue to "") for emptyValue as nullValue when reading, so was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla",
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null columns{color}* in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} There is a difference when reading. In univocity, nothing content would be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior for emptyValue as nullValue when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make",
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null columns{color}* in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} There is a difference when reading. In univocity, nothing content would be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as:
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as NULL strings in csv files and NULL strings in csv files can be parsed as columns of null values in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} There is a difference when reading. In univocity, nothing content would be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as NULL strings in csv files and NULL strings in csv files can be parsed as columns of null values in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} Since Spark 2.4, for empty strings, there are emptyValueInRead for reading and emptyValueInWrite for writing that can be set in CSVOptions: {code:scala} // For writing, convert: ""(dataframe) => emptyValueInWrite(csv) // For reading, convert: "" (csv) => emptyValueInRead(dataframe){code} I think the read handling is not suitable, we can not convert "" or `{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but get {color:#172b4d}emptyValueInRead's setting value actually{color}, it supposed to be as flows: {code:scala} // For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code} {color:#de350b}*We can not recovery it to the original DataFrame.*{color} was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . For the nullValue option, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the