[
https://issues.apache.org/jira/browse/SPARK-53052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18011540#comment-18011540
]
Vindhya G edited comment on SPARK-53052 at 8/2/25 12:33 PM:
------------------------------------------------------------
[Row(id='1', val=None), Row(id='2', val='emptyStr'), Row(id='3', val='str1')]
expected the {{{}Row(id='2', val='') }}(instead of \{{{}val='emptyStr'{}}}).
I think this is behaving as expected as you are providing what emptyValue you
define. Here you defined 'emptyStr' as your emptyValue.
As for the actual emptyValue with '' I do see the difference in behaviour in
scala and python! But only when nullValue is provided. It essentially seems to
come from how null and None is read in scala vs python
was (Author: JIRAUSER299405):
[Row(id='1', val=None), Row(id='2', val='emptyStr'), Row(id='3', val='str1')]
expected the {{{}Row(id='2', val='') }}(instead of \{{{}val='emptyStr'{}}}).
I think this is behaving as expected as you are providing what emptyValue you
define. Here you defined 'emptyStr' as your emptyValue.
As for the actual emptyValue with '' I do see the difference in behaviour in
scala and python!
> emptyValue option does not seem to work from pyspark
> ----------------------------------------------------
>
> Key: SPARK-53052
> URL: https://issues.apache.org/jira/browse/SPARK-53052
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.5.0
> Environment: Spark 3.5.0
> Reporter: Kash Bhatt
> Priority: Minor
>
> According to
> [docs|https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option]
> of csv options:
> {quote}
> ||Property Name||Default||Meaning||
> |emptyValue|(for reading), "" (for writing)|Sets the string representation of
> an empty value.|
> {quote}
> But it doesn't seem to work:
> {code:java}
> with open("/dbfs/tmp/c.csv", "w") as f:
> f.write('''id,val
> 1,
> 2,emptyStr
> 3,str1
> ''')
> spark.read.csv('dbfs:/tmp/c.csv', header=True,
> emptyValue='emptyStr').collect() {code}
> prints:
> [Row(id='1', val=None), Row(id='2', val='emptyStr'), Row(id='3', val='str1')]
> expected the {{{}Row(id='2', val='') (instead of val='emptyStr'{}}}).
> ----
> Not sure if it's related but although docs don't mention any relation to
> {{nullValue}} option, it seems to affect {{{}emptyValue{}}}.
>
> With this content in csv file:
> {code:java}
> id,val
> 1,
> 2,""
> 3,str1 {code}
> Following Scala code works as expected, i.e. prints
> {{Array[org.apache.spark.sql.Row] = Array([1,null], [2,], [3,str1])}}
> {code:java}
> spark.read.option("header", "true")
> .option("emptyValue", "")
> .option("nullValue", null)
> .csv("dbfs:/tmp/c.csv").collect() {code}
> But PySpark code, doesn't work, prints: {{[Row(id='1', val=None), Row(id='2',
> val=None), Row(id='3', val='str1')]}}
> {code:java}
> park.read.csv('dbfs:/tmp/c.csv', header=True, emptyValue='',
> nullValue=None).collect() {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]