[
https://issues.apache.org/jira/browse/SPARK-53052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18011657#comment-18011657
]
Kash Bhatt edited comment on SPARK-53052 at 8/3/25 4:09 AM:
------------------------------------------------------------
[~vindhyag] , you're right. I've learned a few new things about these two
options since I posted this bug. This bug is invalid, but there are issues with
python vs Scala api behavior and definatly with docs.
PS: implementation stops one from set any option to `None` in python
I'll update the ticket when I get a chance.
See:
*
[https://stackoverflow.com/questions/79721756/how-to-use-emptyvalue-option-in-pyspark-while-reading-a-csv-file]
*
[https://stackoverflow.com/questions/79721713/how-to-read-empty-string-as-well-as-null-values-from-a-csv-file-in-pyspark]
was (Author: JIRAUSER310621):
[~vindhyag] , you're right. I've learned a few new things about these two
options since I posted this bug. This bug is invalid, but there are issues with
python vs Scala api behavior and definatly with docs.
I'll update the ticket when I get a chance.
See:
*
[https://stackoverflow.com/questions/79721756/how-to-use-emptyvalue-option-in-pyspark-while-reading-a-csv-file]
*
[https://stackoverflow.com/questions/79721713/how-to-read-empty-string-as-well-as-null-values-from-a-csv-file-in-pyspark]
> emptyValue option does not seem to work from pyspark
> ----------------------------------------------------
>
> Key: SPARK-53052
> URL: https://issues.apache.org/jira/browse/SPARK-53052
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.5.0
> Environment: Spark 3.5.0
> Reporter: Kash Bhatt
> Priority: Minor
>
> According to
> [docs|https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option]
> of csv options:
> {quote}
> ||Property Name||Default||Meaning||
> |emptyValue|(for reading), "" (for writing)|Sets the string representation of
> an empty value.|
> {quote}
> But it doesn't seem to work:
> {code:java}
> with open("/dbfs/tmp/c.csv", "w") as f:
> f.write('''id,val
> 1,
> 2,emptyStr
> 3,str1
> ''')
> spark.read.csv('dbfs:/tmp/c.csv', header=True,
> emptyValue='emptyStr').collect() {code}
> prints:
> [Row(id='1', val=None), Row(id='2', val='emptyStr'), Row(id='3', val='str1')]
> expected the {{{}Row(id='2', val='') (instead of val='emptyStr'{}}}).
> ----
> Not sure if it's related but although docs don't mention any relation to
> {{nullValue}} option, it seems to affect {{{}emptyValue{}}}.
>
> With this content in csv file:
> {code:java}
> id,val
> 1,
> 2,""
> 3,str1 {code}
> Following Scala code works as expected, i.e. prints
> {{Array[org.apache.spark.sql.Row] = Array([1,null], [2,], [3,str1])}}
> {code:java}
> spark.read.option("header", "true")
> .option("emptyValue", "")
> .option("nullValue", null)
> .csv("dbfs:/tmp/c.csv").collect() {code}
> But PySpark code, doesn't work, prints: {{[Row(id='1', val=None), Row(id='2',
> val=None), Row(id='3', val='str1')]}}
> {code:java}
> park.read.csv('dbfs:/tmp/c.csv', header=True, emptyValue='',
> nullValue=None).collect() {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]