[ https://issues.apache.org/jira/browse/SPARK-21768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16132191#comment-16132191 ]
Marco Gaido commented on SPARK-21768: ------------------------------------- This is a duplicate of SPARK-17916. > spark.csv.read Empty String Parsed as NULL when nullValue is Set > ---------------------------------------------------------------- > > Key: SPARK-21768 > URL: https://issues.apache.org/jira/browse/SPARK-21768 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 2.0.2, 2.2.0 > Environment: AWS EMR Spark 2.2.0 (also Spark 2.0.2) > PySpark > Reporter: Andrew Gross > > In a CSV with quoted fields, empty strings will be interpreted as NULL even > when a nullValue is explicitly set: > Example CSV with Quoted Fields, Delimiter | and nullValue XXNULLXX > {{"XXNULLXX"|""|"XXNULLXX"|"foo"}} > PySpark Script to load the file (from S3): > {code:title=load.py|borderStyle=solid} > from pyspark.sql import SparkSession > from pyspark.sql.types import StringType, StructField, StructType > spark = SparkSession.builder.appName("test_csv").getOrCreate() > fields = [] > fields.append(StructField("First Null Field", StringType(), True)) > fields.append(StructField("Empty String Field", StringType(), True)) > fields.append(StructField("Second Null Field", StringType(), True)) > fields.append(StructField("Non Empty String Field", StringType(), True)) > schema = StructType(fields) > keys = ['s3://mybucket/test/demo.csv'] > bad_data = spark.read.csv(keys, timestampFormat="yyyy-MM-dd HH:mm:ss", > mode="FAILFAST", sep="|", nullValue="XXNULLXX", schema=schema) > bad_data.show() > {code} > Output > {noformat} > +----------------+------------------+-----------------+----------------------+ > |First Null Field|Empty String Field|Second Null Field|Non Empty String Field| > +----------------+------------------+-----------------+----------------------+ > | null| null| null| foo| > +----------------+------------------+-----------------+----------------------+ > {noformat} > Expected Output: > {noformat} > +----------------+------------------+-----------------+----------------------+ > |First Null Field|Empty String Field|Second Null Field|Non Empty String Field| > +----------------+------------------+-----------------+----------------------+ > | null| | null| foo| > +----------------+------------------+-----------------+----------------------+ > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org