Andrew Gross created SPARK-21768: ------------------------------------ Summary: spark.csv.read Empty String Parsed as NULL when nullValue is Set Key: SPARK-21768 URL: https://issues.apache.org/jira/browse/SPARK-21768 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.2.0, 2.0.2 Environment: AWS EMR Spark 2.2.0 (also Spark 2.0.2) PySpark
Reporter: Andrew Gross In a CSV with quoted fields, empty strings will be interpreted as NULL even when a nullValue is explicitly set: Example CSV with Quoted Fields, Delimiter | and nullValue XXNULLXX {{"XXNULLXX"|""|"XXNULLXX"|"foo"}} PySpark Script to load the file (from S3): {code:title=load.py|borderStyle=solid} from pyspark.sql import SparkSession from pyspark.sql.types import StringType, StructField, StructType spark = SparkSession.builder.appName("test_csv").getOrCreate() fields = [] fields.append(StructField("First Null Field", StringType(), True)) fields.append(StructField("Empty String Field", StringType(), True)) fields.append(StructField("Second Null Field", StringType(), True)) fields.append(StructField("Non Empty String Field", StringType(), True)) schema = StructType(fields) keys = ['s3://mybucket/test/demo.csv'] bad_data = spark.read.csv(keys, timestampFormat="yyyy-MM-dd HH:mm:ss", mode="FAILFAST", sep="|", nullValue="XXNULLXX", schema=schema) bad_data.show() {code} Output {noformat} +----------------+------------------+-----------------+----------------------+ |First Null Field|Empty String Field|Second Null Field|Non Empty String Field| +----------------+------------------+-----------------+----------------------+ | null| null| null| foo| +----------------+------------------+-----------------+----------------------+ {noformat} Expected Output: {noformat} +----------------+------------------+-----------------+----------------------+ |First Null Field|Empty String Field|Second Null Field|Non Empty String Field| +----------------+------------------+-----------------+----------------------+ | null| | null| foo| +----------------+------------------+-----------------+----------------------+ {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org