Rafal Wojdyla created SPARK-35386:
-------------------------------------

             Summary: parquet read with schema should fail on non-existing 
columns
                 Key: SPARK-35386
                 URL: https://issues.apache.org/jira/browse/SPARK-35386
             Project: Spark
          Issue Type: Bug
          Components: Input/Output, PySpark
    Affects Versions: 3.0.1
            Reporter: Rafal Wojdyla


When read schema is specified as I user I would prefer/like if spark failed on 
missing columns.

{code:python}
spark: SparkSession = ...

spark.read.parquet("/tmp/data.snappy.parquet")
# inferred schema, includes 3 columns: col1, col2, new_col
# DataFrame[col1: bigint, col2: bigint, new_col: bigint]

# let's specify a custom read_schema, with **non nullable** col3 (which is not 
present):
read_schema = StructType(fields=[StructField("col3",DoubleType(),False)])

df = spark.read.schema(read_schema).parquet("/tmp/data.snappy.parquet")

df.schema
# we get a DataFrame with **nullable** col3:
# StructType(List(StructField(col3,DoubleType,true)))

df.count()
# 0
{code}

Is this a feature or a bug? In this case there's just a single parquet file, I 
have also tried {{option("mergeSchema", "true")}}, which doesn't help.

Similar read pattern would fail on pandas (and likely dask).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to