[ https://issues.apache.org/jira/browse/SPARK-35386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17345821#comment-17345821 ]
Rafal Wojdyla commented on SPARK-35386: --------------------------------------- {quote} IIRC, it's not documented. We might think about documenting this behaviour. {quote} I don't really have the context for this, and also consider the current behaviour to be a bug. {quote} You could clip the schema from the file and user specified schema, and compare by recursively traversing. {quote} I mean, I certainly could, I'm just thinking what is the good user experience for this. Assuming post-read schema comparison: {code:python} my_read_schema = ... df = spark.read.parquet(..., schema=my_read_schema) # custom code to compare df.schema to my_read_schema, it would: # ignore metadata, and fail on nullable -> required "promotion" # but for example allow required -> nullable promotion {code} This is obviously not a great user experience, and likely won't handle all corner cases. I think as a user I would prefer if: * if the user specified schema has a {{nullable=False}}/{{required}} column that doesn't exist in the data, the read should fail * if the user specified schema has a {{nullable=True}} column that doesn't exist in the data, the read would create a new "empty" column This allows for schema evolution, where you can promote {{required}} to {{nullable}}. And allows for to user to control the read op, and avoid extra complicated schema comparison code. [~hyukjin.kwon] wdyt? > parquet read with schema should fail on non-existing columns > ------------------------------------------------------------ > > Key: SPARK-35386 > URL: https://issues.apache.org/jira/browse/SPARK-35386 > Project: Spark > Issue Type: Bug > Components: Input/Output, PySpark > Affects Versions: 3.0.1 > Reporter: Rafal Wojdyla > Priority: Major > > When read schema is specified as I user I would prefer/like if spark failed > on missing columns. > {code:python} > from pyspark.sql.dataframe import DoubleType, StructType > spark: SparkSession = ... > spark.read.parquet("/tmp/data.snappy.parquet") > # inferred schema, includes 3 columns: col1, col2, new_col > # DataFrame[col1: bigint, col2: bigint, new_col: bigint] > # let's specify a custom read_schema, with **non nullable** col3 (which is > not present): > read_schema = StructType(fields=[StructField("col3",DoubleType(),False)]) > df = spark.read.schema(read_schema).parquet("/tmp/data.snappy.parquet") > df.schema > # we get a DataFrame with **nullable** col3: > # StructType(List(StructField(col3,DoubleType,true))) > df.count() > # 0 > {code} > Is this a feature or a bug? In this case there's just a single parquet file, > I have also tried {{option("mergeSchema", "true")}}, which doesn't help. > Similar read pattern would fail on pandas (and likely dask). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org