Rafal Wojdyla created SPARK-35386: ------------------------------------- Summary: parquet read with schema should fail on non-existing columns Key: SPARK-35386 URL: https://issues.apache.org/jira/browse/SPARK-35386 Project: Spark Issue Type: Bug Components: Input/Output, PySpark Affects Versions: 3.0.1 Reporter: Rafal Wojdyla
When read schema is specified as I user I would prefer/like if spark failed on missing columns. {code:python} spark: SparkSession = ... spark.read.parquet("/tmp/data.snappy.parquet") # inferred schema, includes 3 columns: col1, col2, new_col # DataFrame[col1: bigint, col2: bigint, new_col: bigint] # let's specify a custom read_schema, with **non nullable** col3 (which is not present): read_schema = StructType(fields=[StructField("col3",DoubleType(),False)]) df = spark.read.schema(read_schema).parquet("/tmp/data.snappy.parquet") df.schema # we get a DataFrame with **nullable** col3: # StructType(List(StructField(col3,DoubleType,true))) df.count() # 0 {code} Is this a feature or a bug? In this case there's just a single parquet file, I have also tried {{option("mergeSchema", "true")}}, which doesn't help. Similar read pattern would fail on pandas (and likely dask). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org