[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

mallman Tue, 21 Nov 2017 09:48:47 -0800

Github user mallman commented on the issue:

    https://github.com/apache/spark/pull/16578
  
    > Can you give an example it would fail? We didn't change 
clipParquetSchema, so I think even when pruning happens, why we read a super 
set of the file's schema and cause the exception, according to the comment? We 
won't add new fields but remove existing from the file's schema, right?
    
    (Oddly, Github won't let me reply to this comment in line.)
    
    The situation we've run into is pruning a schema for a query over a 
partitioned Hive table backed by parquet files where some files are missing 
fields specified by the table schema. This can happen, e.g., in schema 
evolution where fields are added to the table over time without rewriting 
existing partitions. In those cases, we've found parquet-mr throws an exception 
if we try to read from that file with table-pruned schema (a superset of that 
file's schema). Therefore, we further clip the pruned schema against each 
file's schema before attempting to read.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

Reply via email to