Spark input size when filtering on parquet files

Dennis Hunziker Thu, 26 May 2016 13:45:39 -0700

Hi all



I was looking into Spark 1.6.1 (Parquet 1.7.0, Hive 1.2.1) in order to find
out about the improvements made in filtering/scanning parquet files when
querying for tables using SparkSQL and how these changes relate to the new
filter API introduced in Parquet 1.7.0.



After checking the usual sources, I still can’t make sense of some of the
numbers shown on the Spark UI. As an example, I’m looking at the collect
stage for a query that’s selecting a single row from a table containing 1
million numbers using a simple where clause (i.e. col1 = 500000) and this
is what I see on the UI:



0 SUCCESS ... 2.4 MB (hadoop) / 0

1 SUCCESS ... 2.4 MB (hadoop) / 250000

2 SUCCESS ... 2.4 MB (hadoop) / 0

3 SUCCESS ... 2.4 MB (hadoop) / 0



Based on the min/max statistics of each of the parquet parts, it makes
sense not to expect any records for 3 out of the 4, because the record I’m
looking for can only be in a single file. But why is the input size above
shown as 2.4 MB, totaling up to an overall input size of 9.7 MB for the
whole stage? Isn't it just meant to read the metadata and ignore the
content of the file?



Regards,

Dennis

Spark input size when filtering on parquet files

Reply via email to