Hi all
I was looking into Spark 1.6.1 (Parquet 1.7.0, Hive 1.2.1) in order to find out about the improvements made in filtering/scanning parquet files when querying for tables using SparkSQL and how these changes relate to the new filter API introduced in Parquet 1.7.0. After checking the usual sources, I still can’t make sense of some of the numbers shown on the Spark UI. As an example, I’m looking at the collect stage for a query that’s selecting a single row from a table containing 1 million numbers using a simple where clause (i.e. col1 = 500000) and this is what I see on the UI: 0 SUCCESS ... 2.4 MB (hadoop) / 0 1 SUCCESS ... 2.4 MB (hadoop) / 250000 2 SUCCESS ... 2.4 MB (hadoop) / 0 3 SUCCESS ... 2.4 MB (hadoop) / 0 Based on the min/max statistics of each of the parquet parts, it makes sense not to expect any records for 3 out of the 4, because the record I’m looking for can only be in a single file. But why is the input size above shown as 2.4 MB, totaling up to an overall input size of 9.7 MB for the whole stage? Isn't it just meant to read the metadata and ignore the content of the file? Regards, Dennis