Hi, Currently most of the data in our production is using Avro + Snappy. I want
to test the benefits if we store the data in Parquet format. I changed the our
ETL to generate the Parquet format, instead of Avor, and want to test a simple
sql in Spark SQL, to verify the benefits from Parquet.
I generated the same dataset in both Avro and Parquet in HDFS, and load them
both in Spark-SQL. Now I run the same query like "select colum1 from
src_table_avro/parqut where colum2=xxx", I can see that for the parquet data
format, the job runs much fast. The test files size for both format are around
930M. So Avro job generated 8 tasks to read the data with 21s as the median
duration, vs parquet job generate 7 tasks to read the data with 0.4s as the
median duration.
Since the dataset has more than 100 columns, I can see the parquet file really
coming with fast read. But my question is that from the spark UI, both job show
900M as the input size, and 0 for rest, in this case, how do I know column
pruning really works? I think it is due to that, so parquet file can be read so
fast, but is there any statistic can prove that to me on the Spark UI?
Something like the input total file size is 900M, but only 10M really read due
to column pruning? So in case that the columns pruning not work in parquet due
to what kind of SQL query, I can identify in the first place.
Thanks
Yong