Spark 1.3 + Parquet: "Skipping data using statistics"

YaoPau Wed, 12 Aug 2015 15:59:37 -0700

I've seen this function referenced in a couple places, first  this forum post
<https://forums.databricks.com/questions/951/why-should-i-use-parquet.html>  
and  this talk by Michael Armbrust
<https://www.youtube.com/watch?v=6axUqHCu__Y>   during the 42nd minute.


As I understand it, if you create a Parquet file using Spark, Spark will
then have access to min/max vals for each column.  If a query asks for a
value outside that range (like a timestamp), Spark will know to skip that
file entirely.

Michael says this feature is turned off by default in 1.3.  How can I turn
this on?

I don't see much about this feature online.  A couple other questions:

- Does this only work for Parquet files that were created in Spark?  For
example, if I create the Parquet file using Hive + MapReduce, or Impala,
would Spark still have access to min/max values?

- Does this feature work at the row chunk level, or just at the file level?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-Parquet-Skipping-data-using-statistics-tp24233.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark 1.3 + Parquet: "Skipping data using statistics"

Reply via email to