I've seen this function referenced in a couple places, first this forum post <https://forums.databricks.com/questions/951/why-should-i-use-parquet.html> and this talk by Michael Armbrust <https://www.youtube.com/watch?v=6axUqHCu__Y> during the 42nd minute.
As I understand it, if you create a Parquet file using Spark, Spark will then have access to min/max vals for each column. If a query asks for a value outside that range (like a timestamp), Spark will know to skip that file entirely. Michael says this feature is turned off by default in 1.3. How can I turn this on? I don't see much about this feature online. A couple other questions: - Does this only work for Parquet files that were created in Spark? For example, if I create the Parquet file using Hive + MapReduce, or Impala, would Spark still have access to min/max values? - Does this feature work at the row chunk level, or just at the file level? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-Parquet-Skipping-data-using-statistics-tp24233.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org