Dear list, I was hoping somebody could help me with a couple of quick questions:
* I'm interested in performant range queries on Parquet files, i.e. a query against a large file that selects all records with column value in some interval or set. Parquet supports Predicate Pushdown (and some engines, e.g. Spark, do use it), but does Parquet have support for indexed column values at all? E.g. using sorted column values? Or does this issue fall within the scope of (unimplemented?) Index Pages? * What are common approaches for addressing over a union of parquet files? E.g. suppose I have a collection of log files in HDFS, one log file per day. Suppose I want to sum all values for a specific field over all log files. If this data were in MySQL, I could have a table for each day of data and use a MyISAM merge table to union these tables together and just query against the merge table. Does Parquet offer any tools for handling unions of distinct Parquet files? If not, what are common approaches to this problem using, say, Spark or Impala? I know you can create parquet-backed tables in both of these engines, but can you create a union table? Would this union table then be stored in parquet format somehow? Please feel free to just link me to pages that might already cover these issues in detail. I've Googled around quite a bit but perhaps my query choices sucked. Thanks again for all your help!! All the best, -Paul
