Dear list,

I was hoping somebody could help me with a couple of quick questions:

* I'm interested in performant range queries on Parquet files, i.e. a query
against a large file that selects all records with column value in some
interval or set.  Parquet supports Predicate Pushdown (and some engines,
e.g. Spark, do use it), but does Parquet have support for indexed column
values at all?  E.g. using sorted column values?  Or does this issue fall
within the scope of (unimplemented?) Index Pages?

* What are common approaches for addressing over a union of parquet files?
E.g. suppose I have a collection of log files in HDFS, one log file per
day.  Suppose I want to sum all values for a specific field over all log
files.  If this data were in MySQL, I could have a table for each day of
data and use a MyISAM merge table to union these tables together and just
query against the merge table.

Does Parquet offer any tools for handling unions of distinct Parquet
files?  If not, what are common approaches to this problem using, say,
Spark or Impala?  I know you can create parquet-backed tables in both of
these engines, but can you create a union table?  Would this union table
then be stored in parquet format somehow?

Please feel free to just link me to pages that might already cover these
issues in detail.  I've Googled around quite a bit but perhaps my query
choices sucked.  Thanks again for all your help!!

All the best,
-Paul

Reply via email to