Inline. On Fri, Jan 9, 2015 at 3:24 PM, Paul Wais <[email protected]> wrote:
> Dear list, > > I was hoping somebody could help me with a couple of quick questions: > > * I'm interested in performant range queries on Parquet files, i.e. a query > against a large file that selects all records with column value in some > interval or set. Parquet supports Predicate Pushdown (and some engines, > e.g. Spark, do use it), but does Parquet have support for indexed column > values at all? E.g. using sorted column values? Or does this issue fall > within the scope of (unimplemented?) Index Pages? We've thought about how to make these kind of queries work well but a good amount of it still needs to be implemented. The file formats allows for sorting columns. Using this in predicate push down should not be too hard to integrate and what help these queries a lot. Index pages would make this perform even better but that is further out. > * What are common approaches for addressing over a union of parquet files? > E.g. suppose I have a collection of log files in HDFS, one log file per > day. Suppose I want to sum all values for a specific field over all log > files. If this data were in MySQL, I could have a table for each day of > data and use a MyISAM merge table to union these tables together and just > query against the merge table. > > Does Parquet offer any tools for handling unions of distinct Parquet > files? If not, what are common approaches to this problem using, say, > Spark or Impala? I know you can create parquet-backed tables in both of > these engines, but can you create a union table? Would this union table > then be stored in parquet format somehow? > Using something like a union table is out of the scope for parquet. Query engines that support parquet would be the place to implement that functionality. This can be done in Hive for example. Parquet does support schema evolution though and this might be good enough in your case. The log files make up one table and as long as the schema for those files don't change too much (e.g. add a column), you can read all the files as if they were one table. > > Please feel free to just link me to pages that might already cover these > issues in detail. I've Googled around quite a bit but perhaps my query > choices sucked. Thanks again for all your help!! > > All the best, > -Paul >
