Re: Support for range queries and unions of parquet files (merge tables?)

Nong Li Sun, 11 Jan 2015 20:49:39 -0800

Inline.

On Fri, Jan 9, 2015 at 3:24 PM, Paul Wais <[email protected]> wrote:


> Dear list,
>
> I was hoping somebody could help me with a couple of quick questions:
>
> * I'm interested in performant range queries on Parquet files, i.e. a query
> against a large file that selects all records with column value in some
> interval or set.  Parquet supports Predicate Pushdown (and some engines,
> e.g. Spark, do use it), but does Parquet have support for indexed column
> values at all?  E.g. using sorted column values?  Or does this issue fall
> within the scope of (unimplemented?) Index Pages?


We've thought about how to make these kind of queries work well but a good
amount of it still needs to be implemented. The file formats allows for
sorting columns.
Using this in predicate push down should not be too hard to integrate and
what
help these queries a lot. Index pages would make this perform even better
but
that is further out.


> * What are common approaches for addressing over a union of parquet files?
> E.g. suppose I have a collection of log files in HDFS, one log file per
> day.  Suppose I want to sum all values for a specific field over all log
> files.  If this data were in MySQL, I could have a table for each day of
> data and use a MyISAM merge table to union these tables together and just
> query against the merge table.
>
> Does Parquet offer any tools for handling unions of distinct Parquet
> files?  If not, what are common approaches to this problem using, say,
> Spark or Impala?  I know you can create parquet-backed tables in both of
> these engines, but can you create a union table?  Would this union table
> then be stored in parquet format somehow?
>
Using something like a union table is out of the scope for parquet. Query
engines
that support parquet would be the place to implement that functionality.
This
can be done in Hive for example.

Parquet does support schema evolution though and this might be good enough
in your case. The log files make up one table and as long as the schema for
those files don't change too much (e.g. add a column), you can read all the
files
as if they were one table.


>
> Please feel free to just link me to pages that might already cover these
> issues in detail.  I've Googled around quite a bit but perhaps my query
> choices sucked.  Thanks again for all your help!!
>
> All the best,
> -Paul
>

Re: Support for range queries and unions of parquet files (merge tables?)

Reply via email to