Re: Parquet for very wide table

Nong Li Sat, 23 Jan 2016 16:54:06 -0800

I expect this to be difficult. This is roughly 3 orders of magnitude more
than even
a typical wide table use case.


Answers inline.

On Thu, Jan 21, 2016 at 2:10 PM, Krishna <research...@gmail.com> wrote:

> We are considering using Parquet for storing a matrix that is dense and
> very, very wide (can have more than 600K columns).

I've following questions:
>
>    - Is there is a limit on # of columns in Parquet file? We expect to
>    query [10-100] columns at a time using Spark - what are the performance
>    implications in this scenario?
>
There is no hard limit but I think you'll probably run into some issues.
There will
probably be code paths that are not optimized for schemas this big but I
expect
those to be easier to address. The default configurations will probably not
work
well (the metadata to data ratio would be bad). You can try configuring
very large
row groups and see how that goes.


>    - We want a schema-less solution since the matrix can get wider over a
>    period of time
>    - Is there a way to generate such wide structured schema-less Parquet
>    files using map-reduce (input files are in custom binary format)?
>
No, Parquet requires a schema. The schema is flexible so you could map your
schema
to a parquet schema (each column could be binary for example.) Why are you
looking to
use Parquet for this use case?


>    - HBase can support millions of columns - anyone with prior experience
>    that compares Parquet vs HFile performance for wide structured tables?

   - Does Impala have support for evolving schema?
>
Yes. Different systems have different rules on what is allowed but the case
of appending
a column to an existing schema should be well supported.

>
> Krishna
>

Re: Parquet for very wide table

Reply via email to