Aside from Nong's comment, I think PARQUET-222, where we discussed a performance issue of writing wide tables, can be helpful.

Cheng

On 1/23/16 4:53 PM, Nong Li wrote:
I expect this to be difficult. This is roughly 3 orders of magnitude more
than even
a typical wide table use case.

Answers inline.

On Thu, Jan 21, 2016 at 2:10 PM, Krishna <research...@gmail.com> wrote:

We are considering using Parquet for storing a matrix that is dense and
very, very wide (can have more than 600K columns).
I've following questions:
    - Is there is a limit on # of columns in Parquet file? We expect to
    query [10-100] columns at a time using Spark - what are the performance
    implications in this scenario?

There is no hard limit but I think you'll probably run into some issues.
There will
probably be code paths that are not optimized for schemas this big but I
expect
those to be easier to address. The default configurations will probably not
work
well (the metadata to data ratio would be bad). You can try configuring
very large
row groups and see how that goes.


    - We want a schema-less solution since the matrix can get wider over a
    period of time
    - Is there a way to generate such wide structured schema-less Parquet
    files using map-reduce (input files are in custom binary format)?

No, Parquet requires a schema. The schema is flexible so you could map your
schema
to a parquet schema (each column could be binary for example.) Why are you
looking to
use Parquet for this use case?


    - HBase can support millions of columns - anyone with prior experience
    that compares Parquet vs HFile performance for wide structured tables?
    - Does Impala have support for evolving schema?
Yes. Different systems have different rules on what is allowed but the case
of appending
a column to an existing schema should be well supported.

Krishna


Reply via email to