PARQUET-222 is mostly a memory issue caused by the # of columns. On the write path, each column comes with write buffers, and they can accumulate to a large amount. In the case investigated in PARQUET-222, it took more than 10G to write a single row consists of 26k integer columns. I.e., this issue is related to column count rather than row count.

But that was the situation of Parquet 1.6. I haven't checked all the memory management improvements happened recently, and haven't repeated the experiment using newer versions of Parquet yet.

Cheng

On 1/25/16 11:50 AM, Krishna wrote:
Thanks Cheng, Nong.

Data in the matrix is homogenous (cells are booleans), so, I don't expect
to face memory related issues. Is the limitation on the # of columns or
memory issues caused by the # of columns? To me it sounds more like memory
issues.

On Mon, Jan 25, 2016 at 10:16 AM, Cheng Lian <lian.cs....@gmail.com> wrote:

Aside from Nong's comment, I think PARQUET-222, where we discussed a
performance issue of writing wide tables, can be helpful.

Cheng


On 1/23/16 4:53 PM, Nong Li wrote:

I expect this to be difficult. This is roughly 3 orders of magnitude more
than even
a typical wide table use case.

Answers inline.

On Thu, Jan 21, 2016 at 2:10 PM, Krishna <research...@gmail.com> wrote:

We are considering using Parquet for storing a matrix that is dense and
very, very wide (can have more than 600K columns).

I've following questions:

     - Is there is a limit on # of columns in Parquet file? We expect to
     query [10-100] columns at a time using Spark - what are the
performance
     implications in this scenario?

There is no hard limit but I think you'll probably run into some issues.
There will
probably be code paths that are not optimized for schemas this big but I
expect
those to be easier to address. The default configurations will probably
not
work
well (the metadata to data ratio would be bad). You can try configuring
very large
row groups and see how that goes.


     - We want a schema-less solution since the matrix can get wider over a
     period of time
     - Is there a way to generate such wide structured schema-less Parquet
     files using map-reduce (input files are in custom binary format)?

No, Parquet requires a schema. The schema is flexible so you could map
your
schema
to a parquet schema (each column could be binary for example.) Why are you
looking to
use Parquet for this use case?


     - HBase can support millions of columns - anyone with prior experience
     that compares Parquet vs HFile performance for wide structured
tables?

     - Does Impala have support for evolving schema?
Yes. Different systems have different rules on what is allowed but the
case
of appending
a column to an existing schema should be well supported.

Krishna


Reply via email to