Re: Parquet for very wide table

Krishna Mon, 25 Jan 2016 11:51:12 -0800

Thanks Cheng, Nong.

Data in the matrix is homogenous (cells are booleans), so, I don't expect
to face memory related issues. Is the limitation on the # of columns or
memory issues caused by the # of columns? To me it sounds more like memory
issues.


On Mon, Jan 25, 2016 at 10:16 AM, Cheng Lian <lian.cs....@gmail.com> wrote:

> Aside from Nong's comment, I think PARQUET-222, where we discussed a
> performance issue of writing wide tables, can be helpful.
>
> Cheng
>
>
> On 1/23/16 4:53 PM, Nong Li wrote:
>
>> I expect this to be difficult. This is roughly 3 orders of magnitude more
>> than even
>> a typical wide table use case.
>>
>> Answers inline.
>>
>> On Thu, Jan 21, 2016 at 2:10 PM, Krishna <research...@gmail.com> wrote:
>>
>> We are considering using Parquet for storing a matrix that is dense and
>>> very, very wide (can have more than 600K columns).
>>>
>> I've following questions:
>>
>>>     - Is there is a limit on # of columns in Parquet file? We expect to
>>>     query [10-100] columns at a time using Spark - what are the
>>> performance
>>>     implications in this scenario?
>>>
>>> There is no hard limit but I think you'll probably run into some issues.
>> There will
>> probably be code paths that are not optimized for schemas this big but I
>> expect
>> those to be easier to address. The default configurations will probably
>> not
>> work
>> well (the metadata to data ratio would be bad). You can try configuring
>> very large
>> row groups and see how that goes.
>>
>>
>>     - We want a schema-less solution since the matrix can get wider over a
>>>     period of time
>>>     - Is there a way to generate such wide structured schema-less Parquet
>>>     files using map-reduce (input files are in custom binary format)?
>>>
>>> No, Parquet requires a schema. The schema is flexible so you could map
>> your
>> schema
>> to a parquet schema (each column could be binary for example.) Why are you
>> looking to
>> use Parquet for this use case?
>>
>>
>>     - HBase can support millions of columns - anyone with prior experience
>>>     that compares Parquet vs HFile performance for wide structured
>>> tables?
>>>
>>     - Does Impala have support for evolving schema?
>> Yes. Different systems have different rules on what is allowed but the
>> case
>> of appending
>> a column to an existing schema should be well supported.
>>
>> Krishna
>>>
>>>
>

Re: Parquet for very wide table

Reply via email to