Re: Parquet partitioning for unique identifier

Kohki Nishio Fri, 04 Sep 2015 01:02:28 -0700

The stack trace is this

java.lang.OutOfMemoryError: Java heap space
        at 
parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65)
        at 
parquet.bytes.CapacityByteArrayOutputStream.<init>(CapacityByteArrayOutputStream.java:57)
        at 
parquet.column.values.rle.RunLengthBitPackingHybridEncoder.<init>(RunLengthBitPackingHybridEncoder.java:125)
        at 
parquet.column.values.rle.RunLengthBitPackingHybridValuesWriter.<init>(RunLengthBitPackingHybridValuesWriter.java:36)
        at 
parquet.column.ParquetProperties.getColumnDescriptorValuesWriter(ParquetProperties.java:61)
        at parquet.column.impl.ColumnWriterImpl.<init>(ColumnWriterImpl.java:72)
        at 
parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68)
        at 
parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
        at 
parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnIO.java:178)
        at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
        at 
parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)
        at 
parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.java:94)
        at 
parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:64)
        at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
        at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)



It looks like this
https://issues.apache.org/jira/browse/PARQUET-222

Here's the schema I have, I don't think this is such different schema, ...
maybe use of Map is causing this. Is it trying to register all of keys of a
map as a column ?

root
 |-- intId: integer (nullable = false)
 |-- uniqueId: string (nullable = true)
 |-- date1: string (nullable = true)
 |-- date2: string (nullable = true)
 |-- date3: string (nullable = true)
 |-- type: integer (nullable = false)
 |-- cat: string (nullable = true)
 |-- subCat: string (nullable = true)
 |-- unit: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = false)
 |-- attr: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- price: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = false)
 |-- imp1: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = false)
 |-- imp2: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = false)
 |-- imp3: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = false)



On Thu, Sep 3, 2015 at 11:27 PM, Cheng Lian <lian.cs....@gmail.com> wrote:

> Could you please provide the full stack track of the OOM exception?
> Another common case of Parquet OOM is super wide tables, say hundred or
> thousands of columns. And in this case, the number of rows is mostly
> irrelevant.
>
> Cheng
>
>
> On 9/4/15 1:24 AM, Kohki Nishio wrote:
>
> let's say I have a data like htis
>
>    ID  |   Some1   |  Some2    | Some3   | ....
> A00001 | kdsfajfsa | dsafsdafa | fdsfafa  |
> A00002 | dfsfafasd | 23jfdsjkj | 980dfs   |
> A00003 | 99989df   | jksdljas  | 48dsaas  |
>    ..
> Z00..  | fdsafdsfa | fdsdafdas | 89sdaff  |
>
> My understanding is that if I give the column 'ID' to use for partition,
> it's going to generate a file per entry since it's unique, no ? Using Json,
> I create 1000 files separated as specified in parallelize parameter. But
> json is large and a bit slow I'd like to try Parquet to see what happens.
>
> On Wed, Sep 2, 2015 at 11:15 PM, Adrien Mogenet <
> adrien.moge...@contentsquare.com> wrote:
>
>> Any code / Parquet schema to provide? I'm not sure to understand which
>> step fails right there...
>>
>> On 3 September 2015 at 04:12, Raghavendra Pandey <
>> <raghavendra.pan...@gmail.com>raghavendra.pan...@gmail.com> wrote:
>>
>>> Did you specify partitioning column while saving data..
>>> On Sep 3, 2015 5:41 AM, "Kohki Nishio" < <tarop...@gmail.com>
>>> tarop...@gmail.com> wrote:
>>>
>>>> Hello experts,
>>>>
>>>> I have a huge json file (> 40G) and trying to use Parquet as a file
>>>> format. Each entry has a unique identifier but other than that, it doesn't
>>>> have 'well balanced value' column to partition it. Right now it just throws
>>>> OOM and couldn't figure out what to do with it.
>>>>
>>>> It would be ideal if I could provide a partitioner based on the unique
>>>> identifier value like computing its hash value or something.  One of the
>>>> option would be to produce a hash value and add it as a separate column,
>>>> but it doesn't sound right to me. Is there any other ways I can try ?
>>>>
>>>> Regards,
>>>> --
>>>> Kohki Nishio
>>>>
>>>
>>
>>
>> --
>>
>> *Adrien Mogenet*
>> Head of Backend/Infrastructure
>> <adrien.moge...@contentsquare.com>adrien.moge...@contentsquare.com
>> (+33)6.59.16.64.22
>> <http://www.contentsquare.com/>http://www.contentsquare.com
>> 50, avenue Montaigne - 75008 Paris
>>
>
>
>
> --
> Kohki Nishio
>
>
>


-- 
Kohki Nishio

Re: Parquet partitioning for unique identifier

Reply via email to