The stack trace is this java.lang.OutOfMemoryError: Java heap space at parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65) at parquet.bytes.CapacityByteArrayOutputStream.<init>(CapacityByteArrayOutputStream.java:57) at parquet.column.values.rle.RunLengthBitPackingHybridEncoder.<init>(RunLengthBitPackingHybridEncoder.java:125) at parquet.column.values.rle.RunLengthBitPackingHybridValuesWriter.<init>(RunLengthBitPackingHybridValuesWriter.java:36) at parquet.column.ParquetProperties.getColumnDescriptorValuesWriter(ParquetProperties.java:61) at parquet.column.impl.ColumnWriterImpl.<init>(ColumnWriterImpl.java:72) at parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68) at parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56) at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnIO.java:178) at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369) at parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108) at parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.java:94) at parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:64) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
It looks like this https://issues.apache.org/jira/browse/PARQUET-222 Here's the schema I have, I don't think this is such different schema, ... maybe use of Map is causing this. Is it trying to register all of keys of a map as a column ? root |-- intId: integer (nullable = false) |-- uniqueId: string (nullable = true) |-- date1: string (nullable = true) |-- date2: string (nullable = true) |-- date3: string (nullable = true) |-- type: integer (nullable = false) |-- cat: string (nullable = true) |-- subCat: string (nullable = true) |-- unit: map (nullable = true) | |-- key: string | |-- value: double (valueContainsNull = false) |-- attr: map (nullable = true) | |-- key: string | |-- value: string (valueContainsNull = true) |-- price: map (nullable = true) | |-- key: string | |-- value: double (valueContainsNull = false) |-- imp1: map (nullable = true) | |-- key: string | |-- value: double (valueContainsNull = false) |-- imp2: map (nullable = true) | |-- key: string | |-- value: double (valueContainsNull = false) |-- imp3: map (nullable = true) | |-- key: string | |-- value: double (valueContainsNull = false) On Thu, Sep 3, 2015 at 11:27 PM, Cheng Lian <lian.cs....@gmail.com> wrote: > Could you please provide the full stack track of the OOM exception? > Another common case of Parquet OOM is super wide tables, say hundred or > thousands of columns. And in this case, the number of rows is mostly > irrelevant. > > Cheng > > > On 9/4/15 1:24 AM, Kohki Nishio wrote: > > let's say I have a data like htis > > ID | Some1 | Some2 | Some3 | .... > A00001 | kdsfajfsa | dsafsdafa | fdsfafa | > A00002 | dfsfafasd | 23jfdsjkj | 980dfs | > A00003 | 99989df | jksdljas | 48dsaas | > .. > Z00.. | fdsafdsfa | fdsdafdas | 89sdaff | > > My understanding is that if I give the column 'ID' to use for partition, > it's going to generate a file per entry since it's unique, no ? Using Json, > I create 1000 files separated as specified in parallelize parameter. But > json is large and a bit slow I'd like to try Parquet to see what happens. > > On Wed, Sep 2, 2015 at 11:15 PM, Adrien Mogenet < > adrien.moge...@contentsquare.com> wrote: > >> Any code / Parquet schema to provide? I'm not sure to understand which >> step fails right there... >> >> On 3 September 2015 at 04:12, Raghavendra Pandey < >> <raghavendra.pan...@gmail.com>raghavendra.pan...@gmail.com> wrote: >> >>> Did you specify partitioning column while saving data.. >>> On Sep 3, 2015 5:41 AM, "Kohki Nishio" < <tarop...@gmail.com> >>> tarop...@gmail.com> wrote: >>> >>>> Hello experts, >>>> >>>> I have a huge json file (> 40G) and trying to use Parquet as a file >>>> format. Each entry has a unique identifier but other than that, it doesn't >>>> have 'well balanced value' column to partition it. Right now it just throws >>>> OOM and couldn't figure out what to do with it. >>>> >>>> It would be ideal if I could provide a partitioner based on the unique >>>> identifier value like computing its hash value or something. One of the >>>> option would be to produce a hash value and add it as a separate column, >>>> but it doesn't sound right to me. Is there any other ways I can try ? >>>> >>>> Regards, >>>> -- >>>> Kohki Nishio >>>> >>> >> >> >> -- >> >> *Adrien Mogenet* >> Head of Backend/Infrastructure >> <adrien.moge...@contentsquare.com>adrien.moge...@contentsquare.com >> (+33)6.59.16.64.22 >> <http://www.contentsquare.com/>http://www.contentsquare.com >> 50, avenue Montaigne - 75008 Paris >> > > > > -- > Kohki Nishio > > > -- Kohki Nishio