Re: Parquet partitioning for unique identifier

2015-09-04 Thread Cheng Lian
What version of Spark were you using? Have you tried increasing --executor-memory? This schema looks pretty normal. And Parquet stores all keys of a map in a single column. Cheng On 9/4/15 4:00 PM, Kohki Nishio wrote: The stack trace is this java.lang.OutOfMemoryError: Java heap space

Re: Parquet partitioning for unique identifier

2015-09-04 Thread Kohki Nishio
The stack trace is this java.lang.OutOfMemoryError: Java heap space at parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65) at parquet.bytes.CapacityByteArrayOutputStream.(CapacityByteArrayOutputStream.java:57) at

Re: Parquet partitioning for unique identifier

2015-09-04 Thread Cheng Lian
Could you please provide the full stack track of the OOM exception? Another common case of Parquet OOM is super wide tables, say hundred or thousands of columns. And in this case, the number of rows is mostly irrelevant. Cheng On 9/4/15 1:24 AM, Kohki Nishio wrote: let's say I have a data

Re: Parquet partitioning for unique identifier

2015-09-03 Thread Kohki Nishio
let's say I have a data like htis ID | Some1 | Some2| Some3 | A1 | kdsfajfsa | dsafsdafa | fdsfafa | A2 | dfsfafasd | 23jfdsjkj | 980dfs | A3 | 99989df | jksdljas | 48dsaas | .. Z00.. | fdsafdsfa | fdsdafdas | 89sdaff | My understanding is that if I

Re: Parquet partitioning for unique identifier

2015-09-03 Thread Adrien Mogenet
Any code / Parquet schema to provide? I'm not sure to understand which step fails right there... On 3 September 2015 at 04:12, Raghavendra Pandey < raghavendra.pan...@gmail.com> wrote: > Did you specify partitioning column while saving data.. > On Sep 3, 2015 5:41 AM, "Kohki Nishio"

Parquet partitioning for unique identifier

2015-09-02 Thread Kohki Nishio
Hello experts, I have a huge json file (> 40G) and trying to use Parquet as a file format. Each entry has a unique identifier but other than that, it doesn't have 'well balanced value' column to partition it. Right now it just throws OOM and couldn't figure out what to do with it. It would be

Re: Parquet partitioning for unique identifier

2015-09-02 Thread Raghavendra Pandey
Did you specify partitioning column while saving data.. On Sep 3, 2015 5:41 AM, "Kohki Nishio" wrote: > Hello experts, > > I have a huge json file (> 40G) and trying to use Parquet as a file > format. Each entry has a unique identifier but other than that, it doesn't > have