let's say I have a data like htis ID | Some1 | Some2 | Some3 | .... A00001 | kdsfajfsa | dsafsdafa | fdsfafa | A00002 | dfsfafasd | 23jfdsjkj | 980dfs | A00003 | 99989df | jksdljas | 48dsaas | .. Z00.. | fdsafdsfa | fdsdafdas | 89sdaff |
My understanding is that if I give the column 'ID' to use for partition, it's going to generate a file per entry since it's unique, no ? Using Json, I create 1000 files separated as specified in parallelize parameter. But json is large and a bit slow I'd like to try Parquet to see what happens. On Wed, Sep 2, 2015 at 11:15 PM, Adrien Mogenet < adrien.moge...@contentsquare.com> wrote: > Any code / Parquet schema to provide? I'm not sure to understand which > step fails right there... > > On 3 September 2015 at 04:12, Raghavendra Pandey < > raghavendra.pan...@gmail.com> wrote: > >> Did you specify partitioning column while saving data.. >> On Sep 3, 2015 5:41 AM, "Kohki Nishio" <tarop...@gmail.com> wrote: >> >>> Hello experts, >>> >>> I have a huge json file (> 40G) and trying to use Parquet as a file >>> format. Each entry has a unique identifier but other than that, it doesn't >>> have 'well balanced value' column to partition it. Right now it just throws >>> OOM and couldn't figure out what to do with it. >>> >>> It would be ideal if I could provide a partitioner based on the unique >>> identifier value like computing its hash value or something. One of the >>> option would be to produce a hash value and add it as a separate column, >>> but it doesn't sound right to me. Is there any other ways I can try ? >>> >>> Regards, >>> -- >>> Kohki Nishio >>> >> > > > -- > > *Adrien Mogenet* > Head of Backend/Infrastructure > adrien.moge...@contentsquare.com > (+33)6.59.16.64.22 > http://www.contentsquare.com > 50, avenue Montaigne - 75008 Paris > -- Kohki Nishio