Re: Parquet partitioning for unique identifier

Kohki Nishio Thu, 03 Sep 2015 10:26:11 -0700

let's say I have a data like htis

   ID  |   Some1   |  Some2    | Some3   | ....
A00001 | kdsfajfsa | dsafsdafa | fdsfafa  |
A00002 | dfsfafasd | 23jfdsjkj | 980dfs   |
A00003 | 99989df   | jksdljas  | 48dsaas  |
   ..
Z00..  | fdsafdsfa | fdsdafdas | 89sdaff  |


My understanding is that if I give the column 'ID' to use for partition,
it's going to generate a file per entry since it's unique, no ? Using Json,
I create 1000 files separated as specified in parallelize parameter. But
json is large and a bit slow I'd like to try Parquet to see what happens.

On Wed, Sep 2, 2015 at 11:15 PM, Adrien Mogenet <
adrien.moge...@contentsquare.com> wrote:

> Any code / Parquet schema to provide? I'm not sure to understand which
> step fails right there...
>
> On 3 September 2015 at 04:12, Raghavendra Pandey <
> raghavendra.pan...@gmail.com> wrote:
>
>> Did you specify partitioning column while saving data..
>> On Sep 3, 2015 5:41 AM, "Kohki Nishio" <tarop...@gmail.com> wrote:
>>
>>> Hello experts,
>>>
>>> I have a huge json file (> 40G) and trying to use Parquet as a file
>>> format. Each entry has a unique identifier but other than that, it doesn't
>>> have 'well balanced value' column to partition it. Right now it just throws
>>> OOM and couldn't figure out what to do with it.
>>>
>>> It would be ideal if I could provide a partitioner based on the unique
>>> identifier value like computing its hash value or something.  One of the
>>> option would be to produce a hash value and add it as a separate column,
>>> but it doesn't sound right to me. Is there any other ways I can try ?
>>>
>>> Regards,
>>> --
>>> Kohki Nishio
>>>
>>
>
>
> --
>
> *Adrien Mogenet*
> Head of Backend/Infrastructure
> adrien.moge...@contentsquare.com
> (+33)6.59.16.64.22
> http://www.contentsquare.com
> 50, avenue Montaigne - 75008 Paris
>



-- 
Kohki Nishio

Re: Parquet partitioning for unique identifier

Reply via email to