Re: Best practice for writing to HFileOutputFormat(2) with multiple Column Families

Jianshi Huang Fri, 01 Aug 2014 09:38:26 -0700

I know HBase will set the TotalOrderPartitioner in MR, but in Spark, I need
to sort the rows myself.


Jianshi



On Sat, Aug 2, 2014 at 12:24 AM, Arun Allamsetty <arun.allamse...@gmail.com>
wrote:

> Hi Jianshi,
>
> Do you mean that you want to sort the row keys? If yes, then you don't have
> to worry about it because HBase sorts the row keys on its own but
> lexicographically.
>
> Cheers,
> Arun
>
> Sent from a mobile device. Please don't mind the typos.
> On Jul 30, 2014 9:02 PM, "Jianshi Huang" <jianshi.hu...@gmail.com> wrote:
>
> > I need to generate from a 2TB dataset and exploded it to 4 Column
> Families.
> >
> > The result dataset is likely to be 20TB or more. I'm currently using
> Spark
> > so I sorted the (rk, cf, cq) myself. It's huge and I'm considering how to
> > optimize it.
> >
> > My question is:
> > Should I sort and write each column family one by one, or should I put
> them
> > all together then do sort and write?
> >
> > Does my question make sense?
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Best practice for writing to HFileOutputFormat(2) with multiple Column Families

Reply via email to