Best practice for writing to HFileOutputFormat(2) with multiple Column Families

Jianshi Huang Wed, 30 Jul 2014 20:02:24 -0700

I need to generate from a 2TB dataset and exploded it to 4 Column Families.


The result dataset is likely to be 20TB or more. I'm currently using Spark
so I sorted the (rk, cf, cq) myself. It's huge and I'm considering how to
optimize it.

My question is:
Should I sort and write each column family one by one, or should I put them
all together then do sort and write?

Does my question make sense?

-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Best practice for writing to HFileOutputFormat(2) with multiple Column Families

Reply via email to