I need to generate from a 2TB dataset and exploded it to 4 Column Families.
The result dataset is likely to be 20TB or more. I'm currently using Spark so I sorted the (rk, cf, cq) myself. It's huge and I'm considering how to optimize it. My question is: Should I sort and write each column family one by one, or should I put them all together then do sort and write? Does my question make sense? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/