Bump: on disk storage formats

2013-12-08 Thread Ankur Chauhan
Hi all, Sorry for posting this again but I am interested in finding out what different on disk data formats for storing timeline event and analytics aggregate data. Currently I am just using newline delimited json gzipped files. I was wondering if there were any recommendations. -- Ankur

Re: Bump: on disk storage formats

2013-12-08 Thread Andrew Ash
LZO compression at a minimum, and using Parquet as a second step, seems like the way to go though I haven't tried either personally yet. Sent from my mobile phone On Dec 8, 2013, at 16:54, Ankur Chauhan achau...@brightcove.com wrote: Hi all, Sorry for posting this again but I am interested

Re: Bump: on disk storage formats

2013-12-08 Thread Ankur Chauhan
Hi Patrick, I agree this is a very open ended question but I was trying to get a general answer anyway but I think you did hint on some nuances. 1. My work load is definitely bottlenecked by disk IO just beacause even with a project on a single column(mostly 2-3 out of 20) there is a lot of

Re: Bump: on disk storage formats

2013-12-08 Thread Patrick Wendell
Parquet might be a good fit for you then... it's pretty new and I don't have a lot of direct experience working with it. But I've seen examples of people using Spark with Parquet. You might want to checkout Matt Massie's post here: http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/ This

Re: Bump: on disk storage formats

2013-12-08 Thread Azuryy Yu
Thanks for sharing. On 2013-12-09 11:50 AM, Patrick Wendell pwend...@gmail.com wrote: Parquet might be a good fit for you then... it's pretty new and I don't have a lot of direct experience working with it. But I've seen examples of people using Spark with Parquet. You might want to checkout