Thanks for sharing. On 2013-12-09 11:50 AM, "Patrick Wendell" <pwend...@gmail.com> wrote:
> Parquet might be a good fit for you then... it's pretty new and I > don't have a lot of direct experience working with it. But I've seen > examples of people using Spark with Parquet. You might want to > checkout Matt Massie's post here: > > http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/ > > This gives an example of using the Parquet format with Spark. > > - Patrick > > On Sun, Dec 8, 2013 at 7:09 PM, Ankur Chauhan <achau...@brightcove.com> > wrote: > > Hi Patrick, > > > > I agree this is a very open ended question but I was trying to get a > general answer anyway but I think you did hint on some nuances. > > 1. My work load is definitely bottlenecked by disk IO just beacause even > with a project on a single column(mostly 2-3 out of 20) there is a lot of > data to churn throught. > > 2. The fields are mostly all headers and some know parameter fields from > a http GET request so analysis on let's say account id and user agent or ip > address is fairly selective. > > 3. Flattening the fields and using csv definitely looks like something i > can try out. > > > > I believe parquet files can be ceated with a sorted column (for example > timestamp) that would make selection of the right segment of data easier > too(although i don't have any experience with parquet files). > > What is the recommended way of interacting(read/write) with parquet > files? > > > > -- Ankur > > > > On 8 Dec 2013, at 17:38, Patrick Wendell <pwend...@gmail.com> wrote: > > > >> This is a very open ended question so it's hard to give a specific > >> answer... it depends a lot on whether disk IO is a bottleneck in your > >> workload and whether you tend to analyze all of each record or only > >> certain fields. If you are doing disk IO a lot and only touching a few > >> fields something like Parquet might help, or (simpler) just creating > >> smaller projections of your data with only the fields you care about. > >> Tab delimited formats can have less serialization overhead than JSON, > >> so flattening the data might also help. It really depends on your > >> access patterns and data types. > >> > >> In many cases with Spark another important question is how the user > >> stores the data in-memory, not the on-disk format. It does depend how > >> they are using Spark though. > >> > >> - Patrick > >> > >> On Sun, Dec 8, 2013 at 3:03 PM, Andrew Ash <and...@andrewash.com> > wrote: > >>> LZO compression at a minimum, and using Parquet as a second step, > >>> seems like the way to go though I haven't tried either personally yet. > >>> > >>> Sent from my mobile phone > >>> > >>> On Dec 8, 2013, at 16:54, Ankur Chauhan <achau...@brightcove.com> > wrote: > >>> > >>>> Hi all, > >>>> > >>>> Sorry for posting this again but I am interested in finding out what > different on disk data formats for storing timeline event and analytics > aggregate data. > >>>> > >>>> Currently I am just using newline delimited json gzipped files. I was > wondering if there were any recommendations. > >>>> > >>>> -- Ankur > > >