Re: Bump: on disk storage formats

Azuryy Yu Sun, 08 Dec 2013 19:59:53 -0800

Thanks for sharing.
 On 2013-12-09 11:50 AM, "Patrick Wendell" <pwend...@gmail.com> wrote:


> Parquet might be a good fit for you then... it's pretty new and I
> don't have a lot of direct experience working with it. But I've seen
> examples of people using Spark with Parquet. You might want to
> checkout Matt Massie's post here:
>
> http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/
>
> This gives an example of using the Parquet format with Spark.
>
> - Patrick
>
> On Sun, Dec 8, 2013 at 7:09 PM, Ankur Chauhan <achau...@brightcove.com>
> wrote:
> > Hi Patrick,
> >
> > I agree this is a very open ended question but I was trying to get a
> general answer anyway but I think you did hint on some nuances.
> > 1. My work load is definitely bottlenecked by disk IO just beacause even
> with a project on a single column(mostly 2-3 out of 20) there is a lot of
> data to churn throught.
> > 2. The fields are mostly all headers and some know parameter fields from
> a http GET request so analysis on let's say account id and user agent or ip
> address is fairly selective.
> > 3. Flattening the fields and using csv definitely looks like something i
> can try out.
> >
> > I believe parquet files can be ceated with a sorted column (for example
> timestamp) that would make selection of the right segment of data easier
> too(although i don't have any experience with parquet files).
> > What is the recommended way of interacting(read/write) with parquet
> files?
> >
> > -- Ankur
> >
> > On 8 Dec 2013, at 17:38, Patrick Wendell <pwend...@gmail.com> wrote:
> >
> >> This is a very open ended question so it's hard to give a specific
> >> answer... it depends a lot on whether disk IO is a bottleneck in your
> >> workload and whether you tend to analyze all of each record or only
> >> certain fields. If you are doing disk IO a lot and only touching a few
> >> fields something like Parquet might help, or (simpler) just creating
> >> smaller projections of your data with only the fields you care about.
> >> Tab delimited formats can have less serialization overhead than JSON,
> >> so flattening the data might also help. It really depends on your
> >> access patterns and data types.
> >>
> >> In many cases with Spark another important question is how the user
> >> stores the data in-memory, not the on-disk format. It does depend how
> >> they are using Spark though.
> >>
> >> - Patrick
> >>
> >> On Sun, Dec 8, 2013 at 3:03 PM, Andrew Ash <and...@andrewash.com>
> wrote:
> >>> LZO compression at a minimum, and using Parquet as a second step,
> >>> seems like the way to go though I haven't tried either personally yet.
> >>>
> >>> Sent from my mobile phone
> >>>
> >>> On Dec 8, 2013, at 16:54, Ankur Chauhan <achau...@brightcove.com>
> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> Sorry for posting this again but I am interested in finding out what
> different on disk data formats for storing timeline event and analytics
> aggregate data.
> >>>>
> >>>> Currently I am just using newline delimited json gzipped files. I was
> wondering if there were any recommendations.
> >>>>
> >>>> -- Ankur
> >
>

Re: Bump: on disk storage formats

Reply via email to