Re: [DISCUSSION] Graduation? (was Re: Doing the Spark shuffle on Parquet floors)

Jake Farrell Thu, 09 Apr 2015 09:02:50 -0700

Hey Chris
I was thinking along the same lines, I am planning on running through
everything and getting the status pages and other prep work in place this
weekend. Don't think that we are that far off and starting to ramp up for
this will be great


-Jake

On Wed, Apr 8, 2015 at 5:17 PM, Mattmann, Chris A (3980) <
[email protected]> wrote:

> So, along these lines, I was thinking for the community. You guys
> have released in the Incubator - you have added new PPMC/community
> members, and are really behaving like an ASF top level project.
> What do you see as the barriers to graduation? From my view,
> there aren’t any really. You would just need to pick a chair,
> devise a proposed PMC (suggestion: current PPMC+mentors, invite them
> all) and then discuss it and VOTE on it.
>
> Thoughts?
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Ryan Blue <[email protected]>
> Reply-To: "[email protected]"
> <[email protected]>
> Date: Wednesday, April 8, 2015 at 1:54 PM
> To: "[email protected]" <[email protected]>
> Subject: Re: Doing the Spark shuffle on Parquet floors
>
> >"Parquet has been a core component of the system and we see compression
> >of ~20% compared to specialized genome file formats e.g. compressed BAM.
> >In short, we’re really happy with Parquet."
> >
> >Matt, can we tweet this? That's great!
> >
> >rb
> >
> >On 04/07/2015 11:49 AM, Matt Massie wrote:
> >> We are using Apache Parquet and Spark for a genome analysis platform,
> >> called ADAM <http://bdgenomics.org>, that allows researchers to quickly
> >> analyze large datasets of DNA, RNA, etc. Parquet has been a core
> >>component
> >> of the system and we see compression of ~20% compared to specialized
> >>genome
> >> file formats e.g. compressed BAM. In short, we’re really happy with
> >>Parquet.
> >>
> >> We are using Avro Specific classes for almost all the entities in our
> >> system, so Avro generates Java classes from our schema
> >>
> >><
> https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/reso
> >>urces/avro/bdg.avdl>.
> >> Since the AvroIndexedRecordConverter has dictionary support, our initial
> >> load from disk to memory is compact.
> >>
> >> That’s the good news: compact on-disk and initial in-memory
> >>representation.
> >>
> >> Here’s the problem: the Spark shuffle.
> >>
> >> In order to integrate Parquet with Spark, we use a KryoRegistrator to
> >> register Kryo serializers
> >>
> >><
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/s
> >>cala/org/bdgenomics/adam/serialization/ADAMKryoRegistrator.scala>
> >> for each of our Avro objects (see Kryo Serializer interface
> >>
> >><
> https://github.com/EsotericSoftware/kryo/blob/master/src/com/esotericsof
> >>tware/kryo/Serializer.java>).
> >> We are serializing each object into record-oriented Avro, which makes
> >>our
> >> intermediate shuffle files much larger than the corresponding
> >> column-oriented Parquet inputs. These large shuffle files are hurting
> >>our
> >> performance and limiting our scaling for some analysis.
> >>
> >> Since the shuffle data is short-lived, there’s no need store meta-data
> >>and
> >> we have immediate access to schema through each Avro object. Each Avro
> >> specific class has a SCHEMA$ field which contains the Avro Schema for
> >>the
> >> object. There are utility functions in parquet-avro which can convert
> >>this
> >> Avro schema into Parquet schema. We also don’t need index pages, only
> >>the
> >> dictionary and data pages. We don’t need predicates or projection
> >> functionality. Does anyone on this list see a way to create a Parquet
> >>Kryo
> >> serializer
> >>
> >><
> https://github.com/EsotericSoftware/kryo/blob/master/src/com/esotericsof
> >>tware/kryo/Serializer.java>
> >> to read/write Avro Specific objects to/from a stream? Emitting
> >> column-oriented data will understandably incur memory and CPU costs on
> >>the
> >> map side but it will be worth it to improve our shuffle performance.
> >>
> >> This shuffle issue to slowing important research so any advice you have
> >>to
> >> offer will be appreciated. Thank you.
> >>
> >> —
> >> Matt Massie
> >> UC Berkeley, AMPLab
> >> 
> >>
> >
> >
> >--
> >Ryan Blue
> >Software Engineer
> >Cloudera, Inc.
>
>

Re: [DISCUSSION] Graduation? (was Re: Doing the Spark shuffle on Parquet floors)

Reply via email to