Re: Doing the Spark shuffle on Parquet floors

Julien Le Dem Thu, 09 Apr 2015 11:02:56 -0700

or even better, we retweet you :)

On Wed, Apr 8, 2015 at 2:00 PM, Matt Massie <[email protected]> wrote:


> Sure, you can tweet that.
>
> --
> Matt <http://www.linkedin.com/in/mattmassie/> Massie
> <http://www.twitter.com/matt_massie>
> UC, Berkeley AMPLab <https://twitter.com/amplab>
>
> On Wed, Apr 8, 2015 at 1:54 PM, Ryan Blue <[email protected]> wrote:
>
> > "Parquet has been a core component of the system and we see compression
> of
> > ~20% compared to specialized genome file formats e.g. compressed BAM. In
> > short, we’re really happy with Parquet."
> >
> > Matt, can we tweet this? That's great!
> >
> > rb
> >
> > On 04/07/2015 11:49 AM, Matt Massie wrote:
> >
> >> We are using Apache Parquet and Spark for a genome analysis platform,
> >> called ADAM <http://bdgenomics.org>, that allows researchers to quickly
> >> analyze large datasets of DNA, RNA, etc. Parquet has been a core
> component
> >> of the system and we see compression of ~20% compared to specialized
> >> genome
> >> file formats e.g. compressed BAM. In short, we’re really happy with
> >> Parquet.
> >>
> >> We are using Avro Specific classes for almost all the entities in our
> >> system, so Avro generates Java classes from our schema
> >> <https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/
> >> resources/avro/bdg.avdl>.
> >> Since the AvroIndexedRecordConverter has dictionary support, our initial
> >> load from disk to memory is compact.
> >>
> >> That’s the good news: compact on-disk and initial in-memory
> >> representation.
> >>
> >> Here’s the problem: the Spark shuffle.
> >>
> >> In order to integrate Parquet with Spark, we use a KryoRegistrator to
> >> register Kryo serializers
> >> <
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/
> >> scala/org/bdgenomics/adam/serialization/ADAMKryoRegistrator.scala>
> >> for each of our Avro objects (see Kryo Serializer interface
> >> <https://github.com/EsotericSoftware/kryo/blob/master/src/com/
> >> esotericsoftware/kryo/Serializer.java>).
> >> We are serializing each object into record-oriented Avro, which makes
> our
> >> intermediate shuffle files much larger than the corresponding
> >> column-oriented Parquet inputs. These large shuffle files are hurting
> our
> >> performance and limiting our scaling for some analysis.
> >>
> >> Since the shuffle data is short-lived, there’s no need store meta-data
> and
> >> we have immediate access to schema through each Avro object. Each Avro
> >> specific class has a SCHEMA$ field which contains the Avro Schema for
> the
> >> object. There are utility functions in parquet-avro which can convert
> this
> >> Avro schema into Parquet schema. We also don’t need index pages, only
> the
> >> dictionary and data pages. We don’t need predicates or projection
> >> functionality. Does anyone on this list see a way to create a Parquet
> Kryo
> >> serializer
> >> <https://github.com/EsotericSoftware/kryo/blob/master/src/com/
> >> esotericsoftware/kryo/Serializer.java>
> >> to read/write Avro Specific objects to/from a stream? Emitting
> >> column-oriented data will understandably incur memory and CPU costs on
> the
> >> map side but it will be worth it to improve our shuffle performance.
> >>
> >> This shuffle issue to slowing important research so any advice you have
> to
> >> offer will be appreciated. Thank you.
> >>
> >> —
> >> Matt Massie
> >> UC Berkeley, AMPLab
> >> 
> >>
> >>
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Cloudera, Inc.
> >
>

Re: Doing the Spark shuffle on Parquet floors

Reply via email to