Re: Doing the Spark shuffle on Parquet floors

Matt Massie Wed, 08 Apr 2015 14:02:38 -0700

Sure, you can tweet that.

--
Matt <http://www.linkedin.com/in/mattmassie/> Massie
<http://www.twitter.com/matt_massie>
UC, Berkeley AMPLab <https://twitter.com/amplab>


On Wed, Apr 8, 2015 at 1:54 PM, Ryan Blue <[email protected]> wrote:

> "Parquet has been a core component of the system and we see compression of
> ~20% compared to specialized genome file formats e.g. compressed BAM. In
> short, we’re really happy with Parquet."
>
> Matt, can we tweet this? That's great!
>
> rb
>
> On 04/07/2015 11:49 AM, Matt Massie wrote:
>
>> We are using Apache Parquet and Spark for a genome analysis platform,
>> called ADAM <http://bdgenomics.org>, that allows researchers to quickly
>> analyze large datasets of DNA, RNA, etc. Parquet has been a core component
>> of the system and we see compression of ~20% compared to specialized
>> genome
>> file formats e.g. compressed BAM. In short, we’re really happy with
>> Parquet.
>>
>> We are using Avro Specific classes for almost all the entities in our
>> system, so Avro generates Java classes from our schema
>> <https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/
>> resources/avro/bdg.avdl>.
>> Since the AvroIndexedRecordConverter has dictionary support, our initial
>> load from disk to memory is compact.
>>
>> That’s the good news: compact on-disk and initial in-memory
>> representation.
>>
>> Here’s the problem: the Spark shuffle.
>>
>> In order to integrate Parquet with Spark, we use a KryoRegistrator to
>> register Kryo serializers
>> <https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/
>> scala/org/bdgenomics/adam/serialization/ADAMKryoRegistrator.scala>
>> for each of our Avro objects (see Kryo Serializer interface
>> <https://github.com/EsotericSoftware/kryo/blob/master/src/com/
>> esotericsoftware/kryo/Serializer.java>).
>> We are serializing each object into record-oriented Avro, which makes our
>> intermediate shuffle files much larger than the corresponding
>> column-oriented Parquet inputs. These large shuffle files are hurting our
>> performance and limiting our scaling for some analysis.
>>
>> Since the shuffle data is short-lived, there’s no need store meta-data and
>> we have immediate access to schema through each Avro object. Each Avro
>> specific class has a SCHEMA$ field which contains the Avro Schema for the
>> object. There are utility functions in parquet-avro which can convert this
>> Avro schema into Parquet schema. We also don’t need index pages, only the
>> dictionary and data pages. We don’t need predicates or projection
>> functionality. Does anyone on this list see a way to create a Parquet Kryo
>> serializer
>> <https://github.com/EsotericSoftware/kryo/blob/master/src/com/
>> esotericsoftware/kryo/Serializer.java>
>> to read/write Avro Specific objects to/from a stream? Emitting
>> column-oriented data will understandably incur memory and CPU costs on the
>> map side but it will be worth it to improve our shuffle performance.
>>
>> This shuffle issue to slowing important research so any advice you have to
>> offer will be appreciated. Thank you.
>>
>> —
>> Matt Massie
>> UC Berkeley, AMPLab
>> 
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: Doing the Spark shuffle on Parquet floors

Reply via email to