or even better, we retweet you :) On Wed, Apr 8, 2015 at 2:00 PM, Matt Massie <[email protected]> wrote:
> Sure, you can tweet that. > > -- > Matt <http://www.linkedin.com/in/mattmassie/> Massie > <http://www.twitter.com/matt_massie> > UC, Berkeley AMPLab <https://twitter.com/amplab> > > On Wed, Apr 8, 2015 at 1:54 PM, Ryan Blue <[email protected]> wrote: > > > "Parquet has been a core component of the system and we see compression > of > > ~20% compared to specialized genome file formats e.g. compressed BAM. In > > short, we’re really happy with Parquet." > > > > Matt, can we tweet this? That's great! > > > > rb > > > > On 04/07/2015 11:49 AM, Matt Massie wrote: > > > >> We are using Apache Parquet and Spark for a genome analysis platform, > >> called ADAM <http://bdgenomics.org>, that allows researchers to quickly > >> analyze large datasets of DNA, RNA, etc. Parquet has been a core > component > >> of the system and we see compression of ~20% compared to specialized > >> genome > >> file formats e.g. compressed BAM. In short, we’re really happy with > >> Parquet. > >> > >> We are using Avro Specific classes for almost all the entities in our > >> system, so Avro generates Java classes from our schema > >> <https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/ > >> resources/avro/bdg.avdl>. > >> Since the AvroIndexedRecordConverter has dictionary support, our initial > >> load from disk to memory is compact. > >> > >> That’s the good news: compact on-disk and initial in-memory > >> representation. > >> > >> Here’s the problem: the Spark shuffle. > >> > >> In order to integrate Parquet with Spark, we use a KryoRegistrator to > >> register Kryo serializers > >> < > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/ > >> scala/org/bdgenomics/adam/serialization/ADAMKryoRegistrator.scala> > >> for each of our Avro objects (see Kryo Serializer interface > >> <https://github.com/EsotericSoftware/kryo/blob/master/src/com/ > >> esotericsoftware/kryo/Serializer.java>). > >> We are serializing each object into record-oriented Avro, which makes > our > >> intermediate shuffle files much larger than the corresponding > >> column-oriented Parquet inputs. These large shuffle files are hurting > our > >> performance and limiting our scaling for some analysis. > >> > >> Since the shuffle data is short-lived, there’s no need store meta-data > and > >> we have immediate access to schema through each Avro object. Each Avro > >> specific class has a SCHEMA$ field which contains the Avro Schema for > the > >> object. There are utility functions in parquet-avro which can convert > this > >> Avro schema into Parquet schema. We also don’t need index pages, only > the > >> dictionary and data pages. We don’t need predicates or projection > >> functionality. Does anyone on this list see a way to create a Parquet > Kryo > >> serializer > >> <https://github.com/EsotericSoftware/kryo/blob/master/src/com/ > >> esotericsoftware/kryo/Serializer.java> > >> to read/write Avro Specific objects to/from a stream? Emitting > >> column-oriented data will understandably incur memory and CPU costs on > the > >> map side but it will be worth it to improve our shuffle performance. > >> > >> This shuffle issue to slowing important research so any advice you have > to > >> offer will be appreciated. Thank you. > >> > >> — > >> Matt Massie > >> UC Berkeley, AMPLab > >> > >> > >> > > > > -- > > Ryan Blue > > Software Engineer > > Cloudera, Inc. > > >
