Sure, you can tweet that. -- Matt <http://www.linkedin.com/in/mattmassie/> Massie <http://www.twitter.com/matt_massie> UC, Berkeley AMPLab <https://twitter.com/amplab>
On Wed, Apr 8, 2015 at 1:54 PM, Ryan Blue <[email protected]> wrote: > "Parquet has been a core component of the system and we see compression of > ~20% compared to specialized genome file formats e.g. compressed BAM. In > short, we’re really happy with Parquet." > > Matt, can we tweet this? That's great! > > rb > > On 04/07/2015 11:49 AM, Matt Massie wrote: > >> We are using Apache Parquet and Spark for a genome analysis platform, >> called ADAM <http://bdgenomics.org>, that allows researchers to quickly >> analyze large datasets of DNA, RNA, etc. Parquet has been a core component >> of the system and we see compression of ~20% compared to specialized >> genome >> file formats e.g. compressed BAM. In short, we’re really happy with >> Parquet. >> >> We are using Avro Specific classes for almost all the entities in our >> system, so Avro generates Java classes from our schema >> <https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/ >> resources/avro/bdg.avdl>. >> Since the AvroIndexedRecordConverter has dictionary support, our initial >> load from disk to memory is compact. >> >> That’s the good news: compact on-disk and initial in-memory >> representation. >> >> Here’s the problem: the Spark shuffle. >> >> In order to integrate Parquet with Spark, we use a KryoRegistrator to >> register Kryo serializers >> <https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/ >> scala/org/bdgenomics/adam/serialization/ADAMKryoRegistrator.scala> >> for each of our Avro objects (see Kryo Serializer interface >> <https://github.com/EsotericSoftware/kryo/blob/master/src/com/ >> esotericsoftware/kryo/Serializer.java>). >> We are serializing each object into record-oriented Avro, which makes our >> intermediate shuffle files much larger than the corresponding >> column-oriented Parquet inputs. These large shuffle files are hurting our >> performance and limiting our scaling for some analysis. >> >> Since the shuffle data is short-lived, there’s no need store meta-data and >> we have immediate access to schema through each Avro object. Each Avro >> specific class has a SCHEMA$ field which contains the Avro Schema for the >> object. There are utility functions in parquet-avro which can convert this >> Avro schema into Parquet schema. We also don’t need index pages, only the >> dictionary and data pages. We don’t need predicates or projection >> functionality. Does anyone on this list see a way to create a Parquet Kryo >> serializer >> <https://github.com/EsotericSoftware/kryo/blob/master/src/com/ >> esotericsoftware/kryo/Serializer.java> >> to read/write Avro Specific objects to/from a stream? Emitting >> column-oriented data will understandably incur memory and CPU costs on the >> map side but it will be worth it to improve our shuffle performance. >> >> This shuffle issue to slowing important research so any advice you have to >> offer will be appreciated. Thank you. >> >> — >> Matt Massie >> UC Berkeley, AMPLab >> >> >> > > -- > Ryan Blue > Software Engineer > Cloudera, Inc. >
