The dictionary is in the first page of a column chunk (if dictionary encoded). It would not be very hard to add an API to read it. for a given column and row group: - seek to the beginning of the column chunk - read the dictionary page:
https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ColumnChunkPageReadStore.java - deserialize the dictionary: https://github.com/apache/incubator-parquet-mr/blob/3df3372a1ee7b6ea74af89f53a614895b8078609/parquet-column/src/main/java/parquet/column/impl/ColumnReaderImpl.java#L334 On Wed, Apr 8, 2015 at 1:28 PM, Matt Massie <[email protected]> wrote: > Thanks for the ideas, Julien. I appreciate it. > > I suspected that I would need to direct my effort into a Parquet-based > Spark shuffle but wanted to reach out on the Parquet list first to make > sure that the simpler Parquet serialization approach wouldn't work. > > As far as I can tell, there isn't an external API for extracting > dictionaries from Parquet files. Correct? > > -Matt > > On Tue, Apr 7, 2015 at 1:50 PM, Julien Le Dem <[email protected]> > wrote: > > > You might want to look at Twitter Chill: > https://github.com/twitter/chill > > This is where Twitter has the Kryo serializers for this kind of use > cases. > > The one for Thrift: > > > > > https://github.com/twitter/chill/blob/develop/chill-thrift/src/main/java/com/twitter/chill/thrift/TBaseSerializer.java > > Spark is using it already: > > > > > https://github.com/apache/spark/blob/ee11be258251adf900680927ba200bf46512cc04/pom.xml#L320 > > > > But the main problem is that you want to encode multiple records at a > time > > when the serializer gets a single key or value at a time. > > The output is not a single stream since values get routed to a different > > node based on the key. > > > > Unless spark first gets the list of values that goes to a specific node > > before encoding them (and provides an API to have custom encoding of all > > values together) I don't think you can implement a Parquet > KryoSerializer. > > I would ask the Spark project whether it's possible to encode all the > > values that go to the same node together with a custom serializer. > > > > Another approach is to reuse dictionary encoding in your row oriented > > serialization. > > It would not compress as much as Parquet but may help. You'd need to pass > > the dictionary around but maybe you know all the possible values ahead of > > time. (extracting the dictionaries from the Parquet file?) > > After all your in-memory representation is compact because after > dictionary > > decoding all avro struct that use the same string value point to the same > > java string instance, so you may be able to use this. > > > > I hope this helps > > > > Julien > > > > > > > > On Tue, Apr 7, 2015 at 12:07 PM, Matt Massie <[email protected]> > wrote: > > > > > I forgot to mention that I’m happy to do the work here, I’m mostly > > looking > > > for advice and pointers. > > > > > > Also, another interface that might be better for integration is the > Spark > > > Serializer > > > < > > > > > > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/serializer/Serializer.scala > > > > > > > since the SerializationStream and DeserializationStream have the > flush() > > > and close() functions that are necessary (instead of just read/write). > > > > > > > > > -- > > > Matt <http://www.linkedin.com/in/mattmassie/> Massie > > > <http://www.twitter.com/matt_massie> > > > UC, Berkeley AMPLab <https://twitter.com/amplab> > > > > > > On Tue, Apr 7, 2015 at 11:49 AM, Matt Massie <[email protected]> > > wrote: > > > > > > > We are using Apache Parquet and Spark for a genome analysis platform, > > > > called ADAM <http://bdgenomics.org>, that allows researchers to > > quickly > > > > analyze large datasets of DNA, RNA, etc. Parquet has been a core > > > component > > > > of the system and we see compression of ~20% compared to specialized > > > genome > > > > file formats e.g. compressed BAM. In short, we’re really happy with > > > Parquet. > > > > > > > > We are using Avro Specific classes for almost all the entities in our > > > > system, so Avro generates Java classes from our schema > > > > < > > > > > > https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/resources/avro/bdg.avdl > > > >. > > > > Since the AvroIndexedRecordConverter has dictionary support, our > > initial > > > > load from disk to memory is compact. > > > > > > > > That’s the good news: compact on-disk and initial in-memory > > > representation. > > > > > > > > Here’s the problem: the Spark shuffle. > > > > > > > > In order to integrate Parquet with Spark, we use a KryoRegistrator to > > > > register Kryo serializers > > > > < > > > > > > https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/serialization/ADAMKryoRegistrator.scala > > > > > > > > for each of our Avro objects (see Kryo Serializer interface > > > > < > > > > > > https://github.com/EsotericSoftware/kryo/blob/master/src/com/esotericsoftware/kryo/Serializer.java > > > >). > > > > We are serializing each object into record-oriented Avro, which makes > > our > > > > intermediate shuffle files much larger than the corresponding > > > > column-oriented Parquet inputs. These large shuffle files are hurting > > our > > > > performance and limiting our scaling for some analysis. > > > > > > > > Since the shuffle data is short-lived, there’s no need store > meta-data > > > and > > > > we have immediate access to schema through each Avro object. Each > Avro > > > > specific class has a SCHEMA$ field which contains the Avro Schema for > > the > > > > object. There are utility functions in parquet-avro which can convert > > > > this Avro schema into Parquet schema. We also don’t need index pages, > > > only > > > > the dictionary and data pages. We don’t need predicates or projection > > > > functionality. Does anyone on this list see a way to create a Parquet > > > Kryo > > > > serializer > > > > < > > > > > > https://github.com/EsotericSoftware/kryo/blob/master/src/com/esotericsoftware/kryo/Serializer.java > > > > > > > > to read/write Avro Specific objects to/from a stream? Emitting > > > > column-oriented data will understandably incur memory and CPU costs > on > > > the > > > > map side but it will be worth it to improve our shuffle performance. > > > > > > > > This shuffle issue to slowing important research so any advice you > have > > > to > > > > offer will be appreciated. Thank you. > > > > > > > > — > > > > Matt Massie > > > > UC Berkeley, AMPLab > > > > > > > > > > > > > >
