Julien, Thanks for the thoughtful response. Looks like we are talking about this ticket, which I am now tracking (and which also seems to have gotten a recent PR!) :) https://issues.apache.org/jira/browse/PARQUET-951
In the event that not all valid protobuf (or avro for that matter) schema evolutions rules transform seamlessly into valid parquet schema evolution rules, I was looking into moving away from writing my parquet files with ProtoParquetWriter and instead managing the transformation process between my protobufs and parquet files myself. I'm having trouble finding documentation or best practices for doing that, except for a comment somewhere that Parquet files are mainly meant to be generated directly from it's Proto and Avro ParquetWriters. Can you comment on this? Are there any examples of writing parquet files directly? Thanks again. On 2017-04-20 14:43 (-0400), Julien Le Dem <[email protected]> wrote: > Hi Michael,> > The default schema evolution in Parquet is to merge schemas by field name.> > Which means you can:> > - add a field with a name that is not used yet> > But you can not:> > - rename a field. Although it will treat it as removing the field and> > adding a new one. The old name and new name will be treated as 2 different> > columns> > - change the type of a field.> > > In Protobuf and Thrift there is a field ID and both support renaming a> > field and keeping the field id. So by default this is not supported in> > Parquet. You can add fields but not really rename existing ones. You can> > use Protobuf if you restrict yourself not to rename fields.> > Although we have added an optional id field in the parquet schema nodes> > specifically for that purpose.> > QinHui from Criteo is looking into the exact same thing related to> > protobuf. (check out the latest notes from the parquet sync on this list)> > To support renaming of field we would need:> > - populate the id fields in the Parquet Schema when converting the> > Protobuf schema to Parquet by calling withId (> > https://github.com/apache/parquet-mr/blob/master/parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java> > )> > - take the id into account when doing schema merging and the id is> > available. (possibly add a property to switch on the different behavior)> > > That is a feature that totally makes sense and just need someone to spend> > the time implementing it.> > I think QinHui mentioned he is interested in doing so. He also talked about> > dealing with the unknown fields comming from Protobuf (when you don't have> > the latest proto for example)> > QinHui: am I correctly reflecting this? Did you create a JIRA for this?> > > Mike: Let me know if that helps> > > Cheers> > Julien> > > > > On Wed, Apr 19, 2017 at 11:18 AM, Michael Moss <[email protected]>> > wrote:> > > > Hi,> > >> > > I'd be curious to get the communities perspective on using Parquet format> > > as the canonical source of truth for ones data. Are folks doing this in> > > practice or ETLing from their source of truth into Parquet for analytical> > > use cases (storing more than once)?> > >> > > A reason I'm reluctant to store all my data once in parquet is it's lack of> > > support for common schema evolution scenarios which largely seems> > > implementation specific (> > > http://stackoverflow.com/questions/37644664/schema-> > > evolution-in-parquet-format> > > ).> > >> > > Two specific pain points I have are:> > > Concerned about the expense of spark schema merging operations (> > > http://spark.apache.org/docs/latest/sql-programming-guide.> > > html#schema-merging)> > > vs perhaps specifying a schema upfront (like Avro)> > >> > > We use protobufs, and I've found that ProtoReadSupport seems to choke on> > > what would otherwise be valid protobuf schema evolution rules, like> > > renaming fields for example. This leaves us in a situation where perhaps we> > > can write our data with ProtoParquetWriter, but must read it back using> > > regular Parquet support vs https://github.com/saurfang/sparksql-protobuf> > > or ProtoParquetReader, which means we lose all the nice type-safe POJO> > > features working with Protobufs in Spark offers us.> > >> > > Appreciate any insights on the roadmap, or advice.> > >> > > Best,> > >> > > -Mike> > >> > > > > -- > > Julien> >
