Re: Parquet Schema Evolution and Protobuf Compatibility

Michael Moss Fri, 28 Apr 2017 11:43:43 -0700

Julien,

Thanks for the thoughtful response. Looks like we are talking about this
ticket, which I am now tracking (and which also seems to have gotten a
recent PR!) :) https://issues.apache.org/jira/browse/PARQUET-951


In the event that not all valid protobuf (or avro for that matter) schema
evolutions rules transform seamlessly into valid parquet schema evolution
rules, I was looking into moving away from writing my parquet files
with ProtoParquetWriter and instead managing the transformation process
between my protobufs and parquet files myself. I'm having trouble finding
documentation or best practices for doing that, except for a comment
somewhere that Parquet files are mainly meant to be generated directly from
it's Proto and Avro ParquetWriters.

Can you comment on this? Are there any examples of writing parquet files
directly?

Thanks again.

On 2017-04-20 14:43 (-0400), Julien Le Dem <[email protected]> wrote:
> Hi Michael,>
> The default schema evolution in Parquet is to merge schemas by field
name.>
> Which means you can:>
>  - add a field with a name that is not used yet>
> But you can not:>
>  - rename a field. Although it will treat it as removing the field and>
> adding a new one. The old name and new name will be treated as 2
different>
> columns>
>  - change the type of a field.>
>
> In Protobuf and Thrift there is a field ID and both support renaming a>
> field and keeping the field id. So by default this is not supported in>
> Parquet. You can add fields but not really rename existing ones. You can>
> use Protobuf if you restrict yourself not to rename fields.>
> Although we have added an optional id field in the parquet schema nodes>
> specifically for that purpose.>
> QinHui from Criteo is looking into the exact same thing related to>
> protobuf. (check out the latest notes from the parquet sync on this
list)>
> To support renaming of field we would need:>
>  - populate the id fields in the Parquet Schema when converting the>
> Protobuf schema to Parquet by calling withId (>
>
https://github.com/apache/parquet-mr/blob/master/parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java>

> )>
>  - take the id into account when doing schema merging and the id is>
> available. (possibly add a property to switch on the different behavior)>
>
> That is a feature that totally makes sense and just need someone to
spend>
> the time implementing it.>
> I think QinHui mentioned he is interested in doing so. He also talked
about>
> dealing with the unknown fields comming from Protobuf (when you don't
have>
> the latest proto for example)>
> QinHui: am I correctly reflecting this? Did you create a JIRA for this?>
>
> Mike: Let me know if that helps>
>
> Cheers>
> Julien>
>
>
>
> On Wed, Apr 19, 2017 at 11:18 AM, Michael Moss <[email protected]>>
> wrote:>
>
> > Hi,>
> >>
> > I'd be curious to get the communities perspective on using Parquet
format>
> > as the canonical source of truth for ones data. Are folks doing this
in>
> > practice or ETLing from their source of truth into Parquet for
analytical>
> > use cases (storing more than once)?>
> >>
> > A reason I'm reluctant to store all my data once in parquet is it's
lack of>
> > support for common schema evolution scenarios which largely seems>
> > implementation specific (>
> > http://stackoverflow.com/questions/37644664/schema->
> > evolution-in-parquet-format>
> > ).>
> >>
> > Two specific pain points I have are:>
> > Concerned about the expense of spark schema merging operations (>
> > http://spark.apache.org/docs/latest/sql-programming-guide.>
> > html#schema-merging)>
> > vs perhaps specifying a schema upfront (like Avro)>
> >>
> > We use protobufs, and I've found that ProtoReadSupport seems to choke
on>
> > what would otherwise be valid protobuf schema evolution rules, like>
> > renaming fields for example. This leaves us in a situation where
perhaps we>
> > can write our data with ProtoParquetWriter, but must read it back
using>
> > regular Parquet support vs https://github.com/saurfang/sparksql-protobuf>

> > or ProtoParquetReader, which means we lose all the nice type-safe POJO>
> > features working with Protobufs in Spark offers us.>
> >>
> > Appreciate any insights on the roadmap, or advice.>
> >>
> > Best,>
> >>
> > -Mike>
> >>
>
>
>
> -- >
> Julien>
>

Re: Parquet Schema Evolution and Protobuf Compatibility

Reply via email to