Hi,

I'd be curious to get the communities perspective on using Parquet format
as the canonical source of truth for ones data. Are folks doing this in
practice or ETLing from their source of truth into Parquet for analytical
use cases (storing more than once)?

A reason I'm reluctant to store all my data once in parquet is it's lack of
support for common schema evolution scenarios which largely seems
implementation specific (
http://stackoverflow.com/questions/37644664/schema-evolution-in-parquet-format
).

Two specific pain points I have are:
Concerned about the expense of spark schema merging operations (
http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging)
vs perhaps specifying a schema upfront (like Avro)

We use protobufs, and I've found that ProtoReadSupport seems to choke on
what would otherwise be valid protobuf schema evolution rules, like
renaming fields for example. This leaves us in a situation where perhaps we
can write our data with ProtoParquetWriter, but must read it back using
regular Parquet support vs https://github.com/saurfang/sparksql-protobuf
or ProtoParquetReader, which means we lose all the nice type-safe POJO
features working with Protobufs in Spark offers us.

Appreciate any insights on the roadmap, or advice.

Best,

-Mike

Reply via email to