Hi, I'd be curious to get the communities perspective on using Parquet format as the canonical source of truth for ones data. Are folks doing this in practice or ETLing from their source of truth into Parquet for analytical use cases (storing more than once)?
A reason I'm reluctant to store all my data once in parquet is it's lack of support for common schema evolution scenarios which largely seems implementation specific ( http://stackoverflow.com/questions/37644664/schema-evolution-in-parquet-format ). Two specific pain points I have are: Concerned about the expense of spark schema merging operations ( http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging) vs perhaps specifying a schema upfront (like Avro) We use protobufs, and I've found that ProtoReadSupport seems to choke on what would otherwise be valid protobuf schema evolution rules, like renaming fields for example. This leaves us in a situation where perhaps we can write our data with ProtoParquetWriter, but must read it back using regular Parquet support vs https://github.com/saurfang/sparksql-protobuf or ProtoParquetReader, which means we lose all the nice type-safe POJO features working with Protobufs in Spark offers us. Appreciate any insights on the roadmap, or advice. Best, -Mike
