A few clarifications:
> 1) High memory and cpu usage. This is because Parquet files can't be > streamed into as records arrive. I have seen a lot of OOMs in reasonably > sized MR/Spark containers that write out Parquet. When doing dynamic > partitioning, where many writers are open at once, we’ve seen customers > having trouble to make it work. This has made for some very confused ETL > developers. > In Spark 1.6.1 we avoid having more than 2 files open per task, so this should be less of a problem even for dynamic partitioning. > 2) Parquet lags well behind Avro in schema evolution semantics. Can only > add columns at the end? Deleting columns at the end is not recommended if > you plan to add any columns in the future. Reordering is not supported in > current release. > This may be true for Impala, but Spark SQL does schema merging by name so you can add / reorder columns with the constraint that you cannot reuse a name with an incompatible type.