Re: AVRO vs Parquet

Guru Medasani Thu, 10 Mar 2016 15:21:16 -0800

Thanks Michael for clarifying this. My response is  inline.

Guru Medasani
gdm...@gmail.com

> On Mar 10, 2016, at 12:38 PM, Michael Armbrust <mich...@databricks.com> wrote:
> 
> A few clarifications:
>  
> 1) High memory and cpu usage. This is because Parquet files can't be streamed 
> into as records arrive. I have seen a lot of OOMs in reasonably sized 
> MR/Spark containers that write out Parquet. When doing dynamic partitioning, 
> where many writers are open at once, we’ve seen customers having trouble to 
> make it work. This has made for some very confused ETL developers.
> 
> In Spark 1.6.1 we avoid having more than 2 files open per task, so this 
> should be less of a problem even for dynamic partitioning.

Thanks for fixing this. Looks like this the Jira that is going into Spark 1.6.1 
that is fixing the memory issues during dynamic partitioning. I copied it here 
so rest of the folks on the email thread can take a look. 

SPARK-12546 <https://issues.apache.org/jira/browse/SPARK-12546>

Writing to partitioned parquet table can fail with OOM

>  
> 2) Parquet lags well behind Avro in schema evolution semantics. Can only add 
> columns at the end? Deleting columns at the end is not recommended if you 
> plan to add any columns in the future. Reordering is not supported in current 
> release. 
> 
> This may be true for Impala, but Spark SQL does schema merging by name so you 
> can add / reorder columns with the constraint that you cannot reuse a name 
> with an incompatible type.

As I mentioned in my previous email it is something user still needs to be 
aware of as user mentioned the following in the initial question.

I also have to consider any file on HDFS may be accessed from other tools like 
Hive, Impala, HAWQ.

Regarding the reordering I also mentioned in my previous email that it will be 
supported in Impala in the next release as well. 

In the next release, there might be support for optionally matching Parquet 
file columns by name instead of order, like Hive does. Under this scheme, you 
cannot rename columns (since the files will retain the old name and will no 
longer be matched), but you can reorder them. ( This is regarding Impala)

Re: AVRO vs Parquet

Reply via email to