Thanks for the suggestion Cheng, I will try that today.
Are there any implications when reading the parquet data if there are
no summary files present?

Michael

On Sat, Jul 25, 2015 at 2:28 AM, Cheng Lian <lian.cs....@gmail.com> wrote:
> The time is probably spent by ParquetOutputFormat.commitJob. While
> committing a successful write job, Parquet writes a pair of summary files,
> containing metadata like schema, user defined key-value metadata, and
> Parquet row group information. To gather all the necessary information,
> Parquet scans footers of all the data files within the base directory and
> tries to merge them. The more the data there are, the longer it takes.
>
> One possible workaround is to disable summary files by setting "
> parquet.enable.summary-metadata" to false in sc.hadoopConfiguration.
>
> Cheng
>
>
> On 7/25/15 4:15 AM, Michael Kelly wrote:
>
> Hi,
>
>
> We are converting some csv log files to parquet but the job is getting
> progressively slower the more files we add to the parquet folder.
>
> The parquet files are being written to s3, we are using a spark
> standalone cluster running on ec2 and the spark version is 1.4.1. The
> parquet files are partitioned on two columns, first the date, then
> another column. We write the data one day at a time, and the final
> size of the data for one day when it is written out to parquet is
> about 150GB.
>
> We coalesce the data before it is written out, and in total per day we
> have 615 partitions/files written out to s3. We use the
> SaveMode.Append since we are always writing to the same directory.
> This is the command we use to write the data.
>
> df.coalesce(partitions).write.mode(SaveMode.Append).partitionBy("dt","outcome").parquet("s3n://root/parquet/dir/")
>
> Writing the parquet file to an empty directory completes almost
> immediately, whereas after 12 days worth of data has been written,
> each parquet write takes up to 20 minutes (and there are 4 writes per
> day).
>
> Questions
> Is there a more efficient way to write the data? I'm guessing that the
> update to the parquet metadata is the issue, and that it happens in a
> serial fashion.
> Is there a way to write the metadata in the partitioned folders, and
> would this speed things up?
> Would this have any implications for reading in the data?
> I came across DirectParquetOutputCommitter, but the source for it says
> it cannot be used with Append mode, would this be useful?
>
>
> I came across this issue -
> https://issues.apache.org/jira/browse/SPARK-8125 and corresponding
> pull request - https://github.com/apache/spark/pull/7396, but it looks
> like they are more geared for reading parquet metadata in parallel as
> opposed to writing it. Is this the case?
>
> Any help would be much appreciated,
>
> Michael
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to