Thanks for the suggestion Cheng, I will try that today. Are there any implications when reading the parquet data if there are no summary files present?
Michael On Sat, Jul 25, 2015 at 2:28 AM, Cheng Lian <lian.cs....@gmail.com> wrote: > The time is probably spent by ParquetOutputFormat.commitJob. While > committing a successful write job, Parquet writes a pair of summary files, > containing metadata like schema, user defined key-value metadata, and > Parquet row group information. To gather all the necessary information, > Parquet scans footers of all the data files within the base directory and > tries to merge them. The more the data there are, the longer it takes. > > One possible workaround is to disable summary files by setting " > parquet.enable.summary-metadata" to false in sc.hadoopConfiguration. > > Cheng > > > On 7/25/15 4:15 AM, Michael Kelly wrote: > > Hi, > > > We are converting some csv log files to parquet but the job is getting > progressively slower the more files we add to the parquet folder. > > The parquet files are being written to s3, we are using a spark > standalone cluster running on ec2 and the spark version is 1.4.1. The > parquet files are partitioned on two columns, first the date, then > another column. We write the data one day at a time, and the final > size of the data for one day when it is written out to parquet is > about 150GB. > > We coalesce the data before it is written out, and in total per day we > have 615 partitions/files written out to s3. We use the > SaveMode.Append since we are always writing to the same directory. > This is the command we use to write the data. > > df.coalesce(partitions).write.mode(SaveMode.Append).partitionBy("dt","outcome").parquet("s3n://root/parquet/dir/") > > Writing the parquet file to an empty directory completes almost > immediately, whereas after 12 days worth of data has been written, > each parquet write takes up to 20 minutes (and there are 4 writes per > day). > > Questions > Is there a more efficient way to write the data? I'm guessing that the > update to the parquet metadata is the issue, and that it happens in a > serial fashion. > Is there a way to write the metadata in the partitioned folders, and > would this speed things up? > Would this have any implications for reading in the data? > I came across DirectParquetOutputCommitter, but the source for it says > it cannot be used with Append mode, would this be useful? > > > I came across this issue - > https://issues.apache.org/jira/browse/SPARK-8125 and corresponding > pull request - https://github.com/apache/spark/pull/7396, but it looks > like they are more geared for reading parquet metadata in parallel as > opposed to writing it. Is this the case? > > Any help would be much appreciated, > > Michael > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org