Parquet does a lot of serial metadata operations on the driver which makes
it really slow when you have a very large number of files (especially if
you are reading from something like S3).  This is something we are aware of
and that I'd really like to improve in 1.3.

You might try the (brand new and very experimental) new parquet support
that I added into 1.2 at the last minute in an attempt to make our metadata
handling more efficient.

Basically you load the parquet files using the new data source API instead
of using parquetFile:

CREATE TEMPORARY TABLE data
USING org.apache.spark.sql.parquet
OPTIONS (
  path 'path/to/parquet'
)

This will at least parallelize the retrieval of file status object, but
there is a lot more optimization that I hope to do.

On Sat, Nov 22, 2014 at 1:53 PM, Daniel Haviv <danielru...@gmail.com> wrote:

> Hi,
> I'm ingesting a lot of small JSON files and convert them to unified
> parquet files, but even the unified files are fairly small (~10MB).
> I want to run a merge operation every hour on the existing files, but it
> takes a lot of time for such a small amount of data: about 3 GB spread of
> 3000 parquet files.
>
> Basically what I'm doing is load files in the existing directory, coalesce
> them and save to the new dir:
> val parquetFiles=sqlContext.parquetFile("/requests_merged/inproc")
>
> parquetFiles.coalesce(2).saveAsParquetFile("/requests_merged/$currday")
>
> Doing this takes over an hour on my 3 node cluster...
>
> Is there a better way to achieve this ?
> Any ideas what can cause such a simple operation take so long?
>
> Thanks,
> Daniel
>

Reply via email to