Hi,
I'm ingesting a lot of small JSON files and convert them to unified parquet
files, but even the unified files are fairly small (~10MB).
I want to run a merge operation every hour on the existing files, but it
takes a lot of time for such a small amount of data: about 3 GB spread of
3000 parquet files.

Basically what I'm doing is load files in the existing directory, coalesce
them and save to the new dir:
val parquetFiles=sqlContext.parquetFile("/requests_merged/inproc")

parquetFiles.coalesce(2).saveAsParquetFile("/requests_merged/$currday")

Doing this takes over an hour on my 3 node cluster...

Is there a better way to achieve this ?
Any ideas what can cause such a simple operation take so long?

Thanks,
Daniel

Reply via email to