Hi, I'm ingesting a lot of small JSON files and convert them to unified parquet files, but even the unified files are fairly small (~10MB). I want to run a merge operation every hour on the existing files, but it takes a lot of time for such a small amount of data: about 3 GB spread of 3000 parquet files.
Basically what I'm doing is load files in the existing directory, coalesce them and save to the new dir: val parquetFiles=sqlContext.parquetFile("/requests_merged/inproc") parquetFiles.coalesce(2).saveAsParquetFile("/requests_merged/$currday") Doing this takes over an hour on my 3 node cluster... Is there a better way to achieve this ? Any ideas what can cause such a simple operation take so long? Thanks, Daniel