Re: Merging Parquet Files

Daniel Haviv Tue, 25 Nov 2014 01:02:16 -0800

Hi,
Thanks for your reply.. I'm trying to do what you suggested but I get:
scala> sqlContext.sql("CREATE TEMPORARY TABLE data USING
org.apache.spark.sql.parquet OPTIONS (path '/requests_parquet.toomany')")


*java.lang.RuntimeException: Failed to load class for data source:
org.apache.spark.sql.parquet*
*        at scala.sys.package$.error(package.scala:27)*

any idea why ?

Thanks,
Daniel

On Mon, Nov 24, 2014 at 11:30 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> Parquet does a lot of serial metadata operations on the driver which makes
> it really slow when you have a very large number of files (especially if
> you are reading from something like S3).  This is something we are aware of
> and that I'd really like to improve in 1.3.
>
> You might try the (brand new and very experimental) new parquet support
> that I added into 1.2 at the last minute in an attempt to make our metadata
> handling more efficient.
>
> Basically you load the parquet files using the new data source API instead
> of using parquetFile:
>
> CREATE TEMPORARY TABLE data
> USING org.apache.spark.sql.parquet
> OPTIONS (
>   path 'path/to/parquet'
> )
>
> This will at least parallelize the retrieval of file status object, but
> there is a lot more optimization that I hope to do.
>
> On Sat, Nov 22, 2014 at 1:53 PM, Daniel Haviv <danielru...@gmail.com>
> wrote:
>
>> Hi,
>> I'm ingesting a lot of small JSON files and convert them to unified
>> parquet files, but even the unified files are fairly small (~10MB).
>> I want to run a merge operation every hour on the existing files, but it
>> takes a lot of time for such a small amount of data: about 3 GB spread of
>> 3000 parquet files.
>>
>> Basically what I'm doing is load files in the existing directory,
>> coalesce them and save to the new dir:
>> val parquetFiles=sqlContext.parquetFile("/requests_merged/inproc")
>>
>> parquetFiles.coalesce(2).saveAsParquetFile("/requests_merged/$currday")
>>
>> Doing this takes over an hour on my 3 node cluster...
>>
>> Is there a better way to achieve this ?
>> Any ideas what can cause such a simple operation take so long?
>>
>> Thanks,
>> Daniel
>>
>
>

Re: Merging Parquet Files

Reply via email to