You'll need to be running a very recent version of Spark SQL as this feature was just added.
On Tue, Nov 25, 2014 at 1:01 AM, Daniel Haviv <danielru...@gmail.com> wrote: > Hi, > Thanks for your reply.. I'm trying to do what you suggested but I get: > scala> sqlContext.sql("CREATE TEMPORARY TABLE data USING > org.apache.spark.sql.parquet OPTIONS (path '/requests_parquet.toomany')") > > *java.lang.RuntimeException: Failed to load class for data source: > org.apache.spark.sql.parquet* > * at scala.sys.package$.error(package.scala:27)* > > any idea why ? > > Thanks, > Daniel > > On Mon, Nov 24, 2014 at 11:30 PM, Michael Armbrust <mich...@databricks.com > > wrote: > >> Parquet does a lot of serial metadata operations on the driver which >> makes it really slow when you have a very large number of files (especially >> if you are reading from something like S3). This is something we are aware >> of and that I'd really like to improve in 1.3. >> >> You might try the (brand new and very experimental) new parquet support >> that I added into 1.2 at the last minute in an attempt to make our metadata >> handling more efficient. >> >> Basically you load the parquet files using the new data source API >> instead of using parquetFile: >> >> CREATE TEMPORARY TABLE data >> USING org.apache.spark.sql.parquet >> OPTIONS ( >> path 'path/to/parquet' >> ) >> >> This will at least parallelize the retrieval of file status object, but >> there is a lot more optimization that I hope to do. >> >> On Sat, Nov 22, 2014 at 1:53 PM, Daniel Haviv <danielru...@gmail.com> >> wrote: >> >>> Hi, >>> I'm ingesting a lot of small JSON files and convert them to unified >>> parquet files, but even the unified files are fairly small (~10MB). >>> I want to run a merge operation every hour on the existing files, but it >>> takes a lot of time for such a small amount of data: about 3 GB spread of >>> 3000 parquet files. >>> >>> Basically what I'm doing is load files in the existing directory, >>> coalesce them and save to the new dir: >>> val parquetFiles=sqlContext.parquetFile("/requests_merged/inproc") >>> >>> parquetFiles.coalesce(2).saveAsParquetFile("/requests_merged/$currday") >>> >>> Doing this takes over an hour on my 3 node cluster... >>> >>> Is there a better way to achieve this ? >>> Any ideas what can cause such a simple operation take so long? >>> >>> Thanks, >>> Daniel >>> >> >> >