Hi, Thanks for your reply.. I'm trying to do what you suggested but I get: scala> sqlContext.sql("CREATE TEMPORARY TABLE data USING org.apache.spark.sql.parquet OPTIONS (path '/requests_parquet.toomany')")
*java.lang.RuntimeException: Failed to load class for data source: org.apache.spark.sql.parquet* * at scala.sys.package$.error(package.scala:27)* any idea why ? Thanks, Daniel On Mon, Nov 24, 2014 at 11:30 PM, Michael Armbrust <mich...@databricks.com> wrote: > Parquet does a lot of serial metadata operations on the driver which makes > it really slow when you have a very large number of files (especially if > you are reading from something like S3). This is something we are aware of > and that I'd really like to improve in 1.3. > > You might try the (brand new and very experimental) new parquet support > that I added into 1.2 at the last minute in an attempt to make our metadata > handling more efficient. > > Basically you load the parquet files using the new data source API instead > of using parquetFile: > > CREATE TEMPORARY TABLE data > USING org.apache.spark.sql.parquet > OPTIONS ( > path 'path/to/parquet' > ) > > This will at least parallelize the retrieval of file status object, but > there is a lot more optimization that I hope to do. > > On Sat, Nov 22, 2014 at 1:53 PM, Daniel Haviv <danielru...@gmail.com> > wrote: > >> Hi, >> I'm ingesting a lot of small JSON files and convert them to unified >> parquet files, but even the unified files are fairly small (~10MB). >> I want to run a merge operation every hour on the existing files, but it >> takes a lot of time for such a small amount of data: about 3 GB spread of >> 3000 parquet files. >> >> Basically what I'm doing is load files in the existing directory, >> coalesce them and save to the new dir: >> val parquetFiles=sqlContext.parquetFile("/requests_merged/inproc") >> >> parquetFiles.coalesce(2).saveAsParquetFile("/requests_merged/$currday") >> >> Doing this takes over an hour on my 3 node cluster... >> >> Is there a better way to achieve this ? >> Any ideas what can cause such a simple operation take so long? >> >> Thanks, >> Daniel >> > >