Hi guys
Regarding to parquet files. I have Spark 1.2.0 and reading 27 parquet files
(250MB/file), it lasts 4 minutes.
I have a cluster with 4 nodes and it seems me too slow.
The load function is not available in Spark 1.2, so I can't test it
Regards.
Miguel.
On Mon, Apr 13, 2015 at 8:12 PM,
Hi guys
Does anyone know how to stop Spark from opening all Parquet files before
starting a job? This is quite a show stopper for me, since I have 5000 Parquet
files on S3.
Recap of what I tried:
1. Disable schema merging with: sqlContext.load(“parquet, Map(mergeSchema -
false”, path -
You may have seen this thread: http://search-hadoop.com/m/JW1q5SlRpt1
Cheers
On Wed, Apr 8, 2015 at 6:15 AM, Eric Eijkelenboom
eric.eijkelenb...@gmail.com wrote:
Hi guys
*I’ve got:*
- 180 days of log data in Parquet.
- Each day is stored in a separate folder in S3.
- Each day
Hi guys
I’ve got:
180 days of log data in Parquet.
Each day is stored in a separate folder in S3.
Each day consists of 20-30 Parquet files of 256 MB each.
Spark 1.3 on Amazon EMR
This makes approximately 5000 Parquet files with a total size if 1.5 TB.
My code:
val in =
Thanks for the report. We improved the speed here in 1.3.1 so would be
interesting to know if this helps. You should also try disabling schema
merging if you do not need that feature (i.e. all of your files are the
same schema).
sqlContext.load(path, parquet, Map(mergeSchema - false))
On Wed,
Hi Eric - Would you mind to try either disabling schema merging as what
Michael suggested, or disabling the new Parquet data source by
sqlContext.setConf(spark.sql.parquet.useDataSourceApi, false)
Cheng
On 4/9/15 2:43 AM, Michael Armbrust wrote:
Thanks for the report. We improved the speed
We noticed similar perf degradation using Parquet (outside of Spark) and it
happened due to merging of multiple schemas. Would be good to know if
disabling merge of schema (if the schema is same) as Michael suggested
helps in your case.
On Wed, Apr 8, 2015 at 11:43 AM, Michael Armbrust