Re: Opening many Parquet files = slow

2015-04-15 Thread Masf
Hi guys Regarding to parquet files. I have Spark 1.2.0 and reading 27 parquet files (250MB/file), it lasts 4 minutes. I have a cluster with 4 nodes and it seems me too slow. The load function is not available in Spark 1.2, so I can't test it Regards. Miguel. On Mon, Apr 13, 2015 at 8:12 PM,

Re: Opening many Parquet files = slow

2015-04-13 Thread Eric Eijkelenboom
Hi guys Does anyone know how to stop Spark from opening all Parquet files before starting a job? This is quite a show stopper for me, since I have 5000 Parquet files on S3. Recap of what I tried: 1. Disable schema merging with: sqlContext.load(“parquet, Map(mergeSchema - false”, path -

Re: Opening many Parquet files = slow

2015-04-08 Thread Ted Yu
You may have seen this thread: http://search-hadoop.com/m/JW1q5SlRpt1 Cheers On Wed, Apr 8, 2015 at 6:15 AM, Eric Eijkelenboom eric.eijkelenb...@gmail.com wrote: Hi guys *I’ve got:* - 180 days of log data in Parquet. - Each day is stored in a separate folder in S3. - Each day

Opening many Parquet files = slow

2015-04-08 Thread Eric Eijkelenboom
Hi guys I’ve got: 180 days of log data in Parquet. Each day is stored in a separate folder in S3. Each day consists of 20-30 Parquet files of 256 MB each. Spark 1.3 on Amazon EMR This makes approximately 5000 Parquet files with a total size if 1.5 TB. My code: val in =

Re: Opening many Parquet files = slow

2015-04-08 Thread Michael Armbrust
Thanks for the report. We improved the speed here in 1.3.1 so would be interesting to know if this helps. You should also try disabling schema merging if you do not need that feature (i.e. all of your files are the same schema). sqlContext.load(path, parquet, Map(mergeSchema - false)) On Wed,

Re: Opening many Parquet files = slow

2015-04-08 Thread Cheng Lian
Hi Eric - Would you mind to try either disabling schema merging as what Michael suggested, or disabling the new Parquet data source by sqlContext.setConf(spark.sql.parquet.useDataSourceApi, false) Cheng On 4/9/15 2:43 AM, Michael Armbrust wrote: Thanks for the report. We improved the speed

Re: Opening many Parquet files = slow

2015-04-08 Thread Prashant Kommireddi
We noticed similar perf degradation using Parquet (outside of Spark) and it happened due to merging of multiple schemas. Would be good to know if disabling merge of schema (if the schema is same) as Michael suggested helps in your case. On Wed, Apr 8, 2015 at 11:43 AM, Michael Armbrust