Re: Opening many Parquet files = slow

2015-04-08 Thread Prashant Kommireddi
We noticed similar perf degradation using Parquet (outside of Spark) and it
happened due to merging of multiple schemas. Would be good to know if
disabling merge of schema (if the schema is same) as Michael suggested
helps in your case.

On Wed, Apr 8, 2015 at 11:43 AM, Michael Armbrust mich...@databricks.com
wrote:

 Thanks for the report.  We improved the speed here in 1.3.1 so would be
 interesting to know if this helps.  You should also try disabling schema
 merging if you do not need that feature (i.e. all of your files are the
 same schema).

 sqlContext.load(path, parquet, Map(mergeSchema - false))

 On Wed, Apr 8, 2015 at 7:35 AM, Ted Yu yuzhih...@gmail.com wrote:

 You may have seen this thread: http://search-hadoop.com/m/JW1q5SlRpt1

 Cheers

 On Wed, Apr 8, 2015 at 6:15 AM, Eric Eijkelenboom 
 eric.eijkelenb...@gmail.com wrote:

 Hi guys

 *I’ve got:*

- 180 days of log data in Parquet.
- Each day is stored in a separate folder in S3.
- Each day consists of 20-30 Parquet files of 256 MB each.
- Spark 1.3 on Amazon EMR

 This makes approximately 5000 Parquet files with a total size if 1.5 TB.

 *My code*:
 val in = sqlContext.parquetFile(“day1”, “day2”, …, “day180”)

 *Problem*:
 Before the very first stage is started, Spark spends about 25 minutes
 printing the following:
 ...
 15/04/08 13:00:26 INFO s3native.NativeS3FileSystem: Opening key
 'logs/=2014/mm=8/dd=30/1b1d213d-101d-45a7-85ff-f1c573bf5ffd-59' for
 reading at position '258305902'
 15/04/08 13:00:26 INFO s3native.NativeS3FileSystem: Opening key
 'logs/=2014/mm=11/dd=22/2dbc904b-b341-4e9f-a9bb-5b5933c2e101-72'
 for reading at position '260897108'
 15/04/08 13:00:26 INFO s3native.NativeS3FileSystem: Opening '
 s3n://adt-timelord-daily-logs-pure/logs/=2014/mm=11/dd=2/b0752022-3cd0-47e5-9e67-6ad84543e84b-000124'
 for reading
 15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening key
 'logs/=2014/mm=11/dd=2/b0752022-3cd0-47e5-9e67-6ad84543e84b-000124' for
 reading at position '261259189'
 15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening '
 s3n://adt-timelord-daily-logs-pure/logs/=2014/mm=10/dd=15/bc9c8fdf-dc67-441a-8eda-9a06f032158f-000102'
 for reading
 15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening '
 s3n://adt-timelord-daily-logs-pure/logs/=2014/mm=8/dd=30/1b1d213d-101d-45a7-85ff-f1c573bf5ffd-60'
 for reading
 15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening '
 s3n://adt-timelord-daily-logs-pure/logs/=2014/mm=11/dd=22/2dbc904b-b341-4e9f-a9bb-5b5933c2e101-73'
 for reading
 … etc

 It looks like Spark is opening each file, before it actually does any
 work. This means a delay of 25 minutes when working with Parquet files.
 Previously, we used LZO files and did not experience this problem.

 *Bonus info: *
 This also happens when I use auto partition discovery (i.e.
 sqlContext.parquetFile(“/path/to/logsroot/)).

 What can I do to avoid this?

 Thanks in advance!

 Eric Eijkelenboom






Job submission API

2015-04-07 Thread Prashant Kommireddi
Hello folks,

Newbie here! Just had a quick question - is there a job submission API such
as the one with hadoop
https://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapreduce/Job.html#submit()
to submit Spark jobs to a Yarn cluster? I see in example that
bin/spark-submit is what's out there, but couldn't find any APIs around it.

Thanks,
Prashant