Re: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Deenar Toraskar
Daniel Can you elaborate why are you using a broadcast variable to concatenate many Avro files into a single ORC file. Look at wholetextfiles on Spark context. SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename,

Re: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Daniel Haviv
I Agree but it's a constraint I have to deal with. The idea is load these files and merge them into ORC. When using hive on Tez it takes less than a minute. Daniel > On 22 בספט׳ 2015, at 16:00, Jonathan Coveney wrote: > > having a file per record is pretty inefficient on

RE: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread java8964
Your performance problem sounds like in the driver, which is trying to boardcast 10k files by itself alone, which becomes the bottle neck. What you wants is just transfer the data from AVRO format per file to another format. In MR, most likely each mapper process one file, and you utilized the

Re: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Jonathan Coveney
having a file per record is pretty inefficient on almost any file system El martes, 22 de septiembre de 2015, Daniel Haviv < daniel.ha...@veracity-group.com> escribió: > Hi, > We are trying to load around 10k avro files (each file holds only one > record) using spark-avro but it takes over 15