I Agree but it's a constraint I have to deal with.
The idea is load these files and merge them into ORC.
When using hive on Tez it takes less than a minute. 

Daniel

> On 22 בספט׳ 2015, at 16:00, Jonathan Coveney <jcove...@gmail.com> wrote:
> 
> having a file per record is pretty inefficient on almost any file system
> 
> El martes, 22 de septiembre de 2015, Daniel Haviv 
> <daniel.ha...@veracity-group.com> escribió:
>> Hi,
>> We are trying to load around 10k avro files (each file holds only one 
>> record) using spark-avro but it takes over 15 minutes to load.
>> It seems that most of the work is being done at the driver where it created 
>> a broadcast variable for each file.
>> 
>> Any idea why is it behaving that way ?
>> Thank you.
>> Daniel

Reply via email to