I Agree but it's a constraint I have to deal with. The idea is load these files and merge them into ORC. When using hive on Tez it takes less than a minute.
Daniel > On 22 בספט׳ 2015, at 16:00, Jonathan Coveney <jcove...@gmail.com> wrote: > > having a file per record is pretty inefficient on almost any file system > > El martes, 22 de septiembre de 2015, Daniel Haviv > <daniel.ha...@veracity-group.com> escribió: >> Hi, >> We are trying to load around 10k avro files (each file holds only one >> record) using spark-avro but it takes over 15 minutes to load. >> It seems that most of the work is being done at the driver where it created >> a broadcast variable for each file. >> >> Any idea why is it behaving that way ? >> Thank you. >> Daniel