Your performance problem sounds like in the driver, which is trying to boardcast 10k files by itself alone, which becomes the bottle neck. What you wants is just transfer the data from AVRO format per file to another format. In MR, most likely each mapper process one file, and you utilized the whole cluster, instead of just using the Driver in MR. Not sure exactly how to help you, but to do that in the Spark: 1) Disable the boardcast from the driver, let the each task in the Spark to process one file. Maybe use something like hadoop NLineInputFormat, which including all the filenames of your data, so each Spark task will receive the HDFS location of each file, then start the transform logic. In this case, you concurrently transform all your small files by using all the available cores of your executors.2) If above sounds too complex, you need to find the way to disable boardcasting small files from the Spark Driver. This sounds like a good normal way to handle small files, but I cannot find a configuration to force spark disable it. Yong
From: daniel.ha...@veracity-group.com Subject: Re: spark-avro takes a lot time to load thousands of files Date: Tue, 22 Sep 2015 16:54:26 +0300 CC: user@spark.apache.org To: jcove...@gmail.com I Agree but it's a constraint I have to deal with.The idea is load these files and merge them into ORC.When using hive on Tez it takes less than a minute. Daniel On 22 בספט׳ 2015, at 16:00, Jonathan Coveney <jcove...@gmail.com> wrote: having a file per record is pretty inefficient on almost any file system El martes, 22 de septiembre de 2015, Daniel Haviv <daniel.ha...@veracity-group.com> escribió: Hi,We are trying to load around 10k avro files (each file holds only one record) using spark-avro but it takes over 15 minutes to load.It seems that most of the work is being done at the driver where it created a broadcast variable for each file. Any idea why is it behaving that way ?Thank you.Daniel