Daniel Can you elaborate why are you using a broadcast variable to concatenate many Avro files into a single ORC file. Look at wholetextfiles on Spark context.
SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with textFile, which would return one record per line in each file. You can then process this RDD in parallel over the cluster, convert to a dataframe and save as a ORC file. Regards Deenar