Daniel

Can you elaborate why are you using a broadcast variable to concatenate
many Avro files into a single ORC file. Look at wholetextfiles on Spark
context.

SparkContext.wholeTextFiles lets you read a directory containing multiple
small text files, and returns each of them as (filename, content) pairs.
This is in contrast with textFile, which would return one record per line
in each file.
​
You can then process this RDD in parallel over the cluster, convert to a
dataframe and save as a ORC file.

Regards
Deenar

Reply via email to