Querying over mutliple (avro) files using Spark SQL

thomas j Tue, 13 Jan 2015 09:28:32 -0800

Hi,

I have a program that loads a single avro file using spark SQL, queries it,
transforms it and then outputs the data. The file is loaded with:


val records = sqlContext.avroFile(filePath)
val data = records.registerTempTable("data")
...


Now I want to run it over tens of thousands of Avro files (all with schemas
that contain the fields I'm interested in).

Is it possible to load multiple avro files recursively from a top-level
directory using wildcards? All my avro files are stored under
s3://my-bucket/avros/*/DATE/*.avro, and I want to run my task across all of
these.

If that's not possible, is there some way to load multiple avro files into
the same table/RDD so the whole dataset can be processed (and in that case
I'd supply paths to each file concretely, but I *really* don't want to have
to do that).

Thanks

Querying over mutliple (avro) files using Spark SQL

Reply via email to