I've tried this now. Spark can load multiple avro files from the same
directory by passing a path to a directory. However, passing multiple paths
separated with commas didn't work.


Is there any way to load all avro files in multiple directories using
sqlContext.avroFile?

On Wed, Jan 14, 2015 at 3:53 PM, David Jones <letsnumsperi...@gmail.com>
wrote:

> Should I be able to pass multiple paths separated by commas? I haven't
> tried but didn't think it'd work. I'd expected a function that accepted a
> list of strings.
>
> On Wed, Jan 14, 2015 at 3:20 PM, Yana Kadiyska <yana.kadiy...@gmail.com>
> wrote:
>
>> If the wildcard path you have doesn't work you should probably open a bug
>> -- I had a similar problem with Parquet and it was a bug which recently got
>> closed. Not sure if sqlContext.avroFile shares a codepath with 
>> .parquetFile...you
>> can try running with bits that have the fix for .parquetFile or look at the
>> source...
>> Here was my question for reference:
>>
>> http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3ccaaswr-5rfmu-y-7htluj2eqqaecwjs8jh+irrzhm7g1ex7v...@mail.gmail.com%3E
>>
>> On Wed, Jan 14, 2015 at 4:34 AM, David Jones <letsnumsperi...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I have a program that loads a single avro file using spark SQL, queries
>>> it, transforms it and then outputs the data. The file is loaded with:
>>>
>>> val records = sqlContext.avroFile(filePath)
>>> val data = records.registerTempTable("data")
>>> ...
>>>
>>>
>>> Now I want to run it over tens of thousands of Avro files (all with
>>> schemas that contain the fields I'm interested in).
>>>
>>> Is it possible to load multiple avro files recursively from a top-level
>>> directory using wildcards? All my avro files are stored under
>>> s3://my-bucket/avros/*/DATE/*.avro, and I want to run my task across all of
>>> these on EMR.
>>>
>>> If that's not possible, is there some way to load multiple avro files
>>> into the same table/RDD so the whole dataset can be processed (and in that
>>> case I'd supply paths to each file concretely, but I *really* don't want to
>>> have to do that).
>>>
>>> Thanks
>>> David
>>>
>>
>>
>

Reply via email to