Thank you Doug, this worked like a charm!

-R

On Thu, May 3, 2018 at 2:09 PM, Doug Cutting <cutt...@gmail.com> wrote:

> You might instead try using the blob's reader method?
>
> Something like:
>
> InputStream input = Channels.newInputStream(blob.reader());
> try {
>   return new DataFileStream(input, new GenericDatumReader()).getSchema();
> } finally {
>   input.close();
> }
>
> Doug
>
> On Wed, May 2, 2018 at 4:30 PM Rodrigo Ipince <rodr...@leanplum.com>
> wrote:
>
>> Hi,
>>
>>
>> (Disclaimer: I'm new to Avro and Beam)
>>
>>
>> Question: *is there a way to read the schema from an Avro file in GCS
>> without having to read the entire file?*
>>
>>
>> Context:
>>
>> I have a bunch of large files in GCS
>>
>> I want to process them by doing 
>> AvroIO.readGenericRecords(theSchema).from(filePattern)
>> (this is from the Apache Beam SDK). However, I don’t know the schema up
>> front.
>>
>>
>> Now, I can read one of the files and extract the schema from it up front,
>> sort of like this:
>>
>> ```
>> Blob avroFile = … // get Blob from GCS
>>
>> SeekableInput seekableInput = new SeekableByteArrayInput(
>> avroFile.getContent());
>>
>> DatumReader<GenericRecord> datumReader = new GenericDatumReader<>();
>>
>> try (DataFileReader<GenericRecord> dataFileReader = new 
>> DataFileReader<>(seekableInput,
>> datumReader)) {
>>
>>   String schema = dataFileReader.getSchema().toString();
>>
>> }
>>
>> ```
>>
>>
>> but.. the file is really large, and my nodes are really tiny, so they run
>> out of memory. Is there a way to not have to read the entire file in order
>> to extract the schema?
>>
>> Thanks!
>>
>>

Reply via email to