Thank you Doug, this worked like a charm! -R
On Thu, May 3, 2018 at 2:09 PM, Doug Cutting <cutt...@gmail.com> wrote: > You might instead try using the blob's reader method? > > Something like: > > InputStream input = Channels.newInputStream(blob.reader()); > try { > return new DataFileStream(input, new GenericDatumReader()).getSchema(); > } finally { > input.close(); > } > > Doug > > On Wed, May 2, 2018 at 4:30 PM Rodrigo Ipince <rodr...@leanplum.com> > wrote: > >> Hi, >> >> >> (Disclaimer: I'm new to Avro and Beam) >> >> >> Question: *is there a way to read the schema from an Avro file in GCS >> without having to read the entire file?* >> >> >> Context: >> >> I have a bunch of large files in GCS >> >> I want to process them by doing >> AvroIO.readGenericRecords(theSchema).from(filePattern) >> (this is from the Apache Beam SDK). However, I don’t know the schema up >> front. >> >> >> Now, I can read one of the files and extract the schema from it up front, >> sort of like this: >> >> ``` >> Blob avroFile = … // get Blob from GCS >> >> SeekableInput seekableInput = new SeekableByteArrayInput( >> avroFile.getContent()); >> >> DatumReader<GenericRecord> datumReader = new GenericDatumReader<>(); >> >> try (DataFileReader<GenericRecord> dataFileReader = new >> DataFileReader<>(seekableInput, >> datumReader)) { >> >> String schema = dataFileReader.getSchema().toString(); >> >> } >> >> ``` >> >> >> but.. the file is really large, and my nodes are really tiny, so they run >> out of memory. Is there a way to not have to read the entire file in order >> to extract the schema? >> >> Thanks! >> >>