Hi,

You've got it right: the DataFileReader and DataFileStream read a
block at a time, and "fileReader.tell()" sits at the sync marker
between blocks while records are being read from the current block.
You're probably aware that DataFileReader is only seekable to block
boundaries.

The entire block is read from disk and used for the source of the next
N records, so it literally *is* the number of bytes that were read at
the time the current record was emitted, and would take into account
the file compression if any (not the strict size of the binary-encoded
record).

The number of accumulated bytes while decoding per-record doesn't look
like it's exposed, but might be able to be accessed through the binary
decoder used in the DatumReader.  If that doesn't work, maybe make a
JIRA feature request to expose this information -- I can see it being
useful for metrics like yours.

I hope this helps, let us know if you find a solution!  Ryan



On Mon, Jul 27, 2020 at 7:46 PM Jeremy Custenborder
<jcustenbor...@gmail.com> wrote:
>
> Not sure off hand. I thought you were just reading sequentially.
>
> On Sun, Jul 26, 2020 at 12:15 AM Julien Phalip <jpha...@gmail.com> wrote:
> >
> > Hi Jeremy,
> >
> > Thanks for your reply. I'm currently using DataFileReader because I also 
> > need to use random access/seeks. Would that be possible with DataFileStream 
> > as well? Or is there another technique that could work?
> >
> > Julien
> >
> > On Sat, Jul 25, 2020 at 9:36 PM Jeremy Custenborder 
> > <jcustenbor...@gmail.com> wrote:
> >>
> >> Could you use DataFileStream and pass in your own stream? Then you
> >> could get bytes read.
> >>
> >> [1] 
> >> https://avro.apache.org/docs/1.9.2/api/java/org/apache/avro/file/DataFileStream.html
> >>
> >> On Sat, Jul 25, 2020 at 7:42 PM Julien Phalip <jpha...@gmail.com> wrote:
> >> >
> >> > Hi,
> >> >
> >> > I'd like to keep track of the number of bytes read as I'm reading 
> >> > through the records of an Avro file.
> >> >
> >> > See this sample code:
> >> >
> >> > File file = new File("mydata.avro");
> >> > DatumReader<GenericRecord> reader = new GenericDatumReader<>();
> >> > DataFileReader<GenericRecord> fileReader = new DataFileReader<>(file, 
> >> > reader);
> >> > GenericRecord record = new GenericData.Record(fileReader.getSchema());
> >> > long counter = 0;
> >> > while (fileReader.hasNext()) {
> >> >     fileReader.next(record);
> >> >     counter += // Magic happens here
> >> >     System.out.println("Bytes read so far: " + counter);
> >> > }
> >> >
> >> > I can't seem to find a way to extract that information from the 
> >> > `fileReader` or  `record` objects. I figured maybe `fileReader.tell()` 
> >> > might help here, but that value seems to stay stuck on the current 
> >> > block's position.
> >> >
> >> > Is this possible?
> >> >
> >> > Thanks!
> >> >
> >> > Julien

Reply via email to