Interesting question. The spec says a file has to start with a header. http://avro.apache.org/docs/current/spec.html#Object+Container+Files
However, it may still be appropriate to have consistent behavior with the tools/java implementation. We could discuss amending the spec to be clearer about this case either way on the dev list. Also the duplicate __next__ is surely a mistake. Would you please open a ticket and consider making a pull request as well? https://avro.apache.org/issue_tracking.html On Tue, Jun 18, 2019 at 23:03 David Beswick <david.besw...@bupa.com.au> wrote: > Hello, > > I'm getting this problem with the PIP package avro-python3-1.9.0. > > The package seems to have an issue with raw codec files containing no > records (just a '0' block count), but which then following the empty block > record with a sync marker. I've attached an example file but I'm not sure > if it'll come through - let me know if you'd like it. it's been written by > a process external to us. > > The "avro-tools" package reads these kinds of files fine. > > The problem files generate this traceback and assertion. Example code and > traceback: > > > from avro.datafile import DataFileReader, DataFileWriter > > with DataFileReader(open("28.avro", 'rb'), DatumReader()) as r: > print(r.meta) > for rec in r: > print(rec) > > > Traceback (most recent call last): > File "./test.py", line 31, in <module> > for rec in r: > File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/datafile.py", > line 526, in __next__ > datum = self.datum_reader.read(self.datum_decoder) > File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line > 489, in read > return self.read_data(self.writer_schema, self.reader_schema, decoder) > File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line > 534, in read_data > return self.read_record(writer_schema, reader_schema, decoder) > File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line > 734, in read_record > field_val = self.read_data(field.type, readers_field.type, decoder) > File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line > 512, in read_data > return decoder.read_utf8() > File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line > 257, in read_utf8 > input_bytes = self.read_bytes() > File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line > 249, in read_bytes > assert (nbytes >= 0), nbytes > AssertionError: -11 > > > I think the issue is in the __next__ function of DataFileReader, which > seems to assume that a datum will always follow a block header read. The > following implementation fixes the bug for me. Is it correct? > > def __next__(self): > """Return the next datum in the file.""" > while True: > if self.block_count == 0: > if self.is_EOF(): > raise StopIteration > elif self._skip_sync(): > pass > else: > self._read_block_header() > else: > datum = self.datum_reader.read(self.datum_decoder) > self._block_count -= 1 > return datum > > > Please also note that it seems two __next__ methods have been mistakenly > put in this class. > > Regards, > David > > Bupa A&NZ email disclaimer: The information contained in this email and > any attachments is confidential and may be subject to copyright or other > intellectual property protection. If you are not the intended recipient, > you are not authorized to use or disclose this information, and we request > that you notify us by reply mail or telephone and delete the original > message from your mail system. >