Interesting question. The spec says a file has to start with a header.

http://avro.apache.org/docs/current/spec.html#Object+Container+Files

However, it may still be appropriate to have consistent behavior with the
tools/java implementation. We could discuss amending the spec to be clearer
about this case either way on the dev list.

Also the duplicate __next__ is surely a mistake. Would you please open a
ticket and consider making a pull request as well?

https://avro.apache.org/issue_tracking.html


On Tue, Jun 18, 2019 at 23:03 David Beswick <david.besw...@bupa.com.au>
wrote:

> Hello,
>
> I'm getting this problem with the PIP package avro-python3-1.9.0.
>
> The package seems to have an issue with raw codec files containing no
> records (just a '0' block count), but which then following the empty block
> record with a sync marker. I've attached an example file but I'm not sure
> if it'll come through - let me know if you'd like it. it's been written by
> a process external to us.
>
> The "avro-tools" package reads these kinds of files fine.
>
> The problem files generate this traceback and assertion. Example code and
> traceback:
>
>
> from avro.datafile import DataFileReader, DataFileWriter
>
> with DataFileReader(open("28.avro", 'rb'), DatumReader()) as r:
> print(r.meta)
> for rec in r:
> print(rec)
>
>
> Traceback (most recent call last):
> File "./test.py", line 31, in <module>
> for rec in r:
> File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/datafile.py",
> line 526, in __next__
> datum = self.datum_reader.read(self.datum_decoder)
> File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line
> 489, in read
> return self.read_data(self.writer_schema, self.reader_schema, decoder)
> File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line
> 534, in read_data
> return self.read_record(writer_schema, reader_schema, decoder)
> File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line
> 734, in read_record
> field_val = self.read_data(field.type, readers_field.type, decoder)
> File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line
> 512, in read_data
> return decoder.read_utf8()
> File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line
> 257, in read_utf8
> input_bytes = self.read_bytes()
> File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line
> 249, in read_bytes
> assert (nbytes >= 0), nbytes
> AssertionError: -11
>
>
> I think the issue is in the __next__ function of DataFileReader, which
> seems to assume that a datum will always follow a block header read. The
> following implementation fixes the bug for me. Is it correct?
>
> def __next__(self):
> """Return the next datum in the file."""
> while True:
> if self.block_count == 0:
> if self.is_EOF():
> raise StopIteration
> elif self._skip_sync():
> pass
> else:
> self._read_block_header()
> else:
> datum = self.datum_reader.read(self.datum_decoder)
> self._block_count -= 1
> return datum
>
>
> Please also note that it seems two __next__ methods have been mistakenly
> put in this class.
>
> Regards,
> David
>
> Bupa A&NZ email disclaimer: The information contained in this email and
> any attachments is confidential and may be subject to copyright or other
> intellectual property protection. If you are not the intended recipient,
> you are not authorized to use or disclose this information, and we request
> that you notify us by reply mail or telephone and delete the original
> message from your mail system.
>

Reply via email to