Hello,

I'm writing some code to split Avro datafiles into smaller files, and for
one of the approaches I was attempting to read blocks from a
DataFileStream, and for each block call appendEncoded on a DataFileWriter
until a certain number of blocks have been written, then start a new writer
and keep going until every block was transferred to one of the smaller
files.

In a test case it appears to have appended all of the blocks with no
exceptions, but when attempting to read in the resulting data, after
reading the first record it produces:

org.apache.avro.AvroRuntimeException: java.io.IOException: Block read
partially, the data may be corrupt
at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
This particular test case has an input datafile with 1 block of 100
records, and it was looking to split once it has seen 200 records, so it
ends up producing one new datafile which should equivalent to the input.

I'm not that familiar with the low-level internals of Avro so I was
wondering is there anything I am missing that I should be doing when
appending the blocks?
This ticket seems to be a similar issue:
https://issues.apache.org/jira/browse/AVRO-1093 but after looking at that
it didn't lead me to see anything wrong with my approach.

Any pointers would be appreciated, thanks. I can provide a rough outline of
the code if it helps.

-Bryan

Reply via email to