[ https://issues.apache.org/jira/browse/AVRO-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790840#action_12790840 ]
Doug Cutting commented on AVRO-160: ----------------------------------- Scott> I'm concerned about silent data loss during processing mostly - a M/R job runs, a collision (or any form of corruption) occurs, a block gets skipped silently, and some calculation runs with missing tuples. Who's silently skipping blocks? That sounds like the source of the problem. Normally, if corruption is detected an exception should be thrown and the task should fail. Hopefully the task will succeed on another non-corrupt replica. If you truly require that things run to completion on corrupt data, then you'll necessarily miss some tuples. It should be possible to configure things this way, but it should not be the default. Since I never expect to see a collision, I don't feel an urgent need to add code to recover from one. Detecting them and failing might be wise, just in case. Philip> Are we cool with not having checksums in these files? I am. > file format should be friendly to streaming > ------------------------------------------- > > Key: AVRO-160 > URL: https://issues.apache.org/jira/browse/AVRO-160 > Project: Avro > Issue Type: Improvement > Components: spec > Reporter: Doug Cutting > Assignee: Doug Cutting > > It should be possible to stream through an Avro data file without seeking to > the end. > Currently the interpretation is that schemas written to the file apply to all > entries before them. If this were changed so that they instead apply to all > entries that follow, and the initial schema is written at the start of the > file, then streaming could be supported. > Note that the only change permitted to a schema as a file is written is to, > if it is a union, to add new branches at the end of that union. If it is not > a union, no changes may be made. So it is still the case that the final > schema in a file can read every entry in the file and thus may be used to > randomly access the file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.