For Avro Container Files the schema is always at the beginning. Starting each split task reading the schema and then seeking to a particular block has worked well enough for MapReduce over the length of the project, so I would just stick with doing the same thing.
If you are handling split work yourself you can just use DataFileReader[1] and use seek/sync with your desired split offset and pastSync to tell when your work is done. This will essentially access the file the same way MapReduce currently does: a small read at the start followed by a seek and then deserialization of a particular task's work. [1]: http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileReader.html On Fri, Jun 26, 2015 at 10:38 AM, Mike Stanley <m...@mikestanley.org> wrote: > Not 100% on this --- but I'm pretty sure the only other thing you need to > take into consideration is the schema. The avro schema is sometimes > located at the beginning of the container (or external). If you expect it > at the beginning of the container and are using it to "introspect" an avro > file, then splitting it could be problematic for consumer code. If you > plan on splitting it, than it's likely best to manage the schema externally > to the container. > > On Fri, Jun 26, 2015 at 10:11 AM, Sean Busbey <bus...@cloudera.com> wrote: > >> Avro Container Files are always splittable[1]. They're the way you will >> commonly interact with Avro serialized data. >> >> Data serialized as Avro's binary encoding is not splittable by itself, >> because the encoding includes no markers[2]. This may be the source of the >> disconnect you're finding in online docs. >> >> >> >> [1]: http://avro.apache.org/docs/1.7.7/spec.html#Object+Container+Files >> [2]: http://avro.apache.org/docs/1.7.7/spec.html#Data+Serialization >> >> On Thu, Jun 25, 2015 at 12:54 AM, Ankur Jain <ankur.j...@yash.com> wrote: >> >>> Hello, >>> >>> >>> >>> I am reading various forms and docs, somewhere it is mentioned that avro >>> is splittable and somewhere non-splittable. >>> >>> So which one is right?? >>> >>> >>> >>> Regards, >>> >>> Ankur >>> >>> >>> Information transmitted by this e-mail is proprietary to YASH >>> Technologies and/ or its Customers and is intended for use only by the >>> individual or entity to which it is addressed, and may contain information >>> that is privileged, confidential or exempt from disclosure under applicable >>> law. If you are not the intended recipient or it appears that this mail has >>> been forwarded to you without proper authority, you are notified that any >>> use or dissemination of this information in any manner is strictly >>> prohibited. In such cases, please notify us immediately at i...@yash.com >>> and delete this mail from your records. >>> >> >> >> >> -- >> Sean >> > > -- Sean