For Avro Container Files the schema is always at the beginning. Starting
each split task reading the schema and then seeking to a particular block
has worked well enough for MapReduce over the length of the project, so I
would just stick with doing the same thing.

If you are handling split work yourself you can just use DataFileReader[1]
and use seek/sync with your desired split offset and pastSync to tell when
your work is done. This will essentially access the file the same way
MapReduce currently does: a small read at the start followed by a seek and
then deserialization of a particular task's work.

[1]:
http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileReader.html

On Fri, Jun 26, 2015 at 10:38 AM, Mike Stanley <m...@mikestanley.org> wrote:

> Not 100% on this --- but I'm pretty sure the only other thing you need to
> take into consideration is the schema.  The avro schema is sometimes
> located at the beginning of the container (or external).  If you expect it
> at the beginning of the container and are using it to "introspect" an avro
> file, then splitting it could be problematic for consumer code.  If you
> plan on splitting it, than it's likely best to manage the schema externally
> to the container.
>
> On Fri, Jun 26, 2015 at 10:11 AM, Sean Busbey <bus...@cloudera.com> wrote:
>
>> Avro Container Files are always splittable[1]. They're the way you will
>> commonly interact with Avro serialized data.
>>
>> Data serialized as Avro's binary encoding is not splittable by itself,
>> because the encoding includes no markers[2]. This may be the source of the
>> disconnect you're finding in online docs.
>>
>>
>>
>> [1]: http://avro.apache.org/docs/1.7.7/spec.html#Object+Container+Files
>> [2]: http://avro.apache.org/docs/1.7.7/spec.html#Data+Serialization
>>
>> On Thu, Jun 25, 2015 at 12:54 AM, Ankur Jain <ankur.j...@yash.com> wrote:
>>
>>>  Hello,
>>>
>>>
>>>
>>> I am reading various forms and docs, somewhere it is mentioned that avro
>>> is splittable and somewhere non-splittable.
>>>
>>> So which one is right??
>>>
>>>
>>>
>>> Regards,
>>>
>>> Ankur
>>>
>>>
>>>  Information transmitted by this e-mail is proprietary to YASH
>>> Technologies and/ or its Customers and is intended for use only by the
>>> individual or entity to which it is addressed, and may contain information
>>> that is privileged, confidential or exempt from disclosure under applicable
>>> law. If you are not the intended recipient or it appears that this mail has
>>> been forwarded to you without proper authority, you are notified that any
>>> use or dissemination of this information in any manner is strictly
>>> prohibited. In such cases, please notify us immediately at i...@yash.com
>>> and delete this mail from your records.
>>>
>>
>>
>>
>> --
>> Sean
>>
>
>


-- 
Sean

Reply via email to