[jira] Commented: (AVRO-493) hadoop mapreduce support for avro data

Scott Carey (JIRA) Wed, 31 Mar 2010 13:25:50 -0700

    [ 
https://issues.apache.org/jira/browse/AVRO-493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852081#action_12852081
 ]

Scott Carey commented on AVRO-493:
----------------------------------

JIRA's "You can reply to this email to add a comment to the issue online."  
Doesn't appear to work via the apache mail lists, so I have put the email 
exchange in the quote below:

{quote}
Scott Carey wrote:
> Thats too bad that the intermediate files can't use the avro file format, the 
> performance will suffer until that API changes to either allow custom file 
> formats or to support a >feature like the decoder's inputStream() method to 
> allow buffering of chained or interleaved readers.

The intermediate files are part of the mapreduce kernel.  The buffering, 
sorting, transmission and merging of this data is a critical part of 
mapreduce.  So I don't think it is as simple as just permitting a 
pluggable file format.

> FYI, Avro does not work with Hadoop 0.20 for CDH2 or CDH3 (I have not tried 
> plain 0.20) because they include jackson 1.0.1 and you'll get an exception 
> like this:

Can't one update the version of Jackson in one's Hadoop cluster to fix 
this?  However that might not work with Amazon's Electric MapReduce, 
where you don't get to update the cluster (which runs Hadoop 0.18).

Should we avoid using org.codehaus.jackson.JsonFactory.enable() to make 
Avro compatible with older versions of Jackson?

Doug
{quote}

Jackson is in Hadoop due to HADOOP-6184 ("Provide a configuration dump in JSON 
format").  
In my case, I just removed the jar completely from Hadoop because I don't use 
that feature.   We could make sure our use of the Jackson API is 1.0.1 
compatible, but at some point we probably will require the newer version.  
There might be bugs in that version that affect Avro, and it will be 
troublesome if 1.0.1 is silently used and causes bugs or other issues.  

In the short term we could run our unit tests with 1.0.1 and stop using 
enable() and anything else that we are using that is not 1.0.1 compatible.
We can even change Maven to be a range of supported versions ( example, version 
[1.0.1-2.x) is 1.0.1 inclusive to 2.x exclusive).

In the long run Hadoop needs to keep its libraries more up to date given its 
classloader status, and/or implement some classloader partitioning to prevent 
hadoop system and user code class conflicts, especially due to small features 
like HADOOP-6184.

> hadoop mapreduce support for avro data
> --------------------------------------
>
>                 Key: AVRO-493
>                 URL: https://issues.apache.org/jira/browse/AVRO-493
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-493.patch, AVRO-493.patch
>
>
> Avro should provide support for using Hadoop MapReduce over Avro data files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-493) hadoop mapreduce support for avro data

Reply via email to