[ 
https://issues.apache.org/jira/browse/HADOOP-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934691#action_12934691
 ] 

Scott Carey commented on HADOOP-6685:
-------------------------------------

Tom brought up a lot of good points.  

{quote}
I have two serious issues with the current patch, which I have mentioned above. 
However, given that they have not been adequately addressed I feel I have no 
option but to vote -1.

The first is that no change is needed in SequenceFile unless we want to support 
Avro, but, given that Avro data files were designed for this, and are 
multi-lingual, why change the SequenceFile format solely to support Avro? Are 
Avro data files insufficient? Note that Thrift and Protocol Buffers can be 
stored in today's SequenceFiles.

The second is that this patch adds new serializations which introduce into the 
core a new dependency on a particular version of each of Avro, Thrift, and PB, 
in a non-pluggable way.

This type of dependency is qualitatively different to other dependencies. 
Hadoop depends on log4j for instance, so if a user's code does too, then it 
needs to use the same version. A recent JIRA made it possible to specify a 
different version of log4j in the job, but this only works if the version the 
user specifies is compatible with both their code and the Hadoop kernel code.
{quote}

This is HUGE.  Hadoop should not place ANY jar in its classpath that will 
likely cause such conflicts.  

The answer to this problem is actually simple, but not mentioned here.  Hadoop 
should re-package such dependencies.  
Use the maven shade plugin 
http://maven.apache.org/plugins/maven-shade-plugin/index.html
or equivalent in Ant (http://code.google.com/p/jarjar/)

to _embed_ such dependencies in a way that won't conflict with user libraries.  
There are a half-dozen other jars sitting in hadoop's /lib directory that 
should have the same treatment.  If these jars are NOT part of the API that 
hadoop is exposing, it should NOT stuff them on the user's classpath.


-------------
As for Avro file format performance and features related to Hadoop -- I'd be 
happy to work on some of them.   There hasn't been any significant requests on 
the user or dev mailing lists for Avro so far.

I'm not sure how important the overhead of writing a two-record file is (what 
use case is that for?) but surely there is more room for performance gains 
there since the initial focus was all about streaming bulk data.  I have a lot 
of ideas that can improve performance for many things in Avro, as do other 
committers.

Commenting on the side about issues rather than filing bugs or requests for 
improvement isn't constructive.  Please file JIRA issues against Avro.



> Change the generic serialization framework API to use serialization-specific 
> bytes instead of Map<String,String> for configuration
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-6685
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6685
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>             Fix For: 0.22.0
>
>         Attachments: libthrift.jar, serial.patch, serial4.patch, 
> serial6.patch, serial7.patch, SerializationAtSummit.pdf
>
>
> Currently, the generic serialization framework uses Map<String,String> for 
> the serialization specific configuration. Since this data is really internal 
> to the specific serialization, I think we should change it to be an opaque 
> binary blob. This will simplify the interface for defining specific 
> serializations for different contexts (MAPREDUCE-1462). It will also move us 
> toward having serialized objects for Mappers, Reducers, etc (MAPREDUCE-1183).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to