[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Owen O'Malley (JIRA) Wed, 20 Jan 2010 10:21:20 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802916#action_12802916
 ]


Owen O'Malley commented on MAPREDUCE-1126:
------------------------------------------

Not really. I see now what you are trying to accomplish, but I think it is the 
wrong model. While the FileInputFormat is similar in structure, the issues are:
  1. the analogy isn't precise because you aren't 
  2. the users care about the types that come out of their map
  3. the users aren't likely to change the serialization format from the 
default for the types they are using
  4. we end up with 3 configuration knobs per a class instead of the current 1. 
There are 6 classes in the MR pipeline, that means before this is done we have 
18 methods to set the class/schema/serializer for the various types. That is 
*way* too complicated.

I thought that you were going to create a marker interface for AvroRecord that 
has a getSchema method. That would have enabled the AvroWriter to get the 
schema from the types rather than get the types from the schema.

I also think that removing the type checks from the collector and ifile code is 
a bad plan and will allow a lot of errors to reach much further into the system.

Let's consider the proposal that Arun has been discussing. Instead of doing:

{noformat}
   FileInputFormat.setInputPath(job, new Path("/foo"));
   job.setInputFormatClass(TextInputFormat.class);
{noformat}

you do:

{noformat}
  TextInputFormat input = new TextInputFormat();
  input.setInputPath(new Path("/foo"));
  job.setInputFormat(input);
{noformat}

clearly the job needs to serialize the InputFormat object and reconstruct it on 
the other side. This is much much easier for users to understand than the 
current model and can probably be done in a backwards compatible manner. Notice 
that this adds another 5 types that we are going to want serialize 
(InputFormat, Mapper, Combiner, Reducer, OutputFormat) per a job. With the 
current proposal that means that we get 11 * 3 = 33 serialization methods.

I think that:
  * we need to use the global serialization/deserialization factory that we 
already have.
  * moving the {set,get}MapOutput{Key,Value}Class methods is a non-starter. As 
a general rule, if you need to modify all of the examples, we need to carefully 
discuss the issues.
  * the metadata should not be user visible and it would be better if it was 
just used to communicate within the serializer. Why is the metadata a Map? I'd 
rather have it be an opaque blob that is serializable.
  * we can debate whether the type restrictions on map outputs should be 
loosened, but certainly we need to check on the map side whether the type the 
map is outputting  is correct. If you are going to loosen it, the class methods 
should become deprecated and vestigial and you need to support union types in 
Writable too.
  * i'm not wild about having Configuration.setMap. Having a function in 
StringUtils that takes a Map<String,String> map into and from a String seems 
more appropriate.


> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, 
> MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, 
> MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should 
> use the Serialization API to create key comparators.  This would permit, 
> e.g., Avro-based comparators to be used, permitting efficient sorting of 
> complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Reply via email to