[ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802916#action_12802916 ]
Owen O'Malley commented on MAPREDUCE-1126: ------------------------------------------ Not really. I see now what you are trying to accomplish, but I think it is the wrong model. While the FileInputFormat is similar in structure, the issues are: 1. the analogy isn't precise because you aren't 2. the users care about the types that come out of their map 3. the users aren't likely to change the serialization format from the default for the types they are using 4. we end up with 3 configuration knobs per a class instead of the current 1. There are 6 classes in the MR pipeline, that means before this is done we have 18 methods to set the class/schema/serializer for the various types. That is *way* too complicated. I thought that you were going to create a marker interface for AvroRecord that has a getSchema method. That would have enabled the AvroWriter to get the schema from the types rather than get the types from the schema. I also think that removing the type checks from the collector and ifile code is a bad plan and will allow a lot of errors to reach much further into the system. Let's consider the proposal that Arun has been discussing. Instead of doing: {noformat} FileInputFormat.setInputPath(job, new Path("/foo")); job.setInputFormatClass(TextInputFormat.class); {noformat} you do: {noformat} TextInputFormat input = new TextInputFormat(); input.setInputPath(new Path("/foo")); job.setInputFormat(input); {noformat} clearly the job needs to serialize the InputFormat object and reconstruct it on the other side. This is much much easier for users to understand than the current model and can probably be done in a backwards compatible manner. Notice that this adds another 5 types that we are going to want serialize (InputFormat, Mapper, Combiner, Reducer, OutputFormat) per a job. With the current proposal that means that we get 11 * 3 = 33 serialization methods. I think that: * we need to use the global serialization/deserialization factory that we already have. * moving the {set,get}MapOutput{Key,Value}Class methods is a non-starter. As a general rule, if you need to modify all of the examples, we need to carefully discuss the issues. * the metadata should not be user visible and it would be better if it was just used to communicate within the serializer. Why is the metadata a Map? I'd rather have it be an opaque blob that is serializable. * we can debate whether the type restrictions on map outputs should be loosened, but certainly we need to check on the map side whether the type the map is outputting is correct. If you are going to loosen it, the class methods should become deprecated and vestigial and you need to support union types in Writable too. * i'm not wild about having Configuration.setMap. Having a function in StringUtils that takes a Map<String,String> map into and from a String seems more appropriate. > shuffle should use serialization to get comparator > -------------------------------------------------- > > Key: MAPREDUCE-1126 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task > Reporter: Doug Cutting > Assignee: Aaron Kimball > Fix For: 0.22.0 > > Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, > MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, > MAPREDUCE-1126.patch > > > Currently the key comparator is defined as a Java class. Instead we should > use the Serialization API to create key comparators. This would permit, > e.g., Avro-based comparators to be used, permitting efficient sorting of > complex data types without having to write a RawComparator in Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.