[jira] Commented: (MAPREDUCE-815) Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro Serialization
[ https://issues.apache.org/jira/browse/MAPREDUCE-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868714#action_12868714 ] Doug Cutting commented on MAPREDUCE-815: I'd like to close this as redundant with AVRO-493. This will be included in the upcoming 1.4.0 release of Avro which will support mapreduce over Avro data files using Hadoop 0.20 or greater. Objections? > Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro > Serialization > -- > > Key: MAPREDUCE-815 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-815 > Project: Hadoop Map/Reduce > Issue Type: New Feature >Reporter: Ravi Gummadi >Assignee: Aaron Kimball > Attachments: MAPREDUCE-815.2.patch, MAPREDUCE-815.3.patch, > MAPREDUCE-815.4.patch, MAPREDUCE-815.5.patch, MAPREDUCE-815.patch > > > MapReduce needs AvroInputFormat similar to other InputFormats like > TextInputFormat to be able to use avro serialization in hadoop. Similarly > AvroOutputFormat is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-815) Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro Serialization
[ https://issues.apache.org/jira/browse/MAPREDUCE-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805336#action_12805336 ] Chris Douglas commented on MAPREDUCE-815: - It looks like some changes to JobTracker were accidentally included in the latest patch > Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro > Serialization > -- > > Key: MAPREDUCE-815 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-815 > Project: Hadoop Map/Reduce > Issue Type: New Feature >Reporter: Ravi Gummadi >Assignee: Aaron Kimball > Attachments: MAPREDUCE-815.2.patch, MAPREDUCE-815.3.patch, > MAPREDUCE-815.4.patch, MAPREDUCE-815.5.patch, MAPREDUCE-815.patch > > > MapReduce needs AvroInputFormat similar to other InputFormats like > TextInputFormat to be able to use avro serialization in hadoop. Similarly > AvroOutputFormat is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-815) Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro Serialization
[ https://issues.apache.org/jira/browse/MAPREDUCE-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800477#action_12800477 ] Doug Cutting commented on MAPREDUCE-815: In the end-to-end-test validation, you don't need seekable and can read the entire file with something like: {code} reader = new DataFileStream(istream, datumReader); for (int value : reader) { ... } {code} This will save a number of lines and provide a better example of non-split Avro file usage. > Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro > Serialization > -- > > Key: MAPREDUCE-815 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-815 > Project: Hadoop Map/Reduce > Issue Type: New Feature >Reporter: Ravi Gummadi >Assignee: Aaron Kimball > Attachments: MAPREDUCE-815.2.patch, MAPREDUCE-815.3.patch, > MAPREDUCE-815.4.patch, MAPREDUCE-815.patch > > > MapReduce needs AvroInputFormat similar to other InputFormats like > TextInputFormat to be able to use avro serialization in hadoop. Similarly > AvroOutputFormat is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-815) Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro Serialization
[ https://issues.apache.org/jira/browse/MAPREDUCE-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800401#action_12800401 ] Doug Cutting commented on MAPREDUCE-815: - those javadoc @deprecated lines don't look right. did you run javadoc & look at the output? - AvroInputFormat's value field isn't used and could be removed. - should we add an end-to-end test somewhere that runs an mr job with avro input, avro comparison, and avro output? new issue? localrunner would probably suffice. > Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro > Serialization > -- > > Key: MAPREDUCE-815 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-815 > Project: Hadoop Map/Reduce > Issue Type: New Feature >Reporter: Ravi Gummadi >Assignee: Aaron Kimball > Attachments: MAPREDUCE-815.2.patch, MAPREDUCE-815.patch > > > MapReduce needs AvroInputFormat similar to other InputFormats like > TextInputFormat to be able to use avro serialization in hadoop. Similarly > AvroOutputFormat is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-815) Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro Serialization
[ https://issues.apache.org/jira/browse/MAPREDUCE-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800312#action_12800312 ] Doug Cutting commented on MAPREDUCE-815: > That having been said, we can't use null or we'll break the identity mapper. It seems to me that we should be able to pass null end-to-end as a value. If we can't, then we perhaps haven't yet removed all of the Writable assumptions, no? > Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro > Serialization > -- > > Key: MAPREDUCE-815 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-815 > Project: Hadoop Map/Reduce > Issue Type: New Feature >Reporter: Ravi Gummadi >Assignee: Aaron Kimball > Attachments: MAPREDUCE-815.patch > > > MapReduce needs AvroInputFormat similar to other InputFormats like > TextInputFormat to be able to use avro serialization in hadoop. Similarly > AvroOutputFormat is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-815) Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro Serialization
[ https://issues.apache.org/jira/browse/MAPREDUCE-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800309#action_12800309 ] Aaron Kimball commented on MAPREDUCE-815: - The only reason I could think of to use the position would be building some sort of index over an avro file. I think this probably doesn't make much sense here. That having been said, we can't use null or we'll break the identity mapper. (The MapOutputBuffer expects non-null keys and values only. A {{context.write(k, null)}} from the mapper will throw NullPointerException.) This is why writables included NullWritable, I think. We could add a type e.g. "Empty" which implements AvroReflectSerializable and whose toString method returns the empty string; this would work fairly transparently I think and be entirely avro-based. > Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro > Serialization > -- > > Key: MAPREDUCE-815 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-815 > Project: Hadoop Map/Reduce > Issue Type: New Feature >Reporter: Ravi Gummadi >Assignee: Aaron Kimball > Attachments: MAPREDUCE-815.patch > > > MapReduce needs AvroInputFormat similar to other InputFormats like > TextInputFormat to be able to use avro serialization in hadoop. Similarly > AvroOutputFormat is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-815) Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro Serialization
[ https://issues.apache.org/jira/browse/MAPREDUCE-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800267#action_12800267 ] Doug Cutting commented on MAPREDUCE-815: This looks great! A few nits: - in javadoc comments, use "@deprecated use #foo()" to link to the new implementation - AvroSeekableStream is likely to be reused by other applications that use Avro with HDFS. it might be named AvroFSInput. it might better belong in common than in mapreduce. - why use LongWritable? Could we instead use java.lang.Long? Or perhaps just null for these values? Does anyone ever make use of the position? If not, let's use null. If we can avoid a dependency on Writable here that'd be good. or does this provide some important compatibility? - i don't think SYNC_DISTANCE is needed: DataFileWriter syncs automatically every 100k or so. > Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro > Serialization > -- > > Key: MAPREDUCE-815 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-815 > Project: Hadoop Map/Reduce > Issue Type: New Feature >Reporter: Ravi Gummadi >Assignee: Aaron Kimball > Attachments: MAPREDUCE-815.patch > > > MapReduce needs AvroInputFormat similar to other InputFormats like > TextInputFormat to be able to use avro serialization in hadoop. Similarly > AvroOutputFormat is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-815) Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro Serialization
[ https://issues.apache.org/jira/browse/MAPREDUCE-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799422#action_12799422 ] Aaron Kimball commented on MAPREDUCE-815: - Doug: * I agree; I'll make this be the key. The value will be the byte offset. * My current implementation gives a log message at level WARN the first time a non-null value is received; it then ignores the value and continues operating. > Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro > Serialization > -- > > Key: MAPREDUCE-815 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-815 > Project: Hadoop Map/Reduce > Issue Type: New Feature >Reporter: Ravi Gummadi >Assignee: Aaron Kimball > > MapReduce needs AvroInputFormat similar to other InputFormats like > TextInputFormat to be able to use avro serialization in hadoop. Similarly > AvroOutputFormat is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-815) Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro Serialization
[ https://issues.apache.org/jira/browse/MAPREDUCE-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799307#action_12799307 ] Doug Cutting commented on MAPREDUCE-815: FWIW, the file-position-as-text-map-input-key convention came from the original Google MapReduce paper, but I don't think its ever proven useful. > Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro > Serialization > -- > > Key: MAPREDUCE-815 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-815 > Project: Hadoop Map/Reduce > Issue Type: New Feature >Reporter: Ravi Gummadi >Assignee: Aaron Kimball > > MapReduce needs AvroInputFormat similar to other InputFormats like > TextInputFormat to be able to use avro serialization in hadoop. Similarly > AvroOutputFormat is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-815) Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro Serialization
[ https://issues.apache.org/jira/browse/MAPREDUCE-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799299#action_12799299 ] Tom White commented on MAPREDUCE-815: - bq. If, in the InputFormat we populated the key rather than the value, then one would not even need to specify InverseMapper: by default, MapReduce would simply partition and sort Avro data. I like this. It is the approach I was taking on MAPREDUCE-252. > Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro > Serialization > -- > > Key: MAPREDUCE-815 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-815 > Project: Hadoop Map/Reduce > Issue Type: New Feature >Reporter: Ravi Gummadi >Assignee: Aaron Kimball > > MapReduce needs AvroInputFormat similar to other InputFormats like > TextInputFormat to be able to use avro serialization in hadoop. Similarly > AvroOutputFormat is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-815) Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro Serialization
[ https://issues.apache.org/jira/browse/MAPREDUCE-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799288#action_12799288 ] Doug Cutting commented on MAPREDUCE-815: Aaron, this sounds good. A few questions: - If, in the InputFormat we populated the key rather than the value, then one would not even need to specify InverseMapper: by default, MapReduce would simply partition and sort Avro data. Making values optional in both input and output seems more consistent, but does break compatibility with TextInputFormat. Thoughts? - In the OutputFormat, should we check if values are non-null or just drop them? Just dropping them may cause some confusion, but is probably useful in many cases, so I guess we err towards utility? > Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro > Serialization > -- > > Key: MAPREDUCE-815 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-815 > Project: Hadoop Map/Reduce > Issue Type: New Feature >Reporter: Ravi Gummadi >Assignee: Aaron Kimball > > MapReduce needs AvroInputFormat similar to other InputFormats like > TextInputFormat to be able to use avro serialization in hadoop. Similarly > AvroOutputFormat is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-815) Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro Serialization
[ https://issues.apache.org/jira/browse/MAPREDUCE-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798970#action_12798970 ] Aaron Kimball commented on MAPREDUCE-815: - Now that MAPREDUCE-1126 is in, I'm going to attack this and complete the loop. Given that TextInputFormat yields a semi-arbitrary key and encapsulates the file contents in the value, I plan to follow suit here -- the value produced by the AvroRecordReader will contain the next object in the file. As for output: I think that it's best to leave the output format accepting a single value only (rather than explicitly making a hybrid of key and value pair). Users can implement their own UnionAvroOutputFormat (or whatever) if they need both, but I think the basic version should only do the most straightforward thing. I plan to make this write the user's key to the file, and drop the value. That way InverseMapper -> IdentityReducer should emit it all in sorted order. > Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro > Serialization > -- > > Key: MAPREDUCE-815 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-815 > Project: Hadoop Map/Reduce > Issue Type: New Feature >Reporter: Ravi Gummadi >Assignee: Aaron Kimball > > MapReduce needs AvroInputFormat similar to other InputFormats like > TextInputFormat to be able to use avro serialization in hadoop. Similarly > AvroOutputFormat is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-815) Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro Serialization
[ https://issues.apache.org/jira/browse/MAPREDUCE-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790949#action_12790949 ] Doug Cutting commented on MAPREDUCE-815: > What about the AvroOutputFormat? I suggest we treat it similarly: require either keys or values to be null. An output format could combine output keys and values into a compound record, and one could define an input format that splits each input datum into a separate key and value, but I don't think the basic AvroOutputFormat should do this. If we add a MapFile-like abstraction for Avro, then its input and output formats should probably do this. > Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro > Serialization > -- > > Key: MAPREDUCE-815 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-815 > Project: Hadoop Map/Reduce > Issue Type: New Feature >Reporter: Ravi Gummadi >Assignee: Ravi Gummadi > > MapReduce needs AvroInputFormat similar to other InputFormats like > TextInputFormat to be able to use avro serialization in hadoop. Similarly > AvroOutputFormat is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-815) Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro Serialization
[ https://issues.apache.org/jira/browse/MAPREDUCE-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790947#action_12790947 ] Jacob Rideout commented on MAPREDUCE-815: - > only one of key and value will be useful and the other will be null. Thanks Doug, that makes sense. What about the AvroOutputFormat? Does the same condition apply? I can see ignoring keys for the output of the reducer, but what about the output of a map? > Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro > Serialization > -- > > Key: MAPREDUCE-815 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-815 > Project: Hadoop Map/Reduce > Issue Type: New Feature >Reporter: Ravi Gummadi >Assignee: Ravi Gummadi > > MapReduce needs AvroInputFormat similar to other InputFormats like > TextInputFormat to be able to use avro serialization in hadoop. Similarly > AvroOutputFormat is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-815) Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro Serialization
[ https://issues.apache.org/jira/browse/MAPREDUCE-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790835#action_12790835 ] Doug Cutting commented on MAPREDUCE-815: > What is the current line of thought on how keys and values will interact with > the schema for an avro file? I think, similar toTextInputFormat and TextOutputFormat, only one of key and value will be useful and the other will be null. It doesn't matter much which. If input keys have the Avro datum and values are null then the default identity mapper can be used to sort data, while if input values contain the Avro datum and keys are null then InverseMapper must be specified. > Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro > Serialization > -- > > Key: MAPREDUCE-815 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-815 > Project: Hadoop Map/Reduce > Issue Type: New Feature >Reporter: Ravi Gummadi >Assignee: Ravi Gummadi > > MapReduce needs AvroInputFormat similar to other InputFormats like > TextInputFormat to be able to use avro serialization in hadoop. Similarly > AvroOutputFormat is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-815) Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro Serialization
[ https://issues.apache.org/jira/browse/MAPREDUCE-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790583#action_12790583 ] Jacob Rideout commented on MAPREDUCE-815: - What is the current line of thought on how keys and values will interact with the schema for an avro file? Is the intention that there would be a master schema that encapsulated the key/values similar to: {code} { "type" : "record", "fields" : [ { "name" : "KEY", "type" : "record" }, { "name" : "VALUE", "type" : "record" } ]} {code} What about files created without this "master" schema; would the key return a null object? Byte offset in a schema of type "long" ? > Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro > Serialization > -- > > Key: MAPREDUCE-815 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-815 > Project: Hadoop Map/Reduce > Issue Type: New Feature >Reporter: Ravi Gummadi >Assignee: Ravi Gummadi > > MapReduce needs AvroInputFormat similar to other InputFormats like > TextInputFormat to be able to use avro serialization in hadoop. Similarly > AvroOutputFormat is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-815) Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro Serialization
[ https://issues.apache.org/jira/browse/MAPREDUCE-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736993#action_12736993 ] Ravi Gummadi commented on MAPREDUCE-815: This could have something like public class AvroInputFormat extends FileInputFormat { @Override public RecordReader createRecordReader(InputSplit split, TaskAttemptContext context) { return new AvroRecordReader(); } //... } and public class AvroRecordReader extends RecordReader { //implements the methods of RecordReader for KEY and VALUE of avro types } Does this look fine ? > Add AvroInputFormat and AvroOutputFormat so that hadoop can use Avro > Serialization > -- > > Key: MAPREDUCE-815 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-815 > Project: Hadoop Map/Reduce > Issue Type: New Feature >Reporter: Ravi Gummadi >Assignee: Ravi Gummadi > > MapReduce needs AvroInputFormat similar to other InputFormats like > TextInputFormat to be able to use avro serialization in hadoop. Similarly > AvroOutputFormat is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.