Re: Output from AVRO mapper
Thanks Russell. This looks like a lot easier solution which I will look at more carefully in the near future. But at this point I don't want to walk away from the Java M/R solution just because I can't work it out. I know it works - I just am missing something basic in my understanding. -Terry P.S. Cool font too. On 12/21/12 7:42 PM, Russell Jurney wrote: I don't mean to harp, but this is a few lines in Pig: /* Load Avro jars and define shortcut */register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar register /me/pig/contrib/piggybank/java/piggybank.jar define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); /* Load Avros */input = load 'my.avro' using AvroStorage(); /* Verify input */describe input;Illustrate input; /* Convert Avros to JSON */ store input into 'my.json' using com.twitter.elephantbird.pig.store.JsonStorage();store input into 'my.json.lzo' using com.twitter.elephantbird.pig.store.LzoJsonStorage(); /* Convert simple Avros to TSV */store input into 'my.tsv'; /* Convert Avros to SequenceFiles */REGISTER '/path/to/elephant-bird.jar'; store input into 'my.seq' using com.twitter.elephantbird.pig.store.SequenceFileStorage( /* example: */ '-c com.twitter.elephantbird.pig.util.IntWritableConverter', '-c com.twitter.elephantbird.pig.util.TextConverter'); /* Convert Avros to Protobufs */store input into 'input.protobuf’ using com.twitter.elephantbird.examples.proto.pig.store.ProfileProtobufB64LinePigStorage(); /* Convert Avros to a Lucene Index */ store input into 'input.lucene' using LuceneIndexStorage('com.example.MyPigLuceneIndexOutputFormat'); There are also drivers for most NoSQLish databases... Russell Jurney http://datasyndrome.com On Dec 20, 2012, at 9:33 AM, Terry Healy the...@bnl.gov wrote: I'm just getting started using AVRO within Map/Reduce and trying to convert some existing non-AVRO code to use AVRO input. So far the data that previously was stored in tab delimited files has been converted to .avro successfully as checked with avro-tools. Where I'm getting hung up extending beyond my book-based examples is in attempting to read from AVRO (using generic records) where the mapper output is NOT in AVRO format. I can't seem to reconcile extending AvroMapper and NOT using AvroCollector. Here are snippets of code that show my non-AVRO M/R code and my [failing] attempts to make this change. If anyone can help me along it would be very much appreciated. -Terry code Pre-Avro version: (Works fine with .tsv input format) public static class HdFlowMapper extends MapReduceBase implements MapperText, HdFlowWritable, LongPair, HdFlowWritable { @Override public void map(Text key, HdFlowWritable value, OutputCollectorLongPair, HdFlowWritable output, Reporter reporter) throws IOException { ...// outKey = new LongPair(value.getSrcIp(), value.getFirst()); HdFlowWritable outValue = value; // pass it all through output.collect(outKey, outValue); } AVRO attempt: conf.setOutputFormat(TextOutputFormat.class); conf.setOutputKeyClass(LongPair.class); conf.setOutputValueClass(AvroFlowWritable.class); SCHEMA = new Schema.Parser().parse(NetflowSchema); AvroJob.setInputSchema(conf, SCHEMA); //AvroJob.setOutputSchema(conf, SCHEMA); AvroJob.setMapperClass(conf, AvroFlowMapper.class); AvroJob.setReducerClass(conf, AvroFlowReducer.class); public static class AvroFlowMapperK extends AvroMapperK, OutputCollector { @Override ** IDE: "Method does not override or implement a method from a supertype" public void map(K datum, OutputCollectorLongPair, AvroFlowWritable collector, Reporter reporter) throws IOException { GenericRecord record
Re: Output from AVRO mapper
I don't mean to harp, but this is a few lines in Pig: /* Load Avro jars and define shortcut */ register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar register /me/pig/contrib/piggybank/java/piggybank.jar define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); /* Load Avros */ input = load 'my.avro' using AvroStorage(); /* Verify input */ describe input; Illustrate input; /* Convert Avros to JSON */ store input into 'my.json' using com.twitter.elephantbird.pig.store.JsonStorage(); store input into 'my.json.lzo' using com.twitter.elephantbird.pig.store.LzoJsonStorage(); /* Convert simple Avros to TSV */ store input into 'my.tsv'; /* Convert Avros to SequenceFiles */ REGISTER '/path/to/elephant-bird.jar'; store input into 'my.seq' using com.twitter.elephantbird.pig.store.SequenceFileStorage( /* example: */ '-c com.twitter.elephantbird.pig.util.IntWritableConverter', '-c com.twitter.elephantbird.pig.util.TextConverter' ); /* Convert Avros to Protobufs */ store input into 'input.protobuf’ using com.twitter.elephantbird.examples.proto.pig.store.ProfileProtobufB64LinePigStorage(); /* Convert Avros to a Lucene Index */ store input into 'input.lucene' using LuceneIndexStorage('com.example.MyPigLuceneIndexOutputFormat'); There are also drivers for most NoSQLish databases... Russell Jurney http://datasyndrome.com On Dec 20, 2012, at 9:33 AM, Terry Healy the...@bnl.gov wrote: I'm just getting started using AVRO within Map/Reduce and trying to convert some existing non-AVRO code to use AVRO input. So far the data that previously was stored in tab delimited files has been converted to .avro successfully as checked with avro-tools. Where I'm getting hung up extending beyond my book-based examples is in attempting to read from AVRO (using generic records) where the mapper output is NOT in AVRO format. I can't seem to reconcile extending AvroMapper and NOT using AvroCollector. Here are snippets of code that show my non-AVRO M/R code and my [failing] attempts to make this change. If anyone can help me along it would be very much appreciated. -Terry code Pre-Avro version: (Works fine with .tsv input format) public static class HdFlowMapper extends MapReduceBase implements MapperText, HdFlowWritable, LongPair, HdFlowWritable { @Override public void map(Text key, HdFlowWritable value, OutputCollectorLongPair, HdFlowWritable output, Reporter reporter) throws IOException { ...// outKey = new LongPair(value.getSrcIp(), value.getFirst()); HdFlowWritable outValue = value; // pass it all through output.collect(outKey, outValue); } AVRO attempt: conf.setOutputFormat(TextOutputFormat.class); conf.setOutputKeyClass(LongPair.class); conf.setOutputValueClass(AvroFlowWritable.class); SCHEMA = new Schema.Parser().parse(NetflowSchema); AvroJob.setInputSchema(conf, SCHEMA); //AvroJob.setOutputSchema(conf, SCHEMA); AvroJob.setMapperClass(conf, AvroFlowMapper.class); AvroJob.setReducerClass(conf, AvroFlowReducer.class); public static class AvroFlowMapperK extends AvroMapperK, OutputCollector { @Override ** IDE: Method does not override or implement a method from a supertype public void map(K datum, OutputCollectorLongPair, AvroFlowWritable collector, Reporter reporter) throws IOException { GenericRecord record = (GenericRecord) datum; afw = new AvroFlowWritable(record); // ... collector.collect(outKey, afw); } /code
Output from AVRO mapper
I'm just getting started using AVRO within Map/Reduce and trying to convert some existing non-AVRO code to use AVRO input. So far the data that previously was stored in tab delimited files has been converted to .avro successfully as checked with avro-tools. Where I'm getting hung up extending beyond my book-based examples is in attempting to read from AVRO (using generic records) where the mapper output is NOT in AVRO format. I can't seem to reconcile extending AvroMapper and NOT using AvroCollector. Here are snippets of code that show my non-AVRO M/R code and my [failing] attempts to make this change. If anyone can help me along it would be very much appreciated. -Terry code Pre-Avro version: (Works fine with .tsv input format) public static class HdFlowMapper extends MapReduceBase implements MapperText, HdFlowWritable, LongPair, HdFlowWritable { @Override public void map(Text key, HdFlowWritable value, OutputCollectorLongPair, HdFlowWritable output, Reporter reporter) throws IOException { ...// outKey = new LongPair(value.getSrcIp(), value.getFirst()); HdFlowWritable outValue = value; // pass it all through output.collect(outKey, outValue); } AVRO attempt: conf.setOutputFormat(TextOutputFormat.class); conf.setOutputKeyClass(LongPair.class); conf.setOutputValueClass(AvroFlowWritable.class); SCHEMA = new Schema.Parser().parse(NetflowSchema); AvroJob.setInputSchema(conf, SCHEMA); //AvroJob.setOutputSchema(conf, SCHEMA); AvroJob.setMapperClass(conf, AvroFlowMapper.class); AvroJob.setReducerClass(conf, AvroFlowReducer.class); public static class AvroFlowMapperK extends AvroMapperK, OutputCollector { @Override ** IDE: Method does not override or implement a method from a supertype public void map(K datum, OutputCollectorLongPair, AvroFlowWritable collector, Reporter reporter) throws IOException { GenericRecord record = (GenericRecord) datum; afw = new AvroFlowWritable(record); // ... collector.collect(outKey, afw); } /code