Re: Output from AVRO mapper

2012-12-22 Thread Terry Healy

  
  
Thanks Russell. This looks like a lot easier solution which I will
look at more carefully in the near future. But at this point I don't
want to walk away from the Java M/R solution just because I can't
work it out. I know it works - I just am missing something basic in
my understanding.

-Terry

P.S. Cool font too.

On 12/21/12 7:42 PM, Russell Jurney
  wrote:


  
  I don't mean to harp, but this is a few lines in Pig:
  /* Load Avro jars and define shortcut */register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar 
register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar register /me/pig/contrib/piggybank/java/piggybank.jar define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();

/* Load Avros */input = load 'my.avro' using AvroStorage();
/* Verify input */describe input;Illustrate input;
/* Convert Avros to JSON */
store input into 'my.json' using com.twitter.elephantbird.pig.store.JsonStorage();store input into 'my.json.lzo' using com.twitter.elephantbird.pig.store.LzoJsonStorage();

/* Convert simple Avros to TSV */store input into 'my.tsv';
/* Convert Avros to SequenceFiles */REGISTER '/path/to/elephant-bird.jar'; store  input into 'my.seq' using com.twitter.elephantbird.pig.store.SequenceFileStorage(
    /* example: */    '-c com.twitter.elephantbird.pig.util.IntWritableConverter',    '-c com.twitter.elephantbird.pig.util.TextConverter');

/* Convert Avros to Protobufs */store input into 'input.protobuf’ using com.twitter.elephantbird.examples.proto.pig.store.ProfileProtobufB64LinePigStorage();
/* Convert Avros to a Lucene Index */
store input into 'input.lucene' using LuceneIndexStorage('com.example.MyPigLuceneIndexOutputFormat');


  There
  are also drivers for most NoSQLish databases...
  

  Russell Jurney http://datasyndrome.com
  
On Dec 20, 2012, at 9:33 AM, Terry Healy the...@bnl.gov
wrote:

  
  
I'm just getting started using AVRO within Map/Reduce
and trying to
  convert some existing non-AVRO code to use AVRO input.
So far the data
  that previously was stored in tab delimited files has
been converted to
  .avro successfully as checked with avro-tools.
  
  Where I'm getting hung up extending beyond my book-based
examples is in
  attempting to read from AVRO (using generic records)
where the mapper
  output is NOT in AVRO format. I can't seem to reconcile
extending
  AvroMapper and NOT using AvroCollector.
  
  Here are snippets of code that show my non-AVRO M/R code
and my
  [failing] attempts to make this change. If anyone can
help me along it
  would be very much appreciated.
  
  -Terry
  
  code
  Pre-Avro version: (Works fine with .tsv input format)
  
      public static class HdFlowMapper extends
MapReduceBase
      implements MapperText, HdFlowWritable,
LongPair,
  HdFlowWritable {
  
  
      @Override
      public void map(Text key, HdFlowWritable value,
      OutputCollectorLongPair,
HdFlowWritable output,
      Reporter reporter) throws IOException {
  
          ...//
      outKey = new LongPair(value.getSrcIp(),
value.getFirst());
  
      HdFlowWritable outValue = value; // pass
it all through
      output.collect(outKey, outValue);
      }
  
  
  
  AVRO attempt:
  
  
      conf.setOutputFormat(TextOutputFormat.class);
      conf.setOutputKeyClass(LongPair.class);
  
   conf.setOutputValueClass(AvroFlowWritable.class);
  
      SCHEMA = new
Schema.Parser().parse(NetflowSchema);
      AvroJob.setInputSchema(conf, SCHEMA);
      //AvroJob.setOutputSchema(conf, SCHEMA);
      AvroJob.setMapperClass(conf,
AvroFlowMapper.class);
      AvroJob.setReducerClass(conf,
AvroFlowReducer.class);
  
  
  
       public static class AvroFlowMapperK
extends AvroMapperK,
  OutputCollector {
  
  
      @Override    
      ** IDE: "Method does not override or implement a
method from a supertype"
  
      public void map(K datum,
OutputCollectorLongPair,
  AvroFlowWritable collector, Reporter reporter)
throws IOException {
  
  
      GenericRecord record 

Re: Output from AVRO mapper

2012-12-21 Thread Russell Jurney
I don't mean to harp, but this is a few lines in Pig:

/* Load Avro jars and define shortcut */
register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
register /me/pig/contrib/piggybank/java/piggybank.jar
define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();

/* Load Avros */
input = load 'my.avro' using AvroStorage();

/* Verify input */
describe input;
Illustrate input;

/* Convert Avros to JSON */
store input into 'my.json' using
com.twitter.elephantbird.pig.store.JsonStorage();
store input into 'my.json.lzo' using
com.twitter.elephantbird.pig.store.LzoJsonStorage();

/* Convert simple Avros to TSV */
store input into 'my.tsv';

/* Convert Avros to SequenceFiles */
REGISTER '/path/to/elephant-bird.jar';
 store  input into 'my.seq' using
com.twitter.elephantbird.pig.store.SequenceFileStorage(
/* example: */
'-c com.twitter.elephantbird.pig.util.IntWritableConverter',
'-c com.twitter.elephantbird.pig.util.TextConverter'
);

/* Convert Avros to Protobufs */
store input into 'input.protobuf’ using
com.twitter.elephantbird.examples.proto.pig.store.ProfileProtobufB64LinePigStorage();

/* Convert Avros to a Lucene Index */
store input into 'input.lucene' using
LuceneIndexStorage('com.example.MyPigLuceneIndexOutputFormat');

There are also drivers for most NoSQLish databases...

Russell Jurney http://datasyndrome.com

On Dec 20, 2012, at 9:33 AM, Terry Healy the...@bnl.gov wrote:

I'm just getting started using AVRO within Map/Reduce and trying to
convert some existing non-AVRO code to use AVRO input. So far the data
that previously was stored in tab delimited files has been converted to
.avro successfully as checked with avro-tools.

Where I'm getting hung up extending beyond my book-based examples is in
attempting to read from AVRO (using generic records) where the mapper
output is NOT in AVRO format. I can't seem to reconcile extending
AvroMapper and NOT using AvroCollector.

Here are snippets of code that show my non-AVRO M/R code and my
[failing] attempts to make this change. If anyone can help me along it
would be very much appreciated.

-Terry

code
Pre-Avro version: (Works fine with .tsv input format)

   public static class HdFlowMapper extends MapReduceBase
   implements MapperText, HdFlowWritable, LongPair,
HdFlowWritable {


   @Override
   public void map(Text key, HdFlowWritable value,
   OutputCollectorLongPair, HdFlowWritable output,
   Reporter reporter) throws IOException {

   ...//
   outKey = new LongPair(value.getSrcIp(), value.getFirst());

   HdFlowWritable outValue = value; // pass it all through
   output.collect(outKey, outValue);
   }



AVRO attempt:


   conf.setOutputFormat(TextOutputFormat.class);
   conf.setOutputKeyClass(LongPair.class);
   conf.setOutputValueClass(AvroFlowWritable.class);

   SCHEMA = new Schema.Parser().parse(NetflowSchema);
   AvroJob.setInputSchema(conf, SCHEMA);
   //AvroJob.setOutputSchema(conf, SCHEMA);
   AvroJob.setMapperClass(conf, AvroFlowMapper.class);
   AvroJob.setReducerClass(conf, AvroFlowReducer.class);



public static class AvroFlowMapperK extends AvroMapperK,
OutputCollector {


   @Override
   ** IDE: Method does not override or implement a method from a supertype

   public void map(K datum, OutputCollectorLongPair,
AvroFlowWritable collector, Reporter reporter) throws IOException {


   GenericRecord record = (GenericRecord) datum;
   afw = new AvroFlowWritable(record);
   // ...
   collector.collect(outKey, afw);
}

/code


Output from AVRO mapper

2012-12-20 Thread Terry Healy
I'm just getting started using AVRO within Map/Reduce and trying to
convert some existing non-AVRO code to use AVRO input. So far the data
that previously was stored in tab delimited files has been converted to
.avro successfully as checked with avro-tools.

Where I'm getting hung up extending beyond my book-based examples is in
attempting to read from AVRO (using generic records) where the mapper
output is NOT in AVRO format. I can't seem to reconcile extending
AvroMapper and NOT using AvroCollector.

Here are snippets of code that show my non-AVRO M/R code and my
[failing] attempts to make this change. If anyone can help me along it
would be very much appreciated.

-Terry

code
Pre-Avro version: (Works fine with .tsv input format)

public static class HdFlowMapper extends MapReduceBase
implements MapperText, HdFlowWritable, LongPair,
HdFlowWritable {


@Override
public void map(Text key, HdFlowWritable value,
OutputCollectorLongPair, HdFlowWritable output,
Reporter reporter) throws IOException {

...//
outKey = new LongPair(value.getSrcIp(), value.getFirst());

HdFlowWritable outValue = value; // pass it all through
output.collect(outKey, outValue);
}



AVRO attempt:


conf.setOutputFormat(TextOutputFormat.class);
conf.setOutputKeyClass(LongPair.class);
conf.setOutputValueClass(AvroFlowWritable.class);

SCHEMA = new Schema.Parser().parse(NetflowSchema);
AvroJob.setInputSchema(conf, SCHEMA);
//AvroJob.setOutputSchema(conf, SCHEMA);
AvroJob.setMapperClass(conf, AvroFlowMapper.class);
AvroJob.setReducerClass(conf, AvroFlowReducer.class);



public static class AvroFlowMapperK extends AvroMapperK,
OutputCollector {


@Override   
** IDE: Method does not override or implement a method from a 
supertype

public void map(K datum, OutputCollectorLongPair,
AvroFlowWritable collector, Reporter reporter) throws IOException {


GenericRecord record = (GenericRecord) datum;
afw = new AvroFlowWritable(record);
// ...
collector.collect(outKey, afw);
}

/code