Does Avro GenericData.Record violate the .equals contract?

2012-02-09 Thread Andrew Kenworthy
Hallo,

I'm working with avro as the serialization framework for my hadoop map-reduce 
jobs, and am emitting GenericRecord/null as my K/V values from my mapper 
classes. Having looked at the code, I see that the key objects (i.e. my 
records) are only recognised as being discrete by my reducer if it sees that 
the .equals() method called on the record shows a distinction. However, if the 
schema is the same (which it is for most of my mappers), then .equals() calls 
.compare(), which in turn depends on the ORDER attributes set on the fields. 
This means that if I have no sorting defined in my schema, that all records are 
treated as being equal to one another. Have I understood this correctly, and if 
so, is that not a violation of the equals contract? (for one thing, it would 
mean GenericRecord objects will often cause confusion when used with maps and 
other containers).


Regards,

Andrew

Avro Map-Reduce and ChainMapper

2012-02-01 Thread Andrew Kenworthy
Hallo,

Is it possible to chain Avro MR jobs using the ChainMapper? I'm looking to 
chain two map tasks and a reducer, but haven't been able to find any examples:

Chain summary:
a) first map task: takes non-avro input and produces K/V output in the form of 
AvroKey(Record), NullWritable
b) second map task: taking output of first task as its input [mapper extends 
AvroMapper(Record, Pair(Record, NullWritable))]
c) reducer: AvroReducer

In particular, how would I specify the input and output schemas - simply 
calling AvroJob.setInputSchema/setOutputSchema on the individual chained job 
conf objects?

Thanks,

Andrew

How does Avro mark (string) field delimition?

2012-01-23 Thread Andrew Kenworthy
I have looked at the Avro 1.6.0 code and am not sure how Avro distinguishes 
between field boundaries when reading null values.

The BinaryEncoder class (which is where I land when debugging my code) has an 
empty method for writeNull: how does the parser then distinguuish between 
adjacent nullable fields when reading that data?

Thanks in advance,

Andrew

Re: How does Avro mark (string) field delimition?

2012-01-23 Thread Andrew Kenworthy
I don't have a specific use-class that is problematic, but was trying to 
understand how it all works internally. Following your comment about indexes I 
looked in GenericDatumWriter and sure enough the union is tagged so we know 
which part of the union was written:

case UNION:
        int index = data.resolveUnion(schema, datum);
        out.writeIndex(index);
        write(schema.getTypes().get(index), datum, out);
        break;

That's the bit I was missing! Thanks for the input.

Andrew




 From: Harsh J ha...@cloudera.com
To: user@avro.apache.org; Andrew Kenworthy adwkenwor...@yahoo.com 
Sent: Monday, January 23, 2012 4:04 PM
Subject: Re: How does Avro mark (string) field delimition?
 
The read part is empty as well, when the decoder is asked to read a
'null' type. For null carrying unions, I believe an index is written
out so if the index evals to a null, the same logic works yet again.

Does not matter if there are two nulls adjacent to one another,
therefore. How do you imagine this ends up being a problem? What
trouble are you running into?

On Mon, Jan 23, 2012 at 8:08 PM, Andrew Kenworthy
adwkenwor...@yahoo.com wrote:
 I have looked at the Avro 1.6.0 code and am not sure how Avro distinguishes
 between field boundaries when reading null values.

 The BinaryEncoder class (which is where I land when debugging my code) has
 an empty method for writeNull: how does the parser then distinguuish between
 adjacent nullable fields when reading that data?

 Thanks in advance,

 Andrew



-- 
Harsh J
Customer Ops. Engineer, Cloudera




Re: Collecting union-ed Records in AvroReducer

2011-12-13 Thread Andrew Kenworthy
Thank you, Scott. That has cleared up some misunderstanding on my part. I want 
to emit both records as a Pair,
and have now implemented that by using a Record schema holding two sub-records, 
one for type A and one for type B,
so I can just write the relevant datum to the correct sub-record, which gives 
me exactly what I need.

Andrew




 From: Scott Carey scottca...@apache.org
To: user@avro.apache.org user@avro.apache.org; Andrew Kenworthy 
adwkenwor...@yahoo.com 
Sent: Thursday, December 8, 2011 6:45 PM
Subject: Re: Collecting union-ed Records in AvroReducer
 


On 12/8/11 4:10 AM, Andrew Kenworthy adwkenwor...@yahoo.com wrote:


Hallo,

is it possible to write/collect a union-ed record from an avro reducer?

I have a reduce class (extending AvroReducer), and the output schema is a
union schema of record type A and record type B. In the reduce logic I
want to combine instances of A and B in the same datum, passing it to my
Avrocollector. My code looks a bit like this:




If both records were created in the reducer, you can call collect twice,
once with each record.  Collect in general can be called as many times as
you wish.

If you want to combine two records into a single datum rather than emit
multiple datums, you do not want a union, you need a Record.  A union is a
single datum that may be only one of its branches in a single datum.

In short, do you want to emit both records individually or as a pair?  If
it is a pair, you need a Record, if it is multiple outputs or either/or,
it is a Union.




Record unionRecord = new GenericData.Record(myUnionSchema); // not legal!
unionRecord.put(type A, recordA);
unionRecord.put(type B, recordB);

collector.collect(unionRecord);

but GenericData.Record constructor expects a Record Schema. How can I
write both records such that they appear in the same output
 datum?

If your output is either one type or another, see Doug's answer.

for multiple datums, it is

output schema is a union of two records  (a datum is either one or the
other):
[RecordA, RecordB]
then the code is:

collector.collect(recordA);
collector.collect(recordB);


If you want a single datum that contains both a RecordA and a RecordB you
need to have your output schema be a Record with two fields:

{type:record, fields:[
  {name:recordA, type:RecordA},
  {name:recordB, type:RecordB}
]}

And you would use this record schema to create the GenericRecord, and then
populate the fields with the inner records, then call collect once with
the outer record.

Another choice is to output the output be an avro array of the union type
that may have any number of RecordA and RecordB's in a single datum.


Andrew






Re: Reduce-side joins in Avro M/R

2011-12-13 Thread Andrew Kenworthy
I'm currently using a UNION-schema to map two different types of data (read 
from two different input paths) in my reducer to a common record. This works 
fine, but - if I have understood the mechanism correctly - it would mean that 
Avro is having to check each and every record against my UNION schema. With a 
normal reduce-side join, I could use MultipleInputs to specify a mapper for 
each input, thus letting them run independently (since each mapper knows its 
input) with presumably less overhead. 


Is it possible with Avro to avoid the overhead of checking each input row 
against the union schema?

Thanks,

Andrew




 From: Scott Carey scottca...@apache.org
To: user@avro.apache.org user@avro.apache.org; Andrew Kenworthy 
adwkenwor...@yahoo.com 
Sent: Wednesday, December 7, 2011 7:40 PM
Subject: Re: Reduce-side joins in Avro M/R
 

This should be conceptually the same as a normal map-reduce join of the same 
type.  Avro handles the serialization, but not the map-reduce algorithm or 
strategy.   

On 12/6/11 8:43 AM, Andrew Kenworthy adwkenwor...@yahoo.com wrote:


Hi,


I'd like to use reduce-side joins in an avro M/R job, and am not sure how to 
do it: are there any best-practice tips or outlines of what one would have to 
implement in order to make this possible?


Thanks,


Andrew Kenworthy



Collecting union-ed Records in AvroReducer

2011-12-08 Thread Andrew Kenworthy
Hallo,

is it possible to write/collect a union-ed record from an avro reducer?

I have a reduce class (extending AvroReducer), and the output schema is a union 
schema of record type A and record type B. In the reduce logic I want to 
combine instances of A and B in the same datum, passing it to my Avrocollector. 
My code looks a bit like this:

Record unionRecord = new GenericData.Record(myUnionSchema); // not legal!
unionRecord.put(type A, recordA);
unionRecord.put(type B, recordB);

collector.collect(unionRecord);

but GenericData.Record constructor expects a Record Schema. How can I write 
both records such that they appear in the same output datum?

Andrew

Reduce-side joins in Avro M/R

2011-12-06 Thread Andrew Kenworthy
Hi,

I'd like to use reduce-side joins in an avro M/R job, and am not sure how to do 
it: are there any best-practice tips or outlines of what one would have to 
implement in order to make this possible?

Thanks,

Andrew Kenworthy

Re: Records inside records

2011-12-05 Thread Andrew Kenworthy
Hi Nanda,

If you are in a java environment you can test this and similar scenarios in a 
JUnit test using the Schema.Parser object. Here's an example:

    @Test
    public void testNestedRecordFromString() {
        String json = {\type\ : \record\,\name\: \TYPE_A\,\fields\ : 

                + [{\name\: \one\, \type\: {\type\: \record\, 
\name\: \TYPE_B\,\fields\ : 
                + [ {\name\ : \inside_one\,\type\ : \string\}]}}]};
        Schema schema = new Schema.Parser().parse(json);
        
assertTrue(schema.getFields().get(0).schema().getFields().get(0).name().equalsIgnoreCase(inside_one));
    }

This should be OK in avro (the test above is positive for me), but will not 
work with the Avro storage package for pig (see the limitations described here: 
https://cwiki.apache.org/confluence/display/PIG/AvroStorage).

Andrew




 From: nanda gaurav...@gmail.com
To: user@avro.apache.org 
Sent: Monday, December 5, 2011 12:45 PM
Subject: Records inside records
 
Hi,

Is it possible to generate following kind of data object:

{
'type' : 'record',
'name': 'TYPE_A',
'fields' : [
       {'name': 'one', 'type': {'type': 'record', 'name': 'TYPE_B',
'fields' : [ {'name' : 'inside_one',
                          'type' : 'string}
                        ]
                   }}
      ]
}


Basically my requirement is to send timely updates from server to
clients(in various language), which might look something like(Dynamic
Map):

{'field_1' : value_1_type_int,
'field_2' : value_2_type_string,
'field_3' : {'field_4' : value_4_long, 'field_5' : {another map..}}
}
Why I am inclined to use avro is because I never know in advance what is
the message structure of an update going to be, it can be any number of
fields with any amount of nestedness. 
'Record' seems to be a viable option here, but not sure how I can use
nested structure here.
Could someone please help here.

Thanks,
Gaurav Nanda






Re: Exposing a constant in an Avro Schema

2011-11-15 Thread Andrew Kenworthy
Hi Scott,

it's the latter I need; simply the ability to pass meta-data with my schema, so 
the user property is just what I need.

Thanks for your help!

Andrew




From: Scott Carey scottca...@apache.org
To: user@avro.apache.org user@avro.apache.org; Andrew Kenworthy 
adwkenwor...@yahoo.com
Sent: Monday, November 14, 2011 9:09 PM
Subject: Re: Exposing a constant in an Avro Schema


Named types (records, fields, fixed, enum) can store arbitrary user properties 
attached to the schema ( similar to doc but no special meaning).


Do you want this constant to be in every instance of your data object?  If so, 
the enum is one way to do it.  
If you simply want to push metadata along with the schema, use the schema 
properties, they are name-value pairs.  For example you can have myVersion 
attached to your schema for a record:


{type:record, name:bar.baz.FooRecord, myVersion:1.1, fields: {
    {name:field1, type:int},
    …
  } 
}

On 11/14/11 8:03 AM, Andrew Kenworthy adwkenwor...@yahoo.com wrote:


Hi,


I would like to embed a schema version number in the schema that I use for 
writing data: it would be read-only so that I can determine later on which 
version of my avro schema was used. The best I could come up with is to 
(ab)use an enum with a single value like this, as I couldn't find any way to 
define a constant:


{type:enum,name:version_1_1,doc:enum indicating avro write schema 
version 1.1,symbols:[VERSION_1_1]}


Is there a better way to register a constant value that has no meaning within 
the avro data file, other than to expose some kind of meta information?


Thanks,


Andrew Kenworthy





Exposing a constant in an Avro Schema

2011-11-14 Thread Andrew Kenworthy
Hi,

I would like to embed a schema version number in the schema that I use for 
writing data: it would be read-only so that I can determine later on which 
version of my avro schema was used. The best I could come up with is to (ab)use 
an enum with a single value like this, as I couldn't find any way to define a 
constant:

{type:enum,name:version_1_1,doc:enum indicating avro write schema 
version 1.1,symbols:[VERSION_1_1]}

Is there a better way to register a constant value that has no meaning within 
the avro data file, other than to expose some kind of meta information?

Thanks,

Andrew Kenworthy


Re: ThriftDatumReader and null values (Tag 1.6.0-rc0)

2011-11-10 Thread Andrew Kenworthy
Thanks for the quick response and fix!

Andrew




From: Doug Cutting cutt...@apache.org
To: user@avro.apache.org
Sent: Friday, October 28, 2011 8:36 PM
Subject: Re: ThriftDatumReader and null values (Tag 1.6.0-rc0)

This looks like a bug.

I have a proposed fix in https://issues.apache.org/jira/browse/AVRO-948.

Doug

On 10/28/2011 12:59 AM, Andrew Kenworthy wrote:
 Hallo,
 
 I'm trying out the latest Avro tag (1.6.0-rc0) as the new
 ThriftDatumReader/Writer classes look really interesting (we currently
 receive thrift files as input for our hadoop jobs and would like to
 convert them to avro format as early as possible, and then use avro
 (de-)serialisation throughout our job stack).
 
 I have tried out the test case (TestThrift) and it works fine until I
 comment out the line:
 
 test.setStringField(foo);
 
 at which point the test fails as null values don't seem to be allowed.
 Is this intentional or is there something basic that I have not
 understood?  
 
 Thanks,
 
 Andrew Kenworthy




Avro-mapred and new Java MapReduce API (org.apache.hadoop.mapreduce)

2011-11-10 Thread Andrew Kenworthy
Hi,

I see that the avro-mapred classes (AvroMapper, AvroInputFormat etc.) work 
against the old mapreduce API (org.apache.hadoop.mapred). Are there plans to 
extend it to work with org.apache.hadoop.mapreduce as well?

Thanks,

Andrew

ThriftDatumReader and null values (Tag 1.6.0-rc0)

2011-10-28 Thread Andrew Kenworthy
Hallo,

I'm trying out the latest Avro tag (1.6.0-rc0) as the new 
ThriftDatumReader/Writer classes look really interesting (we currently receive 
thrift files as input for our hadoop jobs and would like to convert them to 
avro format as early as possible, and then use avro (de-)serialisation 
throughout our job stack).

I have tried out the test case (TestThrift) and it works fine until I comment 
out the line:

test.setStringField(foo);


at which point the test fails as null values don't seem to be allowed. Is this 
intentional or is there something basic that I have not understood?  

Thanks,

Andrew Kenworthy