Does Avro GenericData.Record violate the .equals contract?
Hallo, I'm working with avro as the serialization framework for my hadoop map-reduce jobs, and am emitting GenericRecord/null as my K/V values from my mapper classes. Having looked at the code, I see that the key objects (i.e. my records) are only recognised as being discrete by my reducer if it sees that the .equals() method called on the record shows a distinction. However, if the schema is the same (which it is for most of my mappers), then .equals() calls .compare(), which in turn depends on the ORDER attributes set on the fields. This means that if I have no sorting defined in my schema, that all records are treated as being equal to one another. Have I understood this correctly, and if so, is that not a violation of the equals contract? (for one thing, it would mean GenericRecord objects will often cause confusion when used with maps and other containers). Regards, Andrew
Avro Map-Reduce and ChainMapper
Hallo, Is it possible to chain Avro MR jobs using the ChainMapper? I'm looking to chain two map tasks and a reducer, but haven't been able to find any examples: Chain summary: a) first map task: takes non-avro input and produces K/V output in the form of AvroKey(Record), NullWritable b) second map task: taking output of first task as its input [mapper extends AvroMapper(Record, Pair(Record, NullWritable))] c) reducer: AvroReducer In particular, how would I specify the input and output schemas - simply calling AvroJob.setInputSchema/setOutputSchema on the individual chained job conf objects? Thanks, Andrew
How does Avro mark (string) field delimition?
I have looked at the Avro 1.6.0 code and am not sure how Avro distinguishes between field boundaries when reading null values. The BinaryEncoder class (which is where I land when debugging my code) has an empty method for writeNull: how does the parser then distinguuish between adjacent nullable fields when reading that data? Thanks in advance, Andrew
Re: How does Avro mark (string) field delimition?
I don't have a specific use-class that is problematic, but was trying to understand how it all works internally. Following your comment about indexes I looked in GenericDatumWriter and sure enough the union is tagged so we know which part of the union was written: case UNION: int index = data.resolveUnion(schema, datum); out.writeIndex(index); write(schema.getTypes().get(index), datum, out); break; That's the bit I was missing! Thanks for the input. Andrew From: Harsh J ha...@cloudera.com To: user@avro.apache.org; Andrew Kenworthy adwkenwor...@yahoo.com Sent: Monday, January 23, 2012 4:04 PM Subject: Re: How does Avro mark (string) field delimition? The read part is empty as well, when the decoder is asked to read a 'null' type. For null carrying unions, I believe an index is written out so if the index evals to a null, the same logic works yet again. Does not matter if there are two nulls adjacent to one another, therefore. How do you imagine this ends up being a problem? What trouble are you running into? On Mon, Jan 23, 2012 at 8:08 PM, Andrew Kenworthy adwkenwor...@yahoo.com wrote: I have looked at the Avro 1.6.0 code and am not sure how Avro distinguishes between field boundaries when reading null values. The BinaryEncoder class (which is where I land when debugging my code) has an empty method for writeNull: how does the parser then distinguuish between adjacent nullable fields when reading that data? Thanks in advance, Andrew -- Harsh J Customer Ops. Engineer, Cloudera
Re: Collecting union-ed Records in AvroReducer
Thank you, Scott. That has cleared up some misunderstanding on my part. I want to emit both records as a Pair, and have now implemented that by using a Record schema holding two sub-records, one for type A and one for type B, so I can just write the relevant datum to the correct sub-record, which gives me exactly what I need. Andrew From: Scott Carey scottca...@apache.org To: user@avro.apache.org user@avro.apache.org; Andrew Kenworthy adwkenwor...@yahoo.com Sent: Thursday, December 8, 2011 6:45 PM Subject: Re: Collecting union-ed Records in AvroReducer On 12/8/11 4:10 AM, Andrew Kenworthy adwkenwor...@yahoo.com wrote: Hallo, is it possible to write/collect a union-ed record from an avro reducer? I have a reduce class (extending AvroReducer), and the output schema is a union schema of record type A and record type B. In the reduce logic I want to combine instances of A and B in the same datum, passing it to my Avrocollector. My code looks a bit like this: If both records were created in the reducer, you can call collect twice, once with each record. Collect in general can be called as many times as you wish. If you want to combine two records into a single datum rather than emit multiple datums, you do not want a union, you need a Record. A union is a single datum that may be only one of its branches in a single datum. In short, do you want to emit both records individually or as a pair? If it is a pair, you need a Record, if it is multiple outputs or either/or, it is a Union. Record unionRecord = new GenericData.Record(myUnionSchema); // not legal! unionRecord.put(type A, recordA); unionRecord.put(type B, recordB); collector.collect(unionRecord); but GenericData.Record constructor expects a Record Schema. How can I write both records such that they appear in the same output datum? If your output is either one type or another, see Doug's answer. for multiple datums, it is output schema is a union of two records (a datum is either one or the other): [RecordA, RecordB] then the code is: collector.collect(recordA); collector.collect(recordB); If you want a single datum that contains both a RecordA and a RecordB you need to have your output schema be a Record with two fields: {type:record, fields:[ {name:recordA, type:RecordA}, {name:recordB, type:RecordB} ]} And you would use this record schema to create the GenericRecord, and then populate the fields with the inner records, then call collect once with the outer record. Another choice is to output the output be an avro array of the union type that may have any number of RecordA and RecordB's in a single datum. Andrew
Re: Reduce-side joins in Avro M/R
I'm currently using a UNION-schema to map two different types of data (read from two different input paths) in my reducer to a common record. This works fine, but - if I have understood the mechanism correctly - it would mean that Avro is having to check each and every record against my UNION schema. With a normal reduce-side join, I could use MultipleInputs to specify a mapper for each input, thus letting them run independently (since each mapper knows its input) with presumably less overhead. Is it possible with Avro to avoid the overhead of checking each input row against the union schema? Thanks, Andrew From: Scott Carey scottca...@apache.org To: user@avro.apache.org user@avro.apache.org; Andrew Kenworthy adwkenwor...@yahoo.com Sent: Wednesday, December 7, 2011 7:40 PM Subject: Re: Reduce-side joins in Avro M/R This should be conceptually the same as a normal map-reduce join of the same type. Avro handles the serialization, but not the map-reduce algorithm or strategy. On 12/6/11 8:43 AM, Andrew Kenworthy adwkenwor...@yahoo.com wrote: Hi, I'd like to use reduce-side joins in an avro M/R job, and am not sure how to do it: are there any best-practice tips or outlines of what one would have to implement in order to make this possible? Thanks, Andrew Kenworthy
Collecting union-ed Records in AvroReducer
Hallo, is it possible to write/collect a union-ed record from an avro reducer? I have a reduce class (extending AvroReducer), and the output schema is a union schema of record type A and record type B. In the reduce logic I want to combine instances of A and B in the same datum, passing it to my Avrocollector. My code looks a bit like this: Record unionRecord = new GenericData.Record(myUnionSchema); // not legal! unionRecord.put(type A, recordA); unionRecord.put(type B, recordB); collector.collect(unionRecord); but GenericData.Record constructor expects a Record Schema. How can I write both records such that they appear in the same output datum? Andrew
Reduce-side joins in Avro M/R
Hi, I'd like to use reduce-side joins in an avro M/R job, and am not sure how to do it: are there any best-practice tips or outlines of what one would have to implement in order to make this possible? Thanks, Andrew Kenworthy
Re: Records inside records
Hi Nanda, If you are in a java environment you can test this and similar scenarios in a JUnit test using the Schema.Parser object. Here's an example: @Test public void testNestedRecordFromString() { String json = {\type\ : \record\,\name\: \TYPE_A\,\fields\ : + [{\name\: \one\, \type\: {\type\: \record\, \name\: \TYPE_B\,\fields\ : + [ {\name\ : \inside_one\,\type\ : \string\}]}}]}; Schema schema = new Schema.Parser().parse(json); assertTrue(schema.getFields().get(0).schema().getFields().get(0).name().equalsIgnoreCase(inside_one)); } This should be OK in avro (the test above is positive for me), but will not work with the Avro storage package for pig (see the limitations described here: https://cwiki.apache.org/confluence/display/PIG/AvroStorage). Andrew From: nanda gaurav...@gmail.com To: user@avro.apache.org Sent: Monday, December 5, 2011 12:45 PM Subject: Records inside records Hi, Is it possible to generate following kind of data object: { 'type' : 'record', 'name': 'TYPE_A', 'fields' : [ {'name': 'one', 'type': {'type': 'record', 'name': 'TYPE_B', 'fields' : [ {'name' : 'inside_one', 'type' : 'string} ] }} ] } Basically my requirement is to send timely updates from server to clients(in various language), which might look something like(Dynamic Map): {'field_1' : value_1_type_int, 'field_2' : value_2_type_string, 'field_3' : {'field_4' : value_4_long, 'field_5' : {another map..}} } Why I am inclined to use avro is because I never know in advance what is the message structure of an update going to be, it can be any number of fields with any amount of nestedness. 'Record' seems to be a viable option here, but not sure how I can use nested structure here. Could someone please help here. Thanks, Gaurav Nanda
Re: Exposing a constant in an Avro Schema
Hi Scott, it's the latter I need; simply the ability to pass meta-data with my schema, so the user property is just what I need. Thanks for your help! Andrew From: Scott Carey scottca...@apache.org To: user@avro.apache.org user@avro.apache.org; Andrew Kenworthy adwkenwor...@yahoo.com Sent: Monday, November 14, 2011 9:09 PM Subject: Re: Exposing a constant in an Avro Schema Named types (records, fields, fixed, enum) can store arbitrary user properties attached to the schema ( similar to doc but no special meaning). Do you want this constant to be in every instance of your data object? If so, the enum is one way to do it. If you simply want to push metadata along with the schema, use the schema properties, they are name-value pairs. For example you can have myVersion attached to your schema for a record: {type:record, name:bar.baz.FooRecord, myVersion:1.1, fields: { {name:field1, type:int}, … } } On 11/14/11 8:03 AM, Andrew Kenworthy adwkenwor...@yahoo.com wrote: Hi, I would like to embed a schema version number in the schema that I use for writing data: it would be read-only so that I can determine later on which version of my avro schema was used. The best I could come up with is to (ab)use an enum with a single value like this, as I couldn't find any way to define a constant: {type:enum,name:version_1_1,doc:enum indicating avro write schema version 1.1,symbols:[VERSION_1_1]} Is there a better way to register a constant value that has no meaning within the avro data file, other than to expose some kind of meta information? Thanks, Andrew Kenworthy
Exposing a constant in an Avro Schema
Hi, I would like to embed a schema version number in the schema that I use for writing data: it would be read-only so that I can determine later on which version of my avro schema was used. The best I could come up with is to (ab)use an enum with a single value like this, as I couldn't find any way to define a constant: {type:enum,name:version_1_1,doc:enum indicating avro write schema version 1.1,symbols:[VERSION_1_1]} Is there a better way to register a constant value that has no meaning within the avro data file, other than to expose some kind of meta information? Thanks, Andrew Kenworthy
Re: ThriftDatumReader and null values (Tag 1.6.0-rc0)
Thanks for the quick response and fix! Andrew From: Doug Cutting cutt...@apache.org To: user@avro.apache.org Sent: Friday, October 28, 2011 8:36 PM Subject: Re: ThriftDatumReader and null values (Tag 1.6.0-rc0) This looks like a bug. I have a proposed fix in https://issues.apache.org/jira/browse/AVRO-948. Doug On 10/28/2011 12:59 AM, Andrew Kenworthy wrote: Hallo, I'm trying out the latest Avro tag (1.6.0-rc0) as the new ThriftDatumReader/Writer classes look really interesting (we currently receive thrift files as input for our hadoop jobs and would like to convert them to avro format as early as possible, and then use avro (de-)serialisation throughout our job stack). I have tried out the test case (TestThrift) and it works fine until I comment out the line: test.setStringField(foo); at which point the test fails as null values don't seem to be allowed. Is this intentional or is there something basic that I have not understood? Thanks, Andrew Kenworthy
Avro-mapred and new Java MapReduce API (org.apache.hadoop.mapreduce)
Hi, I see that the avro-mapred classes (AvroMapper, AvroInputFormat etc.) work against the old mapreduce API (org.apache.hadoop.mapred). Are there plans to extend it to work with org.apache.hadoop.mapreduce as well? Thanks, Andrew
ThriftDatumReader and null values (Tag 1.6.0-rc0)
Hallo, I'm trying out the latest Avro tag (1.6.0-rc0) as the new ThriftDatumReader/Writer classes look really interesting (we currently receive thrift files as input for our hadoop jobs and would like to convert them to avro format as early as possible, and then use avro (de-)serialisation throughout our job stack). I have tried out the test case (TestThrift) and it works fine until I comment out the line: test.setStringField(foo); at which point the test fails as null values don't seem to be allowed. Is this intentional or is there something basic that I have not understood? Thanks, Andrew Kenworthy