How does Avro mark (string) field delimition?
I have looked at the Avro 1.6.0 code and am not sure how Avro distinguishes between field boundaries when reading null values. The BinaryEncoder class (which is where I land when debugging my code) has an empty method for writeNull: how does the parser then distinguuish between adjacent nullable fields when reading that data? Thanks in advance, Andrew
Re: How does Avro mark (string) field delimition?
I don't have a specific use-class that is problematic, but was trying to understand how it all works internally. Following your comment about indexes I looked in GenericDatumWriter and sure enough the union is tagged so we know which part of the union was written: case UNION: int index = data.resolveUnion(schema, datum); out.writeIndex(index); write(schema.getTypes().get(index), datum, out); break; That's the bit I was missing! Thanks for the input. Andrew From: Harsh J ha...@cloudera.com To: user@avro.apache.org; Andrew Kenworthy adwkenwor...@yahoo.com Sent: Monday, January 23, 2012 4:04 PM Subject: Re: How does Avro mark (string) field delimition? The read part is empty as well, when the decoder is asked to read a 'null' type. For null carrying unions, I believe an index is written out so if the index evals to a null, the same logic works yet again. Does not matter if there are two nulls adjacent to one another, therefore. How do you imagine this ends up being a problem? What trouble are you running into? On Mon, Jan 23, 2012 at 8:08 PM, Andrew Kenworthy adwkenwor...@yahoo.com wrote: I have looked at the Avro 1.6.0 code and am not sure how Avro distinguishes between field boundaries when reading null values. The BinaryEncoder class (which is where I land when debugging my code) has an empty method for writeNull: how does the parser then distinguuish between adjacent nullable fields when reading that data? Thanks in advance, Andrew -- Harsh J Customer Ops. Engineer, Cloudera
using avro schemas to select columns (abusing versioning?)
we are working on a very sparse table with say 500 columns where we do batch uploads that typically only contain a subset of the columns (say 100), and we run multiple map-reduce queries on subsets of the columns (typically less than 50 columns go into a single map-reduce job). my question is the following: if i use avro, do i ever actually need the use the full schema of the table? if i understand avro correctly, then the batch uploads could simply add avro files with the schema reflective of the columns that are in the file (as opposed to first inserting many nulls into the data and then saving it with the full schema). the queries could also simply query with the schema that is reflective of the query (as opposed to querying with the full schema with 500 columns and then picking out the relevant columns). as long as i provide defaults of null in the query schemas, i think this would work! correct? is this considered abuse of avro's versioning capabilities?
Re: using avro schemas to select columns (abusing versioning?)
On 01/23/2012 02:18 PM, Koert Kuipers wrote: is this considered abuse of avro's versioning capabilities? Not at all. Using a subset of the fields in Avro is called projection and can provide significant performance improvements. Doug
AVRO-981 - Removed snappy as requirement
https://issues.apache.org/jira/browse/AVRO-981 I took Joe Crobak's advice and removed snappy as a dependency in the python client for avro. With the patch in AVRO-981 applied, Avro installs, builds and functions on Mac OS X. -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com