Re: Avro file Compression
The file format compresses in blocks, and the block size is configurable. This will compress across objects in a block, so it works for small objects as well as large ones as long as the total block size is large enough. I have found that I can increase the ratio of compression by ordering the objects carefully so that neighbor records have more in common. From: Bill Baird bill.ba...@traxtech.com Reply-To: user@avro.apache.org user@avro.apache.org Date: Thursday, August 22, 2013 7:47 AM To: user@avro.apache.org user@avro.apache.org Subject: Re: Avro file Compression As with any compression, how much you get depends on the size and nature of the data. I have objects where unserialized they take 4 or 5k, and they serialize to 1.5 to 3k, or about 2 to 1. However, for the same object structure (which contains several nested arrays ... lots of strings, numbers ... basic business data) when uncompressed it 17MB, it deflates to 1MB (or 17 to 1). For very small objects, deflate will actually produce a larger output, but it does quite well as the size of the data being deflated grows. Bill On Wed, Aug 21, 2013 at 11:31 PM, Harsh J ha...@cloudera.com wrote: Can you share your test? There is an example at http://svn.apache.org/repos/asf/avro/trunk/lang/c/examples/quickstop.c which has the right calls for using a file writer with a deflate codec - is yours similar? On Mon, Aug 19, 2013 at 9:42 PM, amit nanda amit...@gmail.com wrote: I am try to compress the avro files that i am writing, for that i am using the latest Avro C, with deflate option, but i am not able to see any difference in the file size. Is there any special type to data that this works on, or is there any more setting that needs to be done for this to work. -- Harsh J
Re: Avro Schema to SQL
Not all Avro schemas can be converted to SQL. Primarily, Unions can pose challenges, as well as recursive references. Nested types are a mixed bag some SQL-related systems have rich support for nested types and/or JSON (e.g. PosgtgreSQL) which can make this easier, while others are more crude (MySQL, HIVE). With Unions, in some cases a union field can be expanded/flattened into multiple fields, of which only one is not null. Recursive types can be transformed into key references. In general, all of these transformation strategies require decisions by the user and potentially custom work depending on what database is involved. Traversing an Avro Schema in Java is done via the Schema API, the Javadoc explains it and there are many examples in the avro source code. The type of schema must be checked, and for each nested type a different decent into its contained types can occur. From: Avinash Dongre dongre.avin...@gmail.com Reply-To: user@avro.apache.org user@avro.apache.org Date: Wednesday, June 19, 2013 2:31 AM To: user@avro.apache.org user@avro.apache.org Subject: Avro Schema to SQL Is there know tool/framework available to convert Avro Schema into SQL. If now , How Do i iterate over the schema to find out what records, enums are there. I can think of how to achieve this with simple Schema, but I am not able to figure out a way for nested schemas. Thanks Avinash
Re: Reader / Writer terminology
It can be a view, or a transformation. You might view data_a with schema_b. Or, you might take binary data, conforming to schema A and directly re-write it to binary data, conforming to schema B. Most Avro APIs don't yet handle workflows that are not 'read' and 'write' -- transformations to or from object representations to serialized forms. The general case includes all transformation classes as well as views. On 6/8/13 10:16 PM, Gregory (Grisha) Trubetskoy gri...@apache.org wrote: On Sat, 8 Jun 2013, Scott Carey wrote: In a more general sense it is simply from and to -- One might move from schema A to B without serialization at all, transforming a data structure, or simply want a view of data in the form of A as if it was in B. I'd like to zoom in on this specific point for a little, if I may. I think serialization is a red herring. It's always a transformation of one data structure to another, because a claim could be made that one cannot transform a serialized form without loading it into a data structure first. In fact, I think it's always the latter case, a *view*, as you aptly described it. Which makes it not so much a from and to, but more a view A as B? Something like: value_b = value_a.view_as(schema_b) Just my late-night $0.02. Grisha
Re: Reader / Writer terminology
I'm about to make all of this even more confusing For pair-wise resolution when the operation is deserialization, reader and writer make sense. In a more general sense it is simply from and to -- One might move from schema A to B without serialization at all, transforming a data structure, or simply want a view of data in the form of A as if it was in B. There aren't any clear naming winners and many sound good for one use case but worse for others: 'source' and 'destination', 'source' and 'sink', 'original' and 'target', 'expected' and 'actual', 'reader' and 'writer', 'resolver' and 'resolvee', 'sender' and 'reciever'. As part of AVRO-1124 I have recently met in person with a few folks who needed enhancements to that ticket (the discussion and conclusion will be added there shortly, prior to the next patch version). The result is that two names are not enough, because expressing resolution of _sets_ of schemas is more complicated than pairs. When describing a set of schemas that represent some sort of data that may have been persisted, six states are needed. The six states are made up of two dimensions. * The reader dimension is binary, and represents whether a schema is used for reading or not (is ever a to, reader, or target). * The write dimension has three states in the 'write' spectrum: Writer (an active from or source), Written (persisted data, not actively written), and None (not used for writing). The naming of these will be confusing, as part of AVRO-1124 we'll have to have names that are as clear as possible. Currently I have enumerations: ReadState.READER and ReadState.NONE; WriteState.WRITER, WriteState.WRITTEN, and WriteState.NONE. I am not a big fan of these names, and am open to suggestions. A consistent approach in naming is important. For example, I previously had, WriteState.WRITTEN named WriteState.READABLE. That represents the idea of what the state is for the best, but is extremely confusing. These six states relate with one schema resolution rule: Schemas in state ReadState.READER must be able to read all schemas with WriterState.WRITER or WriterState.WRITTEN. and one rule for persisting data: Data must not be persisted unless the corresponding schema is in state WriterState.WRITER Without going into the details, this allows for any schema evolution use case over a set of schemas with both ephemeral data and persisted data. Schemas can transition from one state to another, as long as the constraint rules above are met at all times. Reader and Writer have been useful because they correlate with other meaningful names well -- hypothetically: boolean mySchema.canRead(Schema writer) and boolean mySchema.canBeReadWith(Schema reader) A naming scheme for describing schema resolution an evolution will need to work across many use cases and be useful for describing relationships between schemas. Describing only the pair-wise resolution is not enough. On 6/8/13 12:44 AM, Doug Cutting cutt...@apache.org wrote: Originally I used the term 'actual' for the schema of the data written and 'expected' for the schema that the reader of the data wished to see it as. Some found those terms confusing and suggested that 'writer' and 'reader' were more intuitive, so we started using those instead. That unfortunately seems not to have resolved the confusion entirely. Perhaps we should improve the documentation around this? Do you have any specific suggestions about how that might be done? Doug On Jun 7, 2013 10:12 PM, Gregory (Grisha) Trubetskoy gri...@apache.org wrote: I'm curious how the Reader and Writer terminology came about, and, most importantly, whether it's as confusing to the rest of you as it is to me? As I understand it, the principal analogy here is from the RPC world - a process A writes some Avro to process B, in which case A is the writer and B is the reader. And there is the possibility that the schema which B may be expecting isn't what A is providing, thus B may have to do some conversion on its end to grok it, and Avro schema resolution rules may make this possible. So far so good. This is where it becomes confusing. I am lost on how the act of reading or writing is relevant to the task at hand, which is conversion of a value from one schema to another. As I read stuff on the lists and the docs, I couldn't help noticing words such as original, first, second, actual, expected being using alongside reader and writer as clarification. Why would be wrong with a source and destination schmeas? Consider the following line (from Avro-C): writer_iface = avro_resolved_writer_new(writer_schema, reader_schema); Here writer in resolved_writer and writer_schema are unrelated. The former refers to the fact that this interface will be modifying (writing to) an object, the latter is referring to the writer (source, original, a.k.a actual) schema. Wouldn't this read better as: writer_iface =
Re: Compressed Avro vs. compressed Sequence - unexpected results?
For your avro files, double check that snappy is used (use avro-tools to peek at the metadata in the file, or simply view the head in a text editor, the compression codec used will be in the header). Snappy is very fast, most likely the time to read is dominated by deserialization. Avro will be slower than a trivial deserializer (but more compact), but being many times slower is not expected. I am not entirely sure how Hive's Avro serDe works -- it is possible there is a performance issue there. If you were able to get a handful of stack traces (kill -3 or jstack) from the mapper tasks (or a profiler output), it would be very insightful. On 5/23/13 12:42 AM, nir_zamir nir.za...@gmail.com wrote: Hi, We're examining the storage of our data in Snappy-compressed files. Since we want the data's structure to be self contained, we checked it with Avro and with Sequence (both are splittable, which should best utilize our cluster). We tested the performance on a 12GB data (CSV) file, and a 4-nodes cluster on production environment (very strong machines). Compression What we did here (for test simplicity) is create two Hive tables: Avro-based and Sequence-based. Then we enabled Snappy compression and INSERTed the data from the RAW table (consisting of the 12GB file). In terms of compression rate, Avro was better: 72% vs. 57%. In both cases there were 45 mappers, and CPU/Mem were very far from their limit on all machines. Since there was no reduce operator, this created 45 files. Compression time for Avro took longer: 1.75 minutes vs. 1.2 minutes for sequence files. Decompression What we did here was this Hive query: SELECT COUNT(1) FROM table-name; Here was the real difference: it took Avro about *75% longer* to perform this (3 minutes vs. 0.5 minute). This was very surprising since for our strong machines the I/O would be expected to be the bottleneck, and since Avro files are smaller,we expected them to be faster to decompress. The number of mappers in both cases was similar (14 vs. 17) and again, CPU/Mem didn't seem to be exausted. Since our most critical time is reading, this issue makes it hard for us to be using Avro. Maybe we're doing something wrong - your input would be much appreciated! Thanks, Nir -- View this message in context: http://apache-avro.679487.n3.nabble.com/Compressed-Avro-vs-compressed-Sequ ence-unexpected-results-tp4027467.html Sent from the Avro - Users mailing list archive at Nabble.com.
Re: using Avro unions with HIVE
The Hive mailing list would have more info on the Avro SerDe usage. In general, a system that does not have union types like Hive (or Pig, etc) has to expand a union into multiple fields if there are more than one non-null type -- and at most one branch of the union is not null. For example a record with fields: {name:timestamp, type:long, default:-1} {name:ipAddress, type:[IPv4, IPv6]} where IPv4 and IPv6 are previously defined types, would have to expand to three fields timestamp, ipAddress:IPv4, and ipAddress:IPv6, where only one of the last two is not null in any given record. I do not know what Hive's Avro SerDe does with unions. On 5/23/13 7:15 AM, Ran S r...@liveperson.com wrote: Hi, We started to work with Avro in CDH4 and to query the Avro files using Hive. This does work fine for us, except for unions. We do not understand how to query the data inside a union using Hive. For example, let's look at the following schema: { type:record, name:event, namespace:com.mysite, fields:[ { name:header, type:{ type:record, name:CommonHeader, fields:[{ name:eventTimeStamp, type:long, efault:-1 }, { name:globalUserId, type:[null, string], default:null } ] }, default:null }, { name:eventbody, type:{ type:record, name:eventbody, fields:[ { name:body, type:[ null, { type:record, name:event1, fields:[ { name:event1Header, type:[null, { type:array, items:string }], default:null }, { name:event1Body, type:[null, { type:array, items:string }], default:null } ] }, { type:record, name:event2, fields:[ { name:page, type:{ type:record, name:URL, fields:[{ name:url, type:string }] }, default:null }, { name:referrer, type:string, default:null } ] } ], default:null } ] }, default:null } ]} Note that body is a union of three types: null, event1 and event2 So if I want to query fields inside event1, I first need to access it. I then set a HiveQL like this: SELECT eventbody.body.??? from SRC My question is: what shoule I put in the ??? above to make this work? Thank you, Ran -- View this message in context: http://apache-avro.679487.n3.nabble.com/using-Avro-unions-with-HIVE-tp4027 473.html Sent from the Avro - Users mailing list archive at Nabble.com.
Re: Newb question on imorting JSON and defaults
On 5/22/13 2:26 PM, Gregory (Grisha) Trubetskoy gri...@apache.org wrote: Hello! I have a test.json file that looks like this: {first:John, last:Doe, middle:C} {first:John, last:Doe} (Second line does NOT have a middle element). And I have a test.schema file that looks like this: {name:test, type:record, fields: [ {name:first, type:string}, {name:middle, type:string, default:}, {name:last, type:string} ]} I then try to use fromjson, as follows, and it chokes on the second line: $ java -jar avro-tools-1.7.4.jar fromjson --schema-file test.schema test.json test.avro Exception in thread main org.apache.avro.AvroTypeException: Expected field name not found: middle at org.apache.avro.io.JsonDecoder.doAction(JsonDecoder.java:477) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at org.apache.avro.io.JsonDecoder.advance(JsonDecoder.java:139) at org.apache.avro.io.JsonDecoder.readString(JsonDecoder.java:219) at org.apache.avro.io.JsonDecoder.readString(JsonDecoder.java:214) at org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107 ) at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.j ava:348) at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.j ava:341) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:15 4) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.j ava:177) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:14 8) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:13 9) at org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:105) at org.apache.avro.tool.Main.run(Main.java:80) at org.apache.avro.tool.Main.main(Main.java:69) The short story is - I need to convert a bunch of JSON where an element may not be present sometimes, in which case I'd want it to default to something sensible, e.g. blank or null. According to the Schema Resolution if the reader's record schema has a field that contains a default value, and writer's schema does not have a field with the same name, then the reader should use the default value from its field. I'm clearly missing something obvious, any help would be appreciated! There are two things that seem to be missing here: 1. The fromjson tool is configuring the writer's schema (and readers's) the one you provided. Avro is expecting every JSON fragment you are giving it to have the same schema. 2. The tool will not work for all arbitrary json, it expects json in the format that the Avro JSON Encoder writes. There are a few differences with expectations, primarily when disambiguating union types and maps from records. To perform schema evolution while reading, you may need to separate json fragments missing middle from those that have it, and run the tool twice, with corresponding schemas for each case. Alternatively the tool could be modified to handle schema resolution or deal with different json encodings as well(tools/src/main/java/org/apache/avro/tool/DataFileWriteTool). Alternatively, you can avoid schema resolution and write two files, one with data in each schema after separating the records. Then you can deal with schema resolution in a later pass through the data with other tools (e.g. data file reader + writer), or only lazily when reading resolve the data into the schema you wish. Grisha
Re: Best practices for java enums...?
It would be nice to be able to reference an existing class when using the specific compiler. If you have an existing com.mycompany.Foo enum (or SpecificRecord, or Fixed type), then provide the specific compiler with the type prior to parsing the schema, it could accept a reference: {type:record, name:com.mycompany.Rec, fields: [ {name:fooField, type:com.mycompany.Foo} ]} Ordinarily, this would fail to compile, but given a reference to an existing compatible type, such as an enum, it could work. -Scott On 5/9/13 4:39 PM, Felix GV fe...@mate1inc.com wrote: Hello, I'm currently writing an avro schema which includes an enum field that I already have as a java enum in my application. At first, I named the avro field with the same fully qualified name (package name dot enum name) as my existing java enum. I then ran the avro compiler and found that it overwrote my existing java enum with an avro-generated enum. I find this slightly annoying because my java enum had comments documenting the purpose of each enum value, and the avro-generated enum doesn't have this. I see two or three potential solutions: 1. Accepting to replace my current enum with the avro-generated one in my code base, which I feel I cannot document properly (since I have access to just one doc attribute for the whole enum, instead of per symbol). On a side note, I haven't found any way to have a multi-line doc attribute in an avro schema, so that makes things slightly more annoying still. I wouldn't mind settling on using the avro-generated enums without documentation per symbol if at least I could have one big doc/comment that documents all symbols at once, but since it seems the doc attribute must be a one-liner, this is starting to be a little too messy for my taste... 2. Maintaining two separate enums: my manually written (and documented) enum as well as the avro-generated enum. For now, I think this is what I'm going to do, because those enums have little chances of changing anyway, but from a maintenance standpoint, it seems pretty horrendous... 3. I guess there's a third way, which would involve creating a script that backs up my enums, compiles all my schemas, and then restores my backed up enums, but this also seems ultra messy :( ... I haven't tested if it'd work (since the manually written enum is missing the $SCHEMA field), but I guess it would... Am I being OCD about this? or is this a concern that others have bumped into? How do you guys deal with this? Did I miss anything in the way avro works? P.S.: I've seen that reflect mappings may be able to work with arbitrary java enums, but since they seemed discouraged for performance reasons, I haven't digged much in this direction. I'd like to keep using .avsc files if possible, but if there's a better way, I can certainly try it. P.P.S.: We're currently using avro 1.6.1, but if the latest version provides a nice way of handling my use case, then I guess I could get us to upgrade... Thanks a lot :) ! -- Felix
Re: Jackson and Avro, nested schema
It appears that you will need to modify the JSON decoder in Avro to achieve this. The JSON decoder in Avro was built to encode any Avro schema into JSON with 100% fidelity, so that the decoder can read it back. The decoder does not work with any arbitrary JSON. This is because there are ambiguities: In your example: { id: doc1, fields: { foo: bar, spam: eggs, answer: 42, x: {a: 1} } } This can be interpreted by Avro in several ways. Is the value of fields a map or a record with four fields? is the value of x a map or a record with one field? Is answer an int, long, float, or double? is a string doc1 a string or a bytes literal? If you want to bake in the assumption that it is maps, all the way down, you'll need to extend / modify the JSON Decoder. It would be a useful contribution to have a generic JSON schema and decoder for it. We could have a JSON schema record (one field, a union of null, string, double, and map of string to self) and this type's field would automatically be un-nested by the special JSON decoder and not interpreted as a record. -Scott On 5/8/13 11:49 AM, David Arthur mum...@gmail.com wrote: I'm attempting to use Jackson and Avro together to map JSON documents to a generated Avro class. I have looked at the Json schema included with Avro, but this requires a top-level value element which I don't want. Essentially, I have JSON documents that have a few typed top level fields, and one field called fields which is more or less arbitrary JSON. I've reduced this down to strings and ints for simplicity My first attempt was: { type: record, name: Json, fields: [ { name: value, type: [ string, int, {type: map, values: Json} ] } ] }, { name: Document, type: record, fields: [ { name: id, type: string }, { name: fields, type: {type: map, values: [string, int, {type: map, values: Json}]} } ] } Given a JSON document like: { id: doc1, fields: { foo: bar, spam: eggs, answer: 42, x: {a: 1} } } this seems to work, but it doesn't. When I turn around and try to serialize this object with Avro, I get the following exception: java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.avro.generic.IndexedRecord at org.apache.avro.generic.GenericData.getField(GenericData.java:526) at org.apache.avro.generic.GenericData.getField(GenericData.java:541) at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter. java:104) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:6 6) at org.apache.avro.generic.GenericDatumWriter.writeMap(GenericDatumWriter.jav a:173) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:6 9) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:7 3) at org.apache.avro.generic.GenericDatumWriter.writeMap(GenericDatumWriter.jav a:173) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:6 9) at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter. java:106) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:6 6) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:5 8) My best guess is that since the fields field is a union, the representation of it in the generate class is an Object which Jackson happily throws whatever into. If I change my schema to explicitly use int instead of the Json type, it works fine for my test document type: {type: map, values: [string, int, {type: map, values: int}]} However now I need to enumerate the types for each level of nesting I want. This is not ideal, and limits me to a fixed level of nesting To be clear, my issue is not modelling my schema in Avro, but rather getting Jackson to map JSON onto the generated classes without too much pain. I have also tried https://github.com/FasterXML/jackson-dataformat-avro without much luck. Any help is appreciated -David
Re: avro.java.string vs utf8 compatibility in recent pig and hive versions
The change in the Pig loader in PIG-3297 seems correct they must use CharSequence, not Utf8. I suspect that the Avro 1.5.3.jar does not respect the avro.java.string property and is using Utf8 (for the API that Pig is using), but have not confirmed it. avro.java.string is an optional hint for the Java implementation. On the Avro side, we may be able to make a modification that allows one to configure a decoder or encoder to ignore the avro.java.string property. Perhaps it could look for a system property as an override to help with cases like this. On 5/10/13 3:16 PM, Michael Moss michael.m...@gmail.com wrote: Hello, It looks like representing avro strings as Utf8 provide some interesting performance enhancements, but I'm wondering if folks out there are actually using it in practice, or have had any issues with it. We have recently run into an issue where our avro files which represents strings as avro.java.string are causing ClassCastExceptions because Pig and Hive are expecting them to be Utf8. The exceptions occur when using avro-1.7.x.jar, but dissapear when using version avro-1.5.3.jar. I'm wondering if this is something that should be addressed in the avro jar, or in pig and hive like this thread suggests: https://issues.apache.org/jira/browse/PIG-3297 Here are the exceptions we are seeing: Hive: Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.avro.util.Utf8at org.apache.hadoop.hive.serde2.avro.AvroDeserializer.deserializeMap(AvroDeseria lizer.java:253) Pig: Caused by: java.io.IOException: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.avro.util.Utf8 at org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:275 ) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.n extKeyValue(PigRecordReader.java:194) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask. java:532) Thanks. -Mike
Re: Hadoop serialization DatumReader/Writer
Making the DatumReader/Writers configurable would be a welcome addition. Ideally, much more of what goes on there could be: 1. configuration driven 2. pre-computed to avoid repeated work during decoding/encoding We do some of both already. The trick is to do #1 without impacting performance and #2 requires a bigger overhaul. If you would like, a contribution including a Clojure related maven module or two that depends on the Java stuff would be a welcome addition and allow us to identify compatibility issues as we change the Java library over time. On 5/8/13 3:33 PM, Marshall Bockrath-Vandegrift llas...@gmail.com wrote: Hi all: Is there a reason Avro¹s Hadoop serialization classes don¹t allow configuration of the DatumReader and DatumWriter classes? My use-case is that I¹m implementing Clojure DatumReader and -Writer classes which produce and consume Clojure¹s data structures directly. I¹d like to then extend that to Hadoop MapReduce jobs which operate in terms of Clojure data, with Avro handling all de/serialization directly to/from that Clojure data. Am I going around this in a backwards fashion, or would a patch to allow configuration of the Hadoop serialization DatumReader/Writers be welcome? -Marshall
Re: map/reduce of compressed Avro
Martin said it already, but I will emphasize: Avro data files are splittable and can support multiple mappers no matter what codec is used for compression. This is because avro files are block based, and only use the compression within the block. I recommend starting with gzip compression, and moving to snappy only if deflate compression level '1' is not fast enough. For more information on avro data files, see: http://avro.apache.org/docs/current/spec.html#Object+Container+Files On 4/22/13 11:47 PM, nir_zamir nir.za...@gmail.com wrote: Thanks Martin. What will happen if I try to use an indexed LZO-compressed avro file? Will it work and utilize the index to allow multiple mappers? I think that for Snappy for example, the file is splittable and can use multiple mappers, but I haven't tested it yet - would be glad if anyone has any experience with that. Thanks! Nir. -- View this message in context: http://apache-avro.679487.n3.nabble.com/map-reduce-of-compressed-Avro-tp40 26947p4027009.html Sent from the Avro - Users mailing list archive at Nabble.com.
Re: Could specific records implement the generic API as well?
Which aspect of the generic API are you most interested in? The builder, getters, or setters? Most people that use Specific records do so for compile time type safety, so adding 'set(foo, fooval)' is not desired for those users. On the other hand it is certainly possible to add it. The code generated by the specific code generation utility uses templates, one can add a template that extends what is produced to include generic API bits. -Scott On 4/15/13 11:23 AM, Christophe Taton ta...@wibidata.com wrote: Hi, Is there a reason for specific records to not implement the generic API? I didn't find any obvious technical reason, but maybe I missed something. Thanks, C.
Re: Could specific records implement the generic API as well?
I would like to figure out how to make SpecificRecord and GenericRecord immutable in the longer term (or as an option with the code generation and/or builder). The builder is the first step, but setters are the enemy. Is there a way to do this that does not introduce new mutators for all SpecificRecords? On 4/15/13 3:43 PM, Doug Cutting cutt...@apache.org wrote: On Mon, Apr 15, 2013 at 2:21 PM, Christophe Taton ta...@wibidata.com wrote: If you think it's a meaningful addition, I'm happy to make the change. The two methods I wrote above could be added to SpecificRecordBase and it could then be declared to implement GenericRecord. I think GenericRecordBuilder could be used to build specific records with a few additional changes: - change the type of the 'record' field from GenericData.Record to GenericRecord. - replace the call to 'new GenericData.Record()' to '(GenericRecord)data().newRecord(null, schema())' - add a constructor that accepts a GenericData instance, instead of calling GenericData.get(). Then you could use new GenericRecordBuilder(SpecificData.get(), schema) to create specific records. Doug
Re: Issue writing union in avro?
It is well documented in the specification: http://avro.apache.org/docs/current/spec.html#json_encoding I know others have overridden this behavior by extending GenericData and/or the JsonDecoder/Encoder. It wouldn't conform to the Avro Specification JSON, but you can extend avro do do what you need it to. The reason for this encoding is to make sure that round-tripping data from binary to json and back results in the same data. Additionally, unions can be more complicated and contain multiple records each with different names. Disambiguating the value requires more information since several Avro data types map to the same JSON data type. If the schema is a union of bytes and string, is hello a string, or byte literal? If it is a union of a map and a record, is {state:CA, city:Pittsburgh} a record with two string fields, or a map? There are other approaches, and for some users perfect transmission of types is not critical. Generally speaking, if you want to output Avro data as JSON and consume as JSON, the extra data is not helpful. If you want to read it back in as Avro, you're going to need the info to know which branch of the union to take. On 4/6/13 6:49 PM, Jonathan Coveney jcove...@gmail.com wrote: Err, it's the output format that deserializes the json and then writes it in the binary format, not the input format. But either way the general flow is the same. As a general aside, is it the case that the java case is correct in that when writing a union it should be {string: hello} or whatnot? Seems like we should probably add that to the documentation if it is a requirement. 2013/4/7 Jonathan Coveney jcove...@gmail.com Scott, Thanks for the input. The use case is that a number of our batch processes are built on python streaming. Currently, the reducer will output a json string as a value, and then the input format will deserialize the json, and then write it in binary format. Given that our use of python streaming isn't going away, any suggestions on how to make this better? Is there a better way to go from json string - writing binary avro data? Thanks again Jon 2013/4/6 Scott Carey scottca...@apache.org This is due to using the JSON encoding for avro and not the binary encoding. It would appear that the Python version is a little bit lax on the spec. Some have built variations of the JSON encoding that do not label the union, but there are drawbacks to this too, as the type can be ambiguous in a very large number of cases without a label. Why are you using the JSON encoding for Avro? The primary purpose of the JSON serialization form as it is now is for transforming the binary to human readable form. Instead of building your GenericRecord from a JSON string, try using GenericRecordBuilder. -Scott On 4/5/13 4:59 AM, Jonathan Coveney jcove...@gmail.com wrote: Ok, I figured out the issue: If you make string c the following: String c = {\name\: \Alyssa\, \favorite_number\: {\int\: 256}, \favorite_color\: {\string\: \blue\}}; Then this works. This represents a divergence between the python and the Java implementation... the above does not work in Python, but it does work in Java. And of course, vice versa. I think I know how to fix this (and can file a bug with my reproduction and the fix), but I'm not sure which one is the expected case? Which implementation is wrong? Thanks 2013/4/5 Jonathan Coveney jcove...@gmail.com Correction: the issue is when reading the string according to the avro schema, not on writing. it fails before I get a chance to write :) 2013/4/5 Jonathan Coveney jcove...@gmail.com I implemented essentially the Java avro example but using the GenericDatumWriter and GenericDatumReader and hit an issue. https://gist.github.com/jcoveney/5317904 This is the error: Exception in thread main java.lang.RuntimeException: org.apache.avro.AvroTypeException: Expected start-union. Got VALUE_NUMBER_INT at com.spotify.hadoop.mapred.Hrm.main(Hrm.java:45) Caused by: org.apache.avro.AvroTypeException: Expected start-union. Got VALUE_NUMBER_INT at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:697) at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:441) at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:1 52) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader. java:177) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:1 48) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:1 39) at com.spotify.hadoop.mapred.Hrm.main(Hrm.java:38) Am I doing something wrong? Is this a bug? I'm digging in now
Re: Issue writing union in avro?
This is due to using the JSON encoding for avro and not the binary encoding. It would appear that the Python version is a little bit lax on the spec. Some have built variations of the JSON encoding that do not label the union, but there are drawbacks to this too, as the type can be ambiguous in a very large number of cases without a label. Why are you using the JSON encoding for Avro? The primary purpose of the JSON serialization form as it is now is for transforming the binary to human readable form. Instead of building your GenericRecord from a JSON string, try using GenericRecordBuilder. -Scott On 4/5/13 4:59 AM, Jonathan Coveney jcove...@gmail.com wrote: Ok, I figured out the issue: If you make string c the following: String c = {\name\: \Alyssa\, \favorite_number\: {\int\: 256}, \favorite_color\: {\string\: \blue\}}; Then this works. This represents a divergence between the python and the Java implementation... the above does not work in Python, but it does work in Java. And of course, vice versa. I think I know how to fix this (and can file a bug with my reproduction and the fix), but I'm not sure which one is the expected case? Which implementation is wrong? Thanks 2013/4/5 Jonathan Coveney jcove...@gmail.com Correction: the issue is when reading the string according to the avro schema, not on writing. it fails before I get a chance to write :) 2013/4/5 Jonathan Coveney jcove...@gmail.com I implemented essentially the Java avro example but using the GenericDatumWriter and GenericDatumReader and hit an issue. https://gist.github.com/jcoveney/5317904 This is the error: Exception in thread main java.lang.RuntimeException: org.apache.avro.AvroTypeException: Expected start-union. Got VALUE_NUMBER_INT at com.spotify.hadoop.mapred.Hrm.main(Hrm.java:45) Caused by: org.apache.avro.AvroTypeException: Expected start-union. Got VALUE_NUMBER_INT at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:697) at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:441) at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.jav a:177) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139) at com.spotify.hadoop.mapred.Hrm.main(Hrm.java:38) Am I doing something wrong? Is this a bug? I'm digging in now but am curious if anyone has seen this before? I get the feeling I am working with Avro in a way that most people do not :)
Re: Has anyone developed a utility to tell what is missing from a record?
Try GenericRecordBuilder. For the Specific API, there are builders that will not let you construct an object that can not be serialized. The Generic API should have the same thing, but I am not 100% sure the builder there covers it. I have always avoided using any API that allows me to create an object that is unsafe to serialize since finding out at serialization time is a huge pain (and in my case, is often on a separate thread from the place it was created). On 4/4/13 6:58 AM, Jonathan Coveney jcove...@gmail.com wrote: I'm working on migrating an internally developed serialization format to Avro. In the process, there have been many cases where I made a mistake migrating the schema (I've automated it), and then avro cries that a record I'm trying to serialize doesn't match the schema. Generally, the error it gives doesn't help find the actual issue, and for a big enough record finding the issue can be tedious. I've thought about making a tool which, given the schema and the record would tell you what the issue is, but I'm wondering if this already exists? I suppose the error message could also include this information... Thanks Jon
Re: Support for char[] and short[] - Java
You can cast both short and char safely to int and back, and use Avro's int type. These will be variable length integer encoded and take 1 to 3 bytes in binary form per short/char. This will be clunky as a user to wrap char[] or short[] into ListInteger or int[] however. Another option would be to extend the reader to look for special meta-data in the schema that indicates that an array of int is to be interpreted as shorts or chars. Can you give an example where a char[] converted to utf8 bytes and back results in a loss of data? I was under the impression that UTF-16 surrogate pairs are converted to proper UTF-8 sequences and back to surrogate pairs. Or, are you using char to represent something else, as a two byte unsigned quantity where interpreting as UTF-16 causes the problem? On 12/23/12 10:30 PM, Tarun Gupta tarun.gu...@technogica.com wrote: Hi, I am new Avro but I did some basic research regarding how do we a support data types like Char arrays and Short arrays while defining the Avro schema. Issue # AVRO-249 sounded somewhat relevant but its about supporting Short using the reflection API. We are planning to use Avro for a Java based Client Server data exchange use case, note that our data model is expected to have large arrays of Short and Char, and performance is our 'key concern'. We can't use a string to store char[], because what we get back is different then what you put in, because of UTF-16 normalization. Thanks in Advance. Tarun Gupta
Re: Appending to .avro log files
A sync marker delimits each block in the avro file. If you want to start reading data from the middle of a 100GB file, DataFileReader will seek to the middle and find the next sync marker. Each block can be individually compressed, and by default when writing a file the writer will not compress the block and flush to disk until a block as gotten as large as the sync interval in bytes.Alternatively, you can manually sync(). If you have a 100 byte sync interval, you may not see any data reach disk until that many bytes have been written (or sync() is called manually). Your problem is likely that the first block in the file has not been flushed to disk yet, and therefore the file is corrupt and missing a trailing sync marker. On 1/3/13 12:36 PM, Terry Healy the...@bnl.gov wrote: Hello- I'm upgrading a logging program to append GenericRecords to a .avro file instead of text (.tsv). I have a working schema that is used to convert existing .tsv of the same format into .avro and that works fine. When I run a test writing 30,000 bogus records, it runs but when I try to use avro-tools-1.7.3.jar tojson on the output file, it reports: AvroRuntimeException: java.io.IOException: Invalid sync! The file is still open at this point since the logging program is running. Is this expected behavior because it is still open? (getmeta and getschema work fine). I'm not sure if it has any bearing, since I never really understood the function of the the AVRO sync interval; in this and the working programs it is set to 100. Any ideas appreciated. -Terry
Re: Embedding schema with binary encoding
Calling toJson() on a Schema will print it in json fom. However you most likely do not want to invent your own file format for Avro data. DataFileWriter which will manage the schema for you, along with compression, metadata, and the ability to seek to the middle of the file.Additionally it is then readable by several other languages and tools. On 1/7/13 4:42 AM, Pratyush Chandra chandra.praty...@gmail.com wrote: I am able to serialize with binary encoding to a file using following : FileOutputStream outputStream = new FileOutputStream(file); Encoder e = EncoderFactory.get().binaryEncoder(outputStream, null); DatumWriterGenericRecord datumWriter = new GenericDatumWriterGenericRecord(schema); GenericRecord message1= new GenericData.Record(schema); message1.put(to, Alyssa); datumWriter.write(message1, e); e.flush(); outputStream.close(); But the output file contains only serialized data and not schema. How can I add schema also ? Thanks Pratyush Chandra
Re: Setters and getters
No. However each API (Specific, Reflect, Generic in Java) has different limitations and use cases. You'll have to provide more information about your use cases and expectations for more specific guidance. On 1/7/13 11:21 AM, Tanya Bansal tanyapban...@gmail.com wrote: Is it necessary to write setters and getters for all member variables for a class that is going to be serialized by Avro? Thanks -Tanya
Re: Sync() between records? How do we recover from a bad record, using DataFileReader?
For the corruption test, try corrupting the records, not the sync marker. The features added to DataFileReader for corruption recovery were for the case when decoding a record fails (corrupted record), not for when a sync marker is corrupted. Perhaps we should add that too, but it does not surprise me that that case has a bug. On 1/6/13 7:38 PM, Russell Jurney russell.jur...@gmail.com wrote: We are trying to recover, report bad record, and move to the next record of an Avro file in PIG-3015 and PIG-3059. It seems that sync blocks don't exist between files, however. How should we recover from a bad record using Avro's DataFileReader? https://issues.apache.org/jira/browse/PIG-3015 https://issues.apache.org/jira/browse/PIG-3059 Russell Jurney http://datasyndrome.com
Re: Serializing json against a schema
You could use the ReflectDatumWriter to write a simple java data class to Avro, and you can create instances of such classes from JSON using a library like Jackson. There is a JSON encoding for Avro, if your data conformed to that format (which would be more verbose than what you have below) you could use that to decode it, then re-encode it to binary. Lastly you can use the SpecificDatum API, generate Java classes from your schema, then set the data from the json with its type-safe builder pattern APIs instead of the loose Generic API. On 1/7/13 3:46 AM, Pratyush Chandra chandra.praty...@gmail.com wrote: Hi, I am new to Avro. I was going through examples and figured out that GenericRecord can be appended to DataFileWriter and then serialized. Example: record.avsc is { namespace: example.proto, name: Message, type: record, fields: [ {name: to, type: [string,null]} ] } and my code snippet is : DatumWriterGenericRecord datumWriter = new GenericDatumWriterGenericRecord(schema); DataFileWriterGenericRecord dataFileWriter = new DataFileWriterGenericRecord(datumWriter); dataFileWriter.create(schema, file); GenericRecord message1= new GenericData.Record(schema); message1.put(to, Alyssa); dataFileWriter.append(message1); dataFileWriter.close(); My question is : Suppose I am receiving a json from server, and based on schema I would like to serialize it directly, without parsing it. For example : Input received is {to: Alyssa} Is there a way, I can serialize above json with record.avsc schema instead of appending GenericRecord ? -- Pratyush Chandra
Re: issue with writing an array of records
On 1/7/13 8:35 AM, Alan Miller alan.mill...@gmail.com wrote: Hi, I have a schema with an array of records (I'm open to other suggestions too) field called ifnet to store misc attribute name/values for a host's network interfaces. e.g. { type: record, namespace: com.company.avro.data, name: MyRecord, doc: My Data Record., fields: [ // (required) fields {name: time_stamp, type: long }, {name:hostname, type: string }, // (optional) array of ifnet instances {name: ifnet, type: [null, { type: array, items: { type: record, name: Ifnet, namespace: com.company.avro.data, fields: [ {name: name, type: string}, {name: send_bps, type: long }, {name: recv_bps, type: long } ] } } ] } ] } First thought: Why the union of null and the array? It may be easier to simply have an empty list when there are no Ifnet data. I can write the records, (time_stamp and hostname are correct) but my array of records field (ifnet) only contains the last element of my java List. Am I writing the field correctly? I'm trying to write the ifnet field with a java.util.Listcom.company.avro.data.Ifnet Here's the related code lines that write the ifnet field. (Yes, I'm attempting to use reflection because Ifnet is only 1 of approx 11 other array of record fields I'm trying to implement.) Class[] paramObj = new Class[1]; paramObj[0] = Ifnet.class; Method method = cls.getMethod(methodName, List.class); jsonObj = new Ifnet(); listOfObj = new ArrayListIfnet(); ... // in a loop building the ListIfnet... LOG.info(String.format( [%s] %s %s(%s) as %s, name, k,methNm,v,types[j].toString())); ... LOG.info(String.format( [%s] setting name to %s, name, name)); ... istOfObj.add(jsonObj); ... // then finally I call invoke with a List of Ifnet records if (method != null) { method.invoke(obj, listOfObj); } LOG.info(String.format( invoking %s.%s, method.getClass().getSimpleName(), method.getName())); LOG.info(String.format( param: listObj%s with %d entries , jsonObj.getClass().getName(), listOfObj.size())); and the respective output 20130107T172303 INFO c.c.a.d.MyDriver - Setifnet json via setIfnet(Ifnet object) 20130107T172303 INFO c.c.a.d.MyDriver -[e0c] setting name to e0c 20130107T172303 INFO c.c.a.d.MyDriver -[e0c] send_bps setSendBps(0) as class java.lang.Long 20130107T172303 INFO c.c.a.d.MyDriver -[e0c] setting name to e0c 20130107T172303 INFO c.c.a.d.MyDriver -[e0c] recv_bps setRecvBps(0) as class java.lang.Long 20130107T172303 INFO c.c.a.d.MyDriver -[e0d] setting name to e0d 20130107T172303 INFO c.c.a.d.MyDriver -[e0d] send_bps setSendBps(0) as class java.lang.Long 20130107T172303 INFO c.c.a.d.MyDriver -[e0d] setting name to e0d 20130107T172303 INFO c.c.a.d.MyDriver -[e0d] recv_bps setRecvBps(0) as class java.lang.Long 20130107T172303 INFO c.c.a.d.MyDriver -[e0a] setting name to e0a 20130107T172303 INFO c.c.a.d.MyDriver -[e0a] send_bps setSendBps(170720) as class java.lang.Long 20130107T172303 INFO c.c.a.d.MyDriver -[e0a] setting name to e0a 20130107T172303 INFO c.c.a.d.MyDriver -[e0a] recv_bps setRecvBps(244480) as class java.lang.Long 20130107T172303 INFO c.c.a.d.MyDriver -[e0b] setting name to e0b 20130107T172303 INFO c.c.a.d.MyDriver -[e0b] send_bps setSendBps(0) as class java.lang.Long 20130107T172303 INFO c.c.a.d.MyDriver -[e0b] setting name to e0b 20130107T172303 INFO c.c.a.d.MyDriver -[e0b] recv_bps setRecvBps(0) as class java.lang.Long 20130107T172303 INFO c.c.a.d.MyDriver -[e0P] setting name to e0P 20130107T172303 INFO c.c.a.d.MyDriver -[e0P] send_bps setSendBps(0) as class java.lang.Long 20130107T172303 INFO c.c.a.d.MyDriver -[e0P] setting name to e0P 20130107T172303 INFO c.c.a.d.MyDriver -[e0P] recv_bps setRecvBps(0) as class java.lang.Long 20130107T172303 INFO c.c.a.d.MyDriver -[losk] setting name to losk 20130107T172303 INFO c.c.a.d.MyDriver -[losk] send_bps setSendBps(0) as class java.lang.Long 20130107T172303 INFO c.c.a.d.MyDriver -[losk] setting name to losk 20130107T172303 INFO c.c.a.d.MyDriver -[losk] recv_bps setRecvBps(0) as class java.lang.Long 20130107T172303 INFO c.c.a.d.MyDriver - invoking Method.setIfnet 20130107T172303 INFO c.c.a.d.MyDriver - param: listObjcom.synopsys.iims.be.storage.Ifnet with 6 entries 20130107T172303 INFO c.c.a.d.MyDriver - Set time_stamp integer via setTimeStamp to 1357513251 When I dump the records I see an array of 6 entries but the values all reflect the last last entry in my java.util.List. The
Re: any movement on JSON encoding for RPC?
Avro can serialize in JSON, however most users use the compact binary serialization for performance and data storage reasons (JSON is typically 10x larger), and use the JSON format for debugging or export to other systems. I do not know if anyone is planning work on the JSON encoding in combination with Avro RPC, the best place to find out is the dev mailing list and JIRA tickets. On 11/21/12 1:31 PM, Brian Lee leeb...@yahoo.com wrote: I found a message from last year that JSON encoding for RPC was not yet implemented. Is this still the case? If so, this would be very bad as one of the selling points we were using is that Avro serialized its messages in JSON format. Brian
Re: Backwards compatible - Optional fields
A reader must always have the schema of the written data to decode it. When creating your Decoder, you must pass both the reader's schema and the schema as written. Once given this pair, Avro can know to skip data as written if the reader does not need it, or to inject default values for the reader if the writer did not provide it. The flaw in your code is here where you only provide the reader's schema: new SpecificDatumReaderA(a.getSchema()); On 10/2/12 2:04 PM, Gabriel Ki gab...@gmail.com wrote: Hi all, I had an impression that reader works with older version object as long as the new fields are optional. Is that true? If not, what would you recommend? Thanks a lot in advance. For example: { namespace: org.apache.avro.examples, protocol: MyProtocol, types: [ { name: Metadata, type: record, fields: [ {name: S1, type: string} ]}, { name: Metadatav2, type: record, fields: [ {name: S1, type: string}, {name: S2, type: [string, null]} // optional field in the new version ]} ] } public static A extends SpecificRecordBase A parseAvroObject(final A a, final byte[] bb) throws IOException { if (bb == null) { return null; } ByteArrayInputStream bais = new ByteArrayInputStream(bb); DatumReaderA dr = new SpecificDatumReaderA(a.getSchema()); Decoder d = DecoderFactory.get().binaryDecoder(bais, null); return dr.read(a, d); } public static void main(String[] args) throws IOException { Metadata.Builder mb = Metadata.newBuilder(); mb.setS1(S1 value); byte[] bs = toBytes(mb.build()); Metadata m = parseAvroObject(new Metadata(), bs); System.out.println(parse as Metadata + m); // This I thought it worked with older avro Metadatav2 m2 = parseAvroObject(new Metadatav2(), bs); System.out.println(parse as Metadatav2 + m2); } Exception in thread main java.io.EOFException at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:145) at org.apache.avro.io.BinaryDecoder.readIndex(BinaryDecoder.java:405) at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java: 166) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129) Thanks -gabe
Re: Schema resolution failure when the writer's schema is a primitive type and the reader's schema is a union
My understanding of the spec is that promotion to a union should work as long as the prior type is a member of the union. What happens if the union in the reader schema union order is reversed? This may be a bug. -Scott On 8/16/12 5:59 PM, Alexandre Normand alexandre.norm...@gmail.com wrote: Hey, I've been running into this case where I have a field of type int but I need to allow for null values. To do this, I now have a new schema that defines that field as a union of null and int such as: type: [ null, int ] According to my interpretation of the spec, avro should resolve this correctly. For reference, this reads like this (from http://avro.apache.org/docs/current/spec.html#Schema+Resolution): if reader's is a union, but writer's is not The first schema in the reader's union that matches the writer's schema is recursively resolved against it. If none match, an error is signaled.) However, when trying to do this, I get this: org.apache.avro.AvroTypeException: Attempt to process a int when a union was expected. I've written a simple test that illustrates what I'm saying: @Test public void testReadingUnionFromValueWrittenAsPrimitive() throws Exception { Schema writerSchema = new Schema.Parser().parse({\n + \type\:\record\,\n + \name\:\NeighborComparisons\,\n + \fields\: [\n + {\name\: \test\,\n + \type\: \int\ }]} ); Schema readersSchema = new Schema.Parser().parse( {\n + \type\:\record\,\n + \name\:\NeighborComparisons\,\n + \fields\: [ {\n + \name\: \test\,\n + \type\: [\null\, \int\],\n + \default\: null } ] }); GenericData.Record record = new GenericData.Record(writerSchema); record.put(test, Integer.valueOf(10)); ByteArrayOutputStream output = new ByteArrayOutputStream(); JsonEncoder jsonEncoder = EncoderFactory.get().jsonEncoder(writerSchema, output); GenericDatumWriterGenericData.Record writer = new GenericDatumWriterGenericData.Record(writerSchema); writer.write(record, jsonEncoder); jsonEncoder.flush(); output.flush(); System.out.println(output.toString()); JsonDecoder jsonDecoder = DecoderFactory.get().jsonDecoder(readersSchema, output.toString()); GenericDatumReaderGenericData.Record reader = new GenericDatumReaderGenericData.Record(writerSchema, readersSchema); GenericData.Record read = reader.read(null, jsonDecoder); assertEquals(10, read.get(test)); } Am I misunderstanding how avro should handle such a case of schema resolution or is the problem in the implementation? Cheers! -- Alex
Re: Pig with Avro and HBase
I am using Pig on Avro data files, and Avro in HBase. Can you elaborate on what you mean by 'auto-load the schema'? In the sense that a big LOAD statement doesn't have to declare the schema? I do this with avro data files to some extent (with limitations). A working implementation of https://issues.apache.org/jira/browse/AVRO-1124 seems to be the way to go for tracking a mapping from something like a Table or known file type to a sequence of schemas (and the most recent schema). Then a pig loader could load from HBase using the most recent schema from a named schema group, or read the same thing from files that represent the same schema group with an avro file loader. On 8/22/12 8:37 PM, Russell Jurney russell.jur...@gmail.com wrote: Is anyone using Pig with Avro as the datatype in HBase? I want to auto-load the schema, and this seems the most direct way to do it. -- Russell Jurney twitter.com/rjurney http://twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com http://datasyndrome.com/
Re: Suggestions when using Pair.getPairSchema for Reduce-Side Joins in MR2
It sounds like we need to be extra clear in the documentation on Pair, and perhaps have a different class or flavor that serves the purpose you needed. (KeyPair?) In Avro's MRV1 API, there is no key schema or value schema for map output, but only one map output schema that must be a Pair a pair of key and value, where only the key is used for the sort. -Scott On 6/27/12 3:09 PM, Jacob Metcalf jacob_metc...@hotmail.com wrote: I spent an hour or so of today debugging some map reduce jobs I had developed in Avro 1.7 and Map Reduce 2 and thought it might be constructive to share. I needed to do a reduce side join for which you need a composite key. The key consists of the key you are actually grouping by and an integer which is just used for sorting (the technique is described in many places but there is a nice picture on page 24 of http://www.inf.ed.ac.uk/publications/thesis/online/IM100859.pdf). For this I thought it would be ideal to use Avro pair class which has a handy function for creating its own schema so I could configure the shuffle something like this: Schema joinKeySchema = Pair.getPairSchema( Schema.create( Schema.type.STRING ), Schema.create( Schema.type.INTEGER )); AvroJob.setMapOutputKeySchema( joinKeySchema ); I then planned to use the standard AvroKeyComparator for sorting and a specialised comparator for grouping/partitioning which would ignore the integer part. However it did not work as the sort on the integer did not appear to take place and my map output would arrive in the wrong order at the reducer. I finally tracked the issue down to the fact that the pair schema by default ignores the second part of the pair: private static Schema makePairSchema(Schema key, Schema value) { Schema pair = Schema.createRecord(PAIR, null, null, false); ListField fields = new ArrayListField(); fields.add(new Field(KEY, key, , null)); fields.add(new Field(VALUE, value, , null, Field.Order.IGNORE)); pair.setFields(fields); return pair; } In the end it was easy enough to work around by creating my own pair schema. I am not an expert but I suspect there is a very valid application for this ignore in MR1. As a suggestion it may help going forwards if a second version with a boolean to toggle the ignore were introduced to make the semantics clearer . Jacob
Re: C/C++ parsing vs. Java parsing.
The schema provided is a union of several schemas. Java supports parsing this, C++ may not. Does it work if you make it one single schema, and nest NA, acomplex and retypes inside of object ? It only needs to be defined the first time it is referenced. If it does not, then it is certainly a bug. Either way I would file a bug in JIRA. The spec does not say whether a file should be parseable if it contains a union rather than a record, but it probably should be. -Scott On 6/24/12 11:17 PM, Saptarshi Guha sg...@mozilla.com wrote: I have a avro scheme found here: http://sguha.pastebin.mozilla.org/1677671 I tried java -jar avro-tools-1.7.0.jar compile schema ~/tmp/robject.avro foo and it worked. This failed: avrogencpp --input ~/tmp/robject.avro --output ~/tmp/h2 Segmentation fault: 11 This failed: avro_schema_t *person_schema = (avro_schema_t*)malloc(sizeof(avro_schema_t)); (avro_schema_from_json_literal(string.of.avro.file), person_schema) with Error was Error parsing JSON: string or '}' expected near end of file Q1: Does C and C++ API support all schemas the Java one supports? Q2: Is it yes to Q1 and this is a bug? Regards Saptarshi
Re: Paranamer issue
On 6/6/12 10:33 AM, Peter Cameron peter.came...@2icworld.com wrote: The BSD license is a problem for our clients, whereas the Apache 2 license is not. Go figure. That's the situation! ASL 2.0 is a derivative of the BSD license, after all... Apache projects regularly depend on other items that are MIT or BSD licensed since these are the least restrictive open source licenses around. So what is the answer for us when we don't want to ship the avro tools JAR but need the Paranamer classes from it. What can we do to stay consistent with Apache 2 e.g. create my own Paranamer JAR containing just those classes from the tools JAR? As Doug said, packaging doesn't affect anything license-wise, but you can repackage things fairly easily into a single jar that contains what you need using maven-shade-plugin. The avro-tools.jar uses this to repackage all dependencies inside of it. You can do the same thing for the base avro.jar and explicitly include only the few jars you need (or exclude those you do not) by adding a maven-shade-plugin configuration to lang/java/avro/pom.xml and rebuilding. Peter On 06/06/2012 18:30, Doug Cutting wrote: On 06/06/2012 06:51 AM, Peter Cameron wrote: I've only just discovered the dependancy of Avro upon the thoughtworks Paranamer classes. We use reflection at runtime with a schema and encountered the usual ClassNotFoundException for Paranamer after I'd been rationalising our codebase -- which included the removal of the avro-tools-1.6.3 JAR. The tools JAR contains the Paranamer classes which I was unaware of. We operate in a very lightweight environment so the 10Mb tools JAR is not suitable for us to deploy. I went looking for the Paranamer JAR and eventually found version 2.5. However, this is BSD licensed. BSD is not suitable for us. Only Apache 2.0. How is BSD a problem? BSD is less restrictive an Apache 2.0 and is thus is generally not considered to alter the requirements of one re-distributing software that includes BSD within an Apache-licensed project. Doug
Re: Scala API
This would be fantastic. I would be excited to see it. It would be great to see a Scala language addition to the project if you wish to contribute. I believe there have been a few other Scala Avro attempts by others over time. I recall one where all records were case classes (but this broke at 22 fields). Another thing to look at is: http://code.google.com/p/avro-scala-compiler-plugin/ Perhaps we can get a few of the other people who have developed Scala Avro tools to review/comment or contribute as well? On 5/29/12 11:04 PM, Christophe Taton ta...@wibidata.com wrote: Hi people, Is there interest in a custom Scala API for Avro records and protocols? I am currently working on an schema compiler for Scala, but before I go deeper, I would really like to have external feedback. I would especially like to hear from anyone who has opinions on how to map Avro types onto Scala types. Here are a few hints on what I've been trying so far: * Records are compiled into two forms: mutable and immutable. Very nice. * To avoid collisions with Java generated classes, scala classes are generated in a .scala sub-package. * Avro arrays are translated to Seq/List when immutable and Buffer/ArrayBuffer when mutable. * Avro maps are translated to immutable or mutable Map/HashMap. * Bytes/Fixed are translated to Seq[Byte] when immutable and Buffer[Byte] when mutable. * Avro unions are currently translated into Any, but I plan to: * translate union{null, X} into Scala Option[X] * compile union {T1, T2, T3} into a custom case classes to have proper type checking and pattern matching. If you have a record R1, it compiles to a Scala class. If you put it in a union of {T1, String}, what does the case class for the union look like? Is it basically a wrapper like a specialized Either[T1, String] ? Maybe Scala will get Union types later to push this into the compiler instead of object instances :) * Scala records provide a method encode(encoder) to serialize as binary into a byte stream (appears ~30% faster than SpecificDatumWriter). * Scala mutable records provide a method decode(decoder) to deserialize a byte stream (appears ~25% faster than SpecificDatumReader). I have some plans to improve {Generic,Specific}Datum{Reader,Writer} in Java, I would be interested in seeing how the Scala one here works. The Java one is slowed by traversing too many data structures that represent decisions that could be pre-computed rather than repeatedly parsed for each record. * Scala records implement the SpecificRecord Java interface (with some overhead), so one may still use the SpecificDatumReader/Writer when the custom encoder/decoder methods cannot be used. * Mutable records can be converted to immutable (ie. can act as builders). Thanks, Christophe
Re: How represent abstract in Schemas
Avro schemas can represent Union types, but not abstract types. It does not make sense to serialize an abstract class, since its data members are not known. By definition, an abstract type does not define all of the possible sub types in advance, which presents another problem -- in order to make sense of serialized data, the universe of types serialized need to be known. You can model an abstract type with union types with a little bit of work. For example, if you have type AbstractThing, with children Concrete1 and Concrete2, you can serialize these as a union of Concrete1 and Concrete2. When reading the element with this union, you will need to check the instance type at runtime and cast or if you know the super type is AbstractThing, you can blindly cast to AbstractThing. As new types are added, your schema will change to include more branches in the union. If you remove a type, you will need to provide a default in case the removed type is encountered while reading data. If you are using the Java Specific API the above will not work without wrapper classes that contain the hierarchy, and the ability to create these from the serialized types. Serialization deals only with data stored in member variables, and interfaces have no data. An Avro Protocol maps to a Java Interface, but it is never serialized, it represents a contract for exchanging serialized data. -Scott On 5/6/12 9:55 PM, Gavin YAO gavin.ming@gmail.com wrote: Hello: I am very new to the Apache Avro community so I hope I am doing right in just sending a mail to this address. Is it possible to represent abstract as in Java language we can do it by abstract class or interface? Thanks a lot!
Re: Nested schema issue
On 5/1/12 9:47 AM, Peter Cameron pe...@pbpartnership.com wrote: I'm having a problem with nesting schemas. A very brief overview of why we're using Avro (successfully so far) is: o code generation not required o small binary format o dynamic use of schemas at runtime We're doing a flavour of RPC, and the reason we're not using Avro's IDL and flavour of RPC is because the endpoint is not necessarily a Java platform (C# and Java for our purposes), and only the Java implementation of Avro has RPC. Hence no Avro RPC for us. I'm aware that Avro doesn't import nested schemas out of the box. We need that functionality as we're exposed to schemas over which we have no control, and in the interests of maintainability, these schemas are nicely partitioned and are referenced as types from within other schemas. So, for example, a address schema refers to a some.domain.location object by having a field of type some.domain.location. Note that our runtime has no knowledge of any some.domain package (e.g. address or location objects). Only the endpoints know about some.domain. (A layer at our endpoint runtime serialises any unknown i.e. non-primitive objects as bytestreams.) I implemented a schema cache which intelligently imports schemas on the fly, so adding the address schema to the cache, automatically adds the location schema that it refers to. The cache uses Avro's schema to parse an added schema, catches parse exceptions, looks at the exception message to see whether or not the error is due to a missing or undefined type, and thus goes off to import the needed schema. Brittle, I know, but no other way for us. We need this functionality, and nothing else comes close to Avro. On the Java side, recent versions have a Parser that can deal with schema import. It requires that a schema be defined before use however. Perhaps we can add a callback to the API for returning undefined schemas as they are found. So far so good, until today when I hit a corner case. Say I have an address object that has two fields, called position1 and position2. If position1 and position2 are non-primitive types, then the address schema doesn't parse so presumably is an invalid Avro schema. The error concerns redefining the location type. Here's the example: location schema == { name: location, type: record, namespace : some.domain, fields : [ { name: latitude, type: float }, { name: longitude, type: float } ] } address schema == { name: address, type: record, namespace : some.domain, fields : [ { name: street, type: string }, { name: city, type: string }, { name: position1, type: some.domain.location }, { name: position2, type: some.domain.location } ] } Now, an answer of having a list of positions as a field is not an answer for us, as we need to solve the general issue of a schema with more than one instance of the same nested type i.e. my problem is not with an address or location schema. Can this be done? This is potentially a blocker for us. This should be possible. A named type can be used for multiple differently named fields in a record. Is the parse error in C# or Java? What is the error? cheers, Peter
Re: Nested schema issue (with munged invalid schema)
On 5/1/12 9:55 AM, Peter Cameron pe...@pbpartnership.com wrote: I'm having a problem with nesting schemas. A very brief overview of why we're using Avro (successfully so far) is: o code generation not required o small binary format o dynamic use of schemas at runtime We're doing a flavour of RPC, and the reason we're not using Avro's IDL and flavour of RPC is because the endpoint is not necessarily a Java platform (C# and Java for our purposes), and only the Java implementation of Avro has RPC. Hence no Avro RPC for us. I'm aware that Avro doesn't import nested schemas out of the box. We need that functionality as we're exposed to schemas over which we have no control, and in the interests of maintainability, these schemas are nicely partitioned and are referenced as types from within other schemas. So, for example, a address schema refers to a some.domain.location object by having a field of type some.domain.location. Note that our runtime has no knowledge of any some.domain package (e.g. address or location objects). Only the endpoints know about some.domain. (A layer at our endpoint runtime serialises any unknown i.e. non-primitive objects as bytestreams.) I implemented a schema cache which intelligently imports schemas on the fly, so adding the address schema to the cache, automatically adds the location schema that it refers to. The cache uses Avro's schema to parse an added schema, catches parse exceptions, looks at the exception message to see whether or not the error is due to a missing or undefined type, and thus goes off to import the needed schema. Brittle, I know, but no other way for us. We need this functionality, and nothing else comes close to Avro. So far so good, until today when I hit a corner case. Say I have an address object that has two fields, called position1 and position2. If position1 and position2 are non-primitive types, then the address schema doesn't parse so presumably is an invalid Avro schema. The error concerns redefining the location type. Here's the example: location schema == { name: location, type: record, namespace : some.domain, fields : [ { name: latitude, type: float }, { name: longitude, type: float } ] } address schema == { name: address, type: record, namespace : some.domain, fields : [ { name: street, type: string }, { name: city, type: string }, { name: position1, type: some.domain.location }, { name: position2, type: some.domain.location } ] } Now, an answer of having a list of positions as a field is not an answer for us, as we need to solve the general issue of a schema with more than one instance of the same nested type i.e. my problem is not with an address or location schema. The problematic schema constructed by my schema cache is: { name: address2, type: record, namespace : some.domain, fields : [ { name: street, type: string }, { name: city, type: string }, { name: position1, type: {type:record,name:location,namespace:some.domain,fields:[{name :latitude,type:float},{name:longitude,type:float}]} }, { name: position2, type: {type:record,name:location,namespace:some.domain,fields:[{name :latitude,type:float},{name:longitude,type:float}]} } ] } The second time that location is used, it should be used by reference, and not re-defined. I believe that name:position2 type:some.domain.location should work, provided the type some.domain.location is defined previously in the schema, as it is in position1. Can this be done? This is potentially a blocker for us. cheers, Peter
Re: Support for Serialization and Externalization?
On 4/23/12 10:37 AM, Joe Gamache gama...@cabotresearch.com wrote: Hello, We have been using Avro successfully to serialize many of our objects, using binary encoding, for storage and retrieval. Although the documentation about the Reflect Mapping states: This API is not recommended except as a stepping stone for systems that currently uses Java interfaces to define RPC protocols. we used this mapping as that recommendation did not seem to apply. We do not use the serialized data for RPC (or any other messaging system).In fact, this part has in-place for a while and works exceptionally well. Now we would like to externalize a smaller subset of the objects for interaction with a WebApp. Here we would like to use the JSON encoding and the specific mapping.We tried having this set of objects implement GenericRecord, however, this then breaks the use of Reflection on these objects. [The ReflectData.createSchema method checks for this condition.] Can Avro be used to serialize objects one way, and externalize them another? [The externalized objects are a subset of the serialized ones.] Perhaps more generally, my question is: can both binary encoding and JSON encoding be supported on overlapping objects using different mappers? If yes, what is the best way to accomplish this? That should be possible. If not I think It is a bug. The Java reflect API is supposed to be able to handle Specific and Generic records, or at least there is supposed to be a way to use them both. What is the specific error, from what API call? Perhaps it is a simple fix and you can submit a patch and test to JIRA? Thanks, -Scott Thanks for any help - I am still quite a noob here so I greatly appreciate any additional details! Joe Gamache
Re: Specific/GenericDatumReader performance and resolving decoders
I think this approach makes sense, reader=writer is common. In addition to record fields, unions are affected. I have been thinking about the issue that resolving records is slower than not for a while. In theory, it could be just as fast because you can pre-compute the steps needed and bake that into the reading logic. This seems like a reasonable way to avoid the cost for the case where schemas equal. Please open a JIRA ticket and put your preliminary thoughts there. It is a good place to discuss the technical bits of the issue even before you have a patch. On 4/19/12 2:09 AM, Irving, Dave dave.irv...@baml.com wrote: Hi, Recently I¹ve been looking at the performance of avros SpecificDatumReaders/Writers. In our use cases, when deserializing, we find it quite usual for reader / writer schemas to be identical. Interestingly, GenericDatumReader bakes in the use of ResolvingDecoders right in to its core. So even if constructed with a single (reader/writer) schema, a ResolvingDecoder is still used. I experimented a little, and wrote a SpecificDatumReader which instead of being hard wired with a ResolvingDecoder, uses a DecodeStrategy leaving the reader only dealing with Decoders directly. Details follow but for same schema¹ decodes the performance difference is impressive. For the types of records I deal with, a decode with reader schema == writer schema using this approach is about 1.6x faster than a standard SpecificDatumReader decode. interface DecodeStrategy { Decoder configureForRead(Decoder in) throws IOException; void readComplete() throws IOException; void decodeRecordFields(Object old, SpecificRecord record, Schema expected, Decoder in, SpecificDatumReader2 reader) throws IOException; } The idea is that when we hit a record, instead of getting field order from a ResolvingDecoder directly, we just let the decode strategy do it for us (calling back for each field to the reader allowing recursion). For e.g. when we know reader / writer schemas are identical, and we don¹t want validation an IdentitySchemaDecodeStrategy#decodeRecordFields can just pull the fields direct from the provided record schema (calling back on the reader for each one): ... void decodeRecordFields(..) { ListField fields = expected.getFields(); For (int i=0, len = fields.size(); ilen; ++i) { reader.readField(old, in, field, record); } } ... The resolving decoder impl of this strategy just does a readFieldOrder¹ like GenericDatumReader does today. For each read (given a Decoder), the datum reader lets the decode strategy return back the actual decoder to be used (via #configureForRead). This means that a resolving implementation can use this hook to configure the ResolvingDecoder and return this. The result is that the datum reader can work with same schema / validated schema / resolved schemas seamlessly without caring about the difference. I thought I¹d share the approach before working on a full patch: Is this an approach you¹d be interested in taking back to core avro? Or is it a little niche? J Cheers, Dave This message w/attachments (message) is intended solely for the use of the intended recipient(s) and may contain information that is privileged, confidential or proprietary. If you are not an intended recipient, please notify the sender, and then please delete and destroy all copies and attachments, and be advised that any review or dissemination of, or the taking of any action in reliance on, the information contained in or attached to this message is prohibited. Unless specifically indicated, this message is not an offer to sell or a solicitation of any investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Sender. Subject to applicable law, Sender may intercept, monitor, review and retain e-communications (EC) traveling through its networks/systems and may produce any such EC to regulators, law enforcement, in litigation and as required by law. The laws of the country of each sender/recipient may impact the handling of EC, and EC may be archived, supervised and produced in countries other than the country in which you are located. This message cannot be guaranteed to be secure or free of errors or viruses. References to Sender are references to any subsidiary of Bank of America Corporation. Securities and Insurance Products: * Are Not FDIC Insured * Are Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a Condition to Any Banking Service or Activity * Are Not Insured by Any Federal Government Agency. Attachments that are part of this EC may have additional important disclosures and disclaimers, which you should read. This message is subject to terms available at the following link: http://www.bankofamerica.com/emaildisclaimer. By messaging with Sender you
[ANNOUNCE] New Apache Avro PMC Member: Douglas Creager
The Apache Avro PMC is pleased to announce that Douglas Creager is now part of the PMC. Congratulations and Thanks!
Re: Sync Marker Issue while reading AVRO files writen with FLUME with PIG
I have not seen this issue before with 100 TB of Avro files, but am not using Flume to write them. We have moved on to Avro 1.6.x but were on the 1.5.x line for quite some time. Perhaps while writing there was an exception of some sort that was not handled correctly in Avro or Flume. Looking at the DataFileWriter code, I can see how a file could get truncated without a sync marker if the writing process crashes, but not how it could successfully write two blocks in a row without a sync between. You should be able to modify the file reader to recover and re-write the data if it is only a missing sync marker, or skip over the block if it is corrupt. On 4/3/12 1:28 AM, Markus Resch markus.re...@adtech.de wrote: Hey everyone, we're facing a problem while reading AVRO files written with FLUME using the AVRO Java API 1.5.4 into a HADOOP cluster. The Avro Data Store complains about missing sync marker. Investigating the problem shows us, that's perfectly right. The sync marker is missing. Thus we have a block of the double size. Our software packets: rpm -qa | grep hadoop hadoop-0.20-namenode-0.20.2+923.142-1 hadoop-0.20-0.20.2+923.142-1 hadoop-0.20-native-0.20.2+923.142-1 hadoop-hive-0.7.1+42.27-2 hadoop-pig-0.8.1+28.18-1 This is pretty much all a basic cloudera CDH3 Update 2 Packaging installation with a patched PIG version which is CDH3 Update 3. Did anyone had a similar issue? Does this ring a bell? Thanks Markus
Re: avro compression using snappy and deflate
On 3/30/12 12:08 PM, Shirahatti, Nikhil snik...@telenav.com wrote: Hello All, I think I figured our where I goofed up. I was flushing on every record, so basically this was compression per record, so it had a meta data with each record. This was adding more data to the output when compared to avro. So now I have better figures: atleast looks realistic, still need to find out of it is map-reduceable. Avro= 12G Avro+Defalte= 4.5G Deflate is affected quite a bit by the compression level selected (1 to 9) in both performance and level of compression. However, in my experience anything past level 6 is only very slightly smaller and much slower, while the difference between levels 1 to 3 is large on both fronts. Avro+Snappy = 5.5G Have others tried Avro + LZO? I have not heard of anyone doing this. LZO is not Apache license compatible, and there are now several alternatives that are in the same class of compression algorithm available, including Snappy. Thanks, Nikhil On 3/30/12 12:54 AM, Shirahatti, Nikhil snik...@telenav.com wrote: The original data file (a text file) is 40GB, the avro file is about 12GB, avro snappy is 13GB! Thanks, Nikhil -- View this message in context: http://apache-avro.679487.n3.nabble.com/avro-compression-using-snappy-and - deflate-tp3870167p3870184.html Sent from the Avro - Users mailing list archive at Nabble.com.
Re: BigInt / longlong
On 3/28/12 11:01 AM, Meyer, Dennis dennis.me...@adtech.com wrote: Hi, What type refers to an Java Bigint or C long long? Or is there any other type in Avro that maps a 64 bit unsigned int? I unfortunately could only find smaller types in the docs: Primitive Types The set of primitive type names is: * string: unicode character sequence * bytes: sequence of 8-bit bytes * int: 32-bit signed integer * long: 64-bit signed integer * float: single precision (32-bit) IEEE 754 floating-point number * double: double precision (64-bit) IEEE 754 floating-point number * boolean: a binary value * null: no value Anyway in the encoding section theres some 64bit unsigned. Can I use them somehow by a type? An unsigned value fits in a signed one. They are both 64 bits. Each language that supports a long unsigned type has its own way to convert from one to the other without loss of data. Work around might be to use the 52 significant bits of a double, but seems like a hack and of course loosing some more number space compared to uint64. I'd like to get around any other self-encoding hacks as I'd like to also use Hadoop/PIG/HIVE on top on AVRO, so would like to keep functionality on numbers if possible. Java does not have an unsigned 64 bit type. Hadoop/Pig/Hive all only have signed 64 bit integer quantities. Luckily, multiplication and addition on two's compliment signed values is identical to the operations on unsigned ints, so for many operations there is no loss in fidelity as long as you pass the raw bits on to something that interprets the number as an unsigned quantity. That is, if you take the raw bits of a set of unsigned 64 bit numbers, and treat those bits as if they are a signed 64 bit quantities, then do addition, subtraction, and multiplication on them, then treat the raw bit result as an unsigned 64 bit value, it is as if you did the whole thing unsigned. http://en.wikipedia.org/wiki/Two%27s_complement Avro only has signed 32 and 64 bit integer quantities because they can be mapped to unsigned ones in most cases without a problem and many (actually, most) languages do not support unsigned integers. If you want various precision quantities you can use an Avro Fixed type to map to any type you choose. For example you can use a 16 byte fixed to map to 128 bit unsigned ints. Thanks, Dennis
Re: Problem: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 64 / avro.io.SchemaResolutionException: Can't access branch index 64 for union with 2 branches / `read_data': Writer's schem
Avro Java's file writer[1] (the last several versions) rewinds its buffer if there is an exception during writing, so if there are writes afterwords the file will not be corrupt. However, most tools are not so careful. [1] DataFileWriter.append() http://svn.apache.org/repos/asf/avro/trunk/lang/java/avro/src/main/java/org/ apache/avro/file/DataFileWriter.java On 3/23/12 8:27 PM, Russell Jurney russell.jur...@gmail.com wrote: Ok, now I have a followup question... how does one recover from an exception writing an Avro? The incomplete record is being written, which is crashing the reader. On Fri, Mar 23, 2012 at 8:01 PM, Russell Jurney russell.jur...@gmail.com wrote: Thanks Scott, looking at the raw data it seems to have been a truncated record due to UTF problems. Russell Jurney http://datasyndrome.com On Mar 23, 2012, at 7:59 PM, Scott Carey scottca...@apache.org wrote: It appears to be reading a union index and failing in there somehow. If it did not have any of the pig AvroStorage stuff in there I could tell you more. What does avro-tools.jar 's 'tojson' tool do? (java jar avro-tools-1.6.3.jar tojson file | your_favorite_text_reader) What version of Avro is the java stack trace below? On 3/23/12 7:01 PM, Russell Jurney russell.jur...@gmail.com wrote: I have a problem record I've written in Avro that crashes anything which tries to read it :( Can anyone make sense of these errors? The exception in Pig/AvroStorage is this: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 64 at org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java :275) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordRead er.nextKeyValue(PigRecordReader.java:187) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapT ask.java:532) at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:364) at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:14 2) at org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readRecord(PigAvr oDatumReader.java:67) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:13 8) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:12 9) at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233) at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220) at org.apache.pig.piggybank.storage.avro.PigAvroRecordReader.getCurrentValue( PigAvroRecordReader.java:80) at org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java :273) ... 7 more When reading the record in Python: File /me/Collecting-Data/src/python/cat_avro, line 21, in module for record in df_reader: File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6 /site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/datafile.py, line 354, in next datum = self.datum_reader.read(self.datum_decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6 /site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 445, in read return self.read_data(self.writers_schema, self.readers_schema, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6 /site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 490, in read_data return self.read_record(writers_schema, readers_schema, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6 /site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 690, in read_record field_val = self.read_data(field.type, readers_field.type, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6 /site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 488, in read_data return self.read_union(writers_schema, readers_schema, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6 /site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 650, in read_union raise SchemaResolutionException(fail_msg, writers_schema, readers_schema) avro.io.SchemaResolutionException: Can't access branch index 64 for union with 2 branches When reading the record in Ruby: /Users/peyomp/.rvm/gems/ruby-1.8.7-p352/gems/avro-1.6.1/lib/avro/io.rb:298 :in `read_data': Writer's schema and Reader's schema [string,null] do not match. (Avro::IO::SchemaMatchException) -- Russell Jurney twitter.com/rjurney http://twitter.com/rjurney russell.jur...@gmail.com mailto:russell.jur...@gmail.com datasyndrome.com http://datasyndrome.com/ -- Russell Jurney twitter.com/rjurney http://twitter.com
Re: Problem: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 64 / avro.io.SchemaResolutionException: Can't access branch index 64 for union with 2 branches / `read_data': Writer's schem
It appears to be reading a union index and failing in there somehow. If it did not have any of the pig AvroStorage stuff in there I could tell you more. What does avro-tools.jar 's 'tojson' tool do? (java jar avro-tools-1.6.3.jar tojson file | your_favorite_text_reader) What version of Avro is the java stack trace below? On 3/23/12 7:01 PM, Russell Jurney russell.jur...@gmail.com wrote: I have a problem record I've written in Avro that crashes anything which tries to read it :( Can anyone make sense of these errors? The exception in Pig/AvroStorage is this: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 64 at org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:27 5) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader. nextKeyValue(PigRecordReader.java:187) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask .java:532) at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:364) at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142) at org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readRecord(PigAvroDa tumReader.java:67) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129) at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233) at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220) at org.apache.pig.piggybank.storage.avro.PigAvroRecordReader.getCurrentValue(Pig AvroRecordReader.java:80) at org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:27 3) ... 7 more When reading the record in Python: File /me/Collecting-Data/src/python/cat_avro, line 21, in module for record in df_reader: File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/si te-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/datafile.py, line 354, in next datum = self.datum_reader.read(self.datum_decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/si te-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 445, in read return self.read_data(self.writers_schema, self.readers_schema, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/si te-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 490, in read_data return self.read_record(writers_schema, readers_schema, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/si te-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 690, in read_record field_val = self.read_data(field.type, readers_field.type, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/si te-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 488, in read_data return self.read_union(writers_schema, readers_schema, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/si te-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 650, in read_union raise SchemaResolutionException(fail_msg, writers_schema, readers_schema) avro.io.SchemaResolutionException: Can't access branch index 64 for union with 2 branches When reading the record in Ruby: /Users/peyomp/.rvm/gems/ruby-1.8.7-p352/gems/avro-1.6.1/lib/avro/io.rb:298:in `read_data': Writer's schema and Reader's schema [string,null] do not match. (Avro::IO::SchemaMatchException) -- Russell Jurney twitter.com/rjurney http://twitter.com/rjurney russell.jur...@gmail.com mailto:russell.jur...@gmail.com datasyndrome.com http://datasyndrome.com/
Re: Globbing several AVRO files with different (extended) schemes
I'm assuming you are using Pig's AvroStorage function. It appears that it does not support schema migration, but it certainly could do so. A collection of avro files can be 'viewed' as if they all are of one schema provided they can all resolve to it. I have several tools that do this successfully with MapReduce/Pig/Hive. The Pig AvroStorage tool is maintained by the Apache Pig project, you will need to inquire there in order to get more details. -Scott On 3/20/12 2:27 AM, Markus Resch markus.re...@adtech.de wrote: Hi guys, Thanks again for your awesome hint about sqoop. I have another question: The Data I'm working with is stored as AVRO Files in the Hadoop. When I try to glob them everything works just perfectly. But. When I add the schema of a single data file while the others remain everything gets wrecked: currently we assume all avro files under the same location * share the same schema and will throw exception if not. (e.g. I add a new data field) Expected behavior for me would be: If I'm globbing several files with slightly different schema the result of the LOAD would be either return an intersection of all valid fields that are common to both schemes or the atoms of the missing fields are nulled. How could I handle this properly? Thanks Markus
Re: a possible bug in Avro MapReduce
Perhaps it is https://issues.apache.org/jira/browse/AVRO-1045 Are you creating a copy of the GenericRecord? -Scott On 3/19/12 3:34 PM, ey-chih chow eyc...@hotmail.com wrote: Hi, We got an Avro MapReduce job with the signature of the map function as follows: public void map(ByteBuffer input, AvroCollectorPairUtf8, GenericRecord collector, Reporter reporter) throws IOException; However, the position of the ByteBuffer input, i.e. input.position(), is always set to 0 when map() gets invoked. With this, we can not extract data from input. This is for the version of avro 1.5.4. For the older versions of avro, input.position() is set to the end of the input data. Is there anybody knows why this gets set to 0? Or is this a bug? Ey-Chih Chow
Re: Java MapReduce Avro Jackson Error
What version of Avro are you using? You may want to try Avro 1.6.3 + Jackson 1.8.8. This is related, but is not your exact problem. https://issues.apache.org/jira/browse/AVRO-1037 You are likely pulling in some other version of jackson somehow. You may want to use 'mvn dependency:tree' on your project to see where all the dependencies are coming from. That may help identify the culprit. -Scott On 3/19/12 5:06 PM, Deepak Nettem deepaknet...@gmail.com wrote: Sorry, I meant, I added the jackson-core-asl dependency, and still get the error. groupIdorg.codehaus.jackson/groupId artifactIdjackson-core-asl/artifactId version1.5.2/version scopecompile/scope /dependency On Mon, Mar 19, 2012 at 8:05 PM, Deepak Nettem deepaknet...@gmail.com wrote: Hi Tatu, I added the dependency: dependency groupIdorg.codehaus.jackson/groupId artifactIdjackson-mapper-asl/artifactId version1.5.2/version scopecompile/scope /dependency But that still gives me this error: Error: org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$F eature;)Lorg/codehaus/jackson/JsonFactory; Any other ideas? On Mon, Mar 19, 2012 at 7:27 PM, Tatu Saloranta tsalora...@gmail.com wrote: On Mon, Mar 19, 2012 at 4:20 PM, Deepak Nettem deepaknet...@gmail.com wrote: I found that the Hadoop lib directory contains jackson-core-asl-1.0.1.jar and jackson-mapper-asl-1.0.1.jar. I removed these, but got this error: hadoop Exception in thread main java.lang.NoClassDefFoundError: org/codehaus/jackson/map/JsonMappingException I am using Maven as a build tool, and my pom.xml has this dependency: dependency groupIdorg.codehaus.jackson/groupId artifactIdjackson-mapper-asl/artifactId version1.5.2/version scopecompile/scope /dependency Any help would on this issue would be greatly appreciated. You may want to add similar entry for jackson-core-asl -- mapper does require core, and although there is transient dependency from mapper, Maven does not necessarily enforce correct version. So it is best to add explicit dependency so that version of core is also 1.5.x; you may otherwise just get 1.0.1 of that one. -+ Tatu +-
Re: Java MapReduce Avro Jackson Error
If you are using avro-tools, beware it is a shaded jar with all dependencies inside of it for use as a command line tool (java jar avro-tools-VERSION.jar). If you are using avro-tools in your project for some reason (there is really only command line utilities in it) use the nodeps classifier: classifiernodeps/classifier http://repo1.maven.org/maven2/org/apache/avro/avro-tools/1.6.3/ Note the nodeps jar is 47K, while the default jar is 10MB. For what it is worth, I removed the Jackson jar from our hadoop install long ago. It is used to dump configuration files to JSON there, a peripheral feature we don't use. Another thing that you may want to do is change your Hadoop dependency scope to scopeprovided/scope since hadoop will be put on your classpath by the hadoop environment. Short of this, excluding the chained Hadoop dependencies you aren't using (most likely: jetty, kfs, and the tomcat:jasper and eclipse:jdt stuff) may help. On 3/19/12 6:23 PM, Deepak Nettem deepaknet...@gmail.com wrote: Hi Tatu / Scott, Thanks for your replies. I replaced the earlier dependencies with these: dependency groupIdorg.apache.avro/groupId artifactIdavro-tools/artifactId version1.6.3/version /dependency dependency groupIdorg.apache.avro/groupId artifactIdavro/artifactId version1.6.3/version /dependency dependency groupIdorg.codehaus.jackson/groupId artifactIdjackson-mapper-asl/artifactId version1.8.8/version scopecompile/scope /dependency dependency groupIdorg.codehaus.jackson/groupId artifactIdjackson-core-asl/artifactId version1.8.8/version scopecompile/scope /dependency And this is my app dependency tree: [INFO] --- maven-dependency-plugin:2.1:tree (default-cli) @ AvroTest --- [INFO] org.avrotest:AvroTest:jar:1.0-SNAPSHOT [INFO] +- junit:junit:jar:3.8.1:test (scope not updated to compile) [INFO] +- org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8:compile [INFO] +- org.codehaus.jackson:jackson-core-asl:jar:1.8.8:compile [INFO] +- net.sf.json-lib:json-lib:jar:jdk15:2.3:compile [INFO] | +- commons-beanutils:commons-beanutils:jar:1.8.0:compile [INFO] | +- commons-collections:commons-collections:jar:3.2.1:compile [INFO] | +- commons-lang:commons-lang:jar:2.4:compile [INFO] | +- commons-logging:commons-logging:jar:1.1.1:compile [INFO] | \- net.sf.ezmorph:ezmorph:jar:1.0.6:compile [INFO] +- org.apache.avro:avro-tools:jar:1.6.3:compile [INFO] | \- org.slf4j:slf4j-api:jar:1.6.4:compile [INFO] +- org.apache.avro:avro:jar:1.6.3:compile [INFO] | +- com.thoughtworks.paranamer:paranamer:jar:2.3:compile [INFO] | \- org.xerial.snappy:snappy-java:jar:1.0.4.1:compile [INFO] \- org.apache.hadoop:hadoop-core:jar:0.20.2:compile [INFO]+- commons-cli:commons-cli:jar:1.2:compile [INFO]+- xmlenc:xmlenc:jar:0.52:compile [INFO]+- commons-httpclient:commons-httpclient:jar:3.0.1:compile [INFO]+- commons-codec:commons-codec:jar:1.3:compile [INFO]+- commons-net:commons-net:jar:1.4.1:compile [INFO]+- org.mortbay.jetty:jetty:jar:6.1.14:compile [INFO]+- org.mortbay.jetty:jetty-util:jar:6.1.14:compile [INFO]+- tomcat:jasper-runtime:jar:5.5.12:compile [INFO]+- tomcat:jasper-compiler:jar:5.5.12:compile [INFO]+- org.mortbay.jetty:jsp-api-2.1:jar:6.1.14:compile [INFO]+- org.mortbay.jetty:jsp-2.1:jar:6.1.14:compile [INFO]| \- ant:ant:jar:1.6.5:compile [INFO]+- commons-el:commons-el:jar:1.0:compile [INFO]+- net.java.dev.jets3t:jets3t:jar:0.7.1:compile [INFO]+- org.mortbay.jetty:servlet-api-2.5:jar:6.1.14:compile [INFO]+- net.sf.kosmosfs:kfs:jar:0.3:compile [INFO]+- hsqldb:hsqldb:jar:1.8.0.10:compile [INFO]+- oro:oro:jar:2.0.8:compile [INFO]\- org.eclipse.jdt:core:jar:3.1.1:compile I still get the same error. Is there anything specific I need to do other than changing dependencies in pom.xml to make this error go away? On Mon, Mar 19, 2012 at 9:12 PM, Tatu Saloranta tsalora...@gmail.com wrote: On Mon, Mar 19, 2012 at 6:06 PM, Scott Carey scottca...@apache.org wrote: What version of Avro are you using? You may want to try Avro 1.6.3 + Jackson 1.8.8. This is related, but is not your exact problem. https://issues.apache.org/jira/browse/AVRO-1037 You are likely pulling in some other version of jackson somehow. You may want to use 'mvn dependency:tree' on your project to see where all the dependencies are coming from. That may help identify the culprit. This sounds like a good idea, and I agree in that this is probably still due to an old version lurking somewhere. -+ Tatu +-
Re: Make a copy of an avro record
We should be generating Java 1.6 compatible code. What version were you testing? 1.6.3 is near release, the RC is available here: http://mail-archives.apache.org/mod_mbox/avro-dev/201203.mbox/%3C4F514F22.8 =070...@apache.org%3E Does it have the same problem? On 3/12/12 9:27 AM, Jeremy Lewi jer...@lewi.us wrote: Thanks James and Doug. I was able to simply cast the output of SpecificData...deepCopy to my type and it seems to bypass the problematic methods decorated with @override. What about the potential incompatibility with earlier versions of java due to the change in semantics of @override? If this is really an issue this seems like it would affect a lot of users particularly people using Avro MapReduce on a cluster where upgrading java is not a trivial proposition. In my particular case, the reduce processing requires loading all values associated with the key into memory, which necessitates a deep copy because the iterable object passed to the reducer seems to be reusing the same instance. Using SpecificData.get().deepCopy(record) seems like a viable workaround. Nonetheless, it does seem a bit problematic if the compiler is generating code that is incompatible with earlier versions of java. J On Mon, Mar 12, 2012 at 9:05 AM, Doug Cutting cutt...@apache.org wrote: On 03/11/2012 10:22 PM, James Baldassari wrote: If you want to make a deep copy of a specific record, the easiest way is probably to use the Builder API, e.g. GraphNodeData.newBuilder(recordToCopy).build(). SpecificData.get().deepCopy(record) should work too. Doug
Re: parsing Avro data files in JavaScript
See also the discussion about a JavaScript Avro implementation from last week: http://search-hadoop.com/m/MiNCyvLts/HttpTranceiversubj=HttpTranceiver+and +JSON+encoded+Avro+ On 2/21/12 7:56 AM, Carriere, Jeromy jero...@x.com wrote: We're working on one to support the X.commerce Fabric: https://github.com/xcommerce/node-avro Which is based on: http://code.google.com/p/javascript-avro/ -- Jeromy Carriere Chief Architect X.commerce From: Kevin Meinert ke...@subatomicglue.commailto:ke...@subatomicglue.com Reply-To: user@avro.apache.orgmailto:user@avro.apache.org Date: Tue, 21 Feb 2012 09:51:49 -0600 To: user@avro.apache.orgmailto:user@avro.apache.org Subject: parsing Avro data files in JavaScript Does anyone have an example of a avro binary parser for Javascript? I don't see a JS implementation in the downloads. Also, curious if anyone has written a simple parser for avro binary before in any other language. Or have tips for writing one. --- kevin meinert | http://www.subatomiclabs.com
Re: Order of the schema in Union
As for why the union does not seem to match: The Union schemas are not the same as the one in the error the one in the error does not have a namespace. It finds AVRO_NCP_ICM but the union has only merced.AVRO_NCP_ICM and merced. AVRO_IVR_BY_CALLID. The namespace and name must both match. Is your output schema correct? It looks like you are setting both your MapOutputSchema and OutputSchema to be a Pair schema. I suspect you only want the Pair schema as a map output and reducer input, but cannot be sure from the below. From the below, your reducer must create Pair objects and output them, and maybe that is related to the error below. It may also be related to the combiner, does it happen without it? On 2/12/12 11:01 PM, Serge Blazhievsky easyv...@gmail.com wrote: Hi all, I am running into an interesting problem with Union. It seems that order of the schema in union must be in the same order as input path for different files. This does not look like right behavior. The code and exception are below. The moment I change the order in union it works. Thanks Serge public int run(String[] strings) throws Exception { JobConf job = new JobConf(); job.setNumMapTasks(map); job.setNumReduceTasks(reduce); // Uncomment to run locally in a single process job.set(mapred.job.tracker, local); File file = new File(input); DatumReaderGenericRecord reader = new GenericDatumReaderGenericRecord(); DataFileReaderGenericRecord dataFileReader = new DataFileReaderGenericRecord(file, reader); Schema s = dataFileReader.getSchema(); File lfile = new File(linput); DatumReaderGenericRecord lreader = new GenericDatumReaderGenericRecord(); DataFileReaderGenericRecord ldataFileReader = new DataFileReaderGenericRecord(lfile, lreader); Schema s2 = ldataFileReader.getSchema(); ListSchema slist= new ArrayListSchema(); slist.add(s2); slist.add(s); System.out.println(s.toString(true)); System.out.println(s2.toString(true)); Schema s_union=Schema.createUnion(slist); AvroJob.setInputSchema(job, s_union); ListSchema.Field fields = s.getFields(); ListSchema.Field outfields = new ArrayListSchema.Field(); for (Schema.Field f : fields) { outfields.add(new Schema.Field(f.name http://f.name (), Schema.create(Type.STRING), null, null)); } boolean b = false; Schema outschema = Schema.createRecord(AVRO_IVR_BY_CALLID, AVRO_IVR_BY_CALLID, merced, b); outschema.setFields(outfields); Schema STRING_SCHEMA = Schema.create(Schema.Type.STRING); Schema OUT_SCHEMA = new PairString, GenericRecord(, STRING_SCHEMA, new GenericData.Record(outschema), outschema).getSchema(); AvroJob.setMapOutputSchema(job, OUT_SCHEMA); AvroJob.setOutputSchema(job, OUT_SCHEMA); AvroJob.setMapperClass(job, MapImpl.class); AvroJob.setCombinerClass(job, ReduceImpl.class); AvroJob.setReducerClass(job, ReduceImpl.class); // FileInputFormat.setInputPaths(job, new Path(input)); FileInputFormat.addInputPath(job, new Path(linput)); FileInputFormat.addInputPath(job, new Path(input)); // MultipleInputs.addInputPath(job, new Path(input), AvroInputFormatGenericRecord.class, MapImpl.class); FileOutputFormat.setOutputPath(job, new Path(output)); FileOutputFormat.setCompressOutput(job, true); int res = 255; RunningJob runJob = JobClient.runJob(job); if (runJob != null) { res = runJob.isSuccessful() ? 0 : 1; } return res; } 2/02/12 22:56:52 WARN mapred.LocalJobRunner: job_local_0001 org.apache.avro.AvroTypeException: Found { type : record, name : AVRO_NCP_ICM, fields : [ { name : DATADATE, type : string }, { name : ICM_CALLID, type : string }, { name : AGENT_ELID, type : string }, { name : AGENT_NAME, type : string }, { name : AGENT_SITE, type : string }, { name : AGENT_SVIEW_USER_ID, type : string }, { name : AGENT_UNIT_ID, type : string }, { name : ANI, type : string }, { name : CALL_CTR_UNIT_ID, type : string }, { name : CALL_FA_ID, type : string }, { name : CALL_FUNCTIONALAREA, type : string }, { name : CTI_CALL_IDENTIFIER, type : string }, { name : CALLDISPOSITION, type : string }, { name : AGENTPERIPHERALNUMBER, type : string }, { name :
Re: HttpTranceiver and JSON-encoded Avro?
See https://issues.apache.org/jira/browse/AVRO-485 for some discussion on JavaScript for Avro. Please comment in that ticket with your needs and use case. The project would welcome a JavaScript implementation. On 2/15/12 2:07 PM, Frank Grimes frankgrime...@gmail.com wrote: Are there any fast and stable ones you might recommend? On 2012-02-15, at 4:22 PM, Russell Jurney wrote: FWIW, there are avro libs for JavaScript and node on github. Russell Jurney http://datasyndrome.com On Feb 15, 2012, at 7:32 AM, Frank Grimes frankgrime...@gmail.com wrote: Hi All, Is there any way to send Avro data over HTTP encoded in JSON? We want to integrate with Node.js and JSON seems to be the best/simplest way to do so. Thanks, Frank Grimes
Re: Writing Unsolicited Messages to a Connected Netty Client
For certain kinds of data it would be useful to continuously stream data from server to client (or vice-versa). This can be represented as an Avro array response or request where each array element triggers a callback at the receiving end. This likely requires an extension to the avro spec, but is much more capable than a polling solution. It is related to Comet in the sense that the RPC request is long lived, but is effectively a sequence of smaller inverse RPCs. Poling in general has built-in race conditions for many types of information exchange and should be avoided whenever such race conditions exist. For streaming large volumes of data, this would be much more efficient than an individual RPC per item. For example, if the RPC is I need to know every state change in X polling is not an option, but streaming is. If the requirement is I need to know when the next state change occurs, but do not need to know all changes polling is OK, and streaming may send too much data. On 1/20/12 11:25 AM, Armin Garcia armin.gar...@arrayent.com wrote: Hi James, I see your point. On a different NIO framework, I implemented exactly the same message handling procedure (ie message routing) you just described. I guess I was pushing the NettyTransceiver a bit beyond its intended scope. I'll take a look at the comet pattern and see what I can do with it. Again, thanks Shaun James. -Armin On Fri, Jan 20, 2012 at 10:15 AM, James Baldassari jbaldass...@gmail.com wrote: Hi Armin, First I'd like to explain why the server-initiated messages are problematic. Allowing the server to send unsolicited messages back to the client may work for some Transceiver implementations (possibly PHP), but this change would not be compatible with NettyTransceiver. When the NettyTransceiver receives a message from the server, it needs to know which callback to invoke in order to pass the message back correctly to the client. There could be several RPCs in flight concurrently, so one of NettyTransceiver's jobs is to match up the response with the request that initiated it. If the client didn't initiate the RPC then NettyTransceiver won't know where to deliver the message, unless there were some catch-all callback that would be invoked whenever one of these unsolicited messages were received. So although you're probably only interested in the PHP client, allowing the server to send these unsolicited messages would potentially break NettyTransceiver (and possibly other implementations as well). Shaun's idea of having the client poll the server periodically would definitely work. What we want to do is have the client receive notifications from the server as they become available on the server side, but we also don't want the client to be polling with such a high frequency that a lot of CPU and bandwidth resources are wasted. I think we can get the best of both worlds by copying the Comet pattern, i.e. the long poll but using the Avro RPC layer instead of (or on top of) HTTP. First we'll start with Shaun's update listener interface: protocol WeatherUpdateListener { WeatherUpdate listenForUpdate(); } The PHP client would invoke this RPC against the server in a tight loop. On the server side, the RPC will block until there is an update that is ready to be sent to the client. When the client does receive an event from the server (or some timeout occurs), the client will immediately send another poll to the server and block until the next update is received. In this way the client will not be flooding the server with RPCs, but the client will also get updates in a timely manner. See the following for more info about Comet: http://www.javaworld.com/javaworld/jw-03-2008/jw-03-asynchhttp.html?page=6 -James On Fri, Jan 20, 2012 at 12:44 PM, Armin Garcia armin.gar...@arrayent.com wrote: Hi Shaun, This is definitely another way. I share your same concern. I have to keep an eye out for high availablilty and high throughput. I'll be depending on this connection to support a massive amount of data. -Armin On Fri, Jan 20, 2012 at 9:25 AM, Shaun Williams shaun_willi...@apple.com wrote: Another solution is to use the response leg of a transaction to push messages to the client, e.g. provide a server protocol like this: WeatherUpdate listenForUpdate(); This would essentially block until an update is available. The only problem is that if the client is expecting a series of updates, it would need to call this method again after receiving each update. This is not an ideal solution, but it might solve your problem. -Shaun On Jan 20, 2012, at 8:24 AM, Armin Garcia wrote: Hi James, First, thank you for your response. Yes, you are right. I am trying to setup a bi-directional communication link. Your suggestion would definitely accomplish this requirement. I was hoping the same channel could be
Re: AVRO Path
There are no plans that I know of currently, although the topic came up two times in separate conversations last night at the SF Hadoop MeetUp. I think an ability to extract a subset of a schema from a larger one and read/write/transform data accordingly makes a lot of sense. Currently, the Avro spec allows for schema resolution which is sort of a degenerate schema extraction/transformation at the record level without the ability to address or extract nested elements. An addition to the spec for describing other schema extractions may be useful. Further discussion should probably be in a JIRA ticket or at least on the dev list. -Scott On 1/10/12 1:02 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Are there plans for (or is there already) an AVRO Path implementation (like XPath, or JSON Path). Thanks!
Re: Can spill to disk be in compressed Avro format to reduce I/O?
On 1/12/12 11:24 AM, Frank Grimes frankgrime...@gmail.com wrote: Hi Scott, If I have a map-only job, would I want only one mapper running to pull all the records from the source input files and stream/append them to the target avro file? Would that be no different (or more efficient) than doing hadoop dfs -cat file1 file2 file3 and piping the output to append to a hadoop dfs -put combinedFile? In that case, my only question is how would I combine the avro files into a new file without deserializing them? It would be different. An Avro file has a header that contains the Schema and compression codec info along with other metadata, followed by data blocks. Each data block has a record count and size prefix and a 16 byte delimiter. You cannot simply concatenate them together because the schema or compression codec may differ, a header in the middle of the file is not allowed, and the delimiter may differ. http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWr iter.html DataFileWriter can append a pre-existing file with the same schema, in particular look at the documentation for appendAllFrom() http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWr iter.html#appendAllFrom%28org.apache.avro.file.DataFileStream,%20boolean%29 Thanks, Frank Grimes On 2012-01-12, at 1:14 PM, Scott Carey wrote: On 1/12/12 8:27 AM, Frank Grimes frankgrime...@gmail.com wrote: Hi All, We have Avro data files in HDFS which are compressed using the Deflate codec. We have written an M/R job using the Avro Mapred API to combine those files. It seems to be working fine, however when we run it we notice that the temporary work area (spills, etc) seem to be uncompressed. We're thinking we might see a speedup due to reduced I/O if the temporary files are compressed as well. If all you want to do is combine the files, there is no reason to deserialize and reserialize the contents, and a map-only job could suffice. If this is the case, you might want to consider one of two optoins: 1. Use a map only job, with a combined file input. This will produce one file per mapper and no intermediate data. 2. Use the Avro data file API to append to a file. I am not sure if this will work with HDFS without some modifications to Avro, but it should be possible since the data file APIs can take InputStream/OutputStream. The data file API has the ability to append data blocks from the file if the schemas are an exact match. This can be done without deserialization, and optionally can change the compression level or leave it alone. Is there a way to enable mapred.compress.map.output in such a way that those temporary files are compressed as Avro/Deflate? I tried simply setting conf.setBoolean(mapred.compress.map.output, true); but it didn't seem to have any effect. I am not sure, as I haven't tried it myself. However, the Avro M/R should be able to leverage all of the Hadoop compressed intermediate forms. LZO/Snappy are fast and in our cluster Snappy is the default. Deflate can be a lot slower but much more compact. Note that in order to avoid unnecessary sorting overhead, I made each key a constant (1L) so that the logs are combined but ordering isn't necessarily preserved. (we don't care about ordering) In that case, I think you can use a map only job. There may be some work to get a single mapper to read many files however. FYI, here are my mapper and reducer. public static class AvroReachMapper extends AvroMapperDeliveryLogEvent, PairLong, DeliveryLogEvent { public void map(DeliveryLogEvent levent, AvroCollectorPairLong, DeliveryLogEvent collector, Reporter reporter) throws IOException { collector.collect(new PairLong, DeliveryLogEvent(1L, levent)); } } public static class Reduce extends AvroReducerLong, DeliveryLogEvent, DeliveryLogEvent { @Override public void reduce(Long key, IterableDeliveryLogEvent values, AvroCollectorDeliveryLogEvent collector, Reporter reporter) throws IOException { for (DeliveryLogEvent event : values) { collector.collect(event); } } } Also, I'm setting the following: AvroJob.setInputSchema(conf, DeliveryLogEvent.SCHEMA$); AvroJob.setMapperClass(conf, Mapper.class); AvroJob.setMapOutputSchema(conf, SCHEMA); AvroJob.setOutputSchema(conf, DeliveryLogEvent.SCHEMA$); AvroJob.setOutputCodec(conf, DataFileConstants.DEFLATE_CODEC); AvroOutputFormat.setDeflateLevel(conf, 9); AvroOutputFormat.setSyncInterval(conf, 1024 * 256); AvroJob.setReducerClass(conf, Reducer.class); JobClient.runJob(conf); Thanks, Frank Grimes
Re: Can spill to disk be in compressed Avro format to reduce I/O?
On 1/12/12 12:35 PM, Frank Grimes frankgrime...@gmail.com wrote: So I decided to try writing my own AvroStreamCombiner utility and it seems to choke when passing multiple input files: hadoop dfs -cat hdfs://hadoop/machine1.log.avro hdfs://hadoop/machine2.log.avro | ./deliveryLogAvroStreamCombiner.sh combined.log.avro Exception in thread main java.io.IOException: Invalid sync! at org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:293) at org.apache.avro.file.DataFileWriter.appendAllFrom(DataFileWriter.java:329) at DeliveryLogAvroStreamCombiner.main(Unknown Source) Here's the code in question: public class DeliveryLogAvroStreamCombiner { /** * @param args */ public static void main(String[] args) throws Exception { DataFileStreamDeliveryLogEvent dfs = null; DataFileWriterDeliveryLogEvent dfw = null; try { dfs = new DataFileStreamDeliveryLogEvent(System.in, new SpecificDatumReaderDeliveryLogEvent()); OutputStream stdout = System.out; dfw = new DataFileWriterDeliveryLogEvent(new SpecificDatumWriterDeliveryLogEvent()); dfw.setCodec(CodecFactory.deflateCodec(9)); dfw.setSyncInterval(1024 * 256); dfw.create(DeliveryLogEvent.SCHEMA$, stdout); dfw.appendAllFrom(dfs, false); dfs is from System.in, which has multiple files one after the other. Each file will need its own DataFileStream (has its own header and metadata). In Java you could get the list of files, and for each file use HDFS's API to open the file stream, and append that to your one file. In bash you could loop over all the source files and append one at a time (the above fails on the second file). However, in order to append to the end of a pre-existing file the only API now takes a File, not a seekable stream, so Avro would need a patch to allow that in HDFS (also, only an HDFS version that supports appends would work). Other things of note: You will probably get better total file size compression by using a larger sync interval (1M to 4 M) than deflate level 9. Deflate 9 is VERY slow and almost never compresses more than 1% better than deflate 6, which is much faster. I suggest experimenting with the 'recodec' option on some of your files to see what the best size / performance tradeoff is. I doubt that 256K (pre-compression) blocks with level 9 compression is the way to go. For reference: http://tukaani.org/lzma/benchmarks.html (gzip uses deflate compression) -Scott } finally { if (dfs != null) try {dfs.close();} catch (Exception e) {e.printStackTrace();} if (dfw != null) try {dfw.close();} catch (Exception e) {e.printStackTrace();} } } } Is there any way this could be made to work without needing to download the individual files to disk and calling append for each of them? Thanks, Frank Grimes On 2012-01-12, at 2:24 PM, Frank Grimes wrote: Hi Scott, If I have a map-only job, would I want only one mapper running to pull all the records from the source input files and stream/append them to the target avro file? Would that be no different (or more efficient) than doing hadoop dfs -cat file1 file2 file3 and piping the output to append to a hadoop dfs -put combinedFile? In that case, my only question is how would I combine the avro files into a new file without deserializing them? Thanks, Frank Grimes On 2012-01-12, at 1:14 PM, Scott Carey wrote: On 1/12/12 8:27 AM, Frank Grimes frankgrime...@gmail.com wrote: Hi All, We have Avro data files in HDFS which are compressed using the Deflate codec. We have written an M/R job using the Avro Mapred API to combine those files. It seems to be working fine, however when we run it we notice that the temporary work area (spills, etc) seem to be uncompressed. We're thinking we might see a speedup due to reduced I/O if the temporary files are compressed as well. If all you want to do is combine the files, there is no reason to deserialize and reserialize the contents, and a map-only job could suffice. If this is the case, you might want to consider one of two optoins: 1. Use a map only job, with a combined file input. This will produce one file per mapper and no intermediate data. 2. Use the Avro data file API to append to a file. I am not sure if this will work with HDFS without some modifications to Avro, but it should be possible since the data file APIs can take InputStream/OutputStream. The data file API has the ability to append data blocks from the file if the schemas are an exact match. This can be done without deserialization, and optionally can change the compression level or leave it alone. Is there a way to enable mapred.compress.map.output in such a way that those temporary files are compressed as Avro/Deflate? I tried simply setting conf.setBoolean(mapred.compress.map.output, true); but it didn't seem to have any effect. I am not sure, as I haven't tried it myself. However, the Avro M/R should be able to leverage all
Re: Can spill to disk be in compressed Avro format to reduce I/O?
The Recodec tool may be useful, and the source code is a good reference. java jar avro-tools-VERSION.jar http://svn.apache.org/viewvc/avro/tags/release-1.6.1/lang/java/tools/src/ma in/java/org/apache/avro/tool/RecodecTool.java?view=co https://issues.apache.org/jira/browse/AVRO-684 On 1/12/12 12:53 PM, Scott Carey scottca...@apache.org wrote: On 1/12/12 12:35 PM, Frank Grimes frankgrime...@gmail.com wrote: So I decided to try writing my own AvroStreamCombiner utility and it seems to choke when passing multiple input files: hadoop dfs -cat hdfs://hadoop/machine1.log.avro hdfs://hadoop/machine2.log.avro | ./deliveryLogAvroStreamCombiner.sh combined.log.avro Exception in thread main java.io.IOException: Invalid sync! at org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:293) at org.apache.avro.file.DataFileWriter.appendAllFrom(DataFileWriter.java:329 ) at DeliveryLogAvroStreamCombiner.main(Unknown Source) Here's the code in question: public class DeliveryLogAvroStreamCombiner { /** * @param args */ public static void main(String[] args) throws Exception { DataFileStreamDeliveryLogEvent dfs = null; DataFileWriterDeliveryLogEvent dfw = null; try { dfs = new DataFileStreamDeliveryLogEvent(System.in, new SpecificDatumReaderDeliveryLogEvent()); OutputStream stdout = System.out; dfw = new DataFileWriterDeliveryLogEvent(new SpecificDatumWriterDeliveryLogEvent()); dfw.setCodec(CodecFactory.deflateCodec(9)); dfw.setSyncInterval(1024 * 256); dfw.create(DeliveryLogEvent.SCHEMA$, stdout); dfw.appendAllFrom(dfs, false); dfs is from System.in, which has multiple files one after the other. Each file will need its own DataFileStream (has its own header and metadata). In Java you could get the list of files, and for each file use HDFS's API to open the file stream, and append that to your one file. In bash you could loop over all the source files and append one at a time (the above fails on the second file). However, in order to append to the end of a pre-existing file the only API now takes a File, not a seekable stream, so Avro would need a patch to allow that in HDFS (also, only an HDFS version that supports appends would work). Other things of note: You will probably get better total file size compression by using a larger sync interval (1M to 4 M) than deflate level 9. Deflate 9 is VERY slow and almost never compresses more than 1% better than deflate 6, which is much faster. I suggest experimenting with the 'recodec' option on some of your files to see what the best size / performance tradeoff is. I doubt that 256K (pre-compression) blocks with level 9 compression is the way to go. For reference: http://tukaani.org/lzma/benchmarks.html (gzip uses deflate compression) -Scott } finally { if (dfs != null) try {dfs.close();} catch (Exception e) {e.printStackTrace();} if (dfw != null) try {dfw.close();} catch (Exception e) {e.printStackTrace();} } } } Is there any way this could be made to work without needing to download the individual files to disk and calling append for each of them? Thanks, Frank Grimes On 2012-01-12, at 2:24 PM, Frank Grimes wrote: Hi Scott, If I have a map-only job, would I want only one mapper running to pull all the records from the source input files and stream/append them to the target avro file? Would that be no different (or more efficient) than doing hadoop dfs -cat file1 file2 file3 and piping the output to append to a hadoop dfs -put combinedFile? In that case, my only question is how would I combine the avro files into a new file without deserializing them? Thanks, Frank Grimes On 2012-01-12, at 1:14 PM, Scott Carey wrote: On 1/12/12 8:27 AM, Frank Grimes frankgrime...@gmail.com wrote: Hi All, We have Avro data files in HDFS which are compressed using the Deflate codec. We have written an M/R job using the Avro Mapred API to combine those files. It seems to be working fine, however when we run it we notice that the temporary work area (spills, etc) seem to be uncompressed. We're thinking we might see a speedup due to reduced I/O if the temporary files are compressed as well. If all you want to do is combine the files, there is no reason to deserialize and reserialize the contents, and a map-only job could suffice. If this is the case, you might want to consider one of two optoins: 1. Use a map only job, with a combined file input. This will produce one file per mapper and no intermediate data. 2. Use the Avro data file API to append to a file. I am not sure if this will work with HDFS without some modifications to Avro, but it should be possible since the data file APIs can take
Re: Can spill to disk be in compressed Avro format to reduce I/O?
On 1/12/12 5:52 PM, Frank Grimes frankgrime...@gmail.com wrote: Hi Scott, I've looked into this some more and I now see what you mean about appending to HDFS directly not being possible with the current DataFileWriter API. That's unfortunate because we really would like to avoid needing to hit disk just to write temporary files. (and the associated cleanup) I kinda like the notion of not requiring HDFS APIs to achieve this merging of Avro files/streams. Assuming we wanted to be able to stream multiple files as in my example... could DataFileStream easily be changed to support that use case? i.e. allow it to skip/ignore subsequent header and metadata in the stream or not error out with Invalid sync!? That may be possible, open a JIRA to discuss further. It should be modified to 'reset' to the start of a new file or stream and continue from there, since it needs to read the header and find the new sync value and validate that the schemas match and the codec is compatible. It may be possible to detect the end of one file and the start of another if the files are streamed back to back, but perhaps not reliably. The avro-tools could be extended to have a command line tool that takes a list of files (HDFS or local) and writes a new file (HDFS or local) concatenated and possibly recodec'd. Thanks, Frank Grimes On 2012-01-12, at 3:53 PM, Scott Carey wrote: On 1/12/12 12:35 PM, Frank Grimes frankgrime...@gmail.com wrote: So I decided to try writing my own AvroStreamCombiner utility and it seems to choke when passing multiple input files: hadoop dfs -cat hdfs://hadoop/machine1.log.avro hdfs://hadoop/machine2.log.avro | ./deliveryLogAvroStreamCombiner.sh combined.log.avro Exception in thread main java.io.IOException: Invalid sync! at org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:293) at org.apache.avro.file.DataFileWriter.appendAllFrom(DataFileWriter.java:329) at DeliveryLogAvroStreamCombiner.main(Unknown Source) Here's the code in question: public class DeliveryLogAvroStreamCombiner { /** * @param args */ public static void main(String[] args) throws Exception { DataFileStreamDeliveryLogEvent dfs = null; DataFileWriterDeliveryLogEvent dfw = null; try { dfs = new DataFileStreamDeliveryLogEvent(System.in, new SpecificDatumReaderDeliveryLogEvent()); OutputStream stdout = System.out; dfw = new DataFileWriterDeliveryLogEvent(new SpecificDatumWriterDeliveryLogEvent()); dfw.setCodec(CodecFactory.deflateCodec(9)); dfw.setSyncInterval(1024 * 256); dfw.create(DeliveryLogEvent.SCHEMA$, stdout); dfw.appendAllFrom(dfs, false); dfs is from System.in, which has multiple files one after the other. Each file will need its own DataFileStream (has its own header and metadata). In Java you could get the list of files, and for each file use HDFS's API to open the file stream, and append that to your one file. In bash you could loop over all the source files and append one at a time (the above fails on the second file). However, in order to append to the end of a pre-existing file the only API now takes a File, not a seekable stream, so Avro would need a patch to allow that in HDFS (also, only an HDFS version that supports appends would work). Other things of note: You will probably get better total file size compression by using a larger sync interval (1M to 4 M) than deflate level 9. Deflate 9 is VERY slow and almost never compresses more than 1% better than deflate 6, which is much faster. I suggest experimenting with the 'recodec' option on some of your files to see what the best size / performance tradeoff is. I doubt that 256K (pre-compression) blocks with level 9 compression is the way to go. For reference: http://tukaani.org/lzma/benchmarks.html (gzip uses deflate compression) -Scott } finally { if (dfs != null) try {dfs.close();} catch (Exception e) {e.printStackTrace();} if (dfw != null) try {dfw.close();} catch (Exception e) {e.printStackTrace();} } } } Is there any way this could be made to work without needing to download the individual files to disk and calling append for each of them? Thanks, Frank Grimes On 2012-01-12, at 2:24 PM, Frank Grimes wrote: Hi Scott, If I have a map-only job, would I want only one mapper running to pull all the records from the source input files and stream/append them to the target avro file? Would that be no different (or more efficient) than doing hadoop dfs -cat file1 file2 file3 and piping the output to append to a hadoop dfs -put combinedFile? In that case, my only question is how would I combine the avro files into a new file without deserializing them? Thanks, Frank Grimes On 2012-01-12, at 1:14 PM, Scott Carey wrote: On 1/12/12 8:27 AM, Frank Grimes frankgrime...@gmail.com wrote: Hi All, We have Avro data files in HDFS which are compressed using the Deflate codec
Re: encoding problem for ruby client
This sounds like the Ruby implementation does not correctly use UTF-8 on your platform for encoding strings. It may be a bug, but I am not knowledgeable enough on the Ruby implementation to know for sure. The Avro specification states that a string is encoded as a long followed by that many bytes of UTF-8 encoded character data. (http://avro.apache.org/docs/current/spec.html#binary_encode_primitive). If you think that the Ruby implementation does not adhere to the spec, please file a bug in JIRA. Thanks! -Scott On 1/4/12 3:59 AM, kafka0102 kafka0102 yujianjia0...@gmail.com wrote: Hi. I use avro's java and ruby clients. When they comunite, the ruby client always encode(decode) the multi-byte chars(utf-8) to latin1. For now, when the data is multi-byte chars,I first encode Iconv.conv(UTF8, LATIN1,data) in the ruby client, and then decoded it Utils.conv(data, ISO-8859-1,UTF-8); in the java server.It works,but too ugly. I see the avro ruby client using StringIO to pack the data, but I cannot find ways to make it support multi-byte chars. Can anyone help me?
Re: Collecting union-ed Records in AvroReducer
On 12/8/11 4:10 AM, Andrew Kenworthy adwkenwor...@yahoo.com wrote: Hallo, is it possible to write/collect a union-ed record from an avro reducer? I have a reduce class (extending AvroReducer), and the output schema is a union schema of record type A and record type B. In the reduce logic I want to combine instances of A and B in the same datum, passing it to my Avrocollector. My code looks a bit like this: If both records were created in the reducer, you can call collect twice, once with each record. Collect in general can be called as many times as you wish. If you want to combine two records into a single datum rather than emit multiple datums, you do not want a union, you need a Record. A union is a single datum that may be only one of its branches in a single datum. In short, do you want to emit both records individually or as a pair? If it is a pair, you need a Record, if it is multiple outputs or either/or, it is a Union. Record unionRecord = new GenericData.Record(myUnionSchema); // not legal! unionRecord.put(type A, recordA); unionRecord.put(type B, recordB); collector.collect(unionRecord); but GenericData.Record constructor expects a Record Schema. How can I write both records such that they appear in the same output datum? If your output is either one type or another, see Doug's answer. for multiple datums, it is output schema is a union of two records (a datum is either one or the other): [RecordA, RecordB] then the code is: collector.collect(recordA); collector.collect(recordB); If you want a single datum that contains both a RecordA and a RecordB you need to have your output schema be a Record with two fields: {type:record, fields:[ {name:recordA, type:RecordA}, {name:recordB, type:RecordB} ]} And you would use this record schema to create the GenericRecord, and then populate the fields with the inner records, then call collect once with the outer record. Another choice is to output the output be an avro array of the union type that may have any number of RecordA and RecordB's in a single datum. Andrew
Re: Map having string, Object
The best practice is usually to use the flexible schema with the union value rather than transmit schemas each time. This restricts the possibilities to the set defined, and the type selected in the branch is available on the decoding side. In the case above the number of variants is not too large for this approach to be unwieldy, and there may be benefits of knowing the type on the other side without inspecting the value. You can construct an Avro schema that represents all possible data variants, effectively tagging the types of every field during serialization using unions. However none of the Avro APIs are (yet) optimized for this use case, it would be somewhat clumsy to work with, and is less space efficient. Other serialization systems are a better fit for completely open-ended data schemas. One can look at Avro as a serialization system, but I see it more as a system for describing your data. It provides tools for describing and transforming data that exists in related forms (e.g. older or newer schema versions) to the form you want to see (e.g. current schema). Whether this data is serialized or an object graph is less important than that it conforms to a schema. A transformation between a serialized form and an object graph is one use case of many possibilities. Think about your use case from that perspective. Ask whether this is data that gains benefit from describing it with an Avro Schema and then interpreting it as conforming to that schema. If it is completely open ended there may be little benefit and significant overhead. You can also embed JSON or binary JSON in Avro data fairly easily using Jackson JSON. On 12/7/11 9:10 AM, Gaurav Nanda gaurav...@gmail.com wrote: I agree that in this case Json would be equally helpful. But In my application there is one more type of message, where untagged data can provide compact data encoding. So to maintain consistency, I preferred to send these kind of messages also using avro. @where untagged data can provide compact data encoding. In that case also, my schema has to be dynamically generated (i.e. on runtime), so has to be passed to client. So would avro be better to compressed json is that case? Thanks, Gaurav Nanda On Wed, Dec 7, 2011 at 9:17 PM, Tatu Saloranta tsalora...@gmail.com wrote: On Wed, Dec 7, 2011 at 5:16 AM, Gaurav gaurav...@gmail.com wrote: Hi, We have a requirement to send typed(key-value) pairs from server to clients (in various languages). Value can be one of primitive types or a map of same (string, Object) type. One option is to construct record schema on the fly and second option is to use unions to write schema in a general way. Problems with 1 is that we have to construct schema everytime depending upon keys and then attach the entire string schema to a relatively small record. But in second schema, u don't need to write schema on the wire as it is present with client also. I have written one such sample schema: {type:map,values:[int,long,float,double,string,boolean ,{type:map,values:[int,long,float,double,string,boolean ]}]} Do you guys think writing something of this sort makes sense or is there any better approach to this? For this kind of loose data, perhaps JSON would serve you better, unless you absolutely have to use Avro? -+ Tatu +-
Re: Reduce-side joins in Avro M/R
This should be conceptually the same as a normal map-reduce join of the same type. Avro handles the serialization, but not the map-reduce algorithm or strategy. On 12/6/11 8:43 AM, Andrew Kenworthy adwkenwor...@yahoo.com wrote: Hi, I'd like to use reduce-side joins in an avro M/R job, and am not sure how to do it: are there any best-practice tips or outlines of what one would have to implement in order to make this possible? Thanks, Andrew Kenworthy
Re: Importing in avdl from classpath of project
I think that at minimum, it would be useful to have an option to 'also look in the classpath' in the maven plugin, and have the option to do so in general with the IDL compiler. I would gladly review the patch in a JIRA. -Scott On 12/7/11 10:13 AM, Chau, Victor vic...@x.com wrote: Hello, I am trying to address a shortcoming of the way that the import feature works in IDL. Currently, it looks like the only option is to place the file being imported inside the same directory as that of the importing avdl. In our setup, we have avdl¹s that are spread among several maven projects that are owned by different teams. I would like to be able to just create a dependency on another jar that contains the avdl I am want to import and have Avro be smart enough to look for it in the classpath of the project containing the avdl. The main problem is to make all of this work with the avro-maven-plugin. The plugin¹s runtime classpath is not the same as that of the maven project¹s classpath. Through the magic of Stackoverflow, I figured out how to get the project¹s classpath and construct a new classloader and pass it to the Idl compiler for it to lookup the file if it is not available in the local directory. Is this a feature that people think would be useful? Essentially, the IDL syntax would not change but the behavior is: 1. If imported file is available locally (in the current input path), use it 2. Else look for it on the project¹s classpath. If so, I have a working patch that needs some cleanup but I can submit it as a feature request in JIRA.
Re: Best practice for versioning IDLs?
I don't think there are yet best practices for what you are trying to do. However, I suggest you first consider embedding the version as metadata in the schema, rather than data. If you put it in a Record, it will be data serialized with every record. If you put it as schema metadata, it will only exist in the schemas and not the data. In raw JSON schema form, the metadata can be added to any named type: Record, Fixed, Enum, Protocol. The doc field is a special named metadata field, you can use it or add your own: { namespace: com.acme, protocol: HelloWorld, doc: Protocol Greetings, acme.version: 1.22.3, types: [ {name: Greeting, type: record, fields: [ {name: message, type: string}]}, {name: Curse, type: error, fields: [ {name: message, type: string}]} ], messages: { hello: { doc: Say hello., request: [{name: greeting, type: Greeting }], response: Greeting, errors: [Curse] } } } http://avro.apache.org/docs/current/spec.html#Protocol+Declaration For IDL, it should be possible to add a property using the @propname(propval) annotation on the protocol. http://avro.apache.org/docs/current/idl.html#defining_protocol I have not tried this myself however. If I had the setup to test it now, I would try to see if the below AvroIDL creates an empty protocol with the acme.version property set: @acme.version(1.22.3) @namespace(com.acme) protocol HelloWorld { } On 11/29/11 9:20 AM, George Fletcher gffle...@aol.com wrote: Hi, I'd like to incorporate a semver.org style versioning structure for the IDL files we are using. The IDLs represent interfaces of services (ala SOA). We currently manage our IDL files separately from the implementation as multiple services might use the same IDL. This makes it critical to have the IDL's understand their version. I'd like to see our build process be able to inject into the IDL the version from the build environment (currently maven). Another option would be to define the version within the IDL. However, the only way I can think of to do this, is to create a Version Record within each IDL and then maybe have the Record contains 3 string fields (major, minor, patch). Just wondering if there are any best practices already established for this kind of IDL versioning. Thanks, George
Re: Overriding default velocity templates
To the best of my recollection, the IDL custom template bits you mention below have not been wired up through all of the tooling. Please feel free to submit JIRA tickets and patches to improve it. Thanks! -Scott On 11/28/11 7:01 AM, George Fletcher gffle...@aol.com wrote: Hi, I'm looking for a way to override the default velocity templates used to generate java sources from IDL files. I know that I can do this by passing a command like argument to override 'org.apache.avro.specific.templates' but that doesn't work well with our build process. We want a standard set of templates used by many developers. What is the best way to override the system property? It appears from the avro code that while the SpecificCompiler.java supports a setTemplateDir() method, nothing in the avro-maven-plugin calls this method. Thanks, George
Re: Avro-mapred and new Java MapReduce API (org.apache.hadoop.mapreduce)
I have heard some suggestions that it would be useful if we could somehow model Avro's interaction with mapreduce using composition rather than inheritance. Has anyone tried that? Or would it be too clumsy? A good relationship with the mapreduce/mapred api via composition might require changes on the hadoop side however. On 11/13/11 5:04 AM, Friso van Vollenhoven fvanvollenho...@xebia.com wrote: Hi, I use my own set of classes for this. I mostly copied from / modeled after the Avro mapred support for the old API. My approach is slightly different, though. The existing MR support fully abstracts / wraps away the Hadoop MR API and only exposes the Avro one. The only Hadoop API that the Avro classes see is the Configuration object. Problem is that in the new API, the Configuration object is kept within a context instance and you'd need to wrap the whole context thing and give the wrapper to the Avro mapper and reducer. This felt a bit overkill so I chose to just make mapper and reducer subclasses that handle the Avro work and then call a protected method to do the actual mapping or reducing. Problem is that you lose the property of a bare mapper or reducer being the identity function, but you could reintroduce this in a generic way, I think. I just don't use the identity functions a lot in practice, so I didn't bother. I pushed the code here: https://github.com/friso/avro-mapreduce. There is a unit test with some usage examples. Cheers, Friso On 11 nov. 2011, at 20:43, Doug Cutting wrote: On 11/10/2011 12:38 AM, Andrew Kenworthy wrote: Are there plans to extend it to work with org.apache.hadoop.mapreduce as well? There's an issue in Jira for this: https://issues.apache.org/jira/browse/AVRO-593 I don't know of anyone actively working on this at present. It would be a great addition to Avro and I am hopeful someone will resume work on it soon. Doug
Re: Does extending union break compatibility
On 11/3/11 4:56 PM, Neil Davudo neil_dav...@yahoo.com wrote: I have a record defined as follows // version 1 record SomeRecord { union { null, TypeA } unionOfTypes; } I change the record to the following // version 2 record SomeRecord { union { null, TypeA, TypeB } unionOfTypes; } Does the change break compatibility? Would data encoded using version 1 of the record definition be decodable using version 2 of the record definition? Readers with the second schema should be able to read data written with the first schema, provided they use the API properly (both schemas must be provided to the reader, so that it can translate from one to the other). The reverse, reading data written in the latter schema with the first schema, is possible as well provided that the first schema contains a default value so that if the reader encounters a union branch it does not know about, it can substitute the default value. TIA Neil
Re: How to add optional new record fields and/or new methods in avro-ipc?
On 10/18/11 9:47 AM, Doug Cutting cutt...@apache.org wrote: On 10/17/2011 08:14 PM, 常冰琳 wrote: What I do in the demo is add a new nullable string in server side, not change a string to nullable string. I add a new field with default value using specific, and it works fine, so I suspect the reason that reflect doesn't work is that I didn't add default value to the nullable string field. Perhaps the default value for nullable field should be null by default? Reflect by default assumes that all values are not nullable. This is perhaps a bug, but the alternative is to make every non-numeric value nullable, which would result in verbose schemas. To amend this, you can use Avro's @Nullable annotation: http://avro.apache.org/docs/current/api/java/org/apache/avro/reflect/Nulla ble.html This can be applied to parameters, return types and fields. For example: import org.apache.avro.reflect.Nullable; public class Foo { @Nullable String x; public void setX(@Nullable String x) { this.x = x; } @Nullable public String getX() { return x; } } The problem is that this does not provide the ability to evolve schemas if you add a field since you would need @Default or something similar, as well: @Nullable @Default(null) Does reflect have any concept of default values? Doug
Re: How to add optional new record fields and/or new methods in avro-ipc?
On 10/18/11 10:38 AM, Doug Cutting cutt...@apache.org wrote: On 10/18/2011 10:09 AM, Scott Carey wrote: On 10/18/11 9:47 AM, Doug Cutting cutt...@apache.org wrote: To amend this, you can use Avro's @Nullable annotation: The problem is that this does not provide the ability to evolve schemas if you add a field since you would need @Default or something similar, as well: @Nullable @Default(null) I don't think this is required. The default value for a union is the default value for its first branch. A null schema needs no default. So the schema [null, string] needs to specify no default value while the schema [string, null] does. Thus the best practice for nullable values is to place the null first in the union. This is what is done by the @Nullable annotation. Perhaps we should clarify this in the Specification? We might state that a null schema implicitly has a default value of null since that's the only value its ever permitted to have anyway. Good to know. So, any ideas what is causing the original User's problem? @Nullable is in use with Reflect (does not work), Specific works (with default values but not without -- it appears to have null first but not confirmed). I suspect there is something else going on. Does reflect have any concept of default values? No. We could add an @Default annotation, I suppose. But I don't think this is needed for nullable stuff. Doug
Re: Avro mapred: How to avoid schema specification in job.xml?
I'm not all that familiar with how Oozie interacts with Avro. The Job must set its avro.input.schema and avro.output.schema properties this can be done in code (see the unit tests in the Avro mapred project for examples), and if you are using SpecificRecords and DataFiles the schema is available to the code where necessary. On 10/10/11 5:41 AM, Julien Muller julien.mul...@ezako.com wrote: Hello, I have been using avro with hadoop and oozie for months now and I am very happy with the results. The only point I see as a limitation now is that we specify avro schemes in workflow.xml (job.xml): - avro.input.schema - avro.output.schema Since this info is already provided in Mapper/Reducer signatures, I see this as redundant. The schema is also present in all my serialized files, which means that the schema is specified in 3 different places. From a run point of view, this is a pain, since any schema modification (let's say a simple optional field added) forces me to update many job files. This task is very error prone and since we have a large amount of jobs, it generates a lot of work. The only solution I see now would be to find/replace in the build script, but I hope I could find a better solution by providing some generic schemes to the job file, or find a way to deactivate schema validation in the job. Any help will be appreciated! -- Julien Muller
Re: Avro mapred: How to avoid schema specification in job.xml?
On 10/10/11 11:41 AM, Julien Muller julien.mul...@ezako.com wrote: Hello, Thanks for your answer, let me try to clarify my context a bit: I'm not all that familiar with how Oozie interacts with Avro. Let's get oozie out of the picture. I use job.xml files to configure Jobs. This means I do not have any JobConf object and I cannot use AvroJob. Therefore I directly write the job properties (as what AvroJob outputs). The Job must set its avro.input.schema and avro.output.schema properties this can be done in code (see the unit tests in the Avro mapred project for examples), The solution I have now is basically based on the Avro mapred unit tests. But in my context, it is not an option to code (using the $SCHEMA property) at the job configuration level. where you code: AvroJob.setInputSchema(job, Schema.create(Schema.Type.STRING)); I have to copy the entire schema in job.xml file. And I have to update it every time my schema get updated. I hope I can find a better solution. I suppose that in AvroJob we could transmit only the class name in a property, and use that to look up the schema for generated classes using reflection. Could you do something similar? I don't think it is possible to avoid configuring at least some sort of pointer to where the schema is. This could be via a property, or if you already have the job class, an annotation on that class. and if you are using SpecificRecords and DataFiles the schema is available to the code where necessary. I am not sure what you mean here. I am using SpecificRecords and would like to avoid specifying avro.input.schema, since this info is already here in the specific record. Potentially the AvroMapper / AvroReducer could have a fall-back for obtaining the schema if the property is not set reflection on a class name or an annotation . If this looks like it is an enhancement request for Avro (or a bug) please file a JIRA ticket. Thanks! Thanks, Julien Muller 2011/10/10 Scott Carey scottca...@apache.org I'm not all that familiar with how Oozie interacts with Avro. The Job must set its avro.input.schema and avro.output.schema properties this can be done in code (see the unit tests in the Avro mapred project for examples), and if you are using SpecificRecords and DataFiles the schema is available to the code where necessary. On 10/10/11 5:41 AM, Julien Muller julien.mul...@ezako.com wrote: Hello, I have been using avro with hadoop and oozie for months now and I am very happy with the results. The only point I see as a limitation now is that we specify avro schemes in workflow.xml (job.xml): - avro.input.schema - avro.output.schema Since this info is already provided in Mapper/Reducer signatures, I see this as redundant. The schema is also present in all my serialized files, which means that the schema is specified in 3 different places. From a run point of view, this is a pain, since any schema modification (let's say a simple optional field added) forces me to update many job files. This task is very error prone and since we have a large amount of jobs, it generates a lot of work. The only solution I see now would be to find/replace in the build script, but I hope I could find a better solution by providing some generic schemes to the job file, or find a way to deactivate schema validation in the job. Any help will be appreciated! -- Julien Muller
Re: Data incompatibility between Avro 1.4.1 and 1.5.4
AVRO-793 was not a bug in the encoded data or its format. It was a bug in how schema resolution worked for certain projection corner cases during deserialization. Is your data readable with the same schema that wrote it? (for example, if it is an avro data file, you can use avro-tools.jar to print it out with its own schema). If the error only occurs when you try to use a different schema to read than it was written with, it is most likely a bug with the schema resolution process. If so, file a bug. We will need to reproduce it, so the more information you can give us about the schemas the better. Best would be a reproducible test case but that may not be trivial. At minimum the stack trace you get with 1.5.4 could be enlightening. Thanks! -Scott On 10/3/11 3:32 PM, W.P. McNeill bill...@gmail.com wrote: I have a bunch of data that I serialized using the Avro 1.4.1 library. I wanted use projection schemas with this data but I can't because of bug 793 (https://issues.apache.org/jira/browse/AVRO-793). So I changed my code to use Avro 1.5.4. When I try to deserialize the Avro 1.4.1 data with the new code built with Avro 1.5.4, I get the same runtime deserialization errors described in JIRA 793. Is this expected? Is there any way around it beyond reserializing all my data using Avro 1.5.4? (I think I'm asking whether JIRA 793 is just a problem with deserialization or a problem with the binary serialization format.)
Re: In Java, how can I create an equivalent of an Apache Avro container file without being forced to use a File as a medium?
In addition to Joe's comments: On the write side, DataFileWriter.create() can take a file or an output stream. http://avro.apache.org/docs/1.5.4/api/java/org/apache/avro/file/DataFileWrit er.html On the read side, DataFileStream can be used if the input does not have random access and can be represented with an InputStream. If the input has random access, implement SeekableInput and then construct a DataFileReader with it: http://avro.apache.org/docs/1.5.4/api/java/org/apache/avro/file/SeekableInpu t.html http://avro.apache.org/docs/1.5.4/api/java/org/apache/avro/file/DataFileRead er.html On 9/24/11 2:14 AM, Bernard Liang liang.bern...@gmail.com wrote: Hello, This is somewhat of a more advanced question regarding the Java implementation of Avro. The details are located at the following link: http://stackoverflow.com/questions/7537959/in-java-how-can-i-create-an-equival ent-of-an-apache-avro-container-file-without If there is anyone that might be able to assist me with this, I would like to get in contact with you. Best regards, Bernard Liang
Re: Compression and splittable Avro files in Hadoop
Yes, Avro Data Files are always splittable. You may want to up the default block size in the files if this is for MapReduce. The block size can often have a bigger impact on the compression ratio than the compression level setting. If you are sensitive to the write performance, you might want lower deflate compression levels as well. The read performance is relatively constant for deflate as the compression level changes (except for uncompressed level 0), but the write performance varies a quite a bit between compression level 1 and 9 -- typically a factor of 5 or 6. On 9/30/11 6:42 PM, Eric Hauser ewhau...@gmail.com wrote: A coworker and I were having a conversation today about choosing a compression algorithm for some data we are storing in Hadoop. We have been using (https://github.com/tomslabs/avro-utils) for our Map/Reduce jobs and Haivvreo for integration with Hive. By default, the avro-utils OutputFormat uses deflate compression. Even though default/zlib/gzip files are not splittable, we decided that Avro data files are always splittable because individual blocks within the file are compressed instead of the entire file. Is this accurate? Thanks.
Re: Avro versioning and SpecificDatum's
That looks like a bug. What happens if there is no aliasing/renaming involved? Aliasing is a newer feature than field addition, removal, and promotion. This should be easy to reproduce, can you file a JIRA ticket? We should discuss this further there. Thanks! On 9/19/11 6:14 PM, Alex Holmes grep.a...@gmail.com wrote: OK, I was able to reproduce the exception. v1: {name: Record, type: record, fields: [ {name: name, type: string}, {name: id, type: int} ] } v2: {name: Record, type: record, fields: [ {name: name_rename, type: string, aliases: [name]} ] } Step 1. Write Avro file using v1 generated class Step 2. Read Avro file using v2 generated class Exception in thread main org.apache.avro.AvroRuntimeException: Bad index at Record.put(Unknown Source) at org.apache.avro.generic.GenericData.setField(GenericData.java:463) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.j ava:166) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:13 8) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:12 9) at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233) at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220) at Read.readFromAvro(Unknown Source) at Read.main(Unknown Source) The code to write/read the avro file didn't change from below. On Mon, Sep 19, 2011 at 9:08 PM, Alex Holmes grep.a...@gmail.com wrote: I'm trying to put together a simple test case to reproduce the exception. While I was creating the test case, I hit this behavior which doesn't seem right, but maybe it's my misunderstanding on how forward/backward compatibility should work: Schema v1: {name: Record, type: record, fields: [ {name: name, type: string}, {name: id, type: int} ] } Schema v2: {name: Record, type: record, fields: [ {name: name_rename, type: string, aliases: [name]}, {name: new_field, type: int, default:0} ] } In the 2nd version I: - removed field id - renamed field name to name_rename - added field new_field I write the v1 data file: public static Record createRecord(String name, int id) { Record record = new Record(); record.name = name; record.id = id; return record; } public static void writeToAvro(OutputStream outputStream) throws IOException { DataFileWriterRecord writer = new DataFileWriterRecord(new SpecificDatumWriterRecord()); writer.create(Record.SCHEMA$, outputStream); writer.append(createRecord(r1, 1)); writer.append(createRecord(r2, 2)); writer.close(); outputStream.close(); } I wrote a version-agnostic Read class: public static void readFromAvro(InputStream is) throws IOException { DataFileStreamRecord reader = new DataFileStreamRecord( is, new SpecificDatumReaderRecord()); for (Record a : reader) { System.out.println(ToStringBuilder.reflectionToString(a)); } IOUtils.cleanup(null, is); IOUtils.cleanup(null, reader); } Running the Read code against the v1 data file, and including the v1 code-generated classes in the classpath produced: Record@6a8c436b[name=r1,id=1] Record@6baa9f99[name=r2,id=2] If I run the same code, but use just the v2 generated classes in the classpath I get: Record@39dd3812[name_rename=r1,new_field=1] Record@27b15692[name_rename=r2,new_field=2] The name_rename field seems to be good, but why would new_field inherit the values of the deleted field id? Cheers, Alex On Mon, Sep 19, 2011 at 12:35 PM, Doug Cutting cutt...@apache.org wrote: On 09/19/2011 05:12 AM, Alex Holmes wrote: I then modified my original schema by adding, deleting and renaming some fields, creating version 2 of the schema. After re-creating the Java classes I attempted to read the version 1 file using the DataFileStream (with a SpecificDatumReader), and this is throwing an exception. This should work. Can you provide more detail? What is the exception? A reproducible test case would be great to have. Thanks, Doug
Re: Avro versioning and SpecificDatum's
As Doug mentioned in the ticket, the problem is likely: new SpecificDatumReaderRecord() This should be new SpecificDatumReaderRecord(Record.class) Which sets the reader to resolve to the schema found in Record.class On 9/20/11 3:44 AM, Alex Holmes grep.a...@gmail.com wrote: Created the following ticket: https://issues.apache.org/jira/browse/AVRO-891 Thanks, Alex On Tue, Sep 20, 2011 at 6:26 AM, Alex Holmes grep.a...@gmail.com wrote: Thanks, I'll add a bug. As a FYI, even without the alias (retaining the original field name), just removing the id field yields the exception. On Tue, Sep 20, 2011 at 2:22 AM, Scott Carey scottca...@apache.org wrote: That looks like a bug. What happens if there is no aliasing/renaming involved? Aliasing is a newer feature than field addition, removal, and promotion. This should be easy to reproduce, can you file a JIRA ticket? We should discuss this further there. Thanks! On 9/19/11 6:14 PM, Alex Holmes grep.a...@gmail.com wrote: OK, I was able to reproduce the exception. v1: {name: Record, type: record, fields: [ {name: name, type: string}, {name: id, type: int} ] } v2: {name: Record, type: record, fields: [ {name: name_rename, type: string, aliases: [name]} ] } Step 1. Write Avro file using v1 generated class Step 2. Read Avro file using v2 generated class Exception in thread main org.apache.avro.AvroRuntimeException: Bad index at Record.put(Unknown Source) at org.apache.avro.generic.GenericData.setField(GenericData.java:463) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReade r.j ava:166) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java :13 8) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java :12 9) at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233) at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220) at Read.readFromAvro(Unknown Source) at Read.main(Unknown Source) The code to write/read the avro file didn't change from below. On Mon, Sep 19, 2011 at 9:08 PM, Alex Holmes grep.a...@gmail.com wrote: I'm trying to put together a simple test case to reproduce the exception. While I was creating the test case, I hit this behavior which doesn't seem right, but maybe it's my misunderstanding on how forward/backward compatibility should work: Schema v1: {name: Record, type: record, fields: [ {name: name, type: string}, {name: id, type: int} ] } Schema v2: {name: Record, type: record, fields: [ {name: name_rename, type: string, aliases: [name]}, {name: new_field, type: int, default:0} ] } In the 2nd version I: - removed field id - renamed field name to name_rename - added field new_field I write the v1 data file: public static Record createRecord(String name, int id) { Record record = new Record(); record.name = name; record.id = id; return record; } public static void writeToAvro(OutputStream outputStream) throws IOException { DataFileWriterRecord writer = new DataFileWriterRecord(new SpecificDatumWriterRecord()); writer.create(Record.SCHEMA$, outputStream); writer.append(createRecord(r1, 1)); writer.append(createRecord(r2, 2)); writer.close(); outputStream.close(); } I wrote a version-agnostic Read class: public static void readFromAvro(InputStream is) throws IOException { DataFileStreamRecord reader = new DataFileStreamRecord( is, new SpecificDatumReaderRecord()); for (Record a : reader) { System.out.println(ToStringBuilder.reflectionToString(a)); } IOUtils.cleanup(null, is); IOUtils.cleanup(null, reader); } Running the Read code against the v1 data file, and including the v1 code-generated classes in the classpath produced: Record@6a8c436b[name=r1,id=1] Record@6baa9f99[name=r2,id=2] If I run the same code, but use just the v2 generated classes in the classpath I get: Record@39dd3812[name_rename=r1,new_field=1] Record@27b15692[name_rename=r2,new_field=2] The name_rename field seems to be good, but why would new_field inherit the values of the deleted field id? Cheers, Alex On Mon, Sep 19, 2011 at 12:35 PM, Doug Cutting cutt...@apache.org wrote: On 09/19/2011 05:12 AM, Alex Holmes wrote: I then modified my original schema by adding, deleting and renaming some fields, creating version 2 of the schema. After re-creating the Java classes I attempted to read the version 1 file using the DataFileStream (with a SpecificDatumReader), and this is throwing an exception. This should work. Can you provide more detail? What is the exception? A reproducible test case would be great to have. Thanks, Doug
Re: Avro versioning and SpecificDatum's
I version with SpecificDatum objects using avro data files and it works fine. I have seen problems arise if a user is configuring or reconfiguring the schemas on the DatumReader passed into the construction of the DataFileReader. In the case of SpecificDatumReader, it is as simple as: DatumReaderT reader = new SpecificDatumReaderT(T.class); DataFileReaderT fileReader = new DataFileReader(file, reader); On 9/19/11 5:12 AM, Alex Holmes grep.a...@gmail.com wrote: Hi, I'm starting to play with how I can support versioning with Avro. I created an initial schema, code-generated some some Java classes using org.apache.avro.tool.Main compile protocol, and then used the DataFileWriter (with a SpecificDatumWriter) to serialize my objects to a file. I then modified my original schema by adding, deleting and renaming some fields, creating version 2 of the schema. After re-creating the Java classes I attempted to read the version 1 file using the DataFileStream (with a SpecificDatumReader), and this is throwing an exception. Is versioning supported in conjunction with the SpecificDatum* reader/writer classes, or do I have to work at the GenericDatum level for this to work? Many thanks, Alex
Re: How should I migrate 1.4 code to avro 1.5?
The javadoc for the deprecated method directs users to the replacement. BinaryEncoder and BinaryDecoder are well documented, with docs available via maven for IDE's to consume easily, or via the Apache Avro website: http://avro.apache.org/docs/1.5.3/api/java/org/apache/avro/io/BinaryEncoder. html http://avro.apache.org/docs/1.5.3/api/java/org/apache/avro/io/DecoderFactory .html defaultFactory http://avro.apache.org/docs/1.5.3/api/java/org/apache/avro/io/DecoderFactor y.html#defaultFactory%28%29 () Deprecated. use the equivalent get() http://avro.apache.org/docs/1.5.3/api/java/org/apache/avro/io/DecoderFactor y.html#get%28%29 instead Generally, when using Avro you will have an easier time if you have the docs available in your IDE or at least available for reference in a browser. There are not a lot of blog posts and examples out there, but the javadoc is mostly decent and we try hard to make sure all public and protected methods and constructors have documentation. Many classes and packages have solid documentation as well. Please report any documentation bugs or suggestions for improvement. Thanks! -Scott On 9/2/11 2:41 PM, W.P. McNeill bill...@gmail.com wrote: I'm new to Avro. Since I'm having trouble finding simple examples online I'm writing one of my own that I'm putting on github. https://github.com/wpm/AvroExample Hopefully, this will be of help to people like me who are also having trouble finding simple code examples. I want to get this compiling without of hitch in Maven. I had it running with a 1.4 version of Avro, but when I changed that to 1.5, some of the code no longer works. Specifically, BinaryEncoder can no longer be instantiated directly because it is now an abstract class (AvroExample.java: line 33) and DecoderFactory.defaultFactory is deprecated (AvroExample.java: line 41). How should I modify this code so that it works with the latest and greatest version of Avro? I looked through the Release Notes, but the answers weren't obvious. Thanks.
Re: How should I migrate 1.4 code to avro 1.5?
Are you still having trouble with this? I noticed that the code has changed and you are using MyPair instead of Pair. Was there a naming conflict bug with Avro's Pair.java? -Scott On 9/2/11 3:46 PM, W.P. McNeill bill...@gmail.com wrote: I made changes that got rid of all the deprecated calls. I think I am using the 1.5 interface correctly. However, I get a runtime error when I try to deserialize into a class using a SpecificDataumReader. The problem starts at line 62 of AvroExample.java https://github.com/wpm/AvroExample/blob/master/src/main/java/wpmcn/AvroExampl e.java#L62 . The code looks like this: DatumReaderPair reader = new SpecificDatumReaderPair(Pair.class); BinaryDecoder decoder = DecoderFactory.get().binaryDecoder(out.toByteArray(), null); Pair result = reader.read(null, decoder); System.out.printf(Left: %s, Right: %s\n, result.left, result.right); Where Pair is an object I have SpecificRecord that I have in this project. When I deserialize with reader.read() I get the following runtime error: Exception in thread main java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to wpmcn.Pair at wpmcn.AvroExample.serializeSpecific(AvroExample.java:64) at wpmcn.AvroExample.main(AvroExample.java:73) When I step into the debugger I see that the GenericDatumReader.read() function has type D as GenericData. Presumably I'm calling something wrong but I can't figure out what. On Fri, Sep 2, 2011 at 3:02 PM, Philip Zeyliger phi...@cloudera.com wrote: EncoderFactory.get().binaryEncoder(...). I encourage you to file a JIRA and submit a patch to AVRO. Having example code in the code base seems like a win to me. -- Philip On Fri, Sep 2, 2011 at 2:41 PM, W.P. McNeill bill...@gmail.com wrote: I'm new to Avro. Since I'm having trouble finding simple examples online I'm writing one of my own that I'm putting on github. https://github.com/wpm/AvroExample Hopefully, this will be of help to people like me who are also having trouble finding simple code examples. I want to get this compiling without of hitch in Maven. I had it running with a 1.4 version of Avro, but when I changed that to 1.5, some of the code no longer works. Specifically, BinaryEncoder can no longer be instantiated directly because it is now an abstract class (AvroExample.java: line 33) and DecoderFactory.defaultFactory is deprecated (AvroExample.java: line 41). How should I modify this code so that it works with the latest and greatest version of Avro? I looked through the Release Notes, but the answers weren't obvious. Thanks.
Re: simultaneous read + write?
AvroDataFile deals with this for some cases. Is it an acceptable API for your use case? You can configure the block size to be very small and/or flush() regularly. If you do this on your own, you will need to track the position that you start to read a record at, and if there is a failure, rewind and reset the reader to that position. -Scott On 8/25/11 7:17 PM, Yang tedd...@gmail.com wrote: I'm trying to implement an on-disk queue, which contains avro records, SpecificRecord my queue implementation basically contains a SpecificDatumWriter, and a SpecificDatumReader pointing to the same file . the problem is, that when the reader reaches the EOF, I can no longer use it again, even after I append more records to the file, if I call the same SpecificDatumReader.read() again, it gave me exceptions: -- - Test set: blah.MyTest -- - Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.257 sec FAILURE! testBasic(blah.MyTest) Time elapsed: 0.24 sec ERROR! java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.avro.io.BinaryDecoder$ByteSource.compactAndFill(BinaryDecoder.j ava:670) at org.apache.avro.io.BinaryDecoder.ensureBounds(BinaryDecoder.java:453) at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:120) at org.apache.avro.io.BinaryDecoder.readIndex(BinaryDecoder.java:405) at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:14 2) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.j ava:166) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:13 8) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:12 9) at blah.DiskEventsQueue.dequeue2(MyTest.java:55) at blah.MyTest.testBasic(MyTest.java:85) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java: 57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm pl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) Thanks Yang
Re: How should I migrate 1.4 code to avro 1.5?
Start with a JIRA ticket and we can discuss and refine there. What we accept into the project must be attached as a patch to the JIRA ticket with the sign-off to Apache and proper license headers on the content. Thanks! -Scott On 9/2/11 5:53 PM, W.P. McNeill bill...@gmail.com wrote: I've got a building version with Chris Wilkes' changes. I'd be happy to include this in an Avro distribution. Should I just open a JIRA to that effect and point to this github project? On Fri, Sep 2, 2011 at 5:28 PM, Chris Wilkes cwil...@gmail.com wrote: Oh and I'm the one that did the pull request. I changed the name of the avro class to MyPair as I was confused when reading it with avro's own Pair class. What I usually do is put all of my avro schemas into a separate project with nothing else in it. Then I have all my other projects depend on that one, in this case AvroExample.java would be a in a separate project from MyPair.avsc. This gets around weirdness with mvn install vs Eclipse seeing the updated files, etc. On Fri, Sep 2, 2011 at 5:20 PM, W.P. McNeill bill...@gmail.com wrote: Still having trouble with this. The name change was part of merging the pull request on github. My last email details where I'm at right now. The pull request code looks correct; I'm just trying to get it to build in my Maven environment. On Fri, Sep 2, 2011 at 5:19 PM, Scott Carey scottca...@apache.org wrote: Are you still having trouble with this? I noticed that the code has changed and you are using MyPair instead of Pair. Was there a naming conflict bug with Avro's Pair.java? -Scott On 9/2/11 3:46 PM, W.P. McNeill bill...@gmail.com wrote: I made changes that got rid of all the deprecated calls. I think I am using the 1.5 interface correctly. However, I get a runtime error when I try to deserialize into a class using a SpecificDataumReader. The problem starts at line 62 of AvroExample.java. The code looks like this: DatumReaderPair reader = new SpecificDatumReaderPair(Pair.class); BinaryDecoder decoder = DecoderFactory.get().binaryDecoder(out.toByteArray(), null); Pair result = reader.read(null, decoder); System.out.printf(Left: %s, Right: %s\n, result.left, result.right); Where Pair is an object I have SpecificRecord that I have in this project. When I deserialize with reader.read() I get the following runtime error: Exception in thread main java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to wpmcn.Pair at wpmcn.AvroExample.serializeSpecific(AvroExample.java:64) at wpmcn.AvroExample.main(AvroExample.java:73) When I step into the debugger I see that the GenericDatumReader.read() function has type D as GenericData. Presumably I'm calling something wrong but I can't figure out what. On Fri, Sep 2, 2011 at 3:02 PM, Philip Zeyliger phi...@cloudera.com wrote: EncoderFactory.get().binaryEncoder(...). I encourage you to file a JIRA and submit a patch to AVRO. Having example code in the code base seems like a win to me. -- Philip On Fri, Sep 2, 2011 at 2:41 PM, W.P. McNeill bill...@gmail.com wrote: I'm new to Avro. Since I'm having trouble finding simple examples online I'm writing one of my own that I'm putting on github. https://github.com/wpm/AvroExample Hopefully, this will be of help to people like me who are also having trouble finding simple code examples. I want to get this compiling without of hitch in Maven. I had it running with a 1.4 version of Avro, but when I changed that to 1.5, some of the code no longer works. Specifically, BinaryEncoder can no longer be instantiated directly because it is now an abstract class (AvroExample.java: line 33) and DecoderFactory.defaultFactory is deprecated (AvroExample.java: line 41). How should I modify this code so that it works with the latest and greatest version of Avro? I looked through the Release Notes, but the answers weren't obvious. Thanks.
Re: avro BinaryDecoder bug ?
Looks like a bug to me. Can you file a JIRA ticket? Thanks! On 8/29/11 1:24 PM, Yang tedd...@gmail.com wrote: if I read on a empty file with BinaryDecoder, I get EOF, good, but with the current code, if I read it again with the same decoder, I get a IndexOutofBoundException, not EOF. it seems that always giving EOF should be a more desirable behavior. you can see from this test code: import static org.junit.Assert.assertEquals; import java.io.IOException; import org.apache.avro.specific.SpecificRecord; import org.junit.Test; import myavro.Apple; import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.InputStream; import java.io.OutputStream; import org.apache.avro.io.Decoder; import org.apache.avro.io.DecoderFactory; import org.apache.avro.io.Encoder; import org.apache.avro.io.EncoderFactory; import org.apache.avro.specific.SpecificDatumReader; import org.apache.avro.specific.SpecificDatumWriter; class MyWriter { SpecificDatumWriterSpecificRecord wr; Encoder enc; OutputStream ostream; public MyWriter() throws FileNotFoundException { wr = new SpecificDatumWriterSpecificRecord(new Apple().getSchema()); ostream = new FileOutputStream(new File(/tmp/testavro)); enc = EncoderFactory.get().binaryEncoder(ostream, null); } public synchronized void dump(SpecificRecord event) throws IOException { wr.write(event, enc); enc.flush(); } } class MyReader { SpecificDatumReaderSpecificRecord rd; Decoder dec; InputStream istream; public MyReader() throws FileNotFoundException { rd = new SpecificDatumReaderSpecificRecord(new Apple().getSchema()); istream = new FileInputStream(new File(/tmp/testavro)); dec = DecoderFactory.get().binaryDecoder(istream, null); } public synchronized SpecificRecord read() throws IOException { Object r = rd.read(null, dec); return (SpecificRecord) r; } } public class AvroWriteAndReadSameTime { @Test public void testWritingAndReadingAtSameTime() throws Exception { MyWriter dumper = new MyWriter(); final Apple apple = new Apple(); apple.taste = sweet; dumper.dump(apple); final MyReader rd = new MyReader(); rd.read(); try { rd.read(); } catch (Exception e) { e.printStackTrace(); } // the second one somehow generates a NPE, we hope to get EOF... try { rd.read(); } catch (Exception e) { e.printStackTrace(); } } } the issue is in BinaryDecoder.readInt(), right now even when it hits EOF, it still advances the pos pointer. all the other APIs (readLong readFloat ...) do not do this. changing to the following makes it work: @Override public int readInt() throws IOException { ensureBounds(5); // won't throw index out of bounds int len = 1; int b = buf[pos] 0xff; int n = b 0x7f; if (b 0x7f) { b = buf[pos + len++] 0xff; n ^= (b 0x7f) 7; if (b 0x7f) { b = buf[pos + len++] 0xff; n ^= (b 0x7f) 14; if (b 0x7f) { b = buf[pos + len++] 0xff; n ^= (b 0x7f) 21; if (b 0x7f) { b = buf[pos + len++] 0xff; n ^= (b 0x7f) 28; if (b 0x7f) { throw new IOException(Invalid int encoding); } } } } } if (pos+len limit) { throw new EOFException(); } pos += len; //== CHANGE, used to be above the EOF throw return (n 1) ^ -(n 1); // back to two's-complement }
Re: Map output records/reducer input records mismatch
We have had one other report of something similar happening. https://issues.apache.org/jira/browse/AVRO-782 What Avro version is this happening with? What JVM version? On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args if it is Sun and JRE 6u21 or later? (some issues in loop predicates affect Java 6 too, just not as many as the recent news on Java7). Otherwise, it may likely be the same thing as AVRO-782. Any extra information related to that issue would be welcome. Thanks! -Scott On 8/16/11 8:39 AM, Vyacheslav Zholudev vyacheslav.zholu...@gmail.com wrote: Hi, I'm having multiple hadoop jobs that use the avro mapred API. Only in one of the jobs I have a visible mismatch between a number of map output records and reducer input records. Does anybody encountered such a behavior? Can anybody think of possible explanations of this phenomenon? Any pointers/thoughts are highly appreciated! Best, Vyacheslav
Re: Compiling multiple input schemas
What about leveraging shell expansion? This would mean we would need inverse syntax, like tar or zip ( destination, list of sources in reverse dependency order ) Then your examples are avro-tools-1.6.0.jar compile schema tmp/ input/position.avsc input/player.avsc avro-tools-1.6.0.jar compile schema tmp/ input/* That would be incompatible, since we reversed argument order. But it would be more like other unix command line tools that take lists of files and output results somewhere else. (Or as I see Doug has just replied the last argument can be the destination) On 8/16/11 1:38 PM, Bill Graham billgra...@gmail.com wrote: Hi, With Avro-874, multiple inter-dependent schema files can be parsed. I've written a patch to the SpecificCompilerTool to allow the same when producing java from multiple schemas that I'd like to contribute for consistency if there's interest. It allows you to pass multiple input files like this: $ java -cp avro-tools-1.6.0.jar org.apache.avro.tool.Main compile schema input/position.avsc,input/player.avsc tmp/ While I was at it, it seemed useful to parse an entire directory of schema files as well so I implemented this: $ java -cp avro-tools-1.6.0.jar org.apache.avro.tool.Main compile schema input/ tmp/ The latter approach will not work properly for files with dependencies, since the file order probably isn't in reverse dependency order. If that's the case, a combination of files and directories can be used to force an ordering. So if b depends on a and other files depend on either of them you could do this: $ java -cp avro-tools-1.6.0.jar org.apache.avro.tool.Main compile schema input/a.avsc,input/b.avsc,input/ tmp/ Let me know if some or all of this seems useful to contribute. The first example is really the main one that I need. I've done the same for Protocol as well btw. thanks, Bill
Re: Map output records/reducer input records mismatch
On 8/16/11 3:56 PM, Vyacheslav Zholudev vyacheslav.zholu...@gmail.com wrote: Hi, Scott, thanks for your reply. What Avro version is this happening with? What JVM version? We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have to look up. On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args if it is Sun and JRE 6u21 or later? (some issues in loop predicates affect Java 6 too, just not as many as the recent news on Java7). Otherwise, it may likely be the same thing as AVRO-782. Any extra information related to that issue would be welcome. I will have to collect it. In the meanwhile, do you have any reasonable explanations of the issue besides it being something like AVRO-782? What is your key type (map output schema, first type argument of Pair)? Is your key a Utf8 or String? I don't have a reasonable explanation at this point, I haven't looked into it in depth with a good reproducible case. I have my suspicions with how recycling of the key works since Utf8 is mutable and its backing byte[] can end up shared. Thanks a lot, Vyacheslav Thanks! -Scott On 8/16/11 8:39 AM, Vyacheslav Zholudev vyacheslav.zholu...@gmail.com wrote: Hi, I'm having multiple hadoop jobs that use the avro mapred API. Only in one of the jobs I have a visible mismatch between a number of map output records and reducer input records. Does anybody encountered such a behavior? Can anybody think of possible explanations of this phenomenon? Any pointers/thoughts are highly appreciated! Best, Vyacheslav Best, Vyacheslav
Re: why Utf8 (vs String)?
Also, Utf8 caches the result of toString(), so that if you call toString() many times, it only allocates the String once. It also implements the CharSequence interface, and many libraries in the JRE accept CharSequence. Note that Utf8 is mutable and exposes its backing store (byte array). String is immutable. Be careful with how you use Utf8 objects if you hold on to them for a long time or pass them to other code -- users should not expect similar characteristics to String for general use. On 8/11/11 5:08 PM, Yang tedd...@gmail.com wrote: Thanks a lot Doug On Thu, Aug 11, 2011 at 5:02 PM, Doug Cutting cutt...@apache.org wrote: This is for performance. A Utf8 may be efficiently compared to other Utf8's, e.g., when sorting, without decoding the UTF-8 bytes into characters. A Utf8 may also be reused, so when iterating through a large number of values (e.g., in a MapReduce job) only a single instance need be allocated, while String would require an allocation per iteration. Note that String may be used when writing data, but that data is generally read as Utf8. The toString() method may be called whenever a String is required. If only equality or ordering is needed, and not substring operations, then leaving values as Utf8 is generally faster than converting to String. Doug On 08/11/2011 04:36 PM, Yang wrote: if I declare a field to be string, the generated java implementation uses avro..Utf8 for that, I was wondering what is the thinking behind this, and what is the proper way to use the Utf8 value - oftentimes in my logic, I need to compare the value against other String's, or store them into other databases , which of course do not know about Utf8, so that I'd have to transform them into String's. so it seems being Utf8 unnecessarily asks for a lot of transformations. or I guess I'm not getting the correct usage ? Thanks Yang
Re: Combining schemas
On 8/9/11 11:15 AM, Bill Graham billgra...@gmail.com wrote: Hi, I'm trying to create a schema that references a type defined in another schema and I'm having some troubles. Is there an easy way to do this? My test schemas look like this: $ cat position.avsc {type:enum, name: Position, namespace: avro.examples.baseball, symbols: [P, C, B1, B2, B3, SS, LF, CF, RF, DH] } $ cat player.avsc {type:record, name:Player, namespace: avro.examples.baseball, fields: [ {name: number, type: int}, {name: first_name, type: string}, {name: last_name, type: string}, {name: position, type: {type: array, items: avro.examples.baseball.Position} } ] } I've read this thread (http://apache-avro.679487.n3.nabble.com/How-to-reference-previously-defined-e num-in-avsc-file-td2663512.html) and tried using IDL like so with no luck: $ cat baseball.avdl @namespace(avro.examples.baseball) protocol Baseball { import schema position.avsc; import schema player.avsc; } $ java -jar avro-tools-1.5.1.jar idl baseball.avdl baseball.avpr Exception in thread main org.apache.avro.SchemaParseException: Undefined name: avro.examples.baseball.Position at org.apache.avro.Schema.parse(Schema.java:979) at org.apache.avro.Schema.parse(Schema.java:1052) at org.apache.avro.Schema.parse(Schema.java:1021) at org.apache.avro.Schema.parse(Schema.java:884) at org.apache.avro.compiler.idl.Idl.ImportSchema(Idl.java:388) at org.apache.avro.compiler.idl.Idl.ProtocolBody(Idl.java:320) at org.apache.avro.compiler.idl.Idl.ProtocolDeclaration(Idl.java:206) at org.apache.avro.compiler.idl.Idl.CompilationUnit(Idl.java:84) ... I agree that the documentation indicates that this should work. I suspect that it may not be able to resolve dependencies among imports. That is if Baseball depends on position, and on player, it works. But since player depends on position, it does not. The import statement pulls in each item individually for use in composite things in the AvroIDL, but does not allow for interdependencies in the imports. This seems worthy of a JIRA enhancement request. I'm sure the project will accept a patch that adds this. I also saw this blog post (http://www.infoq.com/articles/ApacheAvro#_ftnref6_7758) where the author had to write some nasty String.replace(..) code to combine schemas, but there's got to be a better way that this. We need to improve the ability to import multiple files when parsing. Using the lower level Avro API you can parse the files yourself in an order that will work. I have simply put all my types in one file. If you made one avsc file with both Position and Player in a JSON array it will complie. It would look like: [ position schema here, player schema here ] Also FYI, it seems enum values can't start with numbers (i.e. '1B'). Is this a know issue or a feature? I haven't seen it documented anywhere. You get an error like this if the value starts with a number: org.apache.avro.SchemaParseException: Illegal initial character Enums are a named type. The enum names must start with [A-Za-z_] and subsequently contain only [A-Za-z0-9_]. http://avro.apache.org/docs/1.5.1/spec.html#Names However, the spec does not say that the values must have such restrictions. This may be a bug, can you file a JIRA ticket? Thanks! -Scott thanks, Bill
Re: Hadoop and org.apache.avro.file.DataFileReader sez Not an Avro data file
An avro data file is not created with a FileOutputStream. That will write = avro binary data to a file, but not in the avro file format (which is split= table and contains header metadata). The API for Avro Data Files is here: http://avro.apache.org/docs/current/api/java/org/apache/avro/file/package- http://avro.apache.org/docs/current/api/java/org/apache/avro/file/package- s=summary.html On 7/20/11 2:35 PM, Peter Wolf opus...@gmail.com wrote: Hello, anyone out there know about AVRO file formats and/or Hadoop support? My Hadoop AvroJob code does not recognize the AVRO files created by my other code. It seems that the MAGIC number is wrong. What is going on? How many different ways of encoding AVRO files are there, and how do I make sure they match. I am creating the input files like this... static public void write(String file, GenericRecord record, Schema schema) throws IOException { OutputStream o = new FileOutputStream(file); GenericDatumWriter w = new GenericDatumWriter(schema); Encoder e = EncoderFactory.get().binaryEncoder(o, null); w.write(record, e); e.flush(); } Hadoop is reading them using org.apache.avro.file.DataFileReader Here is where it breaks. I checked, and it really is trying to read the right file... /** Open a reader for a file. */ public static D FileReaderD openReader(SeekableInput in, DatumReaderD reader) throws IOException { if (in.length() MAGIC.length) throw new IOException(Not an Avro data file); // read magic header byte[] magic = new byte[MAGIC.length]; in.seek(0); for (int c = 0; c magic.length; c = in.read(magic, c, magic.length-c)) {} in.seek(0); if (Arrays.equals(MAGIC, magic)) // current format return new DataFileReaderD(in, reader); if (Arrays.equals(DataFileReader12.MAGIC, magic)) // 1.2 format return new DataFileReader12D(in, reader); throw new IOException(Not an Avro data file); } Some background... I am trying to write my first AVRO Hadoop application. I am using Hadoop Cloudera 20.2-737 and AVRO 1.5.1 I followed the instructions here... http://avro.apache.org/docs/current/api/java/org/apache/avro/mapred/packag e-summary.html#package_description The sample code here... http://svn.apache.org/viewvc/avro/tags/release-1.5.1/lang/java/mapred/src/ test/java/org/apache/avro/mapred/TestWordCount.java?view=markup Here is my code which breaks with a Not an Avro data file error. public static class MapImpl extends AvroMapperAccount, PairUtf8, Long { @Override public void map(Account account, AvroCollectorPairUtf8, Long collector, Reporter reporter) throws IOException { StringTokenizer tokens = new StringTokenizer(account.timestamp.toString()); while (tokens.hasMoreTokens()) collector.collect(new PairUtf8, Long(new Utf8(tokens.nextToken()), 1L)); } } public static class ReduceImpl extends AvroReducerUtf8, Long, PairUtf8, Long { @Override public void reduce(Utf8 word, IterableLong counts, AvroCollectorPairUtf8, Long collector, Reporter reporter) throws IOException { long sum = 0; for (long count : counts) sum += count; collector.collect(new PairUtf8, Long(word, sum)); } } public int run(String[] args) throws Exception { if (args.length != 2) { System.err.println(Usage: + getClass().getName() + input output); System.exit(2); } JobConf job = new JobConf(this.getClass()); Path outputPath = new Path(args[1]); outputPath.getFileSystem(job).delete(outputPath); //WordCountUtil.writeLinesFile(); job.setJobName(this.getClass().getName()); AvroJob.setInputSchema(job, Account.schema); //Schema.create(Schema.Type.STRING)); AvroJob.setOutputSchema(job, new PairUtf8, Long(new Utf8(), 0L).getSchema()); AvroJob.setMapperClass(job, MapImpl.class); AvroJob.setCombinerClass(job, ReduceImpl.class); AvroJob.setReducerClass(job, ReduceImpl.class); FileInputFormat.setInputPaths(job, new Path(args[0]));
Re: Hadoop and org.apache.avro.file.DataFileReader sez Not an Avro data file
Let me try that again, without the odd formatting: An avro data file is not created with a FileOutputStream. That will write avro binary data to a file, but not in the avro file format (which is splittable and contains header metadata). The API for Avro Data Files is here: http://avro.apache.org/docs/current/api/java/org/apache/avro/file/package-s ummary.html On 7/20/11 5:38 PM, Scott Carey scottca...@apache.org wrote: An avro data file is not created with a FileOutputStream. That will write = avro binary data to a file, but not in the avro file format (which is split= table and contains header metadata). The API for Avro Data Files is here: http://avro.apache.org/docs/current/api/java/org/apache/avro/file/package- http://avro.apache.org/docs/current/api/java/org/apache/avro/file/package - s=summary.html On 7/20/11 2:35 PM, Peter Wolf opus...@gmail.com wrote: Hello, anyone out there know about AVRO file formats and/or Hadoop support? My Hadoop AvroJob code does not recognize the AVRO files created by my other code. It seems that the MAGIC number is wrong. What is going on? How many different ways of encoding AVRO files are there, and how do I make sure they match. I am creating the input files like this... static public void write(String file, GenericRecord record, Schema schema) throws IOException { OutputStream o = new FileOutputStream(file); GenericDatumWriter w = new GenericDatumWriter(schema); Encoder e = EncoderFactory.get().binaryEncoder(o, null); w.write(record, e); e.flush(); } Hadoop is reading them using org.apache.avro.file.DataFileReader Here is where it breaks. I checked, and it really is trying to read the right file... /** Open a reader for a file. */ public static D FileReaderD openReader(SeekableInput in, DatumReaderD reader) throws IOException { if (in.length() MAGIC.length) throw new IOException(Not an Avro data file); // read magic header byte[] magic = new byte[MAGIC.length]; in.seek(0); for (int c = 0; c magic.length; c = in.read(magic, c, magic.length-c)) {} in.seek(0); if (Arrays.equals(MAGIC, magic)) // current format return new DataFileReaderD(in, reader); if (Arrays.equals(DataFileReader12.MAGIC, magic)) // 1.2 format return new DataFileReader12D(in, reader); throw new IOException(Not an Avro data file); } Some background... I am trying to write my first AVRO Hadoop application. I am using Hadoop Cloudera 20.2-737 and AVRO 1.5.1 I followed the instructions here... http://avro.apache.org/docs/current/api/java/org/apache/avro/mapred/packa g e-summary.html#package_description The sample code here... http://svn.apache.org/viewvc/avro/tags/release-1.5.1/lang/java/mapred/src / test/java/org/apache/avro/mapred/TestWordCount.java?view=markup Here is my code which breaks with a Not an Avro data file error. public static class MapImpl extends AvroMapperAccount, PairUtf8, Long { @Override public void map(Account account, AvroCollectorPairUtf8, Long collector, Reporter reporter) throws IOException { StringTokenizer tokens = new StringTokenizer(account.timestamp.toString()); while (tokens.hasMoreTokens()) collector.collect(new PairUtf8, Long(new Utf8(tokens.nextToken()), 1L)); } } public static class ReduceImpl extends AvroReducerUtf8, Long, PairUtf8, Long { @Override public void reduce(Utf8 word, IterableLong counts, AvroCollectorPairUtf8, Long collector, Reporter reporter) throws IOException { long sum = 0; for (long count : counts) sum += count; collector.collect(new PairUtf8, Long(word, sum)); } } public int run(String[] args) throws Exception { if (args.length != 2) { System.err.println(Usage: + getClass().getName() + input output); System.exit(2); } JobConf job = new JobConf(this.getClass()); Path outputPath = new Path(args[1]); outputPath.getFileSystem(job).delete(outputPath); //WordCountUtil.writeLinesFile(); job.setJobName(this.getClass().getName()); AvroJob.setInputSchema(job, Account.schema
Re: Schema with multiple Record types Java API
Try out the Reflect API. It may not be flexible enough yet, but the intended use case is to serialize pre-existing classes. If more annotations are required for your use case, create a JIRA ticket. http://avro.apache.org/docs/current/api/java/org/apache/avro/reflect/package-summary.html Thanks! -Scott On 7/15/11 4:54 AM, Peter Wolf opus...@gmail.commailto:opus...@gmail.com wrote: Thanks again Scott, Yes, I am using AVRO to serialize existing Java classes, so tools to generate code will not help me. Are there tools that go the other way, such as JAXB for XML? I really want to point to a root Java object, and say serialize this, and everything it points to, as AVRO. BTW AVRO Rocks! My objects contain are amounts of data, and I am *very* impressed with the speed of serialization/deserialization. Cheers P On 7/14/11 10:10 PM, Scott Carey wrote: AvroIDL can handle imports, but it generates classes. The Avro API's for this can be used to generate Schemas without making objects if you wish. The Avro schema compiler (*.avsc, *.avpr) does not support imports, it is a feature requested by many but not contributed by anyone. You may be interested in the code-gen capabilities of Avro, which has a Velocity templating engine to create Java classes based on schemas. This can be customized to generate classes in custom ways. However, if you are using Avro to serialize objects that have pre-existing classes, the Reflect API or an enhancement of it may be more suitable. More information on your use case may help to point you in the right direction. -Scott On 7/14/11 6:43 PM, Peter Wolf opus...@gmail.commailto:opus...@gmail.com wrote: Many thanks Scott, I am looking for the equivalent of #include or import. I want to make a complicated schema with many record types, but manage it in separate strings. In my application, I am using AVRO to serialize a tree of connected Java objects. The record types mirror Java classes. The schema descriptions live in the different Java classes, and reference each other. My current code looks like this... public class Foo { static String schemaDescription = { + \namespace\: \foo\, + \name\: \Foo\, + \type\: \record\, + \fields\: [ + {\name\: \notes\, \type\: \string\ }, + {\name\: \timestamp\, \type\: \string\ }, + {\name\: \bah\, \type\: + Bah.schemaDescription + }, + {\name\: \zot\, \type\: + Zot.schemaDescription + } + ] + }; static Schema schema = Schema.parse(schemaDescription); So, I am referencing by copying the schemaDescriptions. The top level schemaDescription strings therefore get really big. Is there already a clean coding Pattern for doing this-- I can't be the first. Is there a document describing best practices? Thanks P On 7/14/11 7:02 PM, Scott Carey wrote: The name and namespace is part of any named schema (Type.RECORD, Type.FIXED, Type.ENUM). We don't currently have an API to search a schema for subschemas that match names. It would be useful, you might want to create a JIRA ticket explaining your use case. So it would be a little more complex. Schema schema = Schema.parse(schemaDescription); Schema.Type type = schema.getType(); switch (type) { case RECORD: String name = schema.getName(); String namespace = schema.getNamespace(); ListField fields = schema.getFields(); } etc. In general, I have created SpecificRecord objects from schemas using the specific compiler (and the ant task or maven plugin) and then within those generated classes there is a static SCHEMA variable to reference. Avro IDL is alo an easier way to define related schemas. Currently there are only build tools that generate code from these, though there are APIs to extract schemas. -Scott On 7/13/11 10:43 AM, Peter Wolf opus...@gmail.commailto:opus...@gmail.com wrote: Hello, this a dumb question, but I can not find the answer in the docs I want to have a complicated schema with lots of Records referencing other Records. Like this... { namespace: com.foobah, name: Bah, type: record, fields: [ {name: value, type: int} ] } { namespace: com.foobah, name: Foo, type: record, fields: [ {name: bah, type: Bah} ] } Using the Java API, how do I reference types within a schema? Let's say I want to make a Foo object, I want to do something like this... Schema schema = Schema.parse(schemaDescription); Schema foo = schema.getSchema(com.foobah.Foo); GenericData o = new GenericData( foo ); Many thanks in advance Peter
Re: Schema with multiple Record types Java API
The name and namespace is part of any named schema (Type.RECORD, Type.FIXED, Type.ENUM). We don't currently have an API to search a schema for subschemas that match names. It would be useful, you might want to create a JIRA ticket explaining your use case. So it would be a little more complex. Schema schema = Schema.parse(schemaDescription); Schema.Type type = schema.getType(); switch (type) { case RECORD: String name = schema.getName(); String namespace = schema.getNamespace(); ListField fields = schema.getFields(); } etc. In general, I have created SpecificRecord objects from schemas using the specific compiler (and the ant task or maven plugin) and then within those generated classes there is a static SCHEMA variable to reference. Avro IDL is alo an easier way to define related schemas. Currently there are only build tools that generate code from these, though there are APIs to extract schemas. -Scott On 7/13/11 10:43 AM, Peter Wolf opus...@gmail.commailto:opus...@gmail.com wrote: Hello, this a dumb question, but I can not find the answer in the docs I want to have a complicated schema with lots of Records referencing other Records. Like this... { namespace: com.foobah, name: Bah, type: record, fields: [ {name: value, type: int} ] } { namespace: com.foobah, name: Foo, type: record, fields: [ {name: bah, type: Bah} ] } Using the Java API, how do I reference types within a schema? Let's say I want to make a Foo object, I want to do something like this... Schema schema = Schema.parse(schemaDescription); Schema foo = schema.getSchema(com.foobah.Foo); GenericData o = new GenericData( foo ); Many thanks in advance Peter
Re: Schema with multiple Record types Java API
AvroIDL can handle imports, but it generates classes. The Avro API's for this can be used to generate Schemas without making objects if you wish. The Avro schema compiler (*.avsc, *.avpr) does not support imports, it is a feature requested by many but not contributed by anyone. You may be interested in the code-gen capabilities of Avro, which has a Velocity templating engine to create Java classes based on schemas. This can be customized to generate classes in custom ways. However, if you are using Avro to serialize objects that have pre-existing classes, the Reflect API or an enhancement of it may be more suitable. More information on your use case may help to point you in the right direction. -Scott On 7/14/11 6:43 PM, Peter Wolf opus...@gmail.commailto:opus...@gmail.com wrote: Many thanks Scott, I am looking for the equivalent of #include or import. I want to make a complicated schema with many record types, but manage it in separate strings. In my application, I am using AVRO to serialize a tree of connected Java objects. The record types mirror Java classes. The schema descriptions live in the different Java classes, and reference each other. My current code looks like this... public class Foo { static String schemaDescription = { + \namespace\: \foo\, + \name\: \Foo\, + \type\: \record\, + \fields\: [ + {\name\: \notes\, \type\: \string\ }, + {\name\: \timestamp\, \type\: \string\ }, + {\name\: \bah\, \type\: + Bah.schemaDescription + }, + {\name\: \zot\, \type\: + Zot.schemaDescription + } + ] + }; static Schema schema = Schema.parse(schemaDescription); So, I am referencing by copying the schemaDescriptions. The top level schemaDescription strings therefore get really big. Is there already a clean coding Pattern for doing this-- I can't be the first. Is there a document describing best practices? Thanks P On 7/14/11 7:02 PM, Scott Carey wrote: The name and namespace is part of any named schema (Type.RECORD, Type.FIXED, Type.ENUM). We don't currently have an API to search a schema for subschemas that match names. It would be useful, you might want to create a JIRA ticket explaining your use case. So it would be a little more complex. Schema schema = Schema.parse(schemaDescription); Schema.Type type = schema.getType(); switch (type) { case RECORD: String name = schema.getName(); String namespace = schema.getNamespace(); ListField fields = schema.getFields(); } etc. In general, I have created SpecificRecord objects from schemas using the specific compiler (and the ant task or maven plugin) and then within those generated classes there is a static SCHEMA variable to reference. Avro IDL is alo an easier way to define related schemas. Currently there are only build tools that generate code from these, though there are APIs to extract schemas. -Scott On 7/13/11 10:43 AM, Peter Wolf opus...@gmail.commailto:opus...@gmail.com wrote: Hello, this a dumb question, but I can not find the answer in the docs I want to have a complicated schema with lots of Records referencing other Records. Like this... { namespace: com.foobah, name: Bah, type: record, fields: [ {name: value, type: int} ] } { namespace: com.foobah, name: Foo, type: record, fields: [ {name: bah, type: Bah} ] } Using the Java API, how do I reference types within a schema? Let's say I want to make a Foo object, I want to do something like this... Schema schema = Schema.parse(schemaDescription); Schema foo = schema.getSchema(com.foobah.Foo); GenericData o = new GenericData( foo ); Many thanks in advance Peter
Re: Classpath for java
I suspect that you will need to go into the module with the Pair class. When executing a maven plugin directly from the command line (exec:exec) the maven 'scope' is very restricted, and when you do this on the top level project it executes on that project only by default. The surefire test plugin occurs in the test phase, after it has finished all of the prior phases including compiling and constructing all the paths required for testing. On 6/26/11 9:53 AM, Jeremy Lewi jer...@lewi.us wrote: Hi, I'm having trouble understanding how the class path is being set by maven for java. When I run a unit test using the maven surefire plugin cd lang/java mvn -Dtest=org.apache.avro.mapred.TestWordCount test -X The output shows the following directories are on the classpath. lang/java/mapred/target/test-classes lang/java/mapred/target/classes lang/java/ipc/target/classes lang/java/avro/target/classes But when I try to execute a class (I put a main method in lang/java/.../Pair.java for testing) mvn exec:exec -Dexec.mainClass=Pair -X Only lang/java/target/classes is on the path. So I'm trying to determine how to configure the exec plugin to properly set the class path so that I can execute programs. If anyone has any pointers I would greatly appreciate it. Thanks J
Re: Avro and Hadoop streaming
Hadoop has an old version of Avro in it. You must place the 1.6.0 jar (and relevant dependencies, or the avro-tools.jar with all dependencies bundled) in a location that gets picked up first in the task classpath. Packaging it in the job jar works. I'm not sure if putting it in the distributed cache and loading it as a library that way would. On 6/15/11 9:30 AM, Matt Pouttu-Clarke matt.pouttu-cla...@icrossing.com wrote: You have to package it in the job jar file under a /lib directory. On 6/15/11 9:26 AM, Miki Tebeka miki.teb...@gmail.com wrote: Still didn't work. I'm pretty new to hadoop world, I probably need to place the avro jar somewhere on the classpath of the nodes, however I have no idea how to do that. On Wed, Jun 15, 2011 at 3:33 AM, Harsh J ha...@cloudera.com wrote: Miki, You'll need to provide the entire canonical class name (org.apache.avro.mapredS). On Wed, Jun 15, 2011 at 5:31 AM, Miki Tebeka miki.teb...@gmail.com wrote: Greetings, I've tried to run a job with the following command: hadoop jar ./hadoop-streaming-0.20.2-cdh3u0.jar \ -input /in/avro \ -output $out \ -mapper avro-mapper.py \ -reducer avro-reducer.py \ -file avro-mapper.py \ -file avro-reducer.py \ -cacheArchive /cache/avro-mapred-1.6.0-SNAPSHOT.jar \ -inputformat AvroAsTextInputFormat However I get -inputformat : class not found : AvroAsTextInputFormat I'm probably missing something obvious to do. Any ideas? Thanks! -- Miki On Fri, Jun 3, 2011 at 1:43 AM, Doug Cutting cutt...@apache.org wrote: Miki, Have you looked at AvroAsTextInputFormat? http://avro.apache.org/docs/current/api/java/org/apache/avro/mapred/Av roAsT extInputFormat.html Also, release 1.5.2 will include AvroTextOutputFormat: https://issues.apache.org/jira/browse/AVRO-830 Are these perhaps what you're looking for? Doug On 06/02/2011 11:30 PM, Miki Tebeka wrote: Greetings, I'd like to use hadoop streaming with Avro files. My plan is to write an inputformat class that emits json records, one per line. This way the streaming application can read one record per line. (http://hadoop.apache.org/common/docs/r0.15.2/streaming.html#Specifyi ng+Ot her+Plugins+for+Jobs) I couldn't find any documentation/help about writing inputformat classes. Can someone point me to the right direction? Thanks, -- Miki -- Harsh J iCrossing Privileged and Confidential Information This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information of iCrossing. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
Re: avro object reuse
Corruption can occur in I/O busses and RAM. Does this tend to fail on the same nodes, or any node randomly? Since it does not fail consistently, this makes me suspect some sort of corruption even more. I suggest turning on stack traces for fatal throwables. This shouldn't hurt production performance since they don't happen regularly and break the task anyway. Of the heap dumps seen so far, the primary consumption is byte[] and no more than 300MB. How large are your java heaps? On 6/10/11 10:53 AM, ey-chih chow eyc...@hotmail.commailto:eyc...@hotmail.com wrote: Since this was in production, we did not turn on stack trace. Also, it was highly unlikely that there was any data corrupted because, if one mapper failed due to out of memory, the system started another one and went through all the data. From: sc...@richrelevance.commailto:sc...@richrelevance.com To: user@avro.apache.orgmailto:user@avro.apache.org Date: Thu, 9 Jun 2011 17:43:02 -0700 Subject: Re: avro object reuse If the exception is happening while decoding, it could be due to corrupt data. Avro allocates a List preallocated to the size encoded, and I've seen corrupted data cause attempted allocations of arrays too large for the heap. On 6/9/11 4:58 PM, Scott Carey sc...@richrelevance.commailto:sc...@richrelevance.com wrote: What is the stack trace on the out of memory exception? On 6/9/11 4:45 PM, ey-chih chow eyc...@hotmail.commailto:eyc...@hotmail.com wrote: We configure more than 100MB for MapReduce to do sorting. Memory we allocate for doing other things in the mapper actually is larger, but, for this job, we always get out-of-meory exceptions and the job can not complete. We try to find out if there is a way to avoid this problem. Ey-Chih Chow From: sc...@richrelevance.commailto:sc...@richrelevance.com To: user@avro.apache.orgmailto:user@avro.apache.org Date: Thu, 9 Jun 2011 15:42:10 -0700 Subject: Re: avro object reuse The most likely candidate for creating many instances of BufferAccessor and ByteArrayByteSource is BinaryData.compare() and BinaryData.hashCode(). Each call will create one of each (hash) or two of each (compare). These are only 32 bytes per instance and quickly become garbage that is easily cleaned up by the GC. The below have only 32 bytes each and 8MB total. On the other hand, the byte[]'s appear to be about 24K each on average and are using 100MB. Is this the size of your configured MapReduce sort MB? On 6/9/11 3:08 PM, ey-chih chow eyc...@hotmail.commailto:eyc...@hotmail.com wrote: We did more monitoring. At one instance, we got the following histogram via Jmap. The question is why there are so many instances of BinaryDecoder$BufferAccessor and BinaryDecoder$ByteArrayByteSource. How to avoid this? Thanks. Object Histogram: num #instances#bytes Class description -- 1: 4199100241168 byte[] 2: 272948 8734336 org.apache.avro.io.BinaryDecoder$BufferAccessor 3: 272945 8734240 org.apache.avro.io.BinaryDecoder$ByteArrayByteSource 4: 20935387976 int[] 5: 23762 2822864 * ConstMethodKlass 6: 23762 1904760 * MethodKlass 7: 39295 1688992 * SymbolKlass 8: 21271216976 * ConstantPoolKlass 9: 2127882760 * InstanceKlassKlass 10: 1847742936 * ConstantPoolCacheKlass 11: 9602715608 char[] 12: 1072299584 * MethodDataKlass 13: 9698232752 java.lang.String 14: 2317222432 java.lang.Class 15: 3288204440 short[] 16: 3167156664 * System ObjArray 17: 240157624 java.util.HashMap$Entry 18: 666 53280 java.lang.reflect.Method 19: 161 52808 * ObjArrayKlassKlass 20: 180843392 java.util.Hashtable$Entry From: eyc...@hotmail.commailto:eyc...@hotmail.com To: user@avro.apache.orgmailto:user@avro.apache.org Subject: RE: avro object reuse Date: Wed, 1 Jun 2011 15:14:03 -0700 We use a lot of toString() call on the avro Utf8 object. Will this cause Jackson call? Thanks. Ey-Chih From: sc...@richrelevance.commailto:sc...@richrelevance.com To: user@avro.apache.orgmailto:user@avro.apache.org Date: Wed, 1 Jun 2011 13:38:39 -0700 Subject: Re: avro object reuse This is great info. Jackson should only be used once when the file is opened, so this is confusing from that point of view. Is something else using Jackson or initializing an Avro JsonDecoder frequently? There are over 10 Jackson DeserializationConfig objects. Another place that parses the schema is in AvroSerialization.java. Does the Hadoop getDeserializer() API method get called once per job
Re: avro object reuse
The most likely candidate for creating many instances of BufferAccessor and ByteArrayByteSource is BinaryData.compare() and BinaryData.hashCode(). Each call will create one of each (hash) or two of each (compare). These are only 32 bytes per instance and quickly become garbage that is easily cleaned up by the GC. The below have only 32 bytes each and 8MB total. On the other hand, the byte[]'s appear to be about 24K each on average and are using 100MB. Is this the size of your configured MapReduce sort MB? On 6/9/11 3:08 PM, ey-chih chow eyc...@hotmail.commailto:eyc...@hotmail.com wrote: We did more monitoring. At one instance, we got the following histogram via Jmap. The question is why there are so many instances of BinaryDecoder$BufferAccessor and BinaryDecoder$ByteArrayByteSource. How to avoid this? Thanks. Object Histogram: num #instances#bytes Class description -- 1: 4199100241168 byte[] 2: 272948 8734336 org.apache.avro.io.BinaryDecoder$BufferAccessor 3: 272945 8734240 org.apache.avro.io.BinaryDecoder$ByteArrayByteSource 4: 20935387976 int[] 5: 23762 2822864 * ConstMethodKlass 6: 23762 1904760 * MethodKlass 7: 39295 1688992 * SymbolKlass 8: 21271216976 * ConstantPoolKlass 9: 2127882760 * InstanceKlassKlass 10: 1847742936 * ConstantPoolCacheKlass 11: 9602715608 char[] 12: 1072299584 * MethodDataKlass 13: 9698232752 java.lang.String 14: 2317222432 java.lang.Class 15: 3288204440 short[] 16: 3167156664 * System ObjArray 17: 240157624 java.util.HashMap$Entry 18: 666 53280 java.lang.reflect.Method 19: 161 52808 * ObjArrayKlassKlass 20: 180843392 java.util.Hashtable$Entry From: eyc...@hotmail.commailto:eyc...@hotmail.com To: user@avro.apache.orgmailto:user@avro.apache.org Subject: RE: avro object reuse Date: Wed, 1 Jun 2011 15:14:03 -0700 We use a lot of toString() call on the avro Utf8 object. Will this cause Jackson call? Thanks. Ey-Chih From: sc...@richrelevance.commailto:sc...@richrelevance.com To: user@avro.apache.orgmailto:user@avro.apache.org Date: Wed, 1 Jun 2011 13:38:39 -0700 Subject: Re: avro object reuse This is great info. Jackson should only be used once when the file is opened, so this is confusing from that point of view. Is something else using Jackson or initializing an Avro JsonDecoder frequently? There are over 10 Jackson DeserializationConfig objects. Another place that parses the schema is in AvroSerialization.java. Does the Hadoop getDeserializer() API method get called once per job, or per record? If this is called more than once per map job, it might explain this. In principle, Jackson is only used by a mapper during initialization. The below indicates that this may not be the case or that something outside of Avro is causing a lot of Jackson JSON parsing. Are you using something that is converting the Avro data to Json form? toString() on most Avro datum objects will do a lot of work with Jackson, for example — but the below are deserializer objects not serializer objects so that is not likely the issue. On 6/1/11 11:34 AM, ey-chih chow eyc...@hotmail.commailto:eyc...@hotmail.com wrote: We ran jmap on one of our mapper and found the top usage as follows: num #instances #bytes Class description -- 1: 24405 291733256 byte[] 2: 6056 40228984 int[] 3: 388799 19966776 char[] 4: 101779 16284640 org.codehaus.jackson.impl.ReaderBasedParser 5: 369623 11827936 java.lang.String 6: 111059 8769424 java.util.HashMap$Entry[] 7: 204083 8163320 org.codehaus.jackson.impl.JsonReadContext 8: 211374 6763968 java.util.HashMap$Entry 9: 102551 5742856 org.codehaus.jackson.util.TextBuffer 10: 105854 5080992 java.nio.HeapByteBuffer 11: 105821 5079408 java.nio.HeapCharBuffer 12: 104578 5019744 java.util.HashMap 13: 102551 4922448 org.codehaus.jackson.io.IOContext 14: 101782 4885536 org.codehaus.jackson.map.DeserializationConfig 15: 101783 4071320 org.codehaus.jackson.sym.CharsToNameCanonicalizer 16: 101779 4071160 org.codehaus.jackson.map.deser.StdDeserializationContext 17: 101779 4071160 java.io.StringReader 18: 101754 4070160 java.util.HashMap$KeyIterator It looks like Jackson eats up a lot of memory. Our mapper reads in files of the avro format. Does avro use Jackson a lot in reading the avro files? Is there any way to improve this? Thanks. Ey-Chih Chow From: sc...@richrelevance.commailto:sc...@richrelevance.com To:
Re: avro object reuse
No, that should not trigger Jackson parsing. Schema.parse() and Protocol.parse() do. On 6/2/11 10:23 AM, ey-chih chow eyc...@hotmail.commailto:eyc...@hotmail.com wrote: We create GenericData.Record a lot in our code via new GenericData.Record(schema). Will this generates Jackson calls? Thanks. Ey-Chih Chow From: sc...@richrelevance.commailto:sc...@richrelevance.com To: user@avro.apache.orgmailto:user@avro.apache.org Date: Wed, 1 Jun 2011 18:48:15 -0700 Subject: Re: avro object reuse One thing we do right now that might be related is the following: We keep Avro default Schema values as JsonNode objects. While traversing the JSON Avro schema representation using ObjectMapper.readTree() we remember JsonNodes that are default properties on fields and keep them on the Schema object. If these keep references to the parent (and the whole JSON tree, or worse, the ObjectMapper and input stream) it would be poor use of Jackson by us; although we'd need a way to keep a detached JsonNode or equivalent. However, even if that is the case (which it does not seem to be -- the jmap output has no JsonNode instances), it doesn't explain why we would be calling ObjectMapper frequently. We only call ObjectMapper.readTree(JsonParser) when creating a Schema from JSON. We call JsonNode methods from extracted fragments for everything else. This brings me to the following suspicion based on the data: Somewhere, Schema objects are being created frequently via one of the Schema.parse() or Protocol.parse() static methods. On 6/1/11 5:48 PM, Tatu Saloranta tsalora...@gmail.commailto:tsalora...@gmail.com wrote: On Wed, Jun 1, 2011 at 5:45 PM, Scott Carey sc...@richrelevance.commailto:sc...@richrelevance.com wrote: It would be useful to get a 'jmap -histo:live' report as well, which will only have items that remain after a full GC. However, a high churn of short lived Jackson objects is not expected here unless the user is reading Json serialized files and not Avro binary. Avro Data Files only contain binary encoded Avro content. It would be surprising to see many Jackson objects here if reading Avro Data Files, because we expect to use Jackson to parse an Avro schema from json only once or twice per file. After the schema is parsed, Jackson shouldn't be used. A hundred thousand DeserializationConfig instances means that isn't the case. Right -- it indicates that something (else) is using Jackson; and there will typically be one instance of DeserializationConfig for each data-binding call (ObjectMapper.readValue()), as a read-only copy is made for operation. ... or if something is reading schema that many times, that sounds like a problem in itself. -+ Tatu +-
Re: mixed schema avro data file?
Two options: * DIfferent files per schema * One schema that is a union of all schemas you want in the file Which is best depends on your use case. On 6/1/11 4:02 PM, Yang tedd...@gmail.commailto:tedd...@gmail.com wrote: our use case is that we have many different types of events, with different schemas. I was thinking to dump them into one file, for easier maintenance of the files. but then I found that all the DataFileWriter, JsonEncoder/Decoder require a schema to be present, so each file can have really only one schema. of course I can create a separate encoder/writer for each record I write. but then there would be no way to parse out the file later. such a mixed schema file can be useful only to humans at best. so generally what is your experience in dealing with serializing objects of different types? do you put them in different files? Thanks Yang
Re: avro object reuse
One thing we do right now that might be related is the following: We keep Avro default Schema values as JsonNode objects. While traversing the JSON Avro schema representation using ObjectMapper.readTree() we remember JsonNodes that are default properties on fields and keep them on the Schema object. If these keep references to the parent (and the whole JSON tree, or worse, the ObjectMapper and input stream) it would be poor use of Jackson by us; although we'd need a way to keep a detached JsonNode or equivalent. However, even if that is the case (which it does not seem to be -- the jmap output has no JsonNode instances), it doesn't explain why we would be calling ObjectMapper frequently. We only call ObjectMapper.readTree(JsonParser) when creating a Schema from JSON. We call JsonNode methods from extracted fragments for everything else. This brings me to the following suspicion based on the data: Somewhere, Schema objects are being created frequently via one of the Schema.parse() or Protocol.parse() static methods. On 6/1/11 5:48 PM, Tatu Saloranta tsalora...@gmail.com wrote: On Wed, Jun 1, 2011 at 5:45 PM, Scott Carey sc...@richrelevance.com wrote: It would be useful to get a 'jmap -histo:live' report as well, which will only have items that remain after a full GC. However, a high churn of short lived Jackson objects is not expected here unless the user is reading Json serialized files and not Avro binary. Avro Data Files only contain binary encoded Avro content. It would be surprising to see many Jackson objects here if reading Avro Data Files, because we expect to use Jackson to parse an Avro schema from json only once or twice per file. After the schema is parsed, Jackson shouldn't be used. A hundred thousand DeserializationConfig instances means that isn't the case. Right -- it indicates that something (else) is using Jackson; and there will typically be one instance of DeserializationConfig for each data-binding call (ObjectMapper.readValue()), as a read-only copy is made for operation. ... or if something is reading schema that many times, that sounds like a problem in itself. -+ Tatu +-
Re: I have written a layout for log4j using avro
To read and write an Avro Data File use the classes in org.apache.avro.file : http://avro.apache.org/docs/current/api/java/index.html The classes in tools are command line tools that wrap Avro Java APIs. The source code of these can be used as examples for using these APIs. On 5/30/11 8:01 AM, harisgx . hari...@gmail.commailto:hari...@gmail.com wrote: Hi, I have written a layout for log4j using avro. http://bytescrolls.blogspot.com/2011/05/using-avro-to-serialize-logs-in-log4j.html https://github.com/harisgx/avro-log4j But if I want to convert the records to a avro data file in a compressed form, in the docs it is mentioned about to use DataFileWriteTool to read new-line delimited JSON records. in the method, - run(InputStream stdin, PrintStream out, PrintStream err, ListString args) args is to be non null. How do we populate the args values? thanks -haris
Re: inheritance implementation?
You can do this a few ways. The composition you list will work, the member variable should be of type Fruit. Or you can put the type object inside the fruit: record Fruit { int size; string color; int weight; union { Apple, Orange } type; } record Orange { string skin_thickness; } record Apple { string skin_pattern; } However, Avro's IDL language and the Specific compiler in Java will not compile this into a class hierarchy. You can use a wrapper class in Java to do that. A factory method to create a specific Fruit subclass by inspecting a Fruit would use instanceof to determine the union type and create the corresponding object. One way or the other, you do need to do some instanceof / casting depending on what you are accessing. I have used the pattern above, with the 'type' inside the outer general object. On 5/31/11 11:33 AM, Yang tedd...@gmail.commailto:tedd...@gmail.com wrote: I understand that avro does not have inheritance now, so I am wondering what is the best way to achieve the following goal: I define Apple, Orange, and Fruit. Apple and Orange should ideally derive from Fruit, but since there is no built-in mechanism, we create an internal member for aboth Apple and Orange, encapsulating the contents of Orangle Apple :{ Fruit: fruit_member string: pattern_on_skin } Orange : { Fruit: fruit_member string: skin_thickness } Fruit: { int : size, string: color int: weight } say I want to pass objects of both Apple and Orange to some scale to measure the total weight, I can pass them just as Objects, int findTotalWeight(ListObject l ) { int result=0; for(Object o : l ) { result += ??? somehow get access to the fruit_member var ?? } } so what is the best way to fill in the line above with ? doing a lot of instanceof is kind of cumbersome Thanks Yang