Re: Avro file Compression

2013-08-22 Thread Scott Carey
The file format compresses in blocks, and the block size is configurable.
This will compress across objects in a block, so it works for small objects
as well as large ones ‹ as long as the total block size is large enough.

I have found that I can increase the ratio of compression by ordering the
objects carefully so that neighbor records have more in common.

From:  Bill Baird bill.ba...@traxtech.com
Reply-To:  user@avro.apache.org user@avro.apache.org
Date:  Thursday, August 22, 2013 7:47 AM
To:  user@avro.apache.org user@avro.apache.org
Subject:  Re: Avro file Compression

As with any compression, how much you get depends on the size and nature of
the data.  I have objects where unserialized they take 4 or 5k, and they
serialize to 1.5 to 3k, or about 2 to 1.  However, for the same object
structure (which contains several nested arrays ... lots of strings, numbers
... basic business data) when uncompressed it 17MB, it deflates to 1MB (or
17 to 1).  For very small objects, deflate will actually produce a larger
output, but it does quite well as the size of the data being deflated grows.

Bill


On Wed, Aug 21, 2013 at 11:31 PM, Harsh J ha...@cloudera.com wrote:
 Can you share your test? There is an example at
 http://svn.apache.org/repos/asf/avro/trunk/lang/c/examples/quickstop.c
 which has the right calls for using a file writer with a deflate codec
 - is yours similar?
 
 On Mon, Aug 19, 2013 at 9:42 PM, amit nanda amit...@gmail.com wrote:
  I am try to compress the avro files that i am writing, for that i am using
  the latest Avro C, with deflate option, but i am not able to see any
  difference in the file size.
 
  Is there any special type to data that this works on, or is there any more
  setting that needs to be done for this to work.
 
 
 
 
 
 --
 Harsh J





Re: Avro Schema to SQL

2013-06-28 Thread Scott Carey
Not all Avro schemas can be converted to SQL.  Primarily, Unions can pose
challenges, as well as recursive references.

Nested types are a mixed bag ‹ some SQL-related systems have rich support
for nested types and/or JSON (e.g. PosgtgreSQL) which can make this easier,
while others are more crude (MySQL, HIVE).

With Unions, in some cases a union field can be expanded/flattened into
multiple fields, of which only one is not null.  Recursive types can be
transformed into key references.

In general, all of these transformation strategies require decisions by the
user and potentially custom work depending on what database is involved.

Traversing an Avro Schema in Java is done via the Schema API, the Javadoc
explains it and there are many examples in the avro source code.  The type
of schema must be checked, and for each nested type a different decent into
its contained types can occur.

From:  Avinash Dongre dongre.avin...@gmail.com
Reply-To:  user@avro.apache.org user@avro.apache.org
Date:  Wednesday, June 19, 2013 2:31 AM
To:  user@avro.apache.org user@avro.apache.org
Subject:  Avro Schema to SQL

Is there know tool/framework available to convert Avro Schema into SQL.
If now , How Do i iterate over the schema to find out what records, enums
are there. I can think of how to achieve this with simple Schema, but I am
not able to figure out a way for nested schemas.



Thanks
Avinash





Re: Reader / Writer terminology

2013-06-10 Thread Scott Carey
It can be a view, or a transformation.  You might view data_a with
schema_b.  Or, you might take binary data, conforming to schema A and
directly re-write it to binary data, conforming to schema B.  Most Avro
APIs don't yet handle workflows that are not 'read' and 'write' --
transformations to or from object representations to serialized forms.

The general case includes all transformation classes as well as views.

On 6/8/13 10:16 PM, Gregory (Grisha) Trubetskoy gri...@apache.org
wrote:



On Sat, 8 Jun 2013, Scott Carey wrote:

 In a more general sense it is simply from and to -- One might move
 from schema A to B without serialization at all, transforming a data
 structure, or simply want a view of data in the form of A as if it was
 in B.

I'd like to zoom in on this specific point for a little, if I may.

I think serialization is a red herring. It's always a transformation of
one data structure to another, because a claim could be made that one
cannot transform a serialized form without loading it into a data
structure first.

In fact, I think it's always the latter case, a *view*, as you aptly
described it. Which makes it not so much a from and to, but more a
view A as B?

Something like:

value_b = value_a.view_as(schema_b)

Just my late-night $0.02.

Grisha




Re: Reader / Writer terminology

2013-06-08 Thread Scott Carey
I'm about to make all of this even more confusingŠ

For pair-wise resolution when the operation is deserialization, reader and
writer make sense.  In a more general sense it is simply from and to
-- One might move from schema A to B without serialization at all,
transforming a data structure, or simply want a view of data in the form of
A as if it was in B.   There aren't any clear naming winners and many sound
good for one use case but worse for others:  'source' and 'destination',
'source' and 'sink', 'original' and 'target', 'expected' and 'actual',
'reader' and 'writer', 'resolver' and 'resolvee', 'sender' and 'reciever'.

As part of AVRO-1124 I have recently met in person with a few folks who
needed enhancements to that ticket (the discussion and conclusion will be
added there shortly, prior to the next patch version).
The result is that two names are not enough, because expressing resolution
of _sets_ of schemas is more complicated than pairs.

When describing a set of schemas that represent some sort of data that may
have been persisted,  six states are needed.   The six states are made up of
two dimensions.   
* The reader dimension is binary, and represents whether a schema is used
for reading or not (is ever a to, reader, or target).
* The write dimension has three states in the 'write' spectrum:  Writer
(an active from or source), Written (persisted data, not actively
written), and None (not used for writing).

The naming of these will be confusing, as part of AVRO-1124 we'll have to
have names that are as clear as possible.  Currently I have enumerations:
ReadState.READER and ReadState.NONE;  WriteState.WRITER, WriteState.WRITTEN,
and WriteState.NONE.   I am not a big fan of these names, and am open to
suggestions.   A consistent approach in naming is important.   For example,
I previously had, WriteState.WRITTEN named WriteState.READABLE.  That
represents the idea of what the state is for the best, but is extremely
confusing.

These six states relate with one schema resolution rule:
Schemas in state ReadState.READER must be able to read all schemas with
WriterState.WRITER or WriterState.WRITTEN.

and one rule for persisting data:
Data must not be persisted unless the corresponding schema is in state
WriterState.WRITER

Without going into the details, this allows for any schema evolution use
case over a set of schemas with both ephemeral data and persisted data.
Schemas can transition from one state to another, as long as the constraint
rules above are met at all times.


Reader and Writer have been useful because they correlate with other
meaningful names well -- hypothetically:
   boolean mySchema.canRead(Schema writer) and
   boolean mySchema.canBeReadWith(Schema reader)

A naming scheme for describing schema resolution an evolution will need to
work across many use cases and be useful for describing relationships
between schemas.  Describing only the pair-wise resolution is not enough.

On 6/8/13 12:44 AM, Doug Cutting cutt...@apache.org wrote:

 Originally I used the term 'actual' for the schema of the data written and
 'expected' for the schema that the reader of the data wished to see it as.
 Some found those terms confusing and suggested that 'writer' and 'reader' were
 more intuitive, so we started using those instead. That unfortunately seems
 not to have resolved the confusion entirely.
 
 Perhaps we should improve the documentation around this? Do you have any
 specific suggestions about how that might be done?
 
 Doug
 
 On Jun 7, 2013 10:12 PM, Gregory (Grisha) Trubetskoy gri...@apache.org
 wrote:
 
 I'm curious how the Reader and Writer terminology came about, and, most
 importantly, whether it's as confusing to the rest of you as it is to me?
 
 As I understand it, the principal analogy here is from the RPC world - a
 process A writes some Avro to process B, in which case A is the writer and B
 is the reader.
 
 And there is the possibility that the schema which B may be expecting isn't
 what A is providing, thus B may have to do some conversion on its end to grok
 it, and Avro schema resolution rules may make this possible.
 
 So far so good. This is where it becomes confusing. I am lost on how the act
 of reading or writing is relevant to the task at hand, which is conversion of
 a value from one schema to another.
 
 As I read stuff on the lists and the docs, I couldn't help noticing words
 such as original, first, second, actual, expected being using
 alongside reader and writer as clarification.
 
 Why would be wrong with a source and destination schmeas?
 
 Consider the following line (from Avro-C):
 
 writer_iface = avro_resolved_writer_new(writer_schema, reader_schema);
 
 Here writer in resolved_writer and writer_schema are unrelated. The former
 refers to the fact that this interface will be modifying (writing to) an
 object, the latter is referring to the writer (source, original, a.k.a
 actual) schema.
 
 Wouldn't this read better as:
 
 writer_iface = 

Re: Compressed Avro vs. compressed Sequence - unexpected results?

2013-05-23 Thread Scott Carey
For your avro files, double check that snappy is used (use avro-tools to
peek at the metadata in the file, or simply view the head in a text
editor, the compression codec used will be in the header).

Snappy is very fast, most likely the time to read is dominated by
deserialization.  Avro will be slower than a trivial deserializer (but
more compact), but being many times slower is not expected.  I am not
entirely sure how Hive's Avro serDe works -- it is possible there is a
performance issue there.  If you were able to get a handful of stack
traces (kill -3 or jstack) from the mapper tasks (or a profiler output),
it would be very insightful.


On 5/23/13 12:42 AM, nir_zamir nir.za...@gmail.com wrote:

Hi,

We're examining the storage of our data in Snappy-compressed files. Since
we
want the data's structure to be self contained, we checked it with Avro
and
with Sequence (both are splittable, which should best utilize our
cluster).

We tested the performance on a 12GB data (CSV) file, and a 4-nodes cluster
on production environment (very strong machines).

Compression

What we did here (for test simplicity) is create two Hive tables:
Avro-based
and Sequence-based. Then we enabled Snappy compression and INSERTed the
data
from the RAW table (consisting of the 12GB file).

In terms of compression rate, Avro was better: 72% vs. 57%.
In both cases there were 45 mappers, and CPU/Mem were very far from their
limit on all machines.
Since there was no reduce operator, this created 45 files.

Compression time for Avro took longer: 1.75 minutes vs. 1.2 minutes for
sequence files.

Decompression

What we did here was this Hive query:
SELECT COUNT(1) FROM table-name;

Here was the real difference: it took Avro about *75% longer* to perform
this (3 minutes vs. 0.5 minute).
This was very surprising since for our strong machines the I/O would be
expected to be the bottleneck, and since Avro files are smaller,we
expected
them to be faster to decompress.
The number of mappers in both cases was similar (14 vs. 17) and again,
CPU/Mem didn't seem to be exausted.
Since our most critical time is reading, this issue makes it hard for us
to
be using Avro.

Maybe we're doing something wrong - your input would be much appreciated!

Thanks,
Nir



--
View this message in context:
http://apache-avro.679487.n3.nabble.com/Compressed-Avro-vs-compressed-Sequ
ence-unexpected-results-tp4027467.html
Sent from the Avro - Users mailing list archive at Nabble.com.




Re: using Avro unions with HIVE

2013-05-23 Thread Scott Carey
The Hive mailing list would have more info on the Avro SerDe usage.

In general, a system that does not have union types like Hive (or Pig,
etc) has to expand a union into multiple fields if there are more than one
non-null type -- and at most one branch of the union is not null.

For example a record with fields:

  {name:timestamp, type:long, default:-1}
  {name:ipAddress, type:[IPv4, IPv6]}

where IPv4 and IPv6 are previously defined types, would have to expand to
three fields
 timestamp, ipAddress:IPv4, and ipAddress:IPv6, where only one of
the last two is not null in any given record.

I do not know what Hive's Avro SerDe does with unions.

On 5/23/13 7:15 AM, Ran S r...@liveperson.com wrote:

Hi,
We started to work with Avro in CDH4 and to query the Avro files using
Hive.
This does work fine for us, except for unions.
We do not understand how to query the data inside a union using Hive.

For example, let's look at the following schema:

{
   type:record, 
   name:event, 
   namespace:com.mysite,
   fields:[
{
name:header,
type:{
type:record, name:CommonHeader,
fields:[{ name:eventTimeStamp, type:long, efault:-1
},
  { name:globalUserId, type:[null, string],
default:null } ]
},
default:null
},
{
name:eventbody,
type:{
type:record, name:eventbody,
fields:[
{
name:body,
type:[
   null,
   {
type:record,
name:event1,
fields:[
{
name:event1Header,
type:[null, { type:array,
items:string }], default:null
},
{
name:event1Body,
type:[null, { type:array,
items:string }], default:null
}
]
},
   {
type:record,
name:event2,
fields:[
{
name:page,
type:{
type:record, name:URL,
fields:[{ name:url, type:string }]
},
default:null
},
{
name:referrer, type:string,
default:null
}
]
}
   ],
default:null
}
]
},
default:null
}
]}

Note that body is a union of three types:
null, event1 and event2

So if I want to query fields inside event1, I first need to access it.
I then set a HiveQL like this:
SELECT eventbody.body.??? from SRC

My question is: what shoule I put in the ??? above to make this work?

Thank you,
Ran



--
View this message in context:
http://apache-avro.679487.n3.nabble.com/using-Avro-unions-with-HIVE-tp4027
473.html
Sent from the Avro - Users mailing list archive at Nabble.com.




Re: Newb question on imorting JSON and defaults

2013-05-23 Thread Scott Carey


On 5/22/13 2:26 PM, Gregory (Grisha) Trubetskoy gri...@apache.org
wrote:


Hello!

I have a test.json file that looks like this:

{first:John, last:Doe, middle:C}
{first:John, last:Doe}

(Second line does NOT have a middle element).

And I have a test.schema file that looks like this:

{name:test,
  type:record,
  fields: [
 {name:first,  type:string},
 {name:middle, type:string, default:},
 {name:last,   type:string}
]}

I then try to use fromjson, as follows, and it chokes on the second line:

$ java -jar avro-tools-1.7.4.jar fromjson --schema-file test.schema
test.json  test.avro
Exception in thread main org.apache.avro.AvroTypeException: Expected
field name not found: middle
 at org.apache.avro.io.JsonDecoder.doAction(JsonDecoder.java:477)
 at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
 at org.apache.avro.io.JsonDecoder.advance(JsonDecoder.java:139)
 at 
org.apache.avro.io.JsonDecoder.readString(JsonDecoder.java:219)
 at 
org.apache.avro.io.JsonDecoder.readString(JsonDecoder.java:214)
 at 
org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107
)
 at 
org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.j
ava:348)
 at 
org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.j
ava:341)
 at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:15
4)
 at 
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.j
ava:177)
 at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:14
8)
 at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:13
9)
 at 
org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:105)
 at org.apache.avro.tool.Main.run(Main.java:80)
 at org.apache.avro.tool.Main.main(Main.java:69)


The short story is - I need to convert a bunch of JSON where an element
may not be present sometimes, in which case I'd want it to default to
something sensible, e.g. blank or null.

According to the Schema Resolution if the reader's record schema has a
field that contains a default value, and writer's schema does not have a
field with the same name, then the reader should use the default value
from its field.

I'm clearly missing something obvious, any help would be appreciated!

There are two things that seem to be missing here:
 1. The fromjson tool is configuring the writer's schema (and readers's)
the one you provided.   Avro is expecting every
JSON fragment you are giving it to have the same schema.
 2. The tool will not work for all arbitrary json, it expects json in the
format that the Avro JSON Encoder writes.  There are a few differences
with expectations, primarily when disambiguating union types and maps from
records.

To perform schema evolution while reading, you may need to separate json
fragments missing middle from those that have it, and run the tool
twice, with corresponding schemas for each case.
Alternatively the tool could be modified to handle schema resolution or
deal with different json encodings as
well(tools/src/main/java/org/apache/avro/tool/DataFileWriteTool).

Alternatively, you can avoid schema resolution and write two files, one
with data in each schema after separating the records.   Then you can deal
with schema resolution in a later pass through the data with other tools
(e.g. data file reader + writer), or only lazily
when reading resolve the data into the schema you wish.




Grisha





Re: Best practices for java enums...?

2013-05-13 Thread Scott Carey
It would be nice to be able to reference an existing class when using the
specific compiler.

If you have an existing com.mycompany.Foo enum (or SpecificRecord, or
Fixed type), then provide the specific compiler with the type prior to
parsing the schema, it could accept a reference:

{type:record, name:com.mycompany.Rec, fields: [
  {name:fooField, type:com.mycompany.Foo}
]}

Ordinarily, this would fail to compile, but given a reference to an existing
compatible type, such as an enum, it could work.

-Scott

On 5/9/13 4:39 PM, Felix GV fe...@mate1inc.com wrote:

 Hello, 
 
 I'm currently writing an avro schema which includes an enum field that I
 already have as a java enum in my application.
 
 At first, I named the avro field with the same fully qualified name (package
 name dot enum name) as my existing java enum. I then ran the avro compiler and
 found that it overwrote my existing java enum with an avro-generated enum.
 
 I find this slightly annoying because my java enum had comments documenting
 the purpose of each enum value, and the avro-generated enum doesn't have this.
 
 I see two or three potential solutions:
 1. Accepting to replace my current enum with the avro-generated one in my code
 base, which I feel I cannot document properly (since I have access to just one
 doc attribute for the whole enum, instead of per symbol). On a side note, I
 haven't found any way to have a multi-line doc attribute in an avro schema, so
 that makes things slightly more annoying still. I wouldn't mind settling on
 using the avro-generated enums without documentation per symbol if at least I
 could have one big doc/comment that documents all symbols at once, but since
 it seems the doc attribute must be a one-liner, this is starting to be a
 little too messy for my taste...
 2. Maintaining two separate enums: my manually written (and documented) enum
 as well as the avro-generated enum. For now, I think this is what I'm going to
 do, because those enums have little chances of changing anyway, but from a
 maintenance standpoint, it seems pretty horrendous...
 3. I guess there's a third way, which would involve creating a script that
 backs up my enums, compiles all my schemas, and then restores my backed up
 enums, but this also seems ultra messy :( ... I haven't tested if it'd work
 (since the manually written enum is missing the $SCHEMA field), but I guess it
 would... 
 Am I being OCD about this? or is this a concern that others have bumped into?
 How do you guys deal with this? Did I miss anything in the way avro works?
 
 P.S.: I've seen that reflect mappings may be able to work with arbitrary java
 enums, but since they seemed discouraged for performance reasons, I haven't
 digged much in this direction. I'd like to keep using .avsc files if possible,
 but if there's a better way, I can certainly try it.
 
 P.P.S.: We're currently using avro 1.6.1, but if the latest version provides a
 nice way of handling my use case, then I guess I could get us to upgrade...
 
 Thanks a lot :) !
 
 --
 Felix




Re: Jackson and Avro, nested schema

2013-05-13 Thread Scott Carey
It appears that you will need to modify the JSON decoder in Avro to
achieve this.

The JSON decoder in Avro was built to encode any Avro schema into JSON
with 100% fidelity, so that the decoder can read it back.  The decoder
does not work with any arbitrary JSON.

This is because there are ambiguities:

In your example:
{
  id: doc1,
  fields: {
foo: bar,
spam: eggs,
answer: 42,
x: {a: 1}
  }
}


This can be interpreted by Avro in several ways.  Is the value of fields
a map or a record with four fields?  is the value of x a map or a record
with one field?  Is answer an int, long, float, or double?  is a string
doc1 a string or a bytes literal?

If you want to bake in the assumption that it is maps, all the way down,
you'll need to extend / modify the JSON Decoder.

It would be a useful contribution to have a generic JSON schema and
decoder for it.  We could have a JSON schema record (one field, a union
of null, string, double, and map of string to self) and this type's field
would automatically be un-nested by the special JSON decoder and not
interpreted as a record.

-Scott

On 5/8/13 11:49 AM, David Arthur mum...@gmail.com wrote:

I'm attempting to use Jackson and Avro together to map JSON documents to
a generated Avro class. I have looked at the Json schema included with
Avro, but this requires a top-level value element which I don't want.
Essentially, I have JSON documents that have a few typed top level
fields, and one field called fields which is more or less arbitrary
JSON.

I've reduced this down to strings and ints for simplicity

My first attempt was:

  {
 type: record,
 name: Json,
 fields: [
   {
 name: value,
 type: [ string, int, {type: map, values: Json} ]
   }
 ]
   },

   {
 name: Document,
 type: record,
 fields: [
   {
 name: id,
 type: string
   },
   {
 name: fields,
 type: {type: map, values: [string, int, {type:
map, values: Json}]}
   }
 ]
   }

Given a JSON document like:

{
   id: doc1,
   fields: {
 foo: bar,
 spam: eggs,
 answer: 42,
 x: {a: 1}
   }
}

this seems to work, but it doesn't. When I turn around and try to
serialize this object with Avro, I get the following exception:

java.lang.ClassCastException: java.lang.Integer cannot be cast to
org.apache.avro.generic.IndexedRecord
 at org.apache.avro.generic.GenericData.getField(GenericData.java:526)
 at org.apache.avro.generic.GenericData.getField(GenericData.java:541)
 at 
org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.
java:104)
 at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:6
6)
 at 
org.apache.avro.generic.GenericDatumWriter.writeMap(GenericDatumWriter.jav
a:173)
 at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:6
9)
 at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:7
3)
 at 
org.apache.avro.generic.GenericDatumWriter.writeMap(GenericDatumWriter.jav
a:173)
 at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:6
9)
 at 
org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.
java:106)
 at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:6
6)
 at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:5
8)

My best guess is that since the fields field is a union, the
representation of it in the generate class is an Object which Jackson
happily throws whatever into.

If I change my schema to explicitly use int instead of the Json
type, it works fine for my test document

 type: {type: map, values: [string, int, {type:
map, values: int}]}

However now I need to enumerate the types for each level of nesting I
want. This is not ideal, and limits me to a fixed level of nesting

To be clear, my issue is not modelling my schema in Avro, but rather
getting Jackson to map JSON onto the generated classes without too much
pain. I have also tried
https://github.com/FasterXML/jackson-dataformat-avro without much luck.

Any help is appreciated

-David









Re: avro.java.string vs utf8 compatibility in recent pig and hive versions

2013-05-13 Thread Scott Carey
The change in the Pig loader in PIG-3297 seems correct ‹ they must use
CharSequence, not Utf8.

I suspect that the Avro 1.5.3.jar does not respect the avro.java.string
property and is using Utf8 (for the API that Pig is using), but have not
confirmed it.  avro.java.string is an optional hint for the Java
implementation.

On the Avro side, we may be able to make a modification that allows one to
configure a decoder or encoder to ignore the avro.java.string property.
Perhaps it could look for a system property as an override to help with
cases like this.


On 5/10/13 3:16 PM, Michael Moss michael.m...@gmail.com wrote:

 Hello, 
 
 It looks like representing avro strings as Utf8 provide some interesting
 performance enhancements, but I'm wondering if folks out there are actually
 using it in practice, or have had any issues with it.
 
 We have recently run into an issue where our avro files which represents
 strings as avro.java.string are causing ClassCastExceptions because Pig and
 Hive are expecting them to be Utf8. The exceptions occur when using
 avro-1.7.x.jar, but dissapear when using version avro-1.5.3.jar.
 
 I'm wondering if this is something that should be addressed in the avro jar,
 or in pig and hive like this thread suggests:
 https://issues.apache.org/jira/browse/PIG-3297
 
 Here are the exceptions we are seeing:
 Hive:
 Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to
 org.apache.avro.util.Utf8at
 org.apache.hadoop.hive.serde2.avro.AvroDeserializer.deserializeMap(AvroDeseria
 lizer.java:253)
 
 Pig:
 Caused by: java.io.IOException: java.lang.ClassCastException: java.lang.String
 cannot be cast to org.apache.avro.util.Utf8
 at 
 
org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:275
)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.n
 extKeyValue(PigRecordReader.java:194)
 at 
 org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.
 java:532)
 
 Thanks.
 
 -Mike
 
 




Re: Hadoop serialization DatumReader/Writer

2013-05-13 Thread Scott Carey
Making the DatumReader/Writers configurable would be a welcome addition.

Ideally, much more of what goes on there could be:
 1. configuration driven
 2. pre-computed to avoid repeated work during decoding/encoding

We do some of both already.  The trick is to do #1 without impacting
performance and #2 requires a bigger overhaul.

If you would like, a contribution including a Clojure related maven module
or two that depends on the Java stuff would be a welcome addition and
allow us to identify compatibility issues as we change the Java library
over time.


On 5/8/13 3:33 PM, Marshall Bockrath-Vandegrift llas...@gmail.com
wrote:

Hi all:

Is there a reason Avro¹s Hadoop serialization classes don¹t allow
configuration of the DatumReader and DatumWriter classes?

My use-case is that I¹m implementing Clojure DatumReader and -Writer
classes which produce and consume Clojure¹s data structures directly.
I¹d like to then extend that to Hadoop MapReduce jobs which operate in
terms of Clojure data, with Avro handling all de/serialization directly
to/from that Clojure data.

Am I going around this in a backwards fashion, or would a patch to allow
configuration of the Hadoop serialization DatumReader/Writers be
welcome?

-Marshall





Re: map/reduce of compressed Avro

2013-04-29 Thread Scott Carey
Martin said it already, but I will emphasize:

Avro data files are splittable and can support multiple mappers no matter
what codec is used for compression.  This is because avro files are block
based, and only use the compression within the block.  I recommend
starting with gzip compression, and moving to snappy only if deflate
compression level '1' is not fast enough.

For more information on avro data files, see:
http://avro.apache.org/docs/current/spec.html#Object+Container+Files



On 4/22/13 11:47 PM, nir_zamir nir.za...@gmail.com wrote:

Thanks Martin.

What will happen if I try to use an indexed LZO-compressed avro file? Will
it work and utilize the index to allow multiple mappers?

I think that for Snappy for example, the file is splittable and can use
multiple mappers, but I haven't tested it yet - would be glad if anyone
has
any experience with that.

Thanks!
Nir.



--
View this message in context:
http://apache-avro.679487.n3.nabble.com/map-reduce-of-compressed-Avro-tp40
26947p4027009.html
Sent from the Avro - Users mailing list archive at Nabble.com.




Re: Could specific records implement the generic API as well?

2013-04-15 Thread Scott Carey
Which aspect of the generic API are you most interested in?  The builder,
getters, or setters?

Most people that use Specific records do so for compile time type safety, so
adding 'set(foo, fooval)' is not desired for those users.   On the other
hand it is certainly possible to add it.

The code generated by the specific code generation utility uses templates,
one can add a template that extends what is produced to include generic API
bits.

-Scott

On 4/15/13 11:23 AM, Christophe Taton ta...@wibidata.com wrote:

 Hi, 
 Is there a reason for specific records to not implement the generic API?
 I didn't find any obvious technical reason, but maybe I missed something.
 Thanks,
 C.




Re: Could specific records implement the generic API as well?

2013-04-15 Thread Scott Carey
I would like to figure out how to make SpecificRecord and GenericRecord
immutable in the longer term (or as an option with the code generation
and/or builder).  The builder is the first step, but setters are the
enemy.  Is there a way to do this that does not introduce new mutators for
all SpecificRecords?



On 4/15/13 3:43 PM, Doug Cutting cutt...@apache.org wrote:

On Mon, Apr 15, 2013 at 2:21 PM, Christophe Taton ta...@wibidata.com
wrote:
 If you think it's a meaningful addition, I'm happy to make the change.

The two methods I wrote above could be added to SpecificRecordBase and
it could then be declared to implement GenericRecord.

I think GenericRecordBuilder could be used to build specific records
with a few additional changes:
 - change the type of the 'record' field from GenericData.Record to
GenericRecord.
 - replace the call to 'new GenericData.Record()' to
'(GenericRecord)data().newRecord(null, schema())'
 - add a constructor that accepts a GenericData instance, instead of
calling GenericData.get().

Then you could use new GenericRecordBuilder(SpecificData.get(),
schema) to create specific records.

Doug




Re: Issue writing union in avro?

2013-04-07 Thread Scott Carey
It is well documented in the specification:
http://avro.apache.org/docs/current/spec.html#json_encoding

I know others have overridden this behavior by extending GenericData and/or
the JsonDecoder/Encoder.  It wouldn't conform to the Avro Specification
JSON, but you can extend avro do do what you need it to.

The reason for this encoding is to make sure that round-tripping data from
binary to json and back results in the same data.  Additionally, unions can
be more complicated and contain multiple records each with different names.
Disambiguating the value requires more information since several Avro data
types map to the same JSON data type.  If the schema is a union of bytes and
string, is hello a string, or byte literal?  If it is a union of a map and
a record, is {state:CA, city:Pittsburgh}  a record with two string
fields, or a map?   There are other approaches, and for some users perfect
transmission of types is not critical.  Generally speaking, if you want to
output Avro data as JSON and consume as JSON, the extra data is not helpful.
If you want to read it back in as Avro, you're going to need the info to
know which branch of the union to take.

On 4/6/13 6:49 PM, Jonathan Coveney jcove...@gmail.com wrote:

 Err, it's the output format that deserializes the json and then writes it in
 the binary format, not the input format. But either way the general flow is
 the same. 
 
 As a general aside, is it the case that the java case is correct in that when
 writing a union it should be {string: hello} or whatnot? Seems like we
 should probably add that to the documentation if it is a requirement.
 
 
 2013/4/7 Jonathan Coveney jcove...@gmail.com
 Scott, 
 
 Thanks for the input. The use case is that a number of our batch processes
 are built on python streaming. Currently, the reducer will output a json
 string as a value, and then the input format will deserialize the json, and
 then write it in binary format.
 
 Given that our use of python streaming isn't going away, any suggestions on
 how to make this better? Is there a better way to go from json string -
 writing binary avro data?
 
 Thanks again
 Jon
 
 
 2013/4/6 Scott Carey scottca...@apache.org
 This is due to using the JSON encoding for avro and not the binary encoding.
 It would appear that the Python version is a little bit lax on the spec.
 Some have built variations of the JSON encoding that do not label the union,
 but there are drawbacks to this too, as the type can be ambiguous in a very
 large number of cases without a label.
 
 Why are you using the JSON encoding for Avro?  The primary purpose of the
 JSON serialization form as it is now is for transforming the binary to human
 readable form. 
 Instead of building your GenericRecord from a JSON string, try using
 GenericRecordBuilder.
 
 -Scott
 
 On 4/5/13 4:59 AM, Jonathan Coveney jcove...@gmail.com wrote:
 
 Ok, I figured out the issue:
 
 If you make string c the following:
 String c = {\name\: \Alyssa\, \favorite_number\: {\int\: 256},
 \favorite_color\: {\string\: \blue\}};
 
 Then this works.
 
 This represents a divergence between the python and the Java
 implementation... the above does not work in Python, but it does work in
 Java. And of course, vice versa.
 
 I think I know how to fix this (and can file a bug with my reproduction and
 the fix), but I'm not sure which one is the expected case? Which
 implementation is wrong?
 
 Thanks
 
 
 2013/4/5 Jonathan Coveney jcove...@gmail.com
 Correction: the issue is when reading the string according to the avro
 schema, not on writing. it fails before I get a chance to write :)
 
 
 2013/4/5 Jonathan Coveney jcove...@gmail.com
 I implemented essentially the Java avro example but using the
 GenericDatumWriter and GenericDatumReader and hit an issue.
 
 https://gist.github.com/jcoveney/5317904
 
 This is the error:
 Exception in thread main java.lang.RuntimeException:
 org.apache.avro.AvroTypeException: Expected start-union. Got
 VALUE_NUMBER_INT
 at com.spotify.hadoop.mapred.Hrm.main(Hrm.java:45)
 Caused by: org.apache.avro.AvroTypeException: Expected start-union. Got
 VALUE_NUMBER_INT
 at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:697)
 at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:441)
 at 
 org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
 at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
 at 
 org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
 at 
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:1
 52)
 at 
 org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.
 java:177)
 at 
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:1
 48)
 at 
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:1
 39)
 at com.spotify.hadoop.mapred.Hrm.main(Hrm.java:38)
 
 Am I doing something wrong? Is this a bug? I'm digging in now

Re: Issue writing union in avro?

2013-04-06 Thread Scott Carey
This is due to using the JSON encoding for avro and not the binary encoding.
It would appear that the Python version is a little bit lax on the spec.
Some have built variations of the JSON encoding that do not label the union,
but there are drawbacks to this too, as the type can be ambiguous in a very
large number of cases without a label.

Why are you using the JSON encoding for Avro?  The primary purpose of the
JSON serialization form as it is now is for transforming the binary to human
readable form. 
Instead of building your GenericRecord from a JSON string, try using
GenericRecordBuilder.

-Scott

On 4/5/13 4:59 AM, Jonathan Coveney jcove...@gmail.com wrote:

 Ok, I figured out the issue:
 
 If you make string c the following:
 String c = {\name\: \Alyssa\, \favorite_number\: {\int\: 256},
 \favorite_color\: {\string\: \blue\}};
 
 Then this works.
 
 This represents a divergence between the python and the Java implementation...
 the above does not work in Python, but it does work in Java. And of course,
 vice versa.
 
 I think I know how to fix this (and can file a bug with my reproduction and
 the fix), but I'm not sure which one is the expected case? Which
 implementation is wrong?
 
 Thanks
 
 
 2013/4/5 Jonathan Coveney jcove...@gmail.com
 Correction: the issue is when reading the string according to the avro
 schema, not on writing. it fails before I get a chance to write :)
 
 
 2013/4/5 Jonathan Coveney jcove...@gmail.com
 I implemented essentially the Java avro example but using the
 GenericDatumWriter and GenericDatumReader and hit an issue.
 
 https://gist.github.com/jcoveney/5317904
 
 This is the error:
 Exception in thread main java.lang.RuntimeException:
 org.apache.avro.AvroTypeException: Expected start-union. Got
 VALUE_NUMBER_INT
 at com.spotify.hadoop.mapred.Hrm.main(Hrm.java:45)
 Caused by: org.apache.avro.AvroTypeException: Expected start-union. Got
 VALUE_NUMBER_INT
 at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:697)
 at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:441)
 at 
 org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
 at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
 at 
 org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
 at 
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
 at 
 org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.jav
 a:177)
 at 
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
 at 
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
 at com.spotify.hadoop.mapred.Hrm.main(Hrm.java:38)
 
 Am I doing something wrong? Is this a bug? I'm digging in now but am curious
 if anyone has seen this before?
 
 I get the feeling I am working with Avro in a way that most people do not :)
 
 
 




Re: Has anyone developed a utility to tell what is missing from a record?

2013-04-06 Thread Scott Carey
Try GenericRecordBuilder.

For the Specific API, there are builders that will not let you construct an
object that can not be serialized.
The Generic API should have the same thing, but I am not 100% sure the
builder there covers it.

I have always avoided using any API that allows me to create an object that
is unsafe to serialize since finding out at serialization time is a huge
pain (and in my case, is often on a separate thread from the place it was
created).

On 4/4/13 6:58 AM, Jonathan Coveney jcove...@gmail.com wrote:

 I'm working on migrating an internally developed serialization format to Avro.
 In the process, there have been many cases where I made a mistake migrating
 the schema (I've automated it), and then avro cries that a record I'm trying
 to serialize doesn't match the schema. Generally, the error it gives doesn't
 help find the actual issue, and for a big enough record finding the issue can
 be tedious.
 
 I've thought about making a tool which, given the schema and the record would
 tell you what the issue is, but I'm wondering if this already exists? I
 suppose the error message could also include this information...
 
 Thanks
 Jon




Re: Support for char[] and short[] - Java

2013-01-08 Thread Scott Carey
You can cast both short and char safely to int and back, and use Avro's int
type.  These will be variable length integer encoded and take 1 to 3 bytes
in binary form per short/char.
This will be clunky as a user to wrap char[] or short[] into ListInteger
or int[] however.  Another option would be to extend the reader to look for
special meta-data in the schema that indicates that an array of int is to be
interpreted as shorts or chars.

Can you give an example where a char[] converted to utf8 bytes and back
results in a loss of data?  I was under the impression that UTF-16 surrogate
pairs are converted to proper UTF-8 sequences and back to surrogate pairs.
Or, are you using char to represent something else, as a two byte unsigned
quantity where interpreting as UTF-16 causes the problem?

On 12/23/12 10:30 PM, Tarun Gupta tarun.gu...@technogica.com wrote:

 Hi, 
 
 I am new Avro but I did some basic research regarding how do we a support data
 types like Char arrays and Short arrays while defining the Avro schema. Issue
 # AVRO-249 sounded somewhat relevant but its about supporting Short using the
 reflection API. 
 
 We are planning to use Avro for a Java based Client Server data exchange use
 case, note that our data model is expected to have large arrays of Short and
 Char, and performance is our 'key concern'. We can't use a string to store
 char[], because what we get back is different then what you put in, because of
 UTF-16 normalization.
 
 Thanks in Advance.
 Tarun Gupta




Re: Appending to .avro log files

2013-01-08 Thread Scott Carey
A sync marker delimits each block in the avro file.  If you want to start
reading data from the middle of a 100GB file, DataFileReader will seek to
the middle and find the next sync marker.  Each block can be individually
compressed, and by default when writing a file the writer will not
compress the block and flush to disk until a block as gotten as large as
the sync interval in bytes.Alternatively, you can manually sync().

If you have a 100 byte sync interval, you may not see any data reach
disk until that many bytes have been written (or sync() is called
manually).

Your problem is likely that the first block in the file has not been
flushed to disk yet, and therefore the file is corrupt and missing a
trailing sync marker.

On 1/3/13 12:36 PM, Terry Healy the...@bnl.gov wrote:

Hello-

I'm upgrading a logging program to append GenericRecords to a .avro file
instead of text (.tsv). I have a working schema that is used to convert
existing .tsv of the same format into .avro and that works fine.

When I run a test writing 30,000 bogus records, it runs but when I try
to use avro-tools-1.7.3.jar tojson on the output file, it reports:

AvroRuntimeException: java.io.IOException: Invalid sync!

The file is still open at this point since the logging program is
running. Is this expected behavior because it is still open? (getmeta
and getschema work fine).

I'm not sure if it has any bearing, since I never really understood the
function of the the AVRO sync interval; in this and the working programs
it is set to 100.

Any ideas appreciated.

-Terry




Re: Embedding schema with binary encoding

2013-01-08 Thread Scott Carey
Calling toJson() on a Schema will print it in json fom.  However you most
likely do not want to invent your own file format for Avro data.

DataFileWriter which will manage the schema for you, along with compression,
metadata, and the ability to seek to the middle of the file.Additionally
it is then readable by several other languages and tools.

On 1/7/13 4:42 AM, Pratyush Chandra chandra.praty...@gmail.com wrote:

 I am able to serialize with binary encoding to a file using following :
 FileOutputStream outputStream = new FileOutputStream(file);
 Encoder e = EncoderFactory.get().binaryEncoder(outputStream, null);
 DatumWriterGenericRecord datumWriter = new
 GenericDatumWriterGenericRecord(schema);
 GenericRecord message1= new GenericData.Record(schema);
 message1.put(to, Alyssa);
 datumWriter.write(message1, e);
 e.flush();
 outputStream.close();
 
 But the output file contains only serialized data and not schema. How can I
 add schema also ?
 
 Thanks
 Pratyush Chandra




Re: Setters and getters

2013-01-08 Thread Scott Carey
No.   However each API (Specific, Reflect, Generic in Java) has different
limitations and use cases.  You'll have to provide more information about
your use cases and expectations for more specific guidance.

On 1/7/13 11:21 AM, Tanya Bansal tanyapban...@gmail.com wrote:

 Is it necessary to write setters and getters for all member variables for a
 class that is going to be serialized by Avro?
 
 Thanks 
 -Tanya




Re: Sync() between records? How do we recover from a bad record, using DataFileReader?

2013-01-08 Thread Scott Carey
For the corruption test, try corrupting the records, not the sync marker.
The features added to DataFileReader for corruption recovery were for the
case when decoding a record fails (corrupted record), not for when a sync
marker is corrupted.  Perhaps we should add that too, but it does not
surprise me that that case has a bug.


On 1/6/13 7:38 PM, Russell Jurney russell.jur...@gmail.com wrote:


We are trying to recover, report bad record, and move to the next record
of an Avro file in PIG-3015 and PIG-3059. It seems that sync blocks don't
exist between files, however.

How should we recover from a bad record using Avro's DataFileReader?

https://issues.apache.org/jira/browse/PIG-3015
https://issues.apache.org/jira/browse/PIG-3059

Russell Jurney http://datasyndrome.com




Re: Serializing json against a schema

2013-01-08 Thread Scott Carey
You could use the ReflectDatumWriter to write a simple java data class to
Avro, and you can  create instances of such classes from JSON using a
library like Jackson.   There is a JSON encoding for Avro, if your data
conformed to that format (which would be more verbose than what you have
below) you could use that to decode it, then re-encode it to binary.
Lastly you can use the SpecificDatum API, generate Java classes from your
schema, then set the data from the json with its type-safe builder pattern
APIs instead of the loose Generic API.


On 1/7/13 3:46 AM, Pratyush Chandra chandra.praty...@gmail.com wrote:



Hi,

I am new to Avro. I was going through examples and figured out that
GenericRecord can be appended to DataFileWriter and then serialized.

Example:
record.avsc is 
{
namespace: example.proto,
name: Message, type: record,
fields: [
  {name: to,   type: [string,null]}
]
}

and my code snippet is :
DatumWriterGenericRecord datumWriter = new
GenericDatumWriterGenericRecord(schema);
DataFileWriterGenericRecord dataFileWriter = new
DataFileWriterGenericRecord(datumWriter);
dataFileWriter.create(schema, file);
GenericRecord message1= new GenericData.Record(schema);
message1.put(to, Alyssa);
dataFileWriter.append(message1);
dataFileWriter.close();

My question is : Suppose I am receiving a json from server, and based on
schema I would like to serialize it directly, without parsing it.
For example :
Input received is {to: Alyssa}
Is there a way, I can serialize above json with record.avsc schema
instead of appending GenericRecord ?

-- 
Pratyush Chandra




Re: issue with writing an array of records

2013-01-08 Thread Scott Carey

On 1/7/13 8:35 AM, Alan Miller alan.mill...@gmail.com wrote:


Hi, I have a schema with an array of records (I'm open to other
suggestions too) field
called ifnet to store misc attribute name/values for a host's network
interfaces.
e.g. 


{ type: record,
  namespace: com.company.avro.data,
  name: MyRecord,
  doc: My Data Record.,
  fields: [
// (required) fields
{name:  time_stamp, type: long
  },
{name:hostname, type: string
  },

// (optional) array of ifnet instances
{name: ifnet,
 type: [null, {
type: array,
items: { type: record, name: Ifnet,
   namespace: com.company.avro.data,
   fields: [ {name: name,   type:
string},
   {name: send_bps,   type:
long  },
   {name: recv_bps,   type:
long  }
   ]
}
  }
 ]
}

 ]
}

First thought:  Why the union of null and the array?  It may be easier to
simply  have an empty list when there are no Ifnet data.





I can write the records, (time_stamp and hostname are correct) but
my array of records field (ifnet) only contains the last element of my
java List.

Am I writing the field correctly?  I'm trying to write the ifnet field
with a
java.util.Listcom.company.avro.data.Ifnet

Here's the related code lines that write the ifnet field. (Yes, I'm
attempting to use reflection
because Ifnet is only 1 of approx 11 other array of record fields I'm
trying to implement.)

   Class[] paramObj = new Class[1];
   paramObj[0] = Ifnet.class;
   Method method = cls.getMethod(methodName, List.class);
   jsonObj = new Ifnet();
   listOfObj = new ArrayListIfnet();
   ...   


   // in a loop building the ListIfnet...

LOG.info(String.format(   [%s] %s %s(%s) as %s, name,
k,methNm,v,types[j].toString()));
   ...   

LOG.info(String.format(   [%s] setting name to %s, name, name));

   ...   

   istOfObj.add(jsonObj);

   ...

  // then finally I call invoke with a List of Ifnet records

  if (method != null) { method.invoke(obj, listOfObj); }
  LOG.info(String.format(  invoking %s.%s,
method.getClass().getSimpleName(), method.getName()));
  LOG.info(String.format(  param: listObj%s with %d entries ,
jsonObj.getClass().getName(), listOfObj.size()));


and the respective output
20130107T172303  INFO c.c.a.d.MyDriver - Setifnet
json via setIfnet(Ifnet object)
20130107T172303  INFO c.c.a.d.MyDriver -[e0c] setting name to e0c
20130107T172303  INFO c.c.a.d.MyDriver -[e0c] send_bps setSendBps(0)
as class java.lang.Long
20130107T172303  INFO c.c.a.d.MyDriver -[e0c] setting name to e0c
20130107T172303  INFO c.c.a.d.MyDriver -[e0c] recv_bps setRecvBps(0)
as class java.lang.Long
20130107T172303  INFO c.c.a.d.MyDriver -[e0d] setting name to e0d
20130107T172303  INFO c.c.a.d.MyDriver -[e0d] send_bps setSendBps(0)
as class java.lang.Long
20130107T172303  INFO c.c.a.d.MyDriver -[e0d] setting name to e0d
20130107T172303  INFO c.c.a.d.MyDriver -[e0d] recv_bps setRecvBps(0)
as class java.lang.Long
20130107T172303  INFO c.c.a.d.MyDriver -[e0a] setting name to e0a
20130107T172303  INFO c.c.a.d.MyDriver -[e0a] send_bps
setSendBps(170720) as class java.lang.Long
20130107T172303  INFO c.c.a.d.MyDriver -[e0a] setting name to e0a
20130107T172303  INFO c.c.a.d.MyDriver -[e0a] recv_bps
setRecvBps(244480) as class java.lang.Long
20130107T172303  INFO c.c.a.d.MyDriver -[e0b] setting name to e0b
20130107T172303  INFO c.c.a.d.MyDriver -[e0b] send_bps setSendBps(0)
as class java.lang.Long
20130107T172303  INFO c.c.a.d.MyDriver -[e0b] setting name to e0b
20130107T172303  INFO c.c.a.d.MyDriver -[e0b] recv_bps setRecvBps(0)
as class java.lang.Long
20130107T172303  INFO c.c.a.d.MyDriver -[e0P] setting name to e0P
20130107T172303  INFO c.c.a.d.MyDriver -[e0P] send_bps setSendBps(0)
as class java.lang.Long
20130107T172303  INFO c.c.a.d.MyDriver -[e0P] setting name to e0P
20130107T172303  INFO c.c.a.d.MyDriver -[e0P] recv_bps setRecvBps(0)
as class java.lang.Long
20130107T172303  INFO c.c.a.d.MyDriver -[losk] setting name to losk
20130107T172303  INFO c.c.a.d.MyDriver -[losk] send_bps setSendBps(0)
as class java.lang.Long
20130107T172303  INFO c.c.a.d.MyDriver -[losk] setting name to losk
20130107T172303  INFO c.c.a.d.MyDriver -[losk] recv_bps setRecvBps(0)
as class java.lang.Long
20130107T172303  INFO c.c.a.d.MyDriver -   invoking Method.setIfnet
20130107T172303  INFO c.c.a.d.MyDriver -   param:
listObjcom.synopsys.iims.be.storage.Ifnet with 6 entries

20130107T172303  INFO c.c.a.d.MyDriver - Set   time_stamp
integer via setTimeStamp to 1357513251


When I dump the records I see an array of 6 entries but the values all
reflect the last last entry in my java.util.List.
The 

Re: any movement on JSON encoding for RPC?

2012-11-27 Thread Scott Carey
Avro can serialize in JSON, however most users use the compact binary
serialization for performance and data storage reasons (JSON is typically
10x larger), and use the JSON format for debugging or export to other
systems.

I do not know if anyone is planning work on the JSON encoding in combination
with Avro RPC,  the best place to find out is the dev mailing list and JIRA
tickets.

On 11/21/12 1:31 PM, Brian Lee leeb...@yahoo.com wrote:

 I found a message from last year that JSON encoding for RPC was not yet
 implemented. Is this still the case? If so, this would be very bad as one of
 the selling points we were using is that Avro serialized its messages in JSON
 format.
 
 Brian




Re: Backwards compatible - Optional fields

2012-10-03 Thread Scott Carey
A reader must always have the schema of the written data to decode it.

When creating your Decoder, you must pass both the reader's schema and the
schema as written.

Once given this pair, Avro can know to skip data as written if the reader
does not need it, or to inject default values for the reader if the writer
did not provide it.

The flaw in your code is here where you only provide the reader's schema:

new SpecificDatumReaderA(a.getSchema());




On 10/2/12 2:04 PM, Gabriel Ki gab...@gmail.com wrote:

 Hi all,
 
 I had an impression that reader works with older version object as long as the
 new fields are optional.  Is that true?  If not,
 what would you recommend?  Thanks a lot in advance.
 
 For example:
 
 {
   namespace: org.apache.avro.examples,
   protocol: MyProtocol,
 
   types: [
 { name: Metadata, type: record, fields: [
   {name: S1, type: string}
 ]},
 
 { name: Metadatav2, type: record, fields: [
   {name: S1, type: string},
   {name: S2, type: [string, null]}  // optional field in the
 new version
 ]}
   ] 
 }
 
 public static A extends SpecificRecordBase A parseAvroObject(final A a,
 final byte[] bb)
 throws IOException {
 if (bb == null) {
 return null;
 }
 ByteArrayInputStream bais = new ByteArrayInputStream(bb);
 DatumReaderA dr = new SpecificDatumReaderA(a.getSchema());
 Decoder d = DecoderFactory.get().binaryDecoder(bais, null);
 return dr.read(a, d);
 }
 
 
 public static void main(String[] args) throws IOException {
 
 Metadata.Builder mb = Metadata.newBuilder();
 mb.setS1(S1 value);
 byte[] bs = toBytes(mb.build());
 
 Metadata m = parseAvroObject(new Metadata(), bs);
 System.out.println(parse as Metadata  + m);
 
 // This I thought it worked with older avro
 Metadatav2 m2 = parseAvroObject(new Metadatav2(), bs);
 System.out.println(parse as Metadatav2  + m2);
  }
 
 
 Exception in thread main java.io.EOFException
 at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:145)
 at org.apache.avro.io.BinaryDecoder.readIndex(BinaryDecoder.java:405)
 at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
 at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
 at 
 org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
 at 
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
 at 
 org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:
 166)
 at 
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138)
 at 
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129)
 
 
 Thanks
 -gabe




Re: Schema resolution failure when the writer's schema is a primitive type and the reader's schema is a union

2012-08-31 Thread Scott Carey
My understanding of the spec is that promotion to a union should work as
long as the prior type is a member of the union.

What happens if the union in the reader schema union order is reversed?

This may be a bug.

-Scott

On 8/16/12 5:59 PM, Alexandre Normand alexandre.norm...@gmail.com
wrote:


Hey, 
I've been running into this case where I have a field of type int but I
need to allow for null values. To do this, I now have a new schema that
defines that field as a union of
null and int such as:
type: [ null, int ]
According to my interpretation of the spec, avro should resolve this
correctly. For reference, this reads like this (from
http://avro.apache.org/docs/current/spec.html#Schema+Resolution):

if
 reader's is a union, but writer's is not
The first schema in the reader's union that matches the writer's schema
is recursively resolved against it. If none match, an error is signaled.)


However, when trying to do this, I get this:
org.apache.avro.AvroTypeException: Attempt to process a int when a union
was expected.

I've written a simple test that illustrates what I'm saying:
@Test
public void testReadingUnionFromValueWrittenAsPrimitive() throws
Exception {
Schema writerSchema = new Schema.Parser().parse({\n +
\type\:\record\,\n +
\name\:\NeighborComparisons\,\n +
\fields\: [\n +
  {\name\: \test\,\n +
  \type\: \int\ }]} );
Schema readersSchema = new Schema.Parser().parse( {\n +
\type\:\record\,\n +
\name\:\NeighborComparisons\,\n +
\fields\: [ {\n +
  \name\: \test\,\n +
  \type\: [\null\, \int\],\n +
  \default\: null } ]  });
GenericData.Record record = new GenericData.Record(writerSchema);
record.put(test, Integer.valueOf(10));

ByteArrayOutputStream output = new ByteArrayOutputStream();
JsonEncoder jsonEncoder =
EncoderFactory.get().jsonEncoder(writerSchema, output);
GenericDatumWriterGenericData.Record writer = new
GenericDatumWriterGenericData.Record(writerSchema);
writer.write(record, jsonEncoder);
jsonEncoder.flush();
output.flush();

System.out.println(output.toString());

JsonDecoder jsonDecoder =
DecoderFactory.get().jsonDecoder(readersSchema, output.toString());
GenericDatumReaderGenericData.Record reader =
new GenericDatumReaderGenericData.Record(writerSchema,
readersSchema);
GenericData.Record read = reader.read(null, jsonDecoder);

assertEquals(10, read.get(test));
}

Am I misunderstanding how avro should handle such a case of schema
resolution or is the problem in the implementation?

Cheers!

-- 
Alex




Re: Pig with Avro and HBase

2012-08-30 Thread Scott Carey
I am using Pig on Avro data files, and Avro in HBase.

Can you elaborate on what you mean by 'auto-load the schema'?  In the
sense that a big LOAD statement doesn't have to declare the schema?  I do
this with avro data files to some extent (with limitations).

A working implementation of
https://issues.apache.org/jira/browse/AVRO-1124 seems to be the way to go
for tracking a mapping from something like a Table or known file type to a
sequence of schemas (and the most recent schema).  Then a pig loader could
load from HBase using the most recent schema from a named schema group, or
read the same thing from files that represent the same schema group with
an avro file loader.


On 8/22/12 8:37 PM, Russell Jurney russell.jur...@gmail.com wrote:



Is anyone using Pig with Avro as the datatype in HBase? I want to
auto-load the schema, and this seems the most direct way to do it.

-- 
Russell Jurney twitter.com/rjurney http://twitter.com/rjurney
russell.jur...@gmail.com datasyndrome.com http://datasyndrome.com/




Re: Suggestions when using Pair.getPairSchema for Reduce-Side Joins in MR2

2012-06-28 Thread Scott Carey
It sounds like we need to be extra clear in the documentation on Pair, and
perhaps have a different class or flavor that serves the purpose you needed.
(KeyPair?)

In Avro's MRV1 API, there is no key schema or value schema for map output,
but only one map output schema that must be a Pair ‹ a pair of key and
value, where only the key is used for the sort.

-Scott

On 6/27/12 3:09 PM, Jacob Metcalf jacob_metc...@hotmail.com wrote:

 
 
 I spent an hour or so of today debugging some map reduce jobs I had developed
 in Avro 1.7 and Map Reduce 2 and thought it might be constructive to share. I
 needed to do a reduce side join for which you need a composite key. The key
 consists of the key you are actually grouping by and an integer which is just
 used for sorting (the technique is described in many places but there is a
 nice picture on page 24 of
 http://www.inf.ed.ac.uk/publications/thesis/online/IM100859.pdf).
 
 
 For this I thought it would be ideal to use Avro pair class which has a handy
 function for creating its own schema so I could configure the shuffle
 something like this:
 
 
 Schema joinKeySchema = Pair.getPairSchema( Schema.create( Schema.type.STRING
 ), Schema.create( Schema.type.INTEGER ));
 AvroJob.setMapOutputKeySchema( joinKeySchema );
  
 I then planned to use the standard AvroKeyComparator for sorting and a
 specialised comparator for grouping/partitioning which would ignore the
 integer part. However it did not work as the sort on the integer did not
 appear to take place and my map output would arrive in the wrong order at the
 reducer. I finally tracked the issue down to the fact that the pair schema by
 default ignores the second part of the pair:
 
 
 private static Schema makePairSchema(Schema key, Schema value) {
 Schema pair = Schema.createRecord(PAIR, null, null, false);
 ListField fields = new ArrayListField();
 fields.add(new Field(KEY, key, , null));
 fields.add(new Field(VALUE, value, , null, Field.Order.IGNORE));
 pair.setFields(fields);
 return pair;
   }
 
 
 In the end it was easy enough to work around by creating my own pair schema. I
 am not an expert but I suspect there is a very valid application for this
 ignore in MR1. As a suggestion it may help going forwards if a second version
 with a boolean to toggle the ignore were introduced to make the semantics
 clearer .
 
 
 
 Jacob




Re: C/C++ parsing vs. Java parsing.

2012-06-25 Thread Scott Carey
The schema provided is a union of several schemas.  Java supports parsing
this, C++ may not.  Does it work if you make it one single schema, and
nest NA, acomplex and retypes inside of object ?  It only needs to
be defined the first time it is referenced.  If it does not, then it is
certainly a bug.

Either way I would file a bug in JIRA.  The spec does not say whether a
file should be parseable if it contains a union rather than a record, but
it probably should be.

-Scott

On 6/24/12 11:17 PM, Saptarshi Guha sg...@mozilla.com wrote:

I have a avro scheme found here: http://sguha.pastebin.mozilla.org/1677671

I tried

java -jar avro-tools-1.7.0.jar  compile schema ~/tmp/robject.avro foo

and it worked.

This failed:

avrogencpp --input ~/tmp/robject.avro --output ~/tmp/h2
Segmentation fault: 11


This failed:

 avro_schema_t *person_schema =
(avro_schema_t*)malloc(sizeof(avro_schema_t));
(avro_schema_from_json_literal(string.of.avro.file), person_schema)

with

Error was Error parsing JSON: string or '}' expected near end of file

Q1: Does C and C++ API support all schemas the Java one supports?
Q2: Is it yes to Q1 and this is a bug?

Regards
Saptarshi




Re: Paranamer issue

2012-06-18 Thread Scott Carey


On 6/6/12 10:33 AM, Peter Cameron peter.came...@2icworld.com wrote:

The BSD license is a problem for our clients, whereas the Apache 2
license is not. Go figure. That's the situation!

ASL 2.0 is a derivative of the BSD license, after all...
Apache projects regularly depend on other items that are MIT or BSD
licensed since these are the least restrictive open source licenses around.



So what is the answer for us when we don't want to ship the avro tools
JAR but need the Paranamer classes from it. What can we do to stay
consistent with Apache 2 e.g. create my own Paranamer JAR containing
just those classes from the tools JAR?


As Doug said, packaging doesn't affect anything license-wise, but you can
repackage things fairly easily into a single jar that contains what you
need using maven-shade-plugin.
The avro-tools.jar uses this to repackage all dependencies inside of it.
You can do the same thing for the base avro.jar and explicitly include
only the few jars you need (or exclude those you do not) by adding a
maven-shade-plugin configuration to lang/java/avro/pom.xml
and rebuilding.




Peter


On 06/06/2012 18:30, Doug Cutting wrote:
 On 06/06/2012 06:51 AM, Peter Cameron wrote:
 I've only just discovered the dependancy of Avro upon the thoughtworks
 Paranamer classes. We use reflection at runtime with a schema and
 encountered the usual ClassNotFoundException for Paranamer after I'd
 been rationalising our codebase -- which included the removal of the
 avro-tools-1.6.3 JAR. The tools JAR contains the Paranamer classes
which
 I was unaware of. We operate in a very lightweight environment so the
 10Mb tools JAR is not suitable for us to deploy.

 I went looking for the Paranamer JAR and eventually found version 2.5.
 However, this is BSD licensed. BSD is not suitable for us. Only
 Apache 2.0.

 How is BSD a problem?  BSD is less restrictive an Apache 2.0 and is
 thus is generally not considered to alter the requirements of one
 re-distributing software that includes BSD within an Apache-licensed
 project.

 Doug





Re: Scala API

2012-05-30 Thread Scott Carey
This would be fantastic.  I would be excited to see it.  It would be great
to see a Scala language addition to the project if you wish to contribute.

I believe there have been a few other Scala Avro attempts by others over
time.   I recall one where all records were case classes (but this broke at
22 fields).
Another thing to look at is:
http://code.google.com/p/avro-scala-compiler-plugin/

Perhaps we can get a few of the other people who have developed Scala Avro
tools to review/comment or contribute as well?

On 5/29/12 11:04 PM, Christophe Taton ta...@wibidata.com wrote:

 Hi people, 
 
 Is there interest in a custom Scala API for Avro records and protocols?
 I am currently working on an schema compiler for Scala, but before I go
 deeper, I would really like to have external feedback.
 I would especially like to hear from anyone who has opinions on how to map
 Avro types onto Scala types.
 Here are a few hints on what I've been trying so far:
 * Records are compiled into two forms: mutable and immutable.
Very nice.
 * To avoid collisions with Java generated classes, scala classes are generated
 in a .scala sub-package.
 * Avro arrays are translated to Seq/List when immutable and Buffer/ArrayBuffer
 when mutable.
 * Avro maps are translated to immutable or mutable Map/HashMap.
 * Bytes/Fixed are translated to Seq[Byte] when immutable and Buffer[Byte] when
 mutable. 
 * Avro unions are currently translated into Any, but I plan to:
 * translate union{null, X} into Scala Option[X]
 * compile union {T1, T2, T3} into a custom case classes to have proper type
 checking and pattern matching.
If you have a record R1, it compiles to a Scala class.  If you put it in a
union of {T1, String}, what does the case class for the union look like?  Is
it basically a wrapper like a specialized Either[T1, String] ?   Maybe Scala
will get Union types later to push this into the compiler instead of object
instances :)
 * Scala records provide a method encode(encoder) to serialize as binary into a
 byte stream (appears ~30% faster than SpecificDatumWriter).
 * Scala mutable records provide a method decode(decoder) to deserialize a byte
 stream (appears ~25% faster than SpecificDatumReader).
I have some plans to improve {Generic,Specific}Datum{Reader,Writer}  in
Java, I would be interested in seeing how the Scala one here works.  The
Java one is slowed by traversing too many data structures that represent
decisions that could be pre-computed rather than repeatedly parsed for each
record.
 * Scala records implement the SpecificRecord Java interface (with some
 overhead), so one may still use the SpecificDatumReader/Writer when the custom
 encoder/decoder methods cannot be used.
 * Mutable records can be converted to immutable (ie. can act as builders).
 Thanks,
 Christophe
 




Re: How represent abstract in Schemas

2012-05-07 Thread Scott Carey
Avro schemas can represent Union types, but not abstract types.  It does not
make sense to serialize an abstract class, since its data members are not
known.
By definition, an abstract type does not define all of the possible sub
types in advance, which presents another problem -- in order to make sense
of serialized data, the universe of types serialized need to be known.

You can model an abstract type with union types with a little bit of work.
For example, if you have type AbstractThing, with children Concrete1 and
Concrete2, you can serialize these as a union of Concrete1 and Concrete2.
When reading the element with this union, you will need to check the
instance type at runtime and cast or if you know the super type is
AbstractThing, you can blindly cast to AbstractThing.  As new types are
added, your schema will change to include more branches in the union.  If
you remove a type, you will need to provide a default in case the removed
type is encountered  while reading data.

If you are using the Java Specific API the above will not work without
wrapper classes that contain the hierarchy, and the ability to create these
from the serialized types.

Serialization deals only with data stored in member variables, and
interfaces have no data.  An Avro Protocol maps to a Java Interface, but it
is never serialized, it represents a contract for exchanging serialized
data.

-Scott

On 5/6/12 9:55 PM, Gavin YAO gavin.ming@gmail.com wrote:

 Hello: 
 I am very new to the Apache Avro community so I hope I am doing right
 in just sending a mail to this address.
 Is it possible to represent abstract as in Java language we can do it
 by abstract class or interface?
 
 Thanks a lot!
  




Re: Nested schema issue

2012-05-01 Thread Scott Carey


On 5/1/12 9:47 AM, Peter Cameron pe...@pbpartnership.com wrote:

I'm having a problem with nesting schemas. A very brief overview of why
we're using Avro (successfully so far) is:

o code generation not required
o small binary format
o dynamic use of schemas at runtime

We're doing a flavour of RPC, and the reason we're not using Avro's IDL
and flavour of RPC is because the endpoint is not necessarily a Java
platform (C# and Java for our purposes), and only the Java
implementation of Avro has RPC. Hence no Avro RPC for us.

I'm aware that Avro doesn't import nested schemas out of the box. We
need that functionality as we're exposed to schemas over which we have
no control, and in the interests of maintainability, these schemas are
nicely partitioned and are referenced as types from within other
schemas. So, for example, a address schema refers to a
some.domain.location object by having a field of type
some.domain.location. Note that our runtime has no knowledge of any
some.domain package (e.g. address or location objects). Only the
endpoints know about some.domain. (A layer at our endpoint runtime
serialises any unknown i.e. non-primitive objects as bytestreams.)

I implemented a schema cache which intelligently imports schemas on the
fly, so adding the address schema to the cache, automatically adds the
location schema that it refers to. The cache uses Avro's schema to parse
an added schema, catches parse exceptions, looks at the exception
message to see whether or not the error is due to a missing or undefined
type, and thus goes off to import the needed schema. Brittle, I know,
but no other way for us. We need this functionality, and nothing else
comes close to Avro.

On the Java side, recent versions have a Parser that can deal with schema
import.  It requires that a schema be defined before use however.  Perhaps
we can add a callback to the API for returning undefined schemas as they
are found.


So far so good, until today when I hit a corner case.

Say I have an address object that has two fields, called position1 and
position2. If position1 and position2 are non-primitive types, then the
address schema doesn't parse so presumably is an invalid Avro schema.
The error concerns redefining the location type. Here's the example:

location schema
==

{
 name: location,
 type: record,
 namespace : some.domain,
 fields :
 [
 {
 name: latitude,
 type: float
 },
 {
 name: longitude,
 type: float
 }
 ]
}

address schema
==

{
 name: address,
 type: record,
 namespace : some.domain,
 fields :
 [
 {
 name: street,
 type: string
 },
 {
 name: city,
 type: string
 },
 {
 name: position1,
 type: some.domain.location
 },
 {
 name: position2,
 type: some.domain.location
 }
 ]
}


Now, an answer of having a list of positions as a field is not an answer
for us, as we need to solve the general issue of a schema with more than
one instance of the same nested type i.e. my problem is not with an
address or location schema.

Can this be done? This is potentially a blocker for us.

This should be possible.  A named type can be used for multiple
differently named fields in a record. Is the parse error in C# or Java?
What is the error?


cheers,
Peter





Re: Nested schema issue (with munged invalid schema)

2012-05-01 Thread Scott Carey


On 5/1/12 9:55 AM, Peter Cameron pe...@pbpartnership.com wrote:

 I'm having a problem with nesting schemas. A very brief overview of why we're
 using Avro (successfully so far) is:
 
 o code generation not required
 o small binary format
 o dynamic use of schemas at runtime
 
 We're doing a flavour of RPC, and the reason we're not using Avro's IDL and
 flavour of RPC is because the endpoint is not necessarily a Java platform (C#
 and Java for our purposes), and only the Java implementation of Avro has RPC.
 Hence no Avro RPC for us.
 
 I'm aware that Avro doesn't import nested schemas out of the box. We need that
 functionality as we're exposed to schemas over which we have no control, and
 in the interests of maintainability, these schemas are nicely partitioned and
 are referenced as types from within other schemas. So, for example, a address
 schema refers to a some.domain.location object by having a field of type
 some.domain.location. Note that our runtime has no knowledge of any
 some.domain package (e.g. address or location objects). Only the endpoints
 know about some.domain. (A layer at our endpoint runtime serialises any
 unknown i.e. non-primitive objects as bytestreams.)
 
 I implemented a schema cache which intelligently imports schemas on the fly,
 so adding the address schema to the cache, automatically adds the location
 schema that it refers to. The cache uses Avro's schema to parse an added
 schema, catches parse exceptions, looks at the exception message to see
 whether or not the error is due to a missing or undefined type, and thus goes
 off to import the needed schema. Brittle, I know, but no other way for us. We
 need this functionality, and nothing else comes close to Avro.
 
 So far so good, until today when I hit a corner case.
 
 Say I have an address object that has two fields, called position1 and
 position2. If position1 and position2 are non-primitive types, then the
 address schema doesn't parse so presumably is an invalid Avro schema. The
 error concerns redefining the location type. Here's the example:
 
 location schema 
 == 
 
 { 
 name: location,
 type: record,
 namespace : some.domain,
 fields : 
 [ 
 { 
 name: latitude,
 type: float
 }, 
 { 
 name: longitude,
 type: float
 } 
 ] 
 } 
 
 address schema 
 == 
 
 { 
 name: address,
 type: record,
 namespace : some.domain,
 fields : 
 [ 
 { 
 name: street,
 type: string
 }, 
 { 
 name: city,
 type: string
 }, 
 { 
 name: position1,
 type: some.domain.location
 }, 
 { 
 name: position2,
 type: some.domain.location
 } 
 ] 
 } 
 
 
 Now, an answer of having a list of positions as a field is not an answer for
 us, as we need to solve the general issue of a schema with more than one
 instance of the same nested type i.e. my problem is not with an address or
 location schema.
 
 The problematic schema constructed by my schema cache is:
 
 {
 name: address2,
 type: record,
 namespace : some.domain,
 fields : 
 [
 {
 name: street,
 type: string
 },
 {
 name: city,
 type: string
 },
 {
 name: position1,
 type:
 {type:record,name:location,namespace:some.domain,fields:[{name
 :latitude,type:float},{name:longitude,type:float}]}
 },
 {
 name: position2,
 type:
 {type:record,name:location,namespace:some.domain,fields:[{name
 :latitude,type:float},{name:longitude,type:float}]}
 }
 ]
 }

The second time that location is used, it should be used by reference, and
not re-defined.  I believe that
  name:position2
  type:some.domain.location should work, provided the type
some.domain.location is defined previously in the schema, as it is in
position1.

 
 
 Can this be done? This is potentially a blocker for us.
 
 cheers, 
 Peter 
 




Re: Support for Serialization and Externalization?

2012-05-01 Thread Scott Carey


On 4/23/12 10:37 AM, Joe Gamache gama...@cabotresearch.com wrote:

 Hello,
 
 We have been using  Avro successfully to serialize many of our objects, using
 binary encoding, for storage and retrieval.  Although the documentation about
 the Reflect Mapping states:
 This API is not recommended except as a stepping stone for systems that
 currently uses Java interfaces to define RPC protocols.
 we used this mapping as that recommendation did not seem to apply.  We do not
 use the serialized data for RPC (or any other messaging system).In fact,
 this part has in-place for a while and works exceptionally well.
 
 Now we would like to externalize a smaller subset of the objects for
 interaction with a WebApp.  Here we would like to use the JSON encoding and
 the specific mapping.We tried having this set of objects implement
 GenericRecord, however, this then breaks the use of Reflection on these
 objects.  [The ReflectData.createSchema method checks for this condition.]
 
 Can Avro be used to serialize objects one way, and externalize them another?
 [The externalized objects are a subset of the serialized ones.]   Perhaps more
 generally, my question is: can both binary encoding and JSON encoding be
 supported on overlapping objects using different mappers?   If yes, what is
 the best way to accomplish this?

That should be possible.  If not I think It is a bug.  The Java reflect API
is supposed to be able to handle Specific and Generic records, or at least
there is supposed to be a way to use them both.
What is the specific error, from what API call?  Perhaps it is a simple fix
and you can submit a patch and test to JIRA?

Thanks,

-Scott 

 
 Thanks for any help - I am still quite a noob here so I greatly appreciate any
 additional details!
 
 Joe Gamache




Re: Specific/GenericDatumReader performance and resolving decoders

2012-04-19 Thread Scott Carey
I think this approach makes sense, reader=writer is common.  In addition to
record fields, unions are affected.

I have been thinking about the issue that resolving records is slower than
not for a while.  In theory, it could be just as fast because you can
pre-compute the steps needed and bake that into the reading logic.  This
seems like a reasonable way to avoid the cost for the case where schemas
equal.

Please open a JIRA ticket and put your preliminary thoughts there.  It is a
good place to discuss the technical bits of the issue even before you have a
patch.

On 4/19/12 2:09 AM, Irving, Dave dave.irv...@baml.com wrote:

 Hi,
  
 Recently I¹ve been looking at the performance of avros
 SpecificDatumReaders/Writers. In our use cases, when deserializing, we find it
 quite usual for reader / writer schemas to be identical. Interestingly,
 GenericDatumReader bakes in the use of ResolvingDecoders right in to its core.
 So even if constructed with a single (reader/writer) schema, a
 ResolvingDecoder is still used.
 I experimented a little, and wrote a SpecificDatumReader which instead of
 being hard wired with a ResolvingDecoder, uses a DecodeStrategy ­ leaving the
 reader only dealing with Decoders directly.
 Details follow ­ but for Œsame schema¹ decodes ­ the performance difference is
 impressive. For the types of records I deal with, a decode with reader schema
 == writer schema using this approach is about 1.6x faster than a standard
 SpecificDatumReader decode.
  
  
 interface DecodeStrategy
 {
   Decoder configureForRead(Decoder in) throws IOException;
  
   void readComplete() throws IOException;
  
   void decodeRecordFields(Object old, SpecificRecord record, Schema expected,
 Decoder in, SpecificDatumReader2 reader) throws IOException;
 }
  
 The idea is that when we hit a record, instead of getting field order from a
 ResolvingDecoder directly, we just let the decode strategy do it for us
 (calling back for each field to the reader ­ allowing recursion).
 For e.g. when we know reader / writer schemas are identical, and we don¹t want
 validation ­ an IdentitySchemaDecodeStrategy#decodeRecordFields can just pull
 the fields direct from the provided record schema (calling back on the reader
 for each one):
  
 ...
  
 void decodeRecordFields(..)
 {
   ListField fields = expected.getFields();
   For (int i=0, len = fields.size(); ilen; ++i)
   {
 reader.readField(old, in, field, record);
   }
 }
  
 ...
  
 The resolving decoder impl of this strategy just does a ŒreadFieldOrder¹ like
 GenericDatumReader does today.
  
 For each read (given a Decoder), the datum reader lets the decode strategy
 return back the actual decoder to be used (via #configureForRead). This means
 that a resolving implementation can use this hook to configure the
 ResolvingDecoder and return this.
 The result is that the datum reader can work with same schema / validated
 schema / resolved schemas seamlessly without caring about the difference.
  
 I thought I¹d share the approach before working on a full patch: Is this an
 approach you¹d be interested in taking back to core avro? Or is it a little
 niche? J
  
 Cheers,
  
 Dave
  
 
 This message w/attachments (message) is intended solely for the use of the
 intended recipient(s) and may contain information that is privileged,
 confidential or proprietary. If you are not an intended recipient, please
 notify the sender, and then please delete and destroy all copies and
 attachments, and be advised that any review or dissemination of, or the taking
 of any action in reliance on, the information contained in or attached to this
 message is prohibited.
 Unless specifically indicated, this message is not an offer to sell or a
 solicitation of any investment products or other financial product or service,
 an official confirmation of any transaction, or an official statement of
 Sender. Subject to applicable law, Sender may intercept, monitor, review and
 retain e-communications (EC) traveling through its networks/systems and may
 produce any such EC to regulators, law enforcement, in litigation and as
 required by law. 
 The laws of the country of each sender/recipient may impact the handling of
 EC, and EC may be archived, supervised and produced in countries other than
 the country in which you are located. This message cannot be guaranteed to be
 secure or free of errors or viruses.
 
 References to Sender are references to any subsidiary of Bank of America
 Corporation. Securities and Insurance Products: * Are Not FDIC Insured * Are
 Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a
 Condition to Any Banking Service or Activity * Are Not Insured by Any Federal
 Government Agency. Attachments that are part of this EC may have additional
 important disclosures and disclaimers, which you should read. This message is
 subject to terms available at the following link:
 http://www.bankofamerica.com/emaildisclaimer. By messaging with Sender you
 

[ANNOUNCE] New Apache Avro PMC Member: Douglas Creager

2012-04-10 Thread Scott Carey
The Apache Avro PMC is pleased to announce that Douglas Creager is now
part of the PMC.

Congratulations and Thanks!






Re: Sync Marker Issue while reading AVRO files writen with FLUME with PIG

2012-04-03 Thread Scott Carey
I have not seen this issue before with 100 TB of Avro files, but am not
using Flume to write them.  We have moved on to Avro 1.6.x but were on the
1.5.x line for quite some time.  Perhaps while writing there was an
exception of some sort that was not handled correctly in Avro or Flume.

Looking at the DataFileWriter code, I can see how a file could get
truncated without a sync marker if the writing process crashes, but not
how it could successfully write two blocks in a row without a sync between.

You should be able to modify the file reader to recover and re-write the
data if it is only a missing sync marker, or skip over the block if it is
corrupt.

On 4/3/12 1:28 AM, Markus Resch markus.re...@adtech.de wrote:

Hey everyone,

we're facing a problem while reading AVRO files written with FLUME using
the AVRO Java API 1.5.4 into a HADOOP cluster. The Avro Data Store
complains about missing sync marker. Investigating the problem shows us,
that's perfectly right. The sync marker is missing. Thus we have a block
of the double size.

Our software packets:
 rpm -qa | grep hadoop
hadoop-0.20-namenode-0.20.2+923.142-1
hadoop-0.20-0.20.2+923.142-1
hadoop-0.20-native-0.20.2+923.142-1
hadoop-hive-0.7.1+42.27-2
hadoop-pig-0.8.1+28.18-1

This is pretty much all a basic cloudera
CDH3 Update 2 Packaging installation with a patched PIG version which is
CDH3 Update 3.

Did anyone had a similar issue? Does this ring a bell?

Thanks

Markus






Re: avro compression using snappy and deflate

2012-04-02 Thread Scott Carey


On 3/30/12 12:08 PM, Shirahatti, Nikhil snik...@telenav.com wrote:

Hello All,

I think I figured our where I goofed up.

I was flushing on every record, so basically this was compression per
record, so it had a meta data with each record. This was adding more data
to the output when compared to avro.

So now I have better figures: atleast looks realistic, still need to find
out of it is map-reduceable.
Avro= 12G
Avro+Defalte= 4.5G

Deflate is affected quite a bit by the compression level selected (1 to 9)
in both performance and level of compression.  However, in my experience
anything past level 6 is only very slightly smaller and much slower, while
the difference between levels 1 to 3 is large on both fronts.

Avro+Snappy = 5.5G

Have others tried Avro + LZO?

I have not heard of anyone doing this.  LZO is not Apache license
compatible, and there are now several alternatives that are in the same
class of compression algorithm available, including Snappy.


Thanks,
Nikhil


On 3/30/12 12:54 AM, Shirahatti, Nikhil snik...@telenav.com wrote:

The original data file (a text file) is 40GB, the avro file is about
12GB,
avro snappy is 13GB!

Thanks,
Nikhil

--
View this message in context:
http://apache-avro.679487.n3.nabble.com/avro-compression-using-snappy-and
-
deflate-tp3870167p3870184.html
Sent from the Avro - Users mailing list archive at Nabble.com.





Re: BigInt / longlong

2012-03-28 Thread Scott Carey
On 3/28/12 11:01 AM, Meyer, Dennis dennis.me...@adtech.com wrote:

 Hi,
 
 What type refers to an Java Bigint or C long long? Or is there any other type
 in Avro that maps a 64 bit unsigned int?
 
 I unfortunately could only find smaller types in the docs:
 Primitive Types
 The set of primitive type names is:
 * string: unicode character sequence
 * bytes: sequence of 8-bit bytes
 * int: 32-bit signed integer
 * long: 64-bit signed integer
 * float: single precision (32-bit) IEEE 754 floating-point number
 * double: double precision (64-bit) IEEE 754 floating-point number
 * boolean: a binary value
 * null: no value
 
 Anyway in the encoding section theres some 64bit unsigned. Can I use them
 somehow by a type?

An unsigned value fits in a signed one.  They are both 64 bits.  Each
language that supports a long unsigned type has its own way to convert from
one to the other without loss of data.

 Work around might be to use the 52 significant bits of a double, but seems
 like a hack and of course loosing some more number space compared to uint64.
 I'd like to get around any other self-encoding hacks as I'd like to also use
 Hadoop/PIG/HIVE on top on AVRO, so would like to keep functionality on numbers
 if possible.

Java does not have an unsigned 64 bit type.  Hadoop/Pig/Hive all only have
signed 64 bit integer quantities.

Luckily, multiplication and addition on two's compliment signed values is
identical to the operations on unsigned ints, so for many operations there
is no loss in fidelity as long as you pass the raw bits on to something that
interprets the number as an unsigned quantity.

That is, if you take the raw bits of a set of unsigned 64 bit numbers, and
treat those bits as if they are a signed 64 bit quantities, then do
addition, subtraction, and multiplication on them, then treat the raw bit
result as an unsigned 64 bit value, it is as if you did the whole thing
unsigned.

http://en.wikipedia.org/wiki/Two%27s_complement

Avro only has signed 32 and 64 bit integer quantities because they can be
mapped to unsigned ones in most cases without a problem and many (actually,
most) languages do not support unsigned integers.

If you want various precision quantities you can use an Avro Fixed type to
map to any type you choose.  For example you can use a 16 byte fixed to map
to 128 bit unsigned ints.

 
 Thanks,
 Dennis




Re: Problem: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 64 / avro.io.SchemaResolutionException: Can't access branch index 64 for union with 2 branches / `read_data': Writer's schem

2012-03-26 Thread Scott Carey
Avro Java's file writer[1] (the last several versions) rewinds its buffer if
there is an exception during writing, so if there are writes afterwords the
file will not be corrupt.  However, most tools are not so careful.

[1] DataFileWriter.append()
http://svn.apache.org/repos/asf/avro/trunk/lang/java/avro/src/main/java/org/
apache/avro/file/DataFileWriter.java


On 3/23/12 8:27 PM, Russell Jurney russell.jur...@gmail.com wrote:

 Ok, now I have a followup question...
 
 how does one recover from an exception writing an Avro?  The incomplete record
 is being written, which is crashing the reader.
 
 On Fri, Mar 23, 2012 at 8:01 PM, Russell Jurney russell.jur...@gmail.com
 wrote:
 Thanks Scott, looking at the raw data it seems to have been a truncated
 record due to UTF problems.
 
 Russell Jurney http://datasyndrome.com
 
 On Mar 23, 2012, at 7:59 PM, Scott Carey scottca...@apache.org wrote:
 
 
 It appears to be reading a union index and failing in there somehow.  If it
 did not have any of the pig AvroStorage stuff in there I could tell you
 more.
 
 What does avro-tools.jar 's 'tojson' tool do?  (java ­jar
 avro-tools-1.6.3.jar tojson file | your_favorite_text_reader)
 What version of Avro is the java stack trace below?
 
 
 On 3/23/12 7:01 PM, Russell Jurney russell.jur...@gmail.com wrote:
 
 I have a problem record I've written in Avro that crashes anything which
 tries to read it :(
 
 Can anyone make sense of these errors?
 
 The exception in Pig/AvroStorage is this:
 
 java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 64
 at 
 org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java
 :275)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordRead
 er.nextKeyValue(PigRecordReader.java:187)
 at 
 org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapT
 ask.java:532)
 at 
 org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:364)
 at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
 at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
 at 
 org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
 at 
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:14
 2)
 at 
 org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readRecord(PigAvr
 oDatumReader.java:67)
 at 
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:13
 8)
 at 
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:12
 9)
 at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
 at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
 at 
 org.apache.pig.piggybank.storage.avro.PigAvroRecordReader.getCurrentValue(
 PigAvroRecordReader.java:80)
 at 
 org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java
 :273)
 ... 7 more
 
 When reading the record in Python:
 
 File /me/Collecting-Data/src/python/cat_avro, line 21, in module
 for record in df_reader:
   File 
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6
 /site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/datafile.py, line 354,
 in next
 datum = self.datum_reader.read(self.datum_decoder)
   File 
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6
 /site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 445, in
 read
 return self.read_data(self.writers_schema, self.readers_schema,
 decoder)
   File 
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6
 /site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 490, in
 read_data
 return self.read_record(writers_schema, readers_schema, decoder)
   File 
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6
 /site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 690, in
 read_record
 field_val = self.read_data(field.type, readers_field.type, decoder)
   File 
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6
 /site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 488, in
 read_data
 return self.read_union(writers_schema, readers_schema, decoder)
   File 
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6
 /site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 650, in
 read_union
 raise SchemaResolutionException(fail_msg, writers_schema,
 readers_schema)
 avro.io.SchemaResolutionException: Can't access branch index 64 for union
 with 2 branches
 
 When reading the record in Ruby:
 
 /Users/peyomp/.rvm/gems/ruby-1.8.7-p352/gems/avro-1.6.1/lib/avro/io.rb:298
 :in `read_data': Writer's schema  and Reader's schema [string,null] do
 not match. (Avro::IO::SchemaMatchException)
 
 -- 
 Russell Jurney twitter.com/rjurney http://twitter.com/rjurney
 russell.jur...@gmail.com mailto:russell.jur...@gmail.com
 datasyndrome.com http://datasyndrome.com/
 
 
 
 -- 
 Russell Jurney twitter.com/rjurney http://twitter.com

Re: Problem: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 64 / avro.io.SchemaResolutionException: Can't access branch index 64 for union with 2 branches / `read_data': Writer's schem

2012-03-23 Thread Scott Carey

It appears to be reading a union index and failing in there somehow.  If it
did not have any of the pig AvroStorage stuff in there I could tell you
more.

What does avro-tools.jar 's 'tojson' tool do?  (java ­jar
avro-tools-1.6.3.jar tojson file | your_favorite_text_reader)
What version of Avro is the java stack trace below?


On 3/23/12 7:01 PM, Russell Jurney russell.jur...@gmail.com wrote:

 I have a problem record I've written in Avro that crashes anything which tries
 to read it :(
 
 Can anyone make sense of these errors?
 
 The exception in Pig/AvroStorage is this:
 
 java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 64
 at 
 org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:27
 5)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.
 nextKeyValue(PigRecordReader.java:187)
 at 
 org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask
 .java:532)
 at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:364)
 at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
 at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
 at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
 at 
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
 at 
 org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readRecord(PigAvroDa
 tumReader.java:67)
 at 
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138)
 at 
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129)
 at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
 at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
 at 
 org.apache.pig.piggybank.storage.avro.PigAvroRecordReader.getCurrentValue(Pig
 AvroRecordReader.java:80)
 at 
 org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:27
 3)
 ... 7 more
 
 When reading the record in Python:
 
 File /me/Collecting-Data/src/python/cat_avro, line 21, in module
 for record in df_reader:
   File 
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/si
 te-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/datafile.py, line 354, in
 next
 datum = self.datum_reader.read(self.datum_decoder)
   File 
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/si
 te-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 445, in read
 return self.read_data(self.writers_schema, self.readers_schema, decoder)
   File 
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/si
 te-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 490, in read_data
 return self.read_record(writers_schema, readers_schema, decoder)
   File 
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/si
 te-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 690, in
 read_record
 field_val = self.read_data(field.type, readers_field.type, decoder)
   File 
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/si
 te-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 488, in read_data
 return self.read_union(writers_schema, readers_schema, decoder)
   File 
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/si
 te-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 650, in
 read_union
 raise SchemaResolutionException(fail_msg, writers_schema, readers_schema)
 avro.io.SchemaResolutionException: Can't access branch index 64 for union
 with 2 branches
 
 When reading the record in Ruby:
 
 /Users/peyomp/.rvm/gems/ruby-1.8.7-p352/gems/avro-1.6.1/lib/avro/io.rb:298:in
 `read_data': Writer's schema  and Reader's schema [string,null] do not
 match. (Avro::IO::SchemaMatchException)
 
 -- 
 Russell Jurney twitter.com/rjurney http://twitter.com/rjurney
 russell.jur...@gmail.com mailto:russell.jur...@gmail.com  datasyndrome.com
 http://datasyndrome.com/




Re: Globbing several AVRO files with different (extended) schemes

2012-03-20 Thread Scott Carey
I'm assuming you are using Pig's AvroStorage function. It appears that it
does not support schema migration, but it certainly could do so.  A
collection of avro files can be 'viewed' as if they all are of one schema
provided they can all resolve to it.  I have several tools that do this
successfully with MapReduce/Pig/Hive.

The Pig AvroStorage tool is maintained by the Apache Pig project, you will
need to inquire there in order to get more details.

-Scott



On 3/20/12 2:27 AM, Markus Resch markus.re...@adtech.de wrote:

Hi guys,

Thanks again for your awesome hint about sqoop.

I have another question: The Data I'm working with is stored as AVRO
Files in the Hadoop. When I try to glob them everything works just
perfectly. But. When I add the schema of a single data file while the
others remain everything gets wrecked:

currently we assume all avro files under the same location
 * share the same schema and will throw exception if not.

(e.g. I add a new data field) Expected behavior for me would be: If I'm
globbing several files with slightly different schema the result of the
LOAD would be either return an intersection of all valid fields that are
common to both schemes or the atoms of the missing fields are nulled.

How could I handle this properly?

Thanks 

Markus








Re: a possible bug in Avro MapReduce

2012-03-20 Thread Scott Carey
Perhaps it is 
https://issues.apache.org/jira/browse/AVRO-1045

Are you creating a copy of the GenericRecord?

-Scott


On 3/19/12 3:34 PM, ey-chih chow eyc...@hotmail.com wrote:

 Hi,
 
 We got an Avro MapReduce job with the signature of the map function as
 follows:
 
 
 public void map(ByteBuffer input, AvroCollectorPairUtf8, GenericRecord
 collector, Reporter reporter) throws IOException;
 
 
 However, the position of the ByteBuffer input, i.e. input.position(), is
 always set to 0 when map() gets invoked.  With this, we can not extract data
 from input.  This is for the version of avro 1.5.4.  For the older versions of
 avro, input.position() is set to the end of the input data.  Is there anybody
 knows why this gets set to 0?  Or is this a bug?
 
 Ey-Chih Chow





Re: Java MapReduce Avro Jackson Error

2012-03-19 Thread Scott Carey
What version of Avro are you using?

You may want to try Avro 1.6.3 + Jackson 1.8.8.

This is related, but is not your exact problem.
https://issues.apache.org/jira/browse/AVRO-1037
 
You are likely pulling in some other version of jackson somehow.  You may
want to use 'mvn dependency:tree' on your project to see where all the
dependencies are coming from.  That may help identify the culprit.

-Scott

On 3/19/12 5:06 PM, Deepak Nettem deepaknet...@gmail.com wrote:


Sorry,

I meant, I added the jackson-core-asl dependency, and still get the error.

groupIdorg.codehaus.jackson/groupId
  artifactIdjackson-core-asl/artifactId
  version1.5.2/version
  scopecompile/scope
/dependency


On Mon, Mar 19, 2012 at 8:05 PM, Deepak Nettem deepaknet...@gmail.com
wrote:

Hi Tatu,

I added the dependency:

dependency
groupIdorg.codehaus.jackson/groupId
  artifactIdjackson-mapper-asl/artifactId
  version1.5.2/version
  scopecompile/scope
/dependency


But that still gives me this error:

Error: 
org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$F
eature;)Lorg/codehaus/jackson/JsonFactory;

Any other ideas?


On Mon, Mar 19, 2012 at 7:27 PM, Tatu Saloranta tsalora...@gmail.com
wrote:

On Mon, Mar 19, 2012 at 4:20 PM, Deepak Nettem deepaknet...@gmail.com
wrote:
 I found that the Hadoop lib directory contains
jackson-core-asl-1.0.1.jar
 and jackson-mapper-asl-1.0.1.jar.

 I removed these, but got this error:
 hadoop Exception in thread main java.lang.NoClassDefFoundError:
 org/codehaus/jackson/map/JsonMappingException

 I am using Maven as a build tool, and my pom.xml has this dependency:

 dependency
 groupIdorg.codehaus.jackson/groupId
   artifactIdjackson-mapper-asl/artifactId
   version1.5.2/version
   scopecompile/scope
 /dependency

 Any help would on this issue would be greatly appreciated.


You may want to add similar entry for jackson-core-asl -- mapper does
require core, and although there is transient dependency from mapper,
Maven does not necessarily enforce correct version.
So it is best to add explicit dependency so that version of core is
also 1.5.x; you may otherwise just get 1.0.1 of that one.

-+ Tatu +-












Re: Java MapReduce Avro Jackson Error

2012-03-19 Thread Scott Carey
If you are using avro-tools, beware it is a shaded jar with all dependencies
inside of it for use as a command line tool (java ­jar
avro-tools-VERSION.jar).

If you are using avro-tools in your project for some reason (there is really
only command line utilities in it) use the nodeps classifier:

classifiernodeps/classifier

http://repo1.maven.org/maven2/org/apache/avro/avro-tools/1.6.3/

Note the nodeps jar is 47K, while the default jar is 10MB.


For what it is worth, I removed the Jackson jar from our hadoop install long
ago.  It is used to dump configuration files to JSON there, a peripheral
feature we don't use.

Another thing that you may want to do is change your Hadoop dependency scope
to
scopeprovided/scope since hadoop will be put on your classpath by the
hadoop environment.   Short of this, excluding the chained Hadoop
dependencies you aren't using (most likely: jetty,  kfs, and the
tomcat:jasper and eclipse:jdt stuff) may help.

On 3/19/12 6:23 PM, Deepak Nettem deepaknet...@gmail.com wrote:

 Hi Tatu / Scott,
 
 Thanks for your replies. I replaced the earlier dependencies with these:
 
dependency
 groupIdorg.apache.avro/groupId
 artifactIdavro-tools/artifactId
 version1.6.3/version
 /dependency
 
 dependency
 groupIdorg.apache.avro/groupId
 artifactIdavro/artifactId
 version1.6.3/version
 /dependency
 
 dependency 
 groupIdorg.codehaus.jackson/groupId
   artifactIdjackson-mapper-asl/artifactId
   version1.8.8/version
   scopecompile/scope
 /dependency
 
 dependency 
 groupIdorg.codehaus.jackson/groupId
   artifactIdjackson-core-asl/artifactId
   version1.8.8/version
   scopecompile/scope
 /dependency
 
 And this is my app dependency tree:
 
 [INFO] --- maven-dependency-plugin:2.1:tree (default-cli) @ AvroTest ---
 [INFO] org.avrotest:AvroTest:jar:1.0-SNAPSHOT
 [INFO] +- junit:junit:jar:3.8.1:test (scope not updated to compile)
 [INFO] +- org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8:compile
 [INFO] +- org.codehaus.jackson:jackson-core-asl:jar:1.8.8:compile
 [INFO] +- net.sf.json-lib:json-lib:jar:jdk15:2.3:compile
 [INFO] |  +- commons-beanutils:commons-beanutils:jar:1.8.0:compile
 [INFO] |  +- commons-collections:commons-collections:jar:3.2.1:compile
 [INFO] |  +- commons-lang:commons-lang:jar:2.4:compile
 [INFO] |  +- commons-logging:commons-logging:jar:1.1.1:compile
 [INFO] |  \- net.sf.ezmorph:ezmorph:jar:1.0.6:compile
 [INFO] +- org.apache.avro:avro-tools:jar:1.6.3:compile
 [INFO] |  \- org.slf4j:slf4j-api:jar:1.6.4:compile
 [INFO] +- org.apache.avro:avro:jar:1.6.3:compile
 [INFO] |  +- com.thoughtworks.paranamer:paranamer:jar:2.3:compile
 [INFO] |  \- org.xerial.snappy:snappy-java:jar:1.0.4.1:compile
 [INFO] \- org.apache.hadoop:hadoop-core:jar:0.20.2:compile
 [INFO]+- commons-cli:commons-cli:jar:1.2:compile
 [INFO]+- xmlenc:xmlenc:jar:0.52:compile
 [INFO]+- commons-httpclient:commons-httpclient:jar:3.0.1:compile
 [INFO]+- commons-codec:commons-codec:jar:1.3:compile
 [INFO]+- commons-net:commons-net:jar:1.4.1:compile
 [INFO]+- org.mortbay.jetty:jetty:jar:6.1.14:compile
 [INFO]+- org.mortbay.jetty:jetty-util:jar:6.1.14:compile
 [INFO]+- tomcat:jasper-runtime:jar:5.5.12:compile
 [INFO]+- tomcat:jasper-compiler:jar:5.5.12:compile
 [INFO]+- org.mortbay.jetty:jsp-api-2.1:jar:6.1.14:compile
 [INFO]+- org.mortbay.jetty:jsp-2.1:jar:6.1.14:compile
 [INFO]|  \- ant:ant:jar:1.6.5:compile
 [INFO]+- commons-el:commons-el:jar:1.0:compile
 [INFO]+- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
 [INFO]+- org.mortbay.jetty:servlet-api-2.5:jar:6.1.14:compile
 [INFO]+- net.sf.kosmosfs:kfs:jar:0.3:compile
 [INFO]+- hsqldb:hsqldb:jar:1.8.0.10:compile
 [INFO]+- oro:oro:jar:2.0.8:compile
 [INFO]\- org.eclipse.jdt:core:jar:3.1.1:compile
 
 I still get the same error. Is there anything specific I need to do other than
 changing dependencies in pom.xml to make this error go away?
 
 On Mon, Mar 19, 2012 at 9:12 PM, Tatu Saloranta tsalora...@gmail.com wrote:
 On Mon, Mar 19, 2012 at 6:06 PM, Scott Carey scottca...@apache.org wrote:
  What version of Avro are you using?
 
  You may want to try Avro 1.6.3 + Jackson 1.8.8.
 
  This is related, but is not your exact problem.
  https://issues.apache.org/jira/browse/AVRO-1037
 
  You are likely pulling in some other version of jackson somehow.  You may
  want to use 'mvn dependency:tree' on your project to see where all the
  dependencies are coming from.  That may help identify the culprit.
 
 This sounds like a good idea, and I agree in that this is probably
 still due to an old version lurking somewhere.
 
 -+ Tatu +-
 
 




Re: Make a copy of an avro record

2012-03-12 Thread Scott Carey
We should be generating Java 1.6 compatible code.

What version were you testing?

1.6.3 is near release, the RC is available here:
http://mail-archives.apache.org/mod_mbox/avro-dev/201203.mbox/%3C4F514F22.8
=070...@apache.org%3E

Does it have the same problem?



On 3/12/12 9:27 AM, Jeremy Lewi jer...@lewi.us wrote:


Thanks James and Doug. I was able to simply cast the output of
SpecificData...deepCopy to my type and it seems to bypass the problematic
methods decorated with @override.
What about the potential incompatibility with earlier versions of java
due to the change in semantics of @override? If this is really an issue
this seems like it would affect a lot of users particularly people using
Avro MapReduce on a cluster where upgrading java is not a trivial
proposition. In my particular case, the reduce processing requires
loading all values associated with the key into memory, which
necessitates a deep copy because the iterable object passed to the
reducer seems to be reusing the same instance.

Using SpecificData.get().deepCopy(record) seems like a viable workaround.
Nonetheless, it does seem a bit problematic if the compiler is generating
code that is incompatible with earlier versions of java.

J

On Mon, Mar 12, 2012 at 9:05 AM, Doug Cutting cutt...@apache.org wrote:

On 03/11/2012 10:22 PM, James Baldassari wrote:
 If you want to make a deep copy of a specific record, the easiest way is
 probably to use the Builder API,
 e.g. GraphNodeData.newBuilder(recordToCopy).build().


SpecificData.get().deepCopy(record) should work too.

Doug







Re: parsing Avro data files in JavaScript

2012-02-21 Thread Scott Carey
See also the discussion about a JavaScript Avro implementation from last
week:

http://search-hadoop.com/m/MiNCyvLts/HttpTranceiversubj=HttpTranceiver+and
+JSON+encoded+Avro+


On 2/21/12 7:56 AM, Carriere, Jeromy jero...@x.com wrote:

We're working on one to support the X.commerce Fabric:
https://github.com/xcommerce/node-avro

Which is based on:
http://code.google.com/p/javascript-avro/

--
Jeromy Carriere
Chief Architect
X.commerce

From: Kevin Meinert
ke...@subatomicglue.commailto:ke...@subatomicglue.com
Reply-To: user@avro.apache.orgmailto:user@avro.apache.org
Date: Tue, 21 Feb 2012 09:51:49 -0600
To: user@avro.apache.orgmailto:user@avro.apache.org
Subject: parsing Avro data files in JavaScript


Does anyone have an example of a avro binary parser for Javascript?
I don't see a JS implementation in the downloads.

Also, curious if anyone has written a simple parser for avro binary
before in any other language. Or have tips for writing one.


---
kevin meinert | http://www.subatomiclabs.com




Re: Order of the schema in Union

2012-02-21 Thread Scott Carey
As for why the union does not seem to match:
The Union schemas are not the same as the one in the error ‹ the one in the
error does not have a namespace.  It finds AVRO_NCP_ICM  but the union has
only  merced.AVRO_NCP_ICM and merced. AVRO_IVR_BY_CALLID.
The namespace and name must both match.

Is your output schema correct?  It looks like you are setting both your
MapOutputSchema and OutputSchema to be a Pair schema.  I suspect you only
want the Pair schema as a map output and reducer input, but cannot be sure
from the below.

From the below, your reducer must create Pair objects and output them, and
maybe that is related to the error below.  It may also be related to the
combiner, does it happen without it?



On 2/12/12 11:01 PM, Serge Blazhievsky easyv...@gmail.com wrote:

 Hi all,
 
 I am running into an interesting problem with Union. It seems that order of
 the schema in union must be in the same order as input path for different
 files. 
 
 This does not look like right behavior. The code and exception are below.
 
 The moment I change the order in union it works.
 
 
 Thanks
 Serge
 
 
public int run(String[] strings) throws Exception {
 
 JobConf job = new JobConf();
 
 
 job.setNumMapTasks(map);
 job.setNumReduceTasks(reduce);
 
 
 // Uncomment to run locally in a single process
 job.set(mapred.job.tracker, local);
 
 File file = new File(input);
 DatumReaderGenericRecord reader = new
 GenericDatumReaderGenericRecord();
 DataFileReaderGenericRecord dataFileReader = new
 DataFileReaderGenericRecord(file, reader);
 
 Schema s = dataFileReader.getSchema();
 
 
 
   
 
 File lfile = new File(linput);
 DatumReaderGenericRecord lreader = new
 GenericDatumReaderGenericRecord();
 DataFileReaderGenericRecord ldataFileReader = new
 DataFileReaderGenericRecord(lfile, lreader);
 
 Schema s2 = ldataFileReader.getSchema();
 
   

ListSchema slist= new ArrayListSchema();

slist.add(s2);
slist.add(s);



System.out.println(s.toString(true));
System.out.println(s2.toString(true));



 Schema s_union=Schema.createUnion(slist);
   
 

 AvroJob.setInputSchema(job, s_union);
 
 
 
 ListSchema.Field fields = s.getFields();
 
 ListSchema.Field outfields = new ArrayListSchema.Field();
 
 
 for (Schema.Field f : fields) {
 
 outfields.add(new Schema.Field(f.name http://f.name (),
 Schema.create(Type.STRING), null, null));
 }
 
 boolean b = false;
 Schema outschema = Schema.createRecord(AVRO_IVR_BY_CALLID,
 AVRO_IVR_BY_CALLID, merced, b);
 
 outschema.setFields(outfields);
 
 
 
 Schema STRING_SCHEMA = Schema.create(Schema.Type.STRING);
 
 
 Schema OUT_SCHEMA = new PairString, GenericRecord(, STRING_SCHEMA,
 new GenericData.Record(outschema), outschema).getSchema();
 

 AvroJob.setMapOutputSchema(job, OUT_SCHEMA);
 AvroJob.setOutputSchema(job, OUT_SCHEMA);
 
 AvroJob.setMapperClass(job, MapImpl.class);
 AvroJob.setCombinerClass(job, ReduceImpl.class);
 AvroJob.setReducerClass(job, ReduceImpl.class);
 
// FileInputFormat.setInputPaths(job, new Path(input));
 
 
 FileInputFormat.addInputPath(job, new Path(linput));
 FileInputFormat.addInputPath(job, new Path(input));
 
 
 
 
// MultipleInputs.addInputPath(job, new Path(input),
 AvroInputFormatGenericRecord.class, MapImpl.class);
 
 FileOutputFormat.setOutputPath(job, new Path(output));
 FileOutputFormat.setCompressOutput(job, true);
 
 int res = 255;
 RunningJob runJob = JobClient.runJob(job);
 if (runJob != null) {
 res = runJob.isSuccessful() ? 0 : 1;
 }
 return res;
 }
 
 
 2/02/12 22:56:52 WARN mapred.LocalJobRunner: job_local_0001
 org.apache.avro.AvroTypeException: Found {
   type : record,
   name : AVRO_NCP_ICM,
   fields : [ {
 name : DATADATE,
 type : string
   }, {
 name : ICM_CALLID,
 type : string
   }, {
 name : AGENT_ELID,
 type : string
   }, {
 name : AGENT_NAME,
 type : string
   }, {
 name : AGENT_SITE,
 type : string
   }, {
 name : AGENT_SVIEW_USER_ID,
 type : string
   }, {
 name : AGENT_UNIT_ID,
 type : string
   }, {
 name : ANI,
 type : string
   }, {
 name : CALL_CTR_UNIT_ID,
 type : string
   }, {
 name : CALL_FA_ID,
 type : string
   }, {
 name : CALL_FUNCTIONALAREA,
 type : string
   }, {
 name : CTI_CALL_IDENTIFIER,
 type : string
   }, {
 name : CALLDISPOSITION,
 type : string
   }, {
 name : AGENTPERIPHERALNUMBER,
 type : string
   }, {
 name : 

Re: HttpTranceiver and JSON-encoded Avro?

2012-02-15 Thread Scott Carey
See https://issues.apache.org/jira/browse/AVRO-485 for some discussion on
JavaScript for Avro.  Please comment in that ticket with your needs and
use case.  The project would welcome a JavaScript implementation.

On 2/15/12 2:07 PM, Frank Grimes frankgrime...@gmail.com wrote:

Are there any fast and stable ones you might recommend?


On 2012-02-15, at 4:22 PM, Russell Jurney wrote:

 FWIW, there are avro libs for JavaScript and node on github.
 
 Russell Jurney http://datasyndrome.com
 
 On Feb 15, 2012, at 7:32 AM, Frank Grimes frankgrime...@gmail.com
wrote:
 
 Hi All,
 
 Is there any way to send Avro data over HTTP encoded in JSON?
 We want to integrate with Node.js and JSON seems to be the
best/simplest way to do so.
 
 Thanks,
 
 Frank Grimes





Re: Writing Unsolicited Messages to a Connected Netty Client

2012-01-20 Thread Scott Carey
For certain kinds of data it would be useful to continuously stream data
from server to client (or vice-versa).  This can be represented as an Avro
array response or request where each array element triggers a callback at
the receiving end.  This likely requires an extension to the avro spec, but
is much more capable than a polling solution.  It is related to Comet in the
sense that the RPC request is long lived, but is effectively a sequence of
smaller inverse RPCs.  Poling in general has built-in race conditions for
many types of information exchange and should be avoided whenever such race
conditions exist.

For streaming large volumes of data, this would be much more efficient than
an individual RPC per item.  For example, if the RPC is I need to know
every state change in X polling is not an option, but streaming is.  If the
requirement is I need to know when the next state change occurs, but do not
need to know all changes polling is OK, and streaming may send too much
data.



On 1/20/12 11:25 AM, Armin Garcia armin.gar...@arrayent.com wrote:

 Hi James,
 
 I see your point. On a different NIO framework, I implemented exactly the same
 message handling procedure (ie message routing) you just described. I guess I
 was pushing the NettyTransceiver a bit beyond its intended scope.
 
 I'll take a look at the comet pattern and see what I can do with it.
 
 Again, thanks Shaun  James.
 
 -Armin
 
 
 On Fri, Jan 20, 2012 at 10:15 AM, James Baldassari jbaldass...@gmail.com
 wrote:
 Hi Armin,
 
 First I'd like to explain why the server-initiated messages are problematic.
 Allowing the server to send unsolicited messages back to the client may work
 for some Transceiver implementations (possibly PHP), but this change would
 not be compatible with NettyTransceiver.  When the NettyTransceiver receives
 a message from the server, it needs to know which callback to invoke in order
 to pass the message back correctly to the client.  There could be several
 RPCs in flight concurrently, so one of NettyTransceiver's jobs is to match
 up the response with the request that initiated it.  If the client didn't
 initiate the RPC then NettyTransceiver won't know where to deliver the
 message, unless there were some catch-all callback that would be invoked
 whenever one of these unsolicited messages were received.  So although
 you're probably only interested in the PHP client, allowing the server to
 send these unsolicited messages would potentially break NettyTransceiver (and
 possibly other implementations as well).
 
 Shaun's idea of having the client poll the server periodically would
 definitely work.  What we want to do is have the client receive notifications
 from the server as they become available on the server side, but we also
 don't want the client to be polling with such a high frequency that a lot of
 CPU and bandwidth resources are wasted.  I think we can get the best of both
 worlds by copying the Comet pattern, i.e. the long poll but using the Avro
 RPC layer instead of (or on top of) HTTP.  First we'll start with Shaun's
 update listener interface:
 
 protocol WeatherUpdateListener {
   WeatherUpdate listenForUpdate();
 }
 
 The PHP client would invoke this RPC against the server in a tight loop.  On
 the server side, the RPC will block until there is an update that is ready to
 be sent to the client.  When the client does receive an event from the server
 (or some timeout occurs), the client will immediately send another poll to
 the server and block until the next update is received.  In this way the
 client will not be flooding the server with RPCs, but the client will also
 get updates in a timely manner.
 
 See the following for more info about Comet:
 http://www.javaworld.com/javaworld/jw-03-2008/jw-03-asynchhttp.html?page=6
 
 -James
 
 
 
 On Fri, Jan 20, 2012 at 12:44 PM, Armin Garcia armin.gar...@arrayent.com
 wrote:
 Hi Shaun,
 
 This is definitely another way.  I share your same concern.  I have to keep
 an eye out for high availablilty and high throughput.  I'll be depending on
 this connection to support a massive amount of data.
 
-Armin
 
 
 On Fri, Jan 20, 2012 at 9:25 AM, Shaun Williams shaun_willi...@apple.com
 wrote:
 Another solution is to use the response leg of a transaction to push
 messages to the client, e.g. provide a server protocol like this:
 
 WeatherUpdate listenForUpdate();
 
 This would essentially block until an update is available.  The only
 problem is that if the client is expecting a series of updates, it would
 need to call this method again after receiving each update.
 
 This is not an ideal solution, but it might solve your problem.
 
 -Shaun
 
 
 
 On Jan 20, 2012, at 8:24 AM, Armin Garcia wrote:
 
 
 Hi James,
  
 
 First, thank you for your response.
  
 
 Yes, you are right.  I am trying to setup a bi-directional communication
 link.  Your suggestion would definitely accomplish this requirement.  I
 was hoping the same channel could be 

Re: AVRO Path

2012-01-12 Thread Scott Carey
There are no plans that I know of currently, although the topic came up
two times in separate conversations last night at the SF Hadoop MeetUp.

I think an ability to extract a subset of a schema from a larger one and
read/write/transform data accordingly makes a lot of sense. Currently, the
Avro spec allows for schema resolution which is sort of a degenerate
schema extraction/transformation at the record level without the ability
to address or extract nested elements.  An addition to the spec for
describing other schema extractions may be useful.  Further discussion
should probably be in a JIRA ticket or at least on the dev list.

-Scott

On 1/10/12 1:02 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote:

Are there plans for (or is there already) an AVRO Path implementation
(like XPath, or JSON Path).

Thanks!




Re: Can spill to disk be in compressed Avro format to reduce I/O?

2012-01-12 Thread Scott Carey


On 1/12/12 11:24 AM, Frank Grimes frankgrime...@gmail.com wrote:

 Hi Scott,
 
 If I have a map-only job, would I want only one mapper running to pull all the
 records from the source input files and stream/append them to the target avro
 file?
 Would that be no different (or more efficient) than doing hadoop dfs -cat
 file1 file2 file3 and piping the output to append to a hadoop dfs -put
 combinedFile?
 In that case, my only question is how would I combine the avro files into a
 new file without deserializing them?

It would be different.  An Avro file has a header that contains the Schema
and compression codec info along with other metadata, followed by data
blocks.  Each data block has a record count and size prefix and a 16 byte
delimiter.  You cannot simply concatenate them together because the schema
or compression codec may differ, a header in the middle of the file is not
allowed, and the delimiter may differ.

http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWr
iter.html

DataFileWriter can append a pre-existing file with the same schema, in
particular look at the documentation for appendAllFrom()
http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWr
iter.html#appendAllFrom%28org.apache.avro.file.DataFileStream,%20boolean%29



 
 Thanks,
 
 Frank Grimes
 
 
 On 2012-01-12, at 1:14 PM, Scott Carey wrote:
 
 
 
 On 1/12/12 8:27 AM, Frank Grimes frankgrime...@gmail.com wrote:
 
 Hi All,
 
 We have Avro data files in HDFS which are compressed using the Deflate
 codec.
 We have written an M/R job using the Avro Mapred API to combine those files.
 
 It seems to be working fine, however when we run it we notice that the
 temporary work area (spills, etc) seem to be uncompressed.
 We're thinking we might see a speedup due to reduced I/O if the temporary
 files are compressed as well.
 
 If all you want to do is combine the files, there is no reason to deserialize
 and reserialize the contents, and a map-only job could suffice.
 If this is the case, you might want to consider one of two optoins:
 1.  Use a map only job, with a combined file input.  This will produce one
 file per mapper and no intermediate data.
 2.  Use the Avro data file API to append to a file.  I am not sure if this
 will work with HDFS without some modifications to Avro, but it should be
 possible since the data file APIs can take InputStream/OutputStream.  The
 data file API has the ability to append data blocks from the file if the
 schemas are an exact match.  This can be done without deserialization, and
 optionally can change the compression level or leave it alone.
 
 
 Is there a way to enable mapred.compress.map.output in such a way that
 those temporary files are compressed as Avro/Deflate?
 I tried simply setting conf.setBoolean(mapred.compress.map.output, true);
 but it didn't seem to have any effect.
 
 I am not sure, as I haven't tried it myself.  However, the Avro M/R should be
 able to leverage all of the Hadoop compressed intermediate forms.  LZO/Snappy
 are fast and in our cluster Snappy is the default.  Deflate can be a lot
 slower but much more compact.
 
 
 Note that in order to avoid unnecessary sorting overhead, I made each key a
 constant (1L) so that the logs are combined but ordering isn't necessarily
 preserved. (we don't care about ordering)
 
 In that case, I think you can use a map only job.  There may be some work to
 get a single mapper to read many files however.
 
 
 FYI, here are my mapper and reducer.
 
 
 public static class AvroReachMapper extends AvroMapperDeliveryLogEvent,
 PairLong, DeliveryLogEvent {
 public void map(DeliveryLogEvent levent, AvroCollectorPairLong,
 DeliveryLogEvent collector, Reporter reporter)
 throws IOException {
 
 collector.collect(new PairLong, DeliveryLogEvent(1L, levent));
 }
 }
 
 public static class Reduce extends AvroReducerLong, DeliveryLogEvent,
 DeliveryLogEvent {
 
 @Override
 public void reduce(Long key, IterableDeliveryLogEvent values,
 AvroCollectorDeliveryLogEvent collector, Reporter reporter)
 throws IOException {
 
 for (DeliveryLogEvent event : values) {
 collector.collect(event);
 }
 }
 
 }
 
 Also, I'm setting the following:
 
 AvroJob.setInputSchema(conf, DeliveryLogEvent.SCHEMA$);
 AvroJob.setMapperClass(conf, Mapper.class);
 AvroJob.setMapOutputSchema(conf, SCHEMA);
 
 AvroJob.setOutputSchema(conf, DeliveryLogEvent.SCHEMA$);
 AvroJob.setOutputCodec(conf, DataFileConstants.DEFLATE_CODEC);
 AvroOutputFormat.setDeflateLevel(conf, 9);
 AvroOutputFormat.setSyncInterval(conf, 1024 * 256);
 
 AvroJob.setReducerClass(conf, Reducer.class);
 
 JobClient.runJob(conf);
 
 
 Thanks,
 
 Frank Grimes
 




Re: Can spill to disk be in compressed Avro format to reduce I/O?

2012-01-12 Thread Scott Carey


On 1/12/12 12:35 PM, Frank Grimes frankgrime...@gmail.com wrote:

 So I decided to try writing my own AvroStreamCombiner utility and it seems to
 choke when passing multiple input files:
 
 hadoop dfs -cat hdfs://hadoop/machine1.log.avro
 hdfs://hadoop/machine2.log.avro | ./deliveryLogAvroStreamCombiner.sh 
 combined.log.avro
 
 Exception in thread main java.io.IOException: Invalid sync!
 at org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:293)
 at org.apache.avro.file.DataFileWriter.appendAllFrom(DataFileWriter.java:329)
 at DeliveryLogAvroStreamCombiner.main(Unknown Source)
 
 
 Here's the code in question:
 
 public class DeliveryLogAvroStreamCombiner {
 
 /**
  * @param args
  */
 public static void main(String[] args) throws Exception {
 DataFileStreamDeliveryLogEvent dfs = null;
 DataFileWriterDeliveryLogEvent dfw = null;
 
 try {
 dfs = new DataFileStreamDeliveryLogEvent(System.in, new
 SpecificDatumReaderDeliveryLogEvent());
 
 OutputStream stdout = System.out;
 
 dfw = new DataFileWriterDeliveryLogEvent(new
 SpecificDatumWriterDeliveryLogEvent());
 dfw.setCodec(CodecFactory.deflateCodec(9));
 dfw.setSyncInterval(1024 * 256);
 dfw.create(DeliveryLogEvent.SCHEMA$, stdout);
 
 dfw.appendAllFrom(dfs, false);

dfs is from System.in, which has multiple files one after the other.  Each
file will need its own DataFileStream (has its own header and metadata).

In Java you could get the list of files, and for each file use HDFS's API to
open the file stream, and append that to your one file.
In bash you could loop over all the source files and append one at a time
(the above fails on the second file).  However, in order to append to the
end of a pre-existing file the only API now takes a File, not a seekable
stream, so Avro would need a patch to allow that in HDFS (also, only an HDFS
version that supports appends would work).

Other things of note:
You will probably get better total file size compression by using a larger
sync interval (1M to 4 M) than deflate level 9.  Deflate 9 is VERY slow and
almost never compresses more than 1% better than deflate 6, which is much
faster.  I suggest experimenting with the 'recodec' option on some of your
files to see what the best size / performance tradeoff is.  I doubt that
256K (pre-compression) blocks with level 9 compression is the way to go.

For reference: http://tukaani.org/lzma/benchmarks.html
(gzip uses deflate compression)

-Scott


 }
 finally {
 if (dfs != null) try {dfs.close();} catch (Exception e) {e.printStackTrace();}
 if (dfw != null) try {dfw.close();} catch (Exception e) {e.printStackTrace();}
 }
 }
 
 }
 
 Is there any way this could be made to work without needing to download the
 individual files to disk and calling append for each of them?
 
 Thanks,
 
 Frank Grimes
 
 
 On 2012-01-12, at 2:24 PM, Frank Grimes wrote:
 
 Hi Scott,
 
 If I have a map-only job, would I want only one mapper running to pull all
 the records from the source input files and stream/append them to the target
 avro file?
 Would that be no different (or more efficient) than doing hadoop dfs -cat
 file1 file2 file3 and piping the output to append to a hadoop dfs -put
 combinedFile?
 In that case, my only question is how would I combine the avro files into a
 new file without deserializing them?
 
 Thanks,
 
 Frank Grimes
 
 
 On 2012-01-12, at 1:14 PM, Scott Carey wrote:
 
 
 
 On 1/12/12 8:27 AM, Frank Grimes frankgrime...@gmail.com wrote:
 
 Hi All,
 
 We have Avro data files in HDFS which are compressed using the Deflate
 codec.
 We have written an M/R job using the Avro Mapred API to combine those
 files.
 
 It seems to be working fine, however when we run it we notice that the
 temporary work area (spills, etc) seem to be uncompressed.
 We're thinking we might see a speedup due to reduced I/O if the temporary
 files are compressed as well.
 
 If all you want to do is combine the files, there is no reason to
 deserialize and reserialize the contents, and a map-only job could suffice.
 If this is the case, you might want to consider one of two optoins:
 1.  Use a map only job, with a combined file input.  This will produce one
 file per mapper and no intermediate data.
 2.  Use the Avro data file API to append to a file.  I am not sure if this
 will work with HDFS without some modifications to Avro, but it should be
 possible since the data file APIs can take InputStream/OutputStream.  The
 data file API has the ability to append data blocks from the file if the
 schemas are an exact match.  This can be done without deserialization, and
 optionally can change the compression level or leave it alone.
 
 
 Is there a way to enable mapred.compress.map.output in such a way that
 those temporary files are compressed as Avro/Deflate?
 I tried simply setting conf.setBoolean(mapred.compress.map.output, true);
 but it didn't seem to have any effect.
 
 I am not sure, as I haven't tried it myself.  However, the Avro M/R should
 be able to leverage all

Re: Can spill to disk be in compressed Avro format to reduce I/O?

2012-01-12 Thread Scott Carey
The Recodec tool may be useful, and the source code is a good reference.

java ­jar avro-tools-VERSION.jar
http://svn.apache.org/viewvc/avro/tags/release-1.6.1/lang/java/tools/src/ma
in/java/org/apache/avro/tool/RecodecTool.java?view=co

https://issues.apache.org/jira/browse/AVRO-684



On 1/12/12 12:53 PM, Scott Carey scottca...@apache.org wrote:




On 1/12/12 12:35 PM, Frank Grimes frankgrime...@gmail.com wrote:


So I decided to try writing my own AvroStreamCombiner utility and it
seems to choke when passing multiple input files:


hadoop dfs -cat hdfs://hadoop/machine1.log.avro
hdfs://hadoop/machine2.log.avro | ./deliveryLogAvroStreamCombiner.sh 
combined.log.avro




Exception in thread main java.io.IOException: Invalid sync!

at 
org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:293)
at 
org.apache.avro.file.DataFileWriter.appendAllFrom(DataFileWriter.java:329
)
at DeliveryLogAvroStreamCombiner.main(Unknown Source)




Here's the code in question:

public class DeliveryLogAvroStreamCombiner {


  /**
   * @param args
   */
  public static void main(String[] args) throws Exception {
  DataFileStreamDeliveryLogEvent dfs = null;
  DataFileWriterDeliveryLogEvent dfw = null;


  try {
  dfs = new DataFileStreamDeliveryLogEvent(System.in, 
 new
SpecificDatumReaderDeliveryLogEvent());


  OutputStream stdout = System.out;


  dfw = new DataFileWriterDeliveryLogEvent(new
SpecificDatumWriterDeliveryLogEvent());
  dfw.setCodec(CodecFactory.deflateCodec(9));
  dfw.setSyncInterval(1024 * 256);
  dfw.create(DeliveryLogEvent.SCHEMA$, stdout);

  dfw.appendAllFrom(dfs, false);




dfs is from System.in, which has multiple files one after the other.
Each file will need its own DataFileStream (has its own header and
metadata).   

In Java you could get the list of files, and for each file use HDFS's API
to open the file stream, and append that to your one file.
In bash you could loop over all the source files and append one at a time
(the above fails on the second file).  However, in order to append to the
end of a pre-existing file the only API now takes a File, not a seekable
stream, so Avro would need a patch to allow that in HDFS (also, only an
HDFS version that supports appends would work).

Other things of note:
You will probably get better total file size compression by using a
larger sync interval (1M to 4 M) than deflate level 9.  Deflate 9 is VERY
slow and almost never compresses more than 1% better than deflate 6,
which is much faster.  I suggest experimenting with the 'recodec' option
on some of your files to see what the best size / performance tradeoff
is.  I doubt that 256K (pre-compression) blocks with level 9 compression
is the way to go.

For reference: http://tukaani.org/lzma/benchmarks.html
(gzip uses deflate compression)

-Scott



  }
  finally {
  if (dfs != null) try {dfs.close();} catch (Exception e)
{e.printStackTrace();}
  if (dfw != null) try {dfw.close();} catch (Exception e)
{e.printStackTrace();}
  }
  }

}


Is there any way this could be made to work without needing to download
the individual files to disk and calling append for each of them?

Thanks,

Frank Grimes


On 2012-01-12, at 2:24 PM, Frank Grimes wrote:


Hi Scott,

If I have a map-only job, would I want only one mapper running to pull
all the records from the source input files and stream/append them to
the target avro file?
Would that be no different (or more efficient) than doing hadoop dfs
-cat file1 file2 file3 and piping the output to append to a hadoop dfs
-put combinedFile?
In that case, my only question is how would I combine the avro files
into a new file without deserializing them?

Thanks,

Frank Grimes


On 2012-01-12, at 1:14 PM, Scott Carey wrote:




On 1/12/12 8:27 AM, Frank Grimes frankgrime...@gmail.com wrote:


Hi All,
We have Avro data files in HDFS which are compressed using the Deflate
codec.
We have written an M/R job using the Avro Mapred API to combine those
files.

It seems to be working fine, however when we run it we notice that the
temporary work area (spills, etc) seem to be uncompressed.
We're thinking we might see a speedup due to reduced I/O if the
temporary files are compressed as well.



If all you want to do is combine the files, there is no reason to
deserialize and reserialize the contents, and a map-only job could
suffice.
If this is the case, you might want to consider one of two optoins:
1.  Use a map only job, with a combined file input.  This will produce
one file per mapper and no intermediate data.
2.  Use the Avro data file API to append to a file.  I am not sure if
this will work with HDFS without some modifications to Avro, but it
should be possible since the data file APIs can take

Re: Can spill to disk be in compressed Avro format to reduce I/O?

2012-01-12 Thread Scott Carey


On 1/12/12 5:52 PM, Frank Grimes frankgrime...@gmail.com wrote:

 Hi Scott,
 
 I've looked into this some more and I now see what you mean about appending to
 HDFS directly not being possible with the current DataFileWriter API.
 
 That's unfortunate because we really would like to avoid needing to hit disk
 just to write temporary files. (and the associated cleanup)
 
 I kinda like the notion of not requiring HDFS APIs to achieve this merging of
 Avro files/streams.
 
 Assuming we wanted to be able to stream multiple files as in my example...
 could DataFileStream easily be changed to support that use case?
 i.e. allow it to skip/ignore subsequent header and metadata in the stream or
 not error out with Invalid sync!?

That may be possible, open a JIRA to discuss further.  It should be modified
to 'reset' to the start of a new file or stream and continue from there,
since it needs to read the header and find the new sync value and validate
that the schemas match and the codec is compatible.  It may be possible to
detect the end of one file and the start of another if the files are
streamed back to back, but perhaps not reliably.
The avro-tools could be extended to have a command line tool that takes a
list of files (HDFS or local) and writes a new file (HDFS or local)
concatenated and possibly recodec'd.

 
 Thanks,
 
 Frank Grimes
 
 
 On 2012-01-12, at 3:53 PM, Scott Carey wrote:
 
 
 
 On 1/12/12 12:35 PM, Frank Grimes frankgrime...@gmail.com wrote:
 
 So I decided to try writing my own AvroStreamCombiner utility and it seems
 to choke when passing multiple input files:
 
 hadoop dfs -cat hdfs://hadoop/machine1.log.avro
 hdfs://hadoop/machine2.log.avro | ./deliveryLogAvroStreamCombiner.sh 
 combined.log.avro
 
 Exception in thread main java.io.IOException: Invalid sync!
 at 
 org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:293)
 at 
 org.apache.avro.file.DataFileWriter.appendAllFrom(DataFileWriter.java:329)
 at DeliveryLogAvroStreamCombiner.main(Unknown Source)
 
 
 Here's the code in question:
 
 public class DeliveryLogAvroStreamCombiner {
 
 /**
  * @param args
  */
 public static void main(String[] args) throws Exception {
 DataFileStreamDeliveryLogEvent dfs = null;
 DataFileWriterDeliveryLogEvent dfw = null;
 
 try {
 dfs = new DataFileStreamDeliveryLogEvent(System.in, new
 SpecificDatumReaderDeliveryLogEvent());
 
 OutputStream stdout = System.out;
 
 dfw = new DataFileWriterDeliveryLogEvent(new
 SpecificDatumWriterDeliveryLogEvent());
 dfw.setCodec(CodecFactory.deflateCodec(9));
 dfw.setSyncInterval(1024 * 256);
 dfw.create(DeliveryLogEvent.SCHEMA$, stdout);
 
 dfw.appendAllFrom(dfs, false);
 
 dfs is from System.in, which has multiple files one after the other.  Each
 file will need its own DataFileStream (has its own header and metadata).
 
 In Java you could get the list of files, and for each file use HDFS's API to
 open the file stream, and append that to your one file.
 In bash you could loop over all the source files and append one at a time
 (the above fails on the second file).  However, in order to append to the end
 of a pre-existing file the only API now takes a File, not a seekable stream,
 so Avro would need a patch to allow that in HDFS (also, only an HDFS version
 that supports appends would work).
 
 Other things of note:
 You will probably get better total file size compression by using a larger
 sync interval (1M to 4 M) than deflate level 9.  Deflate 9 is VERY slow and
 almost never compresses more than 1% better than deflate 6, which is much
 faster.  I suggest experimenting with the 'recodec' option on some of your
 files to see what the best size / performance tradeoff is.  I doubt that 256K
 (pre-compression) blocks with level 9 compression is the way to go.
 
 For reference: http://tukaani.org/lzma/benchmarks.html
 (gzip uses deflate compression)
 
 -Scott
 
 
 }
 finally {
 if (dfs != null) try {dfs.close();} catch (Exception e)
 {e.printStackTrace();}
 if (dfw != null) try {dfw.close();} catch (Exception e)
 {e.printStackTrace();}
 }
 }
 
 }
 
 Is there any way this could be made to work without needing to download the
 individual files to disk and calling append for each of them?
 
 Thanks,
 
 Frank Grimes
 
 
 On 2012-01-12, at 2:24 PM, Frank Grimes wrote:
 
 Hi Scott,
 
 If I have a map-only job, would I want only one mapper running to pull all
 the records from the source input files and stream/append them to the
 target avro file?
 Would that be no different (or more efficient) than doing hadoop dfs -cat
 file1 file2 file3 and piping the output to append to a hadoop dfs -put
 combinedFile?
 In that case, my only question is how would I combine the avro files into a
 new file without deserializing them?
 
 Thanks,
 
 Frank Grimes
 
 
 On 2012-01-12, at 1:14 PM, Scott Carey wrote:
 
 
 
 On 1/12/12 8:27 AM, Frank Grimes frankgrime...@gmail.com wrote:
 
 Hi All,
 
 We have Avro data files in HDFS which are compressed using the Deflate
 codec

Re: encoding problem for ruby client

2012-01-05 Thread Scott Carey
This sounds like the Ruby implementation does not correctly use UTF-8 on
your platform for encoding strings.  It may be a bug, but I am not
knowledgeable enough on the Ruby implementation to know for sure.

The Avro specification states that a string is encoded as a long followed
by that many bytes of UTF-8 encoded character data.
(http://avro.apache.org/docs/current/spec.html#binary_encode_primitive).
If you think that the Ruby implementation does not adhere to the spec,
please file a bug in JIRA.

Thanks!

-Scott

On 1/4/12 3:59 AM, kafka0102 kafka0102 yujianjia0...@gmail.com wrote:

 Hi.
 I use avro's java and ruby clients. When they comunite, the ruby client always
 encode(decode) the multi-byte chars(utf-8) to latin1. For now, when the data
 is multi-byte chars,I first encode Iconv.conv(UTF8, LATIN1,data) in the
 ruby client, and then decoded it  Utils.conv(data, ISO-8859-1,UTF-8); in
 the java server.It works,but too ugly. I see the avro ruby client using
 StringIO to pack the data, but I cannot find ways to make it support
 multi-byte chars.
 Can anyone help me?




Re: Collecting union-ed Records in AvroReducer

2011-12-08 Thread Scott Carey


On 12/8/11 4:10 AM, Andrew Kenworthy adwkenwor...@yahoo.com wrote:


Hallo,

is it possible to write/collect a union-ed record from an avro reducer?

I have a reduce class (extending AvroReducer), and the output schema is a
union schema of record type A and record type B. In the reduce logic I
want to combine instances of A and B in the same datum, passing it to my
Avrocollector. My code looks a bit like this:




If both records were created in the reducer, you can call collect twice,
once with each record.  Collect in general can be called as many times as
you wish.

If you want to combine two records into a single datum rather than emit
multiple datums, you do not want a union, you need a Record.  A union is a
single datum that may be only one of its branches in a single datum.

In short, do you want to emit both records individually or as a pair?  If
it is a pair, you need a Record, if it is multiple outputs or either/or,
it is a Union.
 



Record unionRecord = new GenericData.Record(myUnionSchema); // not legal!
unionRecord.put(type A, recordA);
unionRecord.put(type B, recordB);

collector.collect(unionRecord);

but GenericData.Record constructor expects a Record Schema. How can I
write both records such that they appear in the same output
 datum?

If your output is either one type or another, see Doug's answer.

for multiple datums, it is

output schema is a union of two records  (a datum is either one or the
other):
[RecordA, RecordB]
then the code is:

collector.collect(recordA);
collector.collect(recordB);


If you want a single datum that contains both a RecordA and a RecordB you
need to have your output schema be a Record with two fields:

{type:record, fields:[
  {name:recordA, type:RecordA},
  {name:recordB, type:RecordB}
]}

And you would use this record schema to create the GenericRecord, and then
populate the fields with the inner records, then call collect once with
the outer record.

Another choice is to output the output be an avro array of the union type
that may have any number of RecordA and RecordB's in a single datum.


Andrew




Re: Map having string, Object

2011-12-07 Thread Scott Carey
The best practice is usually to use the flexible schema with the union
value rather than transmit schemas each time.  This restricts the
possibilities to the set defined, and the type selected in the branch is
available on the decoding side.  In the case above the number of variants
is not too large for this approach to be unwieldy, and there may be
benefits of knowing the type on the other side without inspecting the
value.

You can construct an Avro schema that represents all possible data
variants, effectively tagging the types of every field during
serialization using unions.  However none of the Avro APIs are (yet)
optimized for this use case, it would be somewhat clumsy to work with, and
is less space efficient.  Other serialization systems are a better fit for
completely open-ended data schemas.

One can look at Avro as a serialization system, but I see it more as a
system for describing your data.  It provides tools for describing and
transforming data that exists in related forms (e.g. older or newer schema
versions) to the form you want to see (e.g. current schema).  Whether this
data is serialized or an object graph is less important than that it
conforms to a schema.  A transformation between a serialized form and an
object graph is one use case of many possibilities.

Think about your use case from that perspective.  Ask whether this is data
that gains benefit from describing it with an Avro Schema and then
interpreting it as conforming to that schema.  If it is completely open
ended there may be little benefit and significant overhead.

You can also embed JSON or binary JSON in Avro data fairly easily using
Jackson JSON.


On 12/7/11 9:10 AM, Gaurav Nanda gaurav...@gmail.com wrote:

I agree that in this case Json would be equally helpful. But In my
application there is one more type of message, where untagged data can
provide compact data encoding. So to maintain consistency, I preferred
to send these kind of messages also using avro.

@where untagged data can provide compact data encoding.
In that case also, my schema has to be dynamically generated (i.e. on
runtime), so has to be passed to client. So would avro be better to
compressed json is that case?

Thanks,
Gaurav Nanda

On Wed, Dec 7, 2011 at 9:17 PM, Tatu Saloranta tsalora...@gmail.com
wrote:
 On Wed, Dec 7, 2011 at 5:16 AM, Gaurav gaurav...@gmail.com wrote:
 Hi,

 We have a requirement to send typed(key-value) pairs from server to
clients
 (in various languages).
 Value can be one of primitive types or a map of same (string, Object)
type.

 One option is to construct record schema on the fly and second option
is to
 use unions to write schema in a general way.

 Problems with 1 is that we have to construct schema everytime
depending upon
 keys and then attach the entire string schema to a relatively small
record.

 But in second schema, u don't need to write schema on the wire as it is
 present with client also.

 I have written one such sample schema:
 
{type:map,values:[int,long,float,double,string,boolean
,{type:map,values:[int,long,float,double,string,boolean
]}]}

 Do you guys think writing something of this sort makes sense or is
there any
 better approach to this?

 For this kind of loose data, perhaps JSON would serve you better,
 unless you absolutely have to use Avro?

 -+ Tatu +-




Re: Reduce-side joins in Avro M/R

2011-12-07 Thread Scott Carey
This should be conceptually the same as a normal map-reduce join of the same
type.  Avro handles the serialization, but not the map-reduce algorithm or
strategy.   

On 12/6/11 8:43 AM, Andrew Kenworthy adwkenwor...@yahoo.com wrote:

 Hi,
 
 I'd like to use reduce-side joins in an avro M/R job, and am not sure how to
 do it: are there any best-practice tips or outlines of what one would have to
 implement in order to make this possible?
 
 Thanks,
 
 Andrew Kenworthy




Re: Importing in avdl from classpath of project

2011-12-07 Thread Scott Carey
I think that at minimum, it would be useful to have an option to 'also look
in the classpath' in the maven plugin, and have the option to do so in
general with the IDL compiler.   I would gladly review the patch in a JIRA.

-Scott

On 12/7/11 10:13 AM, Chau, Victor vic...@x.com wrote:

 Hello,
  
 I am trying to address a shortcoming of the way that the import feature works
 in IDL.  Currently, it looks like the only option is to place the file being
 imported inside the same directory as that of the importing avdl.
  
 In our setup, we have avdl¹s that are spread among several maven projects that
 are owned by different teams.  I would like to be able to just create a
 dependency on another jar that contains the avdl I am want to import and have
 Avro be smart enough to look for it in the classpath of the project containing
 the avdl.
  
 The main problem is to make all of this work with the avro-maven-plugin.  The
 plugin¹s runtime classpath is not the same as that of the maven project¹s
 classpath.  Through the magic of Stackoverflow, I figured out how to get the
 project¹s classpath and construct a new classloader and pass it to the Idl
 compiler for it to lookup the file if it is not available in the local
 directory.  
  
 Is this a feature that people think would be useful?  Essentially, the IDL
 syntax would not change but the behavior is:
  
 1.   If imported file is available locally (in the current input path),
 use it
 
 2.   Else look for it on the project¹s classpath.
 
  
 If so, I have a working patch that needs some cleanup but I can submit it as a
 feature request in JIRA.




Re: Best practice for versioning IDLs?

2011-11-29 Thread Scott Carey
I don't think there are yet best practices for what you are trying to do.

However, I suggest you first consider embedding the version as metadata in
the schema, rather than data.  If you put it in a Record, it will be data
serialized with every record.  If you put it as schema metadata, it will
only exist in the schemas and not the data.

In raw JSON schema form, the metadata can be added to any named type:
Record, Fixed, Enum, Protocol.   The doc field is a special named
metadata field, you can use it or add your own:

{
  namespace: com.acme,
  protocol: HelloWorld,
  doc: Protocol Greetings,
  acme.version: 1.22.3,

  types: [
{name: Greeting, type: record, fields: [
  {name: message, type: string}]},
{name: Curse, type: error, fields: [
  {name: message, type: string}]}
  ],

  messages: {
hello: {
  doc: Say hello.,
  request: [{name: greeting, type: Greeting }],
  response: Greeting,
  errors: [Curse]
}
  }
}

http://avro.apache.org/docs/current/spec.html#Protocol+Declaration

For IDL, it should be possible to add a property using the
@propname(propval) annotation on the protocol.
http://avro.apache.org/docs/current/idl.html#defining_protocol

I have not tried this myself however.

If I had the setup to test it now, I would try to see if the below AvroIDL
creates an empty protocol with the acme.version property set:

@acme.version(1.22.3)
@namespace(com.acme)
protocol HelloWorld {

}



On 11/29/11 9:20 AM, George Fletcher gffle...@aol.com wrote:



  


  
  
Hi,
  
  I'd like to incorporate a semver.org style versioning structure
  for the IDL files we are using. The IDLs represent interfaces of
  services (ala SOA).
  
  We currently manage our IDL files separately from the
  implementation as multiple services might use the same IDL. This
  makes it critical to have the IDL's understand their version. I'd
  like to see our build process be able to inject into the IDL the
  version from the build environment (currently maven). Another
  option would be to define the version within the IDL. However, the
  only way I can think of to do this, is to create a Version
  Record within each IDL and then maybe have the Record contains 3
  string fields (major, minor, patch).
  
  Just wondering if there are any best practices already established
  for this kind of IDL versioning.
  
  Thanks,
  George
  




Re: Overriding default velocity templates

2011-11-28 Thread Scott Carey
To the best of my recollection, the IDL custom template bits you mention
below have not been wired up through all of the tooling.  Please feel free
to submit JIRA tickets and patches to improve it.

Thanks!

-Scott

On 11/28/11 7:01 AM, George Fletcher gffle...@aol.com wrote:

 
  Hi,
  
  I'm looking for a way to override the default velocity templates used to
 generate java sources from IDL files. I know that I can do this by passing a
 command like argument to override 'org.apache.avro.specific.templates' but
 that doesn't work well with our build process. We want a standard set of
 templates used by many developers.
  
  What is the best way to override the system property? It appears from the
 avro code that while the SpecificCompiler.java supports a setTemplateDir()
 method, nothing in the avro-maven-plugin calls this method.
  
  Thanks,
  George 




Re: Avro-mapred and new Java MapReduce API (org.apache.hadoop.mapreduce)

2011-11-13 Thread Scott Carey
I have heard some suggestions that it would be useful if we could somehow
model Avro's interaction with mapreduce using composition rather than
inheritance.  Has anyone tried that?  Or would it be too clumsy?  A good
relationship with the mapreduce/mapred api via composition might require
changes on the hadoop side however.

On 11/13/11 5:04 AM, Friso van Vollenhoven fvanvollenho...@xebia.com
wrote:

 Hi, 
 
 I use my own set of classes for this. I mostly copied from / modeled after the
 Avro mapred support for the old API.
 
 My approach is slightly different, though. The existing MR support fully
 abstracts / wraps away the Hadoop MR API and only exposes the Avro one. The
 only Hadoop API that the Avro classes see is the Configuration object. Problem
 is that in the new API, the Configuration object is kept within a context
 instance and you'd need to wrap the whole context thing and give the wrapper
 to the Avro mapper and reducer. This felt a bit overkill so I chose to just
 make mapper and reducer subclasses that handle the Avro work and then call a
 protected method to do the actual mapping or reducing. Problem is that you
 lose the property of a bare mapper or reducer being the identity function, but
 you could reintroduce this in a generic way, I think. I just don't use the
 identity functions a lot in practice, so I didn't bother.
 
 I pushed the code here: https://github.com/friso/avro-mapreduce. There is a
 unit test with some usage examples.
 
 
 Cheers,
 Friso
 
 
 
 On 11 nov. 2011, at 20:43, Doug Cutting wrote:
 
 On 11/10/2011 12:38 AM, Andrew Kenworthy wrote:
 Are there plans to extend it to work with org.apache.hadoop.mapreduce as
 well?
 
 There's an issue in Jira for this:
 
 https://issues.apache.org/jira/browse/AVRO-593
 
 I don't know of anyone actively working on this at present.  It would be
 a great addition to Avro and I am hopeful someone will resume work on it
 soon.
 
 Doug
 




Re: Does extending union break compatibility

2011-11-03 Thread Scott Carey

On 11/3/11 4:56 PM, Neil Davudo neil_dav...@yahoo.com wrote:

I have a record defined as follows

// version 1
record SomeRecord
{
union { null, TypeA } unionOfTypes;
}

I change the record to the following

// version 2
record SomeRecord
{
union { null, TypeA, TypeB } unionOfTypes;
}

Does the change break compatibility? Would data encoded using version 1
of the record definition be decodable using version 2 of the record
definition?

Readers with the second schema should be able to read data written with
the first schema, provided they use the API properly (both schemas must be
provided to the reader, so that it can translate from one to the other).

The reverse, reading data written in the latter schema with the first
schema, is possible as well provided that the first schema contains a
default value so that if the reader encounters a union branch it does not
know about, it can substitute the default value.



TIA

Neil




Re: How to add optional new record fields and/or new methods in avro-ipc?

2011-10-18 Thread Scott Carey
On 10/18/11 9:47 AM, Doug Cutting cutt...@apache.org wrote:

On 10/17/2011 08:14 PM, 常冰琳 wrote:
 What I do in the demo is add a new nullable string in server side, not
 change a string to nullable string.
 I add a new field with default value using specific, and it works fine,
 so I suspect the reason that reflect doesn't work is that I didn't add
 default value to the nullable string field.
 Perhaps the default value for nullable field should be null by default?

Reflect by default assumes that all values are not nullable.  This is
perhaps a bug, but the alternative is to make every non-numeric value
nullable, which would result in verbose schemas.

To amend this, you can use Avro's @Nullable annotation:

http://avro.apache.org/docs/current/api/java/org/apache/avro/reflect/Nulla
ble.html

This can be applied to parameters, return types and fields.

For example:

import org.apache.avro.reflect.Nullable;

public class Foo {
  @Nullable String x;
  public void setX(@Nullable String x) { this.x = x; }
  @Nullable public String getX() { return x; }
}


The problem is that this does not provide the ability to evolve schemas if
you add a field
since you would need @Default or something similar, as well:
@Nullable
@Default(null)

Does reflect have any concept of default values?




Doug




Re: How to add optional new record fields and/or new methods in avro-ipc?

2011-10-18 Thread Scott Carey


On 10/18/11 10:38 AM, Doug Cutting cutt...@apache.org wrote:

On 10/18/2011 10:09 AM, Scott Carey wrote:
 On 10/18/11 9:47 AM, Doug Cutting cutt...@apache.org wrote:
 To amend this, you can use Avro's @Nullable annotation:
 
 The problem is that this does not provide the ability to evolve schemas
if
 you add a field
 since you would need @Default or something similar, as well:
 @Nullable
 @Default(null)

I don't think this is required.  The default value for a union is the
default value for its first branch.  A null schema needs no default.
So the schema [null, string] needs to specify no default value while
the schema [string, null] does.  Thus the best practice for nullable
values is to place the null first in the union.  This is what is done by
the @Nullable annotation.

Perhaps we should clarify this in the Specification?  We might state
that a null schema implicitly has a default value of null since that's
the only value its ever permitted to have anyway.

Good to know.

So, any ideas what is causing the original User's problem?  @Nullable is
in use with Reflect (does not work), Specific works (with default values
but not without -- it appears to have null first but not confirmed).
I suspect there is something else going on.


 Does reflect have any concept of default values?

No.  We could add an @Default annotation, I suppose.  But I don't think
this is needed for nullable stuff.

Doug




Re: Avro mapred: How to avoid schema specification in job.xml?

2011-10-10 Thread Scott Carey
I'm not all that familiar with how Oozie interacts with Avro.

The Job must set its avro.input.schema and avro.output.schema properties ‹
this can be done in code (see the unit tests in the Avro mapred project for
examples), and if you are using SpecificRecords and DataFiles the schema is
available to the code where necessary.



On 10/10/11 5:41 AM, Julien Muller julien.mul...@ezako.com wrote:

 Hello,
 
 I have been using avro with hadoop and oozie for months now and I am very
 happy with the results.
 
 The only point I see as a limitation now is that we specify avro schemes in
 workflow.xml (job.xml):
 - avro.input.schema
 - avro.output.schema
 Since this info is already provided in Mapper/Reducer signatures, I see this
 as redundant. The schema is also present in all my serialized files, which
 means that the schema is specified in 3 different places.
 
 From a run point of view, this is a pain, since any schema modification (let's
 say a simple optional field added) forces me to update many job files. This
 task is very error prone and since we have a large amount of jobs, it
 generates a lot of work.
 
 The only solution I see now would be to find/replace in the build script, but
 I hope I could find a better solution by providing some generic schemes to the
 job file, or find a way to deactivate schema validation in the job. Any help
 will be appreciated!
 
 -- 
 Julien Muller




Re: Avro mapred: How to avoid schema specification in job.xml?

2011-10-10 Thread Scott Carey
On 10/10/11 11:41 AM, Julien Muller julien.mul...@ezako.com wrote:

 Hello,
 
 Thanks for your answer, let me try to clarify my context a bit:
 
 I'm not all that familiar with how Oozie interacts with Avro.
 Let's get oozie out of the picture. I use job.xml files to configure Jobs.
 This means I do not have any JobConf object and I cannot use AvroJob.
 Therefore I directly write the job properties (as what AvroJob outputs).
 
 The Job must set its avro.input.schema and avro.output.schema properties ‹
 this can be done in code (see the unit tests in the Avro mapred project for
 examples), 
 The solution I have now is basically based on the Avro mapred unit tests. But
 in my context, it is not an option to code (using the $SCHEMA property) at the
 job configuration level.
 where you code:
 AvroJob.setInputSchema(job, Schema.create(Schema.Type.STRING));
 I have to copy the entire schema in job.xml file. And I have to update it
 every time my schema get updated.
 I hope I can find a better solution.

I suppose that in AvroJob we could transmit only the class name in a
property, and use that to look up the schema for generated classes using
reflection.  Could you do something similar?  I don't think it is possible
to avoid configuring at least some sort of pointer to where the schema is.
This could be via a property, or if you already have the job class, an
annotation on that class.

 
 and if you are using SpecificRecords and DataFiles the schema is available to
 the code where necessary.
 I am not sure what you mean here. I am using SpecificRecords and would like to
 avoid specifying avro.input.schema, since this info is already here in the
 specific record.

Potentially the AvroMapper / AvroReducer could have a fall-back for
obtaining the schema if the property is not set ‹ reflection on a class name
or an annotation .  If this looks like it is an enhancement request for Avro
(or a bug) please file a JIRA ticket.  Thanks!

 
 Thanks,
 
 Julien Muller
 
 2011/10/10 Scott Carey scottca...@apache.org
 I'm not all that familiar with how Oozie interacts with Avro.
 
 The Job must set its avro.input.schema and avro.output.schema properties ‹
 this can be done in code (see the unit tests in the Avro mapred project for
 examples), and if you are using SpecificRecords and DataFiles the schema is
 available to the code where necessary.
 
 
 
 On 10/10/11 5:41 AM, Julien Muller julien.mul...@ezako.com wrote:
 
 Hello,
 
 I have been using avro with hadoop and oozie for months now and I am very
 happy with the results.
 
 The only point I see as a limitation now is that we specify avro schemes in
 workflow.xml (job.xml):
 - avro.input.schema
 - avro.output.schema
 Since this info is already provided in Mapper/Reducer signatures, I see this
 as redundant. The schema is also present in all my serialized files, which
 means that the schema is specified in 3 different places.
 
 From a run point of view, this is a pain, since any schema modification
 (let's say a simple optional field added) forces me to update many job
 files. This task is very error prone and since we have a large amount of
 jobs, it generates a lot of work.
 
 The only solution I see now would be to find/replace in the build script,
 but I hope I could find a better solution by providing some generic schemes
 to the job file, or find a way to deactivate schema validation in the job.
 Any help will be appreciated!
 
 -- 
 Julien Muller
 




Re: Data incompatibility between Avro 1.4.1 and 1.5.4

2011-10-03 Thread Scott Carey
AVRO-793 was not a bug in the encoded data or its format.  It was a bug in
how schema resolution worked for certain projection corner cases during
deserialization.

Is your data readable with the same schema that wrote it?  (for example,
if it is an avro data file, you can use avro-tools.jar to print it out
with its own schema).
If the error only occurs when you try to use a different schema to read
than it was written with, it is most likely a bug with the schema
resolution process.  If so, file a bug.  We will need to reproduce it, so
the more information you can give us about the schemas the better.  Best
would be a reproducible test case but that may not be trivial.  At minimum
the stack trace you get with 1.5.4 could be enlightening.

Thanks!

-Scott


On 10/3/11 3:32 PM, W.P. McNeill bill...@gmail.com wrote:


I have a bunch of data that I serialized using the Avro 1.4.1 library. I
wanted use projection schemas with this data but I can't because of bug
793 (https://issues.apache.org/jira/browse/AVRO-793). So I changed my
code to use Avro 1.5.4. When I try to deserialize the Avro 1.4.1 data
with the new code built with Avro 1.5.4, I get the same runtime
deserialization errors described in JIRA 793.
Is this expected? Is there any way around it beyond reserializing all my
data using Avro 1.5.4?

(I think I'm asking whether JIRA 793 is just a problem with
deserialization or a problem with the binary serialization format.)




Re: In Java, how can I create an equivalent of an Apache Avro container file without being forced to use a File as a medium?

2011-10-03 Thread Scott Carey
In addition to Joe's comments:

On the write side, DataFileWriter.create() can take a file or an output
stream.
http://avro.apache.org/docs/1.5.4/api/java/org/apache/avro/file/DataFileWrit
er.html

On the read side, DataFileStream can be used if the input does not have
random access and can be represented with an InputStream.
If the input has random access, implement SeekableInput and then construct a
DataFileReader with it:
http://avro.apache.org/docs/1.5.4/api/java/org/apache/avro/file/SeekableInpu
t.html
http://avro.apache.org/docs/1.5.4/api/java/org/apache/avro/file/DataFileRead
er.html


On 9/24/11 2:14 AM, Bernard Liang liang.bern...@gmail.com wrote:

 Hello,
 
 This is somewhat of a more advanced question regarding the Java implementation
 of Avro. The details are located at the following link:
 
 http://stackoverflow.com/questions/7537959/in-java-how-can-i-create-an-equival
 ent-of-an-apache-avro-container-file-without
 
 If there is anyone that might be able to assist me with this, I would like to
 get in contact with you.
 
 Best regards,
 Bernard Liang




Re: Compression and splittable Avro files in Hadoop

2011-09-30 Thread Scott Carey
Yes, Avro Data Files are always splittable.

You may want to up the default block size in the files if this is for
MapReduce.  The block size can often have a bigger impact on the
compression ratio than the compression level setting.

If you are sensitive to the write performance, you might want lower
deflate compression levels as well.  The read performance is relatively
constant for deflate as the compression level changes (except for
uncompressed level 0), but the write performance varies a quite a bit
between compression level 1 and 9 -- typically a factor of 5 or 6.

On 9/30/11 6:42 PM, Eric Hauser ewhau...@gmail.com wrote:

A coworker and I were having a conversation today about choosing a
compression algorithm for some data we are storing in Hadoop.  We have
been using (https://github.com/tomslabs/avro-utils) for our Map/Reduce
jobs and Haivvreo for integration with Hive.  By default, the
avro-utils OutputFormat uses deflate compression.  Even though
default/zlib/gzip files are not splittable, we decided that Avro data
files are always splittable because individual blocks within the file
are compressed instead of the entire file.

Is this accurate?  Thanks.




Re: Avro versioning and SpecificDatum's

2011-09-20 Thread Scott Carey
That looks like a bug.  What happens if there is no aliasing/renaming
involved?  Aliasing is a newer feature than field addition, removal, and
promotion.

This should be easy to reproduce, can you file a JIRA ticket?  We should
discuss this further there.

Thanks!


On 9/19/11 6:14 PM, Alex Holmes grep.a...@gmail.com wrote:

OK, I was able to reproduce the exception.

v1:
{name: Record, type: record,
  fields: [
{name: name, type: string},
{name: id, type: int}
  ]
}

v2:
{name: Record, type: record,
  fields: [
{name: name_rename, type: string, aliases: [name]}
  ]
}

Step 1.  Write Avro file using v1 generated class
Step 2.  Read Avro file using v2 generated class

Exception in thread main org.apache.avro.AvroRuntimeException: Bad index
   at Record.put(Unknown Source)
   at org.apache.avro.generic.GenericData.setField(GenericData.java:463)
   at 
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.j
ava:166)
   at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:13
8)
   at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:12
9)
   at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
   at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
   at Read.readFromAvro(Unknown Source)
   at Read.main(Unknown Source)

The code to write/read the avro file didn't change from below.

On Mon, Sep 19, 2011 at 9:08 PM, Alex Holmes grep.a...@gmail.com wrote:
 I'm trying to put together a simple test case to reproduce the
 exception.  While I was creating the test case, I hit this behavior
 which doesn't seem right, but maybe it's my misunderstanding on how
 forward/backward compatibility should work:

 Schema v1:

 {name: Record, type: record,
  fields: [
{name: name, type: string},
{name: id, type: int}
  ]
 }

 Schema v2:

 {name: Record, type: record,
  fields: [
{name: name_rename, type: string, aliases: [name]},
{name: new_field, type: int, default:0}
  ]
 }

 In the 2nd version I:

 - removed field id
 - renamed field name to name_rename
 - added field new_field

 I write the v1 data file:

  public static Record createRecord(String name, int id) {
Record record = new Record();
record.name = name;
record.id = id;
return record;
  }

  public static void writeToAvro(OutputStream outputStream)
  throws IOException {
DataFileWriterRecord writer =
new DataFileWriterRecord(new SpecificDatumWriterRecord());
writer.create(Record.SCHEMA$, outputStream);

writer.append(createRecord(r1, 1));
writer.append(createRecord(r2, 2));

writer.close();
outputStream.close();
  }

 I wrote a version-agnostic Read class:

  public static void readFromAvro(InputStream is) throws IOException {
DataFileStreamRecord reader = new DataFileStreamRecord(
is, new SpecificDatumReaderRecord());
for (Record a : reader) {
  System.out.println(ToStringBuilder.reflectionToString(a));
}
IOUtils.cleanup(null, is);
IOUtils.cleanup(null, reader);
  }

 Running the Read code against the v1 data file, and including the v1
 code-generated classes in the classpath produced:

 Record@6a8c436b[name=r1,id=1]
 Record@6baa9f99[name=r2,id=2]

 If I run the same code, but use just the v2 generated classes in the
 classpath I get:

 Record@39dd3812[name_rename=r1,new_field=1]
 Record@27b15692[name_rename=r2,new_field=2]

 The name_rename field seems to be good, but why would new_field
 inherit the values of the deleted field id?

 Cheers,
 Alex







 On Mon, Sep 19, 2011 at 12:35 PM, Doug Cutting cutt...@apache.org
wrote:
 On 09/19/2011 05:12 AM, Alex Holmes wrote:
 I then modified my original schema by adding, deleting and renaming
 some fields, creating version 2 of the schema.  After re-creating the
 Java classes I attempted to read the version 1 file using the
 DataFileStream (with a SpecificDatumReader), and this is throwing an
 exception.

 This should work.  Can you provide more detail?  What is the exception?
  A reproducible test case would be great to have.

 Thanks,

 Doug






Re: Avro versioning and SpecificDatum's

2011-09-20 Thread Scott Carey
As Doug mentioned in the ticket, the problem is likely:

new SpecificDatumReaderRecord()


This should be

new SpecificDatumReaderRecord(Record.class)


Which sets the reader to resolve to the schema found in Record.class



On 9/20/11 3:44 AM, Alex Holmes grep.a...@gmail.com wrote:

Created the following ticket:

https://issues.apache.org/jira/browse/AVRO-891

Thanks,
Alex

On Tue, Sep 20, 2011 at 6:26 AM, Alex Holmes grep.a...@gmail.com wrote:
 Thanks, I'll add a bug.

 As a FYI, even without the alias (retaining the original field name),
 just removing the id field yields the exception.

 On Tue, Sep 20, 2011 at 2:22 AM, Scott Carey scottca...@apache.org
wrote:
 That looks like a bug.  What happens if there is no aliasing/renaming
 involved?  Aliasing is a newer feature than field addition, removal,
and
 promotion.

 This should be easy to reproduce, can you file a JIRA ticket?  We
should
 discuss this further there.

 Thanks!


 On 9/19/11 6:14 PM, Alex Holmes grep.a...@gmail.com wrote:

OK, I was able to reproduce the exception.

v1:
{name: Record, type: record,
  fields: [
{name: name, type: string},
{name: id, type: int}
  ]
}

v2:
{name: Record, type: record,
  fields: [
{name: name_rename, type: string, aliases: [name]}
  ]
}

Step 1.  Write Avro file using v1 generated class
Step 2.  Read Avro file using v2 generated class

Exception in thread main org.apache.avro.AvroRuntimeException: Bad
index
   at Record.put(Unknown Source)
   at 
org.apache.avro.generic.GenericData.setField(GenericData.java:463)
   at
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReade
r.j
ava:166)
   at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java
:13
8)
   at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java
:12
9)
   at 
org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
   at 
org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
   at Read.readFromAvro(Unknown Source)
   at Read.main(Unknown Source)

The code to write/read the avro file didn't change from below.

On Mon, Sep 19, 2011 at 9:08 PM, Alex Holmes grep.a...@gmail.com
wrote:
 I'm trying to put together a simple test case to reproduce the
 exception.  While I was creating the test case, I hit this behavior
 which doesn't seem right, but maybe it's my misunderstanding on how
 forward/backward compatibility should work:

 Schema v1:

 {name: Record, type: record,
  fields: [
{name: name, type: string},
{name: id, type: int}
  ]
 }

 Schema v2:

 {name: Record, type: record,
  fields: [
{name: name_rename, type: string, aliases: [name]},
{name: new_field, type: int, default:0}
  ]
 }

 In the 2nd version I:

 - removed field id
 - renamed field name to name_rename
 - added field new_field

 I write the v1 data file:

  public static Record createRecord(String name, int id) {
Record record = new Record();
record.name = name;
record.id = id;
return record;
  }

  public static void writeToAvro(OutputStream outputStream)
  throws IOException {
DataFileWriterRecord writer =
new DataFileWriterRecord(new SpecificDatumWriterRecord());
writer.create(Record.SCHEMA$, outputStream);

writer.append(createRecord(r1, 1));
writer.append(createRecord(r2, 2));

writer.close();
outputStream.close();
  }

 I wrote a version-agnostic Read class:

  public static void readFromAvro(InputStream is) throws IOException {
DataFileStreamRecord reader = new DataFileStreamRecord(
is, new SpecificDatumReaderRecord());
for (Record a : reader) {
  System.out.println(ToStringBuilder.reflectionToString(a));
}
IOUtils.cleanup(null, is);
IOUtils.cleanup(null, reader);
  }

 Running the Read code against the v1 data file, and including the v1
 code-generated classes in the classpath produced:

 Record@6a8c436b[name=r1,id=1]
 Record@6baa9f99[name=r2,id=2]

 If I run the same code, but use just the v2 generated classes in the
 classpath I get:

 Record@39dd3812[name_rename=r1,new_field=1]
 Record@27b15692[name_rename=r2,new_field=2]

 The name_rename field seems to be good, but why would new_field
 inherit the values of the deleted field id?

 Cheers,
 Alex







 On Mon, Sep 19, 2011 at 12:35 PM, Doug Cutting cutt...@apache.org
wrote:
 On 09/19/2011 05:12 AM, Alex Holmes wrote:
 I then modified my original schema by adding, deleting and renaming
 some fields, creating version 2 of the schema.  After re-creating
the
 Java classes I attempted to read the version 1 file using the
 DataFileStream (with a SpecificDatumReader), and this is throwing
an
 exception.

 This should work.  Can you provide more detail?  What is the
exception?
  A reproducible test case would be great to have.

 Thanks,

 Doug










Re: Avro versioning and SpecificDatum's

2011-09-19 Thread Scott Carey
I version with SpecificDatum objects using avro data files and it works
fine.

I have seen problems arise if a user is configuring or reconfiguring the
schemas on the DatumReader passed into the construction of the
DataFileReader.


In the case of SpecificDatumReader, it is as simple as:

DatumReaderT reader = new SpecificDatumReaderT(T.class);
DataFileReaderT fileReader = new DataFileReader(file, reader);



On 9/19/11 5:12 AM, Alex Holmes grep.a...@gmail.com wrote:

Hi,

I'm starting to play with how I can support versioning with Avro.  I
created an initial schema, code-generated some some Java classes using
org.apache.avro.tool.Main compile protocol, and then used the
DataFileWriter (with a SpecificDatumWriter) to serialize my objects to
a file.

I then modified my original schema by adding, deleting and renaming
some fields, creating version 2 of the schema.  After re-creating the
Java classes I attempted to read the version 1 file using the
DataFileStream (with a SpecificDatumReader), and this is throwing an
exception.

Is versioning supported in conjunction with the SpecificDatum*
reader/writer classes, or do I have to work at the GenericDatum level
for this to work?

Many thanks,
Alex




Re: How should I migrate 1.4 code to avro 1.5?

2011-09-02 Thread Scott Carey
The javadoc for the deprecated method directs users to the replacement.

BinaryEncoder and BinaryDecoder are well documented, with docs available via
maven for IDE's to consume easily, or via the Apache Avro website:
http://avro.apache.org/docs/1.5.3/api/java/org/apache/avro/io/BinaryEncoder.
html
http://avro.apache.org/docs/1.5.3/api/java/org/apache/avro/io/DecoderFactory
.html

defaultFactory 
http://avro.apache.org/docs/1.5.3/api/java/org/apache/avro/io/DecoderFactor
y.html#defaultFactory%28%29 ()
 Deprecated. use the equivalent get()
http://avro.apache.org/docs/1.5.3/api/java/org/apache/avro/io/DecoderFactor
y.html#get%28%29  instead

Generally, when using Avro you will have an easier time if you have the docs
available in your IDE or at least available for reference in a browser.
There are not a lot of blog posts and examples out there, but the javadoc is
mostly decent and we try hard to make sure all public and protected methods
and constructors have documentation.  Many classes and packages have solid
documentation as well.  Please report any documentation bugs or suggestions
for improvement.

Thanks!

-Scott

On 9/2/11 2:41 PM, W.P. McNeill bill...@gmail.com wrote:

 I'm new to Avro. Since I'm having trouble finding simple examples online I'm
 writing one of my own that I'm putting on github.
 
 https://github.com/wpm/AvroExample
 
 Hopefully, this will be of help to people like me who are also having trouble
 finding simple code examples.
 
 I want to get this compiling without of hitch in Maven. I had it running with
 a 1.4 version of Avro, but when I changed that to 1.5, some of the code no
 longer works. Specifically, BinaryEncoder can no longer be instantiated
 directly because it is now an abstract class (AvroExample.java: line 33) and
 DecoderFactory.defaultFactory is deprecated (AvroExample.java: line 41).
 
 How should I modify this code so that it works with the latest and greatest
 version of Avro?  I looked through the Release Notes, but the answers weren't
 obvious.
 
 Thanks.
 




Re: How should I migrate 1.4 code to avro 1.5?

2011-09-02 Thread Scott Carey
Are you still having trouble with this?  I noticed that the code has changed
and you are using MyPair instead of Pair.  Was there a naming conflict bug
with Avro's Pair.java?

-Scott

On 9/2/11 3:46 PM, W.P. McNeill bill...@gmail.com wrote:

 I made changes that got rid of all the deprecated calls.  I think I am using
 the 1.5 interface correctly.  However, I get a runtime error when I try to
 deserialize into a class using a SpecificDataumReader.  The problem starts at
 line 62 of AvroExample.java
 https://github.com/wpm/AvroExample/blob/master/src/main/java/wpmcn/AvroExampl
 e.java#L62 .  The code looks like this:
 
   DatumReaderPair reader = new SpecificDatumReaderPair(Pair.class);
   BinaryDecoder decoder =
 DecoderFactory.get().binaryDecoder(out.toByteArray(), null);
   Pair result = reader.read(null, decoder);
   System.out.printf(Left: %s, Right: %s\n, result.left, result.right);
 
 Where Pair is an object I have SpecificRecord that I have in this project.
 When I deserialize with reader.read() I get the following runtime error:
 
 Exception in thread main java.lang.ClassCastException:
 org.apache.avro.generic.GenericData$Record cannot be cast to wpmcn.Pair
 at wpmcn.AvroExample.serializeSpecific(AvroExample.java:64)
 at wpmcn.AvroExample.main(AvroExample.java:73)
 
 When I step into the debugger I see that the GenericDatumReader.read()
 function has type D as GenericData.
 
 Presumably I'm calling something wrong but I can't figure out what.
 
 On Fri, Sep 2, 2011 at 3:02 PM, Philip Zeyliger phi...@cloudera.com wrote:
 EncoderFactory.get().binaryEncoder(...).
 
 I encourage you to file a JIRA and submit a patch to AVRO.  Having example
 code in the code base seems like a win to me.
 
 -- Philip
 
 
 On Fri, Sep 2, 2011 at 2:41 PM, W.P. McNeill bill...@gmail.com wrote:
 I'm new to Avro. Since I'm having trouble finding simple examples online I'm
 writing one of my own that I'm putting on github.
 
 https://github.com/wpm/AvroExample
 
 Hopefully, this will be of help to people like me who are also having
 trouble finding simple code examples.
 
 I want to get this compiling without of hitch in Maven. I had it running
 with a 1.4 version of Avro, but when I changed that to 1.5, some of the code
 no longer works. Specifically, BinaryEncoder can no longer be instantiated
 directly because it is now an abstract class (AvroExample.java: line 33) and
 DecoderFactory.defaultFactory is deprecated (AvroExample.java: line 41).
 
 How should I modify this code so that it works with the latest and greatest
 version of Avro?  I looked through the Release Notes, but the answers
 weren't obvious.
 
 Thanks.
 
 
 




Re: simultaneous read + write?

2011-09-02 Thread Scott Carey
AvroDataFile deals with this for some cases.  Is it an acceptable API for
your use case?  You can configure the block size to be very small and/or
flush() regularly.

If you do this on your own, you will need to track the position that you
start to read a record at, and if there is a failure, rewind and reset the
reader to that position.

-Scott


On 8/25/11 7:17 PM, Yang tedd...@gmail.com wrote:

I'm trying to implement an on-disk queue, which contains avro records,
SpecificRecord

my queue implementation basically contains a
SpecificDatumWriter, and a SpecificDatumReader  pointing to the same file
.

the problem is, that when the reader reaches the EOF, I can no longer
use it again,
even after I append more records to the file,  if I call the same
SpecificDatumReader.read() again,
it gave me exceptions:


--
-
Test set: blah.MyTest
--
-
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.257
sec  FAILURE!
testBasic(blah.MyTest)  Time elapsed: 0.24 sec   ERROR!
java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at 
org.apache.avro.io.BinaryDecoder$ByteSource.compactAndFill(BinaryDecoder.j
ava:670)
at 
org.apache.avro.io.BinaryDecoder.ensureBounds(BinaryDecoder.java:453)
at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:120)
at org.apache.avro.io.BinaryDecoder.readIndex(BinaryDecoder.java:405)
at 
org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
at 
org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:14
2)
at 
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.j
ava:166)
at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:13
8)
at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:12
9)
at blah.DiskEventsQueue.dequeue2(MyTest.java:55)
at blah.MyTest.testBasic(MyTest.java:85)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm
pl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)




Thanks
Yang




Re: How should I migrate 1.4 code to avro 1.5?

2011-09-02 Thread Scott Carey
Start with a JIRA ticket and we can discuss and refine there.

What we accept into the project must be attached as a patch to the JIRA
ticket with the sign-off to Apache and proper license headers on the
content.

Thanks!

-Scott

On 9/2/11 5:53 PM, W.P. McNeill bill...@gmail.com wrote:

 I've got a building version with Chris Wilkes' changes. I'd be happy to
 include this in an Avro distribution. Should I just open a JIRA to that effect
 and point to this github project?
 
 On Fri, Sep 2, 2011 at 5:28 PM, Chris Wilkes cwil...@gmail.com wrote:
 Oh and I'm the one that did the pull request.  I changed the name of
 the avro class to MyPair as I was confused when reading it with
 avro's own Pair class.
 
 What I usually do is put all of my avro schemas into a separate
 project with nothing else in it.  Then I have all my other projects
 depend on that one, in this case AvroExample.java would be a in a
 separate project from MyPair.avsc.  This gets around weirdness with
 mvn install vs Eclipse seeing the updated files, etc.
 
 On Fri, Sep 2, 2011 at 5:20 PM, W.P. McNeill bill...@gmail.com wrote:
  Still having trouble with this.
  The name change was part of merging the pull request on github. My last
  email details where I'm at right now. The pull request code looks correct;
  I'm just trying to get it to build in my Maven environment.
 
  On Fri, Sep 2, 2011 at 5:19 PM, Scott Carey scottca...@apache.org wrote:
 
  Are you still having trouble with this?  I noticed that the code has
  changed and you are using MyPair instead of Pair.  Was there a naming
  conflict bug with Avro's Pair.java?
  -Scott
  On 9/2/11 3:46 PM, W.P. McNeill bill...@gmail.com wrote:
 
  I made changes that got rid of all the deprecated calls.  I think I am
  using the 1.5 interface correctly.  However, I get a runtime error when
I
  try to deserialize into a class using a SpecificDataumReader.  The
 problem
  starts at line 62 of AvroExample.java.  The code looks like this:
DatumReaderPair reader = new
  SpecificDatumReaderPair(Pair.class);
BinaryDecoder decoder =
  DecoderFactory.get().binaryDecoder(out.toByteArray(), null);
Pair result = reader.read(null, decoder);
System.out.printf(Left: %s, Right: %s\n, result.left,
  result.right);
  Where Pair is an object I have SpecificRecord that I have in this
 project.
  When I deserialize with reader.read() I get the following runtime error:
  Exception in thread main java.lang.ClassCastException:
  org.apache.avro.generic.GenericData$Record cannot be cast to wpmcn.Pair
  at wpmcn.AvroExample.serializeSpecific(AvroExample.java:64)
  at wpmcn.AvroExample.main(AvroExample.java:73)
  When I step into the debugger I see that the GenericDatumReader.read()
  function has type D as GenericData.
  Presumably I'm calling something wrong but I can't figure out what.
  On Fri, Sep 2, 2011 at 3:02 PM, Philip Zeyliger phi...@cloudera.com
  wrote:
 
  EncoderFactory.get().binaryEncoder(...).
  I encourage you to file a JIRA and submit a patch to AVRO.  Having
  example code in the code base seems like a win to me.
  -- Philip
 
  On Fri, Sep 2, 2011 at 2:41 PM, W.P. McNeill bill...@gmail.com
 wrote:
 
  I'm new to Avro. Since I'm having trouble finding simple examples
 online
  I'm writing one of my own that I'm putting on github.
  https://github.com/wpm/AvroExample
  Hopefully, this will be of help to people like me who are also
 having
  trouble finding simple code examples.
  I want to get this compiling without of hitch in Maven. I had it
 running
  with a 1.4 version of Avro, but when I changed that to 1.5, some of
 the code
  no longer works. Specifically, BinaryEncoder can no longer be
 instantiated
  directly because it is now an abstract class (AvroExample.java: line
 33) and
  DecoderFactory.defaultFactory is deprecated (AvroExample.java: line
41).
  How should I modify this code so that it works with the latest and
  greatest version of Avro?  I looked through the Release Notes, but
the
  answers weren't obvious.
  Thanks.
 
 
 
 
 




Re: avro BinaryDecoder bug ?

2011-08-31 Thread Scott Carey
Looks like a bug to me.

Can you file a JIRA ticket?

Thanks!

On 8/29/11 1:24 PM, Yang tedd...@gmail.com wrote:

if I read on a empty file with BinaryDecoder, I get EOF, good,

but with the current code, if I read it again with the same decoder, I
get a IndexOutofBoundException, not EOF.

it seems that always giving EOF should be a more desirable behavior.

you can see from this test code:

import static org.junit.Assert.assertEquals;

import java.io.IOException;

import org.apache.avro.specific.SpecificRecord;
import org.junit.Test;

import myavro.Apple;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;

import org.apache.avro.io.Decoder;
import org.apache.avro.io.DecoderFactory;
import org.apache.avro.io.Encoder;
import org.apache.avro.io.EncoderFactory;
import org.apache.avro.specific.SpecificDatumReader;
import org.apache.avro.specific.SpecificDatumWriter;

class MyWriter {

SpecificDatumWriterSpecificRecord wr;
Encoder enc;
OutputStream ostream;

public MyWriter() throws FileNotFoundException {
wr = new SpecificDatumWriterSpecificRecord(new
Apple().getSchema());
ostream = new FileOutputStream(new File(/tmp/testavro));
enc = EncoderFactory.get().binaryEncoder(ostream, null);
}

public synchronized void dump(SpecificRecord event) throws
IOException {
wr.write(event, enc);
enc.flush();
}

}

class MyReader {

SpecificDatumReaderSpecificRecord rd;
Decoder dec;
InputStream istream;

public MyReader() throws FileNotFoundException {
rd = new SpecificDatumReaderSpecificRecord(new
Apple().getSchema());
istream = new FileInputStream(new File(/tmp/testavro));
dec = DecoderFactory.get().binaryDecoder(istream, null);
}

public synchronized SpecificRecord read() throws IOException {
Object r = rd.read(null, dec);
return (SpecificRecord) r;
}

}

public class AvroWriteAndReadSameTime {
@Test
public void testWritingAndReadingAtSameTime() throws Exception {

MyWriter dumper = new MyWriter();
final Apple apple = new Apple();
apple.taste = sweet;
dumper.dump(apple);

final MyReader rd = new MyReader();
rd.read();


try {
rd.read();
} catch (Exception e) {
e.printStackTrace();
}

// the second one somehow generates a NPE, we hope to get EOF...
try {
rd.read();
} catch (Exception e) {
e.printStackTrace();
}

}
}





the issue is in BinaryDecoder.readInt(), right now even when it hits
EOF, it still advances the pos pointer.
all the other APIs (readLong readFloat ...) do not do this. changing
to the following  makes it work:


  @Override
  public int readInt() throws IOException {
ensureBounds(5); // won't throw index out of bounds
int len = 1;
int b = buf[pos]  0xff;
int n = b  0x7f;
if (b  0x7f) {
  b = buf[pos + len++]  0xff;
  n ^= (b  0x7f)  7;
  if (b  0x7f) {
b = buf[pos + len++]  0xff;
n ^= (b  0x7f)  14;
if (b  0x7f) {
  b = buf[pos + len++]  0xff;
  n ^= (b  0x7f)  21;
  if (b  0x7f) {
b = buf[pos + len++]  0xff;
n ^= (b  0x7f)  28;
if (b  0x7f) {
  throw new IOException(Invalid int encoding);
}
  }
}
  }
}
if (pos+len  limit) {
  throw new EOFException();
}
pos += len; //== CHANGE, used to be
above the EOF throw

return (n  1) ^ -(n  1); // back to two's-complement
  }




Re: Map output records/reducer input records mismatch

2011-08-16 Thread Scott Carey
We have had one other report of something similar happening.
https://issues.apache.org/jira/browse/AVRO-782


What Avro version is this happening with? What JVM version?

On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args if
it is Sun and JRE 6u21 or later? (some issues in loop predicates affect
Java 6 too, just not as many as the recent news on Java7).

Otherwise, it may likely be the same thing as AVRO-782.  Any extra
information related to that issue would be welcome.

Thanks!

-Scott



On 8/16/11 8:39 AM, Vyacheslav Zholudev vyacheslav.zholu...@gmail.com
wrote:

Hi,

I'm having multiple hadoop jobs that use the avro mapred API.
Only in one of the jobs I have a visible mismatch between a number of map
output records and reducer input records.

Does anybody encountered such a behavior? Can anybody think of possible
explanations of this phenomenon?

Any pointers/thoughts are highly appreciated!

Best,
Vyacheslav




Re: Compiling multiple input schemas

2011-08-16 Thread Scott Carey
What about leveraging shell expansion?
This would mean we would need inverse syntax, like tar or zip ( destination,
list of sources in reverse dependency order )

Then your examples are
avro-tools-1.6.0.jar compile schema tmp/ input/position.avsc
input/player.avsc
avro-tools-1.6.0.jar compile schema tmp/ input/*

That would be incompatible, since we reversed argument order.  But it would
be more like other unix command line tools that take lists of files and
output results somewhere else.

(Or as I see Doug has just replied ‹ the last argument can be the
destination)

On 8/16/11 1:38 PM, Bill Graham billgra...@gmail.com wrote:

 Hi,
 
 With Avro-874, multiple inter-dependent schema files can be parsed. I've
 written a patch to the SpecificCompilerTool to allow the same when producing
 java from multiple schemas that I'd like to contribute for consistency if
 there's interest. It allows you to pass multiple input files like this:
 
 $ java -cp avro-tools-1.6.0.jar org.apache.avro.tool.Main compile schema
 input/position.avsc,input/player.avsc tmp/
 
 While I was at it, it seemed useful to parse an entire directory of schema
 files as well so I implemented this:
 
 $ java -cp avro-tools-1.6.0.jar org.apache.avro.tool.Main compile schema
 input/ tmp/
 
 The latter approach will not work properly for files with dependencies, since
 the file order probably isn't in reverse dependency order. If that's the case,
 a combination of files and directories can be used to force an ordering. So if
 b depends on a and other files depend on either of them you could do this:
 
 $ java -cp avro-tools-1.6.0.jar org.apache.avro.tool.Main compile schema
 input/a.avsc,input/b.avsc,input/ tmp/
 
 Let me know if some or all of this seems useful to contribute. The first
 example is really the main one that I need. I've done the same for Protocol as
 well btw.
 
 thanks,
 Bill




Re: Map output records/reducer input records mismatch

2011-08-16 Thread Scott Carey
On 8/16/11 3:56 PM, Vyacheslav Zholudev vyacheslav.zholu...@gmail.com
wrote:

Hi, Scott,

thanks for your reply.

 What Avro version is this happening with? What JVM version?

We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have
to look up.

 
 On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args
if
 it is Sun and JRE 6u21 or later? (some issues in loop predicates affect
 Java 6 too, just not as many as the recent news on Java7).
 
 Otherwise, it may likely be the same thing as AVRO-782.  Any extra
 information related to that issue would be welcome.

I will have to collect it. In the meanwhile, do you have any reasonable
explanations of the issue besides it being something like AVRO-782?

What is your key type (map output schema, first type argument of Pair)?
Is your key a Utf8 or String?  I don't have a reasonable explanation at
this point, I haven't looked into it in depth with a good reproducible
case.  I have my suspicions with how recycling of the key works since Utf8
is mutable and its backing byte[] can end up shared.




Thanks a lot,
Vyacheslav

 
 Thanks!
 
 -Scott
 
 
 
 On 8/16/11 8:39 AM, Vyacheslav Zholudev
vyacheslav.zholu...@gmail.com
 wrote:
 
 Hi,
 
 I'm having multiple hadoop jobs that use the avro mapred API.
 Only in one of the jobs I have a visible mismatch between a number of
map
 output records and reducer input records.
 
 Does anybody encountered such a behavior? Can anybody think of possible
 explanations of this phenomenon?
 
 Any pointers/thoughts are highly appreciated!
 
 Best,
 Vyacheslav
 
 

Best,
Vyacheslav







Re: why Utf8 (vs String)?

2011-08-11 Thread Scott Carey
Also, Utf8 caches the result of toString(), so that if you call toString()
many times, it only allocates the String once.
It also implements the CharSequence interface, and many libraries in the
JRE accept CharSequence.

Note that Utf8 is mutable and exposes its backing store (byte array).
String is immutable.  Be careful with how you use Utf8 objects if you hold
on to them for a long time or pass them to other code -- users should not
expect similar characteristics to String for general use.



On 8/11/11 5:08 PM, Yang tedd...@gmail.com wrote:

Thanks  a lot Doug

On Thu, Aug 11, 2011 at 5:02 PM, Doug Cutting cutt...@apache.org wrote:
 This is for performance.

 A Utf8 may be efficiently compared to other Utf8's, e.g., when sorting,
 without decoding the UTF-8 bytes into characters.  A Utf8 may also be
 reused, so when iterating through a large number of values (e.g., in a
 MapReduce job) only a single instance need be allocated, while String
 would require an allocation per iteration.

 Note that String may be used when writing data, but that data is
 generally read as Utf8.  The toString() method may be called whenever a
 String is required.  If only equality or ordering is needed, and not
 substring operations, then leaving values as Utf8 is generally faster
 than converting to String.

 Doug

 On 08/11/2011 04:36 PM, Yang wrote:
 if I declare a field to be string, the generated java implementation
 uses avro..Utf8 for that,

 I was wondering what is the thinking behind this, and what is the
 proper way to use the Utf8 value -
 oftentimes in my logic, I need to compare the value against other
 String's, or store them into other databases , which
 of course do not know about Utf8, so that I'd have to transform them
 into String's.  so it seems being Utf8 unnecessarily
 asks for a lot of transformations.

 or I guess I'm not getting the correct usage ?

 Thanks
 Yang





Re: Combining schemas

2011-08-09 Thread Scott Carey
On 8/9/11 11:15 AM, Bill Graham billgra...@gmail.com wrote:

 Hi,
 
 I'm trying to create a schema that references a type defined in another schema
 and I'm having some troubles. Is there an easy way to do this?
 
 My test schemas look like this:
 
 $ cat position.avsc
 {type:enum, name: Position, namespace: avro.examples.baseball,
  symbols: [P, C, B1, B2, B3, SS, LF, CF, RF, DH]
 }
 
 $ cat player.avsc
 {type:record, name:Player, namespace: avro.examples.baseball,
  fields: [
   {name: number, type: int},
   {name: first_name, type: string},
   {name: last_name, type: string},
   {name: position, type: {type: array, items:
 avro.examples.baseball.Position} }
  ]
 }
 
 I've read this thread
 (http://apache-avro.679487.n3.nabble.com/How-to-reference-previously-defined-e
 num-in-avsc-file-td2663512.html) and tried using IDL like so with no luck:
 
 $ cat baseball.avdl
 @namespace(avro.examples.baseball)
 protocol Baseball {
   import schema position.avsc;
   import schema player.avsc;
 }
 
 $ java -jar avro-tools-1.5.1.jar idl  baseball.avdl baseball.avpr
 Exception in thread main org.apache.avro.SchemaParseException: Undefined
 name: avro.examples.baseball.Position
 at org.apache.avro.Schema.parse(Schema.java:979)
 at org.apache.avro.Schema.parse(Schema.java:1052)
 at org.apache.avro.Schema.parse(Schema.java:1021)
 at org.apache.avro.Schema.parse(Schema.java:884)
 at org.apache.avro.compiler.idl.Idl.ImportSchema(Idl.java:388)
 at org.apache.avro.compiler.idl.Idl.ProtocolBody(Idl.java:320)
 at org.apache.avro.compiler.idl.Idl.ProtocolDeclaration(Idl.java:206)
 at org.apache.avro.compiler.idl.Idl.CompilationUnit(Idl.java:84)
 ...

I agree that the documentation indicates that this should work.  I suspect
that it may not be able to resolve dependencies among imports.  That is if
Baseball depends on position, and on player, it works.  But since player
depends on position, it does not.  The import statement pulls in each item
individually for use in composite things in the AvroIDL, but does not allow
for interdependencies in the imports.
This seems worthy of a JIRA enhancement request.  I'm sure the project will
accept a patch that adds this.

 
 
 I also saw this blog post
 (http://www.infoq.com/articles/ApacheAvro#_ftnref6_7758) where the author had
 to write some nasty String.replace(..) code to combine schemas, but there's
 got to be a better way that this.

We need to improve the ability to import multiple files when parsing.  Using
the lower level Avro API you can parse the files yourself in an order that
will work.  
I have simply put all my types in one file.  If you made one avsc file with
both Position and Player in a JSON array it will complie.  It would look
like:
[
   position schema here,
   player schema here
]

 
 Also FYI, it seems enum values can't start with numbers (i.e. '1B'). Is this a
 know issue or a feature? I haven't seen it documented anywhere. You get an
 error like this if the value starts with a number:
 
 org.apache.avro.SchemaParseException: Illegal initial character


Enums are a named type.  The enum names must start with [A-Za-z_]  and
subsequently contain only [A-Za-z0-9_].
http://avro.apache.org/docs/1.5.1/spec.html#Names

However, the spec does not say that the values must have such restrictions.
This may be a bug, can you file a JIRA ticket?

Thanks!

-Scott

 
 thanks,
 Bill
 




Re: Hadoop and org.apache.avro.file.DataFileReader sez Not an Avro data file

2011-07-20 Thread Scott Carey
An avro data file is not created with a FileOutputStream.  That will write
=
avro binary data to a file, but not in the avro file format (which is
split=
table and contains header metadata).

The API for Avro Data Files is here:
http://avro.apache.org/docs/current/api/java/org/apache/avro/file/package-
http://avro.apache.org/docs/current/api/java/org/apache/avro/file/package-
s=summary.html




On 7/20/11 2:35 PM, Peter Wolf opus...@gmail.com wrote:



  


  
  
Hello, anyone out there know about AVRO file formats and/or Hadoop
support?

My Hadoop AvroJob code does not recognize the AVRO files created by
my other code.  It seems that the MAGIC number is wrong.

What is going on?  How many different ways of encoding AVRO files
are there, and how do I make sure they match.

I am creating the input files like this...

static public void write(String file, GenericRecord record,
Schema schema) throws IOException {
OutputStream o = new FileOutputStream(file);
GenericDatumWriter w = new GenericDatumWriter(schema);
Encoder e = EncoderFactory.get().binaryEncoder(o, null);
w.write(record, e);
e.flush();
}

Hadoop is reading them using org.apache.avro.file.DataFileReader

Here is where it breaks.  I checked, and it really is trying to read
the right file...

  /** Open a reader for a file. */
public static D FileReaderD
  openReader(SeekableInput in,
   DatumReaderD
  reader)
  throws IOException {
  if (in.length()  MAGIC.length)
throw new IOException(Not an Avro data file);
  
  // read magic header
  byte[] magic = new byte[MAGIC.length];
  in.seek(0);
  for (int c = 0; c  magic.length; c = in.read(magic, c,
  magic.length-c)) {}
  in.seek(0);
  
  if (Arrays.equals(MAGIC, magic))  // current
  format
return new DataFileReaderD(in, reader);
  if (Arrays.equals(DataFileReader12.MAGIC, magic)) // 1.2
  format
return new DataFileReader12D(in, reader);
  
  throw new IOException(Not an Avro data file);
  
}



Some background...

I am trying to write my first AVRO Hadoop application.  I am using
Hadoop Cloudera 20.2-737 and AVRO 1.5.1

I followed the instructions here...

   
http://avro.apache.org/docs/current/api/java/org/apache/avro/mapred/packag
e-summary.html#package_description


The sample code here...

   
http://svn.apache.org/viewvc/avro/tags/release-1.5.1/lang/java/mapred/src/
test/java/org/apache/avro/mapred/TestWordCount.java?view=markup

Here is my code which breaks with a Not an Avro data file error.


public static class MapImpl extends AvroMapperAccount,
PairUtf8, Long {
@Override
public void map(Account account,
AvroCollectorPairUtf8, Long collector,
Reporter reporter) throws IOException {
StringTokenizer tokens = new
StringTokenizer(account.timestamp.toString());
while (tokens.hasMoreTokens())
collector.collect(new PairUtf8, Long(new
Utf8(tokens.nextToken()), 1L));
}
}

public static class ReduceImpl
extends AvroReducerUtf8, Long, PairUtf8,
Long {
@Override
public void reduce(Utf8 word, IterableLong counts,
   AvroCollectorPairUtf8,
Long collector,
   Reporter reporter) throws IOException {
long sum = 0;
for (long count : counts)
sum += count;
collector.collect(new PairUtf8, Long(word,
sum));
}
}

public int run(String[] args) throws Exception {

if (args.length != 2) {
System.err.println(Usage:  + getClass().getName() + 
input output);
System.exit(2);
}

JobConf job = new JobConf(this.getClass());
Path outputPath = new Path(args[1]);

outputPath.getFileSystem(job).delete(outputPath);
//WordCountUtil.writeLinesFile();

job.setJobName(this.getClass().getName());

AvroJob.setInputSchema(job, Account.schema);
//Schema.create(Schema.Type.STRING));
AvroJob.setOutputSchema(job,
new PairUtf8, Long(new Utf8(),
0L).getSchema());

AvroJob.setMapperClass(job, MapImpl.class);
AvroJob.setCombinerClass(job, ReduceImpl.class);
AvroJob.setReducerClass(job, ReduceImpl.class);

FileInputFormat.setInputPaths(job, new Path(args[0]));

Re: Hadoop and org.apache.avro.file.DataFileReader sez Not an Avro data file

2011-07-20 Thread Scott Carey
Let me try that again, without the odd formatting:

An avro data file is not created with a FileOutputStream.  That will write
avro binary data to a file, but not in the avro file format (which is
splittable and contains header metadata).


The API for Avro Data Files is here:
http://avro.apache.org/docs/current/api/java/org/apache/avro/file/package-s
ummary.html



On 7/20/11 5:38 PM, Scott Carey scottca...@apache.org wrote:

An avro data file is not created with a FileOutputStream.  That will write
=
avro binary data to a file, but not in the avro file format (which is
split=
table and contains header metadata).

The API for Avro Data Files is here:
http://avro.apache.org/docs/current/api/java/org/apache/avro/file/package-
http://avro.apache.org/docs/current/api/java/org/apache/avro/file/package
-
s=summary.html




On 7/20/11 2:35 PM, Peter Wolf opus...@gmail.com wrote:



  


  
  
Hello, anyone out there know about AVRO file formats and/or Hadoop
support?

My Hadoop AvroJob code does not recognize the AVRO files created by
my other code.  It seems that the MAGIC number is wrong.

What is going on?  How many different ways of encoding AVRO files
are there, and how do I make sure they match.

I am creating the input files like this...

static public void write(String file, GenericRecord record,
Schema schema) throws IOException {
OutputStream o = new FileOutputStream(file);
GenericDatumWriter w = new GenericDatumWriter(schema);
Encoder e = EncoderFactory.get().binaryEncoder(o, null);
w.write(record, e);
e.flush();
}

Hadoop is reading them using org.apache.avro.file.DataFileReader

Here is where it breaks.  I checked, and it really is trying to read
the right file...

  /** Open a reader for a file. */
public static D FileReaderD
  openReader(SeekableInput in,
   DatumReaderD
  reader)
  throws IOException {
  if (in.length()  MAGIC.length)
throw new IOException(Not an Avro data file);
  
  // read magic header
  byte[] magic = new byte[MAGIC.length];
  in.seek(0);
  for (int c = 0; c  magic.length; c = in.read(magic, c,
  magic.length-c)) {}
  in.seek(0);
  
  if (Arrays.equals(MAGIC, magic))  // current
  format
return new DataFileReaderD(in, reader);
  if (Arrays.equals(DataFileReader12.MAGIC, magic)) // 1.2
  format
return new DataFileReader12D(in, reader);
  
  throw new IOException(Not an Avro data file);
  
}



Some background...

I am trying to write my first AVRO Hadoop application.  I am using
Hadoop Cloudera 20.2-737 and AVRO 1.5.1

I followed the instructions here...

   
http://avro.apache.org/docs/current/api/java/org/apache/avro/mapred/packa
g
e-summary.html#package_description


The sample code here...

   
http://svn.apache.org/viewvc/avro/tags/release-1.5.1/lang/java/mapred/src
/
test/java/org/apache/avro/mapred/TestWordCount.java?view=markup

Here is my code which breaks with a Not an Avro data file error.


public static class MapImpl extends AvroMapperAccount,
PairUtf8, Long {
@Override
public void map(Account account,
AvroCollectorPairUtf8, Long collector,
Reporter reporter) throws IOException {
StringTokenizer tokens = new
StringTokenizer(account.timestamp.toString());
while (tokens.hasMoreTokens())
collector.collect(new PairUtf8, Long(new
Utf8(tokens.nextToken()), 1L));
}
}

public static class ReduceImpl
extends AvroReducerUtf8, Long, PairUtf8,
Long {
@Override
public void reduce(Utf8 word, IterableLong counts,
   AvroCollectorPairUtf8,
Long collector,
   Reporter reporter) throws IOException {
long sum = 0;
for (long count : counts)
sum += count;
collector.collect(new PairUtf8, Long(word,
sum));
}
}

public int run(String[] args) throws Exception {

if (args.length != 2) {
System.err.println(Usage:  + getClass().getName() + 
input output);
System.exit(2);
}

JobConf job = new JobConf(this.getClass());
Path outputPath = new Path(args[1]);

outputPath.getFileSystem(job).delete(outputPath);
//WordCountUtil.writeLinesFile();

job.setJobName(this.getClass().getName());

AvroJob.setInputSchema(job, Account.schema

Re: Schema with multiple Record types Java API

2011-07-15 Thread Scott Carey
Try out the Reflect API.  It may not be flexible enough yet, but the intended 
use case is to serialize pre-existing classes.   If more annotations are 
required for your use case,  create a JIRA ticket.

http://avro.apache.org/docs/current/api/java/org/apache/avro/reflect/package-summary.html

Thanks!

-Scott



On 7/15/11 4:54 AM, Peter Wolf opus...@gmail.commailto:opus...@gmail.com 
wrote:

Thanks again Scott,

Yes, I am using AVRO to serialize existing Java classes, so tools to generate 
code will not help me.

Are there tools that go the other way, such as JAXB for XML?  I really want to 
point to a root Java object, and say serialize this, and everything it points 
to, as AVRO.

BTW AVRO Rocks!  My objects contain are amounts of data, and I am *very* 
impressed with the speed of serialization/deserialization.

Cheers
P





On 7/14/11 10:10 PM, Scott Carey wrote:
AvroIDL can handle imports, but it generates classes.  The Avro API's for this 
can be used to generate Schemas without making objects if you wish.

The Avro schema compiler (*.avsc, *.avpr) does not support imports, it is a 
feature requested by many but not contributed by anyone.

You may be interested in the code-gen capabilities of Avro, which has a 
Velocity templating engine to create Java classes based on schemas.  This can 
be customized to generate classes in custom ways.

However, if you are using Avro to serialize objects that have pre-existing 
classes, the Reflect API or an enhancement of it may be more suitable.

More information on your use case may help to point you in the right direction.

-Scott


On 7/14/11 6:43 PM, Peter Wolf opus...@gmail.commailto:opus...@gmail.com 
wrote:

Many thanks Scott,

I am looking for the equivalent of #include or import.  I want to make a 
complicated schema with many record types, but manage it in separate strings.

In my application, I am using AVRO to serialize a tree of connected Java 
objects.  The record types mirror Java classes.  The schema descriptions live 
in the different Java classes, and reference each other.

My current code looks like this...

public class Foo {

static String schemaDescription =
{ +
  \namespace\: \foo\,  +
  \name\: \Foo\,  +
  \type\: \record\,  +
  \fields\: [  +
  {\name\: \notes\, \type\: \string\ },  +
  {\name\: \timestamp\, \type\: \string\ },  +
  {\name\: \bah\, \type\:  + 
Bah.schemaDescription +  }, +
  {\name\: \zot\, \type\:  + 
Zot.schemaDescription +  } +
] +
};

static Schema schema = Schema.parse(schemaDescription);


So, I am referencing by copying the schemaDescriptions.  The top level 
schemaDescription strings therefore get really big.

Is there already a clean coding Pattern for doing this-- I can't be the first.  
Is there a document describing best practices?

Thanks
P





On 7/14/11 7:02 PM, Scott Carey wrote:
The name and namespace is part of any named schema (Type.RECORD, Type.FIXED, 
Type.ENUM).

We don't currently have an API to search a schema for subschemas that match 
names.  It would be useful, you might want to create a JIRA ticket explaining 
your use case.

So it would be a little more complex.

Schema schema = Schema.parse(schemaDescription);
Schema.Type type = schema.getType();
switch (type) {
case RECORD:
  String name = schema.getName();
  String namespace = schema.getNamespace();
  ListField fields = schema.getFields();
}

etc.

In general, I have created SpecificRecord objects from schemas using the 
specific compiler (and the ant task or maven plugin) and then within those 
generated classes there is a static SCHEMA variable to reference.

Avro IDL is alo an easier way to define related schemas.  Currently there are 
only build tools that generate code from these, though there are APIs to 
extract schemas.

-Scott

On 7/13/11 10:43 AM, Peter Wolf opus...@gmail.commailto:opus...@gmail.com 
wrote:

Hello, this a dumb question, but I can not find the answer in the docs

I want to have a complicated schema with lots of Records referencing other 
Records.

Like this...

{
  namespace: com.foobah,
  name: Bah,
  type: record,
  fields: [
  {name: value, type: int}
  ]
}

{
  namespace: com.foobah,
  name: Foo,
  type: record,
  fields: [
  {name: bah, type: Bah}
  ]
}
Using the Java API, how do I reference types within a schema?  Let's say I want 
to make a Foo object, I want to do something like this...

Schema schema = Schema.parse(schemaDescription);
 Schema foo = schema.getSchema(com.foobah.Foo); 
GenericData o = new GenericData( foo );

Many thanks in advance
Peter







Re: Schema with multiple Record types Java API

2011-07-14 Thread Scott Carey
The name and namespace is part of any named schema (Type.RECORD, Type.FIXED, 
Type.ENUM).

We don't currently have an API to search a schema for subschemas that match 
names.  It would be useful, you might want to create a JIRA ticket explaining 
your use case.

So it would be a little more complex.

Schema schema = Schema.parse(schemaDescription);
Schema.Type type = schema.getType();
switch (type) {
case RECORD:
  String name = schema.getName();
  String namespace = schema.getNamespace();
  ListField fields = schema.getFields();
}

etc.

In general, I have created SpecificRecord objects from schemas using the 
specific compiler (and the ant task or maven plugin) and then within those 
generated classes there is a static SCHEMA variable to reference.

Avro IDL is alo an easier way to define related schemas.  Currently there are 
only build tools that generate code from these, though there are APIs to 
extract schemas.

-Scott

On 7/13/11 10:43 AM, Peter Wolf opus...@gmail.commailto:opus...@gmail.com 
wrote:

Hello, this a dumb question, but I can not find the answer in the docs

I want to have a complicated schema with lots of Records referencing other 
Records.

Like this...

{
  namespace: com.foobah,
  name: Bah,
  type: record,
  fields: [
  {name: value, type: int}
  ]
}

{
  namespace: com.foobah,
  name: Foo,
  type: record,
  fields: [
  {name: bah, type: Bah}
  ]
}
Using the Java API, how do I reference types within a schema?  Let's say I want 
to make a Foo object, I want to do something like this...

Schema schema = Schema.parse(schemaDescription);
 Schema foo = schema.getSchema(com.foobah.Foo); 
GenericData o = new GenericData( foo );

Many thanks in advance
Peter





Re: Schema with multiple Record types Java API

2011-07-14 Thread Scott Carey
AvroIDL can handle imports, but it generates classes.  The Avro API's for this 
can be used to generate Schemas without making objects if you wish.

The Avro schema compiler (*.avsc, *.avpr) does not support imports, it is a 
feature requested by many but not contributed by anyone.

You may be interested in the code-gen capabilities of Avro, which has a 
Velocity templating engine to create Java classes based on schemas.  This can 
be customized to generate classes in custom ways.

However, if you are using Avro to serialize objects that have pre-existing 
classes, the Reflect API or an enhancement of it may be more suitable.

More information on your use case may help to point you in the right direction.

-Scott


On 7/14/11 6:43 PM, Peter Wolf opus...@gmail.commailto:opus...@gmail.com 
wrote:

Many thanks Scott,

I am looking for the equivalent of #include or import.  I want to make a 
complicated schema with many record types, but manage it in separate strings.

In my application, I am using AVRO to serialize a tree of connected Java 
objects.  The record types mirror Java classes.  The schema descriptions live 
in the different Java classes, and reference each other.

My current code looks like this...

public class Foo {

static String schemaDescription =
{ +
  \namespace\: \foo\,  +
  \name\: \Foo\,  +
  \type\: \record\,  +
  \fields\: [  +
  {\name\: \notes\, \type\: \string\ },  +
  {\name\: \timestamp\, \type\: \string\ },  +
  {\name\: \bah\, \type\:  + 
Bah.schemaDescription +  }, +
  {\name\: \zot\, \type\:  + 
Zot.schemaDescription +  } +
] +
};

static Schema schema = Schema.parse(schemaDescription);


So, I am referencing by copying the schemaDescriptions.  The top level 
schemaDescription strings therefore get really big.

Is there already a clean coding Pattern for doing this-- I can't be the first.  
Is there a document describing best practices?

Thanks
P





On 7/14/11 7:02 PM, Scott Carey wrote:
The name and namespace is part of any named schema (Type.RECORD, Type.FIXED, 
Type.ENUM).

We don't currently have an API to search a schema for subschemas that match 
names.  It would be useful, you might want to create a JIRA ticket explaining 
your use case.

So it would be a little more complex.

Schema schema = Schema.parse(schemaDescription);
Schema.Type type = schema.getType();
switch (type) {
case RECORD:
  String name = schema.getName();
  String namespace = schema.getNamespace();
  ListField fields = schema.getFields();
}

etc.

In general, I have created SpecificRecord objects from schemas using the 
specific compiler (and the ant task or maven plugin) and then within those 
generated classes there is a static SCHEMA variable to reference.

Avro IDL is alo an easier way to define related schemas.  Currently there are 
only build tools that generate code from these, though there are APIs to 
extract schemas.

-Scott

On 7/13/11 10:43 AM, Peter Wolf opus...@gmail.commailto:opus...@gmail.com 
wrote:

Hello, this a dumb question, but I can not find the answer in the docs

I want to have a complicated schema with lots of Records referencing other 
Records.

Like this...

{
  namespace: com.foobah,
  name: Bah,
  type: record,
  fields: [
  {name: value, type: int}
  ]
}

{
  namespace: com.foobah,
  name: Foo,
  type: record,
  fields: [
  {name: bah, type: Bah}
  ]
}
Using the Java API, how do I reference types within a schema?  Let's say I want 
to make a Foo object, I want to do something like this...

Schema schema = Schema.parse(schemaDescription);
 Schema foo = schema.getSchema(com.foobah.Foo); 
GenericData o = new GenericData( foo );

Many thanks in advance
Peter






Re: Classpath for java

2011-06-26 Thread Scott Carey
I suspect that you will need to go into the module with the Pair class.
When executing a maven plugin directly from the command line (exec:exec)
the maven 'scope' is very restricted, and when you do this on the top
level project it executes on that project only by default.

The surefire test plugin occurs in the test phase, after it has finished
all of the prior phases including compiling and constructing all the paths
required for testing.

On 6/26/11 9:53 AM, Jeremy Lewi jer...@lewi.us wrote:

Hi,

I'm having trouble understanding how the class path is being set by
maven for java. 

When I run a unit test using the maven surefire plugin
cd lang/java
mvn -Dtest=org.apache.avro.mapred.TestWordCount test -X

The output shows the following directories are on the classpath.
lang/java/mapred/target/test-classes
lang/java/mapred/target/classes
lang/java/ipc/target/classes
lang/java/avro/target/classes

But when I try to execute a class (I put a main method in
lang/java/.../Pair.java for testing)
 mvn exec:exec -Dexec.mainClass=Pair -X

Only 
lang/java/target/classes
is on the path.

So I'm trying to determine how to configure the exec plugin to properly
set the class path so that I can execute programs.

If anyone has any pointers I would greatly appreciate it.

Thanks

J





Re: Avro and Hadoop streaming

2011-06-15 Thread Scott Carey
Hadoop has an old version of Avro in it.  You must place the 1.6.0 jar
(and relevant dependencies, or the avro-tools.jar with all dependencies
bundled) in a location that gets picked up first in the task classpath.

Packaging it in the job jar works. I'm not sure if putting it in the
distributed cache and loading it as a library that way would.

On 6/15/11 9:30 AM, Matt Pouttu-Clarke
matt.pouttu-cla...@icrossing.com wrote:

You have to package it in the job jar file under a /lib directory.


On 6/15/11 9:26 AM, Miki Tebeka miki.teb...@gmail.com wrote:

 Still didn't work.
 
 I'm pretty new to hadoop world, I probably need to place the avro jar
 somewhere on the classpath of the nodes,
 however I have no idea how to do that.
 
 On Wed, Jun 15, 2011 at 3:33 AM, Harsh J ha...@cloudera.com wrote:
 Miki,
 
 You'll need to provide the entire canonical class name
 (org.apache.avro.mapredS).
 
 On Wed, Jun 15, 2011 at 5:31 AM, Miki Tebeka miki.teb...@gmail.com
wrote:
 Greetings,
 
 I've tried to run a job with the following command:
 
 hadoop jar ./hadoop-streaming-0.20.2-cdh3u0.jar \
-input /in/avro \
-output $out \
-mapper avro-mapper.py \
-reducer avro-reducer.py \
-file avro-mapper.py \
-file avro-reducer.py \
-cacheArchive /cache/avro-mapred-1.6.0-SNAPSHOT.jar \
-inputformat AvroAsTextInputFormat
 
 However I get
 -inputformat : class not found : AvroAsTextInputFormat
 
 I'm probably missing something obvious to do.
 
 Any ideas?
 
 Thanks!
 --
 Miki
 
 On Fri, Jun 3, 2011 at 1:43 AM, Doug Cutting cutt...@apache.org
wrote:
 Miki,
 
 Have you looked at AvroAsTextInputFormat?
 
 
http://avro.apache.org/docs/current/api/java/org/apache/avro/mapred/Av
roAsT
 extInputFormat.html
 
 Also, release 1.5.2 will include AvroTextOutputFormat:
 
 https://issues.apache.org/jira/browse/AVRO-830
 
 Are these perhaps what you're looking for?
 
 Doug
 
 On 06/02/2011 11:30 PM, Miki Tebeka wrote:
 Greetings,
 
 I'd like to use hadoop streaming with Avro files.
 My plan is to write an inputformat class that emits json records,
one
 per line. This way the streaming application can read one record per
 line.
 
(http://hadoop.apache.org/common/docs/r0.15.2/streaming.html#Specifyi
ng+Ot
 her+Plugins+for+Jobs)
 
 I couldn't find any documentation/help about writing inputformat
 classes. Can someone point me to the right direction?
 
 Thanks,
 --
 Miki
 
 
 
 
 
 --
 Harsh J
 


iCrossing Privileged and Confidential Information
This email message is for the sole use of the intended recipient(s) and
may contain confidential and privileged information of iCrossing. Any
unauthorized review, use, disclosure or distribution is prohibited. If
you are not the intended recipient, please contact the sender by reply
email and destroy all copies of the original message.





Re: avro object reuse

2011-06-10 Thread Scott Carey
Corruption can occur in I/O busses and RAM.  Does this tend to fail on the same 
nodes, or any node randomly?  Since it does not fail consistently, this makes 
me suspect some sort of corruption even more.

I suggest turning on stack traces for fatal throwables.  This shouldn't hurt 
production performance since they don't happen regularly and break the task 
anyway.

Of the heap dumps seen so far, the primary consumption is byte[] and no more 
than 300MB.  How large are your java heaps?

On 6/10/11 10:53 AM, ey-chih chow 
eyc...@hotmail.commailto:eyc...@hotmail.com wrote:

Since this was in production, we did not turn on stack trace.  Also, it was 
highly unlikely that there was any data corrupted because, if one mapper failed 
due to out of memory, the system started another one and went through all the 
data.


From: sc...@richrelevance.commailto:sc...@richrelevance.com
To: user@avro.apache.orgmailto:user@avro.apache.org
Date: Thu, 9 Jun 2011 17:43:02 -0700
Subject: Re: avro object reuse

If the exception is happening while decoding, it could be due to corrupt data. 
Avro allocates a List preallocated to the size encoded, and I've seen corrupted 
data cause attempted allocations of arrays too large for the heap.

On 6/9/11 4:58 PM, Scott Carey 
sc...@richrelevance.commailto:sc...@richrelevance.com wrote:

What is the stack trace on the out of memory exception?


On 6/9/11 4:45 PM, ey-chih chow 
eyc...@hotmail.commailto:eyc...@hotmail.com wrote:

We configure more than 100MB for MapReduce to do sorting.  Memory we allocate 
for doing other things in the mapper actually is larger, but, for this job, we 
always get out-of-meory exceptions and the job can not complete.  We try to 
find out if there is a way to avoid this problem.

Ey-Chih Chow


From: sc...@richrelevance.commailto:sc...@richrelevance.com
To: user@avro.apache.orgmailto:user@avro.apache.org
Date: Thu, 9 Jun 2011 15:42:10 -0700
Subject: Re: avro object reuse

The most likely candidate for creating many instances of BufferAccessor and 
ByteArrayByteSource is BinaryData.compare() and BinaryData.hashCode().  Each 
call will create one of each (hash) or two of each (compare).  These are only 
32 bytes per instance and quickly become garbage that is easily cleaned up by 
the GC.

The below have only 32 bytes each and 8MB total.
On the other hand,  the byte[]'s appear to be about 24K each on average and are 
using 100MB.  Is this the size of your configured MapReduce sort MB?

On 6/9/11 3:08 PM, ey-chih chow 
eyc...@hotmail.commailto:eyc...@hotmail.com wrote:

We did more monitoring.  At one instance, we got the following histogram via 
Jmap.  The question is why there are so many instances of 
BinaryDecoder$BufferAccessor and BinaryDecoder$ByteArrayByteSource.  How to 
avoid this?  Thanks.

Object Histogram:

num   #instances#bytes  Class description
--
1:  4199100241168   byte[]
2:  272948  8734336 org.apache.avro.io.BinaryDecoder$BufferAccessor
3:  272945  8734240 
org.apache.avro.io.BinaryDecoder$ByteArrayByteSource
4:  20935387976 int[]
5:  23762   2822864 * ConstMethodKlass
6:  23762   1904760 * MethodKlass
7:  39295   1688992 * SymbolKlass
8:  21271216976 * ConstantPoolKlass
9:  2127882760  * InstanceKlassKlass
10: 1847742936  * ConstantPoolCacheKlass
11: 9602715608  char[]
12: 1072299584  * MethodDataKlass
13: 9698232752  java.lang.String
14: 2317222432  java.lang.Class
15: 3288204440  short[]
16: 3167156664  * System ObjArray
17: 240157624   java.util.HashMap$Entry
18: 666 53280   java.lang.reflect.Method
19: 161 52808   * ObjArrayKlassKlass
20: 180843392   java.util.Hashtable$Entry



From: eyc...@hotmail.commailto:eyc...@hotmail.com
To: user@avro.apache.orgmailto:user@avro.apache.org
Subject: RE: avro object reuse
Date: Wed, 1 Jun 2011 15:14:03 -0700

We use a lot of toString() call on the avro Utf8 object.  Will this cause 
Jackson call?  Thanks.

Ey-Chih


From: sc...@richrelevance.commailto:sc...@richrelevance.com
To: user@avro.apache.orgmailto:user@avro.apache.org
Date: Wed, 1 Jun 2011 13:38:39 -0700
Subject: Re: avro object reuse

This is great info.

Jackson should only be used once when the file is opened, so this is confusing 
from that point of view.
Is something else using Jackson or initializing an Avro JsonDecoder frequently? 
 There are over 10 Jackson DeserializationConfig objects.

Another place that parses the schema is in AvroSerialization.java.  Does the 
Hadoop getDeserializer() API method get called once per job

Re: avro object reuse

2011-06-09 Thread Scott Carey
The most likely candidate for creating many instances of BufferAccessor and 
ByteArrayByteSource is BinaryData.compare() and BinaryData.hashCode().  Each 
call will create one of each (hash) or two of each (compare).  These are only 
32 bytes per instance and quickly become garbage that is easily cleaned up by 
the GC.

The below have only 32 bytes each and 8MB total.
On the other hand,  the byte[]'s appear to be about 24K each on average and are 
using 100MB.  Is this the size of your configured MapReduce sort MB?

On 6/9/11 3:08 PM, ey-chih chow 
eyc...@hotmail.commailto:eyc...@hotmail.com wrote:

We did more monitoring.  At one instance, we got the following histogram via 
Jmap.  The question is why there are so many instances of 
BinaryDecoder$BufferAccessor and BinaryDecoder$ByteArrayByteSource.  How to 
avoid this?  Thanks.

Object Histogram:

num   #instances#bytes  Class description
--
1:  4199100241168   byte[]
2:  272948  8734336 org.apache.avro.io.BinaryDecoder$BufferAccessor
3:  272945  8734240 
org.apache.avro.io.BinaryDecoder$ByteArrayByteSource
4:  20935387976 int[]
5:  23762   2822864 * ConstMethodKlass
6:  23762   1904760 * MethodKlass
7:  39295   1688992 * SymbolKlass
8:  21271216976 * ConstantPoolKlass
9:  2127882760  * InstanceKlassKlass
10: 1847742936  * ConstantPoolCacheKlass
11: 9602715608  char[]
12: 1072299584  * MethodDataKlass
13: 9698232752  java.lang.String
14: 2317222432  java.lang.Class
15: 3288204440  short[]
16: 3167156664  * System ObjArray
17: 240157624   java.util.HashMap$Entry
18: 666 53280   java.lang.reflect.Method
19: 161 52808   * ObjArrayKlassKlass
20: 180843392   java.util.Hashtable$Entry



From: eyc...@hotmail.commailto:eyc...@hotmail.com
To: user@avro.apache.orgmailto:user@avro.apache.org
Subject: RE: avro object reuse
Date: Wed, 1 Jun 2011 15:14:03 -0700

We use a lot of toString() call on the avro Utf8 object.  Will this cause 
Jackson call?  Thanks.

Ey-Chih


From: sc...@richrelevance.commailto:sc...@richrelevance.com
To: user@avro.apache.orgmailto:user@avro.apache.org
Date: Wed, 1 Jun 2011 13:38:39 -0700
Subject: Re: avro object reuse

This is great info.

Jackson should only be used once when the file is opened, so this is confusing 
from that point of view.
Is something else using Jackson or initializing an Avro JsonDecoder frequently? 
 There are over 10 Jackson DeserializationConfig objects.

Another place that parses the schema is in AvroSerialization.java.  Does the 
Hadoop getDeserializer() API method get called once per job, or per record?  If 
this is called more than once per map job, it might explain this.

In principle, Jackson is only used by a mapper during initialization.  The 
below indicates that this may not be the case or that something outside of Avro 
is causing a lot of Jackson JSON parsing.

Are you using something that is converting the Avro data to Json form?  
toString() on most Avro datum objects will do a lot of work with Jackson, for 
example — but the below are deserializer objects not serializer objects so that 
is not likely the issue.

On 6/1/11 11:34 AM, ey-chih chow 
eyc...@hotmail.commailto:eyc...@hotmail.com wrote:

We ran jmap on one of our mapper and found the top usage as follows:

num  #instances #bytes Class description
--
1: 24405 291733256 byte[]
2: 6056 40228984 int[]
3: 388799 19966776 char[]
4: 101779 16284640 org.codehaus.jackson.impl.ReaderBasedParser
5: 369623 11827936 java.lang.String
6: 111059 8769424 java.util.HashMap$Entry[]
7: 204083 8163320 org.codehaus.jackson.impl.JsonReadContext
8: 211374 6763968 java.util.HashMap$Entry
9: 102551 5742856 org.codehaus.jackson.util.TextBuffer
10: 105854 5080992 java.nio.HeapByteBuffer
11: 105821 5079408 java.nio.HeapCharBuffer
12: 104578 5019744 java.util.HashMap
13: 102551 4922448 org.codehaus.jackson.io.IOContext
14: 101782 4885536 org.codehaus.jackson.map.DeserializationConfig
15: 101783 4071320 org.codehaus.jackson.sym.CharsToNameCanonicalizer
16: 101779 4071160 org.codehaus.jackson.map.deser.StdDeserializationContext
17: 101779 4071160 java.io.StringReader
18: 101754 4070160 java.util.HashMap$KeyIterator

It looks like Jackson eats up a lot of memory.  Our mapper reads in files of 
the avro format.  Does avro use Jackson a lot in reading the avro files?  Is 
there any way to improve this?  Thanks.

Ey-Chih Chow


From: sc...@richrelevance.commailto:sc...@richrelevance.com
To: 

Re: avro object reuse

2011-06-02 Thread Scott Carey
No, that should not trigger Jackson parsing.   Schema.parse() and 
Protocol.parse() do.



On 6/2/11 10:23 AM, ey-chih chow 
eyc...@hotmail.commailto:eyc...@hotmail.com wrote:

We create GenericData.Record a lot in our code via new 
GenericData.Record(schema).  Will this generates Jackson calls?  Thanks.

Ey-Chih Chow

 From: sc...@richrelevance.commailto:sc...@richrelevance.com
 To: user@avro.apache.orgmailto:user@avro.apache.org
 Date: Wed, 1 Jun 2011 18:48:15 -0700
 Subject: Re: avro object reuse

 One thing we do right now that might be related is the following:

 We keep Avro default Schema values as JsonNode objects. While traversing
 the JSON Avro schema representation using ObjectMapper.readTree() we
 remember JsonNodes that are default properties on fields and keep them
 on the Schema object.
 If these keep references to the parent (and the whole JSON tree, or worse,
 the ObjectMapper and input stream) it would be poor use of Jackson by us;
 although we'd need a way to keep a detached JsonNode or equivalent.

 However, even if that is the case (which it does not seem to be -- the
 jmap output has no JsonNode instances), it doesn't explain why we would be
 calling ObjectMapper frequently. We only call
 ObjectMapper.readTree(JsonParser) when creating a Schema from JSON. We
 call JsonNode methods from extracted fragments for everything else.


 This brings me to the following suspicion based on the data:
 Somewhere, Schema objects are being created frequently via one of the
 Schema.parse() or Protocol.parse() static methods.

 On 6/1/11 5:48 PM, Tatu Saloranta 
 tsalora...@gmail.commailto:tsalora...@gmail.com wrote:

 On Wed, Jun 1, 2011 at 5:45 PM, Scott Carey 
 sc...@richrelevance.commailto:sc...@richrelevance.com
 wrote:
  It would be useful to get a 'jmap -histo:live' report as well, which
 will
  only have items that remain after a full GC.
 
  However, a high churn of short lived Jackson objects is not expected
 here
  unless the user is reading Json serialized files and not Avro binary.
  Avro Data Files only contain binary encoded Avro content.
 
  It would be surprising to see many Jackson objects here if reading Avro
  Data Files, because we expect to use Jackson to parse an Avro schema
 from
  json only once or twice per file. After the schema is parsed, Jackson
  shouldn't be used. A hundred thousand DeserializationConfig instances
  means that isn't the case.
 
 Right -- it indicates that something (else) is using Jackson; and
 there will typically be one instance of DeserializationConfig for each
 data-binding call (ObjectMapper.readValue()), as a read-only copy is
 made for operation.
 ... or if something is reading schema that many times, that sounds
 like a problem in itself.
 
 -+ Tatu +-



Re: mixed schema avro data file?

2011-06-01 Thread Scott Carey
Two options:

* DIfferent files per schema
* One schema that is a union of all schemas you want in the file

Which is best depends on your use case.

On 6/1/11 4:02 PM, Yang tedd...@gmail.commailto:tedd...@gmail.com 
wrote:

our use case is that we have many different types of events, with different 
schemas.

I was thinking to dump them into one file, for easier maintenance of the files. 
but then I found that all the DataFileWriter,
JsonEncoder/Decoder require a schema to be present, so each file can have 
really only one schema. of course I can create a
separate encoder/writer for each record I write. but then there would be no way 
to parse out the file later. such a mixed schema file can be
useful only to humans at best.

so generally what is your experience in dealing with serializing objects of 
different types? do you put them in different files?

Thanks
Yang


Re: avro object reuse

2011-06-01 Thread Scott Carey
One thing we do right now that might be related is the following:

We keep Avro default Schema values as JsonNode objects. While traversing
the JSON Avro schema representation using ObjectMapper.readTree() we
remember JsonNodes that are default properties on fields and keep them
on the Schema object.
If these keep references to the parent (and the whole JSON tree, or worse,
the ObjectMapper and input stream) it would be poor use of Jackson by us;
although we'd need a way to keep a detached JsonNode or equivalent.

However, even if that is the case (which it does not seem to be -- the
jmap output has no JsonNode instances), it doesn't explain why we would be
calling ObjectMapper frequently.  We only call
ObjectMapper.readTree(JsonParser) when creating a Schema from JSON.  We
call JsonNode methods from extracted fragments for everything else.


This brings me to the following suspicion based on the data:
Somewhere, Schema objects are being created frequently via one of the
Schema.parse() or Protocol.parse() static methods.

On 6/1/11 5:48 PM, Tatu Saloranta tsalora...@gmail.com wrote:

On Wed, Jun 1, 2011 at 5:45 PM, Scott Carey sc...@richrelevance.com
wrote:
 It would be useful to get a 'jmap -histo:live' report as well, which
will
 only have items that remain after a full GC.

 However, a high churn of short lived Jackson objects is not expected
here
 unless the user is reading Json serialized files and not Avro binary.
 Avro Data Files only contain binary encoded Avro content.

 It would be surprising to see many Jackson objects here if reading Avro
 Data Files, because we expect to use Jackson to parse an Avro schema
from
 json only once or twice per file.  After the schema is parsed, Jackson
 shouldn't be used.   A hundred thousand DeserializationConfig instances
 means that isn't the case.

Right -- it indicates that something (else) is using Jackson; and
there will typically be one instance of DeserializationConfig for each
data-binding call (ObjectMapper.readValue()), as a read-only copy is
made for operation.
... or if something is reading schema that many times, that sounds
like a problem in itself.

-+ Tatu +-



Re: I have written a layout for log4j using avro

2011-05-31 Thread Scott Carey
To read and write an Avro Data File use the classes in org.apache.avro.file :
http://avro.apache.org/docs/current/api/java/index.html

The classes in tools are command line tools that wrap Avro Java APIs.  The 
source code of these can be used as examples for using these APIs.

On 5/30/11 8:01 AM, harisgx . hari...@gmail.commailto:hari...@gmail.com 
wrote:

Hi,

I have written a layout for log4j using avro.

http://bytescrolls.blogspot.com/2011/05/using-avro-to-serialize-logs-in-log4j.html

https://github.com/harisgx/avro-log4j

But if I want to convert the records to a avro data file in a compressed form, 
in the docs it is mentioned about to use  DataFileWriteTool to read new-line 
delimited JSON records.
in the method,
- run(InputStream stdin, PrintStream out, PrintStream err, ListString args)

args is to be non null. How do we populate the args values?

thanks
-haris



Re: inheritance implementation?

2011-05-31 Thread Scott Carey
You can do this a few ways.  The composition you list will work, the member 
variable should be of type Fruit.

Or you can put the type object inside the fruit:

record Fruit {
int size;
string color;
int weight;
union { Apple, Orange } type;
}

record Orange {
string skin_thickness;
}

record Apple {
string skin_pattern;
}

However, Avro's IDL language and the Specific compiler in Java will not compile 
this into a class hierarchy.  You can use a wrapper class in Java to do that.  
A factory method to create a specific Fruit subclass by inspecting a Fruit 
would use instanceof to determine the union type and create the corresponding 
object.

One way or the other, you do need to do some instanceof / casting depending on 
what you are accessing.  I have used the pattern above, with the 'type' inside 
the outer general object.

On 5/31/11 11:33 AM, Yang 
tedd...@gmail.commailto:tedd...@gmail.com wrote:

I understand that avro does not have inheritance now, so I am wondering what is 
the best way to achieve the following goal:

I define Apple, Orange, and Fruit. Apple and Orange should ideally derive from 
Fruit, but since there is no built-in mechanism,
we create an internal member for aboth Apple and Orange, encapsulating the 
contents of Orangle

Apple :{
Fruit: fruit_member

string: pattern_on_skin
}

Orange : {

Fruit: fruit_member

string: skin_thickness
}


Fruit: {
int : size,
string: color
int: weight
}



say I want to pass objects of both Apple and Orange to some scale to measure 
the total weight,
I can pass them just as Objects,


int findTotalWeight(ListObject l ) {

int result=0;
for(Object o : l ) {
   result += ???   somehow get access to the fruit_member 
var ??
   }
}


so what is the best way to fill in the line above with  ? doing a lot of 
instanceof  is kind of cumbersome


Thanks
Yang



  1   2   >