[jira] Updated: (AVRO-517) Resolving Decoder fails in some cases

2010-04-16 Thread Thiruvalluvan M. G. (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thiruvalluvan M. G. updated AVRO-517:
-

Status: Patch Available  (was: Open)

Thanks Scott for catching this intricate bug.

 Resolving Decoder fails in some cases
 -

 Key: AVRO-517
 URL: https://issues.apache.org/jira/browse/AVRO-517
 Project: Avro
  Issue Type: Bug
  Components: java
Affects Versions: 1.3.2
Reporter: Scott Carey
Assignee: Thiruvalluvan M. G.
Priority: Critical
 Attachments: AVRO-517.patch


 User reports that reading an 'actual' schema of 
  string, string, int
 fails when using an expected schema of:
  string, string
 Sample code and details in the comments.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (AVRO-517) Resolving Decoder fails in some cases

2010-04-16 Thread Thiruvalluvan M. G. (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thiruvalluvan M. G. updated AVRO-517:
-

Attachment: AVRO-517.patch

The trouble is that the ResolvingDecoder does not take care of the trailing 
field in the underlying BinaryDecoder. So part of the data belonging to the 
current object is left in the BinaryDecoder. GenericDatumReader constructs a 
new ResolvingDecoder for the next object. So the leftover integer field is read 
as a string of the next object.

This problem will not occur if the same ResolvingDecoder is used for all the 
objects. But that approach requires quite a bit of changes to 
GenericDatumReader. So I added a new method drain() in ResolvingDecoder, 
which,if called after reading the entire record as per reader's schema, drains 
the remaining unused portions.

 Resolving Decoder fails in some cases
 -

 Key: AVRO-517
 URL: https://issues.apache.org/jira/browse/AVRO-517
 Project: Avro
  Issue Type: Bug
  Components: java
Affects Versions: 1.3.2
Reporter: Scott Carey
Assignee: Thiruvalluvan M. G.
Priority: Critical
 Attachments: AVRO-517.patch


 User reports that reading an 'actual' schema of 
  string, string, int
 fails when using an expected schema of:
  string, string
 Sample code and details in the comments.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Questions re integrating Avro into Cascading process

2010-04-16 Thread Scott Carey

On Apr 15, 2010, at 10:33 AM, Ken Krugler wrote:

 Hi all,
 
 We're looking at creating a Cascading Scheme for Avro, and have got a  
 few questions below. These are very general, as this is more of a  
 scoping phase (as in, are we crazy to try this) so apologies in  
 advance for lack of detail.
 
 For context, Cascading is an open source project that provides a  
 workflow API on top of Hadoop. The key unit of data is a tuple, which  
 corresponds to a record - you have fields (names) and values.  
 Cascading uses a generalized tap concept for reading  writing  
 tuples, where a tap uses a scheme to handle the low-level mapping from  
 Cascading-land to/from the storage format.

I am somewhat familiar with Cascading as a user.  I am not familiar with how it 
is implemented or how to customize things like a Tap or Sink.

Correct me if I'm wrong, but its notion of a record is very simple -- there are 
no arrays or maps -- just a list of fields.
This maps to avro easily.

 
 So the goal here is to define a Cascading Scheme that will run on  
 0.18.3 and later versions of Hadoop, and provide general support for  
 reading/writing tuples from/to an Avro-format Hadoop part-x file.
 
 We grabbed the recently committed AvroXXX code from  
 org.apache.avro.mapred (thanks Doug  Scott), and began building the  
 Cascading scheme to bridge between AvroWrapperT keys and Cascading  
 tuples.

You might be fine without the org.apache.avro.mapred stuff -- specifically if 
you only need the sinks and taps to use Avro and not the stuff in between a map 
and reduce.  For example, I have a custom LoadFunc in Pig that can read/write 
avro data files working off Avro 1.3.0 -- but it works for a static schema.

 
 1. What's the best approach if we want to dynamically define the Avro  
 schema, based on a list of field names and types (classes)?
 

Creating an Avro schema programmatically is fairly straightforward -- 
especially without arrays, maps, or unions.  If the code has access to the 
Cascading record definition, transforming that into an Avro schema dynamically 
should be straightforward. Schema has various constructors and static methods 
from which you can get the JSON schema representation or just pass around 
Schema objects.


 This assumes it's possible to dynamically define  use a schema, of  
 course.
 
 2. How much has the new Hadoop map-reduce support code been tested?
 

I can't speak for all of what Doug has done here, but there are unit tests for 
basic stuff -- word count, etc.


 3. Will there be issues with running in 0.18.3, 0.19.2, etc?
 
 I saw some discussion about Hadoop using the older Jackson 1.0.1 jar,  
 and that then creating problems. Anything else?

I'm using Avro 1.3.0 with 0.19.2 and 0.20.1 CDH2 in production and the only 
problem was the above library conflict.  This is without the new 
o.a.avro.mapred stuff however.

 
 4. The key integration point, besides the fields+classes to schema  
 issue above, is mapping between Cascading tuples and AvroWrapperT
 
 If we're using (I assume) the generic format, any input on how we'd do  
 this two-way conversion?
 

I'd suggest thinking about using Avro container files for input and output, 
which may not require the above depending on how Cascading is built internally. 
 In Pig for example, the LoadFunc defines a pig schema on input for reading, 
and everything else from there requires no change -- although this means that 
it is using the default pig types and serialization for all the intermediate 
work, reading and writing inputs and outputs can be done with Avro with minimal 
effort. 
Cascading is already defining the M/R jobs, the keys, values, etc... so you may 
only have to modify the Tap to translate from an Avro schema to the Cascading 
record to get it to read or write an Avro file.

One can go farther and use AvroWrapper and o.a.avro.mapred define the M/R jobs 
enabling a lot of other possibilities.  I can't confidently state what all the 
requirements are here outside of doing the Cascading record  Avro schema 
translation and changing all the touch points that Cascading has on the K/V 
types.


 Thanks!
 
 -- Ken
 
 
 Ken Krugler
 +1 530-210-6378
 http://bixolabs.com
 e l a s t i c   w e b   m i n i n g
 
 
 
 



Re: Questions re integrating Avro into Cascading process

2010-04-16 Thread Ken Krugler

Hi Scott,

Thanks for the response. See below for my comments...


We're looking at creating a Cascading Scheme for Avro, and have got a
few questions below. These are very general, as this is more of a
scoping phase (as in, are we crazy to try this) so apologies in
advance for lack of detail.

For context, Cascading is an open source project that provides a
workflow API on top of Hadoop. The key unit of data is a tuple, which
corresponds to a record - you have fields (names) and values.
Cascading uses a generalized tap concept for reading  writing
tuples, where a tap uses a scheme to handle the low-level mapping  
from

Cascading-land to/from the storage format.


I am somewhat familiar with Cascading as a user.  I am not familiar  
with how it is implemented or how to customize things like a Tap or  
Sink.


Correct me if I'm wrong, but its notion of a record is very simple  
-- there are no arrays or maps -- just a list of fields.

This maps to avro easily.


Correct - currently Cascading doesn't have built-in support for  
arrays, maps or unions - though I believe arrays  maps are on the list.



So the goal here is to define a Cascading Scheme that will run on
0.18.3 and later versions of Hadoop, and provide general support for
reading/writing tuples from/to an Avro-format Hadoop part-x file.

We grabbed the recently committed AvroXXX code from
org.apache.avro.mapred (thanks Doug  Scott), and began building the
Cascading scheme to bridge between AvroWrapperT keys and Cascading
tuples.


You might be fine without the org.apache.avro.mapred stuff --  
specifically if you only need the sinks and taps to use Avro and not  
the stuff in between a map and reduce.  For example, I have a custom  
LoadFunc in Pig that can read/write avro data files working off Avro  
1.3.0 -- but it works for a static schema.




1. What's the best approach if we want to dynamically define the Avro
schema, based on a list of field names and types (classes)?



Creating an Avro schema programmatically is fairly straightforward  
-- especially without arrays, maps, or unions.  If the code has  
access to the Cascading record definition, transforming that into an  
Avro schema dynamically should be straightforward. Schema has  
various constructors and static methods from which you can get the  
JSON schema representation or just pass around Schema objects.


We're currently using the string rep, since a Schema isn't  
serializable, and Cascading needs that to save the defined workflow in  
the job conf.


[snip]


3. Will there be issues with running in 0.18.3, 0.19.2, etc?

I saw some discussion about Hadoop using the older Jackson 1.0.1 jar,
and that then creating problems. Anything else?


I'm using Avro 1.3.0 with 0.19.2 and 0.20.1 CDH2 in production and  
the only problem was the above library conflict.  This is without  
the new o.a.avro.mapred stuff however.


Great, good to know.


4. The key integration point, besides the fields+classes to schema
issue above, is mapping between Cascading tuples and AvroWrapperT

If we're using (I assume) the generic format, any input on how we'd  
do

this two-way conversion?



I'd suggest thinking about using Avro container files for input and  
output, which may not require the above depending on how Cascading  
is built internally.  In Pig for example, the LoadFunc defines a pig  
schema on input for reading, and everything else from there requires  
no change -- although this means that it is using the default pig  
types and serialization for all the intermediate work, reading and  
writing inputs and outputs can be done with Avro with minimal effort.
Cascading is already defining the M/R jobs, the keys, values, etc...  
so you may only have to modify the Tap to translate from an Avro  
schema to the Cascading record to get it to read or write an Avro  
file.


So far one issue is that we need to translate between Cascading  
Strings and Avro Utf8 types, but most everything else works just fine.


One can go farther and use AvroWrapper and o.a.avro.mapred define  
the M/R jobs enabling a lot of other possibilities.  I can't  
confidently state what all the requirements are here outside of  
doing the Cascading record  Avro schema translation and changing  
all the touch points that Cascading has on the K/V types.


It's pretty much four routines in the scheme:

- sinkInit (setting up the conf properly, for which we're using the  
AvroJob support)

- sourceInit (same thing)

- sink (mapping from Tuple to o.a.avro.Generic.GenericData)
- source (mapping from o.a.avro.Generic.GenericData to Tuple)

The above is all based on the Avro mapred support, so we just have to  
do the translation work for Fields - Schema and Tuple - GenericData.


It looks pretty doable, thanks for the help!

-- Ken


Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






[jira] Commented: (AVRO-517) Resolving Decoder fails in some cases

2010-04-16 Thread Scott Carey (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857920#action_12857920
 ] 

Scott Carey commented on AVRO-517:
--

+1

This patch looks good to me. 



 Resolving Decoder fails in some cases
 -

 Key: AVRO-517
 URL: https://issues.apache.org/jira/browse/AVRO-517
 Project: Avro
  Issue Type: Bug
  Components: java
Affects Versions: 1.3.2
Reporter: Scott Carey
Assignee: Thiruvalluvan M. G.
Priority: Critical
 Attachments: AVRO-517.patch


 User reports that reading an 'actual' schema of 
  string, string, int
 fails when using an expected schema of:
  string, string
 Sample code and details in the comments.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Questions re integrating Avro into Cascading process

2010-04-16 Thread Scott Carey

On Apr 16, 2010, at 11:20 AM, Ken Krugler wrote:

 Hi Scott,
 
 Thanks for the response. See below for my comments...
 
 
 Correct me if I'm wrong, but its notion of a record is very simple  
 -- there are no arrays or maps -- just a list of fields.
 This maps to avro easily.
 
 Correct - currently Cascading doesn't have built-in support for  
 arrays, maps or unions - though I believe arrays  maps are on the list.
 

It would be great if Cascading, Pig, and Hive (along with Avro) could get to 
some good common ground on all of these data types.


 Creating an Avro schema programmatically is fairly straightforward  
 -- especially without arrays, maps, or unions.  If the code has  
 access to the Cascading record definition, transforming that into an  
 Avro schema dynamically should be straightforward. Schema has  
 various constructors and static methods from which you can get the  
 JSON schema representation or just pass around Schema objects.
 
 We're currently using the string rep, since a Schema isn't  
 serializable, and Cascading needs that to save the defined workflow in  
 the job conf.
 

That should work well.  The JSON string representation is the canonical, 
cross-language, serialization of an Avro schema.

 
 So far one issue is that we need to translate between Cascading  
 Strings and Avro Utf8 types, but most everything else works just fine.
 

Let us know about the difficulties here and any suggestions or requests for 
enhancement.  
I am interested in making the String  Utf8 situation more efficient and 
easier to use.


 One can go farther and use AvroWrapper and o.a.avro.mapred define  
 the M/R jobs enabling a lot of other possibilities.  I can't  
 confidently state what all the requirements are here outside of  
 doing the Cascading record  Avro schema translation and changing  
 all the touch points that Cascading has on the K/V types.
 
 It's pretty much four routines in the scheme:
 
 - sinkInit (setting up the conf properly, for which we're using the  
 AvroJob support)
 - sourceInit (same thing)
 
 - sink (mapping from Tuple to o.a.avro.Generic.GenericData)
 - source (mapping from o.a.avro.Generic.GenericData to Tuple)
 
 The above is all based on the Avro mapred support, so we just have to  
 do the translation work for Fields - Schema and Tuple - GenericData.
 
 It looks pretty doable, thanks for the help!
 
 -- Ken
 
 
 Ken Krugler
 +1 530-210-6378
 http://bixolabs.com
 e l a s t i c   w e b   m i n i n g