[jira] Commented: (AVRO-532) Java: optimize Utf8#toString() and 'new Utf8(String)'

2010-04-28 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12862019#action_12862019
 ] 

Ken Krugler commented on AVRO-532:
--

I added an example for Bryan as a comment on the [THRIFT-765] issue.

> Java: optimize Utf8#toString() and 'new Utf8(String)'
> -
>
> Key: AVRO-532
> URL: https://issues.apache.org/jira/browse/AVRO-532
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Reporter: Doug Cutting
>
> Avro's Specific and Generic Java APIs frequently convert Java Strings to Avro 
> Utf8.  This conversion might be optimized.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: HUG talk on PTD/Avro

2010-04-26 Thread Ken Krugler

Hi Doug,

On Apr 23, 2010, at 1:31pm, Doug Cutting wrote:


Ken Krugler wrote:
3. It would be great to get feedback on both the Avro Cascading  
scheme (http://github.com/bixolabs/cascading.avro) and the content  
we're currently saving in the Avro file.


Overall it looks fine to me.

What do you think of https://issues.apache.org/jira/browse/AVRO-513?  
Would that make your life much easier?


I read through it, but don't understand why "...explicitly detect  
sequences of matching data" is a issue.


What's the definition of "matching data"? Is there a common use case  
for Avro where you need to detect duplicates?


It might be more efficient, instead of reading Avro generic data and  
converting it to your desired representation, to subclass  
GenericDatumReader and override #readString(), #readBytes(),  
#readMap(), and #readArray().  Similarly for DatumWriter.  But we'd  
then also need to permit one to configure AvroRecordReader to use a  
different DatumReader implementation.  We might, e.g., add a  
DataRepresentationFactory interface:


interface DataRepresentation {
 DatumReader createDatumReader();
 DatumWriter createDatumWriter();
}


Then we could replace AvroJob#setInputSpecific() and  
#setInputGeneric() with  
#setInputRepresentation(Class rep, Schema s).  
You could subclass GenericDatumReader & Writer and implement a  
DataRepresentation that returns these.


Worth it?


I assume the performance win comes because there's only one conversion  
to/from the serialized & stored data, versus two.


If so, then it would definitely be faster, but I don't know by how  
much. It seems like the most likely bottleneck would be with strings,  
as these need conversion and can be long/common.


I'd either need to hook up a profiler to a typical read or write flow,  
or disable the string conversion and measure the speedup.


So no recommendation for now, until I get time to try that out.

Thanks,

-- Ken


Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Questions re integrating Avro into Cascading process

2010-04-26 Thread Ken Krugler


On Apr 23, 2010, at 12:33pm, Doug Cutting wrote:


Ken Krugler wrote:
1. I'm assuming there's no compelling reason to read the file  
headers - in fact, not sure how you'd even get at the data, much  
less how you'd deal with potentially partial/missing data from a  
set of Avro files being read as part files.


I'm not sure what you're asking here.


Sorry, I should have been clearer.

I was thinking about the read side of things, when using the Cascading  
Scheme to pull data from Avro files. If these files have metadata,  
there's no good way to get at it via the Cascading interface, and  
given that a directory will typically contain a set of part-x  
files, it didn't seem like you could do much with the results in any  
case. So just checking to make sure I wasn't overlooking something.


2. We'd like to not include Avro source in the Cascading scheme  
project, but rather just have a dependency on the Avro jar.
We have a similar relationship between Bixo and Tika, and what's  
worked well is for the Bixo master branch to have a dependency on  
the Tika snapshot builds, so we can quickly iterate on both projects.
So are there plans to start pushing Avro snapshot builds to the  
Apache snapshots repository? I see occasional Avro releases to the  
Maven central repo (1.0, 1.2, 1.3.2) but nothing for snapshots.


I'm okay if someone wants to, e.g., configure a nightly Hudson build  
that pushes out an Avro snapshot jar.  Apache releases should not  
depend on snapshots, but snapshots are useful for development.


Avro's build.xml already includes a task to post a snapshot jar.  I  
tested it once, which accounts for the single Avro snapshot that  
exists.  So it should be simple to configure Hudson to do this.   
Philip was going to setup Hudson builds for Avro.  Philip?


That would be great, thanks!

-- Ken


Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






HUG talk on PTD/Avro

2010-04-23 Thread Ken Krugler

Hi all,

Just wrote a blog post about the talk I gave on Wed night at the  
Hadoop Bay Area user group meetup:


http://bixolabs.com/2010/04/22/hadoop-user-group-meetup-talk/

Key points about Avro:

1. The Avro scheme for Cascading worked well for writing out fetch  
results, and we are using it in the example analysis code to read the  
same files for processing.


2. Sample Avro file (one of 613, from first loop) is available at S3 (/ 
bixolabs-ptd-demo/ptd-sample.avro), and we're working with Amazon to  
get this initial set into the Amazon public dataset.


3. It would be great to get feedback on both the Avro Cascading scheme  
(http://github.com/bixolabs/cascading.avro) and the content we're  
currently saving in the Avro file.


Thanks,

-- Ken

--------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Questions re integrating Avro into Cascading process

2010-04-22 Thread Ken Krugler


On Apr 21, 2010, at 3:22pm, Doug Cutting wrote:


Ken Krugler wrote:
One open issue - it would be great to be able to set metadata in  
the headers of the resulting Avro files. But it wasn't obvious how  
to do that, given our (intentionally) arms-length approach via the  
use of the Avro mapred code.
One idea would be to have job conf values using keys prefixed with  
avro.metadata.xxx, and the Avro mapred support could automagically  
use that when creating the file. But this would break our goal of  
using unmodified Avro source, so I'm curious whether support for  
setting the file metadata would also be useful for the standard  
(Hadoop) use of Avro for an output format, and if so, whether there  
was a better approach.


Embedding the metadata in the configuration seems like a good  
approach.  Please file a Jira issue for this and attach a patch.


AvroOutputFormat can add properties named  
avro.mapred.output.metadata.*.  We'll have to enumerate all  
properties in the job and test for this prefix, since Configuration  
is a HashMap, but the alternative of encoding the metadata map in a  
single configuration value seems no more attractive.


Note that https://issues.apache.org/jira/browse/HADOOP-6420 added  
support for adding maps to configuration, but the extracted map  
cannot be enumerated, so could not be added to the DataFileWriter's  
metadata. Also, this feature is perhaps slated for removal as a part  
of https://issues.apache.org/jira/browse/HADOOP-6698, but its code  
might prove useful as a starting point.


Thanks for the info, we'll work up a patch & file the issue when it's  
ready.


Two related questions:

1. I'm assuming there's no compelling reason to read the file headers  
- in fact, not sure how you'd even get at the data, much less how  
you'd deal with potentially partial/missing data from a set of Avro  
files being read as part files.


2. We'd like to not include Avro source in the Cascading scheme  
project, but rather just have a dependency on the Avro jar.


We have a similar relationship between Bixo and Tika, and what's  
worked well is for the Bixo master branch to have a dependency on the  
Tika snapshot builds, so we can quickly iterate on both projects.


So are there plans to start pushing Avro snapshot builds to the Apache  
snapshots repository? I see occasional Avro releases to the Maven  
central repo (1.0, 1.2, 1.3.2) but nothing for snapshots.


Thanks,

-- Ken


Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Questions re integrating Avro into Cascading process

2010-04-18 Thread Ken Krugler

Hi all,

We're looking at creating a Cascading Scheme for Avro, and have got  
a few questions below. These are very general, as this is more of a  
scoping phase (as in, are we crazy to try this) so apologies in  
advance for lack of detail.


For context, Cascading is an open source project that provides a  
workflow API on top of Hadoop. The key unit of data is a tuple,  
which corresponds to a record - you have fields (names) and values.  
Cascading uses a generalized "tap" concept for reading & writing  
tuples, where a tap uses a scheme to handle the low-level mapping  
from Cascading-land to/from the storage format.


So the goal here is to define a Cascading Scheme that will run on  
0.18.3 and later versions of Hadoop, and provide general support for  
reading/writing tuples from/to an Avro-format Hadoop part-x file.


We grabbed the recently committed AvroXXX code from  
org.apache.avro.mapred (thanks Doug & Scott), and began building the  
Cascading scheme to bridge between AvroWrapper keys and Cascading  
tuples.


An update on status - there's a working Cascading tap at http://github.com/bixolabs/cascading.avro 
. See the README (http://github.com/bixolabs/cascading.avro/blob/master/README 
) for more details.


One open issue - it would be great to be able to set metadata in the  
headers of the resulting Avro files. But it wasn't obvious how to do  
that, given our (intentionally) arms-length approach via the use of  
the Avro mapred code.


One idea would be to have job conf values using keys prefixed with  
avro.metadata.xxx, and the Avro mapred support could automagically use  
that when creating the file. But this would break our goal of using  
unmodified Avro source, so I'm curious whether support for setting the  
file metadata would also be useful for the standard (Hadoop) use of  
Avro for an output format, and if so, whether there was a better  
approach.


Thanks!

-- Ken

------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Questions re integrating Avro into Cascading process

2010-04-16 Thread Ken Krugler

Hi Scott,

Thanks for the response. See below for my comments...


We're looking at creating a Cascading Scheme for Avro, and have got a
few questions below. These are very general, as this is more of a
scoping phase (as in, are we crazy to try this) so apologies in
advance for lack of detail.

For context, Cascading is an open source project that provides a
workflow API on top of Hadoop. The key unit of data is a tuple, which
corresponds to a record - you have fields (names) and values.
Cascading uses a generalized "tap" concept for reading & writing
tuples, where a tap uses a scheme to handle the low-level mapping  
from

Cascading-land to/from the storage format.


I am somewhat familiar with Cascading as a user.  I am not familiar  
with how it is implemented or how to customize things like a Tap or  
Sink.


Correct me if I'm wrong, but its notion of a record is very simple  
-- there are no arrays or maps -- just a list of fields.

This maps to avro easily.


Correct - currently Cascading doesn't have built-in support for  
arrays, maps or unions - though I believe arrays & maps are on the list.



So the goal here is to define a Cascading Scheme that will run on
0.18.3 and later versions of Hadoop, and provide general support for
reading/writing tuples from/to an Avro-format Hadoop part-x file.

We grabbed the recently committed AvroXXX code from
org.apache.avro.mapred (thanks Doug & Scott), and began building the
Cascading scheme to bridge between AvroWrapper keys and Cascading
tuples.


You might be fine without the org.apache.avro.mapred stuff --  
specifically if you only need the sinks and taps to use Avro and not  
the stuff in between a map and reduce.  For example, I have a custom  
LoadFunc in Pig that can read/write avro data files working off Avro  
1.3.0 -- but it works for a static schema.




1. What's the best approach if we want to dynamically define the Avro
schema, based on a list of field names and types (classes)?



Creating an Avro schema programmatically is fairly straightforward  
-- especially without arrays, maps, or unions.  If the code has  
access to the Cascading record definition, transforming that into an  
Avro schema dynamically should be straightforward. Schema has  
various constructors and static methods from which you can get the  
JSON schema representation or just pass around Schema objects.


We're currently using the string rep, since a Schema isn't  
serializable, and Cascading needs that to save the defined workflow in  
the job conf.


[snip]


3. Will there be issues with running in 0.18.3, 0.19.2, etc?

I saw some discussion about Hadoop using the older Jackson 1.0.1 jar,
and that then creating problems. Anything else?


I'm using Avro 1.3.0 with 0.19.2 and 0.20.1 CDH2 in production and  
the only problem was the above library conflict.  This is without  
the new o.a.avro.mapred stuff however.


Great, good to know.


4. The key integration point, besides the fields+classes to schema
issue above, is mapping between Cascading tuples and AvroWrapper

If we're using (I assume) the generic format, any input on how we'd  
do

this two-way conversion?



I'd suggest thinking about using Avro container files for input and  
output, which may not require the above depending on how Cascading  
is built internally.  In Pig for example, the LoadFunc defines a pig  
schema on input for reading, and everything else from there requires  
no change -- although this means that it is using the default pig  
types and serialization for all the intermediate work, reading and  
writing inputs and outputs can be done with Avro with minimal effort.
Cascading is already defining the M/R jobs, the keys, values, etc...  
so you may only have to modify the Tap to translate from an Avro  
schema to the Cascading record to get it to read or write an Avro  
file.


So far one issue is that we need to translate between Cascading  
Strings and Avro Utf8 types, but most everything else works just fine.


One can go farther and use AvroWrapper and o.a.avro.mapred define  
the M/R jobs enabling a lot of other possibilities.  I can't  
confidently state what all the requirements are here outside of  
doing the Cascading record <> Avro schema translation and changing  
all the touch points that Cascading has on the K/V types.


It's pretty much four routines in the scheme:

- sinkInit (setting up the conf properly, for which we're using the  
AvroJob support)

- sourceInit (same thing)

- sink (mapping from Tuple to o.a.avro.Generic.GenericData)
- source (mapping from o.a.avro.Generic.GenericData to Tuple)

The above is all based on the Avro mapred support, so we just have to  
do the translation work for Fields <-> Schema and Tuple <-> GenericData.


It looks pretty doable, thanks for the help!

-- Ken


Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Questions re integrating Avro into Cascading process

2010-04-15 Thread Ken Krugler

Hi all,

We're looking at creating a Cascading Scheme for Avro, and have got a  
few questions below. These are very general, as this is more of a  
scoping phase (as in, are we crazy to try this) so apologies in  
advance for lack of detail.


For context, Cascading is an open source project that provides a  
workflow API on top of Hadoop. The key unit of data is a tuple, which  
corresponds to a record - you have fields (names) and values.  
Cascading uses a generalized "tap" concept for reading & writing  
tuples, where a tap uses a scheme to handle the low-level mapping from  
Cascading-land to/from the storage format.


So the goal here is to define a Cascading Scheme that will run on  
0.18.3 and later versions of Hadoop, and provide general support for  
reading/writing tuples from/to an Avro-format Hadoop part-x file.


We grabbed the recently committed AvroXXX code from  
org.apache.avro.mapred (thanks Doug & Scott), and began building the  
Cascading scheme to bridge between AvroWrapper keys and Cascading  
tuples.


1. What's the best approach if we want to dynamically define the Avro  
schema, based on a list of field names and types (classes)?


This assumes it's possible to dynamically define & use a schema, of  
course.


2. How much has the new Hadoop map-reduce support code been tested?

3. Will there be issues with running in 0.18.3, 0.19.2, etc?

I saw some discussion about Hadoop using the older Jackson 1.0.1 jar,  
and that then creating problems. Anything else?


4. The key integration point, besides the fields+classes to schema  
issue above, is mapping between Cascading tuples and AvroWrapper


If we're using (I assume) the generic format, any input on how we'd do  
this two-way conversion?


Thanks!

-- Ken

--------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g