Re: HUG talk on PTD/Avro

2010-04-26 Thread Ken Krugler

Hi Doug,

On Apr 23, 2010, at 1:31pm, Doug Cutting wrote:


Ken Krugler wrote:
3. It would be great to get feedback on both the Avro Cascading  
scheme (http://github.com/bixolabs/cascading.avro) and the content  
we're currently saving in the Avro file.


Overall it looks fine to me.

What do you think of https://issues.apache.org/jira/browse/AVRO-513?  
Would that make your life much easier?


I read through it, but don't understand why ...explicitly detect  
sequences of matching data is a issue.


What's the definition of matching data? Is there a common use case  
for Avro where you need to detect duplicates?


It might be more efficient, instead of reading Avro generic data and  
converting it to your desired representation, to subclass  
GenericDatumReader and override #readString(), #readBytes(),  
#readMap(), and #readArray().  Similarly for DatumWriter.  But we'd  
then also need to permit one to configure AvroRecordReader to use a  
different DatumReader implementation.  We might, e.g., add a  
DataRepresentationFactory interface:


interface DataRepresentationT {
 DatumReaderT createDatumReader();
 DatumWriterT createDatumWriter();
}


Then we could replace AvroJob#setInputSpecific() and  
#setInputGeneric() with  
#setInputRepresentation(ClassDataRepresentation rep, Schema s).  
You could subclass GenericDatumReader  Writer and implement a  
DataRepresentation that returns these.


Worth it?


I assume the performance win comes because there's only one conversion  
to/from the serialized  stored data, versus two.


If so, then it would definitely be faster, but I don't know by how  
much. It seems like the most likely bottleneck would be with strings,  
as these need conversion and can be long/common.


I'd either need to hook up a profiler to a typical read or write flow,  
or disable the string conversion and measure the speedup.


So no recommendation for now, until I get time to try that out.

Thanks,

-- Ken


Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: HUG talk on PTD/Avro

2010-04-23 Thread Doug Cutting

Ken Krugler wrote:
3. It would be great to get feedback on both the Avro Cascading scheme 
(http://github.com/bixolabs/cascading.avro) and the content we're 
currently saving in the Avro file.


Overall it looks fine to me.

What do you think of https://issues.apache.org/jira/browse/AVRO-513? 
Would that make your life much easier?


It might be more efficient, instead of reading Avro generic data and 
converting it to your desired representation, to subclass 
GenericDatumReader and override #readString(), #readBytes(), #readMap(), 
and #readArray().  Similarly for DatumWriter.  But we'd then also need 
to permit one to configure AvroRecordReader to use a different 
DatumReader implementation.  We might, e.g., add a 
DataRepresentationFactory interface:


interface DataRepresentationT {
  DatumReaderT createDatumReader();
  DatumWriterT createDatumWriter();
}

Then we could replace AvroJob#setInputSpecific() and #setInputGeneric() 
with #setInputRepresentation(ClassDataRepresentation rep, Schema s). 
You could subclass GenericDatumReader  Writer and implement a 
DataRepresentation that returns these.


Worth it?

Doug