[ 
https://issues.apache.org/jira/browse/AVRO-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833283#action_12833283
 ] 

Scott Carey commented on AVRO-392:
----------------------------------

bq. AvroFactory is not a good name. If the intent is to eventually support 
encoders to, then might better be named CodecFactory, were not the term codec 
already used for compression codecs. We could perhaps rename those Compressors 
or somesuch. Or we could just name this DecoderFactory, since that's all it 
does at present.

I think it should work with encoders too.  Configuration might overlap, and one 
central place to deal with these is probably better than two.
The compression codec name collision is an issue, but that is in the file spec. 
 There already is a class named CodecFactory.  Maybe renaming them should go in 
a different ticket?  Also, CodecFactory is not the best name either, since 
Codec means to do both Decoding and Encoding, and these do one or the other. 
Hmm.  Maybe two is better just so the naming is clean.  Opinions?

bq. Is there a reason we can't add #isEnd() to Decoder? That would make it 
easier to write Decoder-agnostic programs.
A DirectBinaryDecoder could never implement it othter than by always returning 
the same value or throwing an exception.  It must actually attempt a read in 
order to figure out if it is at the end, which means it might 'overshoot' its 
input, the whole point of its existence is not to do that.  
If we want isEnd() to be universally useable without side-effects for all 
decoders (or binary decoders), we can't have DirectBinaryDecoder.  It will be 
significantly easier to maintain a nice common interface if the semantics don't 
differ too much.   What semantics for read-ahead and 'isEnd' from Jacson on the 
Json side?

bq. BinaryDecoder might become an abstract class, like a tag interface, with 
concrete subclasses BufferedBinaryDecoder and DirectBinaryDecoder. That would 
enable applications to accept a BinaryDecoder that's either buffered or not.
That is how I started out.  The factory also created the abstract BinaryDecoder 
type and the implementations were hidden, and the 'best' type given the factory 
configuration was returned (even if 'direct' is requested, the buffered one 
doesn't buffer when accessing a byte array so it is preferred).  But the 
DataFileReader needed isEnd(), and couldn't operate on this type, which broke 
that factory paradigm.  In short, the parent type became no more useful than 
Decoder. 
I thought about it more, and realized that the semantic differences between one 
implementation that _must_ never read ahead and _must_ work with InputStreams 
and another that may read ahead and can work with a variety of data sources is 
going to continue to cause very different client API needs.

A user who has the need to use the 'direct' form and also for whatever reason 
can't use the .inputStream() proxy is going to have peculiar special needs that 
I don't think belong on a supertype.
A user who has simpler needs and will primarily be streaming data from various 
sources where all data is Avro will have their own unique needs.

So I'm guessing that isEnd(), is only the beginning -- of things we can do 
easily on one but not on the other that are useful for some and not for others.

I think that the first use case is extremely rare.

bq. the javadoc for DirectBinaryDecoder should be updated to describe that it 
is a non-buffering variant.
If we decide to keep that class, I'll clean it up.


So in summary:
* Change the factory name to DecoderFactory unless someone comes up with a good 
name for a factory that is both a DecoderFactory and an EncoderFactory
* Decide if we really want a DirectBinaryDecoder and how much of an API it 
shares with a read-ahead one.  Clean up javadoc for DirectBinaryDecoder if it 
is kept.
* Consider a new ticket to rename file block Codec's to something much more 
distinct from Decoder and Encoder. 

> Binary Decoder Performance and flexibility overhaul
> ---------------------------------------------------
>
>                 Key: AVRO-392
>                 URL: https://issues.apache.org/jira/browse/AVRO-392
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>            Reporter: Scott Carey
>            Assignee: Scott Carey
>            Priority: Critical
>             Fix For: 1.3.0
>
>         Attachments: AVRO-392-preview.patch, 
> AVRO-392-with_DirectBinaryDecoder.patch, AVRO-392.patch, AVRO-392.patch, 
> AVRO-392.patch, AVRO-392.patch
>
>
> BinaryDecoder has room for significant performance improvement.  
> [AVRO-327|https://issues.apache.org/jira/browse/AVRO-327] has some 
> preliminary work here, but in order to satisfy some use cases there is much 
> more work to do.
> I am opening a new ticket because the scope of the changes needed to do this 
> the right way are larger.
> I have done a large bulk of a new implementation that abstracts a 
> 'ByteSource' from the BinaryDecoder.  Currently BinaryDecoder is tightly 
> coupled to InputStream.  The ByteSource can wrap an InputStream, FileChannel, 
> or byte[] in this version, but could be extended to support other channel 
> types, sockets, etc.  This abstraction allows the BinaryDecoder to buffer 
> data from various sources while supporting interleaved access to the 
> underlying data and greater flexibility going forward.
> The performance of this abstraction has been heavily tuned so that maximum 
> performance can be achieved even for slower ByteSource implementations.
> For readers that must interleave reads on a stream with the decoder, this 
> includes a
> {code}
> public InputStream inputStream();
> {code}
> method on the decoder that can serve interleaved reads.  
> Additionally it will be necessary to have a constructor on BinaryDecoder that 
> allows two BinaryDecoders to share a stream (and buffer).
> Performance results on this new version is better than previous prototypes:
> *current trunk BinaryDecoder*
> {noformat}
> ReadInt: 983 ms, 30.497877855999185 million entries/sec
> ReadLongSmall: 1058 ms, 28.336666040111496 million entries/sec
> ReadLong: 1518 ms, 19.75179889508437 million entries/sec
> ReadFloat: 657 ms, 45.61031157924184 million entries/sec
> ReadDouble: 761 ms, 39.387756709704355 million entries/sec
> ReadBoolean: 331 ms, 90.4268145647456 million entries/sec
> RepeaterTest: 7718 ms, 3.886725782038378 million entries/sec
> NestedRecordTest: 1884 ms, 15.91964611687992 million entries/sec
> ResolverTest: 8296 ms, 3.616055866616717 million entries/sec
> MigrationTest: 21216 ms, 1.4139999570144013 million entries/sec
> {noformat}
> *buffering BinaryDecoder*
> {noformat}
> ReadInt: 187 ms, 160.22131904871262 million entries/sec
> ReadLongSmall: 372 ms, 80.4863521975457 million entries/sec
> ReadLong: 613 ms, 48.882385721129246 million entries/sec
> ReadFloat: 253 ms, 118.16606270679061 million entries/sec
> ReadDouble: 275 ms, 108.94314257389068 million entries/sec
> ReadBoolean: 222 ms, 134.85327963176064 million entries/sec
> RepeaterTest: 3335 ms, 8.993007936329503 million entries/sec
> NestedRecordTest: 1152 ms, 26.0256943004597 million entries/sec
> ResolverTest: 4213 ms, 7.120659335077578 million entries/sec
> MigrationTest: 15310 ms, 1.9594884898992941 million entries/sec
> {noformat}
> Performance is 2x to 5x the throughput of trunk on most tests.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to