[jira] Commented: (AVRO-392) Binary Decoder Performance and flexibility overhaul

Scott Carey (JIRA) Sat, 13 Feb 2010 18:00:50 -0800

    [ 
https://issues.apache.org/jira/browse/AVRO-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833490#action_12833490
 ]


Scott Carey commented on AVRO-392:
----------------------------------

{code}
I support adding isEnd() to Decoder. DirectBinaryDecoder would always return 
false. If someone does not want buffering, he cannot rely on isEnd(). isEnd() 
will be called by "high-level" functions and not by functions that work with a 
single Avro object corresponding to a schema. If such a high-level function 
needs to mix Avro and non-avro data in the stream, it should know when avro 
data ends, and so it would not need to call isEnd(). The high-level functions 
that rely on isEnd() to recognize the end of avro data cannot mix avro data and 
non-avro data.
{code}

That isn't quite enough.  For example, if we did that, then the file format 
would break if implementations switched, since it throws an exception if 
isEnd() doesn't return true at the end of a data block.
I strongly prefer APIs that aren't wish-washy.  So a client needs to know if 
isEnd is guaranteed to work, or if not.  We could do an isEnd() if we also did 
a supportsIsEnd(), so that a client could enforce the use of the feature.  This 
is acceptable, and jives with the configurable factory ("give me an instance 
that supports isEnd please" --> (buffered impl).  "give me one that supports 
isEnd and is also direct" --> sorry!

We could generalize this, and have one method on Decoder:
boolean isFeatureSupported(Feature f);
Where Feature is an enum in Decoder, and would currently have IS_END, 
INPUTSTREAM_PROXY, and NO_READAHEAD, or something like that.

Putting inputStream() on Decoder has similar issues, JsonDecoder for example 
can't always support that.

{quote}
One option is that we can add a boolean to the factory methods that specifies 
if one wants buffering or not. Another option is to have additional factory 
methods. I prefer the former, but okay with the other, too.
{quote}
The way i'll build this would be like Jackson's Factory API.  Configure the 
factory to make instances with certain configuration settings, and then there 
would be even fewer (smarter) factory methods. Otherwise, we'll find a few more 
configuration parameters, and end up with 5 booleans and two ints on each 
method signature.

bq. BinaryData class seems to know the internals of BinaryDecoder. Is there a 
way to avoid it?

Probably not, but i'll have another look.  They are intimately related classes. 
 

bq.There are significant code that is common between . . .

Yeah, my original version had a parent BinaryEncoder with the readahead and 
direct ones as children.  But I got frustrated with isEnd() -- The 
DataFileStream class would have to know for sure that it had an instance with a 
functioning isEnd().  If we do want to have both decoders and maintain them, 
this is an option to merge that shared code.  

However it could be worth it to wait.  Merging them implies standardizing on an 
API that has to have optionally supported parts, like isEnd() + 
supportsIsEnd(), forever.  I suspect that these optional parts will grow over 
time, and all client code will have to check for them, specify them at factory 
configuration, etc.  All that adds up to a non-trivial set of baggage to manage 
the decoders with different semantics.  I still think that it would be easier 
to add features to the Decoder/Encoder APIs for interleaving non-Avro data.

bq. Can we not make ByteArrayByteSource.compactAndFill(), by simply setting the 
minPos to pos, instead of creating a new array? Am I missing something?

It used to do an arraycopy() to the front of the array, but that mutated the 
data which made it impossible to for example, read the same array twice.  Since 
the purpose of compactAndFill() is for use by ensureBounds(), it has to make a 
new (usually very small) array to deal with the leftovers so that readLong() 
won't overflow bounds.  Well, I suppose we could try catching IndexOutOfBounds 
instead, or something else.  
Since this wasn't clear, I should add code comments or organize the code so 
that it is more clear what is going on there.

bq. I think ByteArrayByteSource.close() should set the eof flag and 
ba.setPos(ba.limit()) so that all subsequent reads will fail.

Good idea!

bq. The indentation isn't consistent. The most common problem is that the close 
brace is not indented properly (sometimes too much, sometimes too little), but 
there are other lines too.

I was worried that would happen :(  I used the 'ignore whitespace' option of 
svn diff.  This is a side effect -- less consistent whitespace for recipients 
of the patch.  The benefit is for those who wanted to make the patch easier to 
read by minimizing line changes that only affected whitespace, another solution 
to that is to use a diff viewing tool that makes it clear what changes are 
whitespace-only.
For a patch this big, I'm going to go for including whitespace and producing 
consistently indented final output for files where half or more lines are 
already changing.  I get it that one shouldn't change whitespace or formatting 
in a small patch, but this is an overhaul of a couple classes and the final 
result should be self-consistent.

{quote}
The fields position and max in ByteArrayByteSource are used in attach(). After 
that they are merely tracking the the corresponding fields in ba, with apparent 
reason. Since a single ByteSource gets attached only once, that too immediately 
after constuction, we get get rid of the attach() and move the logic to the 
constructors. This way, (1) we can avoid passing the unused bufferSize 
parameter to ByteArrayBufferSource, (2) get rid of the fields position and max 
in ByteArrayByteSource.
{quote}
Sounds like a good simplification.  Position and max were there for the 
detach() use case, the buffer accessor might be enough.  I'll look more deeply 
at this.



> Binary Decoder Performance and flexibility overhaul
> ---------------------------------------------------
>
>                 Key: AVRO-392
>                 URL: https://issues.apache.org/jira/browse/AVRO-392
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>            Reporter: Scott Carey
>            Assignee: Scott Carey
>            Priority: Critical
>             Fix For: 1.3.0
>
>         Attachments: AVRO-392-preview.patch, 
> AVRO-392-with_DirectBinaryDecoder.patch, AVRO-392.patch, AVRO-392.patch, 
> AVRO-392.patch, AVRO-392.patch
>
>
> BinaryDecoder has room for significant performance improvement.  
> [AVRO-327|https://issues.apache.org/jira/browse/AVRO-327] has some 
> preliminary work here, but in order to satisfy some use cases there is much 
> more work to do.
> I am opening a new ticket because the scope of the changes needed to do this 
> the right way are larger.
> I have done a large bulk of a new implementation that abstracts a 
> 'ByteSource' from the BinaryDecoder.  Currently BinaryDecoder is tightly 
> coupled to InputStream.  The ByteSource can wrap an InputStream, FileChannel, 
> or byte[] in this version, but could be extended to support other channel 
> types, sockets, etc.  This abstraction allows the BinaryDecoder to buffer 
> data from various sources while supporting interleaved access to the 
> underlying data and greater flexibility going forward.
> The performance of this abstraction has been heavily tuned so that maximum 
> performance can be achieved even for slower ByteSource implementations.
> For readers that must interleave reads on a stream with the decoder, this 
> includes a
> {code}
> public InputStream inputStream();
> {code}
> method on the decoder that can serve interleaved reads.  
> Additionally it will be necessary to have a constructor on BinaryDecoder that 
> allows two BinaryDecoders to share a stream (and buffer).
> Performance results on this new version is better than previous prototypes:
> *current trunk BinaryDecoder*
> {noformat}
> ReadInt: 983 ms, 30.497877855999185 million entries/sec
> ReadLongSmall: 1058 ms, 28.336666040111496 million entries/sec
> ReadLong: 1518 ms, 19.75179889508437 million entries/sec
> ReadFloat: 657 ms, 45.61031157924184 million entries/sec
> ReadDouble: 761 ms, 39.387756709704355 million entries/sec
> ReadBoolean: 331 ms, 90.4268145647456 million entries/sec
> RepeaterTest: 7718 ms, 3.886725782038378 million entries/sec
> NestedRecordTest: 1884 ms, 15.91964611687992 million entries/sec
> ResolverTest: 8296 ms, 3.616055866616717 million entries/sec
> MigrationTest: 21216 ms, 1.4139999570144013 million entries/sec
> {noformat}
> *buffering BinaryDecoder*
> {noformat}
> ReadInt: 187 ms, 160.22131904871262 million entries/sec
> ReadLongSmall: 372 ms, 80.4863521975457 million entries/sec
> ReadLong: 613 ms, 48.882385721129246 million entries/sec
> ReadFloat: 253 ms, 118.16606270679061 million entries/sec
> ReadDouble: 275 ms, 108.94314257389068 million entries/sec
> ReadBoolean: 222 ms, 134.85327963176064 million entries/sec
> RepeaterTest: 3335 ms, 8.993007936329503 million entries/sec
> NestedRecordTest: 1152 ms, 26.0256943004597 million entries/sec
> ResolverTest: 4213 ms, 7.120659335077578 million entries/sec
> MigrationTest: 15310 ms, 1.9594884898992941 million entries/sec
> {noformat}
> Performance is 2x to 5x the throughput of trunk on most tests.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-392) Binary Decoder Performance and flexibility overhaul

Reply via email to