Re: [xz-devel] Re: java LZDecoder small improvement

Brett Okken Mon, 01 Mar 2021 09:05:51 -0800

> With a quick try I got a feeling that my worry about short repeats was
> wrong. It doesn't matter because decoding each LZMA symbol is much more
> expensive. What matters is avoiding multiple tiny arraycopy calls
> within a single run of the repeat method, and that problem was already
> solved.


My observation is basically the same. Individual tiny array copy calls
do not really matter, but tiny array copy calls in a tight loop is
quite bad.
My /guess/ is that the optimizer is not able to unroll these loops
effectively, possibly because of the arraycopy intrinsic.

>
> > > I came up with the following. I haven't decided yet if I like it.
> >
> > On the 3 files I have been testing with, this change is a mixed bag.
> > Compared to trunk 1 regresses by ~8%. While the other 2 do improve,
> > neither are better than my last patch.
>
> OK, thanks. So it isn't great. I wonder which details make the
> difference.

I think some of the problem is too many branches, making
prediction/speculative execution less useful.
Another issue is the absence of the single byte optimization, which I
will address more below.

> One thing that confuses me in your version is the special handling of
> the first byte:
>
>         buf[pos++] = buf[back++];
>         --left;
>
> If there are two bytes to copy, then one will be copied above and the
> other with arraycopy later. If there are more bytes to copy and
> distance is very small, incrementing "back" above can mean that an
> extra arraycopy call might be needed in the loop because the first copy
> will be one byte smaller.
>
> I understand that it might help when there is just one byte to repeat
> because then the while-loop will be skipped. In all other situations it
> sounds like that the special handling of the first byte would in theory
> be harmful. Note that I don't doubt your test results; I already saw
> with the CRC64 code that some changes in the code can affect
> performance in weird ways.

The image1.dcm is the most impacted by this optimization. Again, this
file is basically a large greyscale bmp. This results in a significant
number of single byte repeats. Optimizing for the single byte improves
performance in that file by 3-5%, while having smaller effects on the
other 2 files (ihe_ovly_pr.dcm slightly slower, large.xml slightly
faster)

> Your code needs
>
>             if (back == bufSize)
>                 back = 0;
>
> in the beginning of the while-loop and later checking for tmp > 0. My
> version avoids these branches by handling those cases under "if (back <
> 0)" (which is equivalent to "if (dist >= pos)"). On the other hand,
> under "if (back < 0)" all copies, including tiny copies, are done with
> arraycopy.
>
> Another tiny difference is that your code uses left shift to double the
> copy size in the loop while I used Math.min(pos - back, left).
>
> > I was able to improve this a bit by pulling the handling of small
> > copies outside of the while loop. This eliminates the regressions
> > compared to trunk, but still does not feel like an improvement over my
> > last patch.
>
> Yeah, the switch isn't worth it. If I understand it correctly now,
> trying to avoid arraycopy for the tiny copies wasn't a useful idea in
> the first place. So the code can be simplified ("version 3"):
>
>         int back = pos - dist - 1;
>         if (back < 0) {
>             // The distance wraps around to the end of the cyclic dictionary
>             // buffer. We cannot get here if the dictionary isn't full.
>             assert full == bufSize;
>             back += bufSize;
>
>             // Here we will never copy more than dist + 1 bytes and
>             // so the copying won't repeat from its own output.
>             // Thus, we can always use arraycopy safely.
>             int copySize = Math.min(bufSize - back, left);
>             assert copySize <= dist + 1;
>
>             System.arraycopy(buf, back, buf, pos, copySize);
>             pos += copySize;
>             back = 0;
>             left -= copySize;
>
>             if (left == 0)
>                 return;
>         }
>
>         assert back < pos;
>         assert left > 0;
>
>         do {
>             // Determine the number of bytes to copy on this loop iteration:
>             // copySize is set so that the source and destination ranges
>             // don't overlap. If "left" is large enough, the destination
>             // range will start right after the last byte of the source
>             // range. This way we don't need to advance "back" which
>             // allows the next iteration of this loop to copy (up to)
>             // twice the number of bytes.
>             int copySize = Math.min(left, pos - back);
>             System.arraycopy(buf, back, buf, pos, copySize);
>             pos += copySize;
>             left -= copySize;
>         } while (left > 0);
>
> I know I may sound stubborn for not accepting your version as is but
> the special handling of the first byte and the readability of the
> while-loop (how easy it is to understand on the first reading) make me
> hesitate.

I agree your approach is more readable. From your version of it, I was
expecting that simplicity in reading to translate into better
performance.
This latest version actually does appear to do that. The image1.dcm
performance matches my version and the other 2 are a bit faster.
Adding the single byte optimization still speeds up image1.dcm (~8ms,
~2%) and large.xml (~3ms, 2%), while slowing ihe_ovly_pr.dcm (~.008ms,
~1%).

> For example, for dist = 0, len = 23 and assuming that copying
> doesn't wrap:
>
>   1. One byte is copied before the while-loop.
>   2. The inner do-while-loop copies 1, 2, 4, and 8 bytes.
>   3. The outer while-loop starts from the beginning and
>      the latter arraycopy is used to copy the remaining 7 bytes.
>
> In contrast my new simplified version above has just one loop in it
> where it copies 1, 2, 4, 8, and 8 bytes.

I think getting rid of the inner loop entirely is the key to the
improvement in this version. This seems quite obvious a change in
hindsight.

> Just to play around, here's yet another method which has only one
> arraycopy. The size difference is very minimal though after you ignore
> the comments and assertions. For readability I prefer "version 3" over
> the version below ("version 4").
>
>         int back = pos - dist - 1;
>         int backNext = back;
>
>         do {
>             int copySize;
>             if (back < 0) {
>                 back += bufSize;
>                 copySize = bufSize - back;
>                 backNext = 0;
>             } else {
>                 copySize = pos - back;
>             }
>
>             if (copySize > left)
>                 copySize = left;
>
>             System.arraycopy(buf, back, buf, pos, copySize);
>             pos += copySize;
>             left -= copySize;
>             back = backNext;
>         } while (left > 0);
>
> It would be nice if you could compare the versions again. Thanks!

Version 3 is better for all 3 files.

Re: [xz-devel] Re: java LZDecoder small improvement

Reply via email to