On Thu, Feb 11, 2021 at 12:51 PM Lasse Collin <lasse.col...@tukaani.org> wrote:
>
> On 2021-02-05 Brett Okken wrote:
> > I worked this out last night. We need to double how much we copy each
> > time by not advancing "back". This actually works even better than
> > Arrays.fill for the single byte case also.
>
> This clearly is a good idea in a Java implementation. :-)
>
> I still worry about short copies. If the file is full of tiny
> matches/repeats of 1-3 bytes or so, arraycopy can be slower. Such files
> aren't typical at all but I don't want to add a corner case where the
> performance drops too much.

Do you have examples of such files, or code on how to generate one?
The case of a 1 byte match/repeat is optimized for in my latest patch,
as providing that optimization did provide noticeable improvement in
the (real) files I have been testing with. While I did observe some 2
and 3 byte repeats, it appears to not be common enough to negatively
impact overall throughput. It would be quite helpful to have some
examples so that we can at least quantify the impact.

> I came up with the following. I haven't decided yet if I like it.

On the 3 files I have been testing with, this change is a mixed bag.
Compared to trunk 1 regresses by ~8%. While the other 2 do improve,
neither are better than my last patch.

jdk 11 64 bit
TRUNK
         (file)  Mode  Cnt    Score    Error  Units
ihe_ovly_pr.dcm  avgt    3    0.662 ±  0.012  ms/op
     image1.dcm  avgt    3  391.644 ± 13.871  ms/op
      large.xml  avgt    3  225.456 ± 16.265  ms/op

(okken last)
ihe_ovly_pr.dcm  avgt    3    0.607 ±  0.187  ms/op
     image1.dcm  avgt    3  369.419 ± 32.400  ms/op
      large.xml  avgt    3  190.580 ±  7.856  ms/op

(lasse new)
         (file)  Mode  Cnt    Score    Error  Units
ihe_ovly_pr.dcm  avgt    3    0.628 ±  0.066  ms/op
     image1.dcm  avgt    3  424.159 ± 14.823  ms/op
       large.xml  avgt    3  192.763 ±  6.831  ms/op


I was able to improve this a bit by pulling the handling of small
copies outside of the while loop. This eliminates the regressions
compared to trunk, but still does not feel like an improvement over my
last patch.

(lasse + outer switch)
         (file)  Mode  Cnt    Score    Error  Units
ihe_ovly_pr.dcm  avgt    3    0.633 ±  0.032  ms/op
     image1.dcm  avgt    3  390.868 ± 40.598  ms/op
      large.xml  avgt    3  190.030 ±  2.619  ms/op


        int back = pos - dist - 1;
        if (back < 0) {
            // We won't get here if the dictionary isn't full.
            assert full == bufSize;

            // The distance wraps around to the end of the cyclic dictionary
            // buffer. Here we will never copy more than dist + 1 bytes
            // and so the copying won't repeat from its own output. Thus,
            // we can always use arraycopy safely.
            back += bufSize;
            int copySize = Math.min(bufSize - back, left);
            assert copySize <= dist + 1;

            System.arraycopy(buf, back, buf, pos, copySize);
            pos += copySize;
            back = 0;
            left -= copySize;

            if (left == 0)
                return;
        }

        assert back < pos;
        assert left > 0;

        int copySize = Math.min(left, pos - back);

        switch(copySize) {
            case 3:
                buf[pos + 2] = buf[back + 2];
            case 2:
                buf[pos + 1] = buf[back + 1];
            case 1:
                buf[pos] = buf[back];
                break;
            default:
                System.arraycopy(buf, back, buf, pos, copySize);
        }
        pos += copySize;
        left -= copySize;

        while (left > 0) {

            copySize = Math.min(left, copySize << 1);
            System.arraycopy(buf, back, buf, pos, copySize);
            pos += copySize;
            left -= copySize;
        }

Reply via email to