> With a quick try I got a feeling that my worry about short repeats was > wrong. It doesn't matter because decoding each LZMA symbol is much more > expensive. What matters is avoiding multiple tiny arraycopy calls > within a single run of the repeat method, and that problem was already > solved.
My observation is basically the same. Individual tiny array copy calls do not really matter, but tiny array copy calls in a tight loop is quite bad. My /guess/ is that the optimizer is not able to unroll these loops effectively, possibly because of the arraycopy intrinsic. > > > > I came up with the following. I haven't decided yet if I like it. > > > > On the 3 files I have been testing with, this change is a mixed bag. > > Compared to trunk 1 regresses by ~8%. While the other 2 do improve, > > neither are better than my last patch. > > OK, thanks. So it isn't great. I wonder which details make the > difference. I think some of the problem is too many branches, making prediction/speculative execution less useful. Another issue is the absence of the single byte optimization, which I will address more below. > One thing that confuses me in your version is the special handling of > the first byte: > > buf[pos++] = buf[back++]; > --left; > > If there are two bytes to copy, then one will be copied above and the > other with arraycopy later. If there are more bytes to copy and > distance is very small, incrementing "back" above can mean that an > extra arraycopy call might be needed in the loop because the first copy > will be one byte smaller. > > I understand that it might help when there is just one byte to repeat > because then the while-loop will be skipped. In all other situations it > sounds like that the special handling of the first byte would in theory > be harmful. Note that I don't doubt your test results; I already saw > with the CRC64 code that some changes in the code can affect > performance in weird ways. The image1.dcm is the most impacted by this optimization. Again, this file is basically a large greyscale bmp. This results in a significant number of single byte repeats. Optimizing for the single byte improves performance in that file by 3-5%, while having smaller effects on the other 2 files (ihe_ovly_pr.dcm slightly slower, large.xml slightly faster) > Your code needs > > if (back == bufSize) > back = 0; > > in the beginning of the while-loop and later checking for tmp > 0. My > version avoids these branches by handling those cases under "if (back < > 0)" (which is equivalent to "if (dist >= pos)"). On the other hand, > under "if (back < 0)" all copies, including tiny copies, are done with > arraycopy. > > Another tiny difference is that your code uses left shift to double the > copy size in the loop while I used Math.min(pos - back, left). > > > I was able to improve this a bit by pulling the handling of small > > copies outside of the while loop. This eliminates the regressions > > compared to trunk, but still does not feel like an improvement over my > > last patch. > > Yeah, the switch isn't worth it. If I understand it correctly now, > trying to avoid arraycopy for the tiny copies wasn't a useful idea in > the first place. So the code can be simplified ("version 3"): > > int back = pos - dist - 1; > if (back < 0) { > // The distance wraps around to the end of the cyclic dictionary > // buffer. We cannot get here if the dictionary isn't full. > assert full == bufSize; > back += bufSize; > > // Here we will never copy more than dist + 1 bytes and > // so the copying won't repeat from its own output. > // Thus, we can always use arraycopy safely. > int copySize = Math.min(bufSize - back, left); > assert copySize <= dist + 1; > > System.arraycopy(buf, back, buf, pos, copySize); > pos += copySize; > back = 0; > left -= copySize; > > if (left == 0) > return; > } > > assert back < pos; > assert left > 0; > > do { > // Determine the number of bytes to copy on this loop iteration: > // copySize is set so that the source and destination ranges > // don't overlap. If "left" is large enough, the destination > // range will start right after the last byte of the source > // range. This way we don't need to advance "back" which > // allows the next iteration of this loop to copy (up to) > // twice the number of bytes. > int copySize = Math.min(left, pos - back); > System.arraycopy(buf, back, buf, pos, copySize); > pos += copySize; > left -= copySize; > } while (left > 0); > > I know I may sound stubborn for not accepting your version as is but > the special handling of the first byte and the readability of the > while-loop (how easy it is to understand on the first reading) make me > hesitate. I agree your approach is more readable. From your version of it, I was expecting that simplicity in reading to translate into better performance. This latest version actually does appear to do that. The image1.dcm performance matches my version and the other 2 are a bit faster. Adding the single byte optimization still speeds up image1.dcm (~8ms, ~2%) and large.xml (~3ms, 2%), while slowing ihe_ovly_pr.dcm (~.008ms, ~1%). > For example, for dist = 0, len = 23 and assuming that copying > doesn't wrap: > > 1. One byte is copied before the while-loop. > 2. The inner do-while-loop copies 1, 2, 4, and 8 bytes. > 3. The outer while-loop starts from the beginning and > the latter arraycopy is used to copy the remaining 7 bytes. > > In contrast my new simplified version above has just one loop in it > where it copies 1, 2, 4, 8, and 8 bytes. I think getting rid of the inner loop entirely is the key to the improvement in this version. This seems quite obvious a change in hindsight. > Just to play around, here's yet another method which has only one > arraycopy. The size difference is very minimal though after you ignore > the comments and assertions. For readability I prefer "version 3" over > the version below ("version 4"). > > int back = pos - dist - 1; > int backNext = back; > > do { > int copySize; > if (back < 0) { > back += bufSize; > copySize = bufSize - back; > backNext = 0; > } else { > copySize = pos - back; > } > > if (copySize > left) > copySize = left; > > System.arraycopy(buf, back, buf, pos, copySize); > pos += copySize; > left -= copySize; > back = backNext; > } while (left > 0); > > It would be nice if you could compare the versions again. Thanks! Version 3 is better for all 3 files.