On 9/15/22 07:30, Yi-yo Chiang via Toybox wrote: > grep is slow when the number of patterns is large. ... > xxd -p -c 0 -l 40 /dev/urandom
Huh, WHY does that produce two lines of output? $ xxd -p -c 0 -l 40 /dev/urandom 1fcf13e1b4844ba209fb9958bde26a13577c577744f1b1290240d03f4f8e 644fd0687c39b1aa8a68 Behavior is consistent between toybox xxd and debian's, but Elliott sent in the xxd implementation and I don't use it much, so... Feeding in -c 40 and -c 80 make no difference? Huh. Oh well, tangent. I'll assume the test you sent me is the test you meant to send me. > A simple optimization would be to concat all patterns into one huge regex > pattern with '|'. Which you could do in your pattern creation just as easily as grep could do so internally? Except when I tried it: $ time toybox grep -Ef <(tr '\n' '|' < test_large_10000 | head -c -1) \ test_large_10000 The result was even _slower_ on glibc (and gave me "out of memory" on musl). > I've hacked something together locally to do this, and the > improvement is evident (runtime reduced to ~2s). That was my first implementation, but it had portability issues (off the top of my head musl's non-extended regex syntax didn't support | and the \1 references don't reset at | either, plus I think there were circumstances in which ^ and $ could get confused? Plus the empty regex line match thing...) So I replaced it in commit ca3528d7bf97. (Alas libc hasn't got a lex primitive the way it's got regexec.) > But this also had some potential problems: > 1. Does regexec always return the left-most longest match? Posix says it does: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#:~:text=longest%20of%20the > 2. One large pattern might be problematic. `man 7 regex` threatens that > patterns larger than 256 bytes might not be supported by all regcomp > implementations. Yeah, musl fails. I don't actually know _what_ is slowing this down, but I'd guess "lack of cache locality" would be a significant part? Potentially a loop over the string with each pattern having "start of line" prepended to it might get some of the speed back? (Or again, it might be slower.) I could also sort the patterns by first character and loop through calling the appropriate regexes based on character at string position... which is harder than you think because there could be a | later in the pattern so that optimization only applies to a _subset_ of patterns. (First char being . or * or ( kinda throws that off to, and that optimization would also have to be aware of case insensitivity...) There are things that can be done to speed this up, but we'd have to try and see what actually helped... > Thoughts on how risky this change could be? I could start sending you the > patches if you think the pros outweigh the potential cons? Gnu grep has its own bespoke regex implementation internally, because gnu. While I _have_ historically written my own regex implementation from scratch (using glob syntax rather than regex but same general idea), I chose not to do so here because project philosophy. This has downsides. There are things that can be done to speed it up, but I'd prefer a more real world test case to optimize against if we go down that path? (This is why I usually wait until somebody complains before optimizing or elaborating too much: they bring test cases. :) (Also, gnu's bespoke regex is working on blocks of data that haven't been delineated into lines, again for speed reasons. That's less of an issue here, but I did read a paper on it when I was writing my own grep...) > Yi-yo Rob _______________________________________________ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net