RE: grep is horriby slow in UTF-8 locales

2004-06-09 Thread Markus Kuhn
Forwarded from someone who has encountered problems posting on
linux-utf8 (who maintains list list server?):

--- Forwarded Message
I wasn't able to post to linux-utf8. Did you guys
receive my message? If I implemented a fix would
you be interested? I haven't yet...

Please let me know if you are interested in getting
this situation fixed.

Brad 

From: Chen, Brad [EMAIL PROTECTED]
Sent: Saturday, June 05, 2004 4:04 PM
To: '[EMAIL PROTECTED]'
Subject: RE: grep is horriby slow in UTF-8 locales

From the proposed patch:

-  if (MB_CUR_MAX  1  mb_properties[beg - buf] == 0)
-continue;
+  if (MB_CUR_MAX  1)
+  {
+memset(cur_state, 0, sizeof(mbstate_t));
+  if (mbrlen(beg + offset, buf + size - beg, cur_state)  0)
+  continue; /* It is a part of multibyte character.  */
+  }

This code does not appear to be functionally equivalent to what
was there before. In the old version, mb_properties[i] would be
0 only if the byte in question was part of a multi-byte character
and not the first byte. For the cases where the new code reaches
continue, the original code would have had mb_properties[i] == 1
and would not have behaved the same way.

Am I misreading this code?

Another thing you might want to tidy up here; mbrlen returns
a size_t which is unsigned, so a  0 comparison will get you
into trouble on some systems.

I confess I haven't correctness tested either version of the
code yet. I was just looking at performance. If you have a=20
favorite correctness test case please send it.

Best Wishes,
Brad Chen
Intel Corporation
SSG/Performance Tools Lab

--- End of Forwarded Message


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: grep is horriby slow in UTF-8 locales

2003-11-16 Thread Bruno Haible
Markus Kuhn wrote:
   b) relying entirely on ISO C's generic multi-byte functions, to make
  sure that even stateful monsters like the ISO 2022 encodings
  are supported equally.

Use of mbrlen is not done because of ISO 2022 encodings (which are not
usable as locale encodings!), but because of the non-UTF-8 multibyte
encodings: EUC-JP, Big5, GB18030 etc.

Bruno

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: grep is horriby slow in UTF-8 locales

2003-11-10 Thread jmaiorana


I recall that we had about two years ago heated discussions here on
whether UTF-8 support should be implemented by
 a) hardwired mechanisms fully optimized to make good use of UTF-8's
neat properties
 b) relying entirely on ISO C's generic multi-byte functions, to make
sure that even stateful monsters like the ISO 2022 encodings
are supported equally.
Unfortunately, it seems that grep has become an excellent teaching
example of how option b) can backfire with a ridiculous performance loss
in a basic text-processing tool.
 

Its not uncommon for code to be written in assembly to gain performance 
increases on certain platforms, usually much smaller increases than a 
factor of 100. Since UTF-8 is the future most common encoding, writing 
special case code to deal with UTF-8 is even better than that because 
all platforms can benefit from it equally.

UTF-8 is specifically designed to be as efficient as possible, sticking 
to the clib multibyte API is a disservice. (I am biased though, because 
I hardcode everything I write to it and specifically avoid generic 
multibyte support.)



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: grep is horriby slow in UTF-8 locales

2003-11-09 Thread Markus Kuhn
Mika Fischer wrote on 2003-11-08 21:47 UTC:
 So it seems the slowdown occurs in the function mbrlen from libc.
 
 The real problem is of course that this function is called once for
 every character of the input because grep makes a map of the input
 file containing the number of bytes of each character.
 
 Obviously this is quite time consuming :)
[...]
 At least for UTF-8 it's easy to skip over any additional bytes a
 character might have, so that might be a workable solution.

I recall that we had about two years ago heated discussions here on
whether UTF-8 support should be implemented by

  a) hardwired mechanisms fully optimized to make good use of UTF-8's
 neat properties

  b) relying entirely on ISO C's generic multi-byte functions, to make
 sure that even stateful monsters like the ISO 2022 encodings
 are supported equally.

Unfortunately, it seems that grep has become an excellent teaching
example of how option b) can backfire with a ridiculous performance loss
in a basic text-processing tool.

Markus

-- 
Markus Kuhn, Computer Lab, Univ of Cambridge, GB
http://www.cl.cam.ac.uk/~mgk25/ | __oo_O..O_oo__



+ Call for Votes: misc.metric-system -- Interested in a new group  +
+ on the introduction of the metric system? Please look at +
+ news.announce.newgroups, http://www.uvv.org/cgi-bin/getmsg/2440 or   +
+ send email to [EMAIL PROTECTED] for information on how +
+ cast your newsgroup creation vote. Ballot ends 25 November 2003. +


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: grep is horriby slow in UTF-8 locales

2003-11-08 Thread Glenn Maynard
On Fri, Nov 07, 2003 at 12:52:44PM +, Markus Kuhn wrote:
 $ grep --version
 grep (GNU grep) 2.5.1
 $ LC_ALL=en_GB.UTF-8 time grep XYZ test.txt
 Command exited with non-zero status 1
 6.83user 0.07system 0:06.93elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
 0inputs+0outputs (157major+34minor)pagefaults 0swaps
 $ LC_ALL=POSIX time grep XYZ test.txt
 Command exited with non-zero status 1
 0.07user 0.09system 0:00.16elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
 0inputs+0outputs (125major+24minor)pagefaults 0swaps

FYI:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=206470
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=181378

I've noticed this, too.  I often use LANG=C for grepping due to this.

Someone mentioned --with-included-regex, but that's not good enough
(a 10% gain for me).

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: grep is horriby slow in UTF-8 locales

2003-11-08 Thread Danilo Segan
Markus Kuhn [EMAIL PROTECTED] writes:

 $ grep --version
 grep (GNU grep) 2.5.1

This doesn't happen with:

$ grep --version
grep (GNU grep) 2.4.2
$ LC_ALL=POSIX time grep XYZ test.txt 
Command exited with non-zero status 1
0.03user 0.07system 0:00.36elapsed 27%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (118major+25minor)pagefaults 0swaps
$ LC_ALL=sr_CS.UTF-8 time grep XYZ test.txt 
Command exited with non-zero status 1
0.06user 0.05system 0:00.10elapsed 105%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (143major+50minor)pagefaults 0swaps
$ LC_ALL=en_GB.UTF-8 time grep XYZ test.txt 
Command exited with non-zero status 1
0.06user 0.04system 0:00.15elapsed 64%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (128major+48minor)pagefaults 0swaps
$ LC_ALL=POSIX time grep XYZ test.txt 
Command exited with non-zero status 1
0.04user 0.06system 0:00.10elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (118major+25minor)pagefaults 0swaps

Last example shows that CPU usage is not really any kind of rule to
base conculsions on (sr_CS.UTF-8 is my everyday locale, and I would
really notice if grep had any problems with it).

test.txt was produced with:
 for i in 1 2 3 4 5 6 7 8 9 0; do cat UnicodeData.txt test.txt; done

I can get a newer grep today, if you think I may experience different
results with it.

Cheers,
Danilo
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: grep is horriby slow in UTF-8 locales

2003-11-08 Thread Markus Kuhn
Rob Park wrote on 2003-11-08 00:49 UTC:
 grep is slower on my system, but it doesn't appear to be as bad as on 
 your system.

Your results show that grep in UTF-8 mode is equally 100x slower than in
single-byte mode, just like on my system (300 MHz P3). You just have
used a faster CPU.

Markus

-- 
Markus Kuhn, Computer Lab, Univ of Cambridge, GB
http://www.cl.cam.ac.uk/~mgk25/ | __oo_O..O_oo__

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: grep is horriby slow in UTF-8 locales

2003-11-08 Thread Rob Park
Markus Kuhn wrote:
Your results show that grep in UTF-8 mode is equally 100x slower than in
single-byte mode, just like on my system (300 MHz P3). You just have
used a faster CPU.
D'oh :)

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: grep is horriby slow in UTF-8 locales

2003-11-08 Thread Danilo egan
Hi Markus,

Markus Kuhn [EMAIL PROTECTED] writes:
 Rob Park wrote on 2003-11-08 00:49 UTC:
 grep is slower on my system, but it doesn't appear to be as bad as on 
 your system.

 Your results show that grep in UTF-8 mode is equally 100x slower than in
 single-byte mode, just like on my system (300 MHz P3). You just have
 used a faster CPU.

I tried this with grep 2.5 (the latest available from
ftp.gnu.org/gnu/grep/, because of the crack) and it still shows
decent results on my home Slackware GNU/Linux 8.0 (GNU libc 2.2.3 I think):

$ LC_ALL=POSIX time grep2.5 XYZ test.txt 
Command exited with non-zero status 1
0.03user 0.06system 0:00.14elapsed 63%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (130major+22minor)pagefaults 0swaps
$ LC_ALL=en_GB.UTF-8 time grep2.5 XYZ test.txt 
Command exited with non-zero status 1
0.05user 0.07system 0:00.12elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (140major+45minor)pagefaults 0swaps

I'm using Celeron 700MHz.

I cannot build grep from CVS right now, but I still suspect this is
not a grep problem.

Cheers,
Danilo
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: grep is horriby slow in UTF-8 locales

2003-11-08 Thread Glenn Maynard
On Fri, Nov 07, 2003 at 04:49:58PM +0100, Danilo Segan wrote:
 This doesn't happen with:
 
 $ grep --version
 grep (GNU grep) 2.4.2

This was probably before full multibyte support was added to grep; the
issue here specifically only happens in multibyte encodings.  (My grep
is slow in en_US.UTF-8, and fast in en_US.ISO-8859-1.) Try:

# echo tést | grep 't.st'
tést
# echo tést | grep 't[aé]st'
tést

 $ LC_ALL=POSIX time grep XYZ test.txt 
 Command exited with non-zero status 1
 0.04user 0.06system 0:00.10elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
 0inputs+0outputs (118major+25minor)pagefaults 0swaps
 
 Last example shows that CPU usage is not really any kind of rule to
 base conculsions on (sr_CS.UTF-8 is my everyday locale, and I would
 really notice if grep had any problems with it).

The field you should be reading is user.  CPU is roughly
(user+system)/elapsed, and isn't very relevant here.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: grep is horriby slow in UTF-8 locales

2003-11-08 Thread Mika Fischer
Hi!

* Markus Kuhn [EMAIL PROTECTED] [2003-11-07 16:33]:
 It seems grep performs about 100x worse in a UTF-8 locale than in and
 ASCII locale, even where the search strring contains no regex
 metacharacters.

Same here on Debian with grep 2.5.1 and libc 2.3.2.

 There is technically no reason, why grep should have to be any slower in
 a UTF-8 locale than in a single-byte locale if the string does not even
 contain any regex meta characters at all. In that case, UTF-8 can be
 processed just like ASCII.
[...]
 Any suggestions? It would be nice not to be penalized like this by grep
 for using a UTF-8 locale by default.

Diagnosis:
I profiled grep and got the following:
LC_ALL=POSIX
snip
Each sample counts as 0.01 seconds.
  %   cumulative   self  self total   
 time   seconds   secondscalls  ms/call  ms/call  name
100.00  0.52 0.52  274 1.90 1.90  bmexec
...
snip

LC_ALL=de_DE.UTF-8
snip
Each sample counts as 0.01 seconds.
  %   cumulative   self  self total   
 time   seconds   secondscalls   s/call   s/call  name
 76.80  1.39 1.39  274 0.01 0.01  check_multibyte_string
 22.65  1.80 0.41  274 0.00 0.00  bmexec
...
snip

The check_multibyte_string function:
snip
static char*
check_multibyte_string(char const *buf, size_t size)
{ 
  char *mb_properties = malloc(size);
  mbstate_t cur_state;
  int i;
  memset(cur_state, 0, sizeof(mbstate_t));
  memset(mb_properties, 0, sizeof(char)*size);
  for (i = 0; i  size ;)
{ 
  size_t mbclen;
  mbclen = mbrlen(buf + i, size - i, cur_state);

  if (mbclen == (size_t) -1 || mbclen == (size_t) -2 || mbclen ==
  0)
{ 
  /* An invalid sequence, or a truncated multibyte character.
 We treat it as a singlebyte character.  */
  mbclen = 1;
}
  mb_properties[i] = mbclen;
  i += mbclen;
}

  return mb_properties;
}
snip

So it seems the slowdown occurs in the function mbrlen from libc.

The real problem is of course that this function is called once for
every character of the input because grep makes a map of the input
file containing the number of bytes of each character.

Obviously this is quite time consuming :)

A special case for non-UTF8 regexps has problems with regexps that
contain . and similar things.

A more general approach would be better IMO. Perhaps it's faster to
match a regexp by skipping over any additional bytes of a MB-character
in case of a . or similar things. Then one could just take the byte
representation of the regexp and try to match it.

At least for UTF-8 it's easy to skip over any additional bytes a
character might have, so that might be a workable solution.

Cheers,
 Mika
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: grep is horriby slow in UTF-8 locales

2003-11-08 Thread Danilo egan
[EMAIL PROTECTED] (Danilo egan) writes:
 $ LC_ALL=en_GB.UTF-8 time grep2.5 XYZ test.txt 
 Command exited with non-zero status 1
 0.05user 0.07system 0:00.12elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
 0inputs+0outputs (140major+45minor)pagefaults 0swaps

Whoops, this above is total crap. I didn't have en_GB.UTF-8 locale
installed at all, so when I created it using:
# localedef -f UTF-8 -i en_GB en_GB.UTF-8

I got the following results, which matches your experiences:

$ LC_ALL=en_GB.UTF-8 time grep2.5 XYZ test.txt 
Command exited with non-zero status 1
2.85user 0.13system 0:03.18elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (161major+55minor)pagefaults 0swaps

Yet, grep 2.4 doesn't inhibit these problems at all (guess I won't be
fast at upgrading :).

Cheers,
Danilo
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



grep is horriby slow in UTF-8 locales

2003-11-07 Thread Markus Kuhn
On Red Hat 9:

$ grep --version
grep (GNU grep) 2.5.1
$ LC_ALL=en_GB.UTF-8 time grep XYZ test.txt
Command exited with non-zero status 1
6.83user 0.07system 0:06.93elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (157major+34minor)pagefaults 0swaps
$ LC_ALL=POSIX time grep XYZ test.txt
Command exited with non-zero status 1
0.07user 0.09system 0:00.16elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (125major+24minor)pagefaults 0swaps

where test.tx is just http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
repeated 10 times.

It seems grep performs about 100x worse in a UTF-8 locale than in and
ASCII locale, even where the search strring contains no regex
metacharacters.

And fgrep is no better.

There is technically no reason, why grep should have to be any slower in
a UTF-8 locale than in a single-byte locale if the string does not even
contain any regex meta characters at all. In that case, UTF-8 can be
processed just like ASCII.

In UTF-8 mode, grep is also much slower than the equivalent Perl:

$ LC_ALL=en_GB.UTF-8 time perl -ne '/XYZ/  print' test.txt
1.49user 0.05system 0:01.55elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (339major+45minor)pagefaults 0swaps
$ LC_ALL=POSIX time perl -ne '/XYZ/  print' test.txt
1.17user 0.09system 0:01.28elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (322major+45minor)pagefaults 0swaps

Any suggestions? It would be nice not to be penalized like this by grep
for using a UTF-8 locale by default.

Markus

-- 
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: grep is horriby slow in UTF-8 locales

2003-11-07 Thread Rob Park
Markus Kuhn wrote:
On Red Hat 9:

$ grep --version
grep (GNU grep) 2.5.1
$ LC_ALL=en_GB.UTF-8 time grep XYZ test.txt
Command exited with non-zero status 1
6.83user 0.07system 0:06.93elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (157major+34minor)pagefaults 0swaps
$ LC_ALL=POSIX time grep XYZ test.txt
Command exited with non-zero status 1
0.07user 0.09system 0:00.16elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (125major+24minor)pagefaults 0swaps
where test.tx is just http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
repeated 10 times.
Wow, I dunno what's going on here. Here are the results on my system 
(also RedHat 9):

$ grep --version
grep (GNU grep) 2.5.1
$ LC_ALL=en_GB.UTF-8 time grep XYZ test.txt
Command exited with non-zero status 1
1.14user 0.04system 0:01.19elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (156major+32minor)pagefaults 0swaps
$ LC_ALL=POSIX time grep XYZ test.txt
Command exited with non-zero status 1
0.01user 0.03system 0:00.03elapsed 102%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (125major+25minor)pagefaults 0swaps
It seems grep performs about 100x worse in a UTF-8 locale than in and
ASCII locale, even where the search strring contains no regex
metacharacters.
grep is slower on my system, but it doesn't appear to be as bad as on 
your system.

In UTF-8 mode, grep is also much slower than the equivalent Perl:

$ LC_ALL=en_GB.UTF-8 time perl -ne '/XYZ/  print' test.txt
1.49user 0.05system 0:01.55elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (339major+45minor)pagefaults 0swaps
$ LC_ALL=POSIX time perl -ne '/XYZ/  print' test.txt
1.17user 0.09system 0:01.28elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (322major+45minor)pagefaults 0swaps
$ LC_ALL=en_GB.UTF-8 time perl -ne '/XYZ/  print' test.txt
0.30user 0.01system 0:00.33elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (341major+45minor)pagefaults 0swaps
$ LC_ALL=POSIX time perl -ne '/XYZ/  print' test.txt
0.19user 0.06system 0:00.24elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (325major+44minor)pagefaults 0swaps
Any suggestions? It would be nice not to be penalized like this by grep
for using a UTF-8 locale by default.
Sorry buddy, I have no idea :(

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/