On Nov 8 11:30, Corinna Vinschen wrote: > On Nov 7 15:26, aputerguy wrote: > > > > Changing LC_ALL also solved the problem for me. > > But it begs the question of how many other basic and take-for-granted > > functions might be affected by this apparent UTF-8 slowdown. And again we, > > are not talking about some minor overhead, we are talking about a slowdown > > of 1500X or 150,000%!!!! > > Yeah, that's really still strange to me. In my testing, the multibyte > to widechar conversion performed by grep in case of UTF-8 took only > 1.5 up to 4 seconds for 10 times the number of input lines as in your > case. It still puzzles me where the time is wasted in grep.
Got it. The problem is this. Grep reads the file in chunks > pagesize. Pagesize is 64K on Cygwin. For each buffer read into memory, the grepbuf() function calls the execute() function as long as it returns a match. The execute() function calls check_multibyte_string() for the entire buffer(!), then calls kwsexec() to find a match. If a match has been found, it free's the memory allocated by check_multibyte_string() and returns to grepbuf. Then grepbuf() will call execute again with the pointers into the buffer moved to the next line. Let's make an example. Assume the buffer is 100K, which is not unusual when running under Cygwin. Assume further that the file consists of 100,000 lines with the text "The quick brown fox jumped over the lazy dog". Each line is 45 bytes, so the buffer contains somwhat more than 2200 lines. Now let's search for the expression "dog". The first call to execute will call check_multibyte_string() for the entire buffer of 100000 bytes. Then it finds a match in the first line, free's the check_multibyte_string() memory and returns to grepbuf. grepbuf calls execute with the start pointer moved to the second line in the buffer, so execute() calls check_multibyte_string() for the remainder of the buffer, which is 99955 bytes. It find a match in the first line, free's the check_multibyte_string buffer, returns to grepbuf, which calls execute, which calls check_multibyte_string() with a buffer of 99910 bytes, and so on... Every invocation of check_multibyte_string() calls mbrtowc() in a loop for the entire buffer given as argument. In our example, that means that mbrtowc() is called (hold your breath) 111,161,115 times for each 100K of input file! No wonder grep takes 3 or 4 minutes to grep this very example on Cygwin. I really think there's some room for optimization left in this algorithm. Btw., the check for mmap in grep's configure file is broken. It tries to mmap to a fixed address formerly allocated via malloc(). This doesn't work on Windows. An autoconf run with a newer version of autoconf would be nice. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple