Hi Michael, > Hilko Bengen <ben...@debian.org> writes: >> The first patch simply removes this check. It has worked well for me for >> several months (before the codesearch package appeared in Debian). I > The issue I have with that is that it will break the assumption that > everything which is in the index is findable. Given that codesearch is > used with UTF-8 search terms, you would not be able to find “Grüße” when > searching for it.
Right. To be honest, it never occured to me to search for non-ASCII content in source code. :-) Codesearch has been very useful in answering questions such as "Are we sure that none of our code uses function X any more ... can we just get rid of X in the next release?". So to me, having an index that contains every source text file been more important than worrying about not being able to find Latin1-encoded strings. BTW, I just tried passing 'äöü' as a Latin1-encoded string (bytes e4 f6 fc) to csearch. This led to regexp/syntax failing with an "invalid UTF-8" error, so this does not work, even if the character encoding of the search term matches that of the index. A "proper" solution would probably involve guessing the character set of a text file and convert it if necessary before indexing. Meh. How are you dealing with this in codesearch.debian.net? >> Please consider also adding the second patch to enable logging of files >> that are not added to the index for whatever reason. > I agree that the second patch makes more sense, even though I am not > entirely sure whether a separate flag would be more appropriate > (nitpick). > > Can you please send the second patch upstream? Will do. Cheers, -Hilko -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org