Bug#719858: codesearch: Indexer only accepts only valid UTF-8
Package: codesearch Version: 0.0~hg20120502-2 Severity: normal Tags: patch After noticing that cindex silently skipped some files in my codebase, I found that this happened to files that contained non-ASCII characters that were encoded as Latin-1. Here's a quick reproducer: $ pwd /home/bengen/tmp/cs-test $ ls -l total 12 -rw-r--r--. 1 bengen bengen 8 Aug 16 09:33 ascii.txt -rw-r--r--. 1 bengen bengen 8 Aug 16 09:33 latin1.txt -rw-r--r--. 1 bengen bengen 11 Aug 16 09:33 utf-8.txt $ rm ~/.csearchindex $ hd ascii.txt 61 6f 75 20 66 6f 6f 0a |aou foo.| 0008 $ hd latin1.txt e4 f6 fc 20 66 6f 6f 0a |... foo.| 0008 $ hd utf-8.txt c3 a4 c3 b6 c3 bc 20 66 6f 6f 0a |.. foo.| 000b $ cindex `pwd` 2013/08/16 09:37:33 index /home/bengen/tmp/cs-test 2013/08/16 09:37:33 flush index 2013/08/16 09:37:33 merge 0 files + mem 2013/08/16 09:37:33 19 data bytes, 371 index bytes 2013/08/16 09:37:33 done $ csearch foo /home/bengen/tmp/cs-test/ascii.txt:aou foo /home/bengen/tmp/cs-test/utf-8.txt:äöü foo There's a check for invalid UTF-8 sequences in the indexWriter.Add() function to weed out binary data. If a file contains an invalid UTF-8 sequence, it is skipped entirely. This behavior is surprising. It only makes sense in a controlled environment where you can guarantee that everything you want to index is valid UTF-8 (or even ASCII). This was certainly not a valid assumption in the case where I wanted to index a rather large internal collection of source code that contained the occasional German message or comment. The first patch simply removes this check. It has worked well for me for several months (before the codesearch package appeared in Debian). I haven't really checked index sizes, but I can't imagine this patch doing much harm because there are other checks that are there to keep the index from being filled up with arbitrary trigrams. Please consider also adding the second patch to enable logging of files that are not added to the index for whatever reason. Cheers, -Hilko diff --git a/index/write.go b/index/write.go index c48e981..adcc547 100644 --- a/index/write.go +++ b/index/write.go @@ -149,12 +149,6 @@ func (ix *IndexWriter) Add(name string, f io.Reader) { if n++; n = 3 { ix.trigram.Add(tv) } - if !validUTF8((tv8)0xFF, tv0xFF) { - if ix.LogSkip { -log.Printf(%s: invalid UTF-8, ignoring\n, name) - } - return - } if n maxFileLen { if ix.LogSkip { log.Printf(%s: too long, ignoring\n, name) diff --git a/cmd/cindex/cindex.go b/cmd/cindex/cindex.go index 040101e..edb4772 100644 --- a/cmd/cindex/cindex.go +++ b/cmd/cindex/cindex.go @@ -123,6 +123,7 @@ func main() { ix := index.Create(file) ix.Verbose = *verboseFlag + ix.LogSkip = *verboseFlag ix.AddPaths(args) for _, arg := range args { log.Printf(index %s, arg)
Bug#719858: codesearch: Indexer only accepts only valid UTF-8
Hi Hilko, Hilko Bengen ben...@debian.org writes: The first patch simply removes this check. It has worked well for me for several months (before the codesearch package appeared in Debian). I The issue I have with that is that it will break the assumption that everything which is in the index is findable. Given that codesearch is used with UTF-8 search terms, you would not be able to find “Grüße” when searching for it. Please consider also adding the second patch to enable logging of files that are not added to the index for whatever reason. I agree that the second patch makes more sense, even though I am not entirely sure whether a separate flag would be more appropriate (nitpick). Can you please send the second patch upstream? -- Best regards, Michael -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#719858: codesearch: Indexer only accepts only valid UTF-8
Hi Michael, Hilko Bengen ben...@debian.org writes: The first patch simply removes this check. It has worked well for me for several months (before the codesearch package appeared in Debian). I The issue I have with that is that it will break the assumption that everything which is in the index is findable. Given that codesearch is used with UTF-8 search terms, you would not be able to find “Grüße” when searching for it. Right. To be honest, it never occured to me to search for non-ASCII content in source code. :-) Codesearch has been very useful in answering questions such as Are we sure that none of our code uses function X any more ... can we just get rid of X in the next release?. So to me, having an index that contains every source text file been more important than worrying about not being able to find Latin1-encoded strings. BTW, I just tried passing 'äöü' as a Latin1-encoded string (bytes e4 f6 fc) to csearch. This led to regexp/syntax failing with an invalid UTF-8 error, so this does not work, even if the character encoding of the search term matches that of the index. A proper solution would probably involve guessing the character set of a text file and convert it if necessary before indexing. Meh. How are you dealing with this in codesearch.debian.net? Please consider also adding the second patch to enable logging of files that are not added to the index for whatever reason. I agree that the second patch makes more sense, even though I am not entirely sure whether a separate flag would be more appropriate (nitpick). Can you please send the second patch upstream? Will do. Cheers, -Hilko -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#719858: codesearch: Indexer only accepts only valid UTF-8
Hi Hilko, Hilko Bengen ben...@debian.org writes: BTW, I just tried passing 'äöü' as a Latin1-encoded string (bytes e4 f6 fc) to csearch. This led to regexp/syntax failing with an invalid UTF-8 error, so this does not work, even if the character encoding of the search term matches that of the index. Yep, that is what I suspected. Only UTF-8 is supported. A proper solution would probably involve guessing the character set of a text file and convert it if necessary before indexing. Meh. How are you dealing with this in codesearch.debian.net? I just assume everything is UTF-8. If it is not, and actually contains non-ASCII characters, it needs to be converted to UTF-8. I mean, common. This is 2013. Just convert it the files already! :-) -- Best regards, Michael -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org