Package: codesearch Version: 0.0~hg20120502-2 Severity: normal Tags: patch After noticing that cindex silently skipped some files in my codebase, I found that this happened to files that contained non-ASCII characters that were encoded as Latin-1.
Here's a quick reproducer: $ pwd /home/bengen/tmp/cs-test $ ls -l total 12 -rw-r--r--. 1 bengen bengen 8 Aug 16 09:33 ascii.txt -rw-r--r--. 1 bengen bengen 8 Aug 16 09:33 latin1.txt -rw-r--r--. 1 bengen bengen 11 Aug 16 09:33 utf-8.txt $ rm ~/.csearchindex $ hd ascii.txt 00000000 61 6f 75 20 66 6f 6f 0a |aou foo.| 00000008 $ hd latin1.txt 00000000 e4 f6 fc 20 66 6f 6f 0a |... foo.| 00000008 $ hd utf-8.txt 00000000 c3 a4 c3 b6 c3 bc 20 66 6f 6f 0a |...... foo.| 0000000b $ cindex `pwd` 2013/08/16 09:37:33 index /home/bengen/tmp/cs-test 2013/08/16 09:37:33 flush index 2013/08/16 09:37:33 merge 0 files + mem 2013/08/16 09:37:33 19 data bytes, 371 index bytes 2013/08/16 09:37:33 done $ csearch foo /home/bengen/tmp/cs-test/ascii.txt:aou foo /home/bengen/tmp/cs-test/utf-8.txt:äöü foo There's a check for invalid UTF-8 sequences in the indexWriter.Add() function to weed out binary data. If a file contains an invalid UTF-8 sequence, it is skipped entirely. This behavior is surprising. It only makes sense in a controlled environment where you can guarantee that everything you want to index is valid UTF-8 (or even ASCII). This was certainly not a valid assumption in the case where I wanted to index a rather large internal collection of source code that contained the occasional German message or comment. The first patch simply removes this check. It has worked well for me for several months (before the codesearch package appeared in Debian). I haven't really checked index sizes, but I can't imagine this patch doing much harm because there are other checks that are there to keep the index from being filled up with arbitrary trigrams. Please consider also adding the second patch to enable logging of files that are not added to the index for whatever reason. Cheers, -Hilko
diff --git a/index/write.go b/index/write.go index c48e981..adcc547 100644 --- a/index/write.go +++ b/index/write.go @@ -149,12 +149,6 @@ func (ix *IndexWriter) Add(name string, f io.Reader) { if n++; n >= 3 { ix.trigram.Add(tv) } - if !validUTF8((tv>>8)&0xFF, tv&0xFF) { - if ix.LogSkip { - log.Printf("%s: invalid UTF-8, ignoring\n", name) - } - return - } if n > maxFileLen { if ix.LogSkip { log.Printf("%s: too long, ignoring\n", name)
diff --git a/cmd/cindex/cindex.go b/cmd/cindex/cindex.go index 040101e..edb4772 100644 --- a/cmd/cindex/cindex.go +++ b/cmd/cindex/cindex.go @@ -123,6 +123,7 @@ func main() { ix := index.Create(file) ix.Verbose = *verboseFlag + ix.LogSkip = *verboseFlag ix.AddPaths(args) for _, arg := range args { log.Printf("index %s", arg)