Package: codesearch
Version: 0.0~hg20120502-2
Severity: normal
Tags: patch

After noticing that cindex silently skipped some files in my codebase, I
found that this happened to files that contained non-ASCII characters
that were encoded as Latin-1.

Here's a quick reproducer:

$ pwd
/home/bengen/tmp/cs-test
$ ls -l
total 12
-rw-r--r--. 1 bengen bengen  8 Aug 16 09:33 ascii.txt
-rw-r--r--. 1 bengen bengen  8 Aug 16 09:33 latin1.txt
-rw-r--r--. 1 bengen bengen 11 Aug 16 09:33 utf-8.txt
$ rm ~/.csearchindex 
$ hd ascii.txt 
00000000  61 6f 75 20 66 6f 6f 0a                           |aou foo.|
00000008
$ hd latin1.txt 
00000000  e4 f6 fc 20 66 6f 6f 0a                           |... foo.|
00000008
$ hd utf-8.txt 
00000000  c3 a4 c3 b6 c3 bc 20 66  6f 6f 0a                 |...... foo.|
0000000b
$ cindex `pwd`
2013/08/16 09:37:33 index /home/bengen/tmp/cs-test
2013/08/16 09:37:33 flush index
2013/08/16 09:37:33 merge 0 files + mem
2013/08/16 09:37:33 19 data bytes, 371 index bytes
2013/08/16 09:37:33 done
$ csearch foo
/home/bengen/tmp/cs-test/ascii.txt:aou foo
/home/bengen/tmp/cs-test/utf-8.txt:äöü foo


There's a check for invalid UTF-8 sequences in the indexWriter.Add()
function to weed out binary data. If a file contains an invalid UTF-8
sequence, it is skipped entirely.

This behavior is surprising. It only makes sense in a controlled
environment where you can guarantee that everything you want to index is
valid UTF-8 (or even ASCII). This was certainly not a valid assumption
in the case where I wanted to index a rather large internal collection
of source code that contained the occasional German message or comment.

The first patch simply removes this check. It has worked well for me for
several months (before the codesearch package appeared in Debian). I
haven't really checked index sizes, but I can't imagine this patch doing
much harm because there are other checks that are there to keep the
index from being filled up with arbitrary trigrams.

Please consider also adding the second patch to enable logging of files
that are not added to the index for whatever reason.

Cheers,
-Hilko

diff --git a/index/write.go b/index/write.go
index c48e981..adcc547 100644
--- a/index/write.go
+++ b/index/write.go
@@ -149,12 +149,6 @@ func (ix *IndexWriter) Add(name string, f io.Reader) {
 		if n++; n >= 3 {
 			ix.trigram.Add(tv)
 		}
-		if !validUTF8((tv>>8)&0xFF, tv&0xFF) {
-			if ix.LogSkip {
-				log.Printf("%s: invalid UTF-8, ignoring\n", name)
-			}
-			return
-		}
 		if n > maxFileLen {
 			if ix.LogSkip {
 				log.Printf("%s: too long, ignoring\n", name)
diff --git a/cmd/cindex/cindex.go b/cmd/cindex/cindex.go
index 040101e..edb4772 100644
--- a/cmd/cindex/cindex.go
+++ b/cmd/cindex/cindex.go
@@ -123,6 +123,7 @@ func main() {
 
 	ix := index.Create(file)
 	ix.Verbose = *verboseFlag
+	ix.LogSkip = *verboseFlag
 	ix.AddPaths(args)
 	for _, arg := range args {
 		log.Printf("index %s", arg)

Reply via email to