Bug#719858: codesearch: Indexer only accepts only valid UTF-8

2013-08-16 Thread Hilko Bengen
Package: codesearch
Version: 0.0~hg20120502-2
Severity: normal
Tags: patch

After noticing that cindex silently skipped some files in my codebase, I
found that this happened to files that contained non-ASCII characters
that were encoded as Latin-1.

Here's a quick reproducer:

$ pwd
/home/bengen/tmp/cs-test
$ ls -l
total 12
-rw-r--r--. 1 bengen bengen  8 Aug 16 09:33 ascii.txt
-rw-r--r--. 1 bengen bengen  8 Aug 16 09:33 latin1.txt
-rw-r--r--. 1 bengen bengen 11 Aug 16 09:33 utf-8.txt
$ rm ~/.csearchindex 
$ hd ascii.txt 
  61 6f 75 20 66 6f 6f 0a   |aou foo.|
0008
$ hd latin1.txt 
  e4 f6 fc 20 66 6f 6f 0a   |... foo.|
0008
$ hd utf-8.txt 
  c3 a4 c3 b6 c3 bc 20 66  6f 6f 0a |.. foo.|
000b
$ cindex `pwd`
2013/08/16 09:37:33 index /home/bengen/tmp/cs-test
2013/08/16 09:37:33 flush index
2013/08/16 09:37:33 merge 0 files + mem
2013/08/16 09:37:33 19 data bytes, 371 index bytes
2013/08/16 09:37:33 done
$ csearch foo
/home/bengen/tmp/cs-test/ascii.txt:aou foo
/home/bengen/tmp/cs-test/utf-8.txt:äöü foo


There's a check for invalid UTF-8 sequences in the indexWriter.Add()
function to weed out binary data. If a file contains an invalid UTF-8
sequence, it is skipped entirely.

This behavior is surprising. It only makes sense in a controlled
environment where you can guarantee that everything you want to index is
valid UTF-8 (or even ASCII). This was certainly not a valid assumption
in the case where I wanted to index a rather large internal collection
of source code that contained the occasional German message or comment.

The first patch simply removes this check. It has worked well for me for
several months (before the codesearch package appeared in Debian). I
haven't really checked index sizes, but I can't imagine this patch doing
much harm because there are other checks that are there to keep the
index from being filled up with arbitrary trigrams.

Please consider also adding the second patch to enable logging of files
that are not added to the index for whatever reason.

Cheers,
-Hilko

diff --git a/index/write.go b/index/write.go
index c48e981..adcc547 100644
--- a/index/write.go
+++ b/index/write.go
@@ -149,12 +149,6 @@ func (ix *IndexWriter) Add(name string, f io.Reader) {
 		if n++; n = 3 {
 			ix.trigram.Add(tv)
 		}
-		if !validUTF8((tv8)0xFF, tv0xFF) {
-			if ix.LogSkip {
-log.Printf(%s: invalid UTF-8, ignoring\n, name)
-			}
-			return
-		}
 		if n  maxFileLen {
 			if ix.LogSkip {
 log.Printf(%s: too long, ignoring\n, name)
diff --git a/cmd/cindex/cindex.go b/cmd/cindex/cindex.go
index 040101e..edb4772 100644
--- a/cmd/cindex/cindex.go
+++ b/cmd/cindex/cindex.go
@@ -123,6 +123,7 @@ func main() {
 
 	ix := index.Create(file)
 	ix.Verbose = *verboseFlag
+	ix.LogSkip = *verboseFlag
 	ix.AddPaths(args)
 	for _, arg := range args {
 		log.Printf(index %s, arg)


Bug#719858: codesearch: Indexer only accepts only valid UTF-8

2013-08-16 Thread Michael Stapelberg
Hi Hilko,

Hilko Bengen ben...@debian.org writes:
 The first patch simply removes this check. It has worked well for me for
 several months (before the codesearch package appeared in Debian). I
The issue I have with that is that it will break the assumption that
everything which is in the index is findable. Given that codesearch is
used with UTF-8 search terms, you would not be able to find “Grüße” when
searching for it.

 Please consider also adding the second patch to enable logging of files
 that are not added to the index for whatever reason.
I agree that the second patch makes more sense, even though I am not
entirely sure whether a separate flag would be more appropriate
(nitpick).

Can you please send the second patch upstream?

-- 
Best regards,
Michael


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#719858: codesearch: Indexer only accepts only valid UTF-8

2013-08-16 Thread Hilko Bengen
Hi Michael,

 Hilko Bengen ben...@debian.org writes:
 The first patch simply removes this check. It has worked well for me for
 several months (before the codesearch package appeared in Debian). I
 The issue I have with that is that it will break the assumption that
 everything which is in the index is findable. Given that codesearch is
 used with UTF-8 search terms, you would not be able to find “Grüße” when
 searching for it.

Right. To be honest, it never occured to me to search for non-ASCII
content in source code. :-)

Codesearch has been very useful in answering questions such as Are we
sure that none of our code uses function X any more ... can we just get
rid of X in the next release?. So to me, having an index that contains
every source text file been more important than worrying about not being
able to find Latin1-encoded strings.

BTW, I just tried passing 'äöü' as a Latin1-encoded string (bytes e4 f6
fc) to csearch. This led to regexp/syntax failing with an invalid
UTF-8 error, so this does not work, even if the character encoding of
the search term matches that of the index.

A proper solution would probably involve guessing the character set of
a text file and convert it if necessary before indexing. Meh.

How are you dealing with this in codesearch.debian.net?

 Please consider also adding the second patch to enable logging of files
 that are not added to the index for whatever reason.
 I agree that the second patch makes more sense, even though I am not
 entirely sure whether a separate flag would be more appropriate
 (nitpick).

 Can you please send the second patch upstream?

Will do.

Cheers,
-Hilko


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#719858: codesearch: Indexer only accepts only valid UTF-8

2013-08-16 Thread Michael Stapelberg
Hi Hilko,

Hilko Bengen ben...@debian.org writes:
 BTW, I just tried passing 'äöü' as a Latin1-encoded string (bytes e4 f6
 fc) to csearch. This led to regexp/syntax failing with an invalid
 UTF-8 error, so this does not work, even if the character encoding of
 the search term matches that of the index.
Yep, that is what I suspected. Only UTF-8 is supported.

 A proper solution would probably involve guessing the character set of
 a text file and convert it if necessary before indexing. Meh.
 How are you dealing with this in codesearch.debian.net?
I just assume everything is UTF-8. If it is not, and actually contains
non-ASCII characters, it needs to be converted to UTF-8. I mean,
common. This is 2013. Just convert it the files already! :-)

-- 
Best regards,
Michael


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org