Bug#719858: codesearch: Indexer only accepts only valid UTF-8

Hilko Bengen Fri, 16 Aug 2013 03:54:37 -0700

Hi Michael,

> Hilko Bengen <ben...@debian.org> writes:
>> The first patch simply removes this check. It has worked well for me for
>> several months (before the codesearch package appeared in Debian). I
> The issue I have with that is that it will break the assumption that
> everything which is in the index is findable. Given that codesearch is
> used with UTF-8 search terms, you would not be able to find “Grüße” when
> searching for it.


Right. To be honest, it never occured to me to search for non-ASCII
content in source code. :-)

Codesearch has been very useful in answering questions such as "Are we
sure that none of our code uses function X any more ... can we just get
rid of X in the next release?". So to me, having an index that contains
every source text file been more important than worrying about not being
able to find Latin1-encoded strings.

BTW, I just tried passing 'äöü' as a Latin1-encoded string (bytes e4 f6
fc) to csearch. This led to regexp/syntax failing with an "invalid
UTF-8" error, so this does not work, even if the character encoding of
the search term matches that of the index.

A "proper" solution would probably involve guessing the character set of
a text file and convert it if necessary before indexing. Meh.

How are you dealing with this in codesearch.debian.net?

>> Please consider also adding the second patch to enable logging of files
>> that are not added to the index for whatever reason.
> I agree that the second patch makes more sense, even though I am not
> entirely sure whether a separate flag would be more appropriate
> (nitpick).
>
> Can you please send the second patch upstream?

Will do.

Cheers,
-Hilko


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#719858: codesearch: Indexer only accepts only valid UTF-8

Reply via email to