Re: Indexing Text Files and Text Encoding

David Narvaez Tue, 20 Jan 2015 10:19:51 -0800

On Tue, Jan 20, 2015 at 12:10 PM, Vishesh Handa <m...@vhanda.in> wrote:
> Hey guys
>
> We have a plain text indexing plugin in KFileMetaData. It gives the plain 
> text of any file whose mimetype beings with 'text/'. We used to use 
> QString::fromUtf8 to convert this into a string. However, this may not be 
> ideal as a different encoding can exist.
>
> I've just written a patch to use the system codec and if the conversion 
> fails, to abort. Does anyone have an opinions on this? I'm slightly 
> conflicted.
>
> Reasons for doing this: If we cannot correctly convert it to text, we're just 
> indexing garbage. This often happens with a binary file getting detected as 
> text. [1].


What about guessing the encoding from some heuristic[0]?

David E. Narvaez

[0] http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html

>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<

Re: Indexing Text Files and Text Encoding

Reply via email to