Hi Joe,

> So I am definitely not an expert in these matters.  But my
> understanding is that Mono internally uses UTF-16 as its Unicode
> representation.

Well, yeah, kind of.  I'm no expert with C#, but it seems to mean
"here's a 16-bit type, have fun".  I'm hesitant to call that
"internal".  :-)  I think it's only slightly more true than saying "C
uses UTF-8 internally" (here's an 8-bit type, have fun).

> the wire, in displaying the results, etc.  Also, I have no idea how
> Python handles Unicode data (the last time I used it heavily -- in
> 2004 or so -- it didn't handle it very well).

That's about the time I started using Python, and it works great for
me.  It has a 'unicode' (string only -- no chars in Python) type,
which can hold any Unicode string; you don't deal with encodings until
you want to do I/O.  So I'm confident that I'll be able to get Unicode
data from Beagle into Python, without too much trouble.

> If you search using the command-line program beagle-query, do you find
> the files?

I get the same result as using "Desktop Search"/beagle-search: works
for the Latin and Georgian, but no hits for the Linear B.

> As far as Beagle is concerned, by itself it doesn't deal with
> character encodings at all.  As far as underlying libs: GTK requires
> UTF-8; underneath it GLib deals with different Unicode versions.

Since C# doesn't really provide a "unicode character" type (only a
16-bit type for stuffing with UTF-16), a program that wants to fully
support Unicode might need to deal a little bit with one encoding
(UTF-16) itself.  But I'm new to Mono, and I'm not sure my previous
sentence is true.  :-)

> It's definitely possible that Lucene doesn't have any special handling
> of these characters.  You might want to try running
> beagle-extract-content on the file to see if the data is extracted
> reasonably.

This extracts (as "Content:") the entire text of the file, in all 3
languages -- great!  So I would say the plain text filter (at least)
passes my characters correctly.

Also, in the "Desktop Search"/beagle-search window, at the bottom it
shows a preview of the text from the file; here it shows non-BMP
characters as "  " (2 spaces), just as I saw from a Python program.

I've never done any debugging of Beagle itself, but when I get home
tonight I'll try to narrow down how far my characters are getting
before getting converted to spaces.


- Ken
_______________________________________________
Dashboard-hackers mailing list
Dashboard-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/dashboard-hackers

Reply via email to