Hi Joe, > So I am definitely not an expert in these matters. But my > understanding is that Mono internally uses UTF-16 as its Unicode > representation.
Well, yeah, kind of. I'm no expert with C#, but it seems to mean "here's a 16-bit type, have fun". I'm hesitant to call that "internal". :-) I think it's only slightly more true than saying "C uses UTF-8 internally" (here's an 8-bit type, have fun). > the wire, in displaying the results, etc. Also, I have no idea how > Python handles Unicode data (the last time I used it heavily -- in > 2004 or so -- it didn't handle it very well). That's about the time I started using Python, and it works great for me. It has a 'unicode' (string only -- no chars in Python) type, which can hold any Unicode string; you don't deal with encodings until you want to do I/O. So I'm confident that I'll be able to get Unicode data from Beagle into Python, without too much trouble. > If you search using the command-line program beagle-query, do you find > the files? I get the same result as using "Desktop Search"/beagle-search: works for the Latin and Georgian, but no hits for the Linear B. > As far as Beagle is concerned, by itself it doesn't deal with > character encodings at all. As far as underlying libs: GTK requires > UTF-8; underneath it GLib deals with different Unicode versions. Since C# doesn't really provide a "unicode character" type (only a 16-bit type for stuffing with UTF-16), a program that wants to fully support Unicode might need to deal a little bit with one encoding (UTF-16) itself. But I'm new to Mono, and I'm not sure my previous sentence is true. :-) > It's definitely possible that Lucene doesn't have any special handling > of these characters. You might want to try running > beagle-extract-content on the file to see if the data is extracted > reasonably. This extracts (as "Content:") the entire text of the file, in all 3 languages -- great! So I would say the plain text filter (at least) passes my characters correctly. Also, in the "Desktop Search"/beagle-search window, at the bottom it shows a preview of the text from the file; here it shows non-BMP characters as " " (2 spaces), just as I saw from a Python program. I've never done any debugging of Beagle itself, but when I get home tonight I'll try to narrow down how far my characters are getting before getting converted to spaces. - Ken _______________________________________________ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers