Robert Muir wrote:
Paul, thanks for the examples. In my opinion, only one of these is a
tokenizer problem :)
none of these will be affected by a unicode upgrade.

Things like:

http://bugs.musicbrainz.org/ticket/1006


another approach is using ibm ICU library for this case, as the
builtin Katakana-Hiragana works well.
you don't need to write the rules, as its built in, but if you are
curious they are defined here:
http://unicode.org/repos/cldr/trunk/common/transforms/Hiragana-Katakana.xml?rev=1.7&content-type=text/vnd.viewcvs-markup
if CharFilter/the static mappings I described do not meet your
requirements, and you want a filter that does this via the rules
above, I can give you some code.

I think we would like to implement the complete unicode rules, so if you could provide us with some code that would be great.
http://bugs.musicbrainz.org/ticket/5311

in this case, it appears you want to do fullwidth-halfwidth conversion
(hard to tell from the ticket but it claims that solves the issue)

you could use a similar CharFilter approach as I described above for this one.

If there is a mapping from halfwidth / fullwidth that would work so converted to fullwidth for indexing and searching, but having read the details it would seem to convert a half width character you would have to know you were looking at chinese (or korean/japanses ecetera) , but as the Musicbrainz system supports any language and the user doesn't specify the language being used when searching I cannot safetly convert these characters because they may just be latin ecetera. However when the entity is added to the database the language is specified so I could do a conversion like this to ensure all chinese albums were always indexed as full width, and then educate users to use full width charcters.
alternatively, you could write java code. this kind of mapping is done
within the CJKTokenizer in Lucene's contrib, and you could steal some
code from there.
Not really going to work for me because need to handle all scripts, if I ad extra chinese handling to tokenizer I expect I'll break handling for other languages
but a different way to look at this, is that its just one example of
Unicode normalization (compatibility decomposition)
so you could say, implement a tokenfilter that normalizes your text to
NFKC and solve this problem, as well as a bunch of other issues in a
bunch of other languages.
if you want code to do this, there are several open jira tickets in
lucene with different implementations.
I assume once again you have to know the script being used in order to do this
http://bugs.musicbrainz.org/ticket/4827

this is a tokenization issue. its also not unicode standard (as really
geresh/gershayim etc should be used).
in the unicode standard (uax #29 segmentation), this issue is
specifically mentioned:

For Hebrew, a tailoring may include a double quotation mark between
letters, because legacy data may contain that in place of U+05F4 (״)
gershayim. This can be done by adding double quotation mark to
MidLetter. U+05F3 (׳) HEBREW PUNCTUATION GERESH may also be included
in a tailoring.

So the easiest way for you to get this, would be to modify jflex rules
for these characters to behave differently, perhaps only when
surrounded by hebrew context.
I think there are two issues, firstly the data needs to be indexed to always use gerhayim is this what you are suggesting I couldn't follow how to change jflex. Then its an issue for the query parser that the user uses a " for searching but doesn't escape it, but I cannot automatically escape it because it may not be Hebrew.


Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to