Robert Muir wrote:
Paul, thanks for the examples. In my opinion, only one of these is a
tokenizer problem :)
none of these will be affected by a unicode upgrade.
Things like:
http://bugs.musicbrainz.org/ticket/1006
another approach is using ibm ICU library for this case, as the
builtin Katakana-Hiragana works well.
you don't need to write the rules, as its built in, but if you are
curious they are defined here:
http://unicode.org/repos/cldr/trunk/common/transforms/Hiragana-Katakana.xml?rev=1.7&content-type=text/vnd.viewcvs-markup
if CharFilter/the static mappings I described do not meet your
requirements, and you want a filter that does this via the rules
above, I can give you some code.
I think we would like to implement the complete unicode rules, so if you
could provide us with some code that would be great.
http://bugs.musicbrainz.org/ticket/5311
in this case, it appears you want to do fullwidth-halfwidth conversion
(hard to tell from the ticket but it claims that solves the issue)
you could use a similar CharFilter approach as I described above for this one.
If there is a mapping from halfwidth / fullwidth that would work so
converted to fullwidth for indexing and searching, but having read the
details it would seem to convert a half width character you would have
to know you were looking at chinese (or korean/japanses ecetera) , but
as the Musicbrainz system supports any language and the user doesn't
specify the language being used when searching I cannot safetly
convert these characters because they may just be latin ecetera. However
when the entity is added to the database the language is specified so I
could do a conversion like this to ensure all chinese albums were always
indexed as full width, and then educate users to use full width charcters.
alternatively, you could write java code. this kind of mapping is done
within the CJKTokenizer in Lucene's contrib, and you could steal some
code from there.
Not really going to work for me because need to handle all scripts, if I
ad extra chinese handling to tokenizer I expect I'll break handling for
other languages
but a different way to look at this, is that its just one example of
Unicode normalization (compatibility decomposition)
so you could say, implement a tokenfilter that normalizes your text to
NFKC and solve this problem, as well as a bunch of other issues in a
bunch of other languages.
if you want code to do this, there are several open jira tickets in
lucene with different implementations.
I assume once again you have to know the script being used in order to
do this
http://bugs.musicbrainz.org/ticket/4827
this is a tokenization issue. its also not unicode standard (as really
geresh/gershayim etc should be used).
in the unicode standard (uax #29 segmentation), this issue is
specifically mentioned:
For Hebrew, a tailoring may include a double quotation mark between
letters, because legacy data may contain that in place of U+05F4 (״)
gershayim. This can be done by adding double quotation mark to
MidLetter. U+05F3 (׳) HEBREW PUNCTUATION GERESH may also be included
in a tailoring.
So the easiest way for you to get this, would be to modify jflex rules
for these characters to behave differently, perhaps only when
surrounded by hebrew context.
I think there are two issues, firstly the data needs to be indexed to
always use gerhayim is this what you are suggesting I couldn't follow
how to change jflex.
Then its an issue for the query parser that the user uses a " for
searching but doesn't escape it, but I cannot automatically escape it
because it may not be Hebrew.
Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org