Re: Best way to create own version of StandardTokenizer ?

Paul Taylor Mon, 07 Sep 2009 03:08:22 -0700

Robert Muir wrote:

Paul, thanks for the examples. In my opinion, only one of these is a
tokenizer problem :)
none of these will be affected by a unicode upgrade.

Things like:

http://bugs.musicbrainz.org/ticket/1006



another approach is using ibm ICU library for this case, as the
builtin Katakana-Hiragana works well.
you don't need to write the rules, as its built in, but if you are
curious they are defined here:
http://unicode.org/repos/cldr/trunk/common/transforms/Hiragana-Katakana.xml?rev=1.7&content-type=text/vnd.viewcvs-markup
if CharFilter/the static mappings I described do not meet your
requirements, and you want a filter that does this via the rules
above, I can give you some code.

I think we would like to implement the complete unicode rules, so if youcould provide us with some code that would be great.

http://bugs.musicbrainz.org/ticket/5311


in this case, it appears you want to do fullwidth-halfwidth conversion
(hard to tell from the ticket but it claims that solves the issue)

you could use a similar CharFilter approach as I described above for this one.

If there is a mapping from halfwidth / fullwidth that would work soconverted to fullwidth for indexing and searching, but having read thedetails it would seem to convert a half width character you would haveto know you were looking at chinese (or korean/japanses ecetera) , butas the Musicbrainz system supports any language and the user doesn'tspecify the language being used when searching I cannot safetlyconvert these characters because they may just be latin ecetera. Howeverwhen the entity is added to the database the language is specified so Icould do a conversion like this to ensure all chinese albums were alwaysindexed as full width, and then educate users to use full width charcters.

alternatively, you could write java code. this kind of mapping is done
within the CJKTokenizer in Lucene's contrib, and you could steal some
code from there.

Not really going to work for me because need to handle all scripts, if Iad extra chinese handling to tokenizer I expect I'll break handling forother languages

but a different way to look at this, is that its just one example of
Unicode normalization (compatibility decomposition)
so you could say, implement a tokenfilter that normalizes your text to
NFKC and solve this problem, as well as a bunch of other issues in a
bunch of other languages.
if you want code to do this, there are several open jira tickets in
lucene with different implementations.

I assume once again you have to know the script being used in order todo this

http://bugs.musicbrainz.org/ticket/4827


this is a tokenization issue. its also not unicode standard (as really
geresh/gershayim etc should be used).
in the unicode standard (uax #29 segmentation), this issue is
specifically mentioned:

For Hebrew, a tailoring may include a double quotation mark between
letters, because legacy data may contain that in place of U+05F4 (״)
gershayim. This can be done by adding double quotation mark to
MidLetter. U+05F3 (׳) HEBREW PUNCTUATION GERESH may also be included
in a tailoring.

So the easiest way for you to get this, would be to modify jflex rules
for these characters to behave differently, perhaps only when
surrounded by hebrew context.

I think there are two issues, firstly the data needs to be indexed toalways use gerhayim is this what you are suggesting I couldn't followhow to change jflex.Then its an issue for the query parser that the user uses a " forsearching but doesn't escape it, but I cannot automatically escape itbecause it may not be Hebrew.



Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Best way to create own version of StandardTokenizer ?

Reply via email to