[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

Robert Muir (JIRA) Fri, 11 Jun 2010 09:29:36 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877846#action_12877846
 ]


Robert Muir commented on LUCENE-2167:
-------------------------------------

bq. NewStandardTokenizer is not quite finished; I plan on stealing Robert's 
Southeast Asian (Lao, Myanmar, Khmer) syllabification routine

Curious, what is your plan here? Do you plan to somehow "jflex-#include" these 
into the grammar so that these are longest-matched instead of the 
Complex_Context rule? 

How to handle the cases where the grammar cannot be forward-only deterministic 
matching? (at least i don't see how it could be, but maybe). e.g. the lao cases 
where some backtracking is needed... and the combining class reordering needed 
for real-world text?

Curious what would you plan to index for Thai, words? a grammar for TCC?

Also, some of these syllable techniques are probably not very good for search 
without doing a "shingle" later... in some cases it may perform OK like single 
ideographs or tibetan syllables do with the grammar you have. For others 
(Khmer, etc) I think the shingling is likely mandatory since they are really 
only a bit better than indexing grapheme clusters.

As far as needing punctuation for shingling, the similar problem already 
exists. For example, after tokenizing, some discarding of information 
(punctuation) has been lost and its too late to do a nice shingle. practical 
cheating/workarounds exist for CJK (you could look at the offset or something 
and cheat, to figure out that they were adjacent), but for something like 
Tibetan the type of punctuation itself is important: the tsheg being 
unambiguous syllable separator, but ambiguous word separator, but the shad or 
whitespace being both. 

Here is the paper I brought up at ehatcher's house recently when we were 
discussing tibetan, that recommends this syllable bigram technique, where the 
shingling is dependent on the original punctuation: 
http://terpconnect.umd.edu/~oard/pdf/iral00b.pdf

One alternative for the short term would be to make a tokenfilter that hooks 
into the ICUTokenizer logic but looks for Complex_Context, or similar. I 
definitely agree it would be best if standardtokenizer worked the best out of 
box without doing something like this.

Finally, I think its worth considering a lot of this as a special case of a 
larger problem that affects even english. For a lot of users, punctuation such 
as the hyphen in english might have some special meaning and they might want to 
shingle or something else in that case too. Its a general problem with 
tokenstreams that the tokenizer often discards this information and the filters 
are left with only a partial picture. Some ideas to improve it would be to make 
use of properties like [:Terminal_Punctuation=Yes:] somehow, or to try to 
integrate Sentence segmentation.


> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>
>                 Key: LUCENE-2167
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2167
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1
>            Reporter: Shyamal Prasad
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

Reply via email to