[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

Robert Muir (JIRA) Sat, 12 Jun 2010 09:26:40 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878283#action_12878283
 ]


Robert Muir commented on LUCENE-2167:
-------------------------------------

bq. Why not just insert U+2029 PARAGRAPH SEPARATOR (PS)?

I would argue because its a sentence boundary, not a paragraph boundary :)

But i thought it would be best to just allow the user to specify the 
replacement string (which could be just U+2029 if you want).
They could also use "<boundary/>" or something entirely different.

bq. and tokenizers that care about appropriately responding to it can 
specialize for just this one, instead of having to also be aware of whatever it 
was that the user specified in the ctor to the charfilter.

well, by default these filters could just work with position increments 
appropriately, and you add whatever string you use to a stopword filter to 
create these position increments.

bq. I like where this is going - toward a solid general solution.

Good, if we get some sorta plan we should open a new JIRA issue i think.

bq. Email. Source code. TREC collections (I think - don't have any right here 
with me). And yes, manually generated and wrapped text. Isn't most text 
manually generated?

Right, but unicode encodes character :) So things like text wrapping in my 
opinion belongs in the display component, and not in a character encoding 
model... most modern text in webpages etc isnt manually wrapped like this.

I think our default implementation should be for unicode text. for the 
non-unicode text you speak of, you can just tailor the default rules.





> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>
>                 Key: LUCENE-2167
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2167
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1
>            Reporter: Shyamal Prasad
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

Reply via email to