[jira] Updated: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

Robert Muir (JIRA) Wed, 30 Jun 2010 07:22:17 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Muir updated LUCENE-2167:
--------------------------------

    Attachment: LUCENE-2167.patch

ok here is a patch file. before applying it, you have to run these commands:

{noformat}
# original grammar -> ClassicTokenizerImpl
svn move 
modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImplOrig.java
 
modules/analysis/common/src/java/org/apache/lucene/analysis/standard/ClassicTokenizerImpl.java
svn move 
modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImplOrig.jflex
 
modules/analysis/common/src/java/org/apache/lucene/analysis/standard/ClassicTokenizerImpl.jflex
# this one is not needed, this patch becomes the new grammar
svn delete 
modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl31.java
svn delete 
modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl31.jflex
# expose the old tokenizer, not just via Version, but also as 
ClassicAnalyzer/Tokenizer/Filter
svn copy 
modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardAnalyzer.java
 
modules/analysis/common/src/java/org/apache/lucene/analysis/standard/ClassicAnalyzer.java
svn copy 
modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java
 
modules/analysis/common/src/java/org/apache/lucene/analysis/standard/ClassicTokenizer.java
svn copy 
modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardFilter.java
 
modules/analysis/common/src/java/org/apache/lucene/analysis/standard/ClassicFilter.java
svn copy 
modules/analysis/common/src/test/org/apache/lucene/analysis/core/TestStandardAnalyzer.java
 
modules/analysis/common/src/test/org/apache/lucene/analysis/core/TestClassicAnalyzer.java
# temporarily edit 
solr/src/java/org/apache/solr/analysis/StandardFilterFactory.java (change the 
$Id hossman.... to just $Id$)
# apply the patch.
{noformat}

if you want to iterate on the patch, make your changes and generate a patch 
with 'svn --no-diff-deleted'.

some notes:
* patch is against 4.0, but i think we can do this in 3.1. all the back compat 
is preserved, etc. we just gotta figure a few things out. all the tests pass 
though.
* The patch is large mainly because of the DFA size. I have some concerns about 
this... the email/url stuff seems to be the culprit, as the UAX#29 generated 
class is only 12KB, about the same size as our existing standardtokenizer.
* I gave backwards compat (you get the old behavior) with Version, but also 
setup ClassicAnalyzer/Tokenizer/Filter for those that want the...not so 
international-friendly old version, for its company Identification, etc.
* I modified token types for icu to be more consistent with this.
* StandardFilter is currently a no-op for the new grammar. In my opinion this 
is a place to implement the 'more sophisticated' logic that the standard 
mentions for certain scripts. We can use token types (IDEOGRAPHIC, 
SOUTHEAST_ASIAN) to drive this. This way the standardanalyzer is a reasonable 
tokenizer for most languages.

So, not completely sure this is the best approach, but it is one... the patch 
is still rough around the edges but at least now we can iterate more easily on 
it.


> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>
>                 Key: LUCENE-2167
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2167
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1
>            Reporter: Shyamal Prasad
>            Assignee: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> standard.zip
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Updated: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

Reply via email to