[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929737#action_12929737
 ] 

Uwe Schindler commented on LUCENE-2747:
---

CharFilter must at least also implement read() to read one char. This is not 
needed by most tokenizers, but like that its incomplete. It does *not* need to 
impl char[] without offset and count.

Else looks fine for now in reusable.

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929738#action_12929738
 ] 

Robert Muir commented on LUCENE-2747:
-

bq. CharFilter must at least also implement read() to read one char. 

Thats incorrect.
only read(char[] cbuf, int off, int len)  is abstract in Reader.
CharStream extends Reader, but only adds correctOffset.
CharFilter extends CharStream, but only delegates read(char[] cbuf, int off, 
int len) 

So implementing read() only adds useless code duplication here.

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929826#action_12929826
 ] 

Uwe Schindler commented on LUCENE-2747:
---

Ay, ay "Code Dup Policeman".

>From perf standpoint for real FilterReaders in java.io that would be no-go, 
>but here it's fine as Tokenizers always buffer. Also java.io's FilterReader 
>are different and delegate this method, but not CharFilter.

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929834#action_12929834
 ] 

Robert Muir commented on LUCENE-2747:
-

Wait, thats an interesting point, any advantage to actually using "real 
FilterReaders" for this API?


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-08 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929934#action_12929934
 ] 

DM Smith commented on LUCENE-2747:
--

I'm not too keen on this. For classics and ancient texts the standard analyzer 
is not as good as the simple analyzer. I think it is important to have a 
tokenizer that does not try to be too smart. I think it'd be good to have a 
SimpleAnalyzer based upon UAX#29, too.

Then I'd be happy.

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-08 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929936#action_12929936
 ] 

Steven Rowe commented on LUCENE-2747:
-

Robert, your patch looks good - I have a couple of questions:

* You removed {{TestHindiFilters.testTokenizer()}}, 
{{TestIndicTokenizer.testBasics()}} and {{TestIndicTokenizer.testFormat()}}, 
but these would be useful in {{TestStandardAnalyzer}} and 
{{TestUAX29Tokenizer}}, wouldn't they?
* You did not remove {{ArabicLetterTokenizer}} and {{IndicTokenizer}}, 
presumably so that they can be used with Lucene 4.0+ when the supplied 
{{Version}} is less than 3.1 -- good catch, I had forgotten this requirement -- 
but when can we actually get rid of these?  Since they will be staying, 
shouldn't their tests remain too, but using {{Version.LUCENE_30}} instead of 
{{TEST_VERSION_CURRENT}}?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-08 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929945#action_12929945
 ] 

Steven Rowe commented on LUCENE-2747:
-

bq. I'm not too keen on this. For classics and ancient texts the standard 
analyzer is not as good as the simple analyzer. I think it is important to have 
a tokenizer that does not try to be too smart. I think it'd be good to have a 
SimpleAnalyzer based upon UAX#29, too.

{{UAX29Tokenizer}} could be combined with {{LowercaseFilter}} to provide that, 
no?

Robert is arguing in the reopened LUCENE-2167 for {{StandardTokenizer}} to be 
stripped down so that it only implements UAX#29 rules (i.e., dropping URL+Email 
recognition), so if that comes to pass, {{StandardAnalyzer}} would just be 
UAX#29+lowercase+stopword (with English stopwords by default, but those can be 
overridden in the ctor) -- would that make you happy?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-09 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930024#action_12930024
 ] 

Robert Muir commented on LUCENE-2747:
-

bq. You removed TestHindiFilters.testTokenizer(), 
TestIndicTokenizer.testBasics() and TestIndicTokenizer.testFormat(), but these 
would be useful in TestStandardAnalyzer and TestUAX29Tokenizer, wouldn't they?

oh, i just deleted everything associated with that tokenizer...

bq. You did not remove ArabicLetterTokenizer and IndicTokenizer, presumably so 
that they can be used with Lucene 4.0+ when the supplied Version is less than 
3.1 - good catch, I had forgotten this requirement - but when can we actually 
get rid of these? Since they will be staying, shouldn't their tests remain too, 
but using Version.LUCENE_30 instead of TEST_VERSION_CURRENT?

i removed the indictokenizer (unreleased) and deleted its tests.
but i kept and deprecated the arabic one, since we have released it.


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-09 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930072#action_12930072
 ] 

Simon Willnauer commented on LUCENE-2747:
-

I looked at the patch briefly and the charStream(Reader) extension looks good 
to me while I would make it protected and throw a IOException. Since this API 
is public and folks will use it in the wild we need to make sure we don't have 
to add the exception later or people creating Readers have to play tricks just 
because the interface has no IOException. About making it protected, do we need 
to call that in a non-protected context, maybe I miss something..
{code}
 public Reader charStream(Reader reader) {
   return reader;
 }

// should be 

  protected Reader charStream(Reader reader) throws IOException{
return reader;
  }
{code}


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-09 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930073#action_12930073
 ] 

Robert Muir commented on LUCENE-2747:
-

Simon: i agree with both those points... we should change the method signature.

also i called it charStream (this is what Solr's analyzer calls it), but this 
is slightly confusing since the api is all Reader-based.
alternatively we could give this a different name, wrapReader or something... 
not sure, i didnt have any better ideas than charStream.



> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-09 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930075#action_12930075
 ] 

Simon Willnauer commented on LUCENE-2747:
-

bq. ...alternatively we could give this a different name, wrapReader or 
something... not sure, i didnt have any better ideas than charStream.
wrapReader seem to be to specific what about initReader?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-09 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930090#action_12930090
 ] 

Robert Muir commented on LUCENE-2747:
-

bq. I'm not too keen on this. For classics and ancient texts the standard 
analyzer is not as good as the simple analyzer.

DM, can you elaborate here? 

Are you speaking of the existing StandardAnalyzer in previous releases, that 
doesn't properly deal with tokenizing diacritics, etc?
This is the reason these "special" tokenizers exist: to work around those bugs.
but StandardTokenizer now handles this stuff fine, and they are obselete.

I'm confused though, in previous releases how SimpleAnalyzer would ever be any 
better, since it would barf on these diacritics too,
it only emits tokens that are runs of Character.isLetter

Or is there something else i'm missing here?


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-09 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930119#action_12930119
 ] 

DM Smith commented on LUCENE-2747:
--

bq. DM, can you elaborate here?

I was a bit trigger happy with the comment. I should have looked at the code 
rather than the jira comments alone. The old StandardAnalyzer had a kitchen 
sink approach to tokenizations trying to do too much with *modern* constructs, 
e.g. URLs, email addresses, acronyms It and SimpleAnalyzer would produce 
about the same stream on "old" English and some other texts, but the 
StandardAnalyzer was much slower. (I don't remember how slow, but it was 
obvious.)

Both of these were weak when it came to non-English/non-Western texts. Thus I 
could take the language specific tokenizers, lists of stop words, stemmers and 
create variations of the SimpleAnalyzer that properly handled a particular 
language. (I created my own analyzers because I wanted to make stop words and 
stemming optional)

In looking at the code in trunk (should have done that before making my 
comment), I see that UAX29Tokenizer is duplicated in the StandardAnalyzer's 
jflex and that ClassicAnalyzer is the old jflex. Also, the new StandardAnalyzer 
does a lot less.

If I understand the suggestion of this and the other 2 issues, StandardAnalyzer 
will no longer handle modern constructs. As I see it this is what 
SimpleAnalyzer should be: Based on UAX29 and does little else. Thus my 
confusion. Is there a point to having SimpleAnalyzer? Shouldn't UAX29Tokenizer 
be moved to core? (What is core anyway?)

And if I understand where this is going: Would there be a way to plugin 
ICUTokenizer as a replacement for UAX29Tokenizer into StandardTokenizer, such 
that all Analyzers using StandardTokenizer would get the alternate 
implementation?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-09 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930137#action_12930137
 ] 

Robert Muir commented on LUCENE-2747:
-

DM, thanks, I see exactly where you are coming from.

I see your point: previously it was much easier to take something like 
SimpleAnalyzer and 'adapt' it to a given language based on things like unicode 
properties.
In fact thats exactly what we did in the cases here (Arabic, Persian, Hindi, 
etc)

But now we can actually tokenize "correctly" for more languages with jflex, 
thanks to its improved unicode support, and its superior to these previous 
hacks :)

to try to answer some of your questions (all my opinion):

bq. Is there a point to having SimpleAnalyzer

I guess so, a lot of people can use this if they have english-only content and 
are probably happy with discard numbers etc... its not a big loss to me if it 
goes though.

bq. Shouldn't UAX29Tokenizer be moved to core? (What is core anyway?)

In trunk (4.x codeline) there is no core, contrib, or solr for analyzer 
components any more. they are all combined into modules/analysis.
In branch_3x (3.x codeline) we did not make this rather disruptive refactor: 
there UAX29Tokenizer is in fact in lucene core.

bq. Would there be a way to plugin ICUTokenizer as a replacement for 
UAX29Tokenizer into StandardTokenizer, such that all Analyzers using 
StandardTokenizer would get the alternate implementation?

Personally, i would prefer if we move towards a factory model where things like 
these supplied "language analyzers" are actually xml/json/properties snippets.
In other words, they are just example configurations that builds your analyzer, 
like solr does.
This is nice, because then you dont have to write code to easily customize how 
your analyzer works.

I think we have been making slow steps towards this, just doing basic things 
like moving stopwords lists to .txt files.
But i think the next step would be LUCENE-2510, where we have factories/config 
attribute parsers for all these analysis components already written.

Then we could have support for declarative analyzer specification via 
xml/json/.properties/whatever, and move all these Analyzers to that.
I still think you should be able to code up your own analyzer, but in my 
opinion this is much easier and preferred for the ones we supply.

Also i think this would solve a lot of analyzer-backwards-compat problems, 
because then our supplied analyzers are really just configuration file examples,
and we can change our examples however we want... someone can use their old 
config file (and hopefully old analysis module jar file!) to guarantee
the exact same behavior if they want.

Finally, most of the benefits of ICUTokenizer are actually in the UAX29 
support... the tokenizers are pretty close with some minor differences:
* the jflex-based implementation is faster, and better in my opinion.
* the ICU-based implementation allows tailoring, and supplies tailored 
tokenization for several complex scripts (jflex doesnt have this... yet)
* the ICU-based implementation works with all of unicode, at the moment jflex 
is limited to the basic multilingual plane.

In my opinion the last 2 points will probably be eventually resolved... i could 
see our ICUTokenizer possibly becoming obselete down the road 
by some better jflex support, though it would have to probably have hooks into 
ICU for the complex script support (so we get it for free from ICU)


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-09 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930448#action_12930448
 ] 

DM Smith commented on LUCENE-2747:
--

Robert, I think we are on the same wavelength. Thanks.

I like the idea of declarative analyzers, too.

Regarding the "last 2 points" has anyone given input to the JFlex team on these 
needs?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-09 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930470#action_12930470
 ] 

Steven Rowe commented on LUCENE-2747:
-

DM, I'm a committer on the JFlex team.  About the second to last point: when 
Robert said "(jflex doesnt have this... yet)" he meant the jflex-based 
implementation, not JFlex itself.

About the last point, JFlex is shooting for level 1 compliance with [UTS#18 
Unicode Regular Expressions|http://unicode.org/reports/tr18/], which requires 
conforming implementations to "handle the full range of Unicode code points, 
including values from U+ to U+10."

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-10 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930580#action_12930580
 ] 

Robert Muir commented on LUCENE-2747:
-

bq. I meant o.a.l.analysis.core. I'd expect the premier analyzers to be in core.

I guess the package doesn't make a big difference to me, all the analyzers are 
in one place and the same.
its true we mixed the "core" analysis stuff with contrib and solr, but if there 
are warts with the contrib stuff,
we should be able to clean it up (I think this has been happening)

bq. I guess I meant: Shouldn't the SimpleAnalyzer just be constructed the same 
as StandardAnalyzer with the addition of a Filter that pitch tokens that are 
not needed?

I don't think so, this seems like a trap. Lots of people use StandardTokenizer 
without any filter... I don't think 
it should emit trash (punctuation). if you want to do the other stuff, you can 
use ClassicTokenizer, or even 
WhitespaceTokenizer and filter to your heart's content.


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-10 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930585#action_12930585
 ] 

DM Smith commented on LUCENE-2747:
--

Robert, Let me ask another way. How about implementing StandardTokenizer using 
jflex to be UAX29Tokenizer minus NUM and ALPHANUM?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-10 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930578#action_12930578
 ] 

DM Smith commented on LUCENE-2747:
--

{quote}
bq. Shouldn't UAX29Tokenizer be moved to core? (What is core anyway?)

In trunk (4.x codeline) there is no core, contrib, or solr for analyzer 
components any more. they are all combined into modules/analysis.
In branch_3x (3.x codeline) we did not make this rather disruptive refactor: 
there UAX29Tokenizer is in fact in lucene core.
{quote}

I meant o.a.l.analysis.core. I'd expect the *premier* analyzers to be in core.

{quote}
bq. Is there a point to having SimpleAnalyzer

I guess so, a lot of people can use this if they have english-only content and 
are probably happy with discard numbers etc... its not a big loss to me if it 
goes though.
{quote}

I guess I meant: Shouldn't the SimpleAnalyzer just be constructed the same as 
StandardAnalyzer with the addition of a Filter that pitch tokens that are not 
needed?
With the suggestion in LUCENE-2167 to use UAX29Tokenizer for StandardAnalyzer, 
effectively deprecating EMAIL and URL and possibly adding some kind of 
PUNCTUATION (so that URLs/emails/acronyms... can be reconstructed, if someone 
desires), the StandardAnalyzer is about as simple as one could get and properly 
handle non-english/non-western languages. It just creates ALPHANUM,  NUM and 
PUNCTUATION (if added) that SimpleAnalyzer does not care about.


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-10 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930598#action_12930598
 ] 

Steven Rowe commented on LUCENE-2747:
-

bq. How about implementing StandardTokenizer using jflex to be UAX29Tokenizer 
minus NUM and ALPHANUM?

DM, ALPHANUM is the only "word" token type - there is no ALPHA or WORD type - 
from the UAX29Tokenizer javadocs:

{quote}
Tokens produced are of the following types:

* : A sequence of alphabetic and numeric characters
* : A number
* : A sequence of characters from South and Southeast Asian 
languages, including Thai, Lao, Myanmar, and Khmer
* : A single CJKV ideographic character
* : A single hiragana character
{quote}

So I guess you think that StandardAnalyzer should exclude tokens that have 
numeric characters in them?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-10 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930601#action_12930601
 ] 

Robert Muir commented on LUCENE-2747:
-

bq. Robert, Let me ask another way. How about implementing StandardTokenizer 
using jflex to be UAX29Tokenizer minus NUM and ALPHANUM?

I think, as i already commented, that it would be best if StandardTokenizer 
just implemented the UAX#29 algorithm.
its a great default and I dont think we should try to alter it.


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-10 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930670#action_12930670
 ] 

DM Smith commented on LUCENE-2747:
--

Robert/Steven, I'm sorry. I fat fingered the last post. I really need to take 
more care.
s/Standard/Simple/;

That is, SimpleAnalyzer is not appropriate for many languages. If it were based 
upon a variation of UAX29Tokenizer, but didn't handle NUM or ALPHANUM, but WORD 
instead, it would be the same type of token stream, just alpha words.

Regarding compatibility: I think the results for English would be nearly, if 
not, identical. Western European would only be slightly off from identical. But 
for other languages it would be an improvement.

At this point, I'm content with what you guys are doing with non-English texts. 
Great job.


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-10 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930674#action_12930674
 ] 

Steven Rowe commented on LUCENE-2747:
-

bq. That is, SimpleAnalyzer is not appropriate for many languages. If it were 
based upon a variation of UAX29Tokenizer, but didn't handle NUM or ALPHANUM, 
but WORD instead, it would be the same type of token stream, just alpha words.

DM, if you want, I could help you convert UAX29Tokenizer in this way - I hang 
out on IRC #lucene and #lucene-dev a lot of the time.  I think you'd want to 
call it something other than SimpleAnalyzer, though - maybe 
UAX29AlphabeticTokenizer?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-10 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930681#action_12930681
 ] 

Robert Muir commented on LUCENE-2747:
-

bq. That is, SimpleAnalyzer is not appropriate for many languages. If it were 
based upon a variation of UAX29Tokenizer, but didn't handle NUM or ALPHANUM, 
but WORD instead, it would be the same type of token stream, just alpha words.

Ok, now i understand you, and yes I agree...  My question is, should we even 
bother fixing it? Like would anyone who actually cares about unicode really 
want only some hacked subset of UAX#29 ?

These simple ones like SimpleAnalyzer, WhitespaceAnalyzer, StopAnalyzer are all 
really bad for Unicode text in different ways, though Simple/Stop are bigger 
offenders i think, because they will separate a base character from its 
combining characters (in my opinion, this should always be avoided) and worse: 
they will break on these.

But people using them are probably happy? e.g. you can do like Solr,  use 
whitespaceanalyzer and follow thru with something like WordDelimiterFilter and 
its mostly ok, depending upon options, except for cases like CJK where its a 
death trap.

Personally i just don't use these things since I know the problems, but we 
could document "this is simplistic and won't work well for many languages" and 
keep them around for people that don't care?

And yeah i suppose its confusing these really "simple" ones are in the .core 
package, but to me the package is meaningless, i was just trying to keep the 
analyzers arranged in some kind of order (e.g. pattern-based analysis in the 
.pattern package, etc).

We could just as well call the package .basic or .simple or something else, its 
just a name.


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-10 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930690#action_12930690
 ] 

DM Smith commented on LUCENE-2747:
--

Robert, I think
* "core" is a bad name that needs to be changed. It is misleading.
* Documentation should be improved along the lines you suggest.

You mention a few broken Analyzers (and by implication related tokenizers and 
filters). I've a question about LowerCaseFilter: Isn't it bad as well?


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-29 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965066#action_12965066
 ] 

Robert Muir commented on LUCENE-2747:
-

OK, i'd like to move forward with this if there are no objections, it improves 
these analyzers by improving their multilingual
capabilities, allowing them to support numbers etc, and deprecates/removes 
useless custom code (only unreleased functionality is removed)

We also add CharFilter support to the ReusableAnalyzerBase, which was missing 
before.

DM: we can open a followup issue to improve the docs for the simplistic 
tokenizers as discussed here.


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org