[jira] [Commented] (LUCENE-3454) rename optimize to a less cool-sounding name

2011-09-25 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114304#comment-13114304
 ] 

DM Smith commented on LUCENE-3454:
--

When I started w/ Lucene, I read the docs and was drawn to call optimize 
because of its "cool name." However, it was reading the documentation at the 
time that convinced me that it was appropriate for my use case:

Creating an index that once created would never be modified. It needed to be as 
fast as possible for search on low performance computing devices (old laptops, 
ancient computers, netbooks, phones, ...).

Maybe I misunderstood, but wasn't it and isn't it still appropriate for that?

And I have no idea what NooSegments means.

If you want a really uncool name how about dumbDown()?

But either way, please document the appropriate use cases for it.

> rename optimize to a less cool-sounding name
> 
>
> Key: LUCENE-3454
> URL: https://issues.apache.org/jira/browse/LUCENE-3454
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 4.0
>Reporter: Robert Muir
>
> I think users see the name optimize and feel they must do this, because who 
> wants a suboptimal system? but this probably just results in wasted time and 
> resources.
> maybe rename to collapseSegments or something?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3239) drop java 5 "support"

2011-06-24 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054735#comment-13054735
 ] 

DM Smith commented on LUCENE-3239:
--

Same page.


> drop java 5 "support"
> -
>
> Key: LUCENE-3239
> URL: https://issues.apache.org/jira/browse/LUCENE-3239
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Robert Muir
>
> its been discussed here and there, but I think we need to drop java 5 
> "support", for these reasons:
> * its totally untested by any continual build process. Testing java5 only 
> when there is a release candidate ready is not enough. If we are to claim 
> "support" then we need a hudson actually running the tests with java 5.
> * its now unmaintained, so bugs have to either be hacked around, tests 
> disabled, warnings placed, but some things simply cannot be fixed... we 
> cannot actually "support" something that is no longer maintained: we do find 
> JRE bugs (http://wiki.apache.org/lucene-java/SunJavaBugs) and its important 
> that bugs actually get fixed: cannot do everything with hacks.
> * because of its limitations, we do things like allow 20% slower grouping 
> speed. I find it hard to believe we are sacrificing performance for this.
> So, in summary: because we don't test it at all, because its buggy and 
> unmaintained, and because we are sacrificing performance, I think we need to 
> cutover the build system for the next release to require java 6.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3239) drop java 5 "support"

2011-06-24 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054621#comment-13054621
 ] 

DM Smith commented on LUCENE-3239:
--

Hey, it's me, old-stick-in-the-mud, wrt upgrading Java :) For the most part, I 
think the same arguments as last time (Java 1.4 -> Java 5) still apply.

However, Oracle is so much more aggressive in obsoleting their software. They 
haven't patched Java 5 in quite some time. When Lucene went to Java 5, Java 1.4 
was still being patched.

I think most will be running Lucene under Java 6 (excepting some versions of 
Mac OS X and hardware. E.g. Core Duo Macs can't run Java 6).

I'd like to see that we have api compatibility w/ Java 5 (i.e. it can compile 
against Java 5), but certify against Java 6. This would allow it to run under 
Java 5, with the appropriate caveats that it is not supported or tested.

If you do go to Java 6 features, then I think it has to be a 4.0 release and 
the planned 4.0 might need to be bumped to a 5.0 designation.

> drop java 5 "support"
> -
>
> Key: LUCENE-3239
> URL: https://issues.apache.org/jira/browse/LUCENE-3239
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Robert Muir
>
> its been discussed here and there, but I think we need to drop java 5 
> "support", for these reasons:
> * its totally untested by any continual build process. Testing java5 only 
> when there is a release candidate ready is not enough. If we are to claim 
> "support" then we need a hudson actually running the tests with java 5.
> * its now unmaintained, so bugs have to either be hacked around, tests 
> disabled, warnings placed, but some things simply cannot be fixed... we 
> cannot actually "support" something that is no longer maintained: we do find 
> JRE bugs (http://wiki.apache.org/lucene-java/SunJavaBugs) and its important 
> that bugs actually get fixed: cannot do everything with hacks.
> * because of its limitations, we do things like allow 20% slower grouping 
> speed. I find it hard to believe we are sacrificing performance for this.
> So, in summary: because we don't test it at all, because its buggy and 
> unmaintained, and because we are sacrificing performance, I think we need to 
> cutover the build system for the next release to require java 6.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2011-02-06 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991170#comment-12991170
 ] 

DM Smith commented on LUCENE-1799:
--

Any idea as to when this will be released?

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: Benchmark.java, Benchmark.java, Benchmark.java, 
> LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2906) Filter to process output of ICUTokenizer and create overlapping bigrams for CJK

2011-02-06 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991169#comment-12991169
 ] 

DM Smith commented on LUCENE-2906:
--

Two questions:
How will this differ from the SmartChineseAnalyzer?
I doubt it but can this be in 3.1?

> Filter to process output of ICUTokenizer and create overlapping bigrams for 
> CJK 
> 
>
> Key: LUCENE-2906
> URL: https://issues.apache.org/jira/browse/LUCENE-2906
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Analysis
>Reporter: Tom Burton-West
>Priority: Minor
> Attachments: LUCENE-2906.patch
>
>
> The ICUTokenizer produces unigrams for CJK. We would like to use the 
> ICUTokenizer but have overlapping bigrams created for CJK as in the CJK 
> Analyzer.  This filter would take the output of the ICUtokenizer, read the 
> ScriptAttribute and for selected scripts (Han, Kana), would produce 
> overlapping bigrams.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2791) WindowsDirectory

2010-12-03 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966540#action_12966540
 ] 

DM Smith commented on LUCENE-2791:
--

I've just back ported all the code to Java 1.1. Also, this port also deletes 
everything but 7-bit ASCII.

(Just couldn't resist)



> WindowsDirectory
> 
>
> Key: LUCENE-2791
> URL: https://issues.apache.org/jira/browse/LUCENE-2791
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Reporter: Robert Muir
> Attachments: LUCENE-2791.patch, LUCENE-2791.patch, 
> WindowsDirectory.dll, WindowsDirectory_amd64.dll
>
>
> We can use Windows' overlapped IO to do pread() and avoid the performance 
> problems of SimpleFS/NIOFSDir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2786) no need for LowerCaseFilter from ArabicAnalyzer

2010-11-30 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965575#action_12965575
 ] 

DM Smith commented on LUCENE-2786:
--

I bet it is there for mixed language texts.

> no need for LowerCaseFilter from ArabicAnalyzer
> ---
>
> Key: LUCENE-2786
> URL: https://issues.apache.org/jira/browse/LUCENE-2786
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 3.0.2
> Environment: All
>Reporter: Ibrahim
>Priority: Trivial
>
> No need for this line 171:
> result = new LowerCaseFilter(result);
> in ArabicAnalyzer
> simply because there is no lower case or upper case in Arabic language. it is 
> totally not related to each other.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-10 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930690#action_12930690
 ] 

DM Smith commented on LUCENE-2747:
--

Robert, I think
* "core" is a bad name that needs to be changed. It is misleading.
* Documentation should be improved along the lines you suggest.

You mention a few broken Analyzers (and by implication related tokenizers and 
filters). I've a question about LowerCaseFilter: Isn't it bad as well?


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-10 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930670#action_12930670
 ] 

DM Smith commented on LUCENE-2747:
--

Robert/Steven, I'm sorry. I fat fingered the last post. I really need to take 
more care.
s/Standard/Simple/;

That is, SimpleAnalyzer is not appropriate for many languages. If it were based 
upon a variation of UAX29Tokenizer, but didn't handle NUM or ALPHANUM, but WORD 
instead, it would be the same type of token stream, just alpha words.

Regarding compatibility: I think the results for English would be nearly, if 
not, identical. Western European would only be slightly off from identical. But 
for other languages it would be an improvement.

At this point, I'm content with what you guys are doing with non-English texts. 
Great job.


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-10 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930578#action_12930578
 ] 

DM Smith commented on LUCENE-2747:
--

{quote}
bq. Shouldn't UAX29Tokenizer be moved to core? (What is core anyway?)

In trunk (4.x codeline) there is no core, contrib, or solr for analyzer 
components any more. they are all combined into modules/analysis.
In branch_3x (3.x codeline) we did not make this rather disruptive refactor: 
there UAX29Tokenizer is in fact in lucene core.
{quote}

I meant o.a.l.analysis.core. I'd expect the *premier* analyzers to be in core.

{quote}
bq. Is there a point to having SimpleAnalyzer

I guess so, a lot of people can use this if they have english-only content and 
are probably happy with discard numbers etc... its not a big loss to me if it 
goes though.
{quote}

I guess I meant: Shouldn't the SimpleAnalyzer just be constructed the same as 
StandardAnalyzer with the addition of a Filter that pitch tokens that are not 
needed?
With the suggestion in LUCENE-2167 to use UAX29Tokenizer for StandardAnalyzer, 
effectively deprecating EMAIL and URL and possibly adding some kind of 
PUNCTUATION (so that URLs/emails/acronyms... can be reconstructed, if someone 
desires), the StandardAnalyzer is about as simple as one could get and properly 
handle non-english/non-western languages. It just creates ALPHANUM,  NUM and 
PUNCTUATION (if added) that SimpleAnalyzer does not care about.


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-10 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930585#action_12930585
 ] 

DM Smith commented on LUCENE-2747:
--

Robert, Let me ask another way. How about implementing StandardTokenizer using 
jflex to be UAX29Tokenizer minus NUM and ALPHANUM?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-09 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930448#action_12930448
 ] 

DM Smith commented on LUCENE-2747:
--

Robert, I think we are on the same wavelength. Thanks.

I like the idea of declarative analyzers, too.

Regarding the "last 2 points" has anyone given input to the JFlex team on these 
needs?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-09 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930119#action_12930119
 ] 

DM Smith commented on LUCENE-2747:
--

bq. DM, can you elaborate here?

I was a bit trigger happy with the comment. I should have looked at the code 
rather than the jira comments alone. The old StandardAnalyzer had a kitchen 
sink approach to tokenizations trying to do too much with *modern* constructs, 
e.g. URLs, email addresses, acronyms It and SimpleAnalyzer would produce 
about the same stream on "old" English and some other texts, but the 
StandardAnalyzer was much slower. (I don't remember how slow, but it was 
obvious.)

Both of these were weak when it came to non-English/non-Western texts. Thus I 
could take the language specific tokenizers, lists of stop words, stemmers and 
create variations of the SimpleAnalyzer that properly handled a particular 
language. (I created my own analyzers because I wanted to make stop words and 
stemming optional)

In looking at the code in trunk (should have done that before making my 
comment), I see that UAX29Tokenizer is duplicated in the StandardAnalyzer's 
jflex and that ClassicAnalyzer is the old jflex. Also, the new StandardAnalyzer 
does a lot less.

If I understand the suggestion of this and the other 2 issues, StandardAnalyzer 
will no longer handle modern constructs. As I see it this is what 
SimpleAnalyzer should be: Based on UAX29 and does little else. Thus my 
confusion. Is there a point to having SimpleAnalyzer? Shouldn't UAX29Tokenizer 
be moved to core? (What is core anyway?)

And if I understand where this is going: Would there be a way to plugin 
ICUTokenizer as a replacement for UAX29Tokenizer into StandardTokenizer, such 
that all Analyzers using StandardTokenizer would get the alternate 
implementation?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

2010-11-08 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929934#action_12929934
 ] 

DM Smith commented on LUCENE-2747:
--

I'm not too keen on this. For classics and ancient texts the standard analyzer 
is not as good as the simple analyzer. I think it is important to have a 
tokenizer that does not try to be too smart. I think it'd be good to have a 
SimpleAnalyzer based upon UAX#29, too.

Then I'd be happy.

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---
>
> Key: LUCENE-2747
> URL: https://issues.apache.org/jira/browse/LUCENE-2747
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to 
> provide language-neutral tokenization.  Lucene contains several 
> language-specific tokenizers that should be replaced by UAX#29-based 
> StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The 
> language-specific *analyzers*, by contrast, should remain, because they 
> contain language-specific post-tokenization filters.  The language-specific 
> analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond 
> just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and 
> depends on the fact that this tokenizer breaks tokens on the ZWNJ character 
> (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ 
> is not a word boundary.  Robert Muir has suggested using a char filter 
> converting ZWNJ to spaces prior to StandardTokenizer in the converted 
> PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

2010-05-17 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868230#action_12868230
 ] 

DM Smith commented on LUCENE-2167:
--

{quote}
bq.Naming will require some thought, though - I don't like EnglishTokenizer or 
EuropeanTokenizer - both seem to exclude valid constituencies.
What valid constituencies do you refer to?
{quote}
{quote}
Well, we can't call it English/EuropeanTokenizer (maybe 
EnglishAndEuropeanAnalyzer? seems too long), and calling it either only English 
or only European seems to leave the other out. Americans, e.g., don't consider 
themselves European, maybe not even linguistically (however incorrect that 
might be).
{quote}

Tongue in cheek:
By and large, these are Romance languages (i.e. latin derivatives). And the 
constructs that are being considered for special processing for the most part 
are fairly recent additions to the languages. So how about 
*ModernRomanceAnalyzer*?

> Implement StandardTokenizer with the UAX#29 Standard
> 
>
> Key: LUCENE-2167
> URL: https://issues.apache.org/jira/browse/LUCENE-2167
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 3.1
>Reporter: Shyamal Prasad
>Assignee: Steven Rowe
>Priority: Minor
> Attachments: LUCENE-2167.benchmark.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866954#action_12866954
 ] 

DM Smith commented on LUCENE-2458:
--

As I see it there are two issues:
1) Backward compatibility. 
2) Correctness according to the syntax definition of a query.

Let me preface the following by saying I have not studied the query parser in 
Lucene. Over 20 years ago I got an MS in compiler writing. I've been away from 
it for quite a while.

So, IMHO as a former compiler writer:

Maybe I'm just not "getting it" but it should be trivial to define the grammar 
(w/ precedence for any ambiguity, if necessary) and implement it. The tokenizer 
for the parser should have the responsibility to break the input into sequences 
of meta and non-meta. This tokenizer should not be anything more than what the 
parser requires.

The non-meta reasonably is subject to further tokenization/analysis. This 
further analysis should be entirely under the user's control. It should not be 
part of the parser.

Regarding the issue, I think it would be best if a quotation was the sole 
criteria for the determination of what is a phrase, not some heuristical 
analysis of the token stream.


> queryparser shouldn't generate phrasequeries based on term count
> 
>
> Key: LUCENE-2458
> URL: https://issues.apache.org/jira/browse/LUCENE-2458
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: QueryParser
>Reporter: Robert Muir
>Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation 
> (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello 
> dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term 
> count is used as some sort of "heuristic" to determine if its a phrase query 
> or not.
> This assumption is a disaster for languages that don't use whitespace 
> separation: CJK, compounding European languages like German, Finnish, etc. It 
> also
> makes it difficult for people to use n-gram analysis techniques. In these 
> cases you get bad relevance (MAP improves nearly *10x* if you use a 
> PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases 
> its being abused as some heuristic to "second guess" the tokenizer and piece 
> back things it shouldn't have split, but for large collections, doing things 
> like generating phrasequeries because StandardTokenizer split a compound on a 
> dash can cause serious performance problems. Instead people should analyze 
> their text with the appropriate methods, and QueryParser should only generate 
> phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty 
> obscure and people are not familiar with it. The result is we have bad 
> out-of-box behavior for many languages, and bad performance for others on 
> some inputs.
> I propose instead that we change the grammar to actually look for double 
> quotes to determine when to generate a phrase query, consistent with the 
> documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2413) Consolidate all (Solr's & Lucene's) analyzers into contrib/analzyers

2010-04-22 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859995#action_12859995
 ] 

DM Smith commented on LUCENE-2413:
--

Robert: +1

> Consolidate all (Solr's & Lucene's) analyzers into contrib/analzyers
> 
>
> Key: LUCENE-2413
> URL: https://issues.apache.org/jira/browse/LUCENE-2413
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael McCandless
> Fix For: 3.1
>
>
> We've been wanting to do this for quite some time now...  I think, now that 
> Solr/Lucene are merged, and we're looking at opening an unstable line of 
> development for Solr/Lucene, now is the right time to do it.
> A standalone module for all analyzers also empowers apps to separately 
> version the analyzers from which version of Solr/Lucene they use, possibly 
> enabling us to remove Version entirely from the analyzers.
> We should also do LUCENE-2309 (decouple, as much as possible, indexer from 
> the analysis API), but I don't think that issue needs to block this 
> consolidation.
> Once we do this, there is one place where our users can find all the 
> analyzers that Solr/Lucene provide.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org