from:"Robert Muir $JIRA$"

[jira] Updated: (LUCENE-1406) new Arabic Analyzer (Apache license)

2008-09-26 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1406:


Attachment: arabic.zip

Attached is my implementation

> new Arabic Analyzer (Apache license)
> 
>
> Key: LUCENE-1406
> URL: https://issues.apache.org/jira/browse/LUCENE-1406
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Analysis
>Reporter: Robert Muir
>Priority: Minor
> Attachments: arabic.zip
>
>
> I've noticed there is no Arabic analyzer for Lucene, most likely because Tim 
> Buckwalter's morphological dictionary is GPL.
> However, it is not necessary  to have full morphological analysis engine for 
> a quality arabic search. 
> This implementation implements the light-8s algorithm present in the 
> following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
> As you can see from the paper, improvement via this method over searching 
> surface forms (as lucene currently does) is significant, with almost 100% 
> improvement in average precision.
> While I personally don't think all the choices were the best, and some easily 
> improvements are still possible, the major motivation for implementing it 
> exactly the way it is presented in the paper is that the algorithm is 
> TREC-tested, so the precision/recall improvements to lucene are already 
> documented.
> For a stopword list, I used a list present at 
> http://members.unine.ch/jacques.savoy/clef/index.html simply because the 
> creator of this list documents the data as BSD-licensed.
> This implementation (Analyzer) consists of above mentioned stopword list plus 
> two filters:
>  ArabicNormalizationFilter: performs orthographic normalization (such as 
> hamza seated on alif, alif maksura, teh marbuta, removal of harakat, tatweel, 
> etc)
>  ArabicStemFilter: performs arabic light stemming
> Both filters operate directly on termbuffer for maximum performance. There is 
> no object creation in this Analyzer.
> There are no external dependencies. I've indexed about half a billion words 
> of arabic text and tested against that.
> If there are any issues with this implementation I am willing to fix them. I 
> use lucene on a daily basis and would like to give something back. Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-1406) new Arabic Analyzer (Apache license)

2008-09-26 Thread Robert Muir (JIRA)

new Arabic Analyzer (Apache license)


 Key: LUCENE-1406
 URL: https://issues.apache.org/jira/browse/LUCENE-1406
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Reporter: Robert Muir
Priority: Minor
 Attachments: arabic.zip

I've noticed there is no Arabic analyzer for Lucene, most likely because Tim 
Buckwalter's morphological dictionary is GPL.

However, it is not necessary  to have full morphological analysis engine for a 
quality arabic search. 
This implementation implements the light-8s algorithm present in the following 
paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf

As you can see from the paper, improvement via this method over searching 
surface forms (as lucene currently does) is significant, with almost 100% 
improvement in average precision.

While I personally don't think all the choices were the best, and some easily 
improvements are still possible, the major motivation for implementing it 
exactly the way it is presented in the paper is that the algorithm is 
TREC-tested, so the precision/recall improvements to lucene are already 
documented.

For a stopword list, I used a list present at 
http://members.unine.ch/jacques.savoy/clef/index.html simply because the 
creator of this list documents the data as BSD-licensed.

This implementation (Analyzer) consists of above mentioned stopword list plus 
two filters:
 ArabicNormalizationFilter: performs orthographic normalization (such as hamza 
seated on alif, alif maksura, teh marbuta, removal of harakat, tatweel, etc)
 ArabicStemFilter: performs arabic light stemming

Both filters operate directly on termbuffer for maximum performance. There is 
no object creation in this Analyzer.

There are no external dependencies. I've indexed about half a billion words of 
arabic text and tested against that.

If there are any issues with this implementation I am willing to fix them. I 
use lucene on a daily basis and would like to give something back. Thanks.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1406) new Arabic Analyzer (Apache license)

2008-09-26 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1406:


Attachment: (was: arabic.zip)

> new Arabic Analyzer (Apache license)
> 
>
> Key: LUCENE-1406
> URL: https://issues.apache.org/jira/browse/LUCENE-1406
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Analysis
>Reporter: Robert Muir
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-1406.patch
>
>
> I've noticed there is no Arabic analyzer for Lucene, most likely because Tim 
> Buckwalter's morphological dictionary is GPL.
> However, it is not necessary  to have full morphological analysis engine for 
> a quality arabic search. 
> This implementation implements the light-8s algorithm present in the 
> following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
> As you can see from the paper, improvement via this method over searching 
> surface forms (as lucene currently does) is significant, with almost 100% 
> improvement in average precision.
> While I personally don't think all the choices were the best, and some easily 
> improvements are still possible, the major motivation for implementing it 
> exactly the way it is presented in the paper is that the algorithm is 
> TREC-tested, so the precision/recall improvements to lucene are already 
> documented.
> For a stopword list, I used a list present at 
> http://members.unine.ch/jacques.savoy/clef/index.html simply because the 
> creator of this list documents the data as BSD-licensed.
> This implementation (Analyzer) consists of above mentioned stopword list plus 
> two filters:
>  ArabicNormalizationFilter: performs orthographic normalization (such as 
> hamza seated on alif, alif maksura, teh marbuta, removal of harakat, tatweel, 
> etc)
>  ArabicStemFilter: performs arabic light stemming
> Both filters operate directly on termbuffer for maximum performance. There is 
> no object creation in this Analyzer.
> There are no external dependencies. I've indexed about half a billion words 
> of arabic text and tested against that.
> If there are any issues with this implementation I am willing to fix them. I 
> use lucene on a daily basis and would like to give something back. Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1406) new Arabic Analyzer (Apache license)

2008-09-26 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1406:


Attachment: LUCENE-1406.patch

attached is patch

> new Arabic Analyzer (Apache license)
> 
>
> Key: LUCENE-1406
> URL: https://issues.apache.org/jira/browse/LUCENE-1406
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Analysis
>Reporter: Robert Muir
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-1406.patch
>
>
> I've noticed there is no Arabic analyzer for Lucene, most likely because Tim 
> Buckwalter's morphological dictionary is GPL.
> However, it is not necessary  to have full morphological analysis engine for 
> a quality arabic search. 
> This implementation implements the light-8s algorithm present in the 
> following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
> As you can see from the paper, improvement via this method over searching 
> surface forms (as lucene currently does) is significant, with almost 100% 
> improvement in average precision.
> While I personally don't think all the choices were the best, and some easily 
> improvements are still possible, the major motivation for implementing it 
> exactly the way it is presented in the paper is that the algorithm is 
> TREC-tested, so the precision/recall improvements to lucene are already 
> documented.
> For a stopword list, I used a list present at 
> http://members.unine.ch/jacques.savoy/clef/index.html simply because the 
> creator of this list documents the data as BSD-licensed.
> This implementation (Analyzer) consists of above mentioned stopword list plus 
> two filters:
>  ArabicNormalizationFilter: performs orthographic normalization (such as 
> hamza seated on alif, alif maksura, teh marbuta, removal of harakat, tatweel, 
> etc)
>  ArabicStemFilter: performs arabic light stemming
> Both filters operate directly on termbuffer for maximum performance. There is 
> no object creation in this Analyzer.
> There are no external dependencies. I've indexed about half a billion words 
> of arabic text and tested against that.
> If there are any issues with this implementation I am willing to fix them. I 
> use lucene on a daily basis and would like to give something back. Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

2008-09-26 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634944#action_12634944
 ] 

Robert Muir commented on LUCENE-1406:
-

Thought I would add the following comments:

I tried to stick to basics to start. Some things that kept bugging me just for 
the record:

1) the rules for stemming only require stemmed token to have 2 characters in 
many places. This seems incorrect... triliteral root anyone? Seems to be too 
aggresive. Yet at the same time, many common "prefix"/suffix combinations are 
not stemmed by light8 algorithm...  But its trec tested... 

2) there is no decomposition of unicode presentation forms. These characters 
show up (typically when text is extracted out of PDF). The easiest way to deal 
with this is Unicode normalization, but that requires Java 6 or ICU.

3) there is no enhanced parsing. Typically academics index high quality news 
text but in other less perfect text often you see much text without spaces 
between words when the characters do not join (to the human reader there is a 
space!). to really solve this you need a lot of special stuff including 
morphological data, but you can partially solve some of the common cases by 
splitting words when you see 100% certain cases such as medial teh marbuta, 
medial alef maksura, double alef, ... I didnt do this because I wanted to keep 
it simple, but its important, see here: 
http://papers.ldc.upenn.edu/COLING2004/Buckwalter_Arabic-orthography-morphology.pdf
 
4) it is simply a stemmer, but I read in lucene docs where it is possible to 
inject synonym-like information (multiple tokens for one word) and boost the 
score for certain ones. Seems like this would be better than simply stemming, 
at least indexing and boosting the normalized surface form for better 
precision. I'd want to setup TREC tests to actually measure this though.


> new Arabic Analyzer (Apache license)
> 
>
> Key: LUCENE-1406
> URL: https://issues.apache.org/jira/browse/LUCENE-1406
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Analysis
>Reporter: Robert Muir
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-1406.patch
>
>
> I've noticed there is no Arabic analyzer for Lucene, most likely because Tim 
> Buckwalter's morphological dictionary is GPL.
> However, it is not necessary  to have full morphological analysis engine for 
> a quality arabic search. 
> This implementation implements the light-8s algorithm present in the 
> following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
> As you can see from the paper, improvement via this method over searching 
> surface forms (as lucene currently does) is significant, with almost 100% 
> improvement in average precision.
> While I personally don't think all the choices were the best, and some easily 
> improvements are still possible, the major motivation for implementing it 
> exactly the way it is presented in the paper is that the algorithm is 
> TREC-tested, so the precision/recall improvements to lucene are already 
> documented.
> For a stopword list, I used a list present at 
> http://members.unine.ch/jacques.savoy/clef/index.html simply because the 
> creator of this list documents the data as BSD-licensed.
> This implementation (Analyzer) consists of above mentioned stopword list plus 
> two filters:
>  ArabicNormalizationFilter: performs orthographic normalization (such as 
> hamza seated on alif, alif maksura, teh marbuta, removal of harakat, tatweel, 
> etc)
>  ArabicStemFilter: performs arabic light stemming
> Both filters operate directly on termbuffer for maximum performance. There is 
> no object creation in this Analyzer.
> There are no external dependencies. I've indexed about half a billion words 
> of arabic text and tested against that.
> If there are any issues with this implementation I am willing to fix them. I 
> use lucene on a daily basis and would like to give something back. Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-10-06 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637186#action_12637186
 ] 

Robert Muir commented on LUCENE-1316:
-

we use MatchAllDocs query also. In addition to what is described here we got 
very nice gains by overriding the Scorer.score() methods that take a 
HitCollector.

This seems pretty dumb but I guess since it has to score every doc the method 
call overhead actually matters? 

> Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
> 
>
> Key: LUCENE-1316
> URL: https://issues.apache.org/jira/browse/LUCENE-1316
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Query/Scoring
>Affects Versions: 2.3
> Environment: All
>Reporter: Todd Feak
>Priority: Minor
> Attachments: LUCENE_1316.patch, LUCENE_1316.patch, LUCENE_1316.patch, 
> MatchAllDocsQuery.java
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The isDeleted() method on IndexReader has been mentioned a number of times as 
> a potential synchronization bottleneck. However, the reason this  bottleneck 
> occurs is actually at a higher level that wasn't focused on (at least in the 
> threads I read).
> In every case I saw where a stack trace was provided to show the lock/block, 
> higher in the stack you see the MatchAllScorer.next() method. In Solr 
> paricularly, this scorer is used for "NOT" queries. We saw incredibly poor 
> performance (order of magnitude) on our load tests for NOT queries, due to 
> this bottleneck. The problem is that every single document is run through 
> this isDeleted() method, which is synchronized. Having an optimized index 
> exacerbates this issues, as there is only a single SegmentReader to 
> synchronize on, causing a major thread pileup waiting for the lock.
> By simply having the MatchAllScorer see if there have been any deletions in 
> the reader, much of this can be avoided. Especially in a read-only 
> environment for production where you have slaves doing all the high load 
> searching.
> I modified line 67 in the MatchAllDocsQuery
> FROM:
>   if (!reader.isDeleted(id)) {
> TO:
>   if (!reader.hasDeletions() || !reader.isDeleted(id)) {
> In our micro load test for NOT queries only, this was a major performance 
> improvement.  We also got the same query results. I don't believe this will 
> improve the situation for indexes that have deletions. 
> Please consider making this adjustment for a future bug fix release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

2008-11-01 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644541#action_12644541
 ] 

Robert Muir commented on LUCENE-1435:
-

at least in ICU, its not completely safe.  If the different JVM instances are 
"different" in version (upgrade, etc) then it would be a shame to find your 
sorts all busted. 

When comparing keys, it is important to know that both keys were generated by 
the same algorithms and weightings. Otherwise, identical strings with keys 
generated on two different dates, for example, might compare as unequal. Sort 
keys can be affected by new versions of ICU or its data tables, new sort key 
formats, or changes to the Collator.

http://www.icu-project.org/userguide/Collate_ServiceArchitecture.html

> CollationKeyFilter: convert tokens into CollationKeys encoded using 
> IndexableBinaryStringTools
> --
>
> Key: LUCENE-1435
> URL: https://issues.apache.org/jira/browse/LUCENE-1435
> Project: Lucene - Java
>  Issue Type: New Feature
>Affects Versions: 2.4
>Reporter: Steven Rowe
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and 
> then encodes the CollationKey with IndexableBinaryStringTools, to allow it to 
> be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need 
> collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

2008-11-02 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644604#action_12644604
 ] 

Robert Muir commented on LUCENE-1435:
-

One alternative is that the ICU implementation has versioning specifically for 
this purpose.

The version information of Collator is a 32-bit integer. If a new version of 
ICU has changes affecting the content of collation elements, the version 
information will be changed. In that case, to use the new version of ICU 
collator will require regenerating any saved or stored sort keys. However, 
since ICU 1.8.1. it is possible to build your program so that it uses more than 
one version of ICU. Therefore, you could use the current version for the 
features you need and use the older version for collation.

> CollationKeyFilter: convert tokens into CollationKeys encoded using 
> IndexableBinaryStringTools
> --
>
> Key: LUCENE-1435
> URL: https://issues.apache.org/jira/browse/LUCENE-1435
> Project: Lucene - Java
>  Issue Type: New Feature
>Affects Versions: 2.4
>Reporter: Steven Rowe
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and 
> then encodes the CollationKey with IndexableBinaryStringTools, to allow it to 
> be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need 
> collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

2008-12-02 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652691#action_12652691
 ] 

Robert Muir commented on LUCENE-1390:
-

I am using this patch and its working well.

Nitpick... wonder if you could change the mapping of Ə and ə to from E to A... 
This character is only used in Azeri and not too long ago (<20 years) it was 
written as A with umlaut, so there is some precedence.

Thanks,
Robert

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> 
>
> Key: LUCENE-1390
> URL: https://issues.apache.org/jira/browse/LUCENE-1390
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
> Environment: any
>Reporter: Andi Vajda
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, 
> ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the 
> ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this 
> code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 
> and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede 
> ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

2008-12-03 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652831#action_12652831
 ] 

Robert Muir commented on LUCENE-1390:
-

sean... from your link: On 16th May 1992 the Latin alphabet for Azerbaijani was 
slightly revised - the letter ä was replaced with ə and the order of letters 
was changed as well. 

i've never seen 'ae' used in its place, certainly not in the Azeri text that I 
am indexing.

andi... im referring to the schwa character in azeri: U+018F (uppercase) and 
U+0259 (lowercase)

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> 
>
> Key: LUCENE-1390
> URL: https://issues.apache.org/jira/browse/LUCENE-1390
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
> Environment: any
>Reporter: Andi Vajda
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, 
> ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the 
> ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this 
> code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 
> and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede 
> ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

2008-12-03 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652834#action_12652834
 ] 

Robert Muir commented on LUCENE-1390:
-

with regards to transliteration the bgn/pcgn standard states: 

The special letter Ə, ə known as schwa, should be reproduced in that form 
whenever encountered.
Use Ә (U+04D8) and ә (U+04D9) for schwa when writing in the Cyrillic script, 
but use Ə (U+018F) and ə (U+0259) for schwa when writing in the Roman alphabet.

In those instances when it cannot be reproduced, however, the letter Ä ä may be 
substituted for it.

http://earth-info.nga.mil/gns/html/Romanization/Romanization_Azerbaijani.pdf


> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> 
>
> Key: LUCENE-1390
> URL: https://issues.apache.org/jira/browse/LUCENE-1390
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
> Environment: any
>Reporter: Andi Vajda
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, 
> ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the 
> ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this 
> code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 
> and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede 
> ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

2008-12-03 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653010#action_12653010
 ] 

Robert Muir commented on LUCENE-1390:
-

thanks guys, just as a comment to whoever is listining I think this is very 
useful functionality.

I am indexing a lot of docs and doing it with ICU works well, but that method 
(unicode decomposition etc) is very expensive and still doesnt handle many 
common cases. In profiling, it was slowing down entire indexing process.

The existing ISO filter doesn't handle many cases that are actually in use in 
my text, but this filter works well and appears to have coverage for most of 
the common cases such as full width forms, at the same time it is fast.

Thanks,
Robert


> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> 
>
> Key: LUCENE-1390
> URL: https://issues.apache.org/jira/browse/LUCENE-1390
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
> Environment: any
>Reporter: Andi Vajda
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, 
> ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the 
> ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this 
> code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 
> and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede 
> ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

2008-12-03 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653064#action_12653064
 ] 

Robert Muir commented on LUCENE-1390:
-

does ISOLatin1AccentFilter really need to be deprecated? I don't think its 
misleading, could just reiterate it only covers Latin 1 and ref this one in the 
docs?

just as this one documents what blocks it covers and i don't expect it to 
normalize U+338E to 'mg'


> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> 
>
> Key: LUCENE-1390
> URL: https://issues.apache.org/jira/browse/LUCENE-1390
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
> Environment: any
>Reporter: Andi Vajda
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, 
> ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the 
> ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this 
> code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 
> and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede 
> ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

2008-12-04 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653500#action_12653500
 ] 

Robert Muir commented on LUCENE-1390:
-

its a bit slower, but the difference is minor. i just ran some tests with some 
cpu-bound (these filters are right at the top of hprof.txt) indexes that i build

i ran em a couple times and it looks like this... not very scientific but it 
gives an idea.

ASCII Folding filter index time (ms): 143365
ISOLatin1Accent filter (ms): 134649


> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> 
>
> Key: LUCENE-1390
> URL: https://issues.apache.org/jira/browse/LUCENE-1390
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
> Environment: any
>Reporter: Andi Vajda
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, 
> ASCIIFoldingFilter.patch
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the 
> ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this 
> code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 
> and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede 
> ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

2008-12-04 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653520#action_12653520
 ] 

Robert Muir commented on LUCENE-1390:
-

sorry, that wasn't a fair test case. a good chunk of those docs contain accents 
outside of latin1, so asciifoldingfilter was doing more work

i reran on some heavily accented (but only latin1) data and the difference was 
negligible, 1% or so 

appears asciifoldingfilter only slows you down versus isolatin1accentfilter in 
the case where it probably should be! (you have accents outside of latin1 but 
are using latin1accentfilter)


> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> 
>
> Key: LUCENE-1390
> URL: https://issues.apache.org/jira/browse/LUCENE-1390
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
> Environment: any
>Reporter: Andi Vajda
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, 
> ASCIIFoldingFilter.patch
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the 
> ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this 
> code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 
> and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede 
> ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-1488) issues with standardanalyzer on multilingual text

2008-12-11 Thread Robert Muir (JIRA)

issues with standardanalyzer on multilingual text
-

 Key: LUCENE-1488
 URL: https://issues.apache.org/jira/browse/LUCENE-1488
 Project: Lucene - Java
  Issue Type: Wish
  Components: contrib/analyzers
Reporter: Robert Muir
Priority: Minor


The standard analyzer in lucene is not exactly unicode-friendly with regards to 
breaking text into words, especially with respect to non-alphabetic scripts.  
This is because it is unaware of unicode bounds properties.

I actually couldn't figure out how the Thai analyzer could possibly be working 
until i looked at the jflex rules and saw that codepoint range for most of the 
Thai block was added to the alphanum specification. defining the exact 
codepoint ranges like this for every language could help with the problem but 
you'd basically be reimplementing the bounds properties already stated in the 
unicode standard. 

in general it looks like this kind of behavior is bad in lucene for even latin, 
for instance, the analyzer will break words around accent marks in decomposed 
form. While most latin letter + accent combinations have composed forms in 
unicode, some do not. (this is also an issue for asciifoldingfilter i suppose). 

I've got a partially tested standardanalyzer that uses icu Rule-based 
BreakIterator instead of jflex. Using this method you can define word 
boundaries according to the unicode bounds properties. After getting it into 
some good shape i'd be happy to contribute it for contrib but I wonder if 
theres a better solution so that out of box lucene will be more friendly to 
non-ASCII text. Unfortunately it seems jflex does not support use of these 
properties such as [\p{Word_Break = Extend}] so this is probably the major 
barrier.

Thanks,
Robert





-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

2008-12-11 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655840#action_12655840
 ] 

Robert Muir commented on LUCENE-1488:
-

thats a good idea. you know, currently trying to get it to pass all the 
standard analyzer unit tests causes some problems since lucene has some rather 
obscure definitions of 'number' (i think ip addresses, etc are included) which 
differ dramatically from the basic unicode definition.

Other things of note:

instantiating the analyzer takes a long time (couple seconds) because ICU must 
"compile" the rules. I'm not sure of the specifics but by compile I think that 
means building massive FSM or similar based on all the unicode data. Its 
possible to precompile the rules into binary format but I think this is not 
currently exposed in ICU.

the lucene tokenization pipeline makes the implementation a little hairy. I 
hack around it by tokenizing on whitespace first, then acting as a token filter 
(just like the Thai analyzer does, which also uses RBBI). I don't think this 
really is that bad from a linguistic standpoint because the rare cases where 
'token' can have whitespace inside of it (persian, etc) need serious muscle 
somewhere else and should be handled by a language analyzer.

i'll try to get this thing in reasonable shape at least to document the 
approach.


> issues with standardanalyzer on multilingual text
> -
>
> Key: LUCENE-1488
> URL: https://issues.apache.org/jira/browse/LUCENE-1488
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Priority: Minor
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards 
> to breaking text into words, especially with respect to non-alphabetic 
> scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be 
> working until i looked at the jflex rules and saw that codepoint range for 
> most of the Thai block was added to the alphanum specification. defining the 
> exact codepoint ranges like this for every language could help with the 
> problem but you'd basically be reimplementing the bounds properties already 
> stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even 
> latin, for instance, the analyzer will break words around accent marks in 
> decomposed form. While most latin letter + accent combinations have composed 
> forms in unicode, some do not. (this is also an issue for asciifoldingfilter 
> i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based 
> BreakIterator instead of jflex. Using this method you can define word 
> boundaries according to the unicode bounds properties. After getting it into 
> some good shape i'd be happy to contribute it for contrib but I wonder if 
> theres a better solution so that out of box lucene will be more friendly to 
> non-ASCII text. Unfortunately it seems jflex does not support use of these 
> properties such as [\p{Word_Break = Extend}] so this is probably the major 
> barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1488) issues with standardanalyzer on multilingual text

2008-12-11 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1488:


Attachment: ICUAnalyzer.patch

i've attached a patch for 'ICUAnalyzer'. I see that some things involving Token 
have changed but I created it before that point.

I stole the unit tests from standard analyzer and put comments as to why 
certain ones arent appropriate and disabled those.

i added some unit tests that demonstrate some of the value, correct analysis 
for arabic numerals, hindi text, decomposed latin diacritics, hebrew 
punctuation, cantonese and linear-b text outside of the BMP, etc.

one issue is that setMaxTokenLength() doesnt work correctly for values > 255 
because CharTokenizer has a hardcoded private limit of 255 that i can't 
override. This is a problem since i use WhitespaceTokenizer first and then 
break down those tokens with the RBBI.


> issues with standardanalyzer on multilingual text
> -
>
> Key: LUCENE-1488
> URL: https://issues.apache.org/jira/browse/LUCENE-1488
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Priority: Minor
> Attachments: ICUAnalyzer.patch
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards 
> to breaking text into words, especially with respect to non-alphabetic 
> scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be 
> working until i looked at the jflex rules and saw that codepoint range for 
> most of the Thai block was added to the alphanum specification. defining the 
> exact codepoint ranges like this for every language could help with the 
> problem but you'd basically be reimplementing the bounds properties already 
> stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even 
> latin, for instance, the analyzer will break words around accent marks in 
> decomposed form. While most latin letter + accent combinations have composed 
> forms in unicode, some do not. (this is also an issue for asciifoldingfilter 
> i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based 
> BreakIterator instead of jflex. Using this method you can define word 
> boundaries according to the unicode bounds properties. After getting it into 
> some good shape i'd be happy to contribute it for contrib but I wonder if 
> theres a better solution so that out of box lucene will be more friendly to 
> non-ASCII text. Unfortunately it seems jflex does not support use of these 
> properties such as [\p{Word_Break = Extend}] so this is probably the major 
> barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

2008-12-12 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656040#action_12656040
 ] 

Robert Muir commented on LUCENE-1488:
-

as soon as I figure out how to invoke the ICU RBBI compiler i'll see if i can 
update the patch with compiled rules so instantiation of this thing is cheap...

> issues with standardanalyzer on multilingual text
> -
>
> Key: LUCENE-1488
> URL: https://issues.apache.org/jira/browse/LUCENE-1488
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Priority: Minor
> Attachments: ICUAnalyzer.patch
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards 
> to breaking text into words, especially with respect to non-alphabetic 
> scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be 
> working until i looked at the jflex rules and saw that codepoint range for 
> most of the Thai block was added to the alphanum specification. defining the 
> exact codepoint ranges like this for every language could help with the 
> problem but you'd basically be reimplementing the bounds properties already 
> stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even 
> latin, for instance, the analyzer will break words around accent marks in 
> decomposed form. While most latin letter + accent combinations have composed 
> forms in unicode, some do not. (this is also an issue for asciifoldingfilter 
> i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based 
> BreakIterator instead of jflex. Using this method you can define word 
> boundaries according to the unicode bounds properties. After getting it into 
> some good shape i'd be happy to contribute it for contrib but I wonder if 
> theres a better solution so that out of box lucene will be more friendly to 
> non-ASCII text. Unfortunately it seems jflex does not support use of these 
> properties such as [\p{Word_Break = Extend}] so this is probably the major 
> barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread Robert Muir (JIRA)

fastss fuzzyquery
-

 Key: LUCENE-1513
 URL: https://issues.apache.org/jira/browse/LUCENE-1513
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Priority: Minor


code for doing fuzzyqueries with fastssWC algorithm.

FuzzyIndexer: given a lucene field, it enumerates all terms and creates an 
auxiliary offline index for fuzzy queries.
FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index to 
retrieve a candidate list. this list is then verified with levenstein algorithm.

sorry but the code is a bit messy... what I'm actually using is very different 
from this so its pretty much untested. but at least you can see whats going on 
or fix it up.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1513:


Attachment: fastSSfuzzy.zip

> fastss fuzzyquery
> -
>
> Key: LUCENE-1513
> URL: https://issues.apache.org/jira/browse/LUCENE-1513
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Attachments: fastSSfuzzy.zip
>
>
> code for doing fuzzyqueries with fastssWC algorithm.
> FuzzyIndexer: given a lucene field, it enumerates all terms and creates an 
> auxiliary offline index for fuzzy queries.
> FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index 
> to retrieve a candidate list. this list is then verified with levenstein 
> algorithm.
> sorry but the code is a bit messy... what I'm actually using is very 
> different from this so its pretty much untested. but at least you can see 
> whats going on or fix it up.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661314#action_12661314
 ] 

Robert Muir commented on LUCENE-1513:
-

otis, discussion was on java-user.

again, I apologize for the messy code. as mentioned there, my setup is very 
specific to exactly what I am doing and in no way is this code ready. But since 
i'm currently pretty busy with other things at work I just wanted to put 
something up here for anyone else interested.

theres the issues you mentioned, and also some i mentioned on java-user. for 
example how to handle updates to indexes that introduce new terms (they must be 
added to auxiliary index), or even if auxiliary index is the best approach.

the general idea is that instead of enumerating terms to find terms, the 
deletion neighborhood as described in the paper is used instead. this way 
search time is not linear based on number of terms. yes you are correct that it 
only can guarantee edit distances of K which is determined at index time. 
perhaps this should be configurable, but i hardcoded k=1 for simplicity. i 
think its something like 80% of typos...

as i mentioned on the list another idea is you could implement FastSS (not the 
wC variant) with deletion positions maybe by using payloads. This would require 
more space but eliminate the candidate verification step. maybe it would be 
nice to have some of their other algorithms such as block-based,etc available 
also. 



> fastss fuzzyquery
> -
>
> Key: LUCENE-1513
> URL: https://issues.apache.org/jira/browse/LUCENE-1513
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Attachments: fastSSfuzzy.zip
>
>
> code for doing fuzzyqueries with fastssWC algorithm.
> FuzzyIndexer: given a lucene field, it enumerates all terms and creates an 
> auxiliary offline index for fuzzy queries.
> FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index 
> to retrieve a candidate list. this list is then verified with levenstein 
> algorithm.
> sorry but the code is a bit messy... what I'm actually using is very 
> different from this so its pretty much untested. but at least you can see 
> whats going on or fix it up.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

2009-02-02 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669576#action_12669576
 ] 

Robert Muir commented on LUCENE-1532:
-

just a suggestion... I got better results by refining edit distance costs by 
keyboard layout (substituting a 'd' with an 'f' costs less than 'd' with 'j', 
and i also penalize less for transposition).

if you have lots of terms it helps for ed function to be able to discriminate 
terms better.

> File based spellcheck with doc frequencies supplied
> ---
>
> Key: LUCENE-1532
> URL: https://issues.apache.org/jira/browse/LUCENE-1532
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/spellchecker
>Reporter: David Bowen
>Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally 
> valid, so it can suggest a very obscure word rather than a more common word 
> which is equally close to the misspelled word that was entered.  It would be 
> very useful to have the option of supplying an integer with each word which 
> indicates its commonness.  I.e. the integer could be the document frequency 
> in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by 
> defining a DocFrequencyInfo interface for obtaining the doc frequency of a 
> word, and a class which implements the interface by looking up the frequency 
> in an index.  So Lucene users can provide alternative implementations of 
> DocFrequencyInfo.  I could submit this as a patch if there is interest.  
> Alternatively, it might be better to just extend the spellcheck API to have a 
> way to supply the frequencies when you create a PlainTextDictionary, but that 
> would mean storing the frequencies somewhere when building the spellcheck 
> index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

2009-02-02 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669605#action_12669605
 ] 

Robert Muir commented on LUCENE-1532:
-

when you talk about hardcoding normalization, I really don't see where its 
unfair or even 'hardcoding' to assume a zipfian distribution in any corpus of 
text for incorporating the frequency weight

I agree the specific corpus determines some of these properties but at the end 
of the day they all tend to have the same general distribution curve even if 
the specifics are different.

> File based spellcheck with doc frequencies supplied
> ---
>
> Key: LUCENE-1532
> URL: https://issues.apache.org/jira/browse/LUCENE-1532
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/spellchecker
>Reporter: David Bowen
>Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally 
> valid, so it can suggest a very obscure word rather than a more common word 
> which is equally close to the misspelled word that was entered.  It would be 
> very useful to have the option of supplying an integer with each word which 
> indicates its commonness.  I.e. the integer could be the document frequency 
> in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by 
> defining a DocFrequencyInfo interface for obtaining the doc frequency of a 
> word, and a class which implements the interface by looking up the frequency 
> in an index.  So Lucene users can provide alternative implementations of 
> DocFrequencyInfo.  I could submit this as a patch if there is interest.  
> Alternatively, it might be better to just extend the spellcheck API to have a 
> way to supply the frequencies when you create a PlainTextDictionary, but that 
> would mean storing the frequencies somewhere when building the spellcheck 
> index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

2009-02-02 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669582#action_12669582
 ] 

Robert Muir commented on LUCENE-1532:
-

I agree the frequency information is very useful, but I'm not sure the exact 
frequency number at just word-level is really that useful for spelling 
correction, assuming a normal zipfian distribution.

using the frequency as a basic guide: 'typo or non-typo', 'common or uncommon', 
etc might be the best use for it.


> File based spellcheck with doc frequencies supplied
> ---
>
> Key: LUCENE-1532
> URL: https://issues.apache.org/jira/browse/LUCENE-1532
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/spellchecker
>Reporter: David Bowen
>Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally 
> valid, so it can suggest a very obscure word rather than a more common word 
> which is equally close to the misspelled word that was entered.  It would be 
> very useful to have the option of supplying an integer with each word which 
> indicates its commonness.  I.e. the integer could be the document frequency 
> in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by 
> defining a DocFrequencyInfo interface for obtaining the doc frequency of a 
> word, and a class which implements the interface by looking up the frequency 
> in an index.  So Lucene users can provide alternative implementations of 
> DocFrequencyInfo.  I could submit this as a patch if there is interest.  
> Alternatively, it might be better to just extend the spellcheck API to have a 
> way to supply the frequencies when you create a PlainTextDictionary, but that 
> would mean storing the frequencies somewhere when building the spellcheck 
> index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

2009-02-02 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669601#action_12669601
 ] 

Robert Muir commented on LUCENE-1532:
-

I think we are on the same page here, I'm just suggesting that if the broad 
goal is to improve spellcheck, I think smarter distance metrics are also worth 
looking at.

In my tests I got significantly better results by tuning the ED function as 
mentioned, I also use freetts/cmudict to incorporate phonetic edit distance and 
average the two. (The idea being to help with true typos but also with 
genuinely bad spellers).  The downside to these tricks are that they are 
language-dependent.

For reference the other thing I will mention is aspell has some test data here: 
http://aspell.net/test/orig/ , maybe it is useful in some way?


> File based spellcheck with doc frequencies supplied
> ---
>
> Key: LUCENE-1532
> URL: https://issues.apache.org/jira/browse/LUCENE-1532
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/spellchecker
>Reporter: David Bowen
>Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally 
> valid, so it can suggest a very obscure word rather than a more common word 
> which is equally close to the misspelled word that was entered.  It would be 
> very useful to have the option of supplying an integer with each word which 
> indicates its commonness.  I.e. the integer could be the document frequency 
> in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by 
> defining a DocFrequencyInfo interface for obtaining the doc frequency of a 
> word, and a class which implements the interface by looking up the frequency 
> in an index.  So Lucene users can provide alternative implementations of 
> DocFrequencyInfo.  I could submit this as a patch if there is interest.  
> Alternatively, it might be better to just extend the spellcheck API to have a 
> way to supply the frequencies when you create a PlainTextDictionary, but that 
> would mean storing the frequencies somewhere when building the spellcheck 
> index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1545) Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E

2009-02-20 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675381#action_12675381
 ] 

Robert Muir commented on LUCENE-1545:
-

this is an example of why i started messing with LUCENE-1488

> Standard analyzer does not correctly tokenize combining character U+0364 
> COMBINING LATIN SMALL LETTRE E
> ---
>
> Key: LUCENE-1545
> URL: https://issues.apache.org/jira/browse/LUCENE-1545
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.4
> Environment: Linux x86_64, Sun Java 1.6
>Reporter: Andreas Hauser
> Fix For: 2.9
>
> Attachments: AnalyzerTest.java
>
>
> Standard analyzer does not correctly tokenize combining character U+0364 
> COMBINING LATIN SMALL LETTRE E.
> The word "moͤchte" is incorrectly tokenized into "mo" "chte", the combining 
> character is lost.
> Expected result is only on token "moͤchte".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.

2009-03-29 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693591#action_12693591
 ] 

Robert Muir commented on LUCENE-1581:
-

some comments I have on this topic:

the problems i have with default internationalization support in lucene revolve 
around the following:

1. breaking text into words (parsing) is not unicode-sensitive
i.e. if i have a word containing s + macron (s̄) it will not tokenize it 
correctly.

2. various filters like lowercase as mentioned here, but also accent removal 
are not unicode-sensitive
 i.e. if i have s + macron (s̄) it will not remove the macron.
this is not a normalization problem, but its true it also doesn't seem to work 
correctly on decomposed NF(K)D text for similar reasons. in this example, there 
is no composed form for s + macron available in unicode so I cannot 'hack' 
around the problem by running NFC on this text before i feed it to lucene.

3. unicode text must be normalized so that both queries and text are in a 
consistent representation.

one option I see is to have at least a basic analyzer that uses ICU to do the 
following.
1. Break text into words correctly.
2. common filters to do things like lowercase and accent-removal correctly.
3. uses a filter to normalize text to one unicode normal form (say, NFKC by 
default)

In my opinion, having this available would solve a majority of the current 
problems.

I kinda started trying to implement some of this with lucene-1488... (at least 
it does step 1!)



> LowerCaseFilter should be able to be configured to use a specific locale.
> -
>
> Key: LUCENE-1581
> URL: https://issues.apache.org/jira/browse/LUCENE-1581
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Digy
>
> //Since I am a .Net programmer, Sample codes will be in c# but I don't think 
> that it would be a problem to understand them.
> //
> Assume an input text like "İ" and and analyzer like below
> {code}
>   public class SomeAnalyzer : Analyzer
>   {
>   public override TokenStream TokenStream(string fieldName, 
> System.IO.TextReader reader)
>   {
>   TokenStream t = new SomeTokenizer(reader);
>   t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t);
>   t = new LowerCaseFilter(t);
>   return t;
>   }
> 
>   }
> {code}
>   
> ASCIIFoldingFilter will return "I" and after, LowerCaseFilter will return
>   "i" (if locale is "en-US") 
>   or 
>   "ı' if(locale is "tr-TR") (that means,this token should be input to 
> another instance of ASCIIFoldingFilter)
> So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, 
> but a better approach can be adding
> a new constructor to LowerCaseFilter and forcing it to use a specific locale.
> {code}
> public sealed class LowerCaseFilter : TokenFilter
> {
> /* +++ */System.Globalization.CultureInfo CultureInfo = 
> System.Globalization.CultureInfo.CurrentCulture;
> public LowerCaseFilter(TokenStream in) : base(in)
> {
> }
> /* +++ */  public LowerCaseFilter(TokenStream in, 
> System.Globalization.CultureInfo CultureInfo) : base(in)
> /* +++ */  {
> /* +++ */  this.CultureInfo = CultureInfo;
> /* +++ */  }
>   
> public override Token Next(Token result)
> {
> result = Input.Next(result);
> if (result != null)
> {
> char[] buffer = result.TermBuffer();
> int length = result.termLength;
> for (int i = 0; i < length; i++)
> /* +++ */ buffer[i] = 
> System.Char.ToLower(buffer[i],CultureInfo);
> return result;
> }
> else
> return null;
> }
> }
> {code}
> DIGY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1606:


Attachment: automaton.patch

patch

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Attachments: automaton.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Robert Muir (JIRA)

Automaton Query/Filter (scalable regex)
---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Priority: Minor
 Attachments: automaton.patch

Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
suitable).

Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
Additionally all of the existing RegexQuery implementations in Lucene are 
really slow if there is no constant prefix. This implementation does not depend 
upon constant prefix, and runs the same query in 640ms.

Some use cases I envision:
 1. lexicography/etc on large text corpora
 2. looking for things such as urls where the prefix is not constant (http:// 
or ftp://)

The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
regular expressions into a DFA. Then, the filter "enumerates" terms in a 
special way, by using the underlying state machine. Here is my short 
description from the comments:

 The algorithm here is pretty basic. Enumerate terms but instead of a 
binary accept/reject do:
  
 1. Look at the portion that is OK (did not enter a reject state in the DFA)
 2. Generate the next possible String and seek to that.

the Query simply wraps the filter with ConstantScoreQuery.

I did not include the automaton.jar inside the patch but it can be downloaded 
from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1606:


Attachment: automatonWithWildCard.patch

Here is an updated patch with AutomatonWildCardQuery.

This implements standard Lucene Wildcard query with AutomatonFilter.

This accelerates quite a few wildcard situations, such as ??(a|b)?cd*ef
Sorry, provides no help for leading *, but definitely for leading ?.

All wildcard tests pass.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Attachments: automaton.patch, automatonWithWildCard.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12699657#action_12699657
 ] 

Robert Muir commented on LUCENE-1606:
-

mark yeah, the enumeration helps a lot, it means a lot less comparisons, plus 
brics is *FAST*.

inside the AutomatonFilter i describe how it could possibly be done better, but 
I was afraid I would mess it up.
its affected somewhat by the size of the alphabet so if you were using it 
against lots of CJK text, it might be worth it to instead use the 
State/Transition objects in the package. Transitions are described by min and 
max character intervals and you can access intervals in sorted order...

its all so nice but I figure this is a start.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Attachments: automaton.patch, automatonWithWildCard.patch, 
> automatonWithWildCard2.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1606:


Attachment: automatonWithWildCard2.patch

oops I did say in javadocs score is constant / boost only so when Wildcard has 
no wildcards and rewrites to termquery, wrap it with 
ConstantScoreQuery(QueryWrapperFilter)) to ensure this.



> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Attachments: automaton.patch, automatonWithWildCard.patch, 
> automatonWithWildCard2.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12699662#action_12699662
 ] 

Robert Muir commented on LUCENE-1606:
-

Mike the thing it cant do is stuff that cannot be determinized. However I think 
you only need an NFA for capturing group related things:

http://oreilly.com/catalog/regex/chapter/ch04.html

One thing is that the brics syntax is a bit different. i.e. ^ and $ are implied 
and I think some things need to be escaped. 
So I think it can do everything RegexQuery does, but maybe different syntax is 
required.


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: automaton.patch, automatonWithWildCard.patch, 
> automatonWithWildCard2.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12699673#action_12699673
 ] 

Robert Muir commented on LUCENE-1606:
-

Uwe, I agree with you, with one caveat: for this functionality to work the Enum 
must be ordered correctly according to Term.compareTo().

Otherwise it will not work correctly...

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: automaton.patch, automatonWithWildCard.patch, 
> automatonWithWildCard2.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12699685#action_12699685
 ] 

Robert Muir commented on LUCENE-1606:
-

Uwe, i'll look and see how you do it for TrieRange.

if it can make the code for this simpler that will be fantastic. maybe by then 
I will have also figured out some way to cleanly and non-recursively use 
min/max character intervals in the state machine to decrease the amount of 
seeks and optimize a little bit.

 

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: automaton.patch, automatonWithWildCard.patch, 
> automatonWithWildCard2.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12699693#action_12699693
 ] 

Robert Muir commented on LUCENE-1606:
-

Uwe, thanks. I'll think on this and on other improvements. 
I'm not really confident in my ability to make the code much cleaner at the end 
of the day, but more efficient and get some things for free as you say.
For now it is working much better than a linear scan, and the improvements wont 
change the order, but might help a bit.

Think i should try to correct this issue or create a separate issue?


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: automaton.patch, automatonWithWildCard.patch, 
> automatonWithWildCard2.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-17 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1606:


Attachment: automatonMultiQuery.patch

ok I refactored this to use FilteredTermEnum/MultiTermQuery as Uwe suggested.

on my big index its actually faster without setting the constant score rewrite 
(maybe creating the huge bitset is expensive?)

I also changed the term enumeration to be a bit smarter, so it will work well 
on a large alphabet like CJK now.


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-17 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700397#action_12700397
 ] 

Robert Muir commented on LUCENE-1606:
-

its ~700ms if i .setConstantScoreRewrite(true)
its ~150ms otherwise...
 

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-18 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700477#action_12700477
 ] 

Robert Muir commented on LUCENE-1606:
-

~ 116,000,000 terms.

I've seen the same behavior with other lucene queries on this index, where I do 
not care about score and thought filter would be best, but queries still have 
the edge.


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-18 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700478#action_12700478
 ] 

Robert Muir commented on LUCENE-1606:
-

my test queries are ones that match like 50-100 out of those 116,000,000... so 
maybe this helps paint the picture.

i can profile each one if you are curious?

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-18 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700480#action_12700480
 ] 

Robert Muir commented on LUCENE-1606:
-

well here it is just for the record:

in the query case (fast), time is dominated by AutomatonTermEnum.next(). This 
is what I expect.
in the filter case (slower), time is instead dominated by 
OpenBitSetIterator.next().

I've seen this with simpler (non-MultiTermQuery) queries before as well.

For this functionality I still like the constant score rewrite option because 
there is no risk of hitting the boolean clause limit.



> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-18 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700496#action_12700496
 ] 

Robert Muir commented on LUCENE-1606:
-

Uwe: yes I tried to think of some heuristics for this query to guess which 
would be the best method.

For example, if the language of the automaton is infinite (for example, built 
from a regular expression/wildcard with a * operator), it seems best to set 
constant score rewrite=true.

I didn't do any of this because I wasn't sure if this constant score rewrite 
option is something that should be entirely left to the user, or not.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-18 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700497#action_12700497
 ] 

Robert Muir commented on LUCENE-1606:
-

yes, I just verified and can easily and quickly detect if the FSM can accept 
more than BooleanQuery.getMaxClauseCount() Strings.

 !Automaton.isFinite() || 
Automaton.getFiniteStrings(BooleanQuery.getMaxClauseCount()) == null

If you think its ok, I could set constant score rewrite=true in this case.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-18 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700503#action_12700503
 ] 

Robert Muir commented on LUCENE-1606:
-

Uwe, ok based on your tests I tried some of my own... on my index when the 
query matches like less than 10-20% of the docs Query method is faster.

when it matches something like over 20%, the Filter method starts to win.



> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-18 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1606:


Attachment: automatonMultiQuerySmart.patch

updated with smarter enumeration. I think this is mathematically the best you 
can get with a DFA.

for example if the regexp is (a|b)cdefg it knows to position at acdefg, then 
bcdefg, etc
if the regexp is (a|b)cd*efg it can only position at acd, etc.

nextString() is now cpu-friendly, and instead walks the state transition 
character intervals in sorted order instead of brute-forcing characters.


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonMultiQuerySmart.patch, automatonWithWildCard.patch, 
> automatonWithWildCard2.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-19 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1606:


Attachment: automatonmultiqueryfuzzy.patch

this includes an alternative for another slow linear query, fuzzy query.

automatonfuzzyquery creates a DFA that accepts all strings within an edit 
distance of 1.

on my 100M term index this works pretty well:
fuzzy: 251,219 ms
automatonfuzzy: 172 ms

while its true its limited to edit distance of one, on the other hand it 
supports transposition and is fast.


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-21 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701261#action_12701261
 ] 

Robert Muir commented on LUCENE-1606:
-

found this interesting article applicable to this query: 
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652

"We show how to compute, for any fixed bound n and any input word W, a 
deterministic Levenshtein-automaton of degree n for W in time linear in the 
length of W."


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-21 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701285#action_12701285
 ] 

Robert Muir commented on LUCENE-1606:
-

eks:

the AutomatonTermEnumerator in this patch does walk the term dictionary 
according to the transitions present in the DFA. Thats what this JIRA issue is 
all about to me, not iterating all the terms! So you do not need the complete 
dictionary as a DFA.

for example: a regexp query of (a|b)cdefg with this patch seeks to 'acdefg', 
then 'bcdefg', as opposed to the current regex support which exhaustively 
enumerates all terms.

slightly more complex example, query of (a|b)cd*efg first seeks to 'acd' 
(because of kleen star operator). suppose it then encounters term 'acda', it 
will next seek to 'acdd', etc. if it encounters 'acdf', then next it seeks to 
'bcd'.

this patch implements regex, wildcard, and fuzzy with n=1 in terms of this 
enumeration. what it doesnt do is fuzzy with arbitrary n!. 

I used the simplistic quadratic method to compute a DFA for fuzzy with n=1 for 
the FuzzyAutomatonQuery present in this patch, the paper has a more complicate 
but linear method to compute the DFA.


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-21 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701303#action_12701303
 ] 

Robert Muir commented on LUCENE-1606:
-

eks, well it does work well for fuzzy n=1 (I have tested against my huge
index).

for your simple dictionary it will do 3 comparisons instead of 4.
this is because your simple dictionary is sorted in the index as such:
four
one
three
two

when it encounters 'three' it will next ask for a TermEnum("una") which will
return null.

give it a try on a big dictionary, you might be surprised :)





-- 
Robert Muir
rcm...@gmail.com


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-21 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701304#action_12701304
 ] 

Robert Muir commented on LUCENE-1606:
-

eks in your example it does three comparisons instead of four (not much of a 
gain for this example, but a big gain on a real index)

this is because it doesnt need to compare 'two', after encountering 'three' it 
requests TermEnum("uana"), which returns null.

i hope you can see how this helps for a large index... (or i can try to 
construct a more realistic example)



> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-21 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701310#action_12701310
 ] 

Robert Muir commented on LUCENE-1606:
-

eks in case this makes it a little better explanation for your example, 
assume a huge term dictionary where words start with a-zA-Z for simplicity.

for each character in that alphabet it will look for 'Xana' and 'Xna' in the 
worst case.
thats 110 comparisons to check all the words that don't start with 'a'.
(the enumeration thru all the words that start with 'a' is a little more 
complex).

if you have say, 1M unique terms you can see how doing something like 100-200 
comparisons is a lot better than 1M.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

2009-04-28 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703645#action_12703645
 ] 

Robert Muir commented on LUCENE-1488:
-

what version of icu4j are you using? needs to be >= 4.0

> issues with standardanalyzer on multilingual text
> -
>
> Key: LUCENE-1488
> URL: https://issues.apache.org/jira/browse/LUCENE-1488
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Priority: Minor
> Attachments: ICUAnalyzer.patch
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards 
> to breaking text into words, especially with respect to non-alphabetic 
> scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be 
> working until i looked at the jflex rules and saw that codepoint range for 
> most of the Thai block was added to the alphanum specification. defining the 
> exact codepoint ranges like this for every language could help with the 
> problem but you'd basically be reimplementing the bounds properties already 
> stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even 
> latin, for instance, the analyzer will break words around accent marks in 
> decomposed form. While most latin letter + accent combinations have composed 
> forms in unicode, some do not. (this is also an issue for asciifoldingfilter 
> i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based 
> BreakIterator instead of jflex. Using this method you can define word 
> boundaries according to the unicode bounds properties. After getting it into 
> some good shape i'd be happy to contribute it for contrib but I wonder if 
> theres a better solution so that out of box lucene will be more friendly to 
> non-ASCII text. Unfortunately it seems jflex does not support use of these 
> properties such as [\p{Word_Break = Extend}] so this is probably the major 
> barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-28 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1606:


Attachment: LUCENE-1606.patch

removed use of multitermquery's getTerm()

equals/hashcode are defined based upon the field and the language accepted by 
the FSM, i.e. regex query of AB.*C equals() wildcard query of AB*C because they 
are the same.


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1628) Persian Analyzer

2009-05-03 Thread Robert Muir (JIRA)

Persian Analyzer


 Key: LUCENE-1628
 URL: https://issues.apache.org/jira/browse/LUCENE-1628
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/analyzers
Reporter: Robert Muir
Priority: Minor


A simple persian analyzer.

i measured trec scores with the benchmark package below against 
http://ece.ut.ac.ir/DBRG/Hamshahri/ :

SimpleAnalyzer:
SUMMARY
  Search Seconds: 0.012
  DocName Seconds:0.020
  Num Points:   981.015
  Num Good Points:   33.738
  Max Good Points:   36.185
  Average Precision:  0.374
  MRR:0.667
  Recall: 0.905
  Precision At 1: 0.585
  Precision At 2: 0.531
  Precision At 3: 0.513
  Precision At 4: 0.496
  Precision At 5: 0.486
  Precision At 6: 0.487
  Precision At 7: 0.479
  Precision At 8: 0.465
  Precision At 9: 0.458
  Precision At 10:0.460
  Precision At 11:0.453
  Precision At 12:0.453
  Precision At 13:0.445
  Precision At 14:0.438
  Precision At 15:0.438
  Precision At 16:0.438
  Precision At 17:0.429
  Precision At 18:0.429
  Precision At 19:0.419
  Precision At 20:0.415

PersianAnalyzer:
SUMMARY
  Search Seconds: 0.004
  DocName Seconds:0.011
  Num Points:   987.692
  Num Good Points:   36.123
  Max Good Points:   36.185
  Average Precision:  0.481
  MRR:0.833
  Recall: 0.998
  Precision At 1: 0.754
  Precision At 2: 0.715
  Precision At 3: 0.646
  Precision At 4: 0.646
  Precision At 5: 0.631
  Precision At 6: 0.621
  Precision At 7: 0.593
  Precision At 8: 0.577
  Precision At 9: 0.573
  Precision At 10:0.566
  Precision At 11:0.572
  Precision At 12:0.562
  Precision At 13:0.554
  Precision At 14:0.549
  Precision At 15:0.542
  Precision At 16:0.538
  Precision At 17:0.533
  Precision At 18:0.527
  Precision At 19:0.525
  Precision At 20:0.518



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1628) Persian Analyzer

2009-05-03 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1628:


Attachment: LUCENE-1628.patch

patch file

> Persian Analyzer
> 
>
> Key: LUCENE-1628
> URL: https://issues.apache.org/jira/browse/LUCENE-1628
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Priority: Minor
> Attachments: LUCENE-1628.patch
>
>
> A simple persian analyzer.
> i measured trec scores with the benchmark package below against 
> http://ece.ut.ac.ir/DBRG/Hamshahri/ :
> SimpleAnalyzer:
> SUMMARY
>   Search Seconds: 0.012
>   DocName Seconds:0.020
>   Num Points:   981.015
>   Num Good Points:   33.738
>   Max Good Points:   36.185
>   Average Precision:  0.374
>   MRR:0.667
>   Recall: 0.905
>   Precision At 1: 0.585
>   Precision At 2: 0.531
>   Precision At 3: 0.513
>   Precision At 4: 0.496
>   Precision At 5: 0.486
>   Precision At 6: 0.487
>   Precision At 7: 0.479
>   Precision At 8: 0.465
>   Precision At 9: 0.458
>   Precision At 10:0.460
>   Precision At 11:0.453
>   Precision At 12:0.453
>   Precision At 13:0.445
>   Precision At 14:0.438
>   Precision At 15:0.438
>   Precision At 16:0.438
>   Precision At 17:0.429
>   Precision At 18:0.429
>   Precision At 19:0.419
>   Precision At 20:0.415
> PersianAnalyzer:
> SUMMARY
>   Search Seconds: 0.004
>   DocName Seconds:0.011
>   Num Points:   987.692
>   Num Good Points:   36.123
>   Max Good Points:   36.185
>   Average Precision:  0.481
>   MRR:0.833
>   Recall: 0.998
>   Precision At 1: 0.754
>   Precision At 2: 0.715
>   Precision At 3: 0.646
>   Precision At 4: 0.646
>   Precision At 5: 0.631
>   Precision At 6: 0.621
>   Precision At 7: 0.593
>   Precision At 8: 0.577
>   Precision At 9: 0.573
>   Precision At 10:0.566
>   Precision At 11:0.572
>   Precision At 12:0.562
>   Precision At 13:0.554
>   Precision At 14:0.549
>   Precision At 15:0.542
>   Precision At 16:0.538
>   Precision At 17:0.533
>   Precision At 18:0.527
>   Precision At 19:0.525
>   Precision At 20:0.518

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-07 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706948#action_12706948
 ] 

Robert Muir commented on LUCENE-1629:
-

Hi,

I see in the paper that lexical resources were also developed for Big5 
(traditional chinese). Are you able to acquire these resources with BSD license 
as well?

> contrib intelligent Analyzer for Chinese
> 
>
> Key: LUCENE-1629
> URL: https://issues.apache.org/jira/browse/LUCENE-1629
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.4.1
> Environment: for java 1.5 or higher, lucene 2.4.1
>Reporter: Xiaoping Gao
> Attachments: analysis-data.zip, LUCENE-1629.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
> language. it's called "imdict-chinese-analyzer", the project on google code 
> is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)   
> "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence 
> properly, or there will be mis-understandings everywhere in the index 
> constructed by Lucene, and the accuracy of the search engine will be affected 
> seriously!
> Although there are two analyzer packages in apache repository which can 
> handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
> every two adjoining characters as a single word, this is obviously not true 
> in reality, also this strategy will increase the index size and hurt the 
> performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
> (HMM), so it can tokenize chinese sentence in a really intelligent way. 
> Tokenizaion accuracy of this model is above 90% according to the paper 
> "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 
> 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to 
> contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-09 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707703#action_12707703
 ] 

Robert Muir commented on LUCENE-1629:
-

Xiaoping, thanks. I see they didn't get great performance with big5 tests but 
just curious.

Maybe mention somewhere in the javadocs that this analyzer is for simplified 
chinese text, just so its clear? 


> contrib intelligent Analyzer for Chinese
> 
>
> Key: LUCENE-1629
> URL: https://issues.apache.org/jira/browse/LUCENE-1629
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.4.1
> Environment: for java 1.5 or higher, lucene 2.4.1
>Reporter: Xiaoping Gao
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: analysis-data.zip, LUCENE-1629-java1.4.patch, 
> LUCENE-1629.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
> language. it's called "imdict-chinese-analyzer", the project on google code 
> is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)   
> "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence 
> properly, or there will be mis-understandings everywhere in the index 
> constructed by Lucene, and the accuracy of the search engine will be affected 
> seriously!
> Although there are two analyzer packages in apache repository which can 
> handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
> every two adjoining characters as a single word, this is obviously not true 
> in reality, also this strategy will increase the index size and hurt the 
> performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
> (HMM), so it can tokenize chinese sentence in a really intelligent way. 
> Tokenizaion accuracy of this model is above 90% according to the paper 
> "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 
> 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to 
> contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-15 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709880#action_12709880
 ] 

Robert Muir commented on LUCENE-1629:
-

if you acquire the big5 resources, do you think it would be possible to create 
a single dictionary that works with both Simplified & Traditional?

(i.e. merge the big5 resources with the gb resources)

The reason I say this, is the existing chinese analyzers, although they 
tokenize in a less intelligent way, they are agnostic to Simplified/Traditional 
issues...


> contrib intelligent Analyzer for Chinese
> 
>
> Key: LUCENE-1629
> URL: https://issues.apache.org/jira/browse/LUCENE-1629
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.4.1
> Environment: for java 1.5 or higher, lucene 2.4.1
>Reporter: Xiaoping Gao
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: analysis-data.zip, bigramdict.mem, 
> build-resources-with-folder.patch, build-resources.patch, 
> build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, 
> LUCENE-1629-java1.4.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
> language. it's called "imdict-chinese-analyzer", the project on google code 
> is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)   
> "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence 
> properly, or there will be mis-understandings everywhere in the index 
> constructed by Lucene, and the accuracy of the search engine will be affected 
> seriously!
> Although there are two analyzer packages in apache repository which can 
> handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
> every two adjoining characters as a single word, this is obviously not true 
> in reality, also this strategy will increase the index size and hurt the 
> performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
> (HMM), so it can tokenize chinese sentence in a really intelligent way. 
> Tokenizaion accuracy of this model is above 90% according to the paper 
> "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 
> 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to 
> contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-15 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709885#action_12709885
 ] 

Robert Muir commented on LUCENE-1629:
-

another potential issue with big5 i want to point out is that many of the big5 
character sets such as HKSCS have characters that are mapped into regions of 
unicode outside of the BMP.

just glancing at the code, some things will need to be modified for this to 
work correctly with surrogate pairs, various functions that take char will need 
to take codepoint (int), etc. 


> contrib intelligent Analyzer for Chinese
> 
>
> Key: LUCENE-1629
> URL: https://issues.apache.org/jira/browse/LUCENE-1629
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.4.1
> Environment: for java 1.5 or higher, lucene 2.4.1
>Reporter: Xiaoping Gao
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: analysis-data.zip, bigramdict.mem, 
> build-resources-with-folder.patch, build-resources.patch, 
> build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, 
> LUCENE-1629-java1.4.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
> language. it's called "imdict-chinese-analyzer", the project on google code 
> is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)   
> "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence 
> properly, or there will be mis-understandings everywhere in the index 
> constructed by Lucene, and the accuracy of the search engine will be affected 
> seriously!
> Although there are two analyzer packages in apache repository which can 
> handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
> every two adjoining characters as a single word, this is obviously not true 
> in reality, also this strategy will increase the index size and hurt the 
> performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
> (HMM), so it can tokenize chinese sentence in a really intelligent way. 
> Tokenizaion accuracy of this model is above 90% according to the paper 
> "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 
> 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to 
> contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-15 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710070#action_12710070
 ] 

Robert Muir commented on LUCENE-1629:
-

koji, have you considered using icu transforms for this behavior?
Not only is the rule-based language very nice (you can define variables, use 
context, etc), but many transformations such as "Traditional-Simplified" are 
already defined.

http://userguide.icu-project.org/transforms/general


> contrib intelligent Analyzer for Chinese
> 
>
> Key: LUCENE-1629
> URL: https://issues.apache.org/jira/browse/LUCENE-1629
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.4.1
> Environment: for java 1.5 or higher, lucene 2.4.1
>Reporter: Xiaoping Gao
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: analysis-data.zip, bigramdict.mem, 
> build-resources-with-folder.patch, build-resources.patch, 
> build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, 
> LUCENE-1629-java1.4.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
> language. it's called "imdict-chinese-analyzer", the project on google code 
> is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)   
> "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence 
> properly, or there will be mis-understandings everywhere in the index 
> constructed by Lucene, and the accuracy of the search engine will be affected 
> seriously!
> Although there are two analyzer packages in apache repository which can 
> handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
> every two adjoining characters as a single word, this is obviously not true 
> in reality, also this strategy will increase the index size and hurt the 
> performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
> (HMM), so it can tokenize chinese sentence in a really intelligent way. 
> Tokenizaion accuracy of this model is above 90% according to the paper 
> "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 
> 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to 
> contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1628) Persian Analyzer

2009-05-18 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1628:


Attachment: LUCENE-1628.patch

farsi stopwords file moved to resources folder and test to ensure it loads.


> Persian Analyzer
> 
>
> Key: LUCENE-1628
> URL: https://issues.apache.org/jira/browse/LUCENE-1628
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1628.patch, LUCENE-1628.patch
>
>
> A simple persian analyzer.
> i measured trec scores with the benchmark package below against 
> http://ece.ut.ac.ir/DBRG/Hamshahri/ :
> SimpleAnalyzer:
> SUMMARY
>   Search Seconds: 0.012
>   DocName Seconds:0.020
>   Num Points:   981.015
>   Num Good Points:   33.738
>   Max Good Points:   36.185
>   Average Precision:  0.374
>   MRR:0.667
>   Recall: 0.905
>   Precision At 1: 0.585
>   Precision At 2: 0.531
>   Precision At 3: 0.513
>   Precision At 4: 0.496
>   Precision At 5: 0.486
>   Precision At 6: 0.487
>   Precision At 7: 0.479
>   Precision At 8: 0.465
>   Precision At 9: 0.458
>   Precision At 10:0.460
>   Precision At 11:0.453
>   Precision At 12:0.453
>   Precision At 13:0.445
>   Precision At 14:0.438
>   Precision At 15:0.438
>   Precision At 16:0.438
>   Precision At 17:0.429
>   Precision At 18:0.429
>   Precision At 19:0.419
>   Precision At 20:0.415
> PersianAnalyzer:
> SUMMARY
>   Search Seconds: 0.004
>   DocName Seconds:0.011
>   Num Points:   987.692
>   Num Good Points:   36.123
>   Max Good Points:   36.185
>   Average Precision:  0.481
>   MRR:0.833
>   Recall: 0.998
>   Precision At 1: 0.754
>   Precision At 2: 0.715
>   Precision At 3: 0.646
>   Precision At 4: 0.646
>   Precision At 5: 0.631
>   Precision At 6: 0.621
>   Precision At 7: 0.593
>   Precision At 8: 0.577
>   Precision At 9: 0.573
>   Precision At 10:0.566
>   Precision At 11:0.572
>   Precision At 12:0.562
>   Precision At 13:0.554
>   Precision At 14:0.549
>   Precision At 15:0.542
>   Precision At 16:0.538
>   Precision At 17:0.533
>   Precision At 18:0.527
>   Precision At 19:0.525
>   Precision At 20:0.518

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1643) use reusable collation keys in ICUCollationFilter

2009-05-18 Thread Robert Muir (JIRA)

use reusable collation keys in ICUCollationFilter
-

 Key: LUCENE-1643
 URL: https://issues.apache.org/jira/browse/LUCENE-1643
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Robert Muir
Priority: Trivial
 Attachments: LUCENE-1643.patch

ICUCollationFilter need not create a new CollationKey object for each token.
In ICU there is a mechanism to use a reusable key.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1643) use reusable collation keys in ICUCollationFilter

2009-05-18 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1643:


Attachment: LUCENE-1643.patch

patch

> use reusable collation keys in ICUCollationFilter
> -
>
> Key: LUCENE-1643
> URL: https://issues.apache.org/jira/browse/LUCENE-1643
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Trivial
> Attachments: LUCENE-1643.patch
>
>
> ICUCollationFilter need not create a new CollationKey object for each token.
> In ICU there is a mechanism to use a reusable key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1460) Change all contrib TokenStreams/Filters to use the new TokenStream API

2009-05-22 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712295#action_12712295
 ] 

Robert Muir commented on LUCENE-1460:
-

is anyone working on this? I have some functionality that needs some of these 
to be new-api so i have at least half of them done.


> Change all contrib TokenStreams/Filters to use the new TokenStream API
> --
>
> Key: LUCENE-1460
> URL: https://issues.apache.org/jira/browse/LUCENE-1460
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
>
> Now that we have the new TokenStream API (LUCENE-1422) we should change all 
> contrib modules to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-29 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714592#action_12714592
 ] 

Robert Muir commented on LUCENE-1629:
-

otis if you are interested in japanese/korean you might find this link 
interesting:

http://bugs.icu-project.org/trac/ticket/2229

similar to the thai approach (in contrib) but with log probabilities.


> contrib intelligent Analyzer for Chinese
> 
>
> Key: LUCENE-1629
> URL: https://issues.apache.org/jira/browse/LUCENE-1629
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.4.1
> Environment: for java 1.5 or higher, lucene 2.4.1
>Reporter: Xiaoping Gao
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: analysis-data.zip, bigramdict.mem, 
> build-resources-with-folder.patch, build-resources.patch, 
> build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, 
> LUCENE-1629-java1.4.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
> language. it's called "imdict-chinese-analyzer", the project on google code 
> is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)   
> "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence 
> properly, or there will be mis-understandings everywhere in the index 
> constructed by Lucene, and the accuracy of the search engine will be affected 
> seriously!
> Although there are two analyzer packages in apache repository which can 
> handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
> every two adjoining characters as a single word, this is obviously not true 
> in reality, also this strategy will increase the index size and hurt the 
> performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
> (HMM), so it can tokenize chinese sentence in a really intelligent way. 
> Tokenizaion accuracy of this model is above 90% according to the paper 
> "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 
> 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to 
> contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1377) Add HTMLStripReader and WordDelimiterFilter from SOLR

2009-06-02 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715595#action_12715595
 ] 

Robert Muir commented on LUCENE-1377:
-

the WordDelimiterFilter in SOLR trunk as it stands today would be a significant 
benefit to lucene.

Also, I think there's a very valuable use for an analyzer something like the 
following:
WhitespaceTokenizer
WordDelimiterFilter (default settings)
LowerCaseFilter

This simple configuration would provide some much needed functionality to 
lucene, specifically the ability to index things like Hindi text. Its not 
perfect and will add some additional nonsense terms for some languages, but in 
the short term, is much more friendly to a variety of languages where no viable 
option in lucene exists at all right now.


> Add HTMLStripReader and WordDelimiterFilter from SOLR
> -
>
> Key: LUCENE-1377
> URL: https://issues.apache.org/jira/browse/LUCENE-1377
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 2.3.2
>Reporter: Jason Rutherglen
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> SOLR has two classes HTMLStripReader and WordDelimiterFilter which are very 
> useful for a wide variety of use cases.  It would be good to place them into 
> core Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1377) Add HTMLStripReader and WordDelimiterFilter from SOLR

2009-06-02 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715786#action_12715786
 ] 

Robert Muir commented on LUCENE-1377:
-

i don't really know enough to say that, but this one is especially important 
(imho)

without getting into details, the changes Yonik made to WordDelimiterFilter, in 
combination with WhitespaceTokenizer treat the various unicode categories 
correctly, unlike any of the other analyzers in lucene. 

No, it doesn't actually use the Word_Break properties which is really the key, 
but if you toss some Hindi, Bangla, Tibetan, Arabic, Tamil, ... [list goes on 
very long] text at it, in general it's gonna work pretty well. This is a big 
improvement over any of the other default options!


> Add HTMLStripReader and WordDelimiterFilter from SOLR
> -
>
> Key: LUCENE-1377
> URL: https://issues.apache.org/jira/browse/LUCENE-1377
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 2.3.2
>Reporter: Jason Rutherglen
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> SOLR has two classes HTMLStripReader and WordDelimiterFilter which are very 
> useful for a wide variety of use cases.  It would be good to place them into 
> core Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1488) issues with standardanalyzer on multilingual text

2009-06-04 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1488:


Attachment: LUCENE-1488.patch

updated patch, not ready yet but you can see where i am going.

ICUTokenizer: Breaks text into words according to UAX #29: Unicode Text 
Segmentation. Text is divided across script boundaries so that this 
segmentation can be tailored for different writing systems; for example Thai 
text is segmented with a different method. The default and script-specific 
rules can be tailored. In the resources folder i have some examples for 
Southeast Asian scripts, etc.  Since i need script boundaries for tailoring, i 
stuff the ISO 15924 script code constant in the flags; this could be useful for 
downstream consumers.

ICUCaseFoldingFilter: Fold case according to Unicode Default Caseless Matching; 
Full case folding. This may change the length of the token, for example german 
sharp s is folded to 'ss'. This filter interacts with the downstream 
normalization filter in a special way, so you can provide a hint as to what the 
desired normalization form will be. In the NFKC or NFKD case it will apply the 
NFKC_Closure set so you do not have to Normalize(Fold(Normalize(Fold(x

ICUDigitFoldingFilter: Standardize digits from different scripts to the latin 
values, 0-9.

ICUFormatFilter: Remove identifier-ignorable codepoints, specifically those 
from the Format category. 

ICUNormalizationFilter: Apply unicode normalization to text. This is 
accelerated with a quick-check.

ICUAnalyzer ties all this together. All of these components should also work 
correctly with surrogate-pair data. 

Needs more doc and tests. any comments appreciated.


> issues with standardanalyzer on multilingual text
> -
>
> Key: LUCENE-1488
> URL: https://issues.apache.org/jira/browse/LUCENE-1488
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Priority: Minor
> Attachments: ICUAnalyzer.patch, LUCENE-1488.patch
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards 
> to breaking text into words, especially with respect to non-alphabetic 
> scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be 
> working until i looked at the jflex rules and saw that codepoint range for 
> most of the Thai block was added to the alphanum specification. defining the 
> exact codepoint ranges like this for every language could help with the 
> problem but you'd basically be reimplementing the bounds properties already 
> stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even 
> latin, for instance, the analyzer will break words around accent marks in 
> decomposed form. While most latin letter + accent combinations have composed 
> forms in unicode, some do not. (this is also an issue for asciifoldingfilter 
> i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based 
> BreakIterator instead of jflex. Using this method you can define word 
> boundaries according to the unicode bounds properties. After getting it into 
> some good shape i'd be happy to contribute it for contrib but I wonder if 
> theres a better solution so that out of box lucene will be more friendly to 
> non-ASCII text. Unfortunately it seems jflex does not support use of these 
> properties such as [\p{Word_Break = Extend}] so this is probably the major 
> barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

2009-06-05 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716686#action_12716686
 ] 

Robert Muir commented on LUCENE-1488:
-

here's a simple description of what the current functionality buys you, its 
this:

all indic languages (Hindi, Bengali, Tamil, ...), middle eastern languages 
(Arabic, Hebrew, etc) will work pretty well here (by that I mean tokenized, 
normalized, etc). Most of these lucene cannot parse correctly with any of the 
built-in analyzers.

obviously european languages lucene handles quite well already, but unicode 
still has some improvements here, i.e. better case-folding.

And finally, of course, the situation where you have data in a bunch of these 
different languages!

in general, the unicode defaults work quite well for almost all languages, with 
the exception of CJK and southeast-asian languages. 
its not my intent to really solve those harder cases, only to provide a 
mechanism for someone else to deal with it if they don't like the defaults.

a great example is the arabic tokenizer, it should not exist. unicode defaults 
work great for that language. and it would be silly to think about 
HindiTokenizer, BengaliTokenizer, etc etc when unicode defaults will tokenize 
those correctly as well. 

there's still some annoying complexity here, and any comments are appreciated. 
Especially tricky is the complexity-performance-maintenance balance, i.e. the 
case-folding filter could be a lot faster, but then it would have to be updated 
when a new unicode version is released... Another thing is i didn't optimize 
the BMP case anywhere [i.e. working at 32-bit codepoint to ensure surrogate 
data works], and I think thats worth considering... like 99.9% of data is in 
the BMP :)

Thanks,
Robert

> issues with standardanalyzer on multilingual text
> -
>
> Key: LUCENE-1488
> URL: https://issues.apache.org/jira/browse/LUCENE-1488
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Priority: Minor
> Attachments: ICUAnalyzer.patch, LUCENE-1488.patch
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards 
> to breaking text into words, especially with respect to non-alphabetic 
> scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be 
> working until i looked at the jflex rules and saw that codepoint range for 
> most of the Thai block was added to the alphanum specification. defining the 
> exact codepoint ranges like this for every language could help with the 
> problem but you'd basically be reimplementing the bounds properties already 
> stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even 
> latin, for instance, the analyzer will break words around accent marks in 
> decomposed form. While most latin letter + accent combinations have composed 
> forms in unicode, some do not. (this is also an issue for asciifoldingfilter 
> i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based 
> BreakIterator instead of jflex. Using this method you can define word 
> boundaries according to the unicode bounds properties. After getting it into 
> some good shape i'd be happy to contribute it for contrib but I wonder if 
> theres a better solution so that out of box lucene will be more friendly to 
> non-ASCII text. Unfortunately it seems jflex does not support use of these 
> properties such as [\p{Word_Break = Extend}] so this is probably the major 
> barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1466) CharFilter - normalize characters before tokenizer

2009-06-10 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718291#action_12718291
 ] 

Robert Muir commented on LUCENE-1466:
-

just as an alternative, i have a different mechanism as part of lucene-1488 
patch I am working on. But maybe its good to have options, as it depends on the 
ICU library.

below is excerpt from javadoc.

A TokenFilter that transforms text with ICU.

ICU provides text-transformation functionality via its Transliteration API.
Although script conversion is its most common use, a transliterator can 
actually perform a more general class of tasks. 
...
Some useful transformations for search are built-in:
* Conversion from Traditional to Simplified Chinese characters
* Conversion from Hiragana to Katakana
* Conversion from Fullwidth to Halfwidth forms.
...
Example usage:
 * stream = new ICUTransformFilter(stream, 
Transliterator.getInstance("Traditional-Simplified"));


> CharFilter - normalize characters before tokenizer
> --
>
> Key: LUCENE-1466
> URL: https://issues.apache.org/jira/browse/LUCENE-1466
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Analysis
>Affects Versions: 2.4
>Reporter: Koji Sekiguchi
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1466.patch, LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1628) Persian Analyzer

2009-06-10 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718292#action_12718292
 ] 

Robert Muir commented on LUCENE-1628:
-

mark, on the same topic: if possible, at some time it would be great to know 
which licenses are OK, and which ones are not.


> Persian Analyzer
> 
>
> Key: LUCENE-1628
> URL: https://issues.apache.org/jira/browse/LUCENE-1628
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1628.patch, LUCENE-1628.patch
>
>
> A simple persian analyzer.
> i measured trec scores with the benchmark package below against 
> http://ece.ut.ac.ir/DBRG/Hamshahri/ :
> SimpleAnalyzer:
> SUMMARY
>   Search Seconds: 0.012
>   DocName Seconds:0.020
>   Num Points:   981.015
>   Num Good Points:   33.738
>   Max Good Points:   36.185
>   Average Precision:  0.374
>   MRR:0.667
>   Recall: 0.905
>   Precision At 1: 0.585
>   Precision At 2: 0.531
>   Precision At 3: 0.513
>   Precision At 4: 0.496
>   Precision At 5: 0.486
>   Precision At 6: 0.487
>   Precision At 7: 0.479
>   Precision At 8: 0.465
>   Precision At 9: 0.458
>   Precision At 10:0.460
>   Precision At 11:0.453
>   Precision At 12:0.453
>   Precision At 13:0.445
>   Precision At 14:0.438
>   Precision At 15:0.438
>   Precision At 16:0.438
>   Precision At 17:0.429
>   Precision At 18:0.429
>   Precision At 19:0.419
>   Precision At 20:0.415
> PersianAnalyzer:
> SUMMARY
>   Search Seconds: 0.004
>   DocName Seconds:0.011
>   Num Points:   987.692
>   Num Good Points:   36.123
>   Max Good Points:   36.185
>   Average Precision:  0.481
>   MRR:0.833
>   Recall: 0.998
>   Precision At 1: 0.754
>   Precision At 2: 0.715
>   Precision At 3: 0.646
>   Precision At 4: 0.646
>   Precision At 5: 0.631
>   Precision At 6: 0.621
>   Precision At 7: 0.593
>   Precision At 8: 0.577
>   Precision At 9: 0.573
>   Precision At 10:0.566
>   Precision At 11:0.572
>   Precision At 12:0.562
>   Precision At 13:0.554
>   Precision At 14:0.549
>   Precision At 15:0.542
>   Precision At 16:0.538
>   Precision At 17:0.533
>   Precision At 18:0.527
>   Precision At 19:0.525
>   Precision At 20:0.518

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1460) Change all contrib TokenStreams/Filters to use the new TokenStream API

2009-06-10 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1460:


Attachment: LUCENE-1460_partial.txt

only partial solution...
some of the analyzers don't have any tests, so I think thats a bit more work!
the AsciiFoldingFilter fix is in here too, i know its not in contrib but it 
doesnt support new API either.



> Change all contrib TokenStreams/Filters to use the new TokenStream API
> --
>
> Key: LUCENE-1460
> URL: https://issues.apache.org/jira/browse/LUCENE-1460
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1460_partial.txt
>
>
> Now that we have the new TokenStream API (LUCENE-1422) we should change all 
> contrib modules to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1545) Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E

2009-06-10 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718300#action_12718300
 ] 

Robert Muir commented on LUCENE-1545:
-

if you are looking for a more short-term solution (since i think 1488 will take 
quite a bit more time), it would be possible to make StandardAnalyzer more 
'unicode-friendly'.

its not possible to make it 'correct', and adding additional unicode 
friendliness would make backwards compat a much more complex issue (different 
unicode versions across JVM  versions, etc).

but if you want, i'm willing to come up with some minor grammar changes for 
StandardAnalyzer that could help things like this.


> Standard analyzer does not correctly tokenize combining character U+0364 
> COMBINING LATIN SMALL LETTRE E
> ---
>
> Key: LUCENE-1545
> URL: https://issues.apache.org/jira/browse/LUCENE-1545
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.4
> Environment: Linux x86_64, Sun Java 1.6
>Reporter: Andreas Hauser
>Priority: Minor
> Fix For: 3.0
>
> Attachments: AnalyzerTest.java
>
>
> Standard analyzer does not correctly tokenize combining character U+0364 
> COMBINING LATIN SMALL LETTRE E.
> The word "moͤchte" is incorrectly tokenized into "mo" "chte", the combining 
> character is lost.
> Expected result is only on token "moͤchte".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1460) Change all contrib TokenStreams/Filters to use the new TokenStream API

2009-06-11 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1460:


Attachment: LUCENE-1460_contrib_partial.txt
LUCENE-1460_core.txt

split patch: core/contrib

> Change all contrib TokenStreams/Filters to use the new TokenStream API
> --
>
> Key: LUCENE-1460
> URL: https://issues.apache.org/jira/browse/LUCENE-1460
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1460_contrib_partial.txt, LUCENE-1460_core.txt, 
> LUCENE-1460_partial.txt
>
>
> Now that we have the new TokenStream API (LUCENE-1422) we should change all 
> contrib modules to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1545) Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E

2009-06-12 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718825#action_12718825
 ] 

Robert Muir commented on LUCENE-1545:
-

michael, I don't see a way from the manual to do it.

its not just the rules, but the JRE used to compile the rules (and its 
underlying unicode defs) so you might need separate standardtokenizerimpl's to 
really control the thing...

> Standard analyzer does not correctly tokenize combining character U+0364 
> COMBINING LATIN SMALL LETTRE E
> ---
>
> Key: LUCENE-1545
> URL: https://issues.apache.org/jira/browse/LUCENE-1545
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.4
> Environment: Linux x86_64, Sun Java 1.6
>Reporter: Andreas Hauser
>Priority: Minor
> Fix For: 3.1
>
> Attachments: AnalyzerTest.java
>
>
> Standard analyzer does not correctly tokenize combining character U+0364 
> COMBINING LATIN SMALL LETTRE E.
> The word "moͤchte" is incorrectly tokenized into "mo" "chte", the combining 
> character is lost.
> Expected result is only on token "moͤchte".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1689) supplementary character handling

2009-06-12 Thread Robert Muir (JIRA)

supplementary character handling


 Key: LUCENE-1689
 URL: https://issues.apache.org/jira/browse/LUCENE-1689
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
Priority: Minor


for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.

supplementary character support should be fixed for code that works with 
char/char[]

For example:
StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed 
so they don't actually remove suppl characters, or modified to look for 
surrogates and behave correctly.
LowercaseFilter should be modified to lowercase suppl. characters correctly.
CharTokenizer should either be deprecated or changed so that isTokenChar() and 
normalize() use int.

in all of these cases code should remain optimized for the BMP case, and suppl 
characters should be the exception, but still work.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1689) supplementary character handling

2009-06-12 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1689:


Attachment: LUCENE-1689_lowercase_example.txt

an example of how LowercaseFilter might be fixed. 
I only changed the new-api method for demonstration purposes.

> supplementary character handling
> 
>
> Key: LUCENE-1689
> URL: https://issues.apache.org/jira/browse/LUCENE-1689
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Minor
> Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with 
> char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be 
> changed so they don't actually remove suppl characters, or modified to look 
> for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() 
> and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and 
> suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

2009-06-12 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718938#action_12718938
 ] 

Robert Muir commented on LUCENE-1689:
-

my example has some "issues" (i.e. it depends upon the knowledge that no 
surrogate pairs lowercase to BMP codepoints).
it also doesn't correctly handle invalid unicode data (unpaired surrogates).

Such complexity can be all added to the slower-path supplementary case. the 
intent is just to show it wouldnt be a terribly invasive change.
Thanks!

> supplementary character handling
> 
>
> Key: LUCENE-1689
> URL: https://issues.apache.org/jira/browse/LUCENE-1689
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Minor
> Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with 
> char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be 
> changed so they don't actually remove suppl characters, or modified to look 
> for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() 
> and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and 
> suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

2009-06-12 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718942#action_12718942
 ] 

Robert Muir commented on LUCENE-1689:
-

icu uses the following idiom to check if a char ch is a surrogate. might be the 
fastest way to ensure the performance impact is minimal:

(ch & 0xF800) == 0xD800



> supplementary character handling
> 
>
> Key: LUCENE-1689
> URL: https://issues.apache.org/jira/browse/LUCENE-1689
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Minor
> Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with 
> char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be 
> changed so they don't actually remove suppl characters, or modified to look 
> for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() 
> and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and 
> suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.

2009-06-12 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1581:


Attachment: TestTurkishCollation.java

i've attached a testcase showing how the collation filters in contrib solve 
your problem.

I think its the best way to get locale-specific matching behavior when you know 
the locale: case differences, normalization, accents, the whole shebang.

just set the strength and locale appropriately 


> LowerCaseFilter should be able to be configured to use a specific locale.
> -
>
> Key: LUCENE-1581
> URL: https://issues.apache.org/jira/browse/LUCENE-1581
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Digy
> Attachments: TestTurkishCollation.java
>
>
> //Since I am a .Net programmer, Sample codes will be in c# but I don't think 
> that it would be a problem to understand them.
> //
> Assume an input text like "İ" and and analyzer like below
> {code}
>   public class SomeAnalyzer : Analyzer
>   {
>   public override TokenStream TokenStream(string fieldName, 
> System.IO.TextReader reader)
>   {
>   TokenStream t = new SomeTokenizer(reader);
>   t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t);
>   t = new LowerCaseFilter(t);
>   return t;
>   }
> 
>   }
> {code}
>   
> ASCIIFoldingFilter will return "I" and after, LowerCaseFilter will return
>   "i" (if locale is "en-US") 
>   or 
>   "ı' if(locale is "tr-TR") (that means,this token should be input to 
> another instance of ASCIIFoldingFilter)
> So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, 
> but a better approach can be adding
> a new constructor to LowerCaseFilter and forcing it to use a specific locale.
> {code}
> public sealed class LowerCaseFilter : TokenFilter
> {
> /* +++ */System.Globalization.CultureInfo CultureInfo = 
> System.Globalization.CultureInfo.CurrentCulture;
> public LowerCaseFilter(TokenStream in) : base(in)
> {
> }
> /* +++ */  public LowerCaseFilter(TokenStream in, 
> System.Globalization.CultureInfo CultureInfo) : base(in)
> /* +++ */  {
> /* +++ */  this.CultureInfo = CultureInfo;
> /* +++ */  }
>   
> public override Token Next(Token result)
> {
> result = Input.Next(result);
> if (result != null)
> {
> char[] buffer = result.TermBuffer();
> int length = result.termLength;
> for (int i = 0; i < length; i++)
> /* +++ */ buffer[i] = 
> System.Char.ToLower(buffer[i],CultureInfo);
> return result;
> }
> else
> return null;
> }
> }
> {code}
> DIGY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.

2009-06-12 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719069#action_12719069
 ] 

Robert Muir commented on LUCENE-1581:
-

For reference, I think the concept of LowerCaseFilter, either with or without 
Locale is incorrect for lucene when the intent is really to erase case 
differences.

There is an important distinction between converting to lowercase (for 
presentation), and erasing case differences (for matching and searching).

Here is an example from the unicode std:
Characters may also have different case mappings, depending on the context. For 
example,
U+03A3 "Σ" greek capital letter sigma lowercases to U+03C3 "σ" greek small 
letter
sigma if it is followed by another letter, but lowercases to U+03C2 "ς" greek 
small
letter final sigma if it is not.

The only correct methods to erase case differences are:
1) Localized (for a specific language): use a collator as recommended here.
2) Multilingual (for a mix of languages): use either the UCA (collator with 
ROOT locale) or unicode case-folding, either of which is only an approximation 
of the language-specific rules involved.

thanks!


> LowerCaseFilter should be able to be configured to use a specific locale.
> -
>
> Key: LUCENE-1581
> URL: https://issues.apache.org/jira/browse/LUCENE-1581
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Digy
> Attachments: TestTurkishCollation.java
>
>
> //Since I am a .Net programmer, Sample codes will be in c# but I don't think 
> that it would be a problem to understand them.
> //
> Assume an input text like "İ" and and analyzer like below
> {code}
>   public class SomeAnalyzer : Analyzer
>   {
>   public override TokenStream TokenStream(string fieldName, 
> System.IO.TextReader reader)
>   {
>   TokenStream t = new SomeTokenizer(reader);
>   t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t);
>   t = new LowerCaseFilter(t);
>   return t;
>   }
> 
>   }
> {code}
>   
> ASCIIFoldingFilter will return "I" and after, LowerCaseFilter will return
>   "i" (if locale is "en-US") 
>   or 
>   "ı' if(locale is "tr-TR") (that means,this token should be input to 
> another instance of ASCIIFoldingFilter)
> So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, 
> but a better approach can be adding
> a new constructor to LowerCaseFilter and forcing it to use a specific locale.
> {code}
> public sealed class LowerCaseFilter : TokenFilter
> {
> /* +++ */System.Globalization.CultureInfo CultureInfo = 
> System.Globalization.CultureInfo.CurrentCulture;
> public LowerCaseFilter(TokenStream in) : base(in)
> {
> }
> /* +++ */  public LowerCaseFilter(TokenStream in, 
> System.Globalization.CultureInfo CultureInfo) : base(in)
> /* +++ */  {
> /* +++ */  this.CultureInfo = CultureInfo;
> /* +++ */  }
>   
> public override Token Next(Token result)
> {
> result = Input.Next(result);
> if (result != null)
> {
> char[] buffer = result.TermBuffer();
> int length = result.termLength;
> for (int i = 0; i < length; i++)
> /* +++ */ buffer[i] = 
> System.Char.ToLower(buffer[i],CultureInfo);
> return result;
> }
> else
> return null;
> }
> }
> {code}
> DIGY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1488) issues with standardanalyzer on multilingual text

2009-06-12 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1488:


Attachment: LUCENE-1488.txt

just an update, still more work to be done.

some of the components are javadoc'ed and have pretty good tests (case folding 
and normalization). These might be useful to someone in the meantime.

also added some tests to TestICUAnalyzer for various jira issues (LUCENE-1032, 
LUCENE-1215, LUCENE-1343, LUCENE-1545, etc) that are solved here. 


> issues with standardanalyzer on multilingual text
> -
>
> Key: LUCENE-1488
> URL: https://issues.apache.org/jira/browse/LUCENE-1488
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Priority: Minor
> Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards 
> to breaking text into words, especially with respect to non-alphabetic 
> scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be 
> working until i looked at the jflex rules and saw that codepoint range for 
> most of the Thai block was added to the alphanum specification. defining the 
> exact codepoint ranges like this for every language could help with the 
> problem but you'd basically be reimplementing the bounds properties already 
> stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even 
> latin, for instance, the analyzer will break words around accent marks in 
> decomposed form. While most latin letter + accent combinations have composed 
> forms in unicode, some do not. (this is also an issue for asciifoldingfilter 
> i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based 
> BreakIterator instead of jflex. Using this method you can define word 
> boundaries according to the unicode bounds properties. After getting it into 
> some good shape i'd be happy to contribute it for contrib but I wonder if 
> theres a better solution so that out of box lucene will be more friendly to 
> non-ASCII text. Unfortunately it seems jflex does not support use of these 
> properties such as [\p{Word_Break = Extend}] so this is probably the major 
> barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

2009-06-13 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719123#action_12719123
 ] 

Robert Muir commented on LUCENE-1689:
-

Michael: LowerCaseFilter doesn't incorrectly break surrogates, it just won't 
lowercase supplementary codepoints.

But I can get it in shape, sure.

> supplementary character handling
> 
>
> Key: LUCENE-1689
> URL: https://issues.apache.org/jira/browse/LUCENE-1689
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with 
> char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be 
> changed so they don't actually remove suppl characters, or modified to look 
> for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() 
> and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and 
> suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

2009-06-13 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719126#action_12719126
 ] 

Robert Muir commented on LUCENE-1488:
-

Michael, I don't think it will be ready for 2.9, here is some answers to your 
questions.

going with your arabic example:
The only thing this absorbs is language-specific tokenization (like 
ArabicLetterTokenizer), because as mentioned I think thats generally the wrong 
approach.
But this can't replace ArabicAnalyzer completely, because ArabicAnalyzer stems 
arabic text in a language-specific way, which has a huge effect on retrieval 
quality for Arabic language text.

Some of what it does the language-specific analyzers don't do though.

In this specific example, it would be nice if ArabicAnalyzer really used the 
functionality here, then did its Arabic-specific stuff!
Because this functionality will do things like normalize 'Arabic Presentation 
Forms' and deal with Arabic digits and things that aren't in the 
ArabicAnalyzer. It also will treat any non-Arabic text in your corpus very 
nicely!

Yes, you are correct about the difference from StandardAnalyzer and I would 
argue there are tokenization bugs in how StandardAnalyzer works with European 
languages too, just see LUCENE-1545!

I know StandardAnalyzer does these things. This tokenizer has some built-in 
types already, such as number. If you want to add more types, its easy. Just 
make a .txt file with your grammar, create a RuleBasedBreakIterator with it, 
and pass it along to the tokenizer constructor. you will have to subclass the 
tokenizer's getType() for any new types though, because RBBI 'types' are really 
just integer codes in the rule file, and you have to map them to some text such 
as "WORD".

Yes, case-folding will work better than lowercase for a few european languages.


> issues with standardanalyzer on multilingual text
> -
>
> Key: LUCENE-1488
> URL: https://issues.apache.org/jira/browse/LUCENE-1488
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Priority: Minor
> Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards 
> to breaking text into words, especially with respect to non-alphabetic 
> scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be 
> working until i looked at the jflex rules and saw that codepoint range for 
> most of the Thai block was added to the alphanum specification. defining the 
> exact codepoint ranges like this for every language could help with the 
> problem but you'd basically be reimplementing the bounds properties already 
> stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even 
> latin, for instance, the analyzer will break words around accent marks in 
> decomposed form. While most latin letter + accent combinations have composed 
> forms in unicode, some do not. (this is also an issue for asciifoldingfilter 
> i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based 
> BreakIterator instead of jflex. Using this method you can define word 
> boundaries according to the unicode bounds properties. After getting it into 
> some good shape i'd be happy to contribute it for contrib but I wonder if 
> theres a better solution so that out of box lucene will be more friendly to 
> non-ASCII text. Unfortunately it seems jflex does not support use of these 
> properties such as [\p{Word_Break = Extend}] so this is probably the major 
> barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

2009-06-13 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719127#action_12719127
 ] 

Robert Muir commented on LUCENE-1689:
-

Simon, I want to address your comment on performance.
I think that surrogate detection is cheap when done right and I don't think 
there's a ton of places that need changes.
But I don't think any indicator is really appropriate, for example my 
TokenFilter might want to convert one chinese character in the BMP to another 
one outside of the BMP. It is all unicode.

But there is more than just analysis involved here, for example I have not 
tested WildcardQuery: ? operator.
I'm not trying to go berzerko and be 'ultra-correct', but basic things like 
that should work.
For situations where its not worth it, i.e. FuzzyQuery's scoring, we should 
just doc that the calculation is based on 'code units', and leave it alone.


> supplementary character handling
> 
>
> Key: LUCENE-1689
> URL: https://issues.apache.org/jira/browse/LUCENE-1689
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with 
> char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be 
> changed so they don't actually remove suppl characters, or modified to look 
> for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() 
> and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and 
> suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

2009-06-13 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719139#action_12719139
 ] 

Robert Muir commented on LUCENE-1689:
-

I am curious how you plan on approaching backwards compat?

#1 The patch is needed for correct java 1.5 behavior, its a 1.5 migration issue.
#2 The patch won't work on java 1.4 because the jdk does not supply the needed 
functionality
#3 Its my understanding you like to deprecate things (say CharTokenizer) and 
give people a transition time. How will this work?

i intend to supply patch that fixes the problems, but I wanted to bring up 
those issues...


> supplementary character handling
> 
>
> Key: LUCENE-1689
> URL: https://issues.apache.org/jira/browse/LUCENE-1689
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with 
> char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be 
> changed so they don't actually remove suppl characters, or modified to look 
> for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() 
> and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and 
> suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

2009-06-13 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719146#action_12719146
 ] 

Robert Muir commented on LUCENE-1689:
-

Simon, a response to your other comment 'I think we need to act on this issue 
asap and release it together with 3.0. -> full support for unicode 4.0 in 
lucene 3.0'

Full unicode support would be great!

But this involves a whole lot more, I'm trying to create a contrib under 
LUCENE-1488 for "full unicode support". 
A lot of what is necessary isn't available in the Java 1.5 JDK, such as unicode 
normalization.

Maybe we can settle for 'full support for UCS (Unicode Character Set)' as a 
start!


> supplementary character handling
> 
>
> Key: LUCENE-1689
> URL: https://issues.apache.org/jira/browse/LUCENE-1689
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with 
> char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be 
> changed so they don't actually remove suppl characters, or modified to look 
> for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() 
> and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and 
> suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

2009-06-13 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719158#action_12719158
 ] 

Robert Muir commented on LUCENE-1689:
-

i forgot to answer your question Michael: 

it depends upon the knowledge that no surrogate pairs lowercase to BMP 
codepoints
Is it invalid to make this assumption? Ie, does the unicode standard not 
guarantee it?

I do not think it guarantees this for all future unicode versions. In my 
opinion, we should exploit things like this if I can show a test case that 
proves its true for all codepoint in the current version of unicode :)
And it should be documented that this could possibly change in some future 
version.
In this example, its a nice simplification because it guarantees the length (in 
code units) will not change!

I think for a next step on this issue I will create and upload a test case 
showing the issues and detailing some possible solutions.
For some of them, maybe a javadoc update is the most appropriate, but for 
others, maybe an API change is the right way to go.
Then we can figure out what should be done.

> supplementary character handling
> 
>
> Key: LUCENE-1689
> URL: https://issues.apache.org/jira/browse/LUCENE-1689
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with 
> char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be 
> changed so they don't actually remove suppl characters, or modified to look 
> for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() 
> and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and 
> suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1689) supplementary character handling

2009-06-13 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719174#action_12719174
 ] 

Robert Muir commented on LUCENE-1689:
-

Yonik, I think the problem is where method signatures must change, such as 
CharTokenizer, required to fix LetterTokenizer and friends.

These are probably the worst offenders, as a lot of these tokenizers just 
remove >BMP chars, which is really nasty behavior for CJK. 

I agree its a collection of issues but maybe there can be an overall strategy?



> supplementary character handling
> 
>
> Key: LUCENE-1689
> URL: https://issues.apache.org/jira/browse/LUCENE-1689
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-1689_lowercase_example.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with 
> char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be 
> changed so they don't actually remove suppl characters, or modified to look 
> for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() 
> and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and 
> suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

2009-06-14 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719342#action_12719342
 ] 

Robert Muir commented on LUCENE-1488:
-

Earwin, I don't understand your question... 
There is no morphological processing or any other language-specific 
functionality in this patch...


> issues with standardanalyzer on multilingual text
> -
>
> Key: LUCENE-1488
> URL: https://issues.apache.org/jira/browse/LUCENE-1488
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Priority: Minor
> Attachments: ICUAnalyzer.patch, LUCENE-1488.patch, LUCENE-1488.txt
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards 
> to breaking text into words, especially with respect to non-alphabetic 
> scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be 
> working until i looked at the jflex rules and saw that codepoint range for 
> most of the Thai block was added to the alphanum specification. defining the 
> exact codepoint ranges like this for every language could help with the 
> problem but you'd basically be reimplementing the bounds properties already 
> stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even 
> latin, for instance, the analyzer will break words around accent marks in 
> decomposed form. While most latin letter + accent combinations have composed 
> forms in unicode, some do not. (this is also an issue for asciifoldingfilter 
> i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based 
> BreakIterator instead of jflex. Using this method you can define word 
> boundaries according to the unicode bounds properties. After getting it into 
> some good shape i'd be happy to contribute it for contrib but I wonder if 
> theres a better solution so that out of box lucene will be more friendly to 
> non-ASCII text. Unfortunately it seems jflex does not support use of these 
> properties such as [\p{Word_Break = Extend}] so this is probably the major 
> barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1689) supplementary character handling

2009-06-14 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1689:


Attachment: testCurrentBehavior.txt

this is just a patch with testcases showing the existing behavior.

perhaps these should be fixed:
Simple/Standard/StopAnalyzer/etc: deletes all supp. characters completely.
LowerCaseFilter: doesn't lowercase supp. characters correctly.
WildcardQuery: ? operator does not work correctly.

perhaps these just need some javadocs:
FuzzyQuery: scoring is strange because its based upon surrogates, leave alone 
and javadoc it.
LengthFilter: length is calculated based on utf-16 code units, leave alone and 
javadoc it.


... and theres always the option to not change any code, but just javadoc all 
the behavior as a "fix", providing stuff in contrib or elsewhere that works 
correctly.
let me know what you think.


> supplementary character handling
> 
>
> Key: LUCENE-1689
> URL: https://issues.apache.org/jira/browse/LUCENE-1689
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-1689_lowercase_example.txt, 
> testCurrentBehavior.txt
>
>
> for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
> supplementary character support should be fixed for code that works with 
> char/char[]
> For example:
> StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be 
> changed so they don't actually remove suppl characters, or modified to look 
> for surrogates and behave correctly.
> LowercaseFilter should be modified to lowercase suppl. characters correctly.
> CharTokenizer should either be deprecated or changed so that isTokenChar() 
> and normalize() use int.
> in all of these cases code should remain optimized for the BMP case, and 
> suppl characters should be the exception, but still work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-06-15 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719605#action_12719605
 ] 

Robert Muir commented on LUCENE-1606:
-

Uwe, you are correct, I just took a glance at the automaton source code and saw 
StringBuilder, so I think it is safe to say it only works with 1.5...

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 2.9
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-06-15 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719612#action_12719612
 ] 

Robert Muir commented on LUCENE-1606:
-

Uwe, sorry about this.

I did just verify automaton.jar can be compiled for Java 5 (at least it does 
not have java 1.6 dependencies), so perhaps this can be integrated for a later 
release.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 2.9
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-06-15 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719623#action_12719623
 ] 

Robert Muir commented on LUCENE-1606:
-

Uwe, ok.

Not to try to complicate things, but related to LUCENE-1689 and java 1.5, I 
could easily modify the Wildcard functionality here to work correctly with 
suppl. characters

This could be an alternative to fixing the WildcardQuery ? operator in core.


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1692) Contrib analyzers need tests

2009-06-15 Thread Robert Muir (JIRA)

Contrib analyzers need tests


 Key: LUCENE-1692
 URL: https://issues.apache.org/jira/browse/LUCENE-1692
 Project: Lucene - Java
  Issue Type: Test
  Components: contrib/analyzers
Reporter: Robert Muir


The analyzers in contrib need tests, preferably ones that test the behavior of 
all the Token 'attributes' involved (offsets, type, etc) and not just what they 
do with token text.

This way, they can be converted to the new api without breakage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-973) Token of "" returns in CJKTokenizer + new TestCJKTokenizer

2009-06-15 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719781#action_12719781
 ] 

Robert Muir commented on LUCENE-973:


very nice. although it might be a tad trickier to convert to the new API, 
anything with tests is easier!

in other words, i have the existing cjktokenizer converted, but whose to say I 
did it right :)


> Token of  "" returns in CJKTokenizer + new TestCJKTokenizer
> ---
>
> Key: LUCENE-973
> URL: https://issues.apache.org/jira/browse/LUCENE-973
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Toru Matsuzawa
>Priority: Minor
> Fix For: 2.9
>
> Attachments: CJKTokenizer20070807.patch, LUCENE-973.patch, 
> LUCENE-973.patch, with-patch.jpg, without-patch.jpg
>
>
> The "" string returns as Token in the boundary of two byte character and one 
> byte character. 
> There is no problem in CJKAnalyzer. 
> When CJKTokenizer is used with the unit, it becomes a problem. (Use it with 
> Solr etc.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1692) Contrib analyzers need tests

2009-06-15 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719799#action_12719799
 ] 

Robert Muir commented on LUCENE-1692:
-

first I looked at BrazilianAnalyzer... out of curiousity can someone explain to 
me how the behavior of BrazilianStemmer differs from the Portuguese snowball 
analyzer... because it looks to be the same algorithm to me!


> Contrib analyzers need tests
> 
>
> Key: LUCENE-1692
> URL: https://issues.apache.org/jira/browse/LUCENE-1692
> Project: Lucene - Java
>  Issue Type: Test
>  Components: contrib/analyzers
>Reporter: Robert Muir
>
> The analyzers in contrib need tests, preferably ones that test the behavior 
> of all the Token 'attributes' involved (offsets, type, etc) and not just what 
> they do with token text.
> This way, they can be converted to the new api without breakage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1692) Contrib analyzers need tests

2009-06-15 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1692:


Attachment: LUCENE-1692.txt

answered my own question, here's tests for brazilian as a start.


> Contrib analyzers need tests
> 
>
> Key: LUCENE-1692
> URL: https://issues.apache.org/jira/browse/LUCENE-1692
> Project: Lucene - Java
>  Issue Type: Test
>  Components: contrib/analyzers
>Reporter: Robert Muir
> Attachments: LUCENE-1692.txt
>
>
> The analyzers in contrib need tests, preferably ones that test the behavior 
> of all the Token 'attributes' involved (offsets, type, etc) and not just what 
> they do with token text.
> This way, they can be converted to the new api without breakage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1692) Contrib analyzers need tests

2009-06-15 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1692:


Attachment: LUCENE-1692.txt

add tests for dutchanalyzer.

this analyzer claims to implement snowball, although tests reveal some 
differences. it also has about 1MB of text files that don't appear to be in use 
at all...

> Contrib analyzers need tests
> 
>
> Key: LUCENE-1692
> URL: https://issues.apache.org/jira/browse/LUCENE-1692
> Project: Lucene - Java
>  Issue Type: Test
>  Components: contrib/analyzers
>Reporter: Robert Muir
> Attachments: LUCENE-1692.txt, LUCENE-1692.txt
>
>
> The analyzers in contrib need tests, preferably ones that test the behavior 
> of all the Token 'attributes' involved (offsets, type, etc) and not just what 
> they do with token text.
> This way, they can be converted to the new api without breakage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1684 matches

Mail list logo