date:20090421

[jira] Updated: (LUCENE-1536) if a filter can support random access API, we should use it

2009-04-21 Thread Jason Rutherglen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1536:
-

Attachment: LUCENE-1536.patch

* Filter has a parameter for included deletes. IndexSearcher
uses it for setting the RandomAccessDocIdSet on the Scorers

* MatchAllDocsScorer in addition to TermScorer properly supports
setting a RandomAccessDocIdSet 

* Added more test cases

* AndRandomAccessDocIdSet.iterator() and BitVector.iterator()
needs to be implemented 

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-21 Thread Robert Muir (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12701261#action_12701261
]

Robert Muir commented on LUCENE-1606:
-

found this interesting article applicable to this query:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652

We show how to compute, for any fixed bound n and any input word W, a
deterministic Levenshtein-automaton of degree n for W in time linear in the
length of W.

Automaton Query/Filter (scalable regex)
---

Key: LUCENE-1606
URL: https://issues.apache.org/jira/browse/LUCENE-1606
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*
Reporter: Robert Muir
Priority: Minor
Fix For: 2.9

Attachments: automaton.patch, automatonMultiQuery.patch,
automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
automatonWithWildCard.patch, automatonWithWildCard2.patch

Attached is a patch for an AutomatonQuery/Filter (name can change if its not
suitable).
Whereas the out-of-box contrib RegexQuery is nice, I have some very large
indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc.
Additionally all of the existing RegexQuery implementations in Lucene are
really slow if there is no constant prefix. This implementation does not
depend upon constant prefix, and runs the same query in 640ms.
Some use cases I envision:
1. lexicography/etc on large text corpora
2. looking for things such as urls where the prefix is not constant (http://
or ftp://)
The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert
regular expressions into a DFA. Then, the filter enumerates terms in a
special way, by using the underlying state machine. Here is my short
description from the comments:
The algorithm here is pretty basic. Enumerate terms but instead of a
binary accept/reject do:

1. Look at the portion that is OK (did not enter a reject state in the
DFA)
2. Generate the next possible String and seek to that.
the Query simply wraps the filter with ConstantScoreQuery.
I did not include the automaton.jar inside the patch but it can be downloaded
from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-21 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12701279#action_12701279
]

Eks Dev commented on LUCENE-1606:
-

Robert,
in order for Lev. Automata to work, you need to have the complete dictionary as
DFA. Once you have dictionary as DFA (or any sort of trie), computing simple
regex-s or simple fixed or weighted Levenshtein distance becomes a snap.
Levenshtein-Automata is particularity fast at it, much simpler and only
slightly slower method (one pager code)
K.Oflazerhttp://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.136.3862

As said, you cannot really walk current term dictionary as automata/trie (or
you have an idea on how to do that?). I guess there is enough application where
stoing complete Term dictionary into RAM-DFA is not a problem. Even making some
smart (heavily cached) persistent trie/DFA should not be all that complex.

Or you intended just to iterate all terms, and compute distance faster break
LD Matrix computation as soon as you see you hit the boundary? But this
requires iteration over all terms?

I have done something similar, in memory, but unfortunately someone else paid
me for this and is not willing to share...

Automaton Query/Filter (scalable regex)
---

Key: LUCENE-1606
URL: https://issues.apache.org/jira/browse/LUCENE-1606
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*
Reporter: Robert Muir
Priority: Minor
Fix For: 2.9

Attachments: automaton.patch, automatonMultiQuery.patch,
automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
automatonWithWildCard.patch, automatonWithWildCard2.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-21 Thread Robert Muir (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12701285#action_12701285
]

Robert Muir commented on LUCENE-1606:
-

eks:

the AutomatonTermEnumerator in this patch does walk the term dictionary
according to the transitions present in the DFA. Thats what this JIRA issue is
all about to me, not iterating all the terms! So you do not need the complete
dictionary as a DFA.

for example: a regexp query of (a|b)cdefg with this patch seeks to 'acdefg',
then 'bcdefg', as opposed to the current regex support which exhaustively
enumerates all terms.

slightly more complex example, query of (a|b)cd*efg first seeks to 'acd'
(because of kleen star operator). suppose it then encounters term 'acda', it
will next seek to 'acdd', etc. if it encounters 'acdf', then next it seeks to
'bcd'.

this patch implements regex, wildcard, and fuzzy with n=1 in terms of this
enumeration. what it doesnt do is fuzzy with arbitrary n!.

I used the simplistic quadratic method to compute a DFA for fuzzy with n=1 for
the FuzzyAutomatonQuery present in this patch, the paper has a more complicate
but linear method to compute the DFA.

Automaton Query/Filter (scalable regex)
---

Key: LUCENE-1606
URL: https://issues.apache.org/jira/browse/LUCENE-1606
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*
Reporter: Robert Muir
Priority: Minor
Fix For: 2.9

Attachments: automaton.patch, automatonMultiQuery.patch,
automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
automatonWithWildCard.patch, automatonWithWildCard2.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-21 Thread Goddard, Michael J.

Q

- Original Message -
From: java-dev-return-32502-michael.j.goddard=saic@lucene.apache.org 
java-dev-return-32502-michael.j.goddard=saic@lucene.apache.org
To: java-dev@lucene.apache.org java-dev@lucene.apache.org
Sent: Tue Apr 21 18:02:47 2009
Subject: [jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12701285#action_12701285
 ] 

Robert Muir commented on LUCENE-1606:
-

eks:

the AutomatonTermEnumerator in this patch does walk the term dictionary 
according to the transitions present in the DFA. Thats what this JIRA issue is 
all about to me, not iterating all the terms! So you do not need the complete 
dictionary as a DFA.

for example: a regexp query of (a|b)cdefg with this patch seeks to 'acdefg', 
then 'bcdefg', as opposed to the current regex support which exhaustively 
enumerates all terms.

slightly more complex example, query of (a|b)cd*efg first seeks to 'acd' 
(because of kleen star operator). suppose it then encounters term 'acda', it 
will next seek to 'acdd', etc. if it encounters 'acdf', then next it seeks to 
'bcd'.

this patch implements regex, wildcard, and fuzzy with n=1 in terms of this 
enumeration. what it doesnt do is fuzzy with arbitrary n!. 

I used the simplistic quadratic method to compute a DFA for fuzzy with n=1 for 
the FuzzyAutomatonQuery present in this patch, the paper has a more complicate 
but linear method to compute the DFA.

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Priority: Minor
 Fix For: 2.9

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch

 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:

  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-21 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12701298#action_12701298
]

Eks Dev commented on LUCENE-1606:
-

hmmm, sounds like good idea, but I am still not convinced it would work for
Fuzzy

take simple dictionary:
one
two
three
four

query Term is, e.g. ana, right? and n=1, means your DFA would be: {.na, a.a,
an., an, na, ana, .ana, ana., a.na, an.a, ana.} where dot represents any
character in you alphabet.

For the first element in DFA (in expanded form) you need to visit all terms, no
matter how you walk DFA... or am I missing something?

Where you could save time is actual calculation of LD Matrix for terms that do
not pass automata

Automaton Query/Filter (scalable regex)
---

Key: LUCENE-1606
URL: https://issues.apache.org/jira/browse/LUCENE-1606
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*
Reporter: Robert Muir
Priority: Minor
Fix For: 2.9

Attachments: automaton.patch, automatonMultiQuery.patch,
automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
automatonWithWildCard.patch, automatonWithWildCard2.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-21 Thread Robert Muir (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12701303#action_12701303
]

Robert Muir commented on LUCENE-1606:
-

eks, well it does work well for fuzzy n=1 (I have tested against my huge
index).

for your simple dictionary it will do 3 comparisons instead of 4.
this is because your simple dictionary is sorted in the index as such:
four
one
three
two

when it encounters 'three' it will next ask for a TermEnum(una) which will
return null.

give it a try on a big dictionary, you might be surprised :)

--
Robert Muir
rcm...@gmail.com

Automaton Query/Filter (scalable regex)
---

Key: LUCENE-1606
URL: https://issues.apache.org/jira/browse/LUCENE-1606
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*
Reporter: Robert Muir
Priority: Minor
Fix For: 2.9

Attachments: automaton.patch, automatonMultiQuery.patch,
automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
automatonWithWildCard.patch, automatonWithWildCard2.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-21 Thread Robert Muir (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12701304#action_12701304
]

Robert Muir commented on LUCENE-1606:
-

eks in your example it does three comparisons instead of four (not much of a
gain for this example, but a big gain on a real index)

this is because it doesnt need to compare 'two', after encountering 'three' it
requests TermEnum(uana), which returns null.

i hope you can see how this helps for a large index... (or i can try to
construct a more realistic example)

Automaton Query/Filter (scalable regex)
---

Key: LUCENE-1606
URL: https://issues.apache.org/jira/browse/LUCENE-1606
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*
Reporter: Robert Muir
Priority: Minor
Fix For: 2.9

Attachments: automaton.patch, automatonMultiQuery.patch,
automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
automatonWithWildCard.patch, automatonWithWildCard2.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1608) CustomScoreQuery should support arbitrary Queries

2009-04-21 Thread Steven Bethard (JIRA)

CustomScoreQuery should support arbitrary Queries
-

 Key: LUCENE-1608
 URL: https://issues.apache.org/jira/browse/LUCENE-1608
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Query/Scoring
Reporter: Steven Bethard
Priority: Minor


CustomScoreQuery only allows the secondary queries to be of type 
ValueSourceQuery instead of allowing them to be any type of Query. As a result, 
what you can do with CustomScoreQuery is pretty limited.

It would be nice to extend CustomScoreQuery to allow arbitrary Query objects. 
Most of the code should stay about the same, though a little more care would 
need to be taken in CustomScorer.score() to use 0.0 when the sub-scorer does 
not produce a score for the current document.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1550) Add N-Gram String Matching for Spell Checking

2009-04-21 Thread Thomas Morton (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Morton updated LUCENE-1550:
--

Attachment: LUCENE-1550.patch

Fixes the empty string case.
Adds some additional unit tests.

 Add N-Gram String Matching for Spell Checking
 -

 Key: LUCENE-1550
 URL: https://issues.apache.org/jira/browse/LUCENE-1550
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/spellchecker
Affects Versions: 2.9
Reporter: Thomas Morton
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1550.patch, LUCENE-1550.patch, LUCENE-1550.patch


 N-Gram version of edit distance based on paper by Grzegorz Kondrak, N-gram 
 similarity and distance. Proceedings of the Twelfth International Conference 
 on String Processing and Information Retrieval (SPIRE 2005), pp. 115-126,  
 Buenos Aires, Argentina, November 2005. 
 http://www.cs.ualberta.ca/~kondrak/papers/spire05.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1550) Add N-Gram String Matching for Spell Checking

2009-04-21 Thread Thomas Morton (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12701362#action_12701362
 ] 

Thomas Morton commented on LUCENE-1550:
---

The implementations returns a normalized edit distance (normalized by string 
length) and specifically 1 if the strings are the same and 0 if that are 
maximally different.  0 in that case makes sense as the number of edits is 
equal to the number of characters in the longest string, so:

1- (2 edits /2 length) = 0


 Add N-Gram String Matching for Spell Checking
 -

 Key: LUCENE-1550
 URL: https://issues.apache.org/jira/browse/LUCENE-1550
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/spellchecker
Affects Versions: 2.9
Reporter: Thomas Morton
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1550.patch, LUCENE-1550.patch, LUCENE-1550.patch


 N-Gram version of edit distance based on paper by Grzegorz Kondrak, N-gram 
 similarity and distance. Proceedings of the Twelfth International Conference 
 on String Processing and Information Retrieval (SPIRE 2005), pp. 115-126,  
 Buenos Aires, Argentina, November 2005. 
 http://www.cs.ualberta.ca/~kondrak/papers/spire05.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1536) if a filter can support random access API, we should use it

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

Re: [jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

[jira] Created: (LUCENE-1608) CustomScoreQuery should support arbitrary Queries

[jira] Updated: (LUCENE-1550) Add N-Gram String Matching for Spell Checking

[jira] Commented: (LUCENE-1550) Add N-Gram String Matching for Spell Checking

11 matches

Site Navigation

Mail list logo

Footer information