date:20090106


[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661145#action_12661145
 ] 

Michael McCandless commented on LUCENE-1483:


Mark, I see 3 testcase failures in TestSort if I pretend that 
SortField.STRING means STRING_ORD -- do you see that?

I think we should fix TestSort so that it runs N times, each time using a 
different STRING sort method, to make sure we are covering all these methods?

 Change IndexSearcher multisegment searches to search each individual segment 
 using a single HitCollector
 

 Key: LUCENE-1483
 URL: https://issues.apache.org/jira/browse/LUCENE-1483
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: Mark Miller
Priority: Minor
 Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 sortBench.py, sortCollate.py


 FieldCache and Filters are forced down to a single segment reader, allowing 
 for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector


[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661148#action_12661148
 ] 

Michael McCandless commented on LUCENE-1483:


I prototyped a rough change to the FieldComparator API, whereby
TopFieldCollector calls setBottom to notify the comparator which slot
is the bottom of the queue (whenever it changes), and then calls
compareBottom (which replaces compare(int slot, int doc, float
score)).  This seems to offer decent perf. gains so I think we should
make this change for real?

I think it gives good gains because 1) compare to bottom is very
frequent for a search that has many hits, and where the queue fairly
quickly converges to the top N, 2) it allows the on-demand comparator
to pre-cache the bottom's ord, and 3) it saves one array deref.

TopFieldCollector would guarantee that compareBottom is not called
unless setBottom was called; during the startup transient, setBottom
is not called until the queue becomes full.



 Change IndexSearcher multisegment searches to search each individual segment 
 using a single HitCollector
 

 Key: LUCENE-1483
 URL: https://issues.apache.org/jira/browse/LUCENE-1483
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: Mark Miller
Priority: Minor
 Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 sortBench.py, sortCollate.py


 FieldCache and Filters are forced down to a single segment reader, allowing 
 for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

[
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661149#action_12661149
]

Michael McCandless commented on LUCENE-1483:

On what ComparatorPolicy to use by default... I think we should start
with ORD, but gather counters of number of compares vs number of
copies, and based on those counters (and comparing to numDocs())
decide how aggressively to switch comparators? That determination
should also take into account the queue size.

An optimized index would always use ORD (w/o gathering counters),
which is fastest.

In the future... we could imagine allowing the query to dictate the
order that segments are visited. EG if the query can roughly estimate
how many hits it'll get on a given segment, we could order by that
instead of simply numDocs().

The query could also choose an appropriate ComparatorPolicy, eg, if it
estimates it'll get very few hits, VAL is best right from the start,
else start with ORD.

Another future fix would be to implement ORDSUB with a single pass
through the queue, using a reused secondary pqueue to do the full sort
of the queue. This would let us assign subords much faster, I think.

But I don't think we should pursue these optimizations as part of this
issue... we need to bring closure here; we already have some solid
gains to capture. I think we should wrapup now...

Change IndexSearcher multisegment searches to search each individual segment
using a single HitCollector

Key: LUCENE-1483
URL: https://issues.apache.org/jira/browse/LUCENE-1483
Project: Lucene - Java
Issue Type: Improvement
Affects Versions: 2.9
Reporter: Mark Miller
Priority: Minor
Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
sortBench.py, sortCollate.py

FieldCache and Filters are forced down to a single segment reader, allowing
for individual segment reloading on reopen.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector


[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661160#action_12661160
 ] 

Mark Miller commented on LUCENE-1483:
-

bq. Mark, I see 3 testcase failures in TestSort if I pretend that 
SortField.STRING means STRING_ORD - do you see that?

Yeah, sorry. That STRING_ORD custom comparator is just a joke really, so I only 
really tested it on the StringSort test. It's just not initing the ords along 
with the values on switching. Making ords package private so that it can be 
changed (and changing it) fixes things. Not sure about new constructors or 
package private for that part of the switch...

bq. I think we should fix TestSort so that it runs N times, each time using a 
different STRING sort method, to make sure we are covering all these methods?

Yeah, this makes sense in any case. I just keep switching them by hand as I 
work on them.

 Change IndexSearcher multisegment searches to search each individual segment 
 using a single HitCollector
 

 Key: LUCENE-1483
 URL: https://issues.apache.org/jira/browse/LUCENE-1483
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: Mark Miller
Priority: Minor
 Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 sortBench.py, sortCollate.py


 FieldCache and Filters are forced down to a single segment reader, allowing 
 for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector


[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661160#action_12661160
 ] 

markrmil...@gmail.com edited comment on LUCENE-1483 at 1/6/09 6:57 AM:
-

bq. Mark, I see 3 testcase failures in TestSort if I pretend that 
SortField.STRING means STRING_ORD - do you see that?

Yeah, sorry. That STRING_ORD custom comparator policy is just a joke really, so 
I only really tested it on the StringSort test. It's just not initing the ords 
along with the values on switching. Making ords package private so that it can 
be changed (and changing it) fixes things. Not sure about new constructors or 
package private for that part of the switch...

bq. I think we should fix TestSort so that it runs N times, each time using a 
different STRING sort method, to make sure we are covering all these methods?

Yeah, this makes sense in any case. I just keep switching them by hand as I 
work on them.

  was (Author: markrmil...@gmail.com):
bq. Mark, I see 3 testcase failures in TestSort if I pretend that 
SortField.STRING means STRING_ORD - do you see that?

Yeah, sorry. That STRING_ORD custom comparator is just a joke really, so I only 
really tested it on the StringSort test. It's just not initing the ords along 
with the values on switching. Making ords package private so that it can be 
changed (and changing it) fixes things. Not sure about new constructors or 
package private for that part of the switch...

bq. I think we should fix TestSort so that it runs N times, each time using a 
different STRING sort method, to make sure we are covering all these methods?

Yeah, this makes sense in any case. I just keep switching them by hand as I 
work on them.
  
 Change IndexSearcher multisegment searches to search each individual segment 
 using a single HitCollector
 

 Key: LUCENE-1483
 URL: https://issues.apache.org/jira/browse/LUCENE-1483
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: Mark Miller
Priority: Minor
 Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 sortBench.py, sortCollate.py


 FieldCache and Filters are forced down to a single segment reader, allowing 
 for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector


[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661165#action_12661165
 ] 

Mark Miller commented on LUCENE-1483:
-

There are other little conversion steps that have to be considered as well I 
think. Like when you switch to ord dem, you won't have the ReaderIndex array 
filled in properly, etc. (probably an issue with that example policy in there 
beyond the ords copy)

Depending on what you come from and what you go to, a couple little hoops have 
to be jumped.

 Change IndexSearcher multisegment searches to search each individual segment 
 using a single HitCollector
 

 Key: LUCENE-1483
 URL: https://issues.apache.org/jira/browse/LUCENE-1483
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: Mark Miller
Priority: Minor
 Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 sortBench.py, sortCollate.py


 FieldCache and Filters are forced down to a single segment reader, allowing 
 for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1304) Memory Leak when using Custom Sort (i.e., DistanceSortSource) of LocalLucene with Lucene

2009-01-06 Thread patrick o'leary (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661199#action_12661199
 ] 

patrick o'leary commented on LUCENE-1304:
-

How will LUCENE-1483 impact this immediately?
I'd really like to get this patch in first and refactor if and when 1483 goes 
in, the benefit of bypassing static comparator is
really needed. 

 Memory Leak when using Custom Sort (i.e., DistanceSortSource) of LocalLucene 
 with Lucene
 

 Key: LUCENE-1304
 URL: https://issues.apache.org/jira/browse/LUCENE-1304
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.3
 Environment: Windows/JDK 1.6
Reporter: Ethan Tao
 Attachments: LUCENE-1304.patch


 We had the memory leak issue when using DistanceSortSource of LocalLucene for 
 repeated query/search. In about 450 queries, we are experiencing out of 
 memory error. After dig in the code, we found the problem source is coming 
 from Lucene package, the way how it handles custom type comparator. Lucene 
 internally caches all created comparators. In the case of query using 
 LocalLucene, we create new comparator for every search due to different 
 lon/lat and query terms. This causes major memory leak as the cached 
 comparators are also holding memory for other large objects (e.g., bit sets). 
 The solution we came up with: ( the proposed change from Lucene is 1 and 3 
 below)
 1.In Lucene package, create new file SortComparatorSourceUncacheable.java:
 package org.apache.lucene.search;
 import org.apache.lucene.index.IndexReader;
 import java.io.IOException;
 import java.io.Serializable;
 public interface SortComparatorSourceUncacheable extends Serializable {
 }
 2.Have your custom sort class to implement the interface
 public class LocalSortSource extends DistanceSortSource implements 
 SortComparatorSourceUncacheable {
 ...
 }
 3.Modify Lucene's FieldSorterHitQueue.java to bypass caching for custom 
 sort comparator:
 Index: FieldSortedHitQueue.java
 ===
 --- FieldSortedHitQueue.java (revision 654583)
 +++ FieldSortedHitQueue.java  (working copy)
 @@ -53,7 +53,12 @@
  this.fields = new SortField[n];
  for (int i=0; in; ++i) {
String fieldname = fields[i].getField();
 -  comparators[i] = getCachedComparator (reader, fieldname, 
 fields[i].getType(), fields[i].getLocale(), fields[i].getFactory());
 +
 +  if(fields[i].getFactory() instanceof SortComparatorSourceUncacheable) 
 { // no caching to avoid memory leak
 +comparators[i] = getComparator (reader, fieldname, 
 fields[i].getType(), fields[i].getLocale(), fields[i].getFactory());
 +  } else {
 +comparators[i] = getCachedComparator (reader, fieldname, 
 fields[i].getType(), fields[i].getLocale(), fields[i].getFactory());
 +  }

if (comparators[i].sortType() == SortField.STRING) {
   this.fields[i] = new SortField (fieldname, 
 fields[i].getLocale(), fields[i].getReverse());
 @@ -157,7 +162,18 @@
SortField[] getFields() {
  return fields;
}
 -  
 +
 +  static ScoreDocComparator getComparator (IndexReader reader, String field, 
 int type, Locale locale, SortComparatorSource factory)
 +throws IOException {
 +  if (type == SortField.DOC) return ScoreDocComparator.INDEXORDER;
 +  if (type == SortField.SCORE) return ScoreDocComparator.RELEVANCE;
 +  FieldCacheImpl.Entry entry = (factory != null)
 +? new FieldCacheImpl.Entry (field, factory)
 +: new FieldCacheImpl.Entry (field, type, locale);
 +  return (ScoreDocComparator)Comparators.createValue(reader, entry);
 +}
 +
 +
 Otis suggests that I put this in Jira. I 'll attach a patch shortly for 
 review. 
 -Ethan

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1512) Incorporate GeoHash in contrib/spatial

2009-01-06 Thread patrick o'leary (JIRA)

Incorporate GeoHash in contrib/spatial
--

 Key: LUCENE-1512
 URL: https://issues.apache.org/jira/browse/LUCENE-1512
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/spatial
Reporter: patrick o'leary
Priority: Minor


Based on comments from Yonik and Ryan in SOLR-773 
GeoHash provides the ability to store latitude / longitude values in a single 
field consistent hash field.
Which elements the need to maintain 2 field caches for latitude / longitude 
fields, reducing the size of an index
and the amount of memory needed for a spatial search.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1512) Incorporate GeoHash in contrib/spatial

2009-01-06 Thread patrick o'leary (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

patrick o'leary updated LUCENE-1512:


Attachment: LUCENE-1512.patch

spatial-lucene GeoHash implementation based on 
http://en.wikipedia.org/wiki/Geohash
removable dependency on refactoring in LUCENE-1504

 Incorporate GeoHash in contrib/spatial
 --

 Key: LUCENE-1512
 URL: https://issues.apache.org/jira/browse/LUCENE-1512
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/spatial
Reporter: patrick o'leary
Priority: Minor
 Attachments: LUCENE-1512.patch


 Based on comments from Yonik and Ryan in SOLR-773 
 GeoHash provides the ability to store latitude / longitude values in a single 
 field consistent hash field.
 Which elements the need to maintain 2 field caches for latitude / longitude 
 fields, reducing the size of an index
 and the amount of memory needed for a spatial search.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1314) IndexReader.clone


[ 
https://issues.apache.org/jira/browse/LUCENE-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661214#action_12661214
 ] 

Michael McCandless commented on LUCENE-1314:



{quote}
 The problem is the user may get into trouble by updating the stale reader 
 which was debated before. I got the impression insuring the reader being 
 updated was the latest was important.
{quote}

But: when one attempts to change a stale reader, that's caught when
trying to acquire the write lock?  (Ie during clone I think you don't
need to also check for this).

{quote}
 The cost of cloning them meaning the creating a new byte array
{quote}

Yeah I was thinking CPU cost of creating  copying deleted docs 
norms; I was just curious (I don't think we have to measure this
before committing).

{quote}
 I need to reread Marvin's tombstones which at first glance seemed to be an 
 iterative approach to saving deletions that seems like a transaction log. 
 Correct?
{quote}

Similar to a transaction log in that the size of what's written is in
proportion to how many changes (deletions) you made.  But different in
that there is no other data structure (ie the tombstones *are* the
representation of the deletes) and so the tombstones are used live
(whereas transaction log is typically played back on next startup
after a failure).

If we had tombstones to represent deletes in Lucene then any new
deletions would not require any cloning of prior deletions.  Ie there
would be no copy-on-write.

{quote}
 M.M.: SegmentReader.Norm now has two refCounts, and I think both are 
 necessary. One tracks refs to the Norm instance itself and the other tracks 
 refs to the byte[]. Can you add some comments explaining the difference 
 (because it's confusing at first blush)?
 
 Byte[] referencing is used because a new norm object needs to be created for 
 each clone, and the byte array is all that is needed for sharing between 
 cloned readers. The current norm referencing is for sharing between readers 
 whereas the byte[] referencing is for copy on write which is independent of 
 reader references.
{quote}

Got it.  Can you put this into the javadocs in the Norm class?

{quote}
 M.M.: In SegmentReader.doClose() you are failing to call 
 deletedDocsCopyOnWriteRef.decRef(), so you have a refCount leak. Can you 
 create a unit test that 1) opens reader 1, 2) does deletes on reader 1, 3) 
 clones reader 1 -- reader 2, 4) closes reader 1, 5) deletes more docs with 
 reader 1, and 6) asserts that the
deletedDocs BitVector did not get cloned? First verify the test fails, then fix 
the bug...
 
 In regards to #5, the test cannot delete from reader 1 once it's closed. A 
 method called TestIndexReaderClone.testSegmentReaderCloseReferencing was 
 added to test this closing use case.
{quote}

Woops -- I meant 5) deletes more docs with reader 2.  Test case
looks good!  Thanks.

A few more comments:

  * Can you update javadocs of IndexReader.reopen to remove the
warning about not doing modification operations?  With
copy-on-write you are now free to do deletes against the reopened
reader with no impact to the reader you had reopened/cloned.

  * What is SegmentReader.doDecRef for?  It seems dead?

  * SegmentReader.doUndeleteAll has 4 space indent (should be 2)

  * We have this in SegmentReader.reopenSegment:
{code}
if (deletedDocsRef == null) deletedDocsRef = new Ref();
else deletedDocsRef.incRef();
{code}
   But I think if I clone a reader with no deletes, the clone then
   [incorrectly] has a deletedDocsRef set?  Can you fix that code to
   keep the invariant that if deleteDocs is null, so is
   deletedDocsRef, and v/v?  Can you sprinkle asserts to make sure
   that invariant always holds?

  * In SegmentReader.decRef we have if (deletedDocsRef != null 
deletedDocsRef.refCount()  1) deletedDocsRef.decRef(); -- but,
you should not have to check if deletedDocsRef.refCount()  1?
Does something break when you remove that?  (In which case I think
we have a refCount bug lurking...)

  * The norm cloning logic in SegmentReader.reopenSegment needs to be
cleaned up... eg we first sweep through each Norm, incRef'ing it,
and then make 2nd pass to do full clone.  Really we should have if
(doClone) up front and do a single pass?  Also: I think we need
that same logic to re-open the singleNormStream for the clone case
as well.
.
Hmm, in the non-single-norm stream case I think we also must
re-open the norm file, rather than clone it, in Norm.clone().  I
think if you 1) open reader 1(do no searching w/ it), 2) clone it
-- reader 2, 3) close reader 1, 4) try to do a search against a
field that then needs to load norms, you'll hit an
AlreadyClosedException, because you had a cloned IndexInput vs a
newly reopened one?  Can you add that test case?

  * Why was this needed:
{code}
if (doClone  normsDirty) {

[jira] Created: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread Robert Muir (JIRA)

fastss fuzzyquery
-

 Key: LUCENE-1513
 URL: https://issues.apache.org/jira/browse/LUCENE-1513
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Priority: Minor


code for doing fuzzyqueries with fastssWC algorithm.

FuzzyIndexer: given a lucene field, it enumerates all terms and creates an 
auxiliary offline index for fuzzy queries.
FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index to 
retrieve a candidate list. this list is then verified with levenstein algorithm.

sorry but the code is a bit messy... what I'm actually using is very different 
from this so its pretty much untested. but at least you can see whats going on 
or fix it up.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1513:


Attachment: fastSSfuzzy.zip

 fastss fuzzyquery
 -

 Key: LUCENE-1513
 URL: https://issues.apache.org/jira/browse/LUCENE-1513
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Priority: Minor
 Attachments: fastSSfuzzy.zip


 code for doing fuzzyqueries with fastssWC algorithm.
 FuzzyIndexer: given a lucene field, it enumerates all terms and creates an 
 auxiliary offline index for fuzzy queries.
 FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index 
 to retrieve a candidate list. this list is then verified with levenstein 
 algorithm.
 sorry but the code is a bit messy... what I'm actually using is very 
 different from this so its pretty much untested. but at least you can see 
 whats going on or fix it up.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1512) Incorporate GeoHash in contrib/spatial

2009-01-06 Thread Ryan McKinley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661223#action_12661223
 ] 

Ryan McKinley commented on LUCENE-1512:
---

This is awesome.  thanks patrick!

 Incorporate GeoHash in contrib/spatial
 --

 Key: LUCENE-1512
 URL: https://issues.apache.org/jira/browse/LUCENE-1512
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/spatial
Reporter: patrick o'leary
Priority: Minor
 Attachments: LUCENE-1512.patch


 Based on comments from Yonik and Ryan in SOLR-773 
 GeoHash provides the ability to store latitude / longitude values in a single 
 field consistent hash field.
 Which elements the need to maintain 2 field caches for latitude / longitude 
 fields, reducing the size of an index
 and the amount of memory needed for a spatial search.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector


[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661238#action_12661238
 ] 

Mark Miller commented on LUCENE-1483:
-

Here is what that example policy has to be essentially. We just have to create 
a good way to do the right conversion I guess. I'll work on whatever you don't 
put up when you share your latest optimizations.


{code}
case SortField.STRING_ORD:
  return new ComparatorPolicy(){
private FieldComparator comparator = new 
FieldComparator.StringOrdComparator(numHits, field);
private boolean first = true;
private boolean second = true;
public FieldComparator nextComparator(FieldComparator oldComparator,
IndexReader reader, int numHits, int numSlotsFull)
throws IOException {
  if(first) {
first = false;
return comparator;
  } else if(second){
StringOrdValOnDemComparator comp = new 
FieldComparator.StringOrdValOnDemComparator(numHits, field);
comp.values = 
((FieldComparator.StringOrdComparator)comparator).values;
comp.ords = ((FieldComparator.StringOrdComparator)comparator).ords;
comp.currentReaderIndex = 1;
comparator = comp;
second = false;
return comp;
  }
  return comparator;
}};
{code}

 Change IndexSearcher multisegment searches to search each individual segment 
 using a single HitCollector
 

 Key: LUCENE-1483
 URL: https://issues.apache.org/jira/browse/LUCENE-1483
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: Mark Miller
Priority: Minor
 Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 sortBench.py, sortCollate.py


 FieldCache and Filters are forced down to a single segment reader, allowing 
 for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1512) Incorporate GeoHash in contrib/spatial

2009-01-06 Thread Ryan McKinley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661241#action_12661241
 ] 

Ryan McKinley commented on LUCENE-1512:
---

Any chance you could make a new patch without SerialChainFilter moved to search?

Should we make a new package for geohash based things?
org.apache.lucene.spatial.geohash
 - GeoHashUtils
 - GeoHashDistanceFilter

Also, the spacing for GeoHashUtils should be 2 spaces rather then 4.







 Incorporate GeoHash in contrib/spatial
 --

 Key: LUCENE-1512
 URL: https://issues.apache.org/jira/browse/LUCENE-1512
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/spatial
Reporter: patrick o'leary
Priority: Minor
 Attachments: LUCENE-1512.patch


 Based on comments from Yonik and Ryan in SOLR-773 
 GeoHash provides the ability to store latitude / longitude values in a single 
 field consistent hash field.
 Which elements the need to maintain 2 field caches for latitude / longitude 
 fields, reducing the size of an index
 and the amount of memory needed for a spatial search.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1504) SerialChainFilter should use DocSet API rather then deprecated BitSet API


[ 
https://issues.apache.org/jira/browse/LUCENE-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661249#action_12661249
 ] 

Mark Miller commented on LUCENE-1504:
-

I think there is contrib dependency examples in the xml query parser and in the 
highlighter (which depends on MemoryIndex).

 SerialChainFilter should use DocSet API rather then deprecated BitSet API
 -

 Key: LUCENE-1504
 URL: https://issues.apache.org/jira/browse/LUCENE-1504
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/spatial
Reporter: Ryan McKinley
 Fix For: 2.9

 Attachments: LUCENE-1504.patch, LUCENE-1504.patch


 From erik's comments in LUCENE-1387
 * Maybe the Filter's should be using the DocIdSet API rather than the 
 BitSet deprecated stuff? We can refactor that after being committed I 
 supposed, but not something we want to leave like that.
 We should also look at moving SerialChainFilter out of the spatial contrib 
 since it is more generally useful then just spatial search.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-06 Thread Ryan McKinley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661260#action_12661260
 ] 

Ryan McKinley commented on LUCENE-1483:
---

Any estimates on how far along this is?

Is it close enough that the reasonably simple patch in LUCENE-1304 should wait? 
 Or do you think it is worth waiting for this? 

I'm trying to get local lucene and solr to play nice (SOLR-773).  The hoops you 
have to jump through to avoid memory leaks make the final code too strange and 
not reusable.

 Change IndexSearcher multisegment searches to search each individual segment 
 using a single HitCollector
 

 Key: LUCENE-1483
 URL: https://issues.apache.org/jira/browse/LUCENE-1483
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: Mark Miller
Priority: Minor
 Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 sortBench.py, sortCollate.py


 FieldCache and Filters are forced down to a single segment reader, allowing 
 for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector


[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661264#action_12661264
 ] 

Mark Miller commented on LUCENE-1483:
-

I think we are wrapping up, but it may make sense to do 1304 anyway. That code 
will be deprecated, but if you use a custom comparator, it will use the 
deprecated code. The custom comparator will be removed in 3.0 I think, and 
you'd have to make a new comparator or comparator policy.

So its probably best to do 1304 if we want it, just for the 2.9 release.

- Mark

 Change IndexSearcher multisegment searches to search each individual segment 
 using a single HitCollector
 

 Key: LUCENE-1483
 URL: https://issues.apache.org/jira/browse/LUCENE-1483
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: Mark Miller
Priority: Minor
 Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 sortBench.py, sortCollate.py


 FieldCache and Filters are forced down to a single segment reader, allowing 
 for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1304) Memory Leak when using Custom Sort (i.e., DistanceSortSource) of LocalLucene with Lucene


[ 
https://issues.apache.org/jira/browse/LUCENE-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661269#action_12661269
 ] 

Mark Miller commented on LUCENE-1304:
-

The main impact is that most of that code will be deprecated. It will still be 
used for  old custom comparators until 3.0 though, so it might be wise to 
consider this for 2.9 in the interim.

 Memory Leak when using Custom Sort (i.e., DistanceSortSource) of LocalLucene 
 with Lucene
 

 Key: LUCENE-1304
 URL: https://issues.apache.org/jira/browse/LUCENE-1304
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.3
 Environment: Windows/JDK 1.6
Reporter: Ethan Tao
 Attachments: LUCENE-1304.patch


 We had the memory leak issue when using DistanceSortSource of LocalLucene for 
 repeated query/search. In about 450 queries, we are experiencing out of 
 memory error. After dig in the code, we found the problem source is coming 
 from Lucene package, the way how it handles custom type comparator. Lucene 
 internally caches all created comparators. In the case of query using 
 LocalLucene, we create new comparator for every search due to different 
 lon/lat and query terms. This causes major memory leak as the cached 
 comparators are also holding memory for other large objects (e.g., bit sets). 
 The solution we came up with: ( the proposed change from Lucene is 1 and 3 
 below)
 1.In Lucene package, create new file SortComparatorSourceUncacheable.java:
 package org.apache.lucene.search;
 import org.apache.lucene.index.IndexReader;
 import java.io.IOException;
 import java.io.Serializable;
 public interface SortComparatorSourceUncacheable extends Serializable {
 }
 2.Have your custom sort class to implement the interface
 public class LocalSortSource extends DistanceSortSource implements 
 SortComparatorSourceUncacheable {
 ...
 }
 3.Modify Lucene's FieldSorterHitQueue.java to bypass caching for custom 
 sort comparator:
 Index: FieldSortedHitQueue.java
 ===
 --- FieldSortedHitQueue.java (revision 654583)
 +++ FieldSortedHitQueue.java  (working copy)
 @@ -53,7 +53,12 @@
  this.fields = new SortField[n];
  for (int i=0; in; ++i) {
String fieldname = fields[i].getField();
 -  comparators[i] = getCachedComparator (reader, fieldname, 
 fields[i].getType(), fields[i].getLocale(), fields[i].getFactory());
 +
 +  if(fields[i].getFactory() instanceof SortComparatorSourceUncacheable) 
 { // no caching to avoid memory leak
 +comparators[i] = getComparator (reader, fieldname, 
 fields[i].getType(), fields[i].getLocale(), fields[i].getFactory());
 +  } else {
 +comparators[i] = getCachedComparator (reader, fieldname, 
 fields[i].getType(), fields[i].getLocale(), fields[i].getFactory());
 +  }

if (comparators[i].sortType() == SortField.STRING) {
   this.fields[i] = new SortField (fieldname, 
 fields[i].getLocale(), fields[i].getReverse());
 @@ -157,7 +162,18 @@
SortField[] getFields() {
  return fields;
}
 -  
 +
 +  static ScoreDocComparator getComparator (IndexReader reader, String field, 
 int type, Locale locale, SortComparatorSource factory)
 +throws IOException {
 +  if (type == SortField.DOC) return ScoreDocComparator.INDEXORDER;
 +  if (type == SortField.SCORE) return ScoreDocComparator.RELEVANCE;
 +  FieldCacheImpl.Entry entry = (factory != null)
 +? new FieldCacheImpl.Entry (field, factory)
 +: new FieldCacheImpl.Entry (field, type, locale);
 +  return (ScoreDocComparator)Comparators.createValue(reader, entry);
 +}
 +
 +
 Otis suggests that I put this in Jira. I 'll attach a patch shortly for 
 review. 
 -Ethan

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector


 [ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1483:
---

Attachment: LUCENE-1483-partial.patch

Attached prototype changes to switch to setBottom and compareBottom API for 
FieldComparator, but, I only included the few files I modified over the last 
patch, and it does not pass TestSort when I switch to it (fails the same tests 
ORD fails on).

Mark can you switch the comparators to this new API (and remove the 
compare(int, int, float) method) and fix the test failures?  Once that passes 
tests, I'll re-run perf test and we can tune the default policy.  I think we 
are close!

 Change IndexSearcher multisegment searches to search each individual segment 
 using a single HitCollector
 

 Key: LUCENE-1483
 URL: https://issues.apache.org/jira/browse/LUCENE-1483
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: Mark Miller
Priority: Minor
 Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py


 FieldCache and Filters are forced down to a single segment reader, allowing 
 for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector


[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661295#action_12661295
 ] 

Michael McCandless commented on LUCENE-1483:


{quote}
 Not sure about new constructors or package private for that part of the 
 switch...
{quote}
Could we just make ctors on each comparator that take the other comparator and 
copy over what they need?  This way we can make attrs private final again, in 
case that helps the JRE optimize.

 Change IndexSearcher multisegment searches to search each individual segment 
 using a single HitCollector
 

 Key: LUCENE-1483
 URL: https://issues.apache.org/jira/browse/LUCENE-1483
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: Mark Miller
Priority: Minor
 Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py


 FieldCache and Filters are forced down to a single segment reader, allowing 
 for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661302#action_12661302
]

Otis Gospodnetic commented on LUCENE-1513:
--

I feel like I missed some FastSS discussion on the list was there one?

I took a quick look at the paper and the code. Is the following the general
idea:
# index fuzzy/misspelled terms in addition to the normal terms (= larger
index, slower indexing). How much fuzziness one wants to allow or handle is
decided at index time.
# rewrite the query to include variations/misspellings of each terms and use
that to search (= more clauses, slower than normal search, but faster than the
normal fuzzy query whose speed depends on the number of indexed terms)
?

Quick code comments:
* Need to add ASL
* Need to replace tabs with 2 spaces and formatting in FuzzyHitCollector
* No @author
* Unit test if possible
* Should FastSSwC not be able to take a variable K?
* Should variables named after types (e.g. set in public static String
getNeighborhoodString(SetString set) { ) be renamed, so they describe what's
in them instead? (easier to understand API?)

fastss fuzzyquery
-

Key: LUCENE-1513
URL: https://issues.apache.org/jira/browse/LUCENE-1513
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*
Reporter: Robert Muir
Priority: Minor
Attachments: fastSSfuzzy.zip

code for doing fuzzyqueries with fastssWC algorithm.
FuzzyIndexer: given a lucene field, it enumerates all terms and creates an
auxiliary offline index for fuzzy queries.
FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index
to retrieve a candidate list. this list is then verified with levenstein
algorithm.
sorry but the code is a bit messy... what I'm actually using is very
different from this so its pretty much untested. but at least you can see
whats going on or fix it up.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector


[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661304#action_12661304
 ] 

Michael McCandless commented on LUCENE-1483:



{quote}
 I'm trying to get local lucene and solr to play nice (SOLR-773). The hoops 
 you have to jump through to avoid memory leaks make the final code too 
 strange and not reusable.
{quote}

With this patch we are changing how custom sorting works.

Previously, Lucene would iterate the terms for you, asking you to
produce a Comparable for each one.  With this patch, we are asking you
to implement FieldComparator, which compares docs/slots directly and
must be aware of switching sub-readers during searching.

Ryan, can you have a look at FieldComparator to see if it works for
local lucene (and any other feedback on it)?

I think the best outcome here would be to get this issue done, and
then get local lucene switched over to this new API (so local lucene
sees the benefits of the new API, and sidesteps the memory leak in
LUCENE-1304).

We may still need to do LUCENE-1304 in case others hit the memory leak
of the old custom sort API.


 Change IndexSearcher multisegment searches to search each individual segment 
 using a single HitCollector
 

 Key: LUCENE-1483
 URL: https://issues.apache.org/jira/browse/LUCENE-1483
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: Mark Miller
Priority: Minor
 Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py


 FieldCache and Filters are forced down to a single segment reader, allowing 
 for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector


[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661306#action_12661306
 ] 

Mark Miller commented on LUCENE-1483:
-

bq. Could we just make ctors on each comparator that take the other comparator 
and copy over what they need? This way we can make attrs private final again, 
in case that helps the JRE optimize.

Right, good idea.

I'll get everything together and put up a patch.

 Change IndexSearcher multisegment searches to search each individual segment 
 using a single HitCollector
 

 Key: LUCENE-1483
 URL: https://issues.apache.org/jira/browse/LUCENE-1483
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: Mark Miller
Priority: Minor
 Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py


 FieldCache and Filters are forced down to a single segment reader, allowing 
 for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread Robert Muir (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661314#action_12661314
]

Robert Muir commented on LUCENE-1513:
-

otis, discussion was on java-user.

again, I apologize for the messy code. as mentioned there, my setup is very
specific to exactly what I am doing and in no way is this code ready. But since
i'm currently pretty busy with other things at work I just wanted to put
something up here for anyone else interested.

theres the issues you mentioned, and also some i mentioned on java-user. for
example how to handle updates to indexes that introduce new terms (they must be
added to auxiliary index), or even if auxiliary index is the best approach.

the general idea is that instead of enumerating terms to find terms, the
deletion neighborhood as described in the paper is used instead. this way
search time is not linear based on number of terms. yes you are correct that it
only can guarantee edit distances of K which is determined at index time.
perhaps this should be configurable, but i hardcoded k=1 for simplicity. i
think its something like 80% of typos...

as i mentioned on the list another idea is you could implement FastSS (not the
wC variant) with deletion positions maybe by using payloads. This would require
more space but eliminate the candidate verification step. maybe it would be
nice to have some of their other algorithms such as block-based,etc available
also.

fastss fuzzyquery
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

Why not just create a new field for this? That is, if you have  
FieldA, create field FieldAFuzzy and put the various permutations there.


The fuzzy scorer/parser can be changed to automatically use the  
Fuzzy field when required.


You could also store positions, and allow that the first term is the  
closest, next is the second closest, etc. to add support for a slop  
factor.


This is similar to the same way fast phonetic searches can be  
implemented.


If you do it this way, you don't have any of the synchronization  
issues between the index and the external fuzzy index.



On Jan 6, 2009, at 2:57 PM, Robert Muir (JIRA) wrote:



[ https://issues.apache.org/jira/browse/LUCENE-1513? 
page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
tabpanelfocusedCommentId=12661314#action_12661314 ]


Robert Muir commented on LUCENE-1513:
-

otis, discussion was on java-user.

again, I apologize for the messy code. as mentioned there, my setup  
is very specific to exactly what I am doing and in no way is this  
code ready. But since i'm currently pretty busy with other things  
at work I just wanted to put something up here for anyone else  
interested.


theres the issues you mentioned, and also some i mentioned on java- 
user. for example how to handle updates to indexes that introduce  
new terms (they must be added to auxiliary index), or even if  
auxiliary index is the best approach.


the general idea is that instead of enumerating terms to find  
terms, the deletion neighborhood as described in the paper is used  
instead. this way search time is not linear based on number of  
terms. yes you are correct that it only can guarantee edit  
distances of K which is determined at index time. perhaps this  
should be configurable, but i hardcoded k=1 for simplicity. i think  
its something like 80% of typos...


as i mentioned on the list another idea is you could implement  
FastSS (not the wC variant) with deletion positions maybe by using  
payloads. This would require more space but eliminate the candidate  
verification step. maybe it would be nice to have some of their  
other algorithms such as block-based,etc available also.





fastss fuzzyquery
-

Key: LUCENE-1513
URL: https://issues.apache.org/jira/browse/ 
LUCENE-1513

Project: Lucene - Java
 Issue Type: New Feature
 Components: contrib/*
   Reporter: Robert Muir
   Priority: Minor
Attachments: fastSSfuzzy.zip


code for doing fuzzyqueries with fastssWC algorithm.
FuzzyIndexer: given a lucene field, it enumerates all terms and  
creates an auxiliary offline index for fuzzy queries.
FastFuzzyQuery: similar to fuzzy query except it queries the  
auxiliary index to retrieve a candidate list. this list is then  
verified with levenstein algorithm.
sorry but the code is a bit messy... what I'm actually using is  
very different from this so its pretty much untested. but at least  
you can see whats going on or fix it up.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

a deletion neighborhood can be pretty large (for example robert is something
like robert obert rbert robrt robet ...)
so if you have a 100 million docs with 1 billion words, but only 100k unique
terms, it definitely would be wasteful to have 1 billion deletion
neighborhoods when you only need 100k.

On Tue, Jan 6, 2009 at 4:02 PM, robert engels reng...@ix.netcom.com wrote:

Why not just create a new field for this? That is, if you have FieldA,
create field FieldAFuzzy and put the various permutations there.

The fuzzy scorer/parser can be changed to automatically use the Fuzzy
field when required.

You could also store positions, and allow that the first term is the
closest, next is the second closest, etc. to add support for a slop
factor.

This is similar to the same way fast phonetic searches can be implemented.

If you do it this way, you don't have any of the synchronization issues
between the index and the external fuzzy index.

On Jan 6, 2009, at 2:57 PM, Robert Muir (JIRA) wrote:

[
https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661314#action_12661314
]

Robert Muir commented on LUCENE-1513:
-

otis, discussion was on java-user.

again, I apologize for the messy code. as mentioned there, my setup is
very specific to exactly what I am doing and in no way is this code ready.
But since i'm currently pretty busy with other things at work I just wanted
to put something up here for anyone else interested.

theres the issues you mentioned, and also some i mentioned on java-user.
for example how to handle updates to indexes that introduce new terms (they
must be added to auxiliary index), or even if auxiliary index is the best
approach.

the general idea is that instead of enumerating terms to find terms, the
deletion neighborhood as described in the paper is used instead. this way
search time is not linear based on number of terms. yes you are correct that
it only can guarantee edit distances of K which is determined at index time.
perhaps this should be configurable, but i hardcoded k=1 for simplicity. i
think its something like 80% of typos...

as i mentioned on the list another idea is you could implement FastSS (not
the wC variant) with deletion positions maybe by using payloads. This would
require more space but eliminate the candidate verification step. maybe it
would be nice to have some of their other algorithms such as block-based,etc
available also.

fastss fuzzyquery
-

code for doing fuzzyqueries with fastssWC algorithm.
FuzzyIndexer: given a lucene field, it enumerates all terms and creates
an auxiliary offline index for fuzzy queries.
FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary
index to retrieve a candidate list. this list is then verified with
levenstein algorithm.
sorry but the code is a bit messy... what I'm actually using is very
different from this so its pretty much untested. but at least you can see
whats going on or fix it up.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

--
Robert Muir
rcm...@gmail.com

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

I don't think that is the case. You will have single deletion  
neighborhood. The number of unique terms in the field is going to be  
the union of the deletion dictionaries of each source term.


For example, given the following documents A which have field 'X'  
with value best, and document B with value jest (and k == 1).


A will generate est bst, bet, bes, B will generate est, jest, jst, jes

so field FieldXFuzzy contains  
(est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)


I don't think the storage requirement is any greater doing it this way.


3.2.1 Indexing
For all words in a dictionary, and a given number of edit operations  
k, FastSS
generates all variant spellings recursively and save them as tuples  
of type
v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a  
list of deletion

positions.

Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants  
for n

dictionary words of length m with k mismatches.


3.2.2 Retrieval
For a query p and edit distance k, first generate the neighborhood Ud  
(p, k).

Then compare the words in the neighborhood with the index, and find
matching candidates. Compare deletion positions for each candidate with
the deletion positions in U(p, k), using Theorem 4.

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

i see, your idea would definitely simplify some things.

What about the index size difference between this approach and using
separate index? Would this separate field increase index size?

I guess my line of thinking is if you have 10 docs with robert, with
separate index you just have robert, and its deletion neighborhood one time.
with this approach you have the same thing, but at least you must have
document numbers and the other inverted index stuff with each neighborhood
term. would this be a significant change to size and/or performance? and
since the documents have multiple terms there is additional positional
information for slop factor for each neighborhood term...

i think its worth investigating, maybe performance would actually be better,
just curious. i think i boxed myself in to auxiliary index because of some
other irrelevant thigns i am doing.

On Tue, Jan 6, 2009 at 4:42 PM, robert engels reng...@ix.netcom.com wrote:

 I don't think that is the case. You will have single deletion neighborhood.
 The number of unique terms in the field is going to be the union of the
 deletion dictionaries of each source term.
 For example, given the following documents A which have field 'X' with
 value best, and document B with value jest (and k == 1).

 A will generate est bst, bet, bes, B will generate est, jest, jst, jes

 so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)

 I don't think the storage requirement is any greater doing it this way.


 3.2.1 Indexing
 For all words in a dictionary, and a given number of edit operations k,
 FastSS
 generates all variant spellings recursively and save them as tuples of type

 v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of
 deletion
 positions.

 Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for n

 dictionary words of length m with k mismatches.


 3.2.2 Retrieval
 For a query p and edit distance k, first generate the neighborhood Ud (p,
 k).
 Then compare the words in the neighborhood with the index, and find
 matching candidates. Compare deletion positions for each candidate with
 the deletion positions in U(p, k), using Theorem 4.





-- 
Robert Muir
rcm...@gmail.com

[jira] Resolved: (LUCENE-1502) CharArraySet behaves inconsistently in add(Object) and contains(Object)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1502.


   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Committed revision 732141.

Thanks Shai!

 CharArraySet behaves inconsistently in add(Object) and contains(Object)
 ---

 Key: LUCENE-1502
 URL: https://issues.apache.org/jira/browse/LUCENE-1502
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.4.1, 2.9

 Attachments: LUCENE-1502.patch


 CharArraySet's add(Object) method looks like this:
 if (o instanceof char[]) {
   return add((char[])o);
 } else if (o instanceof String) {
   return add((String)o);
 } else if (o instanceof CharSequence) {
   return add((CharSequence)o);
 } else {
   return add(o.toString());
 }
 You'll notice that in the case of an Object (for example, Integer), the 
 o.toString() is added. However, its contains(Object) method looks like this:
 if (o instanceof char[]) {
   char[] text = (char[])o;
   return contains(text, 0, text.length);
 } else if (o instanceof CharSequence) {
   return contains((CharSequence)o);
 }
 return false;
 In case of contains(Integer), it always returns false. I've added a simple 
 test to TestCharArraySet, which reproduces the problem:
   public void testObjectContains() {
 CharArraySet set = new CharArraySet(10, true);
 Integer val = new Integer(1);
 set.add(val);
 assertTrue(set.contains(val));
 assertTrue(set.contains(new Integer(1)));
   }
 Changing contains(Object) to this, solves the problem:
 if (o instanceof char[]) {
   char[] text = (char[])o;
   return contains(text, 0, text.length);
 } 
 return contains(o.toString());
 The patch also includes few minor improvements (which were discussed on the 
 mailing list) such as the removal of the following dead code from 
 getHashCode(CharSequence):
   if (false  text instanceof String) {
 code = text.hashCode();
 and simplifying add(Object):
 if (o instanceof char[]) {
   return add((char[])o);
 }
 return add(o.toString());
 (which also aligns with the equivalent contains() method).
 One thing that's still left open is whether we can avoid the calls to 
 Character.toLowerCase calls in all the char[] array methods, by first 
 converting the char[] to lowercase, and then passing it through the equals() 
 and getHashCode() methods. It works for add(), but fails for contains(char[]) 
 since it modifies the input array.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

It is definitely going to increase the index size, but not any more  
than than the external one would (if my understanding is correct).


The nice thing is that you don't have to try and keep documents  
numbers in sync - it will be automatic.


Maybe I don't understand what your external index is storing. Given  
that the document contains 'robert' but the user enters' obert', what  
is the process to find the matching documents?


Is the external index essentially a constant list, that given obert,  
the source words COULD BE robert, tobert, reobert etc., and it  
contains no document information so:


given the source word X, and an edit distance k, you ask the external  
dictionary for possible indexed words, and it returns the list, and  
then use search lucene using each of those words?


If the above is the case, it certainly seems you could generate this  
list in real-time rather efficiently with no IO (unless the external  
index only stores words which HAVE BEEN indexed).


I think the confusion may be because I understand Otis's comments,  
but they don't seem to match what you are stating.


Essentially performing any term match requires efficient searching/ 
matching of the term index. If this is efficient enough, I don't  
think either process is needed - just an improved real-time fuzzy  
possibilities word generator.


On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:


i see, your idea would definitely simplify some things.

What about the index size difference between this approach and  
using separate index? Would this separate field increase index size?


I guess my line of thinking is if you have 10 docs with robert,  
with separate index you just have robert, and its deletion  
neighborhood one time. with this approach you have the same thing,  
but at least you must have document numbers and the other inverted  
index stuff with each neighborhood term. would this be a  
significant change to size and/or performance? and since the  
documents have multiple terms there is additional positional  
information for slop factor for each neighborhood term...


i think its worth investigating, maybe performance would actually  
be better, just curious. i think i boxed myself in to auxiliary  
index because of some other irrelevant thigns i am doing.


On Tue, Jan 6, 2009 at 4:42 PM, robert engels  
reng...@ix.netcom.com wrote:
I don't think that is the case. You will have single deletion  
neighborhood. The number of unique terms in the field is going to  
be the union of the deletion dictionaries of each source term.


For example, given the following documents A which have field 'X'  
with value best, and document B with value jest (and k == 1).


A will generate est bst, bet, bes, B will generate est, jest, jst, jes

so field FieldXFuzzy contains  
(est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)


I don't think the storage requirement is any greater doing it this  
way.



3.2.1 Indexing
For all words in a dictionary, and a given number of edit  
operations k, FastSS
generates all variant spellings recursively and save them as tuples  
of type
v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a  
list of deletion

positions.

Theorem 5. Index uses O(nmk+1) space, as it stores al l the  
variants for n

dictionary words of length m with k mismatches.


3.2.2 Retrieval
For a query p and edit distance k, first generate the neighborhood  
Ud (p, k).

Then compare the words in the neighborhood with the index, and find
matching candidates. Compare deletion positions for each candidate  
with

the deletion positions in U(p, k), using Theorem 4.





--
Robert Muir
rcm...@gmail.com

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

On Tue, Jan 6, 2009 at 5:15 PM, robert engels reng...@ix.netcom.com wrote:

 It is definitely going to increase the index size, but not any more than
 than the external one would (if my understanding is correct).
 The nice thing is that you don't have to try and keep documents numbers in
 sync - it will be automatic.

 Maybe I don't understand what your external index is storing. Given that
 the document contains 'robert' but the user enters' obert', what is the
 process to find the matching documents?


heres a simple example. neighborhood stored for robert is 'robert obert
rbert roert ...'  this is indexed in a tokenized field.

at query time user typoes robert and enters 'tobert'. again neighborhood is
generated 'tobert obert tbert ...'
the system does a query on tobert OR obert OR tbert ... and robert is
returned because 'obert' is present in both neighborhoods.
in this example, by storing k=1 deletions you guarantee to satisfy all edit
distance matches = 1 without linear scan.
you get some false positives too with this approach, thats why what comes
back is only a CANDIDATE and true edit distance must be used to verify. this
might be tricky to do with your method, i don't know.





 Is the external index essentially a constant list, that given obert, the
 source words COULD BE robert, tobert, reobert etc., and it contains no
 document information so:


no. see above, you generate all possible 1-character deletions of the index
term and store them, then at query time you generate all possible
1-character deletions of the query term. basically, LUCENE and LUBENE are 1
character different, but they are the same (LUENE) if you delete 1 character
from both of them. so you dont need to store LUCENE LUBENE LUDENE, you just
store LUENE.


 given the source word X, and an edit distance k, you ask the external
 dictionary for possible indexed words, and it returns the list, and then use
 search lucene using each of those words?

 If the above is the case, it certainly seems you could generate this list
 in real-time rather efficiently with no IO (unless the external index only
 stores words which HAVE BEEN indexed).

 I think the confusion may be because I understand Otis's comments, but they
 don't seem to match what you are stating.

 Essentially performing any term match requires efficient searching/matching
 of the term index. If this is efficient enough, I don't think either process
 is needed - just an improved real-time fuzzy possibilities word generator.

 On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:

 i see, your idea would definitely simplify some things.

 What about the index size difference between this approach and using
 separate index? Would this separate field increase index size?

 I guess my line of thinking is if you have 10 docs with robert, with
 separate index you just have robert, and its deletion neighborhood one time.
 with this approach you have the same thing, but at least you must have
 document numbers and the other inverted index stuff with each neighborhood
 term. would this be a significant change to size and/or performance? and
 since the documents have multiple terms there is additional positional
 information for slop factor for each neighborhood term...

 i think its worth investigating, maybe performance would actually be
 better, just curious. i think i boxed myself in to auxiliary index because
 of some other irrelevant thigns i am doing.

 On Tue, Jan 6, 2009 at 4:42 PM, robert engels reng...@ix.netcom.comwrote:

  I don't think that is the case. You will have single deletion
 neighborhood. The number of unique terms in the field is going to be the
 union of the deletion dictionaries of each source term.
 For example, given the following documents A which have field 'X' with
 value best, and document B with value jest (and k == 1).

 A will generate est bst, bet, bes, B will generate est, jest, jst, jes

 so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)

 I don't think the storage requirement is any greater doing it this way.


 3.2.1 Indexing
 For all words in a dictionary, and a given number of edit operations k,
 FastSS
 generates all variant spellings recursively and save them as tuples of
 type
 v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of
 deletion
 positions.

 Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for
 n
 dictionary words of length m with k mismatches.


 3.2.2 Retrieval
 For a query p and edit distance k, first generate the neighborhood Ud (p,
 k).
 Then compare the words in the neighborhood with the index, and find
 matching candidates. Compare deletion positions for each candidate with
 the deletion positions in U(p, k), using Theorem 4.





 --
 Robert Muir
 rcm...@gmail.com





-- 
Robert Muir
rcm...@gmail.com

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery


To clarify a statement in the last email.

To generate the 'possible source words' in real-time is not a  
difficult as first seems, if you assume some sort of first character  
prefix (which is what it appears google does).


For example, assume the user typed 'robrt' instead of 'robert'. You  
see that this word has very low frequency (or none), so you want to  
find possible misspellings, so do a fuzzy search starting with r. But  
this search can be optimized, because as the edit/delete position  
moves to the right, the prefix remains the same, so these  
possibilities can be quickly skipped.


If you don't find any words with high enough frequency as possible  
edit distances, try [a-z]robrt, assuming the user may have dropped  
the first character (possibly try this in know combination order,  
rather than alpha (i.e. try sr before nr).


For example, searching google for 'robrt engels' works. So does  
'obert engels', so does 'robt engels', all ask me if I meant 'robert  
engels', but searching for 'obrt engels' does not.


On Jan 6, 2009, at 4:15 PM, robert engels wrote:

It is definitely going to increase the index size, but not any more  
than than the external one would (if my understanding is correct).


The nice thing is that you don't have to try and keep documents  
numbers in sync - it will be automatic.


Maybe I don't understand what your external index is storing. Given  
that the document contains 'robert' but the user enters' obert',  
what is the process to find the matching documents?


Is the external index essentially a constant list, that given  
obert, the source words COULD BE robert, tobert, reobert etc., and  
it contains no document information so:


given the source word X, and an edit distance k, you ask the  
external dictionary for possible indexed words, and it returns the  
list, and then use search lucene using each of those words?


If the above is the case, it certainly seems you could generate  
this list in real-time rather efficiently with no IO (unless the  
external index only stores words which HAVE BEEN indexed).


I think the confusion may be because I understand Otis's comments,  
but they don't seem to match what you are stating.


Essentially performing any term match requires efficient searching/ 
matching of the term index. If this is efficient enough, I don't  
think either process is needed - just an improved real-time fuzzy  
possibilities word generator.


On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:


i see, your idea would definitely simplify some things.

What about the index size difference between this approach and  
using separate index? Would this separate field increase index size?


I guess my line of thinking is if you have 10 docs with robert,  
with separate index you just have robert, and its deletion  
neighborhood one time. with this approach you have the same thing,  
but at least you must have document numbers and the other inverted  
index stuff with each neighborhood term. would this be a  
significant change to size and/or performance? and since the  
documents have multiple terms there is additional positional  
information for slop factor for each neighborhood term...


i think its worth investigating, maybe performance would actually  
be better, just curious. i think i boxed myself in to auxiliary  
index because of some other irrelevant thigns i am doing.


On Tue, Jan 6, 2009 at 4:42 PM, robert engels  
reng...@ix.netcom.com wrote:
I don't think that is the case. You will have single deletion  
neighborhood. The number of unique terms in the field is going to  
be the union of the deletion dictionaries of each source term.


For example, given the following documents A which have field 'X'  
with value best, and document B with value jest (and k == 1).


A will generate est bst, bet, bes, B will generate est, jest, jst,  
jes


so field FieldXFuzzy contains  
(est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)


I don't think the storage requirement is any greater doing it this  
way.



3.2.1 Indexing
For all words in a dictionary, and a given number of edit  
operations k, FastSS
generates all variant spellings recursively and save them as  
tuples of type
v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a  
list of deletion

positions.

Theorem 5. Index uses O(nmk+1) space, as it stores al l the  
variants for n

dictionary words of length m with k mismatches.


3.2.2 Retrieval
For a query p and edit distance k, first generate the neighborhood  
Ud (p, k).

Then compare the words in the neighborhood with the index, and find
matching candidates. Compare deletion positions for each candidate  
with

the deletion positions in U(p, k), using Theorem 4.





--
Robert Muir
rcm...@gmail.com

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery


I understand now.

The index in my case would definitely be MUCH larger, but I think it  
would perform better, as you only need to do a single search - for  
obert (if you assume it was a misspelling).


In your case you would eventually do an OR search in the lucene index  
for all possible matches (robert, roberta, roberto, ...) which could  
be much larger with some commonly prefixed/postfixed words).


Classic performance vs. size trade-off.  In your case where it is not  
for misspellings, the performance difference might be worthwhile.


Still, in your case, I am not sure using a Lucene index as the  
external index is appropriate. Maybe a simple BTREE (Derby?) index of  
(word,edit permutation) with a a key on both would allow easy search  
and update. If implemented as a service, some intelligent caching of  
common misspellings could really improve the performance.


On Jan 6, 2009, at 4:29 PM, Robert Muir wrote:




On Tue, Jan 6, 2009 at 5:15 PM, robert engels  
reng...@ix.netcom.com wrote:
It is definitely going to increase the index size, but not any more  
than than the external one would (if my understanding is correct).


The nice thing is that you don't have to try and keep documents  
numbers in sync - it will be automatic.


Maybe I don't understand what your external index is storing. Given  
that the document contains 'robert' but the user enters' obert',  
what is the process to find the matching documents?


heres a simple example. neighborhood stored for robert is 'robert  
obert rbert roert ...'  this is indexed in a tokenized field.


at query time user typoes robert and enters 'tobert'. again  
neighborhood is generated 'tobert obert tbert ...'
the system does a query on tobert OR obert OR tbert ... and robert  
is returned because 'obert' is present in both neighborhoods.
in this example, by storing k=1 deletions you guarantee to satisfy  
all edit distance matches = 1 without linear scan.
you get some false positives too with this approach, thats why what  
comes back is only a CANDIDATE and true edit distance must be used  
to verify. this might be tricky to do with your method, i don't know.





Is the external index essentially a constant list, that given  
obert, the source words COULD BE robert, tobert, reobert etc., and  
it contains no document information so:


no. see above, you generate all possible 1-character deletions of  
the index term and store them, then at query time you generate all  
possible 1-character deletions of the query term. basically, LUCENE  
and LUBENE are 1 character different, but they are the same (LUENE)  
if you delete 1 character from both of them. so you dont need to  
store LUCENE LUBENE LUDENE, you just store LUENE.


given the source word X, and an edit distance k, you ask the  
external dictionary for possible indexed words, and it returns the  
list, and then use search lucene using each of those words?


If the above is the case, it certainly seems you could generate  
this list in real-time rather efficiently with no IO (unless the  
external index only stores words which HAVE BEEN indexed).


I think the confusion may be because I understand Otis's comments,  
but they don't seem to match what you are stating.


Essentially performing any term match requires efficient searching/ 
matching of the term index. If this is efficient enough, I don't  
think either process is needed - just an improved real-time fuzzy  
possibilities word generator.


On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:


i see, your idea would definitely simplify some things.

What about the index size difference between this approach and  
using separate index? Would this separate field increase index size?


I guess my line of thinking is if you have 10 docs with robert,  
with separate index you just have robert, and its deletion  
neighborhood one time. with this approach you have the same thing,  
but at least you must have document numbers and the other inverted  
index stuff with each neighborhood term. would this be a  
significant change to size and/or performance? and since the  
documents have multiple terms there is additional positional  
information for slop factor for each neighborhood term...


i think its worth investigating, maybe performance would actually  
be better, just curious. i think i boxed myself in to auxiliary  
index because of some other irrelevant thigns i am doing.


On Tue, Jan 6, 2009 at 4:42 PM, robert engels  
reng...@ix.netcom.com wrote:
I don't think that is the case. You will have single deletion  
neighborhood. The number of unique terms in the field is going to  
be the union of the deletion dictionaries of each source term.


For example, given the following documents A which have field 'X'  
with value best, and document B with value jest (and k == 1).


A will generate est bst, bet, bes, B will generate est, jest, jst,  
jes


so field FieldXFuzzy contains  
(est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)


I don't

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

robert theres only one problem i see: i don't see how you can do a single
search since fastssWC returns some false positives (with k=1 it will still
return some things with ED of 2). maybe if you store the deletion position
information as a payload (thus using original fastss where there are no
false positives) it would work though. i looked at storing position
information but it appeared like it might be complex and the api was (is)
still marked experimental so i didn't go that route.

i also agree lucene index might not be the best possible data structure...
just convenient thats all. i used it because i store other things related to
the term besides deletion neighborhoods for my fuzzy matching.

i guess i'll also mention that i do think storage size should be a big
consideration. you really don't need this kind of stuff unless you are
searching pretty big indexes in the first place (for = few million docs the
default fuzzy is probably just fine for a lot of people).

for me, the whole thing was about turning 30second queries into 1 second
queries by removing a linear algorithm, i didn't really optimize much beyond
that because i was just very happy to have reasonable performance..

On Tue, Jan 6, 2009 at 6:26 PM, robert engels reng...@ix.netcom.com wrote:

 I understand now.
 The index in my case would definitely be MUCH larger, but I think it would
 perform better, as you only need to do a single search - for obert (if you
 assume it was a misspelling).

 In your case you would eventually do an OR search in the lucene index for
 all possible matches (robert, roberta, roberto, ...) which could be much
 larger with some commonly prefixed/postfixed words).

 Classic performance vs. size trade-off.  In your case where it is not for
 misspellings, the performance difference might be worthwhile.

 Still, in your case, I am not sure using a Lucene index as the external
 index is appropriate. Maybe a simple BTREE (Derby?) index of (word,edit
 permutation) with a a key on both would allow easy search and update. If
 implemented as a service, some intelligent caching of common misspellings
 could really improve the performance.

 On Jan 6, 2009, at 4:29 PM, Robert Muir wrote:



 On Tue, Jan 6, 2009 at 5:15 PM, robert engels reng...@ix.netcom.comwrote:

  It is definitely going to increase the index size, but not any more than
 than the external one would (if my understanding is correct).
 The nice thing is that you don't have to try and keep documents numbers in
 sync - it will be automatic.

 Maybe I don't understand what your external index is storing. Given that
 the document contains 'robert' but the user enters' obert', what is the
 process to find the matching documents?


 heres a simple example. neighborhood stored for robert is 'robert obert
 rbert roert ...'  this is indexed in a tokenized field.

 at query time user typoes robert and enters 'tobert'. again neighborhood is
 generated 'tobert obert tbert ...'
 the system does a query on tobert OR obert OR tbert ... and robert is
 returned because 'obert' is present in both neighborhoods.
 in this example, by storing k=1 deletions you guarantee to satisfy all edit
 distance matches = 1 without linear scan.
 you get some false positives too with this approach, thats why what comes
 back is only a CANDIDATE and true edit distance must be used to verify. this
 might be tricky to do with your method, i don't know.





 Is the external index essentially a constant list, that given obert, the
 source words COULD BE robert, tobert, reobert etc., and it contains no
 document information so:


 no. see above, you generate all possible 1-character deletions of the index
 term and store them, then at query time you generate all possible
 1-character deletions of the query term. basically, LUCENE and LUBENE are 1
 character different, but they are the same (LUENE) if you delete 1 character
 from both of them. so you dont need to store LUCENE LUBENE LUDENE, you just
 store LUENE.


 given the source word X, and an edit distance k, you ask the external
 dictionary for possible indexed words, and it returns the list, and then use
 search lucene using each of those words?

 If the above is the case, it certainly seems you could generate this list
 in real-time rather efficiently with no IO (unless the external index only
 stores words which HAVE BEEN indexed).

 I think the confusion may be because I understand Otis's comments, but
 they don't seem to match what you are stating.

 Essentially performing any term match requires efficient
 searching/matching of the term index. If this is efficient enough, I don't
 think either process is needed - just an improved real-time fuzzy
 possibilities word generator.

 On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:

 i see, your idea would definitely simplify some things.

 What about the index size difference between this approach and using
 separate index? Would this separate field increase index size?

 I guess my line of

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector


[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661390#action_12661390
 ] 

Mark Miller commented on LUCENE-1483:
-

Can't seem to use the partial patch, but I'll try to put in by hand. Just gotta 
remember to make sure I don't miss anything.

 Change IndexSearcher multisegment searches to search each individual segment 
 using a single HitCollector
 

 Key: LUCENE-1483
 URL: https://issues.apache.org/jira/browse/LUCENE-1483
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: Mark Miller
Priority: Minor
 Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py


 FieldCache and Filters are forced down to a single segment reader, allowing 
 for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector