[jira] Commented: (LUCENE-1227) NGramTokenizer to handle more than 1024 chars
[ https://issues.apache.org/jira/browse/LUCENE-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661125#action_12661125 ] Grant Ingersoll commented on LUCENE-1227: - Yes, please do have a look and let us know what you think. NGramTokenizer to handle more than 1024 chars - Key: LUCENE-1227 URL: https://issues.apache.org/jira/browse/LUCENE-1227 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Reporter: Hiroaki Kawai Assignee: Grant Ingersoll Priority: Minor Attachments: LUCENE-1227.patch, NGramTokenizer.patch, NGramTokenizer.patch Current NGramTokenizer can't handle character stream that is longer than 1024. This is too short for non-whitespace-separated languages. I created a patch for this issues. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1509) IndexCommit.getFileNames() should not return dups
[ https://issues.apache.org/jira/browse/LUCENE-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661143#action_12661143 ] Shalin Shekhar Mangar commented on LUCENE-1509: --- Thanks Michael! IndexCommit.getFileNames() should not return dups - Key: LUCENE-1509 URL: https://issues.apache.org/jira/browse/LUCENE-1509 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.4, 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1509.patch If the index was created with autoCommit false, and more than 1 segment was flushed during the IndexWriter session, then the shared doc-store files are incorrectly duplicated in IndexCommit.getFileNames(). This is because that method is walking through each SegmentInfo, appending its files to a list. Since multiple SegmentInfo's may share the doc store files, this causes dups. To fix this, I've added a SegmentInfos.files(...) method, and refactored all places that were computing their files one SegmentInfo at a time to use this new method instead. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661145#action_12661145 ] Michael McCandless commented on LUCENE-1483: Mark, I see 3 testcase failures in TestSort if I pretend that SortField.STRING means STRING_ORD -- do you see that? I think we should fix TestSort so that it runs N times, each time using a different STRING sort method, to make sure we are covering all these methods? Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector Key: LUCENE-1483 URL: https://issues.apache.org/jira/browse/LUCENE-1483 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Mark Miller Priority: Minor Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py FieldCache and Filters are forced down to a single segment reader, allowing for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661148#action_12661148 ] Michael McCandless commented on LUCENE-1483: I prototyped a rough change to the FieldComparator API, whereby TopFieldCollector calls setBottom to notify the comparator which slot is the bottom of the queue (whenever it changes), and then calls compareBottom (which replaces compare(int slot, int doc, float score)). This seems to offer decent perf. gains so I think we should make this change for real? I think it gives good gains because 1) compare to bottom is very frequent for a search that has many hits, and where the queue fairly quickly converges to the top N, 2) it allows the on-demand comparator to pre-cache the bottom's ord, and 3) it saves one array deref. TopFieldCollector would guarantee that compareBottom is not called unless setBottom was called; during the startup transient, setBottom is not called until the queue becomes full. Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector Key: LUCENE-1483 URL: https://issues.apache.org/jira/browse/LUCENE-1483 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Mark Miller Priority: Minor Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py FieldCache and Filters are forced down to a single segment reader, allowing for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661149#action_12661149 ] Michael McCandless commented on LUCENE-1483: On what ComparatorPolicy to use by default... I think we should start with ORD, but gather counters of number of compares vs number of copies, and based on those counters (and comparing to numDocs()) decide how aggressively to switch comparators? That determination should also take into account the queue size. An optimized index would always use ORD (w/o gathering counters), which is fastest. In the future... we could imagine allowing the query to dictate the order that segments are visited. EG if the query can roughly estimate how many hits it'll get on a given segment, we could order by that instead of simply numDocs(). The query could also choose an appropriate ComparatorPolicy, eg, if it estimates it'll get very few hits, VAL is best right from the start, else start with ORD. Another future fix would be to implement ORDSUB with a single pass through the queue, using a reused secondary pqueue to do the full sort of the queue. This would let us assign subords much faster, I think. But I don't think we should pursue these optimizations as part of this issue... we need to bring closure here; we already have some solid gains to capture. I think we should wrapup now... Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector Key: LUCENE-1483 URL: https://issues.apache.org/jira/browse/LUCENE-1483 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Mark Miller Priority: Minor Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py FieldCache and Filters are forced down to a single segment reader, allowing for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661160#action_12661160 ] Mark Miller commented on LUCENE-1483: - bq. Mark, I see 3 testcase failures in TestSort if I pretend that SortField.STRING means STRING_ORD - do you see that? Yeah, sorry. That STRING_ORD custom comparator is just a joke really, so I only really tested it on the StringSort test. It's just not initing the ords along with the values on switching. Making ords package private so that it can be changed (and changing it) fixes things. Not sure about new constructors or package private for that part of the switch... bq. I think we should fix TestSort so that it runs N times, each time using a different STRING sort method, to make sure we are covering all these methods? Yeah, this makes sense in any case. I just keep switching them by hand as I work on them. Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector Key: LUCENE-1483 URL: https://issues.apache.org/jira/browse/LUCENE-1483 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Mark Miller Priority: Minor Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py FieldCache and Filters are forced down to a single segment reader, allowing for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661160#action_12661160 ] markrmil...@gmail.com edited comment on LUCENE-1483 at 1/6/09 6:57 AM: - bq. Mark, I see 3 testcase failures in TestSort if I pretend that SortField.STRING means STRING_ORD - do you see that? Yeah, sorry. That STRING_ORD custom comparator policy is just a joke really, so I only really tested it on the StringSort test. It's just not initing the ords along with the values on switching. Making ords package private so that it can be changed (and changing it) fixes things. Not sure about new constructors or package private for that part of the switch... bq. I think we should fix TestSort so that it runs N times, each time using a different STRING sort method, to make sure we are covering all these methods? Yeah, this makes sense in any case. I just keep switching them by hand as I work on them. was (Author: markrmil...@gmail.com): bq. Mark, I see 3 testcase failures in TestSort if I pretend that SortField.STRING means STRING_ORD - do you see that? Yeah, sorry. That STRING_ORD custom comparator is just a joke really, so I only really tested it on the StringSort test. It's just not initing the ords along with the values on switching. Making ords package private so that it can be changed (and changing it) fixes things. Not sure about new constructors or package private for that part of the switch... bq. I think we should fix TestSort so that it runs N times, each time using a different STRING sort method, to make sure we are covering all these methods? Yeah, this makes sense in any case. I just keep switching them by hand as I work on them. Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector Key: LUCENE-1483 URL: https://issues.apache.org/jira/browse/LUCENE-1483 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Mark Miller Priority: Minor Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py FieldCache and Filters are forced down to a single segment reader, allowing for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661165#action_12661165 ] Mark Miller commented on LUCENE-1483: - There are other little conversion steps that have to be considered as well I think. Like when you switch to ord dem, you won't have the ReaderIndex array filled in properly, etc. (probably an issue with that example policy in there beyond the ords copy) Depending on what you come from and what you go to, a couple little hoops have to be jumped. Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector Key: LUCENE-1483 URL: https://issues.apache.org/jira/browse/LUCENE-1483 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Mark Miller Priority: Minor Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py FieldCache and Filters are forced down to a single segment reader, allowing for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1304) Memory Leak when using Custom Sort (i.e., DistanceSortSource) of LocalLucene with Lucene
[ https://issues.apache.org/jira/browse/LUCENE-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661199#action_12661199 ] patrick o'leary commented on LUCENE-1304: - How will LUCENE-1483 impact this immediately? I'd really like to get this patch in first and refactor if and when 1483 goes in, the benefit of bypassing static comparator is really needed. Memory Leak when using Custom Sort (i.e., DistanceSortSource) of LocalLucene with Lucene Key: LUCENE-1304 URL: https://issues.apache.org/jira/browse/LUCENE-1304 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.3 Environment: Windows/JDK 1.6 Reporter: Ethan Tao Attachments: LUCENE-1304.patch We had the memory leak issue when using DistanceSortSource of LocalLucene for repeated query/search. In about 450 queries, we are experiencing out of memory error. After dig in the code, we found the problem source is coming from Lucene package, the way how it handles custom type comparator. Lucene internally caches all created comparators. In the case of query using LocalLucene, we create new comparator for every search due to different lon/lat and query terms. This causes major memory leak as the cached comparators are also holding memory for other large objects (e.g., bit sets). The solution we came up with: ( the proposed change from Lucene is 1 and 3 below) 1.In Lucene package, create new file SortComparatorSourceUncacheable.java: package org.apache.lucene.search; import org.apache.lucene.index.IndexReader; import java.io.IOException; import java.io.Serializable; public interface SortComparatorSourceUncacheable extends Serializable { } 2.Have your custom sort class to implement the interface public class LocalSortSource extends DistanceSortSource implements SortComparatorSourceUncacheable { ... } 3.Modify Lucene's FieldSorterHitQueue.java to bypass caching for custom sort comparator: Index: FieldSortedHitQueue.java === --- FieldSortedHitQueue.java (revision 654583) +++ FieldSortedHitQueue.java (working copy) @@ -53,7 +53,12 @@ this.fields = new SortField[n]; for (int i=0; in; ++i) { String fieldname = fields[i].getField(); - comparators[i] = getCachedComparator (reader, fieldname, fields[i].getType(), fields[i].getLocale(), fields[i].getFactory()); + + if(fields[i].getFactory() instanceof SortComparatorSourceUncacheable) { // no caching to avoid memory leak +comparators[i] = getComparator (reader, fieldname, fields[i].getType(), fields[i].getLocale(), fields[i].getFactory()); + } else { +comparators[i] = getCachedComparator (reader, fieldname, fields[i].getType(), fields[i].getLocale(), fields[i].getFactory()); + } if (comparators[i].sortType() == SortField.STRING) { this.fields[i] = new SortField (fieldname, fields[i].getLocale(), fields[i].getReverse()); @@ -157,7 +162,18 @@ SortField[] getFields() { return fields; } - + + static ScoreDocComparator getComparator (IndexReader reader, String field, int type, Locale locale, SortComparatorSource factory) +throws IOException { + if (type == SortField.DOC) return ScoreDocComparator.INDEXORDER; + if (type == SortField.SCORE) return ScoreDocComparator.RELEVANCE; + FieldCacheImpl.Entry entry = (factory != null) +? new FieldCacheImpl.Entry (field, factory) +: new FieldCacheImpl.Entry (field, type, locale); + return (ScoreDocComparator)Comparators.createValue(reader, entry); +} + + Otis suggests that I put this in Jira. I 'll attach a patch shortly for review. -Ethan -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1512) Incorporate GeoHash in contrib/spatial
Incorporate GeoHash in contrib/spatial -- Key: LUCENE-1512 URL: https://issues.apache.org/jira/browse/LUCENE-1512 Project: Lucene - Java Issue Type: New Feature Components: contrib/spatial Reporter: patrick o'leary Priority: Minor Based on comments from Yonik and Ryan in SOLR-773 GeoHash provides the ability to store latitude / longitude values in a single field consistent hash field. Which elements the need to maintain 2 field caches for latitude / longitude fields, reducing the size of an index and the amount of memory needed for a spatial search. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1512) Incorporate GeoHash in contrib/spatial
[ https://issues.apache.org/jira/browse/LUCENE-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] patrick o'leary updated LUCENE-1512: Attachment: LUCENE-1512.patch spatial-lucene GeoHash implementation based on http://en.wikipedia.org/wiki/Geohash removable dependency on refactoring in LUCENE-1504 Incorporate GeoHash in contrib/spatial -- Key: LUCENE-1512 URL: https://issues.apache.org/jira/browse/LUCENE-1512 Project: Lucene - Java Issue Type: New Feature Components: contrib/spatial Reporter: patrick o'leary Priority: Minor Attachments: LUCENE-1512.patch Based on comments from Yonik and Ryan in SOLR-773 GeoHash provides the ability to store latitude / longitude values in a single field consistent hash field. Which elements the need to maintain 2 field caches for latitude / longitude fields, reducing the size of an index and the amount of memory needed for a spatial search. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1314) IndexReader.clone
[ https://issues.apache.org/jira/browse/LUCENE-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661214#action_12661214 ] Michael McCandless commented on LUCENE-1314: {quote} The problem is the user may get into trouble by updating the stale reader which was debated before. I got the impression insuring the reader being updated was the latest was important. {quote} But: when one attempts to change a stale reader, that's caught when trying to acquire the write lock? (Ie during clone I think you don't need to also check for this). {quote} The cost of cloning them meaning the creating a new byte array {quote} Yeah I was thinking CPU cost of creating copying deleted docs norms; I was just curious (I don't think we have to measure this before committing). {quote} I need to reread Marvin's tombstones which at first glance seemed to be an iterative approach to saving deletions that seems like a transaction log. Correct? {quote} Similar to a transaction log in that the size of what's written is in proportion to how many changes (deletions) you made. But different in that there is no other data structure (ie the tombstones *are* the representation of the deletes) and so the tombstones are used live (whereas transaction log is typically played back on next startup after a failure). If we had tombstones to represent deletes in Lucene then any new deletions would not require any cloning of prior deletions. Ie there would be no copy-on-write. {quote} M.M.: SegmentReader.Norm now has two refCounts, and I think both are necessary. One tracks refs to the Norm instance itself and the other tracks refs to the byte[]. Can you add some comments explaining the difference (because it's confusing at first blush)? Byte[] referencing is used because a new norm object needs to be created for each clone, and the byte array is all that is needed for sharing between cloned readers. The current norm referencing is for sharing between readers whereas the byte[] referencing is for copy on write which is independent of reader references. {quote} Got it. Can you put this into the javadocs in the Norm class? {quote} M.M.: In SegmentReader.doClose() you are failing to call deletedDocsCopyOnWriteRef.decRef(), so you have a refCount leak. Can you create a unit test that 1) opens reader 1, 2) does deletes on reader 1, 3) clones reader 1 -- reader 2, 4) closes reader 1, 5) deletes more docs with reader 1, and 6) asserts that the deletedDocs BitVector did not get cloned? First verify the test fails, then fix the bug... In regards to #5, the test cannot delete from reader 1 once it's closed. A method called TestIndexReaderClone.testSegmentReaderCloseReferencing was added to test this closing use case. {quote} Woops -- I meant 5) deletes more docs with reader 2. Test case looks good! Thanks. A few more comments: * Can you update javadocs of IndexReader.reopen to remove the warning about not doing modification operations? With copy-on-write you are now free to do deletes against the reopened reader with no impact to the reader you had reopened/cloned. * What is SegmentReader.doDecRef for? It seems dead? * SegmentReader.doUndeleteAll has 4 space indent (should be 2) * We have this in SegmentReader.reopenSegment: {code} if (deletedDocsRef == null) deletedDocsRef = new Ref(); else deletedDocsRef.incRef(); {code} But I think if I clone a reader with no deletes, the clone then [incorrectly] has a deletedDocsRef set? Can you fix that code to keep the invariant that if deleteDocs is null, so is deletedDocsRef, and v/v? Can you sprinkle asserts to make sure that invariant always holds? * In SegmentReader.decRef we have if (deletedDocsRef != null deletedDocsRef.refCount() 1) deletedDocsRef.decRef(); -- but, you should not have to check if deletedDocsRef.refCount() 1? Does something break when you remove that? (In which case I think we have a refCount bug lurking...) * The norm cloning logic in SegmentReader.reopenSegment needs to be cleaned up... eg we first sweep through each Norm, incRef'ing it, and then make 2nd pass to do full clone. Really we should have if (doClone) up front and do a single pass? Also: I think we need that same logic to re-open the singleNormStream for the clone case as well. . Hmm, in the non-single-norm stream case I think we also must re-open the norm file, rather than clone it, in Norm.clone(). I think if you 1) open reader 1(do no searching w/ it), 2) clone it -- reader 2, 3) close reader 1, 4) try to do a search against a field that then needs to load norms, you'll hit an AlreadyClosedException, because you had a cloned IndexInput vs a newly reopened one? Can you add that test case? * Why was this needed: {code} if (doClone normsDirty) {
[jira] Created: (LUCENE-1513) fastss fuzzyquery
fastss fuzzyquery - Key: LUCENE-1513 URL: https://issues.apache.org/jira/browse/LUCENE-1513 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Priority: Minor code for doing fuzzyqueries with fastssWC algorithm. FuzzyIndexer: given a lucene field, it enumerates all terms and creates an auxiliary offline index for fuzzy queries. FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index to retrieve a candidate list. this list is then verified with levenstein algorithm. sorry but the code is a bit messy... what I'm actually using is very different from this so its pretty much untested. but at least you can see whats going on or fix it up. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1513) fastss fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1513: Attachment: fastSSfuzzy.zip fastss fuzzyquery - Key: LUCENE-1513 URL: https://issues.apache.org/jira/browse/LUCENE-1513 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Priority: Minor Attachments: fastSSfuzzy.zip code for doing fuzzyqueries with fastssWC algorithm. FuzzyIndexer: given a lucene field, it enumerates all terms and creates an auxiliary offline index for fuzzy queries. FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index to retrieve a candidate list. this list is then verified with levenstein algorithm. sorry but the code is a bit messy... what I'm actually using is very different from this so its pretty much untested. but at least you can see whats going on or fix it up. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1512) Incorporate GeoHash in contrib/spatial
[ https://issues.apache.org/jira/browse/LUCENE-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661223#action_12661223 ] Ryan McKinley commented on LUCENE-1512: --- This is awesome. thanks patrick! Incorporate GeoHash in contrib/spatial -- Key: LUCENE-1512 URL: https://issues.apache.org/jira/browse/LUCENE-1512 Project: Lucene - Java Issue Type: New Feature Components: contrib/spatial Reporter: patrick o'leary Priority: Minor Attachments: LUCENE-1512.patch Based on comments from Yonik and Ryan in SOLR-773 GeoHash provides the ability to store latitude / longitude values in a single field consistent hash field. Which elements the need to maintain 2 field caches for latitude / longitude fields, reducing the size of an index and the amount of memory needed for a spatial search. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661238#action_12661238 ] Mark Miller commented on LUCENE-1483: - Here is what that example policy has to be essentially. We just have to create a good way to do the right conversion I guess. I'll work on whatever you don't put up when you share your latest optimizations. {code} case SortField.STRING_ORD: return new ComparatorPolicy(){ private FieldComparator comparator = new FieldComparator.StringOrdComparator(numHits, field); private boolean first = true; private boolean second = true; public FieldComparator nextComparator(FieldComparator oldComparator, IndexReader reader, int numHits, int numSlotsFull) throws IOException { if(first) { first = false; return comparator; } else if(second){ StringOrdValOnDemComparator comp = new FieldComparator.StringOrdValOnDemComparator(numHits, field); comp.values = ((FieldComparator.StringOrdComparator)comparator).values; comp.ords = ((FieldComparator.StringOrdComparator)comparator).ords; comp.currentReaderIndex = 1; comparator = comp; second = false; return comp; } return comparator; }}; {code} Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector Key: LUCENE-1483 URL: https://issues.apache.org/jira/browse/LUCENE-1483 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Mark Miller Priority: Minor Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py FieldCache and Filters are forced down to a single segment reader, allowing for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1512) Incorporate GeoHash in contrib/spatial
[ https://issues.apache.org/jira/browse/LUCENE-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661241#action_12661241 ] Ryan McKinley commented on LUCENE-1512: --- Any chance you could make a new patch without SerialChainFilter moved to search? Should we make a new package for geohash based things? org.apache.lucene.spatial.geohash - GeoHashUtils - GeoHashDistanceFilter Also, the spacing for GeoHashUtils should be 2 spaces rather then 4. Incorporate GeoHash in contrib/spatial -- Key: LUCENE-1512 URL: https://issues.apache.org/jira/browse/LUCENE-1512 Project: Lucene - Java Issue Type: New Feature Components: contrib/spatial Reporter: patrick o'leary Priority: Minor Attachments: LUCENE-1512.patch Based on comments from Yonik and Ryan in SOLR-773 GeoHash provides the ability to store latitude / longitude values in a single field consistent hash field. Which elements the need to maintain 2 field caches for latitude / longitude fields, reducing the size of an index and the amount of memory needed for a spatial search. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1504) SerialChainFilter should use DocSet API rather then deprecated BitSet API
[ https://issues.apache.org/jira/browse/LUCENE-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661249#action_12661249 ] Mark Miller commented on LUCENE-1504: - I think there is contrib dependency examples in the xml query parser and in the highlighter (which depends on MemoryIndex). SerialChainFilter should use DocSet API rather then deprecated BitSet API - Key: LUCENE-1504 URL: https://issues.apache.org/jira/browse/LUCENE-1504 Project: Lucene - Java Issue Type: Improvement Components: contrib/spatial Reporter: Ryan McKinley Fix For: 2.9 Attachments: LUCENE-1504.patch, LUCENE-1504.patch From erik's comments in LUCENE-1387 * Maybe the Filter's should be using the DocIdSet API rather than the BitSet deprecated stuff? We can refactor that after being committed I supposed, but not something we want to leave like that. We should also look at moving SerialChainFilter out of the spatial contrib since it is more generally useful then just spatial search. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661260#action_12661260 ] Ryan McKinley commented on LUCENE-1483: --- Any estimates on how far along this is? Is it close enough that the reasonably simple patch in LUCENE-1304 should wait? Or do you think it is worth waiting for this? I'm trying to get local lucene and solr to play nice (SOLR-773). The hoops you have to jump through to avoid memory leaks make the final code too strange and not reusable. Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector Key: LUCENE-1483 URL: https://issues.apache.org/jira/browse/LUCENE-1483 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Mark Miller Priority: Minor Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py FieldCache and Filters are forced down to a single segment reader, allowing for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661264#action_12661264 ] Mark Miller commented on LUCENE-1483: - I think we are wrapping up, but it may make sense to do 1304 anyway. That code will be deprecated, but if you use a custom comparator, it will use the deprecated code. The custom comparator will be removed in 3.0 I think, and you'd have to make a new comparator or comparator policy. So its probably best to do 1304 if we want it, just for the 2.9 release. - Mark Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector Key: LUCENE-1483 URL: https://issues.apache.org/jira/browse/LUCENE-1483 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Mark Miller Priority: Minor Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py FieldCache and Filters are forced down to a single segment reader, allowing for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1304) Memory Leak when using Custom Sort (i.e., DistanceSortSource) of LocalLucene with Lucene
[ https://issues.apache.org/jira/browse/LUCENE-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661269#action_12661269 ] Mark Miller commented on LUCENE-1304: - The main impact is that most of that code will be deprecated. It will still be used for old custom comparators until 3.0 though, so it might be wise to consider this for 2.9 in the interim. Memory Leak when using Custom Sort (i.e., DistanceSortSource) of LocalLucene with Lucene Key: LUCENE-1304 URL: https://issues.apache.org/jira/browse/LUCENE-1304 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.3 Environment: Windows/JDK 1.6 Reporter: Ethan Tao Attachments: LUCENE-1304.patch We had the memory leak issue when using DistanceSortSource of LocalLucene for repeated query/search. In about 450 queries, we are experiencing out of memory error. After dig in the code, we found the problem source is coming from Lucene package, the way how it handles custom type comparator. Lucene internally caches all created comparators. In the case of query using LocalLucene, we create new comparator for every search due to different lon/lat and query terms. This causes major memory leak as the cached comparators are also holding memory for other large objects (e.g., bit sets). The solution we came up with: ( the proposed change from Lucene is 1 and 3 below) 1.In Lucene package, create new file SortComparatorSourceUncacheable.java: package org.apache.lucene.search; import org.apache.lucene.index.IndexReader; import java.io.IOException; import java.io.Serializable; public interface SortComparatorSourceUncacheable extends Serializable { } 2.Have your custom sort class to implement the interface public class LocalSortSource extends DistanceSortSource implements SortComparatorSourceUncacheable { ... } 3.Modify Lucene's FieldSorterHitQueue.java to bypass caching for custom sort comparator: Index: FieldSortedHitQueue.java === --- FieldSortedHitQueue.java (revision 654583) +++ FieldSortedHitQueue.java (working copy) @@ -53,7 +53,12 @@ this.fields = new SortField[n]; for (int i=0; in; ++i) { String fieldname = fields[i].getField(); - comparators[i] = getCachedComparator (reader, fieldname, fields[i].getType(), fields[i].getLocale(), fields[i].getFactory()); + + if(fields[i].getFactory() instanceof SortComparatorSourceUncacheable) { // no caching to avoid memory leak +comparators[i] = getComparator (reader, fieldname, fields[i].getType(), fields[i].getLocale(), fields[i].getFactory()); + } else { +comparators[i] = getCachedComparator (reader, fieldname, fields[i].getType(), fields[i].getLocale(), fields[i].getFactory()); + } if (comparators[i].sortType() == SortField.STRING) { this.fields[i] = new SortField (fieldname, fields[i].getLocale(), fields[i].getReverse()); @@ -157,7 +162,18 @@ SortField[] getFields() { return fields; } - + + static ScoreDocComparator getComparator (IndexReader reader, String field, int type, Locale locale, SortComparatorSource factory) +throws IOException { + if (type == SortField.DOC) return ScoreDocComparator.INDEXORDER; + if (type == SortField.SCORE) return ScoreDocComparator.RELEVANCE; + FieldCacheImpl.Entry entry = (factory != null) +? new FieldCacheImpl.Entry (field, factory) +: new FieldCacheImpl.Entry (field, type, locale); + return (ScoreDocComparator)Comparators.createValue(reader, entry); +} + + Otis suggests that I put this in Jira. I 'll attach a patch shortly for review. -Ethan -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1483: --- Attachment: LUCENE-1483-partial.patch Attached prototype changes to switch to setBottom and compareBottom API for FieldComparator, but, I only included the few files I modified over the last patch, and it does not pass TestSort when I switch to it (fails the same tests ORD fails on). Mark can you switch the comparators to this new API (and remove the compare(int, int, float) method) and fix the test failures? Once that passes tests, I'll re-run perf test and we can tune the default policy. I think we are close! Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector Key: LUCENE-1483 URL: https://issues.apache.org/jira/browse/LUCENE-1483 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Mark Miller Priority: Minor Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py FieldCache and Filters are forced down to a single segment reader, allowing for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661295#action_12661295 ] Michael McCandless commented on LUCENE-1483: {quote} Not sure about new constructors or package private for that part of the switch... {quote} Could we just make ctors on each comparator that take the other comparator and copy over what they need? This way we can make attrs private final again, in case that helps the JRE optimize. Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector Key: LUCENE-1483 URL: https://issues.apache.org/jira/browse/LUCENE-1483 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Mark Miller Priority: Minor Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py FieldCache and Filters are forced down to a single segment reader, allowing for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1513) fastss fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661302#action_12661302 ] Otis Gospodnetic commented on LUCENE-1513: -- I feel like I missed some FastSS discussion on the list was there one? I took a quick look at the paper and the code. Is the following the general idea: # index fuzzy/misspelled terms in addition to the normal terms (= larger index, slower indexing). How much fuzziness one wants to allow or handle is decided at index time. # rewrite the query to include variations/misspellings of each terms and use that to search (= more clauses, slower than normal search, but faster than the normal fuzzy query whose speed depends on the number of indexed terms) ? Quick code comments: * Need to add ASL * Need to replace tabs with 2 spaces and formatting in FuzzyHitCollector * No @author * Unit test if possible * Should FastSSwC not be able to take a variable K? * Should variables named after types (e.g. set in public static String getNeighborhoodString(SetString set) { ) be renamed, so they describe what's in them instead? (easier to understand API?) fastss fuzzyquery - Key: LUCENE-1513 URL: https://issues.apache.org/jira/browse/LUCENE-1513 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Priority: Minor Attachments: fastSSfuzzy.zip code for doing fuzzyqueries with fastssWC algorithm. FuzzyIndexer: given a lucene field, it enumerates all terms and creates an auxiliary offline index for fuzzy queries. FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index to retrieve a candidate list. this list is then verified with levenstein algorithm. sorry but the code is a bit messy... what I'm actually using is very different from this so its pretty much untested. but at least you can see whats going on or fix it up. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661304#action_12661304 ] Michael McCandless commented on LUCENE-1483: {quote} I'm trying to get local lucene and solr to play nice (SOLR-773). The hoops you have to jump through to avoid memory leaks make the final code too strange and not reusable. {quote} With this patch we are changing how custom sorting works. Previously, Lucene would iterate the terms for you, asking you to produce a Comparable for each one. With this patch, we are asking you to implement FieldComparator, which compares docs/slots directly and must be aware of switching sub-readers during searching. Ryan, can you have a look at FieldComparator to see if it works for local lucene (and any other feedback on it)? I think the best outcome here would be to get this issue done, and then get local lucene switched over to this new API (so local lucene sees the benefits of the new API, and sidesteps the memory leak in LUCENE-1304). We may still need to do LUCENE-1304 in case others hit the memory leak of the old custom sort API. Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector Key: LUCENE-1483 URL: https://issues.apache.org/jira/browse/LUCENE-1483 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Mark Miller Priority: Minor Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py FieldCache and Filters are forced down to a single segment reader, allowing for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661306#action_12661306 ] Mark Miller commented on LUCENE-1483: - bq. Could we just make ctors on each comparator that take the other comparator and copy over what they need? This way we can make attrs private final again, in case that helps the JRE optimize. Right, good idea. I'll get everything together and put up a patch. Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector Key: LUCENE-1483 URL: https://issues.apache.org/jira/browse/LUCENE-1483 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Mark Miller Priority: Minor Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py FieldCache and Filters are forced down to a single segment reader, allowing for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1513) fastss fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661314#action_12661314 ] Robert Muir commented on LUCENE-1513: - otis, discussion was on java-user. again, I apologize for the messy code. as mentioned there, my setup is very specific to exactly what I am doing and in no way is this code ready. But since i'm currently pretty busy with other things at work I just wanted to put something up here for anyone else interested. theres the issues you mentioned, and also some i mentioned on java-user. for example how to handle updates to indexes that introduce new terms (they must be added to auxiliary index), or even if auxiliary index is the best approach. the general idea is that instead of enumerating terms to find terms, the deletion neighborhood as described in the paper is used instead. this way search time is not linear based on number of terms. yes you are correct that it only can guarantee edit distances of K which is determined at index time. perhaps this should be configurable, but i hardcoded k=1 for simplicity. i think its something like 80% of typos... as i mentioned on the list another idea is you could implement FastSS (not the wC variant) with deletion positions maybe by using payloads. This would require more space but eliminate the candidate verification step. maybe it would be nice to have some of their other algorithms such as block-based,etc available also. fastss fuzzyquery - Key: LUCENE-1513 URL: https://issues.apache.org/jira/browse/LUCENE-1513 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Priority: Minor Attachments: fastSSfuzzy.zip code for doing fuzzyqueries with fastssWC algorithm. FuzzyIndexer: given a lucene field, it enumerates all terms and creates an auxiliary offline index for fuzzy queries. FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index to retrieve a candidate list. this list is then verified with levenstein algorithm. sorry but the code is a bit messy... what I'm actually using is very different from this so its pretty much untested. but at least you can see whats going on or fix it up. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery
Why not just create a new field for this? That is, if you have FieldA, create field FieldAFuzzy and put the various permutations there. The fuzzy scorer/parser can be changed to automatically use the Fuzzy field when required. You could also store positions, and allow that the first term is the closest, next is the second closest, etc. to add support for a slop factor. This is similar to the same way fast phonetic searches can be implemented. If you do it this way, you don't have any of the synchronization issues between the index and the external fuzzy index. On Jan 6, 2009, at 2:57 PM, Robert Muir (JIRA) wrote: [ https://issues.apache.org/jira/browse/LUCENE-1513? page=com.atlassian.jira.plugin.system.issuetabpanels:comment- tabpanelfocusedCommentId=12661314#action_12661314 ] Robert Muir commented on LUCENE-1513: - otis, discussion was on java-user. again, I apologize for the messy code. as mentioned there, my setup is very specific to exactly what I am doing and in no way is this code ready. But since i'm currently pretty busy with other things at work I just wanted to put something up here for anyone else interested. theres the issues you mentioned, and also some i mentioned on java- user. for example how to handle updates to indexes that introduce new terms (they must be added to auxiliary index), or even if auxiliary index is the best approach. the general idea is that instead of enumerating terms to find terms, the deletion neighborhood as described in the paper is used instead. this way search time is not linear based on number of terms. yes you are correct that it only can guarantee edit distances of K which is determined at index time. perhaps this should be configurable, but i hardcoded k=1 for simplicity. i think its something like 80% of typos... as i mentioned on the list another idea is you could implement FastSS (not the wC variant) with deletion positions maybe by using payloads. This would require more space but eliminate the candidate verification step. maybe it would be nice to have some of their other algorithms such as block-based,etc available also. fastss fuzzyquery - Key: LUCENE-1513 URL: https://issues.apache.org/jira/browse/ LUCENE-1513 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Priority: Minor Attachments: fastSSfuzzy.zip code for doing fuzzyqueries with fastssWC algorithm. FuzzyIndexer: given a lucene field, it enumerates all terms and creates an auxiliary offline index for fuzzy queries. FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index to retrieve a candidate list. this list is then verified with levenstein algorithm. sorry but the code is a bit messy... what I'm actually using is very different from this so its pretty much untested. but at least you can see whats going on or fix it up. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery
a deletion neighborhood can be pretty large (for example robert is something like robert obert rbert robrt robet ...) so if you have a 100 million docs with 1 billion words, but only 100k unique terms, it definitely would be wasteful to have 1 billion deletion neighborhoods when you only need 100k. On Tue, Jan 6, 2009 at 4:02 PM, robert engels reng...@ix.netcom.com wrote: Why not just create a new field for this? That is, if you have FieldA, create field FieldAFuzzy and put the various permutations there. The fuzzy scorer/parser can be changed to automatically use the Fuzzy field when required. You could also store positions, and allow that the first term is the closest, next is the second closest, etc. to add support for a slop factor. This is similar to the same way fast phonetic searches can be implemented. If you do it this way, you don't have any of the synchronization issues between the index and the external fuzzy index. On Jan 6, 2009, at 2:57 PM, Robert Muir (JIRA) wrote: [ https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661314#action_12661314 ] Robert Muir commented on LUCENE-1513: - otis, discussion was on java-user. again, I apologize for the messy code. as mentioned there, my setup is very specific to exactly what I am doing and in no way is this code ready. But since i'm currently pretty busy with other things at work I just wanted to put something up here for anyone else interested. theres the issues you mentioned, and also some i mentioned on java-user. for example how to handle updates to indexes that introduce new terms (they must be added to auxiliary index), or even if auxiliary index is the best approach. the general idea is that instead of enumerating terms to find terms, the deletion neighborhood as described in the paper is used instead. this way search time is not linear based on number of terms. yes you are correct that it only can guarantee edit distances of K which is determined at index time. perhaps this should be configurable, but i hardcoded k=1 for simplicity. i think its something like 80% of typos... as i mentioned on the list another idea is you could implement FastSS (not the wC variant) with deletion positions maybe by using payloads. This would require more space but eliminate the candidate verification step. maybe it would be nice to have some of their other algorithms such as block-based,etc available also. fastss fuzzyquery - Key: LUCENE-1513 URL: https://issues.apache.org/jira/browse/LUCENE-1513 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Priority: Minor Attachments: fastSSfuzzy.zip code for doing fuzzyqueries with fastssWC algorithm. FuzzyIndexer: given a lucene field, it enumerates all terms and creates an auxiliary offline index for fuzzy queries. FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index to retrieve a candidate list. this list is then verified with levenstein algorithm. sorry but the code is a bit messy... what I'm actually using is very different from this so its pretty much untested. but at least you can see whats going on or fix it up. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Robert Muir rcm...@gmail.com
Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery
I don't think that is the case. You will have single deletion neighborhood. The number of unique terms in the field is going to be the union of the deletion dictionaries of each source term. For example, given the following documents A which have field 'X' with value best, and document B with value jest (and k == 1). A will generate est bst, bet, bes, B will generate est, jest, jst, jes so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes) I don't think the storage requirement is any greater doing it this way. 3.2.1 Indexing For all words in a dictionary, and a given number of edit operations k, FastSS generates all variant spellings recursively and save them as tuples of type v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of deletion positions. Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for n dictionary words of length m with k mismatches. 3.2.2 Retrieval For a query p and edit distance k, first generate the neighborhood Ud (p, k). Then compare the words in the neighborhood with the index, and find matching candidates. Compare deletion positions for each candidate with the deletion positions in U(p, k), using Theorem 4.
Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery
i see, your idea would definitely simplify some things. What about the index size difference between this approach and using separate index? Would this separate field increase index size? I guess my line of thinking is if you have 10 docs with robert, with separate index you just have robert, and its deletion neighborhood one time. with this approach you have the same thing, but at least you must have document numbers and the other inverted index stuff with each neighborhood term. would this be a significant change to size and/or performance? and since the documents have multiple terms there is additional positional information for slop factor for each neighborhood term... i think its worth investigating, maybe performance would actually be better, just curious. i think i boxed myself in to auxiliary index because of some other irrelevant thigns i am doing. On Tue, Jan 6, 2009 at 4:42 PM, robert engels reng...@ix.netcom.com wrote: I don't think that is the case. You will have single deletion neighborhood. The number of unique terms in the field is going to be the union of the deletion dictionaries of each source term. For example, given the following documents A which have field 'X' with value best, and document B with value jest (and k == 1). A will generate est bst, bet, bes, B will generate est, jest, jst, jes so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes) I don't think the storage requirement is any greater doing it this way. 3.2.1 Indexing For all words in a dictionary, and a given number of edit operations k, FastSS generates all variant spellings recursively and save them as tuples of type v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of deletion positions. Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for n dictionary words of length m with k mismatches. 3.2.2 Retrieval For a query p and edit distance k, first generate the neighborhood Ud (p, k). Then compare the words in the neighborhood with the index, and find matching candidates. Compare deletion positions for each candidate with the deletion positions in U(p, k), using Theorem 4. -- Robert Muir rcm...@gmail.com
[jira] Resolved: (LUCENE-1502) CharArraySet behaves inconsistently in add(Object) and contains(Object)
[ https://issues.apache.org/jira/browse/LUCENE-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1502. Resolution: Fixed Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) Committed revision 732141. Thanks Shai! CharArraySet behaves inconsistently in add(Object) and contains(Object) --- Key: LUCENE-1502 URL: https://issues.apache.org/jira/browse/LUCENE-1502 Project: Lucene - Java Issue Type: Bug Components: Analysis Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.4.1, 2.9 Attachments: LUCENE-1502.patch CharArraySet's add(Object) method looks like this: if (o instanceof char[]) { return add((char[])o); } else if (o instanceof String) { return add((String)o); } else if (o instanceof CharSequence) { return add((CharSequence)o); } else { return add(o.toString()); } You'll notice that in the case of an Object (for example, Integer), the o.toString() is added. However, its contains(Object) method looks like this: if (o instanceof char[]) { char[] text = (char[])o; return contains(text, 0, text.length); } else if (o instanceof CharSequence) { return contains((CharSequence)o); } return false; In case of contains(Integer), it always returns false. I've added a simple test to TestCharArraySet, which reproduces the problem: public void testObjectContains() { CharArraySet set = new CharArraySet(10, true); Integer val = new Integer(1); set.add(val); assertTrue(set.contains(val)); assertTrue(set.contains(new Integer(1))); } Changing contains(Object) to this, solves the problem: if (o instanceof char[]) { char[] text = (char[])o; return contains(text, 0, text.length); } return contains(o.toString()); The patch also includes few minor improvements (which were discussed on the mailing list) such as the removal of the following dead code from getHashCode(CharSequence): if (false text instanceof String) { code = text.hashCode(); and simplifying add(Object): if (o instanceof char[]) { return add((char[])o); } return add(o.toString()); (which also aligns with the equivalent contains() method). One thing that's still left open is whether we can avoid the calls to Character.toLowerCase calls in all the char[] array methods, by first converting the char[] to lowercase, and then passing it through the equals() and getHashCode() methods. It works for add(), but fails for contains(char[]) since it modifies the input array. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery
It is definitely going to increase the index size, but not any more than than the external one would (if my understanding is correct). The nice thing is that you don't have to try and keep documents numbers in sync - it will be automatic. Maybe I don't understand what your external index is storing. Given that the document contains 'robert' but the user enters' obert', what is the process to find the matching documents? Is the external index essentially a constant list, that given obert, the source words COULD BE robert, tobert, reobert etc., and it contains no document information so: given the source word X, and an edit distance k, you ask the external dictionary for possible indexed words, and it returns the list, and then use search lucene using each of those words? If the above is the case, it certainly seems you could generate this list in real-time rather efficiently with no IO (unless the external index only stores words which HAVE BEEN indexed). I think the confusion may be because I understand Otis's comments, but they don't seem to match what you are stating. Essentially performing any term match requires efficient searching/ matching of the term index. If this is efficient enough, I don't think either process is needed - just an improved real-time fuzzy possibilities word generator. On Jan 6, 2009, at 3:58 PM, Robert Muir wrote: i see, your idea would definitely simplify some things. What about the index size difference between this approach and using separate index? Would this separate field increase index size? I guess my line of thinking is if you have 10 docs with robert, with separate index you just have robert, and its deletion neighborhood one time. with this approach you have the same thing, but at least you must have document numbers and the other inverted index stuff with each neighborhood term. would this be a significant change to size and/or performance? and since the documents have multiple terms there is additional positional information for slop factor for each neighborhood term... i think its worth investigating, maybe performance would actually be better, just curious. i think i boxed myself in to auxiliary index because of some other irrelevant thigns i am doing. On Tue, Jan 6, 2009 at 4:42 PM, robert engels reng...@ix.netcom.com wrote: I don't think that is the case. You will have single deletion neighborhood. The number of unique terms in the field is going to be the union of the deletion dictionaries of each source term. For example, given the following documents A which have field 'X' with value best, and document B with value jest (and k == 1). A will generate est bst, bet, bes, B will generate est, jest, jst, jes so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes) I don't think the storage requirement is any greater doing it this way. 3.2.1 Indexing For all words in a dictionary, and a given number of edit operations k, FastSS generates all variant spellings recursively and save them as tuples of type v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of deletion positions. Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for n dictionary words of length m with k mismatches. 3.2.2 Retrieval For a query p and edit distance k, first generate the neighborhood Ud (p, k). Then compare the words in the neighborhood with the index, and find matching candidates. Compare deletion positions for each candidate with the deletion positions in U(p, k), using Theorem 4. -- Robert Muir rcm...@gmail.com
Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery
On Tue, Jan 6, 2009 at 5:15 PM, robert engels reng...@ix.netcom.com wrote: It is definitely going to increase the index size, but not any more than than the external one would (if my understanding is correct). The nice thing is that you don't have to try and keep documents numbers in sync - it will be automatic. Maybe I don't understand what your external index is storing. Given that the document contains 'robert' but the user enters' obert', what is the process to find the matching documents? heres a simple example. neighborhood stored for robert is 'robert obert rbert roert ...' this is indexed in a tokenized field. at query time user typoes robert and enters 'tobert'. again neighborhood is generated 'tobert obert tbert ...' the system does a query on tobert OR obert OR tbert ... and robert is returned because 'obert' is present in both neighborhoods. in this example, by storing k=1 deletions you guarantee to satisfy all edit distance matches = 1 without linear scan. you get some false positives too with this approach, thats why what comes back is only a CANDIDATE and true edit distance must be used to verify. this might be tricky to do with your method, i don't know. Is the external index essentially a constant list, that given obert, the source words COULD BE robert, tobert, reobert etc., and it contains no document information so: no. see above, you generate all possible 1-character deletions of the index term and store them, then at query time you generate all possible 1-character deletions of the query term. basically, LUCENE and LUBENE are 1 character different, but they are the same (LUENE) if you delete 1 character from both of them. so you dont need to store LUCENE LUBENE LUDENE, you just store LUENE. given the source word X, and an edit distance k, you ask the external dictionary for possible indexed words, and it returns the list, and then use search lucene using each of those words? If the above is the case, it certainly seems you could generate this list in real-time rather efficiently with no IO (unless the external index only stores words which HAVE BEEN indexed). I think the confusion may be because I understand Otis's comments, but they don't seem to match what you are stating. Essentially performing any term match requires efficient searching/matching of the term index. If this is efficient enough, I don't think either process is needed - just an improved real-time fuzzy possibilities word generator. On Jan 6, 2009, at 3:58 PM, Robert Muir wrote: i see, your idea would definitely simplify some things. What about the index size difference between this approach and using separate index? Would this separate field increase index size? I guess my line of thinking is if you have 10 docs with robert, with separate index you just have robert, and its deletion neighborhood one time. with this approach you have the same thing, but at least you must have document numbers and the other inverted index stuff with each neighborhood term. would this be a significant change to size and/or performance? and since the documents have multiple terms there is additional positional information for slop factor for each neighborhood term... i think its worth investigating, maybe performance would actually be better, just curious. i think i boxed myself in to auxiliary index because of some other irrelevant thigns i am doing. On Tue, Jan 6, 2009 at 4:42 PM, robert engels reng...@ix.netcom.comwrote: I don't think that is the case. You will have single deletion neighborhood. The number of unique terms in the field is going to be the union of the deletion dictionaries of each source term. For example, given the following documents A which have field 'X' with value best, and document B with value jest (and k == 1). A will generate est bst, bet, bes, B will generate est, jest, jst, jes so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes) I don't think the storage requirement is any greater doing it this way. 3.2.1 Indexing For all words in a dictionary, and a given number of edit operations k, FastSS generates all variant spellings recursively and save them as tuples of type v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of deletion positions. Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for n dictionary words of length m with k mismatches. 3.2.2 Retrieval For a query p and edit distance k, first generate the neighborhood Ud (p, k). Then compare the words in the neighborhood with the index, and find matching candidates. Compare deletion positions for each candidate with the deletion positions in U(p, k), using Theorem 4. -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com
Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery
To clarify a statement in the last email. To generate the 'possible source words' in real-time is not a difficult as first seems, if you assume some sort of first character prefix (which is what it appears google does). For example, assume the user typed 'robrt' instead of 'robert'. You see that this word has very low frequency (or none), so you want to find possible misspellings, so do a fuzzy search starting with r. But this search can be optimized, because as the edit/delete position moves to the right, the prefix remains the same, so these possibilities can be quickly skipped. If you don't find any words with high enough frequency as possible edit distances, try [a-z]robrt, assuming the user may have dropped the first character (possibly try this in know combination order, rather than alpha (i.e. try sr before nr). For example, searching google for 'robrt engels' works. So does 'obert engels', so does 'robt engels', all ask me if I meant 'robert engels', but searching for 'obrt engels' does not. On Jan 6, 2009, at 4:15 PM, robert engels wrote: It is definitely going to increase the index size, but not any more than than the external one would (if my understanding is correct). The nice thing is that you don't have to try and keep documents numbers in sync - it will be automatic. Maybe I don't understand what your external index is storing. Given that the document contains 'robert' but the user enters' obert', what is the process to find the matching documents? Is the external index essentially a constant list, that given obert, the source words COULD BE robert, tobert, reobert etc., and it contains no document information so: given the source word X, and an edit distance k, you ask the external dictionary for possible indexed words, and it returns the list, and then use search lucene using each of those words? If the above is the case, it certainly seems you could generate this list in real-time rather efficiently with no IO (unless the external index only stores words which HAVE BEEN indexed). I think the confusion may be because I understand Otis's comments, but they don't seem to match what you are stating. Essentially performing any term match requires efficient searching/ matching of the term index. If this is efficient enough, I don't think either process is needed - just an improved real-time fuzzy possibilities word generator. On Jan 6, 2009, at 3:58 PM, Robert Muir wrote: i see, your idea would definitely simplify some things. What about the index size difference between this approach and using separate index? Would this separate field increase index size? I guess my line of thinking is if you have 10 docs with robert, with separate index you just have robert, and its deletion neighborhood one time. with this approach you have the same thing, but at least you must have document numbers and the other inverted index stuff with each neighborhood term. would this be a significant change to size and/or performance? and since the documents have multiple terms there is additional positional information for slop factor for each neighborhood term... i think its worth investigating, maybe performance would actually be better, just curious. i think i boxed myself in to auxiliary index because of some other irrelevant thigns i am doing. On Tue, Jan 6, 2009 at 4:42 PM, robert engels reng...@ix.netcom.com wrote: I don't think that is the case. You will have single deletion neighborhood. The number of unique terms in the field is going to be the union of the deletion dictionaries of each source term. For example, given the following documents A which have field 'X' with value best, and document B with value jest (and k == 1). A will generate est bst, bet, bes, B will generate est, jest, jst, jes so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes) I don't think the storage requirement is any greater doing it this way. 3.2.1 Indexing For all words in a dictionary, and a given number of edit operations k, FastSS generates all variant spellings recursively and save them as tuples of type v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of deletion positions. Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for n dictionary words of length m with k mismatches. 3.2.2 Retrieval For a query p and edit distance k, first generate the neighborhood Ud (p, k). Then compare the words in the neighborhood with the index, and find matching candidates. Compare deletion positions for each candidate with the deletion positions in U(p, k), using Theorem 4. -- Robert Muir rcm...@gmail.com
Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery
I understand now. The index in my case would definitely be MUCH larger, but I think it would perform better, as you only need to do a single search - for obert (if you assume it was a misspelling). In your case you would eventually do an OR search in the lucene index for all possible matches (robert, roberta, roberto, ...) which could be much larger with some commonly prefixed/postfixed words). Classic performance vs. size trade-off. In your case where it is not for misspellings, the performance difference might be worthwhile. Still, in your case, I am not sure using a Lucene index as the external index is appropriate. Maybe a simple BTREE (Derby?) index of (word,edit permutation) with a a key on both would allow easy search and update. If implemented as a service, some intelligent caching of common misspellings could really improve the performance. On Jan 6, 2009, at 4:29 PM, Robert Muir wrote: On Tue, Jan 6, 2009 at 5:15 PM, robert engels reng...@ix.netcom.com wrote: It is definitely going to increase the index size, but not any more than than the external one would (if my understanding is correct). The nice thing is that you don't have to try and keep documents numbers in sync - it will be automatic. Maybe I don't understand what your external index is storing. Given that the document contains 'robert' but the user enters' obert', what is the process to find the matching documents? heres a simple example. neighborhood stored for robert is 'robert obert rbert roert ...' this is indexed in a tokenized field. at query time user typoes robert and enters 'tobert'. again neighborhood is generated 'tobert obert tbert ...' the system does a query on tobert OR obert OR tbert ... and robert is returned because 'obert' is present in both neighborhoods. in this example, by storing k=1 deletions you guarantee to satisfy all edit distance matches = 1 without linear scan. you get some false positives too with this approach, thats why what comes back is only a CANDIDATE and true edit distance must be used to verify. this might be tricky to do with your method, i don't know. Is the external index essentially a constant list, that given obert, the source words COULD BE robert, tobert, reobert etc., and it contains no document information so: no. see above, you generate all possible 1-character deletions of the index term and store them, then at query time you generate all possible 1-character deletions of the query term. basically, LUCENE and LUBENE are 1 character different, but they are the same (LUENE) if you delete 1 character from both of them. so you dont need to store LUCENE LUBENE LUDENE, you just store LUENE. given the source word X, and an edit distance k, you ask the external dictionary for possible indexed words, and it returns the list, and then use search lucene using each of those words? If the above is the case, it certainly seems you could generate this list in real-time rather efficiently with no IO (unless the external index only stores words which HAVE BEEN indexed). I think the confusion may be because I understand Otis's comments, but they don't seem to match what you are stating. Essentially performing any term match requires efficient searching/ matching of the term index. If this is efficient enough, I don't think either process is needed - just an improved real-time fuzzy possibilities word generator. On Jan 6, 2009, at 3:58 PM, Robert Muir wrote: i see, your idea would definitely simplify some things. What about the index size difference between this approach and using separate index? Would this separate field increase index size? I guess my line of thinking is if you have 10 docs with robert, with separate index you just have robert, and its deletion neighborhood one time. with this approach you have the same thing, but at least you must have document numbers and the other inverted index stuff with each neighborhood term. would this be a significant change to size and/or performance? and since the documents have multiple terms there is additional positional information for slop factor for each neighborhood term... i think its worth investigating, maybe performance would actually be better, just curious. i think i boxed myself in to auxiliary index because of some other irrelevant thigns i am doing. On Tue, Jan 6, 2009 at 4:42 PM, robert engels reng...@ix.netcom.com wrote: I don't think that is the case. You will have single deletion neighborhood. The number of unique terms in the field is going to be the union of the deletion dictionaries of each source term. For example, given the following documents A which have field 'X' with value best, and document B with value jest (and k == 1). A will generate est bst, bet, bes, B will generate est, jest, jst, jes so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes) I don't
Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery
robert theres only one problem i see: i don't see how you can do a single search since fastssWC returns some false positives (with k=1 it will still return some things with ED of 2). maybe if you store the deletion position information as a payload (thus using original fastss where there are no false positives) it would work though. i looked at storing position information but it appeared like it might be complex and the api was (is) still marked experimental so i didn't go that route. i also agree lucene index might not be the best possible data structure... just convenient thats all. i used it because i store other things related to the term besides deletion neighborhoods for my fuzzy matching. i guess i'll also mention that i do think storage size should be a big consideration. you really don't need this kind of stuff unless you are searching pretty big indexes in the first place (for = few million docs the default fuzzy is probably just fine for a lot of people). for me, the whole thing was about turning 30second queries into 1 second queries by removing a linear algorithm, i didn't really optimize much beyond that because i was just very happy to have reasonable performance.. On Tue, Jan 6, 2009 at 6:26 PM, robert engels reng...@ix.netcom.com wrote: I understand now. The index in my case would definitely be MUCH larger, but I think it would perform better, as you only need to do a single search - for obert (if you assume it was a misspelling). In your case you would eventually do an OR search in the lucene index for all possible matches (robert, roberta, roberto, ...) which could be much larger with some commonly prefixed/postfixed words). Classic performance vs. size trade-off. In your case where it is not for misspellings, the performance difference might be worthwhile. Still, in your case, I am not sure using a Lucene index as the external index is appropriate. Maybe a simple BTREE (Derby?) index of (word,edit permutation) with a a key on both would allow easy search and update. If implemented as a service, some intelligent caching of common misspellings could really improve the performance. On Jan 6, 2009, at 4:29 PM, Robert Muir wrote: On Tue, Jan 6, 2009 at 5:15 PM, robert engels reng...@ix.netcom.comwrote: It is definitely going to increase the index size, but not any more than than the external one would (if my understanding is correct). The nice thing is that you don't have to try and keep documents numbers in sync - it will be automatic. Maybe I don't understand what your external index is storing. Given that the document contains 'robert' but the user enters' obert', what is the process to find the matching documents? heres a simple example. neighborhood stored for robert is 'robert obert rbert roert ...' this is indexed in a tokenized field. at query time user typoes robert and enters 'tobert'. again neighborhood is generated 'tobert obert tbert ...' the system does a query on tobert OR obert OR tbert ... and robert is returned because 'obert' is present in both neighborhoods. in this example, by storing k=1 deletions you guarantee to satisfy all edit distance matches = 1 without linear scan. you get some false positives too with this approach, thats why what comes back is only a CANDIDATE and true edit distance must be used to verify. this might be tricky to do with your method, i don't know. Is the external index essentially a constant list, that given obert, the source words COULD BE robert, tobert, reobert etc., and it contains no document information so: no. see above, you generate all possible 1-character deletions of the index term and store them, then at query time you generate all possible 1-character deletions of the query term. basically, LUCENE and LUBENE are 1 character different, but they are the same (LUENE) if you delete 1 character from both of them. so you dont need to store LUCENE LUBENE LUDENE, you just store LUENE. given the source word X, and an edit distance k, you ask the external dictionary for possible indexed words, and it returns the list, and then use search lucene using each of those words? If the above is the case, it certainly seems you could generate this list in real-time rather efficiently with no IO (unless the external index only stores words which HAVE BEEN indexed). I think the confusion may be because I understand Otis's comments, but they don't seem to match what you are stating. Essentially performing any term match requires efficient searching/matching of the term index. If this is efficient enough, I don't think either process is needed - just an improved real-time fuzzy possibilities word generator. On Jan 6, 2009, at 3:58 PM, Robert Muir wrote: i see, your idea would definitely simplify some things. What about the index size difference between this approach and using separate index? Would this separate field increase index size? I guess my line of
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661390#action_12661390 ] Mark Miller commented on LUCENE-1483: - Can't seem to use the partial patch, but I'll try to put in by hand. Just gotta remember to make sure I don't miss anything. Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector Key: LUCENE-1483 URL: https://issues.apache.org/jira/browse/LUCENE-1483 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Mark Miller Priority: Minor Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py FieldCache and Filters are forced down to a single segment reader, allowing for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661394#action_12661394 ] Mark Miller commented on LUCENE-1483: - bq. I think we should fix TestSort so that it runs N times, each time using a different STRING sort method, to make sure we are covering all these methods? bq. Yeah, this makes sense in any case. I just keep switching them by hand as I work on them. In thinking about this, we are going to drop those other sort types though right? I figured we would still just have String, and the comparator policy for String would pick the right comparators rather than the sort type? Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector Key: LUCENE-1483 URL: https://issues.apache.org/jira/browse/LUCENE-1483 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Mark Miller Priority: Minor Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py FieldCache and Filters are forced down to a single segment reader, allowing for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1314) IndexReader.clone
[ https://issues.apache.org/jira/browse/LUCENE-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1314: - Attachment: LUCENE-1314.patch Everything in the previous post should be working and completed. TestIndexReaderReopen.testThreadSafety is creating a bug in the deletedDocs referencing which is related to {code} if (!success) { // An exception occured during reopen, we have to decRef the norms // that we incRef'ed already and close singleNormsStream and FieldsReader clone.decRef(); } {code} at the bottom of SegmentReader.reopenSegment. I am finished for the day and posted what is completed otherwise. Similar to a transaction log in that the size of what's written is in proportion to how many changes (deletions) you made. But different in that there is no other data structure (ie the tombstones are the representation of the deletes) and so the tombstones are used live (whereas transaction log is typically played back on next startup after a failure). If we had tombstones to represent deletes in Lucene then any new deletions would not require any cloning of prior deletions. Ie there would be no copy-on-write. Definitely interesting, how do tombstones work with BitVector? I changed Norm.clone to Norm.cloneNorm because it needs to throw an IOException, the clone interface does not allow exceptions and it's hidden inside of SegmentReader so the naming conventions should not matter. IndexReader.clone - Key: LUCENE-1314 URL: https://issues.apache.org/jira/browse/LUCENE-1314 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.3.1 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch Based on discussion http://www.nabble.com/IndexReader.reopen-issue-td18070256.html. The problem is reopen returns the same reader if there are no changes, so if docs are deleted from the new reader, they are also reflected in the previous reader which is not always desired behavior. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: TestIndexInput test failures on jdk 1.6/linux after r641303
Michael McCandless wrote: I'll remove those 2 test cases. The build now works perfectly. Thanks Mike! -- Sami Siren - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
DisjunctionScorer performance
Hi guys: We have been building a suite of boolean operators DocIdSets (e.g. AndDocIdSet/Iterator, OrDocIdSet/Iterator, NotDocIdSet/Iterator). We compared our implementation on the OrDocIdSetIterator (based on DisjunctionMaxScorer code) with some code tuning, and we see the performance doubled in our testing. (we haven't done comparisons with ConjuctionScorer vs. AndDocIdSetIterator, will post numbers when we do) We'd be happy to contribute this back to the community. But what is the best way of going about it? option 1: merge our change into DisjunctionMax/SumScorers. option 2: contribute boolean operator sets, and have DisjunctionScorers derive from OrDocIdSetIterator, ConjunctionScorer derive from AndDocIdSetIterator etc. Option 2 seems to be cleaner. Thoughts? Thanks -John
Re: DisjunctionScorer performance
On Wednesday 07 January 2009 07:36:06 John Wang wrote: Hi guys: We have been building a suite of boolean operators DocIdSets (e.g. AndDocIdSet/Iterator, OrDocIdSet/Iterator, NotDocIdSet/Iterator). We compared our implementation on the OrDocIdSetIterator (based on DisjunctionMaxScorer code) with some code tuning, and we see the performance doubled in our testing. That's good news. What data structure did you use for sorting by doc id? Currently a priority queue is used for that, and normally that is the bottleneck for performance. (we haven't done comparisons with ConjuctionScorer vs. AndDocIdSetIterator, will post numbers when we do) We'd be happy to contribute this back to the community. But what is the best way of going about it? option 1: merge our change into DisjunctionMax/SumScorers. option 2: contribute boolean operator sets, and have DisjunctionScorers derive from OrDocIdSetIterator, ConjunctionScorer derive from AndDocIdSetIterator etc. Option 2 seems to be cleaner. Thoughts? Some theoretical performance improvement is possible when the minimum number of required scorers/iterators is higher than 1, by using of skipTo() (as much as possible) instead of next() in such cases. For the moment that's theoretical because there is no working implementation of this yet, but have a look at LUCENE-1345 . I'm currently working on a DisjunctionDISI, probably the same function as the OrDocIdSetIterator you mentioned above. In case you have something faster than that, could you post it at LUCENE-1345 or at a new issue? An AndDocIdSetIterator could also be useful for the PhraseScorers and for the SpanNear queries, but that is of later concern. So I'd prefer option 2. Regards, Paul Elschot
Re: DisjunctionScorer performance
Paul: Our very simple/naive testing methodology for OrDocIdSetIterator: 5 sub iterators, each subiterators just iterate from 0 to 1,000,000. The test iterates the OrDocIdSetIterator until next() is false. Do you want me to run the same test against DisjunctDisi? -John On Tue, Jan 6, 2009 at 11:48 PM, Paul Elschot paul.elsc...@xs4all.nlwrote: On Wednesday 07 January 2009 07:36:06 John Wang wrote: Hi guys: We have been building a suite of boolean operators DocIdSets (e.g. AndDocIdSet/Iterator, OrDocIdSet/Iterator, NotDocIdSet/Iterator). We compared our implementation on the OrDocIdSetIterator (based on DisjunctionMaxScorer code) with some code tuning, and we see the performance doubled in our testing. That's good news. What data structure did you use for sorting by doc id? Currently a priority queue is used for that, and normally that is the bottleneck for performance. (we haven't done comparisons with ConjuctionScorer vs. AndDocIdSetIterator, will post numbers when we do) We'd be happy to contribute this back to the community. But what is the best way of going about it? option 1: merge our change into DisjunctionMax/SumScorers. option 2: contribute boolean operator sets, and have DisjunctionScorers derive from OrDocIdSetIterator, ConjunctionScorer derive from AndDocIdSetIterator etc. Option 2 seems to be cleaner. Thoughts? Some theoretical performance improvement is possible when the minimum number of required scorers/iterators is higher than 1, by using of skipTo() (as much as possible) instead of next() in such cases. For the moment that's theoretical because there is no working implementation of this yet, but have a look at LUCENE-1345 . I'm currently working on a DisjunctionDISI, probably the same function as the OrDocIdSetIterator you mentioned above. In case you have something faster than that, could you post it at LUCENE-1345 or at a new issue? An AndDocIdSetIterator could also be useful for the PhraseScorers and for the SpanNear queries, but that is of later concern. So I'd prefer option 2. Regards, Paul Elschot
Re: DisjunctionScorer performance
One more thing I missed. I don't quite get your point about skip() vs next(). With or queries, skipping does not help as much comparing to and queries. -John On Tue, Jan 6, 2009 at 11:55 PM, John Wang john.w...@gmail.com wrote: Paul: Our very simple/naive testing methodology for OrDocIdSetIterator: 5 sub iterators, each subiterators just iterate from 0 to 1,000,000. The test iterates the OrDocIdSetIterator until next() is false. Do you want me to run the same test against DisjunctDisi? -John On Tue, Jan 6, 2009 at 11:48 PM, Paul Elschot paul.elsc...@xs4all.nlwrote: On Wednesday 07 January 2009 07:36:06 John Wang wrote: Hi guys: We have been building a suite of boolean operators DocIdSets (e.g. AndDocIdSet/Iterator, OrDocIdSet/Iterator, NotDocIdSet/Iterator). We compared our implementation on the OrDocIdSetIterator (based on DisjunctionMaxScorer code) with some code tuning, and we see the performance doubled in our testing. That's good news. What data structure did you use for sorting by doc id? Currently a priority queue is used for that, and normally that is the bottleneck for performance. (we haven't done comparisons with ConjuctionScorer vs. AndDocIdSetIterator, will post numbers when we do) We'd be happy to contribute this back to the community. But what is the best way of going about it? option 1: merge our change into DisjunctionMax/SumScorers. option 2: contribute boolean operator sets, and have DisjunctionScorers derive from OrDocIdSetIterator, ConjunctionScorer derive from AndDocIdSetIterator etc. Option 2 seems to be cleaner. Thoughts? Some theoretical performance improvement is possible when the minimum number of required scorers/iterators is higher than 1, by using of skipTo() (as much as possible) instead of next() in such cases. For the moment that's theoretical because there is no working implementation of this yet, but have a look at LUCENE-1345 . I'm currently working on a DisjunctionDISI, probably the same function as the OrDocIdSetIterator you mentioned above. In case you have something faster than that, could you post it at LUCENE-1345 or at a new issue? An AndDocIdSetIterator could also be useful for the PhraseScorers and for the SpanNear queries, but that is of later concern. So I'd prefer option 2. Regards, Paul Elschot