[jira] [Commented] (LUCENE-2482) Index sorter
[ https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13494187#comment-13494187 ] Matthew Willson commented on LUCENE-2482: - Hi all -- few quick questions if anyone is still watching this. * Could this be used to achieve an impact ordered index, as in e.g. [1], where documents in a given term's postings list are ordered by score contribution or term frequency? * Any caveats or things one should be aware of when it comes to index sorting in combination with different index merge strategies, and some of the more advanced stuff in Solr for managing distributed indexes? * Anyone aware of any other work along the lines of early stopping / dynamic pruning optimisations in Lucene? e.g. MaxScore from [1] (I think Xapian [2] calls it 'operator decay') or accumulator pruning based algorithms from [1] (perhaps in combination with impact ordering)? in particular is there anything in Lucene 4's approach to scoring and indexing which would make these hard in principle? Any pointers gratefully received. [1] Buettcher Clarke Cormack Implementing and Evaluating search engines ch. 5 pp. 143-153 [2] http://xapian.org/docs/matcherdesign.html Index sorter Key: LUCENE-2482 URL: https://issues.apache.org/jira/browse/LUCENE-2482 Project: Lucene - Core Issue Type: New Feature Components: modules/other Affects Versions: 3.1, 4.0-ALPHA Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 3.6 Attachments: indexSorter.patch, LUCENE-2482-4.0.patch A tool to sort index according to a float document weight. Documents with high weight are given low document numbers, which means that they will be first evaluated. When using a strategy of early termination of queries (see TimeLimitedCollector) such sorting significantly improves the quality of partial results. (Originally this tool was created by Doug Cutting in Nutch, and used norms as document weights - thus the ordering was limited by the limited resolution of norms. This is a pure Lucene version of the tool, and it uses arbitrary floats from a specified stored field). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2482) Index sorter
[ https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237931#comment-13237931 ] Robert Muir commented on LUCENE-2482: - This issue is actually fixed in 3.x, but is still open for a 4.0 port. I'll open an issue (with fix version of 4.0) for the trunk port. Index sorter Key: LUCENE-2482 URL: https://issues.apache.org/jira/browse/LUCENE-2482 Project: Lucene - Java Issue Type: New Feature Components: modules/other Affects Versions: 3.1, 4.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 3.6, 4.0 Attachments: LUCENE-2482-4.0.patch, indexSorter.patch A tool to sort index according to a float document weight. Documents with high weight are given low document numbers, which means that they will be first evaluated. When using a strategy of early termination of queries (see TimeLimitedCollector) such sorting significantly improves the quality of partial results. (Originally this tool was created by Doug Cutting in Nutch, and used norms as document weights - thus the ordering was limited by the limited resolution of norms. This is a pure Lucene version of the tool, and it uses arbitrary floats from a specified stored field). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2482) Index sorter
[ https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13199201#comment-13199201 ] Pablo Castellanos commented on LUCENE-2482: --- Hi, I wanted to implement some early termination strategies over my Lucene index so I started playing with the 4.0 patch as I need to reorder it. So I have found that a lot of functions have changed in the past year and I had to go for some modifications, mainly: {code} /*@Override public TermFreqVector[] getTermFreqVectors(int docNumber) throws IOException { return super.getTermFreqVectors(newToOld[docNumber]); }*/ @Override public Fields getTermVectors(int docID) throws IOException { return super.getTermVectors(newToOld[docID]); } /*@Override public Document document(int n, FieldSelector fieldSelector) throws CorruptIndexException, IOException { return super.document(newToOld[n], fieldSelector); }*/ @Override public void document(int docID, StoredFieldVisitor visitor) throws CorruptIndexException, IOException { super.document(newToOld[docID], visitor); } {code} There exists also a getDeletedDocs function and I haven't found any good replacement for it {code} /*@Override public Bits getDeletedDocs() { final Bits deletedDocs = super.getDeletedDocs(); if (deletedDocs == null) return null; return new Bits() { @Override public boolean get(int index) { return deletedDocs.get(newToOld[index]); } @Override public int length() { return deletedDocs.length(); } }; }*/ {code} After applying these changes and using the code against my lucene index I get some weird results. It seems that the new sorting has worked but the posting list that access to the documents is still pointing to the old data. Imagine that I have 2 documents in my index and that I want to sort them by price (So the most expensive item should have a lower docId) Document 1 {panel}docId:1, name: iPod, price: 100${panel} Document 2 {panel}docId:2, name: iPhone, price: 300${panel} I run my modified version of IndexSorter over it and after that I try to query the new index, so if I query for _name:iPhone_ I get: {panel}docId:2, name: iPod, price: 100${panel} That leads me to believe that the documents have been sorted but the new index is using the old posting list. So I have two questions, are you planning on updating this code for newer versions of Lucene 4.0 or am I on my own to get it to work? And if this is the case, where should I look for getting a solution for my problem? Thanks in advance for your help. Index sorter Key: LUCENE-2482 URL: https://issues.apache.org/jira/browse/LUCENE-2482 Project: Lucene - Java Issue Type: New Feature Components: modules/other Affects Versions: 3.1, 4.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 3.6, 4.0 Attachments: LUCENE-2482-4.0.patch, indexSorter.patch A tool to sort index according to a float document weight. Documents with high weight are given low document numbers, which means that they will be first evaluated. When using a strategy of early termination of queries (see TimeLimitedCollector) such sorting significantly improves the quality of partial results. (Originally this tool was created by Doug Cutting in Nutch, and used norms as document weights - thus the ordering was limited by the limited resolution of norms. This is a pure Lucene version of the tool, and it uses arbitrary floats from a specified stored field). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2482) Index sorter
[ https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12982411#action_12982411 ] Robert Muir commented on LUCENE-2482: - bq. I'm not sure if I follow your use case though ... please remember that this re-sorting is applied exactly the same to all postings, so savings on one list may cause bloat on another list. Hi Andrzej, I came across this the other day, and thought it would be really interesting in the context of some of our newer codecs under development in trunk and the bulkpostings branch. I found the results presented there based on index sorting for codecs like simple9 to be really compelling, significant reduction in bits/posting for docids especially, because it can pack a lot of small deltas efficiently. {noformat} The first method reorders the documents in a text collection based on the number of distinct terms contained in each document. The idea is that two documents that each contain a large number of distinct terms are more likely to share terms than are a document with many distinct terms and a document with few distinct terms. Therefore, by assigning docids so that documents with many terms are close together, we may expect a greater clustering effect than by assigning docids at random. The second method assumes that the documents have been crawled from the Web (or maybe a corporate Intranet). It reassigns docids in lexicographical order of URL. The idea here is that two documents from the same Web server (or maybe even from the same directory on that server) are more likely to share common terms than two random documents from unrelated locations on the Internet. {noformat} http://www.ir.uwaterloo.ca/book/06-index-compression.pdf (see page 214: doc id reordering) Index sorter Key: LUCENE-2482 URL: https://issues.apache.org/jira/browse/LUCENE-2482 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Affects Versions: 3.1, 4.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 3.1, 4.0 Attachments: indexSorter.patch A tool to sort index according to a float document weight. Documents with high weight are given low document numbers, which means that they will be first evaluated. When using a strategy of early termination of queries (see TimeLimitedCollector) such sorting significantly improves the quality of partial results. (Originally this tool was created by Doug Cutting in Nutch, and used norms as document weights - thus the ordering was limited by the limited resolution of norms. This is a pure Lucene version of the tool, and it uses arbitrary floats from a specified stored field). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2482) Index sorter
[ https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915993#action_12915993 ] Koji Sekiguchi commented on LUCENE-2482: I think this is an interesting tool. I'm wondering if Solr can call it, as Solr does merge indexes. Is there any restrictions on this? I've never looked into deeper it, but for example, I see isPayloadAvailable() returns always false. Does it mean that it doesn't support payload? Can it support multiple Sorts on indexed fields other than stored float field? Index sorter Key: LUCENE-2482 URL: https://issues.apache.org/jira/browse/LUCENE-2482 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Affects Versions: 3.1, 4.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 3.1 Attachments: indexSorter.patch A tool to sort index according to a float document weight. Documents with high weight are given low document numbers, which means that they will be first evaluated. When using a strategy of early termination of queries (see TimeLimitedCollector) such sorting significantly improves the quality of partial results. (Originally this tool was created by Doug Cutting in Nutch, and used norms as document weights - thus the ordering was limited by the limited resolution of norms. This is a pure Lucene version of the tool, and it uses arbitrary floats from a specified stored field). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2482) Index sorter
[ https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897172#action_12897172 ] Andrzej Bialecki commented on LUCENE-2482: --- If there are no objections I'd like to commit this soon. Index sorter Key: LUCENE-2482 URL: https://issues.apache.org/jira/browse/LUCENE-2482 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Affects Versions: 3.1 Reporter: Andrzej Bialecki Fix For: 3.1 Attachments: indexSorter.patch A tool to sort index according to a float document weight. Documents with high weight are given low document numbers, which means that they will be first evaluated. When using a strategy of early termination of queries (see TimeLimitedCollector) such sorting significantly improves the quality of partial results. (Originally this tool was created by Doug Cutting in Nutch, and used norms as document weights - thus the ordering was limited by the limited resolution of norms. This is a pure Lucene version of the tool, and it uses arbitrary floats from a specified stored field). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2482) Index sorter
[ https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872366#action_12872366 ] Andrzej Bialecki commented on LUCENE-2482: --- Re: combination of fields + a comparator: sure, why not, take a look at the implementation of the DocScore inner class - you can stuff whatever you want there. I'm not sure if I follow your use case though ... please remember that this re-sorting is applied exactly the same to all postings, so savings on one list may cause bloat on another list. Index sorter Key: LUCENE-2482 URL: https://issues.apache.org/jira/browse/LUCENE-2482 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Affects Versions: 3.1 Reporter: Andrzej Bialecki Fix For: 3.1 Attachments: indexSorter.patch A tool to sort index according to a float document weight. Documents with high weight are given low document numbers, which means that they will be first evaluated. When using a strategy of early termination of queries (see TimeLimitedCollector) such sorting significantly improves the quality of partial results. (Originally this tool was created by Doug Cutting in Nutch, and used norms as document weights - thus the ordering was limited by the limited resolution of norms. This is a pure Lucene version of the tool, and it uses arbitrary floats from a specified stored field). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2482) Index sorter
[ https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872386#action_12872386 ] Eks Dev commented on LUCENE-2482: - Re: I'm not sure if I follow your use case though Simple case, you have a 100Mio docs with 2 fields, CITY and TEXT sorting on CITY makes postings look like: Orlando: - New York: - perfectly compressible. without really affecting distribution (compressibility) of terms from the TEXT field. If CITY would remain in unsorted order (e.g. uniform distribution), you deal with very large postings for all terms coming from this field Sorting on many fields helps often, e.g. if you have hierarchical compositions like 1 CITY with many ZIP_CODES... philosophically, sorting always increases compressibility and improves locality of reference... but sure, you need to know what you want Index sorter Key: LUCENE-2482 URL: https://issues.apache.org/jira/browse/LUCENE-2482 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Affects Versions: 3.1 Reporter: Andrzej Bialecki Fix For: 3.1 Attachments: indexSorter.patch A tool to sort index according to a float document weight. Documents with high weight are given low document numbers, which means that they will be first evaluated. When using a strategy of early termination of queries (see TimeLimitedCollector) such sorting significantly improves the quality of partial results. (Originally this tool was created by Doug Cutting in Nutch, and used norms as document weights - thus the ordering was limited by the limited resolution of norms. This is a pure Lucene version of the tool, and it uses arbitrary floats from a specified stored field). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org