[jira] [Commented] (LUCENE-2482) Index sorter

2012-11-09 Thread Matthew Willson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13494187#comment-13494187
 ] 

Matthew Willson commented on LUCENE-2482:
-

Hi all -- few quick questions if anyone is still watching this.

* Could this be used to achieve an impact ordered index, as in e.g. [1], where 
documents in a given term's postings list are ordered by score contribution or 
term frequency?

* Any caveats or things one should be aware of when it comes to index sorting 
in combination with different index merge strategies, and some of the more 
advanced stuff in Solr for managing distributed indexes?

* Anyone aware of any other work along the lines of early stopping / dynamic 
pruning optimisations in Lucene? e.g. MaxScore from [1] (I think Xapian [2] 
calls it 'operator decay') or accumulator pruning based algorithms from [1] 
(perhaps in combination with impact ordering)? in particular is there anything 
in Lucene 4's approach to scoring and indexing which would make these hard in 
principle?

Any pointers gratefully received.

[1] Buettcher Clarke  Cormack Implementing and Evaluating search engines ch. 
5 pp. 143-153
[2] http://xapian.org/docs/matcherdesign.html

 Index sorter
 

 Key: LUCENE-2482
 URL: https://issues.apache.org/jira/browse/LUCENE-2482
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/other
Affects Versions: 3.1, 4.0-ALPHA
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 3.6

 Attachments: indexSorter.patch, LUCENE-2482-4.0.patch


 A tool to sort index according to a float document weight. Documents with 
 high weight are given low document numbers, which means that they will be 
 first evaluated. When using a strategy of early termination of queries (see 
 TimeLimitedCollector) such sorting significantly improves the quality of 
 partial results.
 (Originally this tool was created by Doug Cutting in Nutch, and used norms as 
 document weights - thus the ordering was limited by the limited resolution of 
 norms. This is a pure Lucene version of the tool, and it uses arbitrary 
 floats from a specified stored field).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2482) Index sorter

2012-03-25 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237931#comment-13237931
 ] 

Robert Muir commented on LUCENE-2482:
-

This issue is actually fixed in 3.x, but is still open for a 4.0 port.

I'll open an issue (with fix version of 4.0) for the trunk port.

 Index sorter
 

 Key: LUCENE-2482
 URL: https://issues.apache.org/jira/browse/LUCENE-2482
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/other
Affects Versions: 3.1, 4.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 3.6, 4.0

 Attachments: LUCENE-2482-4.0.patch, indexSorter.patch


 A tool to sort index according to a float document weight. Documents with 
 high weight are given low document numbers, which means that they will be 
 first evaluated. When using a strategy of early termination of queries (see 
 TimeLimitedCollector) such sorting significantly improves the quality of 
 partial results.
 (Originally this tool was created by Doug Cutting in Nutch, and used norms as 
 document weights - thus the ordering was limited by the limited resolution of 
 norms. This is a pure Lucene version of the tool, and it uses arbitrary 
 floats from a specified stored field).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2482) Index sorter

2012-02-02 Thread Pablo Castellanos (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13199201#comment-13199201
 ] 

Pablo Castellanos commented on LUCENE-2482:
---

Hi, I wanted to implement some early termination strategies over my Lucene 
index so I started playing with the 4.0 patch as I need to reorder it.

So I have found that a lot of functions have changed in the past year and I had 
to go for some modifications, mainly:

{code}
/*@Override
public TermFreqVector[] getTermFreqVectors(int docNumber)
throws IOException {
  return super.getTermFreqVectors(newToOld[docNumber]);
}*/

@Override
public Fields getTermVectors(int docID) throws IOException {
return super.getTermVectors(newToOld[docID]);
}

/*@Override
public Document document(int n, FieldSelector fieldSelector)
throws CorruptIndexException, IOException {
  return super.document(newToOld[n], fieldSelector);
}*/

@Override
public void document(int docID, StoredFieldVisitor visitor)
throws CorruptIndexException, IOException {
super.document(newToOld[docID], visitor);
}
{code}

There exists also a getDeletedDocs function and I haven't found any good 
replacement for it

{code}
/*@Override
public Bits getDeletedDocs() {
  final Bits deletedDocs = super.getDeletedDocs();

  if (deletedDocs == null)
return null;

  return new Bits() {
@Override
public boolean get(int index) {
  return deletedDocs.get(newToOld[index]);
}

@Override
public int length() {
  return deletedDocs.length();
}
  };
}*/
{code}

After applying these changes and using the code against my lucene index I get 
some weird results. It seems that the new sorting has worked but the posting 
list that access to the documents is still pointing to the old data.

Imagine that I have 2 documents in my index and that I want to sort them by 
price (So the most expensive item should have a lower docId)

Document 1
{panel}docId:1, name: iPod, price: 100${panel}

Document 2
{panel}docId:2, name: iPhone, price: 300${panel}

I run my modified version of IndexSorter over it and after that I try to query 
the new index, so if I query for _name:iPhone_ I get:
{panel}docId:2, name: iPod, price: 100${panel}

That leads me to believe that the documents have been sorted but the new index 
is using the old posting list. 

So I have two questions, are you planning on updating this code for newer 
versions of Lucene 4.0 or am I on my own to get it to work? And if this is the 
case, where should I look for getting a solution for my problem?

Thanks in advance for your help.

 Index sorter
 

 Key: LUCENE-2482
 URL: https://issues.apache.org/jira/browse/LUCENE-2482
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/other
Affects Versions: 3.1, 4.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 3.6, 4.0

 Attachments: LUCENE-2482-4.0.patch, indexSorter.patch


 A tool to sort index according to a float document weight. Documents with 
 high weight are given low document numbers, which means that they will be 
 first evaluated. When using a strategy of early termination of queries (see 
 TimeLimitedCollector) such sorting significantly improves the quality of 
 partial results.
 (Originally this tool was created by Doug Cutting in Nutch, and used norms as 
 document weights - thus the ordering was limited by the limited resolution of 
 norms. This is a pure Lucene version of the tool, and it uses arbitrary 
 floats from a specified stored field).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2482) Index sorter

2011-01-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12982411#action_12982411
 ] 

Robert Muir commented on LUCENE-2482:
-

bq. I'm not sure if I follow your use case though ... please remember that this 
re-sorting is applied exactly the same to all postings, so savings on one list 
may cause bloat on another list.

Hi Andrzej, I came across this the other day, and thought it would be really 
interesting in the context of some of our newer codecs
under development in trunk and the bulkpostings branch.

I found the results presented there based on index sorting for codecs like 
simple9 to be really compelling, significant reduction
in bits/posting for docids especially, because it can pack a lot of small 
deltas efficiently.

{noformat}
The first method reorders the documents in a text collection based on the number 
of
distinct terms contained in each document. The idea is that two documents that 
each
contain a large number of distinct terms are more likely to share terms than 
are a
document with many distinct terms and a document with few distinct terms. 
Therefore,
by assigning docids so that documents with many terms are close together, we may
expect a greater clustering effect than by assigning docids at random.

The second method assumes that the documents have been crawled from the Web (or
maybe a corporate Intranet). It reassigns docids in lexicographical order of 
URL. The
idea here is that two documents from the same Web server (or maybe even from the
same directory on that server) are more likely to share common terms than two 
random
documents from unrelated locations on the Internet.
{noformat}

http://www.ir.uwaterloo.ca/book/06-index-compression.pdf (see page 214: doc id 
reordering)


 Index sorter
 

 Key: LUCENE-2482
 URL: https://issues.apache.org/jira/browse/LUCENE-2482
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 3.1, 4.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 3.1, 4.0

 Attachments: indexSorter.patch


 A tool to sort index according to a float document weight. Documents with 
 high weight are given low document numbers, which means that they will be 
 first evaluated. When using a strategy of early termination of queries (see 
 TimeLimitedCollector) such sorting significantly improves the quality of 
 partial results.
 (Originally this tool was created by Doug Cutting in Nutch, and used norms as 
 document weights - thus the ordering was limited by the limited resolution of 
 norms. This is a pure Lucene version of the tool, and it uses arbitrary 
 floats from a specified stored field).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2482) Index sorter

2010-09-28 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915993#action_12915993
 ] 

Koji Sekiguchi commented on LUCENE-2482:


I think this is an interesting tool. I'm wondering if Solr can call it, as Solr 
does merge indexes. 

Is there any restrictions on this? I've never looked into deeper it, but for 
example, I see isPayloadAvailable() returns always false. Does it mean that it 
doesn't support payload?
Can it support multiple Sorts on indexed fields other than stored float field?

 Index sorter
 

 Key: LUCENE-2482
 URL: https://issues.apache.org/jira/browse/LUCENE-2482
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 3.1, 4.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 3.1

 Attachments: indexSorter.patch


 A tool to sort index according to a float document weight. Documents with 
 high weight are given low document numbers, which means that they will be 
 first evaluated. When using a strategy of early termination of queries (see 
 TimeLimitedCollector) such sorting significantly improves the quality of 
 partial results.
 (Originally this tool was created by Doug Cutting in Nutch, and used norms as 
 document weights - thus the ordering was limited by the limited resolution of 
 norms. This is a pure Lucene version of the tool, and it uses arbitrary 
 floats from a specified stored field).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2482) Index sorter

2010-08-11 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897172#action_12897172
 ] 

Andrzej Bialecki  commented on LUCENE-2482:
---

If there are no objections I'd like to commit this soon.

 Index sorter
 

 Key: LUCENE-2482
 URL: https://issues.apache.org/jira/browse/LUCENE-2482
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 3.1
Reporter: Andrzej Bialecki 
 Fix For: 3.1

 Attachments: indexSorter.patch


 A tool to sort index according to a float document weight. Documents with 
 high weight are given low document numbers, which means that they will be 
 first evaluated. When using a strategy of early termination of queries (see 
 TimeLimitedCollector) such sorting significantly improves the quality of 
 partial results.
 (Originally this tool was created by Doug Cutting in Nutch, and used norms as 
 document weights - thus the ordering was limited by the limited resolution of 
 norms. This is a pure Lucene version of the tool, and it uses arbitrary 
 floats from a specified stored field).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2482) Index sorter

2010-05-27 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872366#action_12872366
 ] 

Andrzej Bialecki  commented on LUCENE-2482:
---

Re: combination of fields + a comparator: sure, why not, take a look at the 
implementation of the DocScore inner class - you can stuff whatever you want 
there.

I'm not sure if I follow your use case though ... please remember that this 
re-sorting is applied exactly the same to all postings, so savings on one list 
may cause bloat on another list.

 Index sorter
 

 Key: LUCENE-2482
 URL: https://issues.apache.org/jira/browse/LUCENE-2482
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 3.1
Reporter: Andrzej Bialecki 
 Fix For: 3.1

 Attachments: indexSorter.patch


 A tool to sort index according to a float document weight. Documents with 
 high weight are given low document numbers, which means that they will be 
 first evaluated. When using a strategy of early termination of queries (see 
 TimeLimitedCollector) such sorting significantly improves the quality of 
 partial results.
 (Originally this tool was created by Doug Cutting in Nutch, and used norms as 
 document weights - thus the ordering was limited by the limited resolution of 
 norms. This is a pure Lucene version of the tool, and it uses arbitrary 
 floats from a specified stored field).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2482) Index sorter

2010-05-27 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872386#action_12872386
 ] 

Eks Dev commented on LUCENE-2482:
-

Re: I'm not sure if I follow your use case though

Simple case, you have a 100Mio docs with 2 fields, CITY and  TEXT

sorting on CITY makes postings look like: 
Orlando:  -
 New York:   
-
perfectly compressible. 

without really affecting distribution (compressibility) of terms from the TEXT 
field.

If CITY would remain in unsorted order (e.g. uniform distribution), you deal 
with very large postings for all terms coming from this field  

Sorting on many fields helps often, e.g. if you have hierarchical compositions 
like 1 CITY with many  ZIP_CODES...  philosophically, sorting always increases 
compressibility and improves locality of reference... but sure, you need to 
know what you want

 Index sorter
 

 Key: LUCENE-2482
 URL: https://issues.apache.org/jira/browse/LUCENE-2482
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 3.1
Reporter: Andrzej Bialecki 
 Fix For: 3.1

 Attachments: indexSorter.patch


 A tool to sort index according to a float document weight. Documents with 
 high weight are given low document numbers, which means that they will be 
 first evaluated. When using a strategy of early termination of queries (see 
 TimeLimitedCollector) such sorting significantly improves the quality of 
 partial results.
 (Originally this tool was created by Doug Cutting in Nutch, and used norms as 
 document weights - thus the ordering was limited by the limited resolution of 
 norms. This is a pure Lucene version of the tool, and it uses arbitrary 
 floats from a specified stored field).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org