subject:"\[jira\] Commented\: \(LUCENE\-2482\) Index sorter"

[jira] [Commented] (LUCENE-2482) Index sorter

2012-11-09 Thread Matthew Willson (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13494187#comment-13494187
]

Matthew Willson commented on LUCENE-2482:
-

Hi all -- few quick questions if anyone is still watching this.

* Could this be used to achieve an impact ordered index, as in e.g. [1], where
documents in a given term's postings list are ordered by score contribution or
term frequency?

* Any caveats or things one should be aware of when it comes to index sorting
in combination with different index merge strategies, and some of the more
advanced stuff in Solr for managing distributed indexes?

* Anyone aware of any other work along the lines of early stopping / dynamic
pruning optimisations in Lucene? e.g. MaxScore from [1] (I think Xapian [2]
calls it 'operator decay') or accumulator pruning based algorithms from [1]
(perhaps in combination with impact ordering)? in particular is there anything
in Lucene 4's approach to scoring and indexing which would make these hard in
principle?

Any pointers gratefully received.

[1] Buettcher Clarke Cormack Implementing and Evaluating search engines ch.
5 pp. 143-153
[2] http://xapian.org/docs/matcherdesign.html

Index sorter

Key: LUCENE-2482
URL: https://issues.apache.org/jira/browse/LUCENE-2482
Project: Lucene - Core
Issue Type: New Feature
Components: modules/other
Affects Versions: 3.1, 4.0-ALPHA
Reporter: Andrzej Bialecki
Assignee: Andrzej Bialecki
Fix For: 3.6

Attachments: indexSorter.patch, LUCENE-2482-4.0.patch

A tool to sort index according to a float document weight. Documents with
high weight are given low document numbers, which means that they will be
first evaluated. When using a strategy of early termination of queries (see
TimeLimitedCollector) such sorting significantly improves the quality of
partial results.
(Originally this tool was created by Doug Cutting in Nutch, and used norms as
document weights - thus the ordering was limited by the limited resolution of
norms. This is a pure Lucene version of the tool, and it uses arbitrary
floats from a specified stored field).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2482) Index sorter

2012-03-25 Thread Robert Muir (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237931#comment-13237931
 ] 

Robert Muir commented on LUCENE-2482:
-

This issue is actually fixed in 3.x, but is still open for a 4.0 port.

I'll open an issue (with fix version of 4.0) for the trunk port.

 Index sorter
 

 Key: LUCENE-2482
 URL: https://issues.apache.org/jira/browse/LUCENE-2482
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/other
Affects Versions: 3.1, 4.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 3.6, 4.0

 Attachments: LUCENE-2482-4.0.patch, indexSorter.patch


 A tool to sort index according to a float document weight. Documents with 
 high weight are given low document numbers, which means that they will be 
 first evaluated. When using a strategy of early termination of queries (see 
 TimeLimitedCollector) such sorting significantly improves the quality of 
 partial results.
 (Originally this tool was created by Doug Cutting in Nutch, and used norms as 
 document weights - thus the ordering was limited by the limited resolution of 
 norms. This is a pure Lucene version of the tool, and it uses arbitrary 
 floats from a specified stored field).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2482) Index sorter

2012-02-02 Thread Pablo Castellanos (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13199201#comment-13199201
 ] 

Pablo Castellanos commented on LUCENE-2482:
---

Hi, I wanted to implement some early termination strategies over my Lucene 
index so I started playing with the 4.0 patch as I need to reorder it.

So I have found that a lot of functions have changed in the past year and I had 
to go for some modifications, mainly:

{code}
/*@Override
public TermFreqVector[] getTermFreqVectors(int docNumber)
throws IOException {
  return super.getTermFreqVectors(newToOld[docNumber]);
}*/

@Override
public Fields getTermVectors(int docID) throws IOException {
return super.getTermVectors(newToOld[docID]);
}

/*@Override
public Document document(int n, FieldSelector fieldSelector)
throws CorruptIndexException, IOException {
  return super.document(newToOld[n], fieldSelector);
}*/

@Override
public void document(int docID, StoredFieldVisitor visitor)
throws CorruptIndexException, IOException {
super.document(newToOld[docID], visitor);
}
{code}

There exists also a getDeletedDocs function and I haven't found any good 
replacement for it

{code}
/*@Override
public Bits getDeletedDocs() {
  final Bits deletedDocs = super.getDeletedDocs();

  if (deletedDocs == null)
return null;

  return new Bits() {
@Override
public boolean get(int index) {
  return deletedDocs.get(newToOld[index]);
}

@Override
public int length() {
  return deletedDocs.length();
}
  };
}*/
{code}

After applying these changes and using the code against my lucene index I get 
some weird results. It seems that the new sorting has worked but the posting 
list that access to the documents is still pointing to the old data.

Imagine that I have 2 documents in my index and that I want to sort them by 
price (So the most expensive item should have a lower docId)

Document 1
{panel}docId:1, name: iPod, price: 100${panel}

Document 2
{panel}docId:2, name: iPhone, price: 300${panel}

I run my modified version of IndexSorter over it and after that I try to query 
the new index, so if I query for _name:iPhone_ I get:
{panel}docId:2, name: iPod, price: 100${panel}

That leads me to believe that the documents have been sorted but the new index 
is using the old posting list. 

So I have two questions, are you planning on updating this code for newer 
versions of Lucene 4.0 or am I on my own to get it to work? And if this is the 
case, where should I look for getting a solution for my problem?

Thanks in advance for your help.

 Index sorter
 

 Key: LUCENE-2482
 URL: https://issues.apache.org/jira/browse/LUCENE-2482
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/other
Affects Versions: 3.1, 4.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 3.6, 4.0

 Attachments: LUCENE-2482-4.0.patch, indexSorter.patch


 A tool to sort index according to a float document weight. Documents with 
 high weight are given low document numbers, which means that they will be 
 first evaluated. When using a strategy of early termination of queries (see 
 TimeLimitedCollector) such sorting significantly improves the quality of 
 partial results.
 (Originally this tool was created by Doug Cutting in Nutch, and used norms as 
 document weights - thus the ordering was limited by the limited resolution of 
 norms. This is a pure Lucene version of the tool, and it uses arbitrary 
 floats from a specified stored field).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2482) Index sorter

2011-01-16 Thread Robert Muir (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12982411#action_12982411
]

Robert Muir commented on LUCENE-2482:
-

bq. I'm not sure if I follow your use case though ... please remember that this
re-sorting is applied exactly the same to all postings, so savings on one list
may cause bloat on another list.

Hi Andrzej, I came across this the other day, and thought it would be really
interesting in the context of some of our newer codecs
under development in trunk and the bulkpostings branch.

I found the results presented there based on index sorting for codecs like
simple9 to be really compelling, significant reduction
in bits/posting for docids especially, because it can pack a lot of small
deltas efficiently.

{noformat}
The ﬁrst method reorders the documents in a text collection based on the number
of
distinct terms contained in each document. The idea is that two documents that
each
contain a large number of distinct terms are more likely to share terms than
are a
document with many distinct terms and a document with few distinct terms.
Therefore,
by assigning docids so that documents with many terms are close together, we may
expect a greater clustering eﬀect than by assigning docids at random.

The second method assumes that the documents have been crawled from the Web (or
maybe a corporate Intranet). It reassigns docids in lexicographical order of
URL. The
idea here is that two documents from the same Web server (or maybe even from the
same directory on that server) are more likely to share common terms than two
random
documents from unrelated locations on the Internet.
{noformat}

http://www.ir.uwaterloo.ca/book/06-index-compression.pdf (see page 214: doc id
reordering)

Index sorter

Key: LUCENE-2482
URL: https://issues.apache.org/jira/browse/LUCENE-2482
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*
Affects Versions: 3.1, 4.0
Reporter: Andrzej Bialecki
Assignee: Andrzej Bialecki
Fix For: 3.1, 4.0

Attachments: indexSorter.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2482) Index sorter

2010-09-28 Thread Koji Sekiguchi (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915993#action_12915993
]

Koji Sekiguchi commented on LUCENE-2482:

I think this is an interesting tool. I'm wondering if Solr can call it, as Solr
does merge indexes.

Is there any restrictions on this? I've never looked into deeper it, but for
example, I see isPayloadAvailable() returns always false. Does it mean that it
doesn't support payload?
Can it support multiple Sorts on indexed fields other than stored float field?

Index sorter

Attachments: indexSorter.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2482) Index sorter

2010-08-11 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897172#action_12897172
 ] 

Andrzej Bialecki  commented on LUCENE-2482:
---

If there are no objections I'd like to commit this soon.

 Index sorter
 

 Key: LUCENE-2482
 URL: https://issues.apache.org/jira/browse/LUCENE-2482
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 3.1
Reporter: Andrzej Bialecki 
 Fix For: 3.1

 Attachments: indexSorter.patch


 A tool to sort index according to a float document weight. Documents with 
 high weight are given low document numbers, which means that they will be 
 first evaluated. When using a strategy of early termination of queries (see 
 TimeLimitedCollector) such sorting significantly improves the quality of 
 partial results.
 (Originally this tool was created by Doug Cutting in Nutch, and used norms as 
 document weights - thus the ordering was limited by the limited resolution of 
 norms. This is a pure Lucene version of the tool, and it uses arbitrary 
 floats from a specified stored field).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2482) Index sorter

2010-05-27 Thread Andrzej Bialecki (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872366#action_12872366
]

Andrzej Bialecki commented on LUCENE-2482:
---

Re: combination of fields + a comparator: sure, why not, take a look at the
implementation of the DocScore inner class - you can stuff whatever you want
there.

I'm not sure if I follow your use case though ... please remember that this
re-sorting is applied exactly the same to all postings, so savings on one list
may cause bloat on another list.

Index sorter

Key: LUCENE-2482
URL: https://issues.apache.org/jira/browse/LUCENE-2482
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*
Affects Versions: 3.1
Reporter: Andrzej Bialecki
Fix For: 3.1

Attachments: indexSorter.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2482) Index sorter

2010-05-27 Thread Eks Dev (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872386#action_12872386
]

Eks Dev commented on LUCENE-2482:
-

Re: I'm not sure if I follow your use case though

Simple case, you have a 100Mio docs with 2 fields, CITY and TEXT

sorting on CITY makes postings look like:
Orlando: -
New York:
-
perfectly compressible.

without really affecting distribution (compressibility) of terms from the TEXT
field.

If CITY would remain in unsorted order (e.g. uniform distribution), you deal
with very large postings for all terms coming from this field

Sorting on many fields helps often, e.g. if you have hierarchical compositions
like 1 CITY with many ZIP_CODES... philosophically, sorting always increases
compressibility and improves locality of reference... but sure, you need to
know what you want

Index sorter

Attachments: indexSorter.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2482) Index sorter

[jira] [Commented] (LUCENE-2482) Index sorter

[jira] [Commented] (LUCENE-2482) Index sorter

[jira] Commented: (LUCENE-2482) Index sorter

[jira] Commented: (LUCENE-2482) Index sorter

[jira] Commented: (LUCENE-2482) Index sorter

[jira] Commented: (LUCENE-2482) Index sorter

[jira] Commented: (LUCENE-2482) Index sorter

8 matches

Site Navigation

Mail list logo

Footer information