date:20090119

Re: Filesystem based bitset

2009-01-19 Thread eks dev

Hi Paul, 
not really an answer to your questions, I just thought you may find it useful 
as a confirmation that this packing of integers into (B or some other) Tree is 
good one. 

I have seen Integer set distributions that can profit hugely from the tree 
organization on top. 
 
have look at: http://www.iis.uni-stuttgart.de/intset/
not meant for on disk storage, but the idea is quite similar.
 
cheers, 
eks







From: Paul Elschot paul.elsc...@xs4all.nl
To: java-dev@lucene.apache.org
Sent: Sunday, 18 January, 2009 23:51:36
Subject: Re: Filesystem based bitset

 
On Friday 09 January 2009 22:30:14 Marvin Humphrey wrote:
 On Fri, Jan 09, 2009 at 08:11:31PM +0100, Karl Wettin wrote:
 
  SSD is pretty close to RAM when it comes to seeking. Wouldn't that 
  mean that a bitset stored on an SSD would be more or less as fast as a 
  bitset in RAM? 
 
 Provided that your index can fit in the system i/o cache and stay there, you
 get the speed of RAM regardless of the underlying permanent storage type.
 There's no reason to wait on SSDs before implementing such a feature.
Since this started by thinking out loud, I'd like to continue doing that.
I've been thinking about how to add a decent skipTo() to something that
compresses better than an (Open)BitSet, and this turns out to be an
integer set implemented as a B plus tree (all leafs on the same level) of
only integers with key/data compression by a frame of reference for
every node (see LUCENE-1410).
I found a java implementation for a B plus tree  on sourceforge: BpLusDotNet
in the BplusJ package, see http://bplusdotnet.sourceforge.net/ .
This has nice transaction semantics on a file system and it has a BSD licence,
so it could be used as a starting point, but:
- it only has strings as index values, so it will need quite some simplification
to work on integers as keys and data, and
- it has no built in compression as far as I could see on first inspection.
The questions:
Would someone know of a better starting point for a B plus tree of integers
with node compression?
For example, how close is the current lucene code base to implementing
a b plus tree for the doc ids of a single term?
How valuable are transaction semantics for such an integer set? It is
tempting to try and implement such an integer set by starting from the
ground up, but I don't have any practical programming experience with
transaction semantics, so it may be better to start from something that
has transactions right from the start.
Regards,
Paul Elschot

Re: [jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-19 Thread Michael McCandless



I'm also seeing decent gains (~13%) for sort-by-relevance (ie the  
default sort) term queries w/ large number (~97K and ~386K) of hits on  
10  36 segment indices.


So I agree, LUCENE-1483 is not just about speeding up sort-by-field  
queries.  It seems to give good speedups all around, and of course  
warming time for sort-by-field searches goes way way down.


We just gotta wrap it up now!

Mike

Mark Miller wrote:

One more just for a check with much fewer unique terms (20k). Didn't  
catch that I didnt clamp down enough on the uniques last one. Back  
up to 21 segments this time, same wildcard search, 7718 hits, and  
the new method is still approx 20% faster than the old. The last run  
was 16 segments though with way more uniques - this one is 21  
segments and way fewer uniques.



7718
Segments file=segments_l numSegments=21 version=FORMAT_USER_DATA  
[Lucene 2.9]

1 of 21: name=_bbxo docCount=29349
  compound=true
  hasProx=true
  numFiles=2
  size (MB)=11.92
  docStoreOffset=0
  docStoreSegment=_bbxo
  docStoreIsCompoundFile=true
  no deletions
  test: open reader.OK
  test: fields, norms...OK [1 fields]
  test: terms, freq, prox...OK [21840 terms; 3875263 terms/docs  
pairs; 4516618 tokens]
  test: stored fields...OK [0 total field count; avg 0 fields  
per doc]
  test: term vectorsOK [0 total vector count; avg 0 term/ 
freq vector fields per doc]


2 of 21: name=_bbxp docCount=29459
  compound=true
  hasProx=true
  numFiles=2
  size (MB)=11.982
  docStoreOffset=29349
  docStoreSegment=_bbxo
  docStoreIsCompoundFile=true
  no deletions
  test: open reader.OK
  test: fields, norms...OK [1 fields]
  test: terms, freq, prox...OK [21840 terms; 3895590 terms/docs  
pairs; 4540859 tokens]
  test: stored fields...OK [0 total field count; avg 0 fields  
per doc]
  test: term vectorsOK [0 total vector count; avg 0 term/ 
freq vector fields per doc]


3 of 21: name=_bbxq docCount=29300
  compound=true
  hasProx=true
  numFiles=2
  size (MB)=11.97
  docStoreOffset=58808
  docStoreSegment=_bbxo
  docStoreIsCompoundFile=true
  no deletions
  test: open reader.OK
  test: fields, norms...OK [1 fields]
  test: terms, freq, prox...OK [21840 terms; 3890419 terms/docs  
pairs; 4536052 tokens]
  test: stored fields...OK [0 total field count; avg 0 fields  
per doc]
  test: term vectorsOK [0 total vector count; avg 0 term/ 
freq vector fields per doc]


4 of 21: name=_bbxr docCount=29480
  compound=true
  hasProx=true
  numFiles=2
  size (MB)=11.971
  docStoreOffset=88108
  docStoreSegment=_bbxo
  docStoreIsCompoundFile=true
  no deletions
  test: open reader.OK
  test: fields, norms...OK [1 fields]
  test: terms, freq, prox...OK [21840 terms; 3894211 terms/docs  
pairs; 4538397 tokens]
  test: stored fields...OK [0 total field count; avg 0 fields  
per doc]
  test: term vectorsOK [0 total vector count; avg 0 term/ 
freq vector fields per doc]


5 of 21: name=_bbxs docCount=29470
  compound=true
  hasProx=true
  numFiles=2
  size (MB)=11.979
  docStoreOffset=117588
  docStoreSegment=_bbxo
  docStoreIsCompoundFile=true
  no deletions
  test: open reader.OK
  test: fields, norms...OK [1 fields]
  test: terms, freq, prox...OK [21840 terms; 3895226 terms/docs  
pairs; 4540446 tokens]
  test: stored fields...OK [0 total field count; avg 0 fields  
per doc]
  test: term vectorsOK [0 total vector count; avg 0 term/ 
freq vector fields per doc]


6 of 21: name=_bbxt docCount=29450
  compound=true
  hasProx=true
  numFiles=2
  size (MB)=11.98
  docStoreOffset=147058
  docStoreSegment=_bbxo
  docStoreIsCompoundFile=true
  no deletions
  test: open reader.OK
  test: fields, norms...OK [1 fields]
  test: terms, freq, prox...OK [21840 terms; 3892708 terms/docs  
pairs; 4538338 tokens]
  test: stored fields...OK [0 total field count; avg 0 fields  
per doc]
  test: term vectorsOK [0 total vector count; avg 0 term/ 
freq vector fields per doc]


7 of 21: name=_bbxu docCount=29509
  compound=true
  hasProx=true
  numFiles=2
  size (MB)=11.978
  docStoreOffset=176508
  docStoreSegment=_bbxo
  docStoreIsCompoundFile=true
  no deletions
  test: open reader.OK
  test: fields, norms...OK [1 fields]
  test: terms, freq, prox...OK [21840 terms; 3894189 terms/docs  
pairs; 4538376 tokens]
  test: stored fields...OK [0 total field count; avg 0 fields  
per doc]
  test: term vectorsOK [0 total vector count; avg 0 term/ 
freq vector fields per doc]


8 of 21: name=_bbxv docCount=29401
  compound=true
  hasProx=true
  numFiles=2
  size (MB)=11.976
  docStoreOffset=206017
  docStoreSegment=_bbxo
  docStoreIsCompoundFile=true
  no deletions
  test: open reader.OK
  test: fields, norms...OK [1 fields]
  test: terms, freq, prox...OK [21840 terms; 3891986 terms/docs  
pairs; 4538746 tokens]
  test: stored fields...OK [0 total field count; avg 0 fields  
per doc]

Re: Filesystem based bitset

2009-01-19 Thread Michael McCandless



Paul Elschot wrote:

Since this started by thinking out loud, I'd like to continue doing  
that.
I've been thinking about how to add a decent skipTo() to something  
that

compresses better than an (Open)BitSet, and this turns out to be an
integer set implemented as a B plus tree (all leafs on the same  
level) of

only integers with key/data compression by a frame of reference for
every node (see LUCENE-1410).


Sounds great!  With flexible indexing (LUCENE-1458, which I'm needing
to get back to  finish...) you could experiment with these sorts of
changes to the postings format by implementing your own codec.


For example, how close is the current lucene code base to implementing
a b plus tree for the doc ids of a single term?


I'm not sure this is a good fit -- B+ trees are great at
insertion/deletion of entries, but we never do that with our postings
(they are write once).  Though if the set operations are substantially
faster (??) than the doc-at-a-time iteration Lucene does today, then
maybe it is compelling?  But we'd have to change up how AND/OR queries
work to translate into these set operations.


How valuable are transaction semantics for such an integer set? It is
tempting to try and implement such an integer set by starting from the
ground up, but I don't have any practical programming experience with
transaction semantics, so it may be better to start from something  
that

has transactions right from the start.


If we use this to store/access deleted docs in RAM, then transactions
are very important for realtime search.  With LUCENE-1314
(IndexReader.clone) a cloned reader carries over the deletes from the
original reader but must copy on write as soon as a new deletion is
made.  With BitVector for deleted docs, this operation is very costly.
But if we used B+ tree (or something similar) in RAM to hold the
deleted docs, and that lets us incrementally copy-on-write only the
nodes/blocks affected by the changes, that would be very useful.

It could also be useful for storing deleted docs in the index, ie,
this is an alternative to tombstones, in which case its transactional
support would be good, to avoid writing an entire BitVector when only
a few additional docs became deleted, during commit.  This would fit
nicely with Lucene's already transactional index storage, ie rather
than storing the deletion generation (an int) that we store today,
we'd store some reference into the B+ tree indicating the
commit point to use for deletions.

But I think this usage (changing how deletions are stored on disk) is
less compelling than changing how deletions are stored/used in RAM.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-19 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665102#action_12665102
 ] 

Michael McCandless commented on LUCENE-1483:


I'm working on another iteration of this patch, cleaning things up, adding 
javadocs, etc., in preparation for committing...

 Change IndexSearcher multisegment searches to search each individual segment 
 using a single HitCollector
 

 Key: LUCENE-1483
 URL: https://issues.apache.org/jira/browse/LUCENE-1483
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: Mark Miller
Priority: Minor
 Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 sortBench.py, sortCollate.py


 FieldCache and Filters are forced down to a single segment reader, allowing 
 for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Using full norms (Was: Bubbling up newer records)

2009-01-19 Thread Jiří Kuhn

Hello,


 Michael McCandless wrote:

 The upcoming Lucene in Action revision (now available online through
Manning's MEAP) has a basic example of this (boosting by recency) in the
Advanced Search chapter, using function queries.


I have never used function queries before, but it was very easy to boost
more recent documents with help of FieldScoreQuery. This may be quite common
usage. The result is based on computation during search time but the same
result would be accomplished using document boost during indexing time (and
certainly faster with less memory used). But there is a difference -
document boost is used to compute document's norm value which is stored with
precision loss (float encoded as byte).

The question: Is still really an issue to encode norms as bytes? Do we lose
less than we gain?

Can someone imagine any real disadvantages of storing norms as full 4-bytes
float? Nowadays?

Best regards,
Jiri Kuhn.

Using full norms (Was: Bubbling up newer records)

2009-01-19 Thread Jiří Kuhn

Hello,


 Michael McCandless wrote:

 The upcoming Lucene in Action revision (now available online through
Manning's MEAP) has a basic example of this (boosting by recency) in the
Advanced Search chapter, using function queries.


I have never used function queries before, but it was very easy to boost
more recent documents with help of FieldScoreQuery. This may be quite common
usage. The result is based on computation during search time but the same
result would be accomplished using document boost during indexing time (and
certainly faster with less memory used). But there is a difference -
document boost is used to compute document's norm value which is stored with
precision loss (float encoded as byte).

The question: Is still really an issue to encode norms as bytes? Do we lose
less than we gain?

Can someone imagine any real disadvantages of storing norms as full 4-bytes
float? Nowadays?

Best regards,
Jiri Kuhn.

Re: Question on Lucene search

2009-01-19 Thread Grant Ingersoll


Please ask your question on java-u...@lucene.apache.org.

Thanks,
Grant

On Jan 19, 2009, at 1:20 AM, fell wrote:



Hi all,

I am new to Lucene and I need to know the following:

In case I have indexed some data using Lucene and it contains the  
fields:

Location, City, Country.

Suppose the data is as follows in the index in each of the above  
fields:

1) R G Heights
2) London
3) United Kindom

If i try to search the index by putting the following in my query :
1) RG Heights (Please not R and G do not have space in the middle) or
2) RGHeights. (no space at all) or
3) R  G  Heights. (extra space between tokens),
4) Kingdom United.

Please tell me if lucene would come up with a positive result or  
would it

tell me 'no hits'.

Please let me know this for each of the queries above!

Thanks!
--
View this message in context: 
http://www.nabble.com/Question-on-Lucene-search-tp21537509p21537509.html
Sent from the Lucene - Java Developer mailing list archive at  
Nabble.com.



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1522) another highlighter

2009-01-19 Thread Koji Sekiguchi (JIRA)

another highlighter
---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Priority: Minor


I've written this highlighter for my project to support bi-gram token stream. 
The idea was inherited from my previous project with my colleague and 
LUCENE-644. This approach needs highlight fields to be 
TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
depends on LUCENE-1448 to get refined term offsets.

usage:
{code:java}
Highlighter h = new Highlighter();
FieldQuery fq = h.getFieldQuery( query );
// docId=0, fieldName=content, fragCharSize=100, numFragments=3
String[] fragments = h.getBestFragments( fq, reader, 0, content, 100, 3 );
{code}

features:
- fast for large docs
- supports fixed size N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
- supports PhraseQuery, phrase-unit highlighting with slops
{noformat}
q=w1 w2
bw1 w2/b
---
q=w1 w2~1
bw1/b w3 bw2/b w3 bw1 w2/b
{noformat}
- highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
- easy to apply patch due to independent package (contrib/highlighter2)
- uses Java 1.5
- looks query boost to score fragments (currently doesn't see idf, but it 
should be possible)
- pluggable FragListBuilder
- pluggable FragmentsBuilder

to do:
- term positions can be unnecessary when phraseHighlight==false
- collects performance numbers


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1522) another highlighter

2009-01-19 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-1522:
---

Attachment: LUCENE-1522.patch

to apply this patch, LUCENE-1448 also need to be applied.
{code}
$ svn co -r713975 http://svn.apache.org/repos/asf/lucene/java/trunk
$ cd trunk
$ patch -p0  LUCENE-1448.patch
$ patch -p0  LUCENE-1522.patch
{code}


 another highlighter
 ---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: LUCENE-1522.patch


 I've written this highlighter for my project to support bi-gram token stream. 
 The idea was inherited from my previous project with my colleague and 
 LUCENE-644. This approach needs highlight fields to be 
 TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
 depends on LUCENE-1448 to get refined term offsets.
 usage:
 {code:java}
 Highlighter h = new Highlighter();
 FieldQuery fq = h.getFieldQuery( query );
 // docId=0, fieldName=content, fragCharSize=100, numFragments=3
 String[] fragments = h.getBestFragments( fq, reader, 0, content, 100, 3 );
 {code}
 features:
 - fast for large docs
 - supports fixed size N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
 - supports PhraseQuery, phrase-unit highlighting with slops
 {noformat}
 q=w1 w2
 bw1 w2/b
 ---
 q=w1 w2~1
 bw1/b w3 bw2/b w3 bw1 w2/b
 {noformat}
 - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
 - easy to apply patch due to independent package (contrib/highlighter2)
 - uses Java 1.5
 - looks query boost to score fragments (currently doesn't see idf, but it 
 should be possible)
 - pluggable FragListBuilder
 - pluggable FragmentsBuilder
 to do:
 - term positions can be unnecessary when phraseHighlight==false
 - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Filesystem based bitset

2009-01-19 Thread Paul Elschot

On Monday 19 January 2009 11:32:17 Michael McCandless wrote:
 
 Paul Elschot wrote:
 
  Since this started by thinking out loud, I'd like to continue doing  
  that.
  I've been thinking about how to add a decent skipTo() to something  
  that
  compresses better than an (Open)BitSet, and this turns out to be an
  integer set implemented as a B plus tree (all leafs on the same  
  level) of
  only integers with key/data compression by a frame of reference for
  every node (see LUCENE-1410).
 
 Sounds great!  With flexible indexing (LUCENE-1458, which I'm needing
 to get back to  finish...) you could experiment with these sorts of
 changes to the postings format by implementing your own codec.

I'll take a look there.

 
  For example, how close is the current lucene code base to implementing
  a b plus tree for the doc ids of a single term?
 
 I'm not sure this is a good fit -- B+ trees are great at
 insertion/deletion of entries, but we never do that with our postings
 (they are write once).

 Though if the set operations are substantially
 faster (??) than the doc-at-a-time iteration Lucene does today, then
 maybe it is compelling?  But we'd have to change up how AND/OR queries
 work to translate into these set operations.

The idea is to implement a DocIdSetIterator on top of this, with the
usual next() and skipTo(), so it should fit in the current lucene framework.

  How valuable are transaction semantics for such an integer set? It is
  tempting to try and implement such an integer set by starting from the
  ground up, but I don't have any practical programming experience with
  transaction semantics, so it may be better to start from something  
  that
  has transactions right from the start.
 
 If we use this to store/access deleted docs in RAM, then transactions
 are very important for realtime search.  With LUCENE-1314
 (IndexReader.clone) a cloned reader carries over the deletes from the
 original reader but must copy on write as soon as a new deletion is
 made.  With BitVector for deleted docs, this operation is very costly.
 But if we used B+ tree (or something similar) in RAM to hold the
 deleted docs, and that lets us incrementally copy-on-write only the
 nodes/blocks affected by the changes, that would be very useful.

The one referenced by Eks Dev would be a good starting point for that,
it's basically a binary tree of BitSets of at most 1024 bits at the leafs.

 It could also be useful for storing deleted docs in the index, ie,
 this is an alternative to tombstones, in which case its transactional
 support would be good, to avoid writing an entire BitVector when only
 a few additional docs became deleted, during commit.  This would fit
 nicely with Lucene's already transactional index storage, ie rather
 than storing the deletion generation (an int) that we store today,
 we'd store some reference into the B+ tree indicating the
 commit point to use for deletions.
 
 But I think this usage (changing how deletions are stored on disk) is
 less compelling than changing how deletions are stored/used in RAM.

Thanks,
Paul Elschot

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-19 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665252#action_12665252
 ] 

Michael Busch commented on LUCENE-1483:
---

Mark and Mike,

this issue and the patch are amazingly long and catching up here after vacation 
is pretty hard. Maybe you could update the description of this issue with a 
summary (maybe a bullet list?) that describes the main goals and changes here? 
That would be great...

 Change IndexSearcher multisegment searches to search each individual segment 
 using a single HitCollector
 

 Key: LUCENE-1483
 URL: https://issues.apache.org/jira/browse/LUCENE-1483
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: Mark Miller
Priority: Minor
 Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 sortBench.py, sortCollate.py


 FieldCache and Filters are forced down to a single segment reader, allowing 
 for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1522) another highlighter

2009-01-19 Thread Koji Sekiguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-1522:
---

  Description: 
I've written this highlighter for my project to support bi-gram token stream. 
The idea was inherited from my previous project with my colleague and 
LUCENE-644. This approach needs highlight fields to be 
TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
depends on LUCENE-1448 to get refined term offsets.

usage:
{code:java}
TopDocs docs = searcher.search( query, 10 );
Highlighter h = new Highlighter();
FieldQuery fq = h.getFieldQuery( query );
for( ScoreDoc scoreDoc : docs.scoreDocs ){
  // fieldName=content, fragCharSize=100, numFragments=3
  String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, content, 
100, 3 );
  if( fragments != null ){
for( String fragment : fragments )
  System.out.println( fragment );
  }
}
{code}

features:
- fast for large docs
- supports fixed size N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
- supports PhraseQuery, phrase-unit highlighting with slops
{noformat}
q=w1 w2
bw1 w2/b
---
q=w1 w2~1
bw1/b w3 bw2/b w3 bw1 w2/b
{noformat}
- highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
- easy to apply patch due to independent package (contrib/highlighter2)
- uses Java 1.5
- looks query boost to score fragments (currently doesn't see idf, but it 
should be possible)
- pluggable FragListBuilder
- pluggable FragmentsBuilder

to do:
- term positions can be unnecessary when phraseHighlight==false
- collects performance numbers


  was:
I've written this highlighter for my project to support bi-gram token stream. 
The idea was inherited from my previous project with my colleague and 
LUCENE-644. This approach needs highlight fields to be 
TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
depends on LUCENE-1448 to get refined term offsets.

usage:
{code:java}
Highlighter h = new Highlighter();
FieldQuery fq = h.getFieldQuery( query );
// docId=0, fieldName=content, fragCharSize=100, numFragments=3
String[] fragments = h.getBestFragments( fq, reader, 0, content, 100, 3 );
{code}

features:
- fast for large docs
- supports fixed size N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
- supports PhraseQuery, phrase-unit highlighting with slops
{noformat}
q=w1 w2
bw1 w2/b
---
q=w1 w2~1
bw1/b w3 bw2/b w3 bw1 w2/b
{noformat}
- highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
- easy to apply patch due to independent package (contrib/highlighter2)
- uses Java 1.5
- looks query boost to score fragments (currently doesn't see idf, but it 
should be possible)
- pluggable FragListBuilder
- pluggable FragmentsBuilder

to do:
- term positions can be unnecessary when phraseHighlight==false
- collects performance numbers


Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

 another highlighter
 ---

 Key: LUCENE-1522
 URL: https://issues.apache.org/jira/browse/LUCENE-1522
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: LUCENE-1522.patch


 I've written this highlighter for my project to support bi-gram token stream. 
 The idea was inherited from my previous project with my colleague and 
 LUCENE-644. This approach needs highlight fields to be 
 TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
 depends on LUCENE-1448 to get refined term offsets.
 usage:
 {code:java}
 TopDocs docs = searcher.search( query, 10 );
 Highlighter h = new Highlighter();
 FieldQuery fq = h.getFieldQuery( query );
 for( ScoreDoc scoreDoc : docs.scoreDocs ){
   // fieldName=content, fragCharSize=100, numFragments=3
   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
 content, 100, 3 );
   if( fragments != null ){
 for( String fragment : fragments )
   System.out.println( fragment );
   }
 }
 {code}
 features:
 - fast for large docs
 - supports fixed size N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
 - supports PhraseQuery, phrase-unit highlighting with slops
 {noformat}
 q=w1 w2
 bw1 w2/b
 ---
 q=w1 w2~1
 bw1/b w3 bw2/b w3 bw1 w2/b
 {noformat}
 - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
 - easy to apply patch due to independent package (contrib/highlighter2)
 - uses Java 1.5
 - looks query boost to score fragments (currently doesn't see idf, but it 
 should be possible)
 - pluggable FragListBuilder
 - pluggable FragmentsBuilder
 to do:
 - term positions can be unnecessary when phraseHighlight==false
 - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-19 Thread Mark Miller (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mark Miller updated LUCENE-1483:

Description:
This issue changes how an IndexSearcher searches over multiple segments. The
current method of searching multiple segments is to use a MultiSegmentReader
and treat all of the segments as one. This causes filters and FieldCaches to be
keyed to the MultiReader and makes reopen expensive. If only a few segments
change, the FieldCache is still loaded for all of them.

This patch changes things by searching each individual segment one at a time,
but sharing the HitCollector used across each segment. This allows FieldCaches
and Filters to be keyed on individual SegmentReads, making reopen much cheaper.
FieldCache loading over multiple segments can be much faster as well - with the
old method, all unique terms for every segment is enumerated against each
segment - because of the likely logarithmic change in terms per segment, this
can be very wasteful. Searching individual segments avoids this cost. The
term/document statistics from the multireader are used to score results for
each segment.

When sorting, its more difficult to use a single HitCollector for each sub
searcher. Ordinals are not comparable across segments. To account for this, a
new sort enabled HitCollector is introduced that is able to collect and sort
across segments (because of its ability to compare ordinals across segments).
This TopFieldCollector class will collect the values/ordinals for a given
segment, and upon moving to the next segment, translate any ordinals/values so
that they can be compared against the values for the new segment. This is done
lazily.

All and all, the switch seems to provide numerous performance benefits, in both
sorted and non sorted search. We were seeing a good loss on indices with lots
of segments (1000?) and certain queue sizes / queries, but the latest results
seem to show thats been mostly taken care of (you shouldnt be using such a
large queue on such a segmented index anyway).

was:FieldCache and Filters are forced down to a single segment reader,
allowing for individual segment reloading on reopen.

Here is a start at a better summary. It could be improved.

Change IndexSearcher multisegment searches to search each individual segment
using a single HitCollector

Key: LUCENE-1483
URL: https://issues.apache.org/jira/browse/LUCENE-1483
Project: Lucene - Java
Issue Type: Improvement
Affects Versions: 2.9
Reporter: Mark Miller
Priority: Minor
Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
sortBench.py, sortCollate.py

This issue changes how an IndexSearcher searches over multiple segments. The
current method of searching multiple segments is to use a MultiSegmentReader
and treat all of the segments as one. This causes filters and FieldCaches to
be keyed to the MultiReader and makes reopen expensive. If only a few
segments change, the FieldCache is still loaded for all of them.
This patch changes things by searching each individual segment one at a time,
but sharing the HitCollector used across each segment. This allows
FieldCaches and Filters to be keyed on individual SegmentReads, making reopen
much cheaper. FieldCache loading over multiple segments can be much faster as
well - with the old method, all unique terms for every segment is enumerated
against each segment - because of the likely logarithmic change in terms per
segment, this can be very wasteful. Searching individual segments avoids this
cost. The term/document statistics from the multireader are used to score
results for each segment.
When sorting, its more difficult to use a single HitCollector for each sub
searcher. Ordinals are not comparable across segments. To account for this, a
new sort enabled HitCollector is introduced that is able to collect and sort
across segments (because of its ability to compare ordinals across segments).
This TopFieldCollector class will collect the values/ordinals for a given
segment, and upon moving to the next segment, translate any ordinals/values
so that they can be compared against the

[jira] Updated: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-19 Thread Mark Miller (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mark Miller updated LUCENE-1483:

* Introduces
** MultiReaderHitCollector - a HitCollector that can collect across multiple
IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.
** TopFieldCollector - a HitCollector that can compare values/ordinals across
IndexReaders and sort on fields.
** FieldValueHitQueue - a Priority queue that is part of the TopFieldCollector
implementation.
** FieldComparator - a new Comparator class that works across IndexReaders.
Part of the TopFieldCollector implementation.
** FieldComparatorSource - new class to allow for custom Comparators.
* Alters
** IndexSearcher uses a single HitCollector to collect hits against each
individual SegmentReader. All the other changes stem from this ;)
* Deprecates
** TopFieldDocCollector
** FieldSortedHitQueue

was:
This issue changes how an IndexSearcher searches over multiple segments. The
current method of searching multiple segments is to use a MultiSegmentReader
and treat all of the segments as one. This causes filters and FieldCaches to be
keyed to the MultiReader and makes reopen expensive. If only a few segments
change, the FieldCache is still loaded for all of them.

Change IndexSearcher multisegment searches to search each individual segment
using a single HitCollector

[jira] Updated: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-19 Thread Mark Miller (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mark Miller updated LUCENE-1483:

This patch changes things by searching each individual segment one at a time,
but sharing the HitCollector used across each segment. This allows FieldCaches
and Filters to be keyed on individual SegmentReaders, making reopen much
cheaper. FieldCache loading over multiple segments can be much faster as well -
with the old method, all unique terms for every segment is enumerated against
each segment - because of the likely logarithmic change in terms per segment,
this can be very wasteful. Searching individual segments avoids this cost. The
term/document statistics from the multireader are used to score results for
each segment.

* Introduces
** MultiReaderHitCollector - a HitCollector that can collect across multiple
IndexReaders. Old HitCollectors are

[jira] Updated: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-19 Thread Mark Miller (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mark Miller updated LUCENE-1483:

This patch changes things by searching each individual segment one at a time,
but sharing the HitCollector used across each segment. This allows FieldCaches
and Filters to be keyed on individual SegmentReaders, making reopen much
cheaper. FieldCache loading over multiple segments can be much faster as well -
with the old method, all unique terms for every segment is enumerated against
each segment - because of the likely logarithmic change in terms per segment,
this can be very wasteful. Searching individual segments avoids this cost. The
term/document statistics from the multireader are used to score results for
each segment.

When sorting, its more difficult to use a single HitCollector for each sub
searcher. Ordinals are not comparable across segments. To account for this, a
new field sort enabled HitCollector is introduced that is able to collect and
sort across segments (because of its ability to compare ordinals across
segments). This TopFieldCollector class will collect the values/ordinals for a
given segment, and upon moving to the next segment, translate any
ordinals/values so that they can be compared against the values for the new
segment. This is done lazily.

This patch changes things by searching each individual segment one at a time,
but sharing the HitCollector used across each segment. This allows FieldCaches
and Filters to be keyed on individual SegmentReaders, making reopen much
cheaper. FieldCache loading over multiple segments can be much faster as well -
with the old method, all unique terms for every segment is enumerated against
each segment - because of the likely logarithmic change in terms per segment,
this can be very wasteful. Searching individual segments avoids this cost. The
term/document statistics from the multireader are used to score results for
each segment.

* Introduces
** MultiReaderHitCollector - a HitCollector that can collect across multiple
IndexReaders. Old HitCollectors

[jira] Closed: (LUCENE-1519) Change Primitive Data Types from int to long in class SegmentMerger.java

2009-01-19 Thread Deepak (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak closed LUCENE-1519.
--


No problem

 Change Primitive Data Types from int to long in class SegmentMerger.java
 

 Key: LUCENE-1519
 URL: https://issues.apache.org/jira/browse/LUCENE-1519
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
 Environment: lucene 2.4.0, jdk1.6.0_03/07/11
Reporter: Deepak
Assignee: Michael McCandless
 Fix For: 2.9

   Original Estimate: 4h
  Remaining Estimate: 4h

 Hi
 We are getting an exception while optimize. We are getting this exception 
 mergeFields produced an invalid result: docCount is 385282378 but fdx file 
 size is 3082259028; now aborting this merge to prevent index corruption
  
 I have  checked the code for class SegmentMerger.java and found this check 
 ***
 if (4+docCount*8 != fdxFileLength)
 // This is most likely a bug in Sun JRE 1.6.0_04/_05;
 // we detect that the bug has struck, here, and
 // throw an exception to prevent the corruption from
 // entering the index.  See LUCENE-1282 for
 // details.
 throw new RuntimeException(mergeFields produced an invalid result: 
 docCount is  + docCount +  but fdx file size is  + fdxFileLength + ; now 
 aborting this merge to prevent index corruption);
 }
 ***
 In our case docCount is 385282378 and fdxFileLength size is 3082259028, even 
 though 4+385282378*8 is equal to 3082259028, the above code will not work 
 because number 3082259028 is out of int range. So type of variable docCount 
 needs to be changed to long
 I have written a small test for this 
 
 public class SegmentMergerTest {
 public static void main(String[] args) {
 int docCount = 385282378; 
 long fdxFileLength = 3082259028L; 
 if(4+docCount*8 != fdxFileLength) 
 System.out.println(No Match + (4+docCount*8));
 else 
 System.out.println(Match + (4+docCount*8));
 }
 }
 
 Above test will print No Match but if you change the data type of docCount to 
 long, it will print Match
 Can you please advise us if this issue will be fixed in next release?
 Regards
 Deepak
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Committed revision 735928.

2009-01-19 Thread patrick o'leary

Committed revision 735928.
Adding myself to contrib committers list / testing karma

Thanks
Patrick

scootie:site pjaol$ svn diff docs/*.html

Index: docs/whoweare.html

===

--- docs/whoweare.html (revision 735927)

+++ docs/whoweare.html (working copy)

@@ -285,6 +285,9 @@

 bWolfgang Hoschek/b (whosc...@...)/li



 li

+bPatrick O'Leary/b (pj...@...)/li

+

+li

 bUwe Schindler/b (uschind...@...)/li



 li

@@ -300,7 +303,7 @@

 /div





-a name=N10087/aa name=emeritus/a

+a name=N1008C/aa name=emeritus/a

 h2 class=boxedEmeritus Committers/h2

 div class=section

 ul


scootie:site pjaol$ svn diff src/documentation/content/xdocs/whoweare.xml

Index: src/documentation/content/xdocs/whoweare.xml

===

--- src/documentation/content/xdocs/whoweare.xml (revision 735927)

+++ src/documentation/content/xdocs/whoweare.xml (working copy)

@@ -31,6 +31,7 @@

 section id=contribtitleContrib Committers/title

 ul

 libWolfgang Hoschek/b (whosc...@...)/li

+libPatrick O'Leary/b (pj...@...)/li

 libUwe Schindler/b (uschind...@...)/li

 libAndi Vajda/b (va...@...)/li

 libKarl Wettin/b (ka...@...)/li

Re: Filesystem based bitset

Re: [jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

Re: Filesystem based bitset

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

Using full norms (Was: Bubbling up newer records)

Using full norms (Was: Bubbling up newer records)

Re: Question on Lucene search

[jira] Created: (LUCENE-1522) another highlighter

[jira] Updated: (LUCENE-1522) another highlighter

Re: Filesystem based bitset

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

[jira] Updated: (LUCENE-1522) another highlighter

[jira] Updated: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

[jira] Updated: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

[jira] Updated: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

[jira] Updated: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

[jira] Closed: (LUCENE-1519) Change Primitive Data Types from int to long in class SegmentMerger.java

Committed revision 735928.

18 matches

Site Navigation

Mail list logo

Footer information