The 2GB segment size limit

2008-06-25 Thread Nadav Har'El
Hi,

Recently an index I've been building passed the 2 GB mark, and after I
optimize()ed it into one segment over 2 GB, it stopped working.

Apparently, this is a known problem (on 32 bit JVMs), and mentioned in the FAQ,
http://wiki.apache.org/lucene-java/LuceneFAQ question Is there a way to limit
the size of an index.

My first problem is that it looks to me like this FAQ entry is passing
outdated advice. My second second problem is that we document a bug, instead 
of fixing it.

The first thing the FAQ does is to recommend IndexWriter.setMaxMergeDocs().
This solution has two serious problems: First, normally one doesn't know how
many documents one can index before reaching 2 GB, and second, a call to
optimize() appears to ignore this setting and merge everything again - no good!

The second solution the FAQ recommends (using MultiSearcher) is unwieldy and
in my opinion, should be unnecessary (since we have the concept of segments,
why do we need separate indices in that case?).

The third option labeled the optimal solution is to write a new
FSDirectory implementation that represents files over 2 GB as several
files, broken on the 2 GB mark. But has anyone ever implemented this?

Does anyone have any experience with the 2 GB problem? Is one of these
recommendations *really* the recommended solution? What about the new
LogByteSizeMergePolicy and its setMaxMergeMB argument - wouldn't it be better
to use that? Does anybody know if optimize() also obeys this flag? If not,
shouldn't it?

In short, I'd like to understand the best practices of solving the 2 GB
problem, and improve the FAQ in this regard.

Moreover, I wonder, instead of documenting around the problem, should we
perhaps make the default behavior more correct? In other words, imagine
that we set LogByteSizeMergePolicy.DEFAULT_MAX_MERGE_MB to 1024 (or 1023,
to be on the safe side?). Then, segments larger than 1 GB will never be
merged with anything else. Some users (with multi-gigabyte indices on a 64
bit CPU) may not like this default, but they can change it - at least with
this default Lucene's behavior will be correct on all CPUs and JVMs.

I have one last question that I wonder if anyone can answer before I start
digging into the code. We use merges not just for merging segments, but also
as an oportunity to clean up segments from deleted documents. If some segment
is bigger than the maximum and is never merged again, does this also mean
deleted documents will never ever get cleaned up from it? This can be a
serious problem on huge dynamic indices (e.g., imagine a crawl of the Web
or some large intranet).

Nowadays, 2 GB indices are less rare than they used to be, and 32 bit JVMs
are still quite common, so I think this is a problem we should solve properly.

Thanks,
Nadav.

-- 
Nadav Har'El|Wednesday, Jun 25 2008, 22 Sivan 5768
[EMAIL PROTECTED] |-
Phone +972-523-790466, ICQ 13349191 |Committee: A group of people that keeps
http://nadav.harel.org.il   |minutes and wastes hours.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Fwd: changing index format

2008-06-25 Thread Paul Elschot
Op Wednesday 25 June 2008 07:03:59 schreef John Wang:
 Hi guys:
 Perhaps I should have posted this to this list in the first
 place.

 I am trying to work on a patch to for each term, expose minDoc
 and maxDoc. This value can be retrieve while constructing the
 TermInfo.

 Knowing these two values can be very helpful in caching DocIdSet
 for a given Term. This would help to determine what type of
 underlying implementation to use, e.g. BitSet, HashSet, or ArraySet,
 etc.

I suppose you know about
https://issues.apache.org/jira/browse/LUCENE-1296 ?

But how about using TermScorer? In the trunk it's a subclass of
DocIdSetIterator (via Scorer) and the caching is already done by
Lucene and the underlying OS file cache.
TermScorer does some extra work for its scoring, but I don't think
that would affect performance.

  The problem I am having is stated below, I don't know how to add
 the minDoc and maxDoc values to the index while keeping backward
 compatibility.

I doubt they would help very much. The most important info for this 
is maxDoc from the index reader and the document frequency of the term,
and these are easily determined.

Btw, I've just started to add encoding intervals of consecutive doc ids
to SortedVIntList. For very high document frequencies, that might 
actually be faster than TermScorer and more compact than the current 
index. Once I've got some working code I'll open an issue for it.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: The 2GB segment size limit

2008-06-25 Thread Michael McCandless


Nadav Har'El wrote:


Recently an index I've been building passed the 2 GB mark, and after I
optimize()ed it into one segment over 2 GB, it stopped working.


Nadav, which platform did you hit this on?  I think I've created  2  
GB index on 32 bit WinXP just fine.  How many platforms are really  
affected by this?


Apparently, this is a known problem (on 32 bit JVMs), and mentioned  
in the FAQ,
http://wiki.apache.org/lucene-java/LuceneFAQ question Is there a  
way to limit

the size of an index.

My first problem is that it looks to me like this FAQ entry is passing
outdated advice. My second second problem is that we document a bug,  
instead

of fixing it.

The first thing the FAQ does is to recommend  
IndexWriter.setMaxMergeDocs().
This solution has two serious problems: First, normally one doesn't  
know how
many documents one can index before reaching 2 GB, and second, a  
call to
optimize() appears to ignore this setting and merge everything again  
- no good!


And a 3rd problem is: that limit applies to the input segments (to the  
merge), not the output segment.  So the eg given of setting  
maxMergeDocs to 7M is very likely too high because if you merge 10  
segments, each  7M docs, you'll likely easily get a resulting segment  
 2 GB.


The second solution the FAQ recommends (using MultiSearcher) is  
unwieldy and
in my opinion, should be unnecessary (since we have the concept of  
segments,

why do we need separate indices in that case?).

The third option labeled the optimal solution is to write a new
FSDirectory implementation that represents files over 2 GB as several
files, broken on the 2 GB mark. But has anyone ever implemented this?


I agree these two workarounds sound quite challenging to do in  
practice...



Does anyone have any experience with the 2 GB problem? Is one of these
recommendations *really* the recommended solution? What about the new
LogByteSizeMergePolicy and its setMaxMergeMB argument - wouldn't it  
be better
to use that? Does anybody know if optimize() also obeys this flag?  
If not,

shouldn't it?


optimize() doesn't obey it, and the same problem (input vs output)  
applies to maxMergeMB as well.


To make optimize() obey these limits, one would have to make their own  
MergePolicy.


In short, I'd like to understand the best practices of solving the  
2 GB

problem, and improve the FAQ in this regard.

Moreover, I wonder, instead of documenting around the problem,  
should we
perhaps make the default behavior more correct? In other words,  
imagine
that we set LogByteSizeMergePolicy.DEFAULT_MAX_MERGE_MB to 1024 (or  
1023,
to be on the safe side?). Then, segments larger than 1 GB will never  
be
merged with anything else. Some users (with multi-gigabyte indices  
on a 64
bit CPU) may not like this default, but they can change it - at  
least with

this default Lucene's behavior will be correct on all CPUs and JVMs.


I think we should understand how widespread this really is in our  
userbase.  If it's a minority being affected by it, I think the  
current defaults are correct (and, it's this minority that should  
change Lucene to not produce too large a segment).


I have one last question that I wonder if anyone can answer before I  
start
digging into the code. We use merges not just for merging segments,  
but also
as an oportunity to clean up segments from deleted documents. If  
some segment
is bigger than the maximum and is never merged again, does this also  
mean
deleted documents will never ever get cleaned up from it? This can  
be a
serious problem on huge dynamic indices (e.g., imagine a crawl of  
the Web

or some large intranet).


Right, the deletes will not be cleaned up.  But you can use  
expungeDeletes()?  Or, make a MergePolicy that favors merges that  
would clean up deletes.


Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ReaderCommit

2008-06-25 Thread Michael McCandless


Jason Rutherglen wrote:

For Ocean I created a workaround where the IndexCommits from  
IndexDeletionPolicy are saved in a map in order to achieve deleting  
based on the IndexReader.  It would be more straightforward to  
delete from the IndexCommit in IndexReader.


It seems like we are mixing up deleting a whole commit point, vs  
deleting individual documents?  Or does Ocean somehow decide to delete  
a whole commit point based on which documents have been deleted?


I realize people want to get away from IndexReader performing  
updates, however, for my use case, realtime search updating from  
IndexReader makes sense mainly for obtaining the doc ids of  
deletions.  With IndexWriter managing the merges it would seem  
difficult to expose doc numbers, but perhaps there is a way.


IndexWriter can now delete by query, but it sounds like that's not  
sufficient for Ocean?


Under the hood, IndexWriter has the infrastructure to hold pending  
deleted docIDs and update these docIDs when a merge is committed.  Ie,  
previously we forced a flush of all pending deletes on every flush/ 
merge, but now we buffer the docIDs across flushes/merges.  This means  
IndexWriter *could* delete by docID, however, none of this is exposed  
publicly.


Also, this doesn't solve the problem of how you would get the docIDs  
to delete in the first place (ie one must still use a separate  
IndexReader for that).


I'm not sure this helps you (Ocean) since you presumably need to flush  
deletes very quickly to have realtime search...


Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

2008-06-25 Thread Michael McCandless


Jason Rutherglen wrote:

One of the bottlenecks I have noticed testing Ocean realtime search  
is the delete process which involves writing several files for each  
possibly single delete of a document in SegmentReader.  The best way  
to handle the deletes is too simply keep them in memory without  
flushing them to disk, saving on writing out an entire BitVector per  
delete.  The deletes are saved in the transaction log which is be  
replayed on recovery.


I am not sure of the best way to approach this, perhaps it is  
creating a custom class that inherits from SegmentReader.  It could  
reuse the existing reopen and also provide a way to set the  
deletedDocs BitVector.  Also it would be able to reuse FieldsReader  
by providing locking around FieldsReader for all SegmentReaders of  
the segment to use.  Otherwise in the current architecture each new  
SegmentReader opens a new FieldsReader which is non-optimal.  The  
deletes would be saved to disk but instead of per delete,  
periodically like a checkpoint.


Or ... maybe you could do the deletes through IndexWriter (somehow, if  
we can get docIDs properly) and then SegmentReaders could somehow tap  
into the buffered deleted docIDs that IndexWriter already maintains.   
IndexWriter is already doing this buffering, flush/commit anyway.


We've also discussed at one point creating an IndexReader impl that  
searches the RAM buffer that DocumentsWriter writes to when adding  
documents.  I think it's easier than it sounds, on first glance,  
because DocumentsWriter is in fact writing the postings in nearly the  
same format as is used when the segment is flushed.


So if we had this IndexReader impl, plus extended SegmentReader so it  
could tap into pending deletes buffered in IndexWriter, you could get  
realtime search without having to use Directory as an intermediary.   
Though, it is clearly quite a bit more work :)


Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: changing index format

2008-06-25 Thread Michael McCandless


John Wang wrote:

 The problem I am having is stated below, I don't know how to  
add the minDoc and maxDoc values to the index while keeping backward  
compatibility.


Unfortunately, TermInfo file format just isn't extensible at the  
moment, so I think for now you'll have to break backward compatibility  
if you really want to store these new fields in the _X.tis/.tii files.


EG here is another recent example of wanting to alter what's stored in  
TermInfo:


   https://issues.apache.org/jira/browse/LUCENE-1278

For flexible indexing we clearly need to fix this, so that any  
plugin in the indexing chain could stuff whatever it wants into the  
TermInfo, and also override how TermInfo is read/written.  Even the  
things we now store in TermInfo should be optional.  EG say you choose  
not to store locations (prx) for a given field.  Then, you would not  
need the long proxPointer.


Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1314) IndexReader.reopen(boolean force)

2008-06-25 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12607954#action_12607954
 ] 

Michael McCandless commented on LUCENE-1314:


bq. In my SegmentReader subclass I am passing a lock and passing a reference to 
fieldsReader for global locking and a single fieldsReader across all instances. 
Otherwise there are too many instances of fieldsReader and file descriptors 
will be used up.

Maybe instead we should just fix access to FieldsReader to be thread safe, 
either by making FieldsReader itself thread safe, or by doing something similar 
to what's done for TermVectorsReader (where each thread makes a shallow clone 
of the original TermVectorsReader, held in a ThreadLocal instance).  If we do 
that, then in SegmentReader.doReopen()  we never have to clone FieldsReader.

 IndexReader.reopen(boolean force)
 -

 Key: LUCENE-1314
 URL: https://issues.apache.org/jira/browse/LUCENE-1314
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.3.1
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Attachments: lucene-1314.patch, lucene-1314.patch, lucene-1314.patch


 Based on discussion 
 http://www.nabble.com/IndexReader.reopen-issue-td18070256.html.  The problem 
 is reopen returns the same reader if there are no changes, so if docs are 
 deleted from the new reader, they are also reflected in the previous reader 
 which is not always desired behavior.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

2008-06-25 Thread Jason Rutherglen
I understand what you are saying.  I am not sure it is worth clearly quite
a bit more work given how easy it is to simply be able to have more control
over the IndexReader deletedDocs BitVector which seems like a feature that
should be in there anyways, perhaps even allowing SortedVIntList to be
used.  The other issue with going down the path of integrating too much with
IndexWriter is I am not sure how to integrate the realtime document
additions to IndexWriter which is handled best by InstantiatedIndex.  When
merging needs to happen in Ocean the IndexWriter.addIndexes(IndexReader[]
readers) is used to merge SegmentReaders and InstantiatedIndexReaders.

One of the things I do not understand about IndexWriter deletes is it does
not reuse an already open TermInfosReader with the tii loaded.  Isn't this
slower than deleting using an already open IndexReader?

In any case the method of using deletedDocs in SegmentReader using the patch
given seems to work quite well in Ocean now.  I think long term there is
probably some way to integrate more with IndexWriter, but really that is
something more in line with removing the concept of IndexReader and
IndexWriter and creating an IndexReaderWriter class.

On Wed, Jun 25, 2008 at 6:29 AM, Michael McCandless 
[EMAIL PROTECTED] wrote:


 Jason Rutherglen wrote:

  One of the bottlenecks I have noticed testing Ocean realtime search is the
 delete process which involves writing several files for each possibly single
 delete of a document in SegmentReader.  The best way to handle the deletes
 is too simply keep them in memory without flushing them to disk, saving on
 writing out an entire BitVector per delete.  The deletes are saved in the
 transaction log which is be replayed on recovery.

 I am not sure of the best way to approach this, perhaps it is creating a
 custom class that inherits from SegmentReader.  It could reuse the existing
 reopen and also provide a way to set the deletedDocs BitVector.  Also it
 would be able to reuse FieldsReader by providing locking around FieldsReader
 for all SegmentReaders of the segment to use.  Otherwise in the current
 architecture each new SegmentReader opens a new FieldsReader which is
 non-optimal.  The deletes would be saved to disk but instead of per delete,
 periodically like a checkpoint.


 Or ... maybe you could do the deletes through IndexWriter (somehow, if we
 can get docIDs properly) and then SegmentReaders could somehow tap into the
 buffered deleted docIDs that IndexWriter already maintains.  IndexWriter is
 already doing this buffering, flush/commit anyway.

 We've also discussed at one point creating an IndexReader impl that
 searches the RAM buffer that DocumentsWriter writes to when adding
 documents.  I think it's easier than it sounds, on first glance, because
 DocumentsWriter is in fact writing the postings in nearly the same format as
 is used when the segment is flushed.

 So if we had this IndexReader impl, plus extended SegmentReader so it could
 tap into pending deletes buffered in IndexWriter, you could get realtime
 search without having to use Directory as an intermediary.  Though, it is
 clearly quite a bit more work :)

 Mike

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

2008-06-25 Thread Yonik Seeley
On Wed, Jun 25, 2008 at 6:29 AM, Michael McCandless
[EMAIL PROTECTED] wrote:
 We've also discussed at one point creating an IndexReader impl that searches
 the RAM buffer that DocumentsWriter writes to when adding documents.  I
 think it's easier than it sounds, on first glance, because DocumentsWriter
 is in fact writing the postings in nearly the same format as is used when
 the segment is flushed.

That would be very nice, and should also make it much easier to
implement updateable documents (changing/adding/removing single
fields).

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1314) IndexReader.reopen(boolean force)

2008-06-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608039#action_12608039
 ] 

Jason Rutherglen commented on LUCENE-1314:
--

Here is the code of the SegmentReader subclass.  Using the clone terminology 
would work as well, inside of SegmentReader the clone would most likely reuse 
SegmentReader.reopenSegment.  The subclass turns off locking by overriding 
acquireWriteLock and having it do nothing.  I do not know a general fix for the 
locking issue mentioned it holds a lock and then you can't do deletions in the 
second object.  Perhaps there is a way using lock less commits.  It is 
possible to have SegmentReader implement if deletes occur to an earlier 
IndexReader and a flush is tried it fails, rather than fail in a newer 
IndexReader like it would now.  This would require keeping track of later 
IndexReaders which is something Ocean does outside of IndexReader.  

As far as the FieldsReader, given how many SegmentReaders Ocean creates (up to 
one per update), a shallow clone threadlocal would still potentially create 
many file descriptors.  I would rather see a synchronized FieldsReader, or 
simply use the approach in the code below.  The external lock used seems ok 
because there is little competition for reading Documents, no more than normal 
a Lucene application using a single IndexReader loading documents for N 
results.  

{code}
public class OceanSegmentReader extends SegmentReader {
  protected ReentrantLock fieldsReaderLock;
  
  public OceanSegmentReader() {
openNewFieldsReader = false;
  }
  
  protected void doInitialize() {
fieldsReaderLock = new ReentrantLock();
  }
  
  protected void acquireWriteLock() throws IOException {
  }
  
  protected synchronized DirectoryIndexReader doReopen(SegmentInfos infos, 
boolean force) throws CorruptIndexException, IOException {
OceanSegmentReader segmentReader = 
(OceanSegmentReader)super.doReopen(infos, force);
segmentReader.fieldsReaderLock = fieldsReaderLock;
return segmentReader;
  }
  
  /**
   * @throws CorruptIndexException
   *   if the index is corrupt
   * @throws IOException
   *   if there is a low-level IO error
   */
  public synchronized Document document(int n, FieldSelector fieldSelector) 
throws CorruptIndexException, IOException {
ensureOpen();
if (isDeleted(n))
  throw new IllegalArgumentException(attempt to access a deleted 
document);
fieldsReaderLock.lock();
try {
  return getFieldsReader().doc(n, fieldSelector);
} finally {
  fieldsReaderLock.unlock();
}
  }
}
{code}

 IndexReader.reopen(boolean force)
 -

 Key: LUCENE-1314
 URL: https://issues.apache.org/jira/browse/LUCENE-1314
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.3.1
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Attachments: lucene-1314.patch, lucene-1314.patch, lucene-1314.patch


 Based on discussion 
 http://www.nabble.com/IndexReader.reopen-issue-td18070256.html.  The problem 
 is reopen returns the same reader if there are no changes, so if docs are 
 deleted from the new reader, they are also reflected in the previous reader 
 which is not always desired behavior.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: per-field similarity

2008-06-25 Thread Karl Wettin

+1

24 jun 2008 kl. 22.28 skrev Yonik Seeley:


Something to consider for Lucene 3 is to have something to retrieve
Similarity per-field rather than passing the field name into some
functions...

benefits:
- Would allow customizing most Similarity functions per-field
- Performance: Similarity for a field could be looked up once at the
beginning of a query and reused, eliminating hash lookups for every
Similarity function called that needs to be different depending on the
field name.

Might also consider passing in more optional context when retrieving
the similarity for a field (such as a Query, if searching).
Something like Similarity.getSimilarity(String field, Query q).
Multi-field queries (boolean query) could pass null for the field.
Perhaps it could even be back compatible...

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Is there a reason MemoryIndex does not implement Serializable?

2008-06-25 Thread Jason Rutherglen
It seems like it could, it even has serialVersionUID defined.


Re: Fwd: changing index format

2008-06-25 Thread John Wang
Thanks Paul and Mike for the feedback.
Paul, for us, sparsity of the docIds determine which data structure to use.
Where cardinality gives some of that, min/max docId would also help,
example:

say maxdoc=100, cardinality = 7, docids: {0,1,...6} or
{3,4...9}, using arrayDocIdSet would take 28 bytes and bitset
would take only 1.

Furthermore, knowing min/maxDocId would help in predetermine the size needed
in construction of a given DocIdSet datastructure, to avoid growth.

Thanks for pointing me to SortedVIntList, what is the underlying compression
algorithm? We have developed a DocIdSet implementation using the a variation
of the P4Delta compression algorithm (
http://cis.poly.edu/cs912/indexcomp.pdf) that we would like to contribute
sometime. From our benchmark, we get about 70% compression (30% of the
original size) of arrays, which also give you iteration in compressed format
with performance similar to OpenBitSet. (Iterating over arrays is much
faster over OpenBitSet)

I am not sure TermScorer serves the purpose here. TermScorer reads a batch
of 32 at a time (don't understand why 32 is picked or should it be
customizable), we can't rely on getting lucky to have the underlying OS
cache for us. Many times, we want to move the construction of some filters
ahead while the IndexReader reads. Here is an example, say we have a field
called: gender with only 2 terms: M, F. And our query is always of the form
content:query text AND gender:M/F, it is ideal to keep DocIdSet for M and
F in memory for the life of the IndexReader. I can't imagine constructing a
TermScorer for each query is similar in performance.

Reading the trunk code for TermScorer, I don't see the internal termDocs is
closed in skipTo. skipTo returns a boolean which tells the caller if the end
is reached, the caller may not/should not call next again to have it closed.
So wouldn't this scenario leak? Also in explain(docid), what happens if
termDoc is already closed from the next() call?

Thanks

-John

On Wed, Jun 25, 2008 at 12:45 AM, Paul Elschot [EMAIL PROTECTED]
wrote:

 Op Wednesday 25 June 2008 07:03:59 schreef John Wang:
  Hi guys:
  Perhaps I should have posted this to this list in the first
  place.
 
  I am trying to work on a patch to for each term, expose minDoc
  and maxDoc. This value can be retrieve while constructing the
  TermInfo.
 
  Knowing these two values can be very helpful in caching DocIdSet
  for a given Term. This would help to determine what type of
  underlying implementation to use, e.g. BitSet, HashSet, or ArraySet,
  etc.

 I suppose you know about
 https://issues.apache.org/jira/browse/LUCENE-1296 ?

 But how about using TermScorer? In the trunk it's a subclass of
 DocIdSetIterator (via Scorer) and the caching is already done by
 Lucene and the underlying OS file cache.
 TermScorer does some extra work for its scoring, but I don't think
 that would affect performance.

   The problem I am having is stated below, I don't know how to add
  the minDoc and maxDoc values to the index while keeping backward
  compatibility.

 I doubt they would help very much. The most important info for this
 is maxDoc from the index reader and the document frequency of the term,
 and these are easily determined.

 Btw, I've just started to add encoding intervals of consecutive doc ids
 to SortedVIntList. For very high document frequencies, that might
 actually be faster than TermScorer and more compact than the current
 index. Once I've got some working code I'll open an issue for it.

 Regards,
 Paul Elschot

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Is there a reason MemoryIndex does not implement Serializable?

2008-06-25 Thread Erik Hatcher

No reason done!

Erik

On Jun 25, 2008, at 11:05 AM, Jason Rutherglen wrote:


It seems like it could, it even has serialVersionUID defined.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

2008-06-25 Thread Jason Rutherglen
I read other parts of the email but glanced over this part.  Would terms be
automatically sorted as they came in?  If implemented it would be nice to be
able to get an encoded representation (probably byte array) of the document
and postings which could be written to a log, and then reentered in another
IndexWriter recreating the document and postings.

On Wed, Jun 25, 2008 at 8:41 AM, Yonik Seeley [EMAIL PROTECTED] wrote:

 On Wed, Jun 25, 2008 at 6:29 AM, Michael McCandless
 [EMAIL PROTECTED] wrote:
  We've also discussed at one point creating an IndexReader impl that
 searches
  the RAM buffer that DocumentsWriter writes to when adding documents.  I
  think it's easier than it sounds, on first glance, because
 DocumentsWriter
  is in fact writing the postings in nearly the same format as is used when
  the segment is flushed.

 That would be very nice, and should also make it much easier to
 implement updateable documents (changing/adding/removing single
 fields).

 -Yonik

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




[jira] Created: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Todd Feak (JIRA)
Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer


 Key: LUCENE-1316
 URL: https://issues.apache.org/jira/browse/LUCENE-1316
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.3
 Environment: All
Reporter: Todd Feak
Priority: Minor


The isDeleted() method on IndexReader has been mentioned a number of times as a 
potential synchronization bottleneck. However, the reason this  bottleneck 
occurs is actually at a higher level that wasn't focused on (at least in the 
threads I read).

In every case I saw where a stack trace was provided to show the lock/block, 
higher in the stack you see the MatchAllScorer.next() method. In Solr 
paricularly, this scorer is used for NOT queries. We saw incredibly poor 
performance (order of magnitude) on our load tests for NOT queries, due to this 
bottleneck. The problem is that every single document is run through this 
isDeleted() method, which is synchronized. Having an optimized index 
exacerbates this issues, as there is only a single SegmentReader to synchronize 
on, causing a major thread pileup waiting for the lock.

By simply having the MatchAllScorer see if there have been any deletions in the 
reader, much of this can be avoided. Especially in a read-only environment for 
production where you have slaves doing all the high load searching.

I modified line 67 in the MatchAllDocsQuery
FROM:
  if (!reader.isDeleted(id)) {
TO:
  if (!reader.hasDeletions() || !reader.isDeleted(id)) {

In our micro load test for NOT queries only, this was a major performance 
improvement.  We also got the same query results. I don't believe this will 
improve the situation for indexes that have deletions. 

Please consider making this adjustment for a future bug fix release.






-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Todd Feak (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Feak updated LUCENE-1316:
--


Further investigation indicates that the ValueSourceQuery$ValueSourceScorer may 
suffer from the same issue and benefit from a similar optimization.

 Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
 

 Key: LUCENE-1316
 URL: https://issues.apache.org/jira/browse/LUCENE-1316
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.3
 Environment: All
Reporter: Todd Feak
Priority: Minor
 Attachments: MatchAllDocsQuery.java

   Original Estimate: 1h
  Remaining Estimate: 1h

 The isDeleted() method on IndexReader has been mentioned a number of times as 
 a potential synchronization bottleneck. However, the reason this  bottleneck 
 occurs is actually at a higher level that wasn't focused on (at least in the 
 threads I read).
 In every case I saw where a stack trace was provided to show the lock/block, 
 higher in the stack you see the MatchAllScorer.next() method. In Solr 
 paricularly, this scorer is used for NOT queries. We saw incredibly poor 
 performance (order of magnitude) on our load tests for NOT queries, due to 
 this bottleneck. The problem is that every single document is run through 
 this isDeleted() method, which is synchronized. Having an optimized index 
 exacerbates this issues, as there is only a single SegmentReader to 
 synchronize on, causing a major thread pileup waiting for the lock.
 By simply having the MatchAllScorer see if there have been any deletions in 
 the reader, much of this can be avoided. Especially in a read-only 
 environment for production where you have slaves doing all the high load 
 searching.
 I modified line 67 in the MatchAllDocsQuery
 FROM:
   if (!reader.isDeleted(id)) {
 TO:
   if (!reader.hasDeletions() || !reader.isDeleted(id)) {
 In our micro load test for NOT queries only, this was a major performance 
 improvement.  We also got the same query results. I don't believe this will 
 improve the situation for indexes that have deletions. 
 Please consider making this adjustment for a future bug fix release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Fwd: changing index format

2008-06-25 Thread John Wang
Hi Paul:
Regarding to your comment on adding required/prohibited to BooleanQuery:

Based on the new api on DocIdSet and DocIdSetIterator abstractions, we
also developed decorators such as AndDocIdSet,OrDocIdSet and NotDocIdSet,
furthermore a DocIdSetQuery class that honors the Query api contracts. Given
these tools, we are able to build a customized scored BooleanQuery-like
query infrastructure. We'd be happy to contribute them.

Thanks

-John

On Wed, Jun 25, 2008 at 9:29 AM, Paul Elschot [EMAIL PROTECTED]
wrote:

 Op Wednesday 25 June 2008 17:05:17 schreef John Wang:
  Thanks Paul and Mike for the feedback.
  Paul, for us, sparsity of the docIds determine which data structure
  to use. Where cardinality gives some of that, min/max docId would
  also help, example:
 
  say maxdoc=100, cardinality = 7, docids: {0,1,...6} or
  {3,4...9}, using arrayDocIdSet would take 28 bytes and
  bitset would take only 1.
 
  Furthermore, knowing min/maxDocId would help in predetermine the size
  needed in construction of a given DocIdSet datastructure, to avoid
  growth.
 
  Thanks for pointing me to SortedVIntList, what is the underlying
  compression algorithm?

 A SortedVIntList uses a byte array to store the docid differences as
 a series of VInt's, with a VInt being a series of bytes in which the
 high bit is a continuation bit, and the remaining bits are data for an
 unsigned integer. The same VInt is used in a lucene index in various
 places.

  We have developed a DocIdSet implementation
  using the a variation of the P4Delta compression algorithm (
  http://cis.poly.edu/cs912/indexcomp.pdf) that we would like to
  contribute sometime. From our benchmark, we get about 70% compression
  (30% of the original size) of arrays, which also give you iteration
  in compressed format with performance similar to OpenBitSet.
  (Iterating over arrays is much faster over OpenBitSet)

 Andrzej recently pointed to a paper on PForDelta, and since then
 I have a java implementation rather low on my todo list.
 Needless to say that I'm interested to see it contributed.

  I am not sure TermScorer serves the purpose here. TermScorer reads a
  batch of 32 at a time (don't understand why 32 is picked or should it
  be customizable), we can't rely on getting lucky to have the
  underlying OS cache for us. Many times, we want to move the
  construction of some filters ahead while the IndexReader reads. Here
  is an example, say we have a field called: gender with only 2 terms:
  M, F. And our query is always of the form content:query text AND
  gender:M/F, it is ideal to keep DocIdSet for M and F in memory for
  the life of the IndexReader. I can't imagine constructing a
  TermScorer for each query is similar in performance.

 Well, you can give TermScorer a try before writing other code.
 Adding a DocIdSet as required or prohibited to a BooleanQuery
 would be nice, but that is not yet possible.

  Reading the trunk code for TermScorer, I don't see the internal
  termDocs is closed in skipTo. skipTo returns a boolean which tells
  the caller if the end is reached, the caller may not/should not call
  next again to have it closed. So wouldn't this scenario leak?

 Closing of Scorers has been discussed before, the only conclusion
 I remember now is that there is no bug in the current code.

  Also in
  explain(docid), what happens if termDoc is already closed from the
  next() call?

 When explain() is called on a Scorer, next() and skipTo() should
 not be called. A Scorer can either explain, or search, but not both.

 Regards,
 Paul Elschot


 
  Thanks
 
  -John
 
  On Wed, Jun 25, 2008 at 12:45 AM, Paul Elschot
  [EMAIL PROTECTED]
 
  wrote:
   Op Wednesday 25 June 2008 07:03:59 schreef John Wang:
Hi guys:
Perhaps I should have posted this to this list in the first
place.
   
I am trying to work on a patch to for each term, expose
minDoc and maxDoc. This value can be retrieve while constructing
the TermInfo.
   
Knowing these two values can be very helpful in caching
DocIdSet for a given Term. This would help to determine what type
of underlying implementation to use, e.g. BitSet, HashSet, or
ArraySet, etc.
  
   I suppose you know about
   https://issues.apache.org/jira/browse/LUCENE-1296 ?
  
   But how about using TermScorer? In the trunk it's a subclass of
   DocIdSetIterator (via Scorer) and the caching is already done by
   Lucene and the underlying OS file cache.
   TermScorer does some extra work for its scoring, but I don't think
   that would affect performance.
  
 The problem I am having is stated below, I don't know how to
add the minDoc and maxDoc values to the index while keeping
backward compatibility.
  
   I doubt they would help very much. The most important info for this
   is maxDoc from the index reader and the document frequency of the
   term, and these are easily determined.
  
   Btw, I've just started to add 

Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

2008-06-25 Thread Yonik Seeley
On Wed, Jun 25, 2008 at 11:30 AM, Jason Rutherglen
[EMAIL PROTECTED] wrote:
 I read other parts of the email but glanced over this part.  Would terms be
 automatically sorted as they came in?  If implemented it would be nice to be
 able to get an encoded representation (probably byte array) of the document
 and postings which could be written to a log, and then reentered in another
 IndexWriter recreating the document and postings.

I was talking simpler...  If one could open an IndexReader on the
index (including uncommitted documents in the open IndexWriter), then
you can easily search for a document and retrieve it's stored fields
in order to re-index it with changes (and still maintain decent
performance).

-Yonik


 On Wed, Jun 25, 2008 at 8:41 AM, Yonik Seeley [EMAIL PROTECTED] wrote:

 On Wed, Jun 25, 2008 at 6:29 AM, Michael McCandless
 [EMAIL PROTECTED] wrote:
  We've also discussed at one point creating an IndexReader impl that
  searches
  the RAM buffer that DocumentsWriter writes to when adding documents.  I
  think it's easier than it sounds, on first glance, because
  DocumentsWriter
  is in fact writing the postings in nearly the same format as is used
  when
  the segment is flushed.

 That would be very nice, and should also make it much easier to
 implement updateable documents (changing/adding/removing single
 fields).

 -Yonik

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



BooleanQuery and DocIdSet; Was: Fwd: changing index format

2008-06-25 Thread Paul Elschot
Op Wednesday 25 June 2008 18:45:16 schreef John Wang:
 Hi Paul:
 Regarding to your comment on adding required/prohibited to
 BooleanQuery:

 Based on the new api on DocIdSet and DocIdSetIterator
 abstractions, we also developed decorators such as
 AndDocIdSet,OrDocIdSet and NotDocIdSet, furthermore a DocIdSetQuery
 class that honors the Query api contracts. Given these tools, we are
 able to build a customized scored BooleanQuery-like query
 infrastructure. We'd be happy to contribute them.

Another thing to be removed from near the end of my todo list?
Perhaps I could even take a vacation :)

More seriously: would DocIdSetQuery be superfluous when
a DocIdSet could be added directly to a BooleanQuery?

Could you elaborate a bit on the customized scoring?

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608128#action_12608128
 ] 

Yonik Seeley commented on LUCENE-1316:
--

Although this doesn't solve the general problem, this probably still makes 
sense to do now for the no-deletes case.
Todd, can you produce a patch?  See 
http://wiki.apache.org/lucene-java/HowToContribute

 Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
 

 Key: LUCENE-1316
 URL: https://issues.apache.org/jira/browse/LUCENE-1316
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.3
 Environment: All
Reporter: Todd Feak
Priority: Minor
 Attachments: MatchAllDocsQuery.java

   Original Estimate: 1h
  Remaining Estimate: 1h

 The isDeleted() method on IndexReader has been mentioned a number of times as 
 a potential synchronization bottleneck. However, the reason this  bottleneck 
 occurs is actually at a higher level that wasn't focused on (at least in the 
 threads I read).
 In every case I saw where a stack trace was provided to show the lock/block, 
 higher in the stack you see the MatchAllScorer.next() method. In Solr 
 paricularly, this scorer is used for NOT queries. We saw incredibly poor 
 performance (order of magnitude) on our load tests for NOT queries, due to 
 this bottleneck. The problem is that every single document is run through 
 this isDeleted() method, which is synchronized. Having an optimized index 
 exacerbates this issues, as there is only a single SegmentReader to 
 synchronize on, causing a major thread pileup waiting for the lock.
 By simply having the MatchAllScorer see if there have been any deletions in 
 the reader, much of this can be avoided. Especially in a read-only 
 environment for production where you have slaves doing all the high load 
 searching.
 I modified line 67 in the MatchAllDocsQuery
 FROM:
   if (!reader.isDeleted(id)) {
 TO:
   if (!reader.hasDeletions() || !reader.isDeleted(id)) {
 In our micro load test for NOT queries only, this was a major performance 
 improvement.  We also got the same query results. I don't believe this will 
 improve the situation for indexes that have deletions. 
 Please consider making this adjustment for a future bug fix release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608129#action_12608129
 ] 

Hoss Man commented on LUCENE-1316:
--

rather then attempting localized optimizations of individual Query classes, a 
more generalized improvements would probably be to change 
SegmentReader.isDeleted so that instead of the entire method being 
synchronized, it first checks if the segment has any deletions, and if not then 
enters a synchronized block to check deletedDocs.get(n).

 Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
 

 Key: LUCENE-1316
 URL: https://issues.apache.org/jira/browse/LUCENE-1316
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.3
 Environment: All
Reporter: Todd Feak
Priority: Minor
 Attachments: MatchAllDocsQuery.java

   Original Estimate: 1h
  Remaining Estimate: 1h

 The isDeleted() method on IndexReader has been mentioned a number of times as 
 a potential synchronization bottleneck. However, the reason this  bottleneck 
 occurs is actually at a higher level that wasn't focused on (at least in the 
 threads I read).
 In every case I saw where a stack trace was provided to show the lock/block, 
 higher in the stack you see the MatchAllScorer.next() method. In Solr 
 paricularly, this scorer is used for NOT queries. We saw incredibly poor 
 performance (order of magnitude) on our load tests for NOT queries, due to 
 this bottleneck. The problem is that every single document is run through 
 this isDeleted() method, which is synchronized. Having an optimized index 
 exacerbates this issues, as there is only a single SegmentReader to 
 synchronize on, causing a major thread pileup waiting for the lock.
 By simply having the MatchAllScorer see if there have been any deletions in 
 the reader, much of this can be avoided. Especially in a read-only 
 environment for production where you have slaves doing all the high load 
 searching.
 I modified line 67 in the MatchAllDocsQuery
 FROM:
   if (!reader.isDeleted(id)) {
 TO:
   if (!reader.hasDeletions() || !reader.isDeleted(id)) {
 In our micro load test for NOT queries only, this was a major performance 
 improvement.  We also got the same query results. I don't believe this will 
 improve the situation for indexes that have deletions. 
 Please consider making this adjustment for a future bug fix release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Todd Feak (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Feak updated LUCENE-1316:
--


I like Hoss' suggestion better. I'll try that fix locally and if it provides 
the same improvement, I will submit a patch for you.

 Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
 

 Key: LUCENE-1316
 URL: https://issues.apache.org/jira/browse/LUCENE-1316
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.3
 Environment: All
Reporter: Todd Feak
Priority: Minor
 Attachments: MatchAllDocsQuery.java

   Original Estimate: 1h
  Remaining Estimate: 1h

 The isDeleted() method on IndexReader has been mentioned a number of times as 
 a potential synchronization bottleneck. However, the reason this  bottleneck 
 occurs is actually at a higher level that wasn't focused on (at least in the 
 threads I read).
 In every case I saw where a stack trace was provided to show the lock/block, 
 higher in the stack you see the MatchAllScorer.next() method. In Solr 
 paricularly, this scorer is used for NOT queries. We saw incredibly poor 
 performance (order of magnitude) on our load tests for NOT queries, due to 
 this bottleneck. The problem is that every single document is run through 
 this isDeleted() method, which is synchronized. Having an optimized index 
 exacerbates this issues, as there is only a single SegmentReader to 
 synchronize on, causing a major thread pileup waiting for the lock.
 By simply having the MatchAllScorer see if there have been any deletions in 
 the reader, much of this can be avoided. Especially in a read-only 
 environment for production where you have slaves doing all the high load 
 searching.
 I modified line 67 in the MatchAllDocsQuery
 FROM:
   if (!reader.isDeleted(id)) {
 TO:
   if (!reader.hasDeletions() || !reader.isDeleted(id)) {
 In our micro load test for NOT queries only, this was a major performance 
 improvement.  We also got the same query results. I don't believe this will 
 improve the situation for indexes that have deletions. 
 Please consider making this adjustment for a future bug fix release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608134#action_12608134
 ] 

Yonik Seeley commented on LUCENE-1316:
--

 a more generalized improvements would probably be to change 
 SegmentReader.isDeleted so that instead of the entire method being 
 synchronized

Right, but that's not totally back compatible.  Code that depended on deletes 
being instantly visible across threads would no longer be guaranteed.

 Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
 

 Key: LUCENE-1316
 URL: https://issues.apache.org/jira/browse/LUCENE-1316
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.3
 Environment: All
Reporter: Todd Feak
Priority: Minor
 Attachments: MatchAllDocsQuery.java

   Original Estimate: 1h
  Remaining Estimate: 1h

 The isDeleted() method on IndexReader has been mentioned a number of times as 
 a potential synchronization bottleneck. However, the reason this  bottleneck 
 occurs is actually at a higher level that wasn't focused on (at least in the 
 threads I read).
 In every case I saw where a stack trace was provided to show the lock/block, 
 higher in the stack you see the MatchAllScorer.next() method. In Solr 
 paricularly, this scorer is used for NOT queries. We saw incredibly poor 
 performance (order of magnitude) on our load tests for NOT queries, due to 
 this bottleneck. The problem is that every single document is run through 
 this isDeleted() method, which is synchronized. Having an optimized index 
 exacerbates this issues, as there is only a single SegmentReader to 
 synchronize on, causing a major thread pileup waiting for the lock.
 By simply having the MatchAllScorer see if there have been any deletions in 
 the reader, much of this can be avoided. Especially in a read-only 
 environment for production where you have slaves doing all the high load 
 searching.
 I modified line 67 in the MatchAllDocsQuery
 FROM:
   if (!reader.isDeleted(id)) {
 TO:
   if (!reader.hasDeletions() || !reader.isDeleted(id)) {
 In our micro load test for NOT queries only, this was a major performance 
 improvement.  We also got the same query results. I don't believe this will 
 improve the situation for indexes that have deletions. 
 Please consider making this adjustment for a future bug fix release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: per-field similarity

2008-06-25 Thread Chris Hostetter

: Might also consider passing in more optional context when retrieving
: the similarity for a field (such as a Query, if searching).
: Something like Similarity.getSimilarity(String field, Query q).

i assume you mean Searcher.getSimilarity(String fieldName, Query q) to 
replace the current Searcher.getSimilarity() right?  (where in both cases 
we are talking about an instance method and not a static method)

There's been some discussions about this in the past, I think at one point 
Doug suggested almost the exact same thing in this thread... 

http://www.nabble.com/-jira--Created%3A-%28LUCENE-577%29-SweetSpotSimiliarity-to4533741.html#a4536312

...it could probably be done in a completley backwards compatible way.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608137#action_12608137
 ] 

Hoss Man commented on LUCENE-1316:
--

bq. Code that depended on deletes being instantly visible across threads would 
no longer be guaranteed.

you lost me there ... why would deletes be stop being instantly visible if we 
changed this...

{code}
  public synchronized boolean isDeleted(int n) {
return (deletedDocs != null  deletedDocs.get(n));
  }
{code}

...to this...

{code}
  public boolean isDeleted(int n) {
if (null == deletedDocs) return false;
synchronized (this) { return (deletedDocs.get(n)); }
  }
{code}

?

 Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
 

 Key: LUCENE-1316
 URL: https://issues.apache.org/jira/browse/LUCENE-1316
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.3
 Environment: All
Reporter: Todd Feak
Priority: Minor
 Attachments: MatchAllDocsQuery.java

   Original Estimate: 1h
  Remaining Estimate: 1h

 The isDeleted() method on IndexReader has been mentioned a number of times as 
 a potential synchronization bottleneck. However, the reason this  bottleneck 
 occurs is actually at a higher level that wasn't focused on (at least in the 
 threads I read).
 In every case I saw where a stack trace was provided to show the lock/block, 
 higher in the stack you see the MatchAllScorer.next() method. In Solr 
 paricularly, this scorer is used for NOT queries. We saw incredibly poor 
 performance (order of magnitude) on our load tests for NOT queries, due to 
 this bottleneck. The problem is that every single document is run through 
 this isDeleted() method, which is synchronized. Having an optimized index 
 exacerbates this issues, as there is only a single SegmentReader to 
 synchronize on, causing a major thread pileup waiting for the lock.
 By simply having the MatchAllScorer see if there have been any deletions in 
 the reader, much of this can be avoided. Especially in a read-only 
 environment for production where you have slaves doing all the high load 
 searching.
 I modified line 67 in the MatchAllDocsQuery
 FROM:
   if (!reader.isDeleted(id)) {
 TO:
   if (!reader.hasDeletions() || !reader.isDeleted(id)) {
 In our micro load test for NOT queries only, this was a major performance 
 improvement.  We also got the same query results. I don't believe this will 
 improve the situation for indexes that have deletions. 
 Please consider making this adjustment for a future bug fix release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: per-field similarity

2008-06-25 Thread Yonik Seeley
On Wed, Jun 25, 2008 at 2:19 PM, Chris Hostetter
[EMAIL PROTECTED] wrote:
 : Might also consider passing in more optional context when retrieving
 : the similarity for a field (such as a Query, if searching).
 : Something like Similarity.getSimilarity(String field, Query q).

 i assume you mean Searcher.getSimilarity(String fieldName, Query q) to
 replace the current Searcher.getSimilarity() right?

No, I meant Similarity (it's more like a factory method on the
Similarity class).
The Searcher.getSimilarity() could remain unchanged.
A Similarity is what is passed into the IndexWriter, and you would
want the same per-field flexibility there.

  (where in both cases
 we are talking about an instance method and not a static method)

Right.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread robert engels (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608146#action_12608146
 ] 

robert engels commented on LUCENE-1316:
---

According to the java memory model, hasDeletions() would need to be 
synchronized as well , since if another thread did perform a deletion, it would 
need to be up to date.

This might work in later JVMs by declaring the deletedDocs variable volatile, 
but no guarantees.

Seems better to ALLOW this behavior, that a reader might not see up to date 
deletions made during a query, and do a single synchronized check of deletions 
at the start.



 Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
 

 Key: LUCENE-1316
 URL: https://issues.apache.org/jira/browse/LUCENE-1316
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.3
 Environment: All
Reporter: Todd Feak
Priority: Minor
 Attachments: MatchAllDocsQuery.java

   Original Estimate: 1h
  Remaining Estimate: 1h

 The isDeleted() method on IndexReader has been mentioned a number of times as 
 a potential synchronization bottleneck. However, the reason this  bottleneck 
 occurs is actually at a higher level that wasn't focused on (at least in the 
 threads I read).
 In every case I saw where a stack trace was provided to show the lock/block, 
 higher in the stack you see the MatchAllScorer.next() method. In Solr 
 paricularly, this scorer is used for NOT queries. We saw incredibly poor 
 performance (order of magnitude) on our load tests for NOT queries, due to 
 this bottleneck. The problem is that every single document is run through 
 this isDeleted() method, which is synchronized. Having an optimized index 
 exacerbates this issues, as there is only a single SegmentReader to 
 synchronize on, causing a major thread pileup waiting for the lock.
 By simply having the MatchAllScorer see if there have been any deletions in 
 the reader, much of this can be avoided. Especially in a read-only 
 environment for production where you have slaves doing all the high load 
 searching.
 I modified line 67 in the MatchAllDocsQuery
 FROM:
   if (!reader.isDeleted(id)) {
 TO:
   if (!reader.hasDeletions() || !reader.isDeleted(id)) {
 In our micro load test for NOT queries only, this was a major performance 
 improvement.  We also got the same query results. I don't believe this will 
 improve the situation for indexes that have deletions. 
 Please consider making this adjustment for a future bug fix release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608147#action_12608147
 ] 

Yonik Seeley commented on LUCENE-1316:
--

bq. why would deletes be stop being instantly visible

It's minor, but before, if thread A deleted a document, and then thread B 
checked if it was deleted, thread B was guaranteed to see that it was in fact 
deleted.

If the check for deletedDocs == null were moved outside of the synchronized, 
there's no guarantee when thread B will see (if ever) that deletedDocs has been 
set (no memory barrier).

It's not a major issue since client code shouldn't be written that way IMO, but 
it's worth factoring into the decision.


 Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
 

 Key: LUCENE-1316
 URL: https://issues.apache.org/jira/browse/LUCENE-1316
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.3
 Environment: All
Reporter: Todd Feak
Priority: Minor
 Attachments: MatchAllDocsQuery.java

   Original Estimate: 1h
  Remaining Estimate: 1h

 The isDeleted() method on IndexReader has been mentioned a number of times as 
 a potential synchronization bottleneck. However, the reason this  bottleneck 
 occurs is actually at a higher level that wasn't focused on (at least in the 
 threads I read).
 In every case I saw where a stack trace was provided to show the lock/block, 
 higher in the stack you see the MatchAllScorer.next() method. In Solr 
 paricularly, this scorer is used for NOT queries. We saw incredibly poor 
 performance (order of magnitude) on our load tests for NOT queries, due to 
 this bottleneck. The problem is that every single document is run through 
 this isDeleted() method, which is synchronized. Having an optimized index 
 exacerbates this issues, as there is only a single SegmentReader to 
 synchronize on, causing a major thread pileup waiting for the lock.
 By simply having the MatchAllScorer see if there have been any deletions in 
 the reader, much of this can be avoided. Especially in a read-only 
 environment for production where you have slaves doing all the high load 
 searching.
 I modified line 67 in the MatchAllDocsQuery
 FROM:
   if (!reader.isDeleted(id)) {
 TO:
   if (!reader.hasDeletions() || !reader.isDeleted(id)) {
 In our micro load test for NOT queries only, this was a major performance 
 improvement.  We also got the same query results. I don't believe this will 
 improve the situation for indexes that have deletions. 
 Please consider making this adjustment for a future bug fix release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread robert engels (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608149#action_12608149
 ] 

robert engels commented on LUCENE-1316:
---

The Pattern#5 referenced (cheap read-write lock) is exactly what is trying to 
be accomplished.

 Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
 

 Key: LUCENE-1316
 URL: https://issues.apache.org/jira/browse/LUCENE-1316
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.3
 Environment: All
Reporter: Todd Feak
Priority: Minor
 Attachments: MatchAllDocsQuery.java

   Original Estimate: 1h
  Remaining Estimate: 1h

 The isDeleted() method on IndexReader has been mentioned a number of times as 
 a potential synchronization bottleneck. However, the reason this  bottleneck 
 occurs is actually at a higher level that wasn't focused on (at least in the 
 threads I read).
 In every case I saw where a stack trace was provided to show the lock/block, 
 higher in the stack you see the MatchAllScorer.next() method. In Solr 
 paricularly, this scorer is used for NOT queries. We saw incredibly poor 
 performance (order of magnitude) on our load tests for NOT queries, due to 
 this bottleneck. The problem is that every single document is run through 
 this isDeleted() method, which is synchronized. Having an optimized index 
 exacerbates this issues, as there is only a single SegmentReader to 
 synchronize on, causing a major thread pileup waiting for the lock.
 By simply having the MatchAllScorer see if there have been any deletions in 
 the reader, much of this can be avoided. Especially in a read-only 
 environment for production where you have slaves doing all the high load 
 searching.
 I modified line 67 in the MatchAllDocsQuery
 FROM:
   if (!reader.isDeleted(id)) {
 TO:
   if (!reader.hasDeletions() || !reader.isDeleted(id)) {
 In our micro load test for NOT queries only, this was a major performance 
 improvement.  We also got the same query results. I don't believe this will 
 improve the situation for indexes that have deletions. 
 Please consider making this adjustment for a future bug fix release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608160#action_12608160
 ] 

Yonik Seeley commented on LUCENE-1316:
--

bq. declaring the deletedDocs volatile should do the trick.

Right... that would be cheaper when no docs were deleted.  But would it be more 
expensive when there were deleted docs (a volatile + a synchronized?)  I don't 
know if lock coarsening could do anything with this case...

 Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
 

 Key: LUCENE-1316
 URL: https://issues.apache.org/jira/browse/LUCENE-1316
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.3
 Environment: All
Reporter: Todd Feak
Priority: Minor
 Attachments: MatchAllDocsQuery.java

   Original Estimate: 1h
  Remaining Estimate: 1h

 The isDeleted() method on IndexReader has been mentioned a number of times as 
 a potential synchronization bottleneck. However, the reason this  bottleneck 
 occurs is actually at a higher level that wasn't focused on (at least in the 
 threads I read).
 In every case I saw where a stack trace was provided to show the lock/block, 
 higher in the stack you see the MatchAllScorer.next() method. In Solr 
 paricularly, this scorer is used for NOT queries. We saw incredibly poor 
 performance (order of magnitude) on our load tests for NOT queries, due to 
 this bottleneck. The problem is that every single document is run through 
 this isDeleted() method, which is synchronized. Having an optimized index 
 exacerbates this issues, as there is only a single SegmentReader to 
 synchronize on, causing a major thread pileup waiting for the lock.
 By simply having the MatchAllScorer see if there have been any deletions in 
 the reader, much of this can be avoided. Especially in a read-only 
 environment for production where you have slaves doing all the high load 
 searching.
 I modified line 67 in the MatchAllDocsQuery
 FROM:
   if (!reader.isDeleted(id)) {
 TO:
   if (!reader.hasDeletions() || !reader.isDeleted(id)) {
 In our micro load test for NOT queries only, this was a major performance 
 improvement.  We also got the same query results. I don't believe this will 
 improve the situation for indexes that have deletions. 
 Please consider making this adjustment for a future bug fix release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608162#action_12608162
 ] 

Mark Miller commented on LUCENE-1316:
-

If I remember correctly, volatile does not work correctly until java 1.5. At 
best I think it was implementation dependent with the old memory model.

 Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
 

 Key: LUCENE-1316
 URL: https://issues.apache.org/jira/browse/LUCENE-1316
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.3
 Environment: All
Reporter: Todd Feak
Priority: Minor
 Attachments: MatchAllDocsQuery.java

   Original Estimate: 1h
  Remaining Estimate: 1h

 The isDeleted() method on IndexReader has been mentioned a number of times as 
 a potential synchronization bottleneck. However, the reason this  bottleneck 
 occurs is actually at a higher level that wasn't focused on (at least in the 
 threads I read).
 In every case I saw where a stack trace was provided to show the lock/block, 
 higher in the stack you see the MatchAllScorer.next() method. In Solr 
 paricularly, this scorer is used for NOT queries. We saw incredibly poor 
 performance (order of magnitude) on our load tests for NOT queries, due to 
 this bottleneck. The problem is that every single document is run through 
 this isDeleted() method, which is synchronized. Having an optimized index 
 exacerbates this issues, as there is only a single SegmentReader to 
 synchronize on, causing a major thread pileup waiting for the lock.
 By simply having the MatchAllScorer see if there have been any deletions in 
 the reader, much of this can be avoided. Especially in a read-only 
 environment for production where you have slaves doing all the high load 
 searching.
 I modified line 67 in the MatchAllDocsQuery
 FROM:
   if (!reader.isDeleted(id)) {
 TO:
   if (!reader.hasDeletions() || !reader.isDeleted(id)) {
 In our micro load test for NOT queries only, this was a major performance 
 improvement.  We also got the same query results. I don't believe this will 
 improve the situation for indexes that have deletions. 
 Please consider making this adjustment for a future bug fix release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Issue Comment Edited: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608162#action_12608162
 ] 

[EMAIL PROTECTED] edited comment on LUCENE-1316 at 6/25/08 12:40 PM:
---

If I remember correctly, volatile does not work correctly until java 1.5. At 
best I think it was implementation dependent with the old memory model.

*edit*

maybe its ok under certain circumstances:

http://www.ibm.com/developerworks/library/j-jtp02244.html

Problem #2: Reordering volatile and nonvolatile stores


  was (Author: [EMAIL PROTECTED]):
If I remember correctly, volatile does not work correctly until java 1.5. 
At best I think it was implementation dependent with the old memory model.
  
 Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
 

 Key: LUCENE-1316
 URL: https://issues.apache.org/jira/browse/LUCENE-1316
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.3
 Environment: All
Reporter: Todd Feak
Priority: Minor
 Attachments: MatchAllDocsQuery.java

   Original Estimate: 1h
  Remaining Estimate: 1h

 The isDeleted() method on IndexReader has been mentioned a number of times as 
 a potential synchronization bottleneck. However, the reason this  bottleneck 
 occurs is actually at a higher level that wasn't focused on (at least in the 
 threads I read).
 In every case I saw where a stack trace was provided to show the lock/block, 
 higher in the stack you see the MatchAllScorer.next() method. In Solr 
 paricularly, this scorer is used for NOT queries. We saw incredibly poor 
 performance (order of magnitude) on our load tests for NOT queries, due to 
 this bottleneck. The problem is that every single document is run through 
 this isDeleted() method, which is synchronized. Having an optimized index 
 exacerbates this issues, as there is only a single SegmentReader to 
 synchronize on, causing a major thread pileup waiting for the lock.
 By simply having the MatchAllScorer see if there have been any deletions in 
 the reader, much of this can be avoided. Especially in a read-only 
 environment for production where you have slaves doing all the high load 
 searching.
 I modified line 67 in the MatchAllDocsQuery
 FROM:
   if (!reader.isDeleted(id)) {
 TO:
   if (!reader.hasDeletions() || !reader.isDeleted(id)) {
 In our micro load test for NOT queries only, this was a major performance 
 improvement.  We also got the same query results. I don't believe this will 
 improve the situation for indexes that have deletions. 
 Please consider making this adjustment for a future bug fix release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608183#action_12608183
 ] 

Hoss Man commented on LUCENE-1316:
--

bq. if thread A deleted a document, and then thread B checked if it was 
deleted, thread B was guaranteed to see that it was in fact deleted.

Hmmm i'll take your word for it, but i don't follow the rational: the 
current synchronization just ensured that either the isDeleted() call will 
complete before the delete() call started or vice versa -- but you have no 
guarantee that thread B would run after thread A and get true.    unless... 
is your point that without synchronization on the null check there's no 
garuntee that B will ever see the change to deletedDocs even if it does execute 
after delete() ?

either way: robert's point about hasDeletions() needing to be synchronized 
seems like a bigger issue -- isn't that a bug in the current implementation?  
assuming we fix that then it seems like the original issue is back to square 
one: synchro bottlenecks when there are no deletions.





 Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
 

 Key: LUCENE-1316
 URL: https://issues.apache.org/jira/browse/LUCENE-1316
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.3
 Environment: All
Reporter: Todd Feak
Priority: Minor
 Attachments: MatchAllDocsQuery.java

   Original Estimate: 1h
  Remaining Estimate: 1h

 The isDeleted() method on IndexReader has been mentioned a number of times as 
 a potential synchronization bottleneck. However, the reason this  bottleneck 
 occurs is actually at a higher level that wasn't focused on (at least in the 
 threads I read).
 In every case I saw where a stack trace was provided to show the lock/block, 
 higher in the stack you see the MatchAllScorer.next() method. In Solr 
 paricularly, this scorer is used for NOT queries. We saw incredibly poor 
 performance (order of magnitude) on our load tests for NOT queries, due to 
 this bottleneck. The problem is that every single document is run through 
 this isDeleted() method, which is synchronized. Having an optimized index 
 exacerbates this issues, as there is only a single SegmentReader to 
 synchronize on, causing a major thread pileup waiting for the lock.
 By simply having the MatchAllScorer see if there have been any deletions in 
 the reader, much of this can be avoided. Especially in a read-only 
 environment for production where you have slaves doing all the high load 
 searching.
 I modified line 67 in the MatchAllDocsQuery
 FROM:
   if (!reader.isDeleted(id)) {
 TO:
   if (!reader.hasDeletions() || !reader.isDeleted(id)) {
 In our micro load test for NOT queries only, this was a major performance 
 improvement.  We also got the same query results. I don't believe this will 
 improve the situation for indexes that have deletions. 
 Please consider making this adjustment for a future bug fix release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread robert engels (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608187#action_12608187
 ] 

robert engels commented on LUCENE-1316:
---

Hoss, that is indeed the case, another thread would see deletedDocs as null, 
even though another thread has set it

hasDeletions does not need to be synchronized if deletedDocs is volatile

 Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
 

 Key: LUCENE-1316
 URL: https://issues.apache.org/jira/browse/LUCENE-1316
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.3
 Environment: All
Reporter: Todd Feak
Priority: Minor
 Attachments: MatchAllDocsQuery.java

   Original Estimate: 1h
  Remaining Estimate: 1h

 The isDeleted() method on IndexReader has been mentioned a number of times as 
 a potential synchronization bottleneck. However, the reason this  bottleneck 
 occurs is actually at a higher level that wasn't focused on (at least in the 
 threads I read).
 In every case I saw where a stack trace was provided to show the lock/block, 
 higher in the stack you see the MatchAllScorer.next() method. In Solr 
 paricularly, this scorer is used for NOT queries. We saw incredibly poor 
 performance (order of magnitude) on our load tests for NOT queries, due to 
 this bottleneck. The problem is that every single document is run through 
 this isDeleted() method, which is synchronized. Having an optimized index 
 exacerbates this issues, as there is only a single SegmentReader to 
 synchronize on, causing a major thread pileup waiting for the lock.
 By simply having the MatchAllScorer see if there have been any deletions in 
 the reader, much of this can be avoided. Especially in a read-only 
 environment for production where you have slaves doing all the high load 
 searching.
 I modified line 67 in the MatchAllDocsQuery
 FROM:
   if (!reader.isDeleted(id)) {
 TO:
   if (!reader.hasDeletions() || !reader.isDeleted(id)) {
 In our micro load test for NOT queries only, this was a major performance 
 improvement.  We also got the same query results. I don't believe this will 
 improve the situation for indexes that have deletions. 
 Please consider making this adjustment for a future bug fix release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: BooleanQuery and DocIdSet; Was: Fwd: changing index format

2008-06-25 Thread John Wang
I am not sure, BooleanQuery takes something that can score, e.g. being a
Clause or a Query, the contract requires some sort of scoring functionality.
We use DocIdSetQuery for some of the scoring capabilities such as constant
score (with boosting), age decay, and using the new scoring api in 2.3.
Maybe I am misunderstanding the point o the question.

Thanks

-John

On Wed, Jun 25, 2008 at 10:32 AM, Paul Elschot [EMAIL PROTECTED]
wrote:

 Op Wednesday 25 June 2008 18:45:16 schreef John Wang:
  Hi Paul:
  Regarding to your comment on adding required/prohibited to
  BooleanQuery:
 
  Based on the new api on DocIdSet and DocIdSetIterator
  abstractions, we also developed decorators such as
  AndDocIdSet,OrDocIdSet and NotDocIdSet, furthermore a DocIdSetQuery
  class that honors the Query api contracts. Given these tools, we are
  able to build a customized scored BooleanQuery-like query
  infrastructure. We'd be happy to contribute them.

 Another thing to be removed from near the end of my todo list?
 Perhaps I could even take a vacation :)

 More seriously: would DocIdSetQuery be superfluous when
 a DocIdSet could be added directly to a BooleanQuery?

 Could you elaborate a bit on the customized scoring?

 Regards,
 Paul Elschot

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608189#action_12608189
 ] 

Yonik Seeley commented on LUCENE-1316:
--

bq. is your point that without synchronization on the null check there's no 
garuntee that B will ever see the change to deletedDocs even if it does execute 
after delete()

Right... it's about the memory barrier.

The reality is that there is normally a need for higher level synchronization 
anyway.  That's why it was always silly for things like StringBuffer to be 
synchronized.

bq. assuming we fix that then it seems like the original issue is back to 
square one: synchro bottlenecks when there are no deletions.

A scorer could just check once when initialized... there's never been any 
guarantees about in-flight queries immediately seeing deleted docs changes - 
now that *really* doesn't make sense.  TermScorer grabs the whole bit vector at 
the start and never checks again.

 Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer
 

 Key: LUCENE-1316
 URL: https://issues.apache.org/jira/browse/LUCENE-1316
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.3
 Environment: All
Reporter: Todd Feak
Priority: Minor
 Attachments: MatchAllDocsQuery.java

   Original Estimate: 1h
  Remaining Estimate: 1h

 The isDeleted() method on IndexReader has been mentioned a number of times as 
 a potential synchronization bottleneck. However, the reason this  bottleneck 
 occurs is actually at a higher level that wasn't focused on (at least in the 
 threads I read).
 In every case I saw where a stack trace was provided to show the lock/block, 
 higher in the stack you see the MatchAllScorer.next() method. In Solr 
 paricularly, this scorer is used for NOT queries. We saw incredibly poor 
 performance (order of magnitude) on our load tests for NOT queries, due to 
 this bottleneck. The problem is that every single document is run through 
 this isDeleted() method, which is synchronized. Having an optimized index 
 exacerbates this issues, as there is only a single SegmentReader to 
 synchronize on, causing a major thread pileup waiting for the lock.
 By simply having the MatchAllScorer see if there have been any deletions in 
 the reader, much of this can be avoided. Especially in a read-only 
 environment for production where you have slaves doing all the high load 
 searching.
 I modified line 67 in the MatchAllDocsQuery
 FROM:
   if (!reader.isDeleted(id)) {
 TO:
   if (!reader.hasDeletions() || !reader.isDeleted(id)) {
 In our micro load test for NOT queries only, this was a major performance 
 improvement.  We also got the same query results. I don't believe this will 
 improve the situation for indexes that have deletions. 
 Please consider making this adjustment for a future bug fix release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to do a query using less than or greater than

2008-06-25 Thread Chris Hostetter

: and how to use them?  For a concrete example I'm looking to do a query
: on a date field to find documents earlier than a specified date or
: later than a specified date.  Ex: date:( 20070101)  or date:
: (20070101).  I looked at the range query feature but it didn't appear
: to cover this case. Anyone have any suggestions?

RangeQuery (and ConstantScoreRangeQuery) can both cover this case by 
setting either the upper or lowe term to null.


Incidently...
http://people.apache.org/~hossman/#java-dev
Please Use [EMAIL PROTECTED] Not [EMAIL PROTECTED]

Your question is better suited for the [EMAIL PROTECTED] mailing list ...
not the [EMAIL PROTECTED] list.  java-dev is for discussing development of
the internals of the Lucene Java library ... it is *not* the appropriate
place to ask questions about how to use the Lucene Java library when
developing your own applications.  Please resend your message to
the java-user mailing list, where you are likely to get more/better
responses since that list also has a larger number of subscribers.





-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: per-field similarity

2008-06-25 Thread Yonik Seeley
On Wed, Jun 25, 2008 at 5:06 PM, Chris Hostetter
[EMAIL PROTECTED] wrote:
 Hmmm... that seems like it would be confusing: particularly since in the
 IndexWriter case the Query param would never make sense.  changing
 IndexWriter.getSimilarity to take a String fieldName and changing
 Searcher.getSimilarity to take String fieldName, Query q seem like they
 would be more straight forward.

That would require a user to subclass both IndexWriter and Searcher.
Since Similarity is already passed around, adding a factory method
their seems like the easiest approach.  It's also a class, so we could
easily add a method.

An optional Query param or other context (or more than one factory
method) was just a quick idea... may or may not ultimately make sense.

 (There's also the potential ambiguity of how many times do i call
 Similarity.getSimilarity() before i stop? ... it may seem silly, but if
 you're working in a Query or Scorer or Weight you may not be sure if it's
 been done yet)

Once per level?  When creating the Weight I would think.  If you call
again, the default impl would return this.

It might be a little cleaner to pass around a SimilarityFactory, but
that ship has sailed IMO (along with many others :-)

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: per-field similarity

2008-06-25 Thread Mike Klaas

On 24-Jun-08, at 1:28 PM, Yonik Seeley wrote:


Something to consider for Lucene 3 is to have something to retrieve
Similarity per-field rather than passing the field name into some
functions...


+1

I've felt that this was the proper (and more useful) way to do  
things for a long time


(http://markmail.org/message/56bk6wrbwallyjvr)

-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to do a query using less than or greater than

2008-06-25 Thread Kyle Miller
Chris,
   That's exactly what I was looking for.  Thanks for the info and the
clarification on where to post my questions.
Regards,
Kyle

On Wed, Jun 25, 2008 at 5:12 PM, Chris Hostetter [EMAIL PROTECTED]
wrote:


 : and how to use them?  For a concrete example I'm looking to do a query
 : on a date field to find documents earlier than a specified date or
 : later than a specified date.  Ex: date:( 20070101)  or date:
 : (20070101).  I looked at the range query feature but it didn't appear
 : to cover this case. Anyone have any suggestions?

 RangeQuery (and ConstantScoreRangeQuery) can both cover this case by
 setting either the upper or lowe term to null.


 Incidently...
 http://people.apache.org/~hossman/#java-devhttp://people.apache.org/%7Ehossman/#java-dev
 Please Use [EMAIL PROTECTED] Not [EMAIL PROTECTED]

 Your question is better suited for the [EMAIL PROTECTED] mailing list ...
 not the [EMAIL PROTECTED] list.  java-dev is for discussing development of
 the internals of the Lucene Java library ... it is *not* the appropriate
 place to ask questions about how to use the Lucene Java library when
 developing your own applications.  Please resend your message to
 the java-user mailing list, where you are likely to get more/better
 responses since that list also has a larger number of subscribers.





 -Hoss


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]