[jira] Updated: (LUCENE-1479) TrecDocMaker skips over documents when "Date" is missing from documents

2009-01-08 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1479:
---

Attachment: LUCENE-1479.patch

Thanks Mike, you're right. The compilation error is a result of a refactoring I 
did to that line, by using a single substring call instead of two. I forgot to 
use 'sb' in the second indexOf call, and hence the compilation error.

Regarding dateStr - I fixed that. Thanks for noticing it

> TrecDocMaker skips over documents when "Date" is missing from documents
> ---
>
> Key: LUCENE-1479
> URL: https://issues.apache.org/jira/browse/LUCENE-1479
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/benchmark
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 2.4.1, 2.9
>
> Attachments: LUCENE-1479.patch
>
>
> TrecDocMaker skips over Trec documents if they do not have a "Date" line. 
> When such a document is encountered, the code may skip over several documents 
> until the next tag that is searched for is found.
> The result is, instead of reading ~25M documents from the GOV2 collection, 
> the code reads only ~23M (don't remember the actual numbers).
> The fix adds a terminatingTag to read() such that the code looks for prefix, 
> but only until terminatingTag is found. Appropriate changes were made in 
> getNextDocData().
> Patch to follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1479) TrecDocMaker skips over documents when "Date" is missing from documents

2009-01-08 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1479:
---

Attachment: (was: LUCENE-1479.patch)

> TrecDocMaker skips over documents when "Date" is missing from documents
> ---
>
> Key: LUCENE-1479
> URL: https://issues.apache.org/jira/browse/LUCENE-1479
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/benchmark
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 2.4.1, 2.9
>
>
> TrecDocMaker skips over Trec documents if they do not have a "Date" line. 
> When such a document is encountered, the code may skip over several documents 
> until the next tag that is searched for is found.
> The result is, instead of reading ~25M documents from the GOV2 collection, 
> the code reads only ~23M (don't remember the actual numbers).
> The fix adds a terminatingTag to read() such that the code looks for prefix, 
> but only until terminatingTag is found. Appropriate changes were made in 
> getNextDocData().
> Patch to follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2009-01-08 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662247#action_12662247
 ] 

Doug Cutting commented on LUCENE-1476:
--

bq. To really tighten this loop, you have to [ ... ] remove all function/method 
call overhead [and] operate directly on the memory mapped postings file.

That sounds familiar...

http://svn.apache.org/viewvc/lucene/java/trunk/src/gcj/org/apache/lucene/index/GCJTermDocs.cc?view=markup


> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch, quasi_iterator_deletions.diff
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1494) Additional features for searching for value across multiple fields (many-to-one style)

2009-01-08 Thread Paul Cowan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662244#action_12662244
 ] 

Paul Cowan commented on LUCENE-1494:


Hi Hoss,

I don't disagree that an inverted inheritance hierarchy would make more sense, 
but the problem with that is that getField (which I _think_ is the only thing 
on SpanNearQuery that doesn't really make sense for a MultiField one) is 
mandated by the abstract method declaration of same in SpanQuery, which the 
inverted parent class would still extend. Looking at where getField() is used 
(primarily in SpanWeight.explain() and SpanWeight and BoostingTermWeight's 
.scorer() methods) I'm not sure how I can meaningfully deal with those in the 
case of a multifield span query.

If you (or anyone else) have any suggestions for that then I'm all ears, this 
would be really useful for us (and a lot of other people I think, it's not an 
uncommon query on the lists etc).

Personally I'd be equally happy with just eliminating the same-field 
requirement (as you mentioned, I think, that Doug suggested) but those 
explain()s and scorer() methods would still need to be changed. Any ideas?

Paul

> Additional features for searching for value across multiple fields 
> (many-to-one style)
> --
>
> Key: LUCENE-1494
> URL: https://issues.apache.org/jira/browse/LUCENE-1494
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 2.4
>Reporter: Paul Cowan
>Priority: Minor
> Attachments: LUCENE-1494-multifield.patch, 
> LUCENE-1494-positionincrement.patch
>
>
> This issue is to cover the changes required to do a search across multiple 
> fields with the same name in a fashion similar to a many-to-one database. 
> Below is my post on java-dev on the topic, which details the changes we need:
> ---
> We have an interesting situation where we are effectively indexing two 
> 'entities' in our system, which share a one-to-many relationship (imagine 
> 'User' and 'Delivery Address' for demonstration purposes). At the moment, we 
> index one Lucene Document per 'many' end, duplicating the 'one' end data, 
> like so:
> userid: 1
> userfirstname: fred
> addresscountry: au
> addressphone: 1234
> userid: 1
> userfirstname: fred
> addresscountry: nz
> addressphone: 5678
> userid: 2
> userfirstname: mary
> addresscountry: au
> addressphone: 5678
> (note: 2 Documents indexed for user 1). This is somewhat annoying for us, 
> because when we search in Lucene the results we want back (conceptually) are 
> at the 'user' level, so we have to collapse the results by distinct user id, 
> etc. etc (let alone that it blows out the size of our index enormously). So 
> why do we do it? It would make more sense to use multiple fields:
> userid: 1
> userfirstname: fred
> addresscountry: au
> addressphone: 1234
> addresscountry: nz
> addressphone: 5678
> userid: 2
> userfirstname: mary
> addresscountry: au
> addressphone: 5678
> But imagine the search "+addresscountry:au +addressphone:5678". We'd like 
> this to match ONLY Mary, but of course it matches Fred also because he 
> matches both those terms (just for different addresses).
> There are two aspects to the approach we've (more or less) got working but 
> I'd like to run them past the group and see if they're worth trying to get 
> them into Lucene proper (if so, I'll create a JIRA issue for them)
> 1) Use a modified SpanNearQuery. If we assume that country + phone will 
> always be one token, we can rely on the fact that the positions of 'au' and 
> '5678' in Fred's document will be different.
>SpanQuery q1 = new SpanTermQuery(new Term("addresscountry", "au"));
>SpanQuery q2 = new SpanTermQuery(new Term("addressphone", "5678"));
>SpanQuery snq = new SpanNearQuery(new SpanQuery[]{q1, q2}, 0, false);
> the slop of 0 means that we'll only return those where the two terms are in 
> the same position in their respective fields. This works brilliantly, BUT 
> requires a change to SpanNearQuery's constructor (which checks that all the 
> clauses are against the same field). Are people amenable to perhaps adding 
> another constructor to SNQ which doesn't do the check, or subclassing it to 
> do the same (give it a protected non-checking constructor for the subclass to 
> call)?
> 2) It gets slightly more complicated in the case of variable-length terms. 
> For example, imagine if we had an 'address' field ('123 Smith St') which will 
> result in (1 to n) tokens; slop 0 in a SpanNearQuery won't work here, of 
> course. One thing we've toyed with is the idea of using 
> getPositionIncrementGap -- if we knew that 'address' would be, at most, 20 

Re: Realtime Search

2009-01-08 Thread John Wang
We have worked on this problem on the server level as well. We have also
open sourced it at:

http://code.google.com/p/zoie/

wiki on the realtime aspect:

http://code.google.com/p/zoie/wiki/ZoieSystem

-John

On Fri, Dec 26, 2008 at 12:34 PM, Robert Engels wrote:

> If you move to the "either embedded, or server model", the post reopen is
> trivial, as the structures can be created as the segment is written.
>
> It is the networked shared access model that causes a lot of these
> optimizations to be far more complex than needed.
>
> Would it maybe be simpler to move the "embedded or server" model, and add a
> network shared file (e.g. nfs) access model as a layer?  The latter is going
> to perform far worse anyway.
>
> I guess I don't understand why Lucene continues to try and support this
> model. NO ONE does it any more.  This is the way MS Access worked, and
> everyone that wanted performance needed to move to SQL server for the server
> model.
>
>
> -Original Message-
> >From: Marvin Humphrey 
> >Sent: Dec 26, 2008 12:53 PM
> >To: java-dev@lucene.apache.org
> >Subject: Re: Realtime Search
> >
> >On Fri, Dec 26, 2008 at 06:22:23AM -0500, Michael McCandless wrote:
> >> >  4) Allow 2 concurrent writers: one for small, fast updates, and one
> for
> >> > big background merges.
> >>
> >> Marvin can you describe more detail here?
> >
> >The goal is to improve worst-case write performance.
> >
> >Currently, writes are quick most of the time, but occassionally you'll
> trigger
> >a big merge and get stuck.  To solve this problem, we can assign a merge
> >policy to our primary writer which tells it to merge no more than
> >mergeThreshold documents.  The value of mergeTheshold will need tuning
> >depending on document size, change rate, and so on, but the idea is that
> we
> >want this writer to do as much merging as it can while still keeping
> >worst-case write performance down to an acceptable number.
> >
> >Doing only small merges just puts off the day of reckoning, of course.  By
> >avoiding big consolidations, we are slowly accumulating small-to-medium
> sized
> >segments and causing a gradual degradation of search-time performance.
> >
> >What we'd like is a separate write process, operating (mostly) in the
> >background, dedicated solely to merging segments which contain at least
> >mergeThreshold docs.
> >
> >If all we have to do is add documents to the index, adding that second
> write
> >process isn't a big deal.  We have to worry about competion for segment,
> >snapshot, and temp file names, but that's about it.
> >
> >Deletions make matters more complicated, but with a tombstone-based
> deletions
> >mechanism, the problems are solvable.
> >
> >When the background merge writer starts up, it will see a particular view
> of
> >the index in time, including deletions.  It will perform nearly all of its
> >operations based on this view of the index, mapping around documents which
> >were marked as deleted at init time.
> >
> >In between the time when the background merge writer starts up and the
> time it
> >finishes consolidating segment data, we assume that the primary writer
> will
> >have modified the index.
> >
> >  * New docs have been added in new segments.
> >  * Tombstones have been added which suppress documents in segments which
> >didn't even exist when the background merge writer started up.
> >  * Tombstones have been added which suppress documents in segments which
> >existed when the background merge writer started up, but were not
> merged.
> >  * Tombstones have been added which suppress documents in segments which
> have
> >just been merged.
> >
> >Only the last category of deletions matters.
> >
> >At this point, the background merge writer aquires an exclusive write lock
> on
> >the index.  It examines recently added tombstones, translates the document
> >numbers and writes a tombstone file against itself.  Then it writes the
> >snapshot file to commit its changes and releases the write lock.
> >
> >Worst case update performance for the system is now the sum of the time it
> >takes the background merge writer consolidate tombstones and worst-case
> >performance of the primary writer.
> >
> >> It sounds like this is your solution for "decoupling" segments changes
> due
> >> to merges from changes from docs being indexed, from a reader's
> standpoint?
> >
> >It's true that we are decoupling the process of making logical changes to
> the
> >index from the process of internal consolidation.  I probably wouldn't
> >describe that as being done from the reader's standpoint, though.
> >
> >With mmap and data structures optimized for it, we basically solve the
> >read-time responsiveness cost problem.  From the client perspective, the
> delay
> >between firing off a change order and seeing that change made live is now
> >dominated by the time it takes to actually update the index.  The time
> between
> >the commit and having an IndexReader which can see that commit is
>

[jira] Updated: (LUCENE-1476) BitVector implement DocIdSet

2009-01-08 Thread Marvin Humphrey (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marvin Humphrey updated LUCENE-1476:


Attachment: quasi_iterator_deletions.diff

Here's a patch implementing BitVector.nextSetBit() and converting
SegmentTermDocs over to use the quasi-iterator style. Tested but 
not benchmarked.

> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch, quasi_iterator_deletions.diff
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Realtime Search

2009-01-08 Thread Jason Rutherglen
Based on our discussions, it seems best to get realtime search going in
small steps.  Below are some possible steps to take.

Patch #1: Expose an IndexWriter.getReader method that returns the current
reader and shares the write lock
Patch #2: Implement a realtime ram index class
Patch #3: Implement realtime transactions in IndexWriter or in a subclass of
IndexWriter by implementing a createTransaction method that generates a
realtime Transaction object.  When the transaction is flushed, the
transaction index modifications are available via the getReader method of
IndexWriter

The remaining question is how to synchronize the flushes to disk with
IndexWriter's other index update locking mechanisms.  The flushing could
simply use IW.addIndexes which has in place a locking mechanism.  After
flushing to disk, queued deletes would be applied to the newly copied disk
segments.  I think this entails opening the newly copied disk segments and
applying deletes that occurred to the corresponding ram segments by cloning
the new disk segments and replacing the deleteddocs bitvector then flushing
the deleteddocs to disk.  This system would allow us to avoid using UID in
documents.

The API needs to clearly separate realtime transactions vs. the existing
index update method such as addDocument, deleteDocuments, and
updateDocument.  I don't think it's possible to transparently implement both
because the underlying implementations behave differently.  It is expected
that multiple transaction may be created at once however the
Transaction.flush method would block.


Re: stored fields / unicode compression

2009-01-08 Thread Robert Muir
thanks for the response, this sounds great. some way to plug in arbitrary
schemes would be helpful.

I've experimented with a few for my case and unicode compression gave the
best bang for the buck, but i remember some of the other schemes such as
arithmetic coding seemed to provide wins for reasonably short fields where
gzip was still making them bigger...

On Thu, Jan 8, 2009 at 8:26 PM, Chris Hostetter wrote:

>
> Catching up on my holiday email, I on't think there were any replies to
> this question yet.
>
> The low level file formats used by Lucene is an area I don't have
> time/expertise to follow carefully, but if i'm remember correctly the
> concensus is/was to more more towards pure (byte[] data, int start, int
> end) based APIs for efficiency, with "String" based APIs provided as
> syntactic sugar via a facade, and deprecating the existing "internal" gzip
> compression in favor of similar "external" compression facades.  So
> something like you describe could be done as is using the byte[]
> interfaces *and* be generally useful to others.
>
> Taking a step back to look at the broader picture, this is the kind of
> thing that in Solr could be implemented as a new FieldType
>
> : Date: Fri, 26 Dec 2008 19:00:11 -0500
> : From: Robert Muir
> : Subject: stored fields / unicode compression
> :
> : Has there been any thoughts of using SCSU or BOCU-1 instead of UTF-8 for
> : stored fields?
> : Personally I don't put huge amounts of text in stored fields but these
> : encodings/compression work extremely well on short strings like titles,
> etc.
> : Removing the unicode penalty for non-latin text (i.e. cut in half) is
> : nothing to sneeze at since with lots of docs my stored fields still
> become
> : pretty huge, biggest part of the index.
> :
> : I know I could use one of these schemes right now and store everything as
> : bytes... but just thinking it might be something of more general use. The
> : GZIP compression that is supported isn't very useful as it typically
> makes
> : short snippets bigger...
> :
> : Performance compared to UTF-8 is here... seems like a general win to me
> (but
> : maybe I am missing something)
> : http://unicode.org/notes/tn6/#Performance
>
>
> -Hoss
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


-- 
Robert Muir
rcm...@gmail.com


[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2009-01-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662214#action_12662214
 ] 

Jason Rutherglen commented on LUCENE-1476:
--

M.M.:" I think the transactions layer would also sit on top of this
"realtime" layer? EG this "realtime" layer would expose a commit()
method, and the transaction layer above it would maintain the
transaction log, periodically calling commit() and truncating the
transaction log?"

One approach that may be optimal is to expose from IndexWriter a 
createTransaction method that accepts new documents and deletes.  All documents 
have an associated UID.  The new documents could feasibly be encoded into a 
single segment that represents the added documents for that transaction.  The 
deletes would be represented as document long UIDs rather than int doc ids.  
Then the commit method would be called on the transaction object who returns a 
reader representing the latest version of the index, plus the changes created 
by the transaction.  This system would be a part of IndexWriter and would not 
rely on a transaction log.  IndexWriter.commit would flush the in ram realtime 
indexes to disk.  The realtime merge policy would flush based on the RAM usage 
or number of docs.

{code}
IndexWriter iw = new IndexWriter();
Transaction tr = iw.createTransaction();
tr.addDocument(new Document());
tr.addDocument(new Document());
tr.deleteDocument(1200l);
IndexReader ir = tr.flush(); // flushes transaction to the index (probably to a 
ram index)
IndexReader latestReader = iw.getReader(); // same as ir
iw.commit(boolean doWait); // commits the in ram realtime index to disk
{code}

When commit is called, the disk segment reader flush their deletes to disk 
which is fast.  The in ram realtime index is merged to disk.  The process is 
described in more detail further down.

M.H.: "how about writing a single-file Directory implementation?"

I'm not sure we need this because and appending rolling transaction log should 
work.  Segments don't change, only things like norms and deletes which can be 
appended to a rolling transaction log file system.  If we had a generic 
transaction logging system, the future column stride fields, deletes, norms, 
and future realtime features could use it and be realtime.  

M.H.: "How do you guarantee that you always see the "current" version of a 
given document, and only that version? 

Each transaction returns an IndexReader.  Each "row" or "object" could use a 
unique id in the transaction log model which would allow documents that were 
merged into other segments to be deleted during a transaction log replay.  

M.H.: "When do you expose new deletes in the RAMDirectory, when do you expose 
new deletes in the FSDirectory"

When do you expose new deletes in the RAMDir, when do you expose new deletes in 
the FSDirectory, how do you manage slow merges from the RAMDir to the 
FSDirectory, how do you manage new adds to the RAMDir that take place during 
slow merges..."

Queue deletes to the RAMDir, while copying the RAMDir to the FSDir in the 
background, perform the deletes after the copy is completed, then instantiate a 
new reader with the newly merged FSDirectory and a new RAMDirs.  Writes that 
were occurring during this process would be happening to another new RAMDir.  

One way to think of the realtime problem is in terms of segments rather than 
FSDirs and RAMDirs.  Some segments are on disk, some in RAM.  Each transaction 
is an instance of some segments and their deletes (and we're not worried about 
the deletes being flushed or not so assume they exist as BitVectors).  The 
system should expose an API to checkpoint/flush at a given transaction level 
(usually the current) and should not stop new updates from happening.

When I wrote this type of system, I managed individual segments outside of 
IndexWriter's merge policy and performed the merging manually by placing each 
segment in it's own FSDirectory (the segment size was 64MB) which minimized the 
number of directories.  I do not know the best approach for this when performed 
within IndexWriter.  

M.H.: "Two comments. First, if you don't sync, but rather leave it up to the OS 
when
it wants to actually perform the actual disk i/o, how expensive is flushing? Can
we make it cheap enough to meet Jason's absolute change rate requirements?"

When I tried out the transaction log a write usually mapped pretty quickly to a 
hard disk write.  I don't think it's safe to leave writes up to the OS.

M.M.: "maintain & updated deleted docs even though IndexWriter has the write 
lock"

In my previous realtime search implementation I got around this by having each 
segment in it's own directory.  Assuming this is non-optimal, we will need to 
expose an IndexReader that has the writelock of the IndexWriter.


> BitVector implement DocIdSet
> 
>
>

Re: stored fields / unicode compression

2009-01-08 Thread Chris Hostetter

Catching up on my holiday email, I on't think there were any replies to 
this question yet.  

The low level file formats used by Lucene is an area I don't have 
time/expertise to follow carefully, but if i'm remember correctly the 
concensus is/was to more more towards pure (byte[] data, int start, int 
end) based APIs for efficiency, with "String" based APIs provided as 
syntactic sugar via a facade, and deprecating the existing "internal" gzip 
compression in favor of similar "external" compression facades.  So 
something like you describe could be done as is using the byte[] 
interfaces *and* be generally useful to others.

Taking a step back to look at the broader picture, this is the kind of 
thing that in Solr could be implemented as a new FieldType

: Date: Fri, 26 Dec 2008 19:00:11 -0500
: From: Robert Muir
: Subject: stored fields / unicode compression
: 
: Has there been any thoughts of using SCSU or BOCU-1 instead of UTF-8 for
: stored fields?
: Personally I don't put huge amounts of text in stored fields but these
: encodings/compression work extremely well on short strings like titles, etc.
: Removing the unicode penalty for non-latin text (i.e. cut in half) is
: nothing to sneeze at since with lots of docs my stored fields still become
: pretty huge, biggest part of the index.
: 
: I know I could use one of these schemes right now and store everything as
: bytes... but just thinking it might be something of more general use. The
: GZIP compression that is supported isn't very useful as it typically makes
: short snippets bigger...
: 
: Performance compared to UTF-8 is here... seems like a general win to me (but
: maybe I am missing something)
: http://unicode.org/notes/tn6/#Performance


-Hoss


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2009-01-08 Thread robert engels
The way we've simplified this that every document has an OID. It  
simplifies updates and delete tracking (in the transaction log).


On Jan 8, 2009, at 2:28 PM, Marvin Humphrey (JIRA) wrote:



[ https://issues.apache.org/jira/browse/LUCENE-1476? 
page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
tabpanel&focusedCommentId=12662107#action_12662107 ]


Marvin Humphrey commented on LUCENE-1476:
-

Mike McCandless:


Commit is for crash recovery, and for knowing when it's OK to delete
prior commits. Simply writing the files (and not syncing them), and
perhaps giving IndexReader.open the SegmentInfos to use directly (and
not writing a segments_N via the filesystem) would allow us to search
added docs without paying the cost of sync'ing all the files.


Mmm.  I think I might have given IndexWriter.commit() slightly  
different
semantics.  Specifically, I might have given it a boolean "sync"  
argument

which defaults to false.


Also: brand new, tiny segments should be written into a RAMDirectory
and then merged over time into the real Directory.


Two comments.  First, if you don't sync, but rather leave it up to  
the OS when
it wants to actually perform the actual disk i/o, how expensive is  
flushing? Can
we make it cheap enough to meet Jason's absolute change rate  
requirements?


Second, the multi-index model is very tricky when dealing with  
"updates".  How

do you guarantee that you always see the "current" version of a given
document, and only that version?  When do you expose new deletes in  
the
RAMDirectory, when do you expose new deletes in the FSDirectory,  
how do you
manage slow merges from the RAMDirectory to the FSDirectory, how do  
you manage

new adds to the RAMDirectory that take place during slow merges...

Building a single-index, two-writer model that could handle fast  
updates while
performing background merging was one of the main drivers behind  
the tombstone

design.



BitVector implement DocIdSet


Key: LUCENE-1476
URL: https://issues.apache.org/jira/browse/ 
LUCENE-1476

Project: Lucene - Java
 Issue Type: Improvement
 Components: Index
   Affects Versions: 2.4
   Reporter: Jason Rutherglen
   Priority: Trivial
Attachments: LUCENE-1476.patch

  Original Estimate: 12h
 Remaining Estimate: 12h

BitVector can implement DocIdSet.  This is for making  
SegmentReader.deletedDocs pluggable.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2009-01-08 Thread Marvin Humphrey
> You can do that now by implementing BitVector.nextSetBit(int tick) and using
> that in TermDocs to set a nextDeletion member var instead of checking every
> doc num with BitVector.get().

This seems so easy, I should take a crack at it. :)

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2009-01-08 Thread Marvin Humphrey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662143#action_12662143
 ] 

Marvin Humphrey commented on LUCENE-1476:
-

Mike McCandless:

> So, net/net it seems like "deletes-as-a-filter" approach is compelling?

In terms of CPU-cycles, maybe.  

My gut tells me that it's all but mandatory if we use merged-on-the-fly
tombstone streams, but if Lucene goes that route it should cache a BitVector
and use a shared pseudo-iterator -- in which case the costs will no longer be
significantly more than the current system.

Under the current system, I'm not certain that the deletions checks are that
excessive.  Consider this loop from TermDocs.read():

{code}
while (i < length && count < df) {
  // manually inlined call to next() for speed
  final int docCode = freqStream.readVInt();
  doc += docCode >>> 1;   // shift off low bit
  if ((docCode & 1) != 0)   // if low bit is set
freq = 1; // freq is one
  else
freq = freqStream.readVInt(); // else read freq
  count++;

  if (deletedDocs == null || !deletedDocs.get(doc)) {
docs[i] = doc;
freqs[i] = freq;
++i;
  }
}
{code}

The CPU probably does a good job of predicting the result of the null check on
deletedDocs.  The readVInt() method call is already a pipeline killer.

Here's how that loop looks after I patch the deletions check for 
pseudo-iteration.

{code}
  while (i < length && count < df) {
// manually inlined call to next() for speed
final int docCode = freqStream.readVInt();
doc += docCode >>> 1;   // shift off low bit
if ((docCode & 1) != 0)   // if low bit is set
  freq = 1; // freq is one
else
  freq = freqStream.readVInt(); // else read freq
count++;

if (docNum >= nextDeletion) {
if (docNum > nextDeletion) {
  nextDeletion = deletedDocs.nextSetBit(docNum);
}
if (docNum == nextDeletion) {
  continue;
}
}

docs[i] = doc;
freqs[i] = freq;
++i;
  }
  return i;
}
{code}

Again, the CPU is probably going to do a pretty good job of predicting the
results of the deletion check.  And even then, we're accessing the same shared
BitVector across all TermDocs, and its bits are hopefully a cache hit.

To really tighten this loop, you have to do what Nate and I want with Lucy/KS:

  * Remove all function/method call overhead.
  * Operate directly on the memory mapped postings file.

{code}
SegPList_bulk_read(SegPostingList *self, i32_t *doc_nums, i32_t *freqs,
   u32_t request)
{
i32_t   doc_num   = self->doc_num;
const u32_t remaining = self->doc_freq - self->count;
const u32_t num_got   = request < remaining ? request : remaining;
char   *buf   = InStream_Buf(instream, C32_MAX_BYTES * num_got);
u32_t   i;

for (i = 0; i < num_got; i++) {
u32_t doc_code = Math_decode_c32(&buf); /* static inline function */
u32_t freq = (doc_code & 1) ? 1 : Math_decode_c32(&buf);
doc_num+= doc_code >> 1; 
doc_nums[i]= doc_num;
freqs[i]   = freq;
++i;
}

InStream_Advance_Buf(instream, buf);
self->doc_num = doc_num;
self->count += num_got;

return num_got;
}
{code}

(That loop would be even better using PFOR instead of vbyte.)

In terms of public API, I don't think it's reasonable to change Lucene's
Scorer and TermDocs classes so that their iterators start returning deleted
docs.  

We could potentially make that choice with Lucy/KS, thus allowing us to remove
the deletions check in the PostingList iterator (as above) and getting a
potential speedup.  But even then I hesitate to push the deletions API upwards
into a space where users of raw Scorer and TermDocs classes have to deal with
it -- especially since iterator-style deletions aren't very user-friendly.


> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: jav

[jira] Updated: (LUCENE-1314) IndexReader.clone

2009-01-08 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1314:
-

Attachment: LUCENE-1314.patch

LUCENE-1314.patch

All tests pass.  

IndexReader.close was made non-final to override in SegmentReader.  This is due 
to the propagation of the method calls to SegmentReader.doClose previously 
passed through decRef which could be called by IndexReader.decRef or 
IndexReader.close.  In order to decref the copy on write refs, the close method 
needs to decrement the references, rather than simply the decRef method.  This 
caused the bug found in the previous comment where if decRef was called the 
deletedDocsRef did not need to also be decrefed which was the cause of the ref 
count assertion failing.  

Occasionally TestIndexReaderReopen.testThreadSafety fails due to an already 
closed exception.  Trunk however also fails periodically.  Given multi 
threading of reopen/close is usually unlikely I am not sure it is worth 
investigating further.

Fixed norm byte refs not decrefing on close.

Fixed cloneNorm() byteRef being created when there is no byte array, added 
assertion check.




> IndexReader.clone
> -
>
> Key: LUCENE-1314
> URL: https://issues.apache.org/jira/browse/LUCENE-1314
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.3.1
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, 
> LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, 
> LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, lucene-1314.patch, 
> lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, 
> lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, 
> lucene-1314.patch, lucene-1314.patch, lucene-1314.patch
>
>
> Based on discussion 
> http://www.nabble.com/IndexReader.reopen-issue-td18070256.html.  The problem 
> is reopen returns the same reader if there are no changes, so if docs are 
> deleted from the new reader, they are also reflected in the previous reader 
> which is not always desired behavior.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2009-01-08 Thread Marvin Humphrey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662110#action_12662110
 ] 

Marvin Humphrey commented on LUCENE-1476:
-

Mike McCandless:

> if it's sparse, you need an iterator (state) to remember where you are.

We can hide the sparse representation and the internal state, having the
object lazily build the a non-sparse representation.  That's what I had in
mind with the code for TombstoneDelEnum.nextDeletion().
TombstoneDelEnum.nextInternal() would be a private method used for building up
the internal BitVector.

> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2009-01-08 Thread Marvin Humphrey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662107#action_12662107
 ] 

Marvin Humphrey commented on LUCENE-1476:
-

Mike McCandless:

> Commit is for crash recovery, and for knowing when it's OK to delete
> prior commits. Simply writing the files (and not syncing them), and
> perhaps giving IndexReader.open the SegmentInfos to use directly (and
> not writing a segments_N via the filesystem) would allow us to search
> added docs without paying the cost of sync'ing all the files.

Mmm.  I think I might have given IndexWriter.commit() slightly different
semantics.  Specifically, I might have given it a boolean "sync" argument
which defaults to false.

> Also: brand new, tiny segments should be written into a RAMDirectory
> and then merged over time into the real Directory.

Two comments.  First, if you don't sync, but rather leave it up to the OS when
it wants to actually perform the actual disk i/o, how expensive is flushing? 
Can 
we make it cheap enough to meet Jason's absolute change rate requirements?

Second, the multi-index model is very tricky when dealing with "updates".  How
do you guarantee that you always see the "current" version of a given
document, and only that version?  When do you expose new deletes in the
RAMDirectory, when do you expose new deletes in the FSDirectory, how do you
manage slow merges from the RAMDirectory to the FSDirectory, how do you manage
new adds to the RAMDirectory that take place during slow merges...

Building a single-index, two-writer model that could handle fast updates while
performing background merging was one of the main drivers behind the tombstone
design.


> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2009-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662102#action_12662102
 ] 

Michael McCandless commented on LUCENE-1476:


{quote}
> How about if we model deletions-as-iterator on BitSet.nextSetBit(int tick) 
> instead of a true iterator that keeps state? 
{quote}
That works if under-the-hood it's a non-sparse representation.  But if it's 
sparse, you need an iterator (state) to remember where you are.

> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2009-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662101#action_12662101
 ] 

Michael McCandless commented on LUCENE-1476:


{quote}
> If we move the deletions filtering up, then we'd increase traffic through 
> that cache
{quote}

OK, right.  So we may have some added cost because of this.  I think
it's only TermScorer that uses the bulk API though.

{quote}
> If you were applying deletions filtering after Scorer.next(), then it seems
> likely that costs would go up because of extra hit processing. However, if
> you use Scorer.skipTo() to jump past deletions, as in the loop I provided
> above, then PhraseScorer etc. shouldn't incur any more costs themselves.
{quote}

Ahhh, now I got it!  Good, you're right.

{quote}
> Under the skipTo() loop, I think the filter effectively does get applied
> earlier in the chain. Does that make sense?
{quote}

Right.  This is how Lucene works today.  Excellent.

So, net/net it seems like "deletes-as-a-filter" approach is compelling?


> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2009-01-08 Thread Marvin Humphrey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662100#action_12662100
 ] 

Marvin Humphrey commented on LUCENE-1476:
-

Mike McCandless:

> I'm also curious what cost you see of doing the merge sort for every
> search; I think it could be uncomfortably high since it's so
> hard-for-cpu-to-predict-branch-intensive. 

Probably true.  You're going to get accelerating degradation as the number of
deletions increases.  In a large index, you could end up merging 20, 30
streams.  Based on how the priority queue in ORScorer tends to take up space
in profiling data, that might not be good.

It'd be manageable if you can keep your index reasonably in good shape, but 
you'll 
be suckin' pondwater if it gets flabby.

> We could take the first search that doesn't use skipTo and save the result
> of the merge sort, essentially doing an in-RAM-only "merge" of those
> deletes, and let subsequent searches use that single merged stream. 

That was what I had in mind when proposing the pseudo-iterator model.

{code}
class TombStoneDelEnum extends DelEnum {
  int nextDeletion(int docNum) {
while (currentMax < docNum) { nextInternal(); }
return bits.nextSetBit(docNum);
  }
  // ...
}
{code}

> (This is not MMAP friendly, though).

Yeah.  Ironically, that use of tombstones is more compatible with the Lucene
model. :-)

I'd be reluctant to have Lucy/KS realize those large BitVectors in per-object 
process RAM.  That'd spoil the "cheap wrapper around system i/o cache" 
IndexReader plan.

I can't see an answer yet.  But the one thing I do know is that Lucy/KS needs
a pluggable deletions mechanism to make experimentation easier -- so that's
what I'm working on today.

> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2009-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662097#action_12662097
 ] 

Michael McCandless commented on LUCENE-1476:



{quote}
> It would be exposed as a combination reader writer that manages the 
> transaction status of each update.
{quote} 

I think the transactions layer would also sit on top of this
"realtime" layer?  EG this "realtime" layer would expose a commit()
method, and the transaction layer above it would maintain the
transaction log, periodically calling commit() and truncating the
transaction log?

This "realtime" layer, then, would internally maintain a single
IndexWriter and the readers.  IndexWriter would flush (not commit) new
segments into a RAMDir and yield its in-RAM SegmentInfos to
IndexReader.reopen.  MergePolicy periodically gets those into the real
Directory.  When reopening a reader we have the freedom to use old
(already merged away) segments if the newly merged segment isn't warm
yet.

We "just" need to open some things up in IndexWriter:
 
  * IndexReader.reopen with the in-RAM SegmentInfos

  * Willingness to allow an IndexReader to maintain & updated deleted
docs even though IndexWriter has the write lock

  * Access to segments that were already merged away (I think we could
make a DeletionPolicy that pays attention to when the newly merged
segment is not yet warmed and keeps thue prior segments around).
I think this'd require allowing DeletionPolicy to see "flush
points" in addition to commit points (it doesn't today).

But I'm still hazy on the details on exactly how to open up
IndexWriter.


> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2009-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662092#action_12662092
 ] 

Michael McCandless commented on LUCENE-1476:


{quote}
> If Lucene crashed for some reason the transaction log would be replayed.
{quote}

I think the transaction log is useful for some applications, but could
(should) be built as a separate (optional) layer entirely on top of
Lucene's core.  Ie, neither IndexWriter nor IndexReader need to be
aware of the transaction log, which update belongs to which
transaction, etc?


> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2009-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662089#action_12662089
 ] 

Michael McCandless commented on LUCENE-1476:


{quote}
> There's going to be a change rate that overwhelms the multi-file
> commit system, and it seems that you've determined you're up against
> it.
{quote}

Well... IndexWriter need not "commit" in order to allow a reader to
see the files?

Commit is for crash recovery, and for knowing when it's OK to delete
prior commits.  Simply writing the files (and not syncing them), and
perhaps giving IndexReader.open the SegmentInfos to use directly (and
not writing a segments_N via the filesystem) would allow us to search
added docs without paying the cost of sync'ing all the files.

Also: brand new, tiny segments should be written into a RAMDirectory
and then merged over time into the real Directory.


> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1479) TrecDocMaker skips over documents when "Date" is missing from documents

2009-01-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1479:
--

Assignee: Michael McCandless

> TrecDocMaker skips over documents when "Date" is missing from documents
> ---
>
> Key: LUCENE-1479
> URL: https://issues.apache.org/jira/browse/LUCENE-1479
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/benchmark
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 2.4.1, 2.9
>
> Attachments: LUCENE-1479.patch
>
>
> TrecDocMaker skips over Trec documents if they do not have a "Date" line. 
> When such a document is encountered, the code may skip over several documents 
> until the next tag that is searched for is found.
> The result is, instead of reading ~25M documents from the GOV2 collection, 
> the code reads only ~23M (don't remember the actual numbers).
> The fix adds a terminatingTag to read() such that the code looks for prefix, 
> but only until terminatingTag is found. Appropriate changes were made in 
> getNextDocData().
> Patch to follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1479) TrecDocMaker skips over documents when "Date" is missing from documents

2009-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662073#action_12662073
 ] 

Michael McCandless commented on LUCENE-1479:


Shai, it seems like a doc that has no "Date: XXX" would leave dateStr as null 
and would then cause an NPE when parseDate is later called?  Or am I missing 
something?

Also I'm getting a compilation error:

{code}
[javac] Compiling 1 source file to 
/tango/mike/src/lucene.trecdocmaker/build/contrib/benchmark/classes/java
[javac] 
/tango/mike/src/lucene.trecdocmaker/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/TrecDocMaker.java:190:
 variable name might not have been initialized
[javac] String name = sb.substring(DOCNO.length(), name.indexOf(TERM_DOCNO, 
DOCNO.length()));
[javac]^
[javac] 1 error
{code}

> TrecDocMaker skips over documents when "Date" is missing from documents
> ---
>
> Key: LUCENE-1479
> URL: https://issues.apache.org/jira/browse/LUCENE-1479
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/benchmark
>Reporter: Shai Erera
> Fix For: 2.4.1, 2.9
>
> Attachments: LUCENE-1479.patch
>
>
> TrecDocMaker skips over Trec documents if they do not have a "Date" line. 
> When such a document is encountered, the code may skip over several documents 
> until the next tag that is searched for is found.
> The result is, instead of reading ~25M documents from the GOV2 collection, 
> the code reads only ~23M (don't remember the actual numbers).
> The fix adds a terminatingTag to read() such that the code looks for prefix, 
> but only until terminatingTag is found. Appropriate changes were made in 
> getNextDocData().
> Patch to follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2009-01-08 Thread Marvin Humphrey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662065#action_12662065
 ] 

Marvin Humphrey commented on LUCENE-1476:
-

Jason Rutherglen:

> I found in making the realtime search write speed fast enough that writing
> to individual files per segment can become too costly (they accumulate fast,
> appending to a single file is faster than creating new files, deleting the
> files becomes costly). 

I saw you mentioning i/o overhead on Windows in particular.  I can't see a way
to mod Lucene so that it doesn't generate a bunch of files for each commit,
and FWIW Lucy/KS is going to generate even more files than Lucene.

Half-seriously... how about writing a single-file Directory implementation?

> For example, writing to small individual files per commit, if the number of
> segments is large and the delete spans multiple segments will generate many
> files. 

There would be a maximum of two files per segment to hold the tombstones: one
to hold the tombstone rows, and one to map segment identifiers to tombstone
rows.  (In Lucy/KS, the mappings would probably be stored in the JSON-encoded
"segmeta" file, which stores human-readable metadata on behalf of multiple 
components.)

Segments containing tombstones would be merged according to whatever merge
policy was in place.  So there won't ever be an obscene number of tombstone
files unless you allow an obscene number of segments to accumulate.

> Many users may not want a transaction log as they may be storing the updates
> in a separate SQL database instance (this is the case where I work) and so a
> transaction log is redundant and should be optional. 

I can see how this would be quite useful at the application level.  However, I
think it might be challenging to generalize the transaction log concept at the
library level:

{code}
CustomAnalyzer analyzer = new CustomAnalyzer();
IndexWriter indexWriter = new IndexWriter(analyzer, "/path/to/index");
indexWriter.add(nextDoc());
analyzer.setFoo(2); // change of state not recorded by transaction log
indexWriter.add(nextDoc());
{code}

MySQL is more of a closed system than Lucene, which I think makes options
available that aren't available to us.

> The reader stack is drained based on whether a reader is too old to be
> useful anymore (i.e. no references to it, or it's has N number of readers
> ahead of it).

Right, this is the kind of thing that Lucene has to do because of the
single-reader model, and that were trying to get away from in Lucy/KS by
exploiting mmap and making IndexReaders cheap wrappers around the system i/o
cache.

I don't think I can offer any alternative design suggestions that meet your
needs.   There's going to be a change rate that overwhelms the multi-file
commit system, and it seems that you've determined you're up against it.  

What's killing us is something different: not absolute change rate, but poor 
worst-case performance.

FWIW, we contemplated a multi-index system with an index on a RAM disk for
fast changes and a primary index on the main file system.  It would have
worked fine for pure adds, but it was very tricky to manage state for
documents which were being "updated", i.e.  deleted and re-added.  How are you
handling all these small adds with your combo reader/writer?  Do you not have
that problem?

> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1482) Replace infoSteram by a logging framework (SLF4J)

2009-01-08 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662044#action_12662044
 ] 

Yonik Seeley commented on LUCENE-1482:
--

It seems we should take into consideration the performance of a real logger 
(not the NOP logger) because real applications that already use SLF4J can't use 
NOP adapter.  Solr just switched to SLF4J for example.

> Replace infoSteram by a logging framework (SLF4J)
> -
>
> Key: LUCENE-1482
> URL: https://issues.apache.org/jira/browse/LUCENE-1482
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
> Fix For: 2.4.1, 2.9
>
> Attachments: LUCENE-1482-2.patch, LUCENE-1482.patch, 
> slf4j-api-1.5.6.jar, slf4j-nop-1.5.6.jar
>
>
> Lucene makes use of infoStream to output messages in its indexing code only. 
> For debugging purposes, when the search application is run on the customer 
> side, getting messages from other code flows, like search, query parsing, 
> analysis etc can be extremely useful.
> There are two main problems with infoStream today:
> 1. It is owned by IndexWriter, so if I want to add logging capabilities to 
> other classes I need to either expose an API or propagate infoStream to all 
> classes (see for example DocumentsWriter, which receives its infoStream 
> instance from IndexWriter).
> 2. I can either turn debugging on or off, for the entire code.
> Introducing a logging framework can allow each class to control its logging 
> independently, and more importantly, allows the application to turn on 
> logging for only specific areas in the code (i.e., org.apache.lucene.index.*).
> I've investigated SLF4J (stands for Simple Logging Facade for Java) which is, 
> as it names states, a facade over different logging frameworks. As such, you 
> can include the slf4j.jar in your application, and it recognizes at deploy 
> time what is the actual logging framework you'd like to use. SLF4J comes with 
> several adapters for Java logging, Log4j and others. If you know your 
> application uses Java logging, simply drop slf4j.jar and slf4j-jdk14.jar in 
> your classpath, and your logging statements will use Java logging underneath 
> the covers.
> This makes the logging code very simple. For a class A the logger will be 
> instantiated like this:
> public class A {
>   private static final logger = LoggerFactory.getLogger(A.class);
> }
> And will later be used like this:
> public class A {
>   private static final logger = LoggerFactory.getLogger(A.class);
>   public void foo() {
> if (logger.isDebugEnabled()) {
>   logger.debug("message");
> }
>   }
> }
> That's all !
> Checking for isDebugEnabled is very quick, at least using the JDK14 adapter 
> (but I assume it's fast also over other logging frameworks).
> The important thing is, every class controls its own logger. Not all classes 
> have to output logging messages, and we can improve Lucene's logging 
> gradually, w/o changing the API, by adding more logging messages to 
> interesting classes.
> I will submit a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1314) IndexReader.clone

2009-01-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662043#action_12662043
 ] 

Jason Rutherglen commented on LUCENE-1314:
--

I executed on Eclipse Mac OS X on a 4 core box (core's significant due to the 
threads).  I ran TestIndexReaderReopen.testThreadSafety 2 times in debug mode 
it worked, thought that debug mode wasn't making the bug reproduce so tried 
just running the test and it passed again.  The 5th time it gave an error in 
debug mode.  The test case fails consistently when SegmentReader.reopenSegment 
success == false and decRef is called afterwards in the finally clause.  It 
seems that calling this decRef on the newly cloned object is causing the 
assertion error which is possibly related to threading.  Probably because the 
decRef on the failed clone is decrementing one too many times on a 
deletedDocsRef used by another reader and causing the following assertion 
error.  I'm not sure if this is a real bug or an issue that the test case 
should ignore.  

{code}
java.lang.AssertionError
at 
org.apache.lucene.index.SegmentReader$Ref.decRef(SegmentReader.java:104)
at org.apache.lucene.index.SegmentReader.decRef(SegmentReader.java:249)
at 
org.apache.lucene.index.MultiSegmentReader.doClose(MultiSegmentReader.java:413)
at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:157)
at org.apache.lucene.index.IndexReader.close(IndexReader.java:990)
at 
org.apache.lucene.index.TestIndexReaderReopen$9.run(TestIndexReaderReopen.java:703)
at 
org.apache.lucene.index.TestIndexReaderReopen$ReaderThread.run(TestIndexReaderReopen.java:818)
{code}

> IndexReader.clone
> -
>
> Key: LUCENE-1314
> URL: https://issues.apache.org/jira/browse/LUCENE-1314
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.3.1
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, 
> LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, 
> LUCENE-1314.patch, LUCENE-1314.patch, lucene-1314.patch, lucene-1314.patch, 
> lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, 
> lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, 
> lucene-1314.patch, lucene-1314.patch
>
>
> Based on discussion 
> http://www.nabble.com/IndexReader.reopen-issue-td18070256.html.  The problem 
> is reopen returns the same reader if there are no changes, so if docs are 
> deleted from the new reader, they are also reflected in the previous reader 
> which is not always desired behavior.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662038#action_12662038
 ] 

markrmil...@gmail.com edited comment on LUCENE-1483 at 1/8/09 9:15 AM:
-

Its the ORDSUBORD again (which I don't think we will use) and the two Policies. 
Odd because its the last hit of  10 that fails for all 3. I'll ferret it out 
tonight.

- Mark

*EDIT*

yup...always the last entry thats wrong no matter the queue size - for all 3, 
which is odd because ORD_SUBORD doesnt have too much of a relationship to the 
two policies. Will be a fun one.

  was (Author: markrmil...@gmail.com):
Its the ORDSUBORD again (which I don't think we will use) and the two 
Policies. Odd because its the last hit of  10 that fails for all 3. I'll ferret 
it out tonight.

- Mark
  
> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> 
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> sortBench.py, sortCollate.py
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1482) Replace infoSteram by a logging framework (SLF4J)

2009-01-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662039#action_12662039
 ] 

Shai Erera commented on LUCENE-1482:


Grant, given what I wrote below, having Lucene use NOP adapter, are you still 
worried w.r.t. the performance implications?

If there is a general reluctance to add a dependency on SLF4J, can we review 
the other options I suggested - using infoStream as a class with static 
methods? That at least will allow adding more prints from other classes, w/o 
changing their API.

I prefer SLF4J because IMO logging is important, but having infoStream as a 
service class is better than what exists today (and I don't believe someone can 
argue that calling a static method has any significant, if at all, performance 
implications).

If the committers want to drop that issue, please let me know and I'll close 
it. I don't like to nag :-)

> Replace infoSteram by a logging framework (SLF4J)
> -
>
> Key: LUCENE-1482
> URL: https://issues.apache.org/jira/browse/LUCENE-1482
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
> Fix For: 2.4.1, 2.9
>
> Attachments: LUCENE-1482-2.patch, LUCENE-1482.patch, 
> slf4j-api-1.5.6.jar, slf4j-nop-1.5.6.jar
>
>
> Lucene makes use of infoStream to output messages in its indexing code only. 
> For debugging purposes, when the search application is run on the customer 
> side, getting messages from other code flows, like search, query parsing, 
> analysis etc can be extremely useful.
> There are two main problems with infoStream today:
> 1. It is owned by IndexWriter, so if I want to add logging capabilities to 
> other classes I need to either expose an API or propagate infoStream to all 
> classes (see for example DocumentsWriter, which receives its infoStream 
> instance from IndexWriter).
> 2. I can either turn debugging on or off, for the entire code.
> Introducing a logging framework can allow each class to control its logging 
> independently, and more importantly, allows the application to turn on 
> logging for only specific areas in the code (i.e., org.apache.lucene.index.*).
> I've investigated SLF4J (stands for Simple Logging Facade for Java) which is, 
> as it names states, a facade over different logging frameworks. As such, you 
> can include the slf4j.jar in your application, and it recognizes at deploy 
> time what is the actual logging framework you'd like to use. SLF4J comes with 
> several adapters for Java logging, Log4j and others. If you know your 
> application uses Java logging, simply drop slf4j.jar and slf4j-jdk14.jar in 
> your classpath, and your logging statements will use Java logging underneath 
> the covers.
> This makes the logging code very simple. For a class A the logger will be 
> instantiated like this:
> public class A {
>   private static final logger = LoggerFactory.getLogger(A.class);
> }
> And will later be used like this:
> public class A {
>   private static final logger = LoggerFactory.getLogger(A.class);
>   public void foo() {
> if (logger.isDebugEnabled()) {
>   logger.debug("message");
> }
>   }
> }
> That's all !
> Checking for isDebugEnabled is very quick, at least using the JDK14 adapter 
> (but I assume it's fast also over other logging frameworks).
> The important thing is, every class controls its own logger. Not all classes 
> have to output logging messages, and we can improve Lucene's logging 
> gradually, w/o changing the API, by adding more logging messages to 
> interesting classes.
> I will submit a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662038#action_12662038
 ] 

Mark Miller commented on LUCENE-1483:
-

Its the ORDSUBORD again (which I don't think we will use) and the two Policies. 
Odd because its the last hit of  10 that fails for all 3. I'll ferret it out 
tonight.

- Mark

> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> 
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> sortBench.py, sortCollate.py
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2009-01-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662033#action_12662033
 ] 

Jason Rutherglen commented on LUCENE-1476:
--

Marvin: "The whole tombstone idea arose out of the need for (close to) realtime 
search! It's intended to improve write speed."

It does improve the write speed.  I found in making the realtime search write 
speed fast enough that writing to individual files per segment can become too 
costly (they accumulate fast, appending to a single file is faster than 
creating new files, deleting the files becomes costly).  For example, writing 
to small individual files per commit, if the number of segments is large and 
the delete spans multiple segments will generate many files.  This is variable 
based on how often the updates are expected to occur.  I modeled this after the 
extreme case of the frequency of updates of a MySQL instance backing data for a 
web application.

The MySQL design, translated to Lucene is a transaction log per index.  Where 
the updates consisting of documents and deletes are written to the transaction 
log file.  If Lucene crashed for some reason the transaction log would be 
replayed.  The in memory indexes and newly deleted document bitvectors would be 
held in RAM (LUCENE-1314) until flushed (the in memory indexes and deleted 
documents) manually or based on memory usage.  Many users may not want a 
transaction log as they may be storing the updates in a separate SQL database 
instance (this is the case where I work) and so a transaction log is redundant 
and should be optional.  The first implementation of this will not have a 
transaction log.

Marvin: "I don't think I understand. Is this the "combination index 
reader/writer" model, where the writer prepares a data structure that then gets 
handed off to the reader?"

It would be exposed as a combination reader writer that manages the transaction 
status of each update.  The internal architecture is such that after each 
update a new reader representing the new documents and deletes for the 
transaction is generated and put onto a stack.  The reader stack is drained 
based on whether a reader is too old to be useful anymore (i.e. no references 
to it, or it's has N number of readers ahead of it).  

> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662028#action_12662028
 ] 

Mark Miller commented on LUCENE-1483:
-

bq. It runs legacy vs new sort and asserts that they are the same.

Clever. Very good idea.

I'll fix it up. Also, if you have any ideas about what Policies you want to 
start with, I'd be happy to push those around a bit too.

> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> 
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> sortBench.py, sortCollate.py
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1483:
---

Attachment: LUCENE-1483.patch

Attached full patch (though you'll get failed hunks because of the
annoying $Id$ expansion problem).

I fixed various small issues, and added a new TestStressSort test.  It
runs legacy vs new sort and asserts that they are the same.

It is currently failing... but I haven't spent any time digging into
why.

Mark could you dig and try to figure out why it's failing?  I think we
should resolve it before running (or, trusting) perf tests.

Also: I wonder if we can remove the null checking in the compare
methods for String*Comparator?  EG maybe we need a new
FieldCache.getString{s,Index} methods that optionally take a
"fillNulls" param, and if true nulls are replaced with empty string?
However... that would unfortunately cause a difference whereby ""
would be equal to null (whereas now null sorts ahead of ""), which is
not back compatible.  I guess we could make a "non-null" comparator
and use it whenever it's known there are no nulls in the FieldCache
array.  It may not be worth the hassle.  If the value is never null,
cpu will guess the right branch path every time, so penalty should
be small (yet non-zero!).


> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> 
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> sortBench.py, sortCollate.py
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1497) Minor changes to SimpleHTMLFormatter

2009-01-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1497.


   Resolution: Fixed
Fix Version/s: (was: 2.4.1)
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Committed revision 732739.

Thanks Shai!

> Minor changes to SimpleHTMLFormatter
> 
>
> Key: LUCENE-1497
> URL: https://issues.apache.org/jira/browse/LUCENE-1497
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/highlighter
>Reporter: Shai Erera
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1497.patch
>
>
> I'd like to make few minor changes to SimpleHTMLFormatter.
> 1. Define DEFAULT_PRE_TAG and DEFAULT_POST_TAG and use them in the default 
> constructor. This will not trigger String lookups by the JVM whenever the 
> highlighter is instantiated.
> 2. Create the StringBuffer in highlightTerm with the right number of 
> characters from the beginning. Even though StringBuffer's default constructor 
> allocates 16 chars, which will probably be enough for most highlighted terms 
> (pre + post tags are 7 chars, which leaves 9 chars for terms), I think it's 
> better to allocate SB with the right # of chars in advance, to avoid char[] 
> allocations in the middle.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1497) Minor changes to SimpleHTMLFormatter

2009-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662020#action_12662020
 ] 

Michael McCandless commented on LUCENE-1497:


Ahh, OK, then let's leave your approach (dedicated single StringBuffer).  I'll 
commit shortly.

> Minor changes to SimpleHTMLFormatter
> 
>
> Key: LUCENE-1497
> URL: https://issues.apache.org/jira/browse/LUCENE-1497
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/highlighter
>Reporter: Shai Erera
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4.1, 2.9
>
> Attachments: LUCENE-1497.patch
>
>
> I'd like to make few minor changes to SimpleHTMLFormatter.
> 1. Define DEFAULT_PRE_TAG and DEFAULT_POST_TAG and use them in the default 
> constructor. This will not trigger String lookups by the JVM whenever the 
> highlighter is instantiated.
> 2. Create the StringBuffer in highlightTerm with the right number of 
> characters from the beginning. Even though StringBuffer's default constructor 
> allocates 16 chars, which will probably be enough for most highlighted terms 
> (pre + post tags are 7 chars, which leaves 9 chars for terms), I think it's 
> better to allocate SB with the right # of chars in advance, to avoid char[] 
> allocations in the middle.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1497) Minor changes to SimpleHTMLFormatter

2009-01-08 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662004#action_12662004
 ] 

Shai Erera commented on LUCENE-1497:


If I understand you correctly, you propose to change the code to:
preTag + originalText + postTag.
That creates 2 (or 3) StringBuffers actually. Java implements + by allocating a 
StringBuffer and appending both Strings to it.
What I propose is to create the StringBuffer large enough from the beginning 
such that there won't be additional allocations. 

> Minor changes to SimpleHTMLFormatter
> 
>
> Key: LUCENE-1497
> URL: https://issues.apache.org/jira/browse/LUCENE-1497
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/highlighter
>Reporter: Shai Erera
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4.1, 2.9
>
> Attachments: LUCENE-1497.patch
>
>
> I'd like to make few minor changes to SimpleHTMLFormatter.
> 1. Define DEFAULT_PRE_TAG and DEFAULT_POST_TAG and use them in the default 
> constructor. This will not trigger String lookups by the JVM whenever the 
> highlighter is instantiated.
> 2. Create the StringBuffer in highlightTerm with the right number of 
> characters from the beginning. Even though StringBuffer's default constructor 
> allocates 16 chars, which will probably be enough for most highlighted terms 
> (pre + post tags are 7 chars, which leaves 9 chars for terms), I think it's 
> better to allocate SB with the right # of chars in advance, to avoid char[] 
> allocations in the middle.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2009-01-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661998#action_12661998
 ] 

Mark Miller commented on LUCENE-1476:
-

bq. I noticed that in one version of the patch for segment-centric search 
(LUCENE-1483), each sorted search involved the creation of sub-searchers, which 
were then used to compile Scorers. It would make sense to cache those as 
individual SegmentSearcher objects, no?

Thats a fairly old version I think (based on using MutliSearcher as a hack). 
Now we are using one queue and running it through each subreader of the 
MultiReader.

> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2009-01-08 Thread Marvin Humphrey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661995#action_12661995
 ] 

Marvin Humphrey commented on LUCENE-1476:
-

Mike McCandless:

> For a TermQuery (one term) the cost of the two approaches should be
> the same.

It'll be close, but I don't think that's quite true.  TermScorer pre-fetches
document numbers in batches from the TermDocs object.  At present, only
non-deleted doc nums get cached.  If we move the deletions filtering up, then
we'd increase traffic through that cache.  However, filling it would be
slightly cheaper, because we wouldn't be performing the deletions check.

In theory.  I'm not sure there's a way to streamline away that deletions check
in TermDocs and maintain backwards compatibility.  And while this is a fun
brainstorm, I'm still far from convinced that having TermDocs.next() and
Scorer.next() return deleted docs by default is a good idea.

> For AND (and other) queries I'm not sure. In theory, having to
> process more docIDs is more costly, eg a PhraseQuery or SpanXXXQuery
> may see much higher net cost.

If you were applying deletions filtering after Scorer.next(), then it seems
likely that costs would go up because of extra hit processing.  However, if
you use Scorer.skipTo() to jump past deletions, as in the loop I provided
above, then PhraseScorer etc. shouldn't incur any more costs themselves.

> a costly per-docID search
> with a very restrictive filter could be far more efficient if you
> applied the Filter earlier in the chain.

Under the skipTo() loop, I think the filter effectively *does* get applied
earlier in the chain.  Does that make sense?

I think the potential performance downside comes down to prefetching in
TermScorer, unless there are other classes that do similar prefetching.




> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1497) Minor changes to SimpleHTMLFormatter

2009-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661992#action_12661992
 ] 

Michael McCandless commented on LUCENE-1497:


In fact I think it may be faster to not even use StringBuffer in highlightTerm? 
 Since we know we are concatenating 3 strings can we just + them?  I suspect 
that'd give better net performance (pure speculation!).

> Minor changes to SimpleHTMLFormatter
> 
>
> Key: LUCENE-1497
> URL: https://issues.apache.org/jira/browse/LUCENE-1497
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/highlighter
>Reporter: Shai Erera
>Priority: Minor
> Fix For: 2.4.1, 2.9
>
> Attachments: LUCENE-1497.patch
>
>
> I'd like to make few minor changes to SimpleHTMLFormatter.
> 1. Define DEFAULT_PRE_TAG and DEFAULT_POST_TAG and use them in the default 
> constructor. This will not trigger String lookups by the JVM whenever the 
> highlighter is instantiated.
> 2. Create the StringBuffer in highlightTerm with the right number of 
> characters from the beginning. Even though StringBuffer's default constructor 
> allocates 16 chars, which will probably be enough for most highlighted terms 
> (pre + post tags are 7 chars, which leaves 9 chars for terms), I think it's 
> better to allocate SB with the right # of chars in advance, to avoid char[] 
> allocations in the middle.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1497) Minor changes to SimpleHTMLFormatter

2009-01-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1497:
--

Assignee: Michael McCandless

> Minor changes to SimpleHTMLFormatter
> 
>
> Key: LUCENE-1497
> URL: https://issues.apache.org/jira/browse/LUCENE-1497
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/highlighter
>Reporter: Shai Erera
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4.1, 2.9
>
> Attachments: LUCENE-1497.patch
>
>
> I'd like to make few minor changes to SimpleHTMLFormatter.
> 1. Define DEFAULT_PRE_TAG and DEFAULT_POST_TAG and use them in the default 
> constructor. This will not trigger String lookups by the JVM whenever the 
> highlighter is instantiated.
> 2. Create the StringBuffer in highlightTerm with the right number of 
> characters from the beginning. Even though StringBuffer's default constructor 
> allocates 16 chars, which will probably be enough for most highlighted terms 
> (pre + post tags are 7 chars, which leaves 9 chars for terms), I think it's 
> better to allocate SB with the right # of chars in advance, to avoid char[] 
> allocations in the middle.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2009-01-08 Thread Marvin Humphrey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661982#action_12661982
 ] 

Marvin Humphrey commented on LUCENE-1476:
-

How about if we model deletions-as-iterator on BitSet.nextSetBit(int tick) 
instead of a true iterator that keeps state? 

You can do that now by implementing BitVector.nextSetBit(int tick) and using 
that in TermDocs to set a nextDeletion member var instead of checking every doc 
num with BitVector.get().

That way, the object that provides deletions can still be shared.

> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2009-01-08 Thread Marvin Humphrey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661977#action_12661977
 ] 

Marvin Humphrey commented on LUCENE-1476:
-

Paul Elschot:

> How about a SegmentSearcher?

I like the idea of a SegmentSearcher in general.  A little while back, I 
wondered whether exposing SegmentReaders was really the best way to handle 
segment-centric search.  Upon reflection, I think it is.  Segments are a good 
unit.  They're pure inverted indexes (notwithstanding doc stores and 
tombstones); the larger composite only masquerades as one.

I noticed that in one version of the patch for segment-centric search 
(LUCENE-1483), each sorted search involved the creation of sub-searchers, which 
were then used to compile Scorers. It would make sense to cache those as 
individual SegmentSearcher objects, no? 

And then, to respond to the original suggestion, the SegmentSearcher level 
seems like a good place to handle application of a deletions quasi-filter.  I 
think we could avoid having to deal with segment-start offsets that way. 

> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1510) InstantiatedIndexReader throws NullPointerException in norms() when used with a MultiReader

2009-01-08 Thread Robert Newson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661956#action_12661956
 ] 

Robert Newson commented on LUCENE-1510:
---

Looks good to me. I wonder if you should add;

private static final byte[] EMPTY = new byte[0];

and refer to that, as your todo suggests?



> InstantiatedIndexReader throws NullPointerException in norms() when used with 
> a MultiReader
> ---
>
> Key: LUCENE-1510
> URL: https://issues.apache.org/jira/browse/LUCENE-1510
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Robert Newson
>Assignee: Karl Wettin
> Fix For: 2.9
>
> Attachments: TestWithMultiReader.java
>
>
> When using InstantiatedIndexReader under a MultiReader where the other Reader 
> contains documents, a NullPointerException is thrown here;
>  public void norms(String field, byte[] bytes, int offset) throws IOException 
> {
> byte[] norms = 
> getIndex().getNormsByFieldNameAndDocumentNumber().get(field);
> System.arraycopy(norms, 0, bytes, offset, norms.length);
>   }
> the 'norms' variable is null. Performing the copy only when norms is not null 
> does work, though I'm sure it's not the right fix.
> java.lang.NullPointerException
>   at 
> org.apache.lucene.store.instantiated.InstantiatedIndexReader.norms(InstantiatedIndexReader.java:297)
>   at org.apache.lucene.index.MultiReader.norms(MultiReader.java:273)
>   at 
> org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:70)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:112)
>   at org.apache.lucene.search.Searcher.search(Searcher.java:136)
>   at org.apache.lucene.search.Searcher.search(Searcher.java:146)
>   at 
> org.apache.lucene.store.instantiated.TestWithMultiReader.test(TestWithMultiReader.java:41)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at junit.framework.TestCase.runTest(TestCase.java:164)
>   at junit.framework.TestCase.runBare(TestCase.java:130)
>   at junit.framework.TestResult$1.protect(TestResult.java:106)
>   at junit.framework.TestResult.runProtected(TestResult.java:124)
>   at junit.framework.TestResult.run(TestResult.java:109)
>   at junit.framework.TestCase.run(TestCase.java:120)
>   at junit.framework.TestSuite.runTest(TestSuite.java:230)
>   at junit.framework.TestSuite.run(TestSuite.java:225)
>   at 
> org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130)
>   at 
> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2009-01-08 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661944#action_12661944
 ] 

Paul Elschot commented on LUCENE-1476:
--

bq. To minimize CPU cycles, it would theoretically make more sense to handle 
deletions much higher up, at the top level Scorer, Searcher, or even the 
HitCollector level.

How about a SegmentSearcher?

> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2009-01-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661934#action_12661934
 ] 

Michael McCandless commented on LUCENE-1476:



{quote}
> PostingList would be completely ignorant of deletions, as would classes like 
> NOTScorer and MatchAllScorer:
{quote}

This is a neat idea! Deletions are then applied just like a Filter.

For a TermQuery (one term) the cost of the two approaches should be
the same.

For OR'd Term queries, it actually seems like your proposed approach
may be lower cost?  Ie rather than each TermEnum doing the "AND NOT
deleted" intersection, you only do it once at the top.  There is added
cost in that each TermEnum is now returning more docIDs than before,
but the deleted ones are eliminated before scoring.

For AND (and other) queries I'm not sure.  In theory, having to
process more docIDs is more costly, eg a PhraseQuery or SpanXXXQuery
may see much higher net cost.  We should test.

Conceivably, a future "search optimization phase" could pick & choose
the best point to inject the "AND NOT deleted" filter.  In fact, it
could also pick when to inject a Filter... a costly per-docID search
with a very restrictive filter could be far more efficient if you
applied the Filter earlier in the chain.

I'm also curious what cost you see of doing the merge sort for every
search; I think it could be uncomfortably high since it's so
hard-for-cpu-to-predict-branch-intensive.  We could take the first
search that doesn't use skipTo and save the result of the merge sort,
essentially doing an in-RAM-only "merge" of those deletes, and let
subsequent searches use that single merged stream.  (This is not MMAP
friendly, though).

In my initial rough testing, I switched to iterator API for
SegmentTermEnum and found if %tg deletes is < 10% the search was a bit
faster using an iterator vs random access, but above that was slower.
This was with an already "merged" list of in-order docIDs.

Switching to an iterator API for accessing field values for many docs
(LUCENE-831 -- new FieldCache API, LUCENE-1231 -- column stride
fields) shouldn't have this same problem since it's the "top level"
that's accessing the values (ie, one iterator per field X query).



> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2009-01-08 Thread Michael McCandless


robert engels wrote:


Then why not always write  segment.del, where  is incremented.


This is what Lucene does today.  It's "write once".

Each file may be compressed or uncompressed based on the number of  
deletions it contains.


Lucene also does this.

Still, as Marvin pointed out, the cost of committing a delete is in  
proportion to either the number of deletes already on the segment (if  
written sparse) or the number of documents in the segment (if written  
non-sparse).  It doesn't scale well... though the constant factor may  
be very small (ie may not matter that much in practice?).  With  
tombstones the commit cost would be in proportion to how many deletes  
you did (scales perfectly), at the expense of added per-search cost  
and search iterator state.


For realtime search this could be a good tradeoff to make (lower  
latency on add/delete -> refreshed searcher, at higher per-search  
cost), but... in the realtime search discussion we are now thinking  
that the deletes live with the reader and are carried in RAM over to  
the reopened reader (LUCENE-1314), bypassing having to commit to the  
filesystem at all.


One downside to this is it's single-JRE only, ie to do distributed  
realtime search you'd have to also re-apply the deletes to the head  
IndexReader on each JRE.  (Whereas added docs would be written with a  
single IndexWriter, and propagated via the filesystem ).


If we go forward with this model then indeed slowish commit times for  
new deletes are less important since it's for crash recovery and not  
for opening a new reader.


But we'd have many "control" issues to work through... eg how the  
reader can re-open against old segments right after a new merge is  
committed (because the newly merged segment isn't warmed yet), and,  
how IndexReader can open segments written by the writer but not truly  
committed (sync'd).


Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-1510) InstantiatedIndexReader throws NullPointerException in norms() when used with a MultiReader

2009-01-08 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin closed LUCENE-1510.
---

   Resolution: Fixed
Fix Version/s: 2.9

> InstantiatedIndexReader throws NullPointerException in norms() when used with 
> a MultiReader
> ---
>
> Key: LUCENE-1510
> URL: https://issues.apache.org/jira/browse/LUCENE-1510
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Robert Newson
>Assignee: Karl Wettin
> Fix For: 2.9
>
> Attachments: TestWithMultiReader.java
>
>
> When using InstantiatedIndexReader under a MultiReader where the other Reader 
> contains documents, a NullPointerException is thrown here;
>  public void norms(String field, byte[] bytes, int offset) throws IOException 
> {
> byte[] norms = 
> getIndex().getNormsByFieldNameAndDocumentNumber().get(field);
> System.arraycopy(norms, 0, bytes, offset, norms.length);
>   }
> the 'norms' variable is null. Performing the copy only when norms is not null 
> does work, though I'm sure it's not the right fix.
> java.lang.NullPointerException
>   at 
> org.apache.lucene.store.instantiated.InstantiatedIndexReader.norms(InstantiatedIndexReader.java:297)
>   at org.apache.lucene.index.MultiReader.norms(MultiReader.java:273)
>   at 
> org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:70)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:112)
>   at org.apache.lucene.search.Searcher.search(Searcher.java:136)
>   at org.apache.lucene.search.Searcher.search(Searcher.java:146)
>   at 
> org.apache.lucene.store.instantiated.TestWithMultiReader.test(TestWithMultiReader.java:41)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at junit.framework.TestCase.runTest(TestCase.java:164)
>   at junit.framework.TestCase.runBare(TestCase.java:130)
>   at junit.framework.TestResult$1.protect(TestResult.java:106)
>   at junit.framework.TestResult.runProtected(TestResult.java:124)
>   at junit.framework.TestResult.run(TestResult.java:109)
>   at junit.framework.TestCase.run(TestCase.java:120)
>   at junit.framework.TestSuite.runTest(TestSuite.java:230)
>   at junit.framework.TestSuite.run(TestSuite.java:225)
>   at 
> org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130)
>   at 
> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1510) InstantiatedIndexReader throws NullPointerException in norms() when used with a MultiReader

2009-01-08 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661908#action_12661908
 ] 

Karl Wettin commented on LUCENE-1510:
-

Thanks for the report Robert!

I've committed a fix in revision 732661. Please check it out and let me know 
how it works for you. There was a bit of discrepancies between how the 
InstantiatedIndexReader handled null norms compared to a SegmentReader. I think 
these problems are fixed now.

 

> InstantiatedIndexReader throws NullPointerException in norms() when used with 
> a MultiReader
> ---
>
> Key: LUCENE-1510
> URL: https://issues.apache.org/jira/browse/LUCENE-1510
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Robert Newson
>Assignee: Karl Wettin
> Attachments: TestWithMultiReader.java
>
>
> When using InstantiatedIndexReader under a MultiReader where the other Reader 
> contains documents, a NullPointerException is thrown here;
>  public void norms(String field, byte[] bytes, int offset) throws IOException 
> {
> byte[] norms = 
> getIndex().getNormsByFieldNameAndDocumentNumber().get(field);
> System.arraycopy(norms, 0, bytes, offset, norms.length);
>   }
> the 'norms' variable is null. Performing the copy only when norms is not null 
> does work, though I'm sure it's not the right fix.
> java.lang.NullPointerException
>   at 
> org.apache.lucene.store.instantiated.InstantiatedIndexReader.norms(InstantiatedIndexReader.java:297)
>   at org.apache.lucene.index.MultiReader.norms(MultiReader.java:273)
>   at 
> org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:70)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:112)
>   at org.apache.lucene.search.Searcher.search(Searcher.java:136)
>   at org.apache.lucene.search.Searcher.search(Searcher.java:146)
>   at 
> org.apache.lucene.store.instantiated.TestWithMultiReader.test(TestWithMultiReader.java:41)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at junit.framework.TestCase.runTest(TestCase.java:164)
>   at junit.framework.TestCase.runBare(TestCase.java:130)
>   at junit.framework.TestResult$1.protect(TestResult.java:106)
>   at junit.framework.TestResult.runProtected(TestResult.java:124)
>   at junit.framework.TestResult.run(TestResult.java:109)
>   at junit.framework.TestCase.run(TestCase.java:120)
>   at junit.framework.TestSuite.runTest(TestSuite.java:230)
>   at junit.framework.TestSuite.run(TestSuite.java:225)
>   at 
> org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130)
>   at 
> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org