[jira] Updated: (LUCENE-1479) TrecDocMaker skips over documents when "Date" is missing from documents
[ https://issues.apache.org/jira/browse/LUCENE-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1479: --- Attachment: LUCENE-1479.patch Thanks Mike, you're right. The compilation error is a result of a refactoring I did to that line, by using a single substring call instead of two. I forgot to use 'sb' in the second indexOf call, and hence the compilation error. Regarding dateStr - I fixed that. Thanks for noticing it > TrecDocMaker skips over documents when "Date" is missing from documents > --- > > Key: LUCENE-1479 > URL: https://issues.apache.org/jira/browse/LUCENE-1479 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/benchmark >Reporter: Shai Erera >Assignee: Michael McCandless > Fix For: 2.4.1, 2.9 > > Attachments: LUCENE-1479.patch > > > TrecDocMaker skips over Trec documents if they do not have a "Date" line. > When such a document is encountered, the code may skip over several documents > until the next tag that is searched for is found. > The result is, instead of reading ~25M documents from the GOV2 collection, > the code reads only ~23M (don't remember the actual numbers). > The fix adds a terminatingTag to read() such that the code looks for prefix, > but only until terminatingTag is found. Appropriate changes were made in > getNextDocData(). > Patch to follow -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1479) TrecDocMaker skips over documents when "Date" is missing from documents
[ https://issues.apache.org/jira/browse/LUCENE-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1479: --- Attachment: (was: LUCENE-1479.patch) > TrecDocMaker skips over documents when "Date" is missing from documents > --- > > Key: LUCENE-1479 > URL: https://issues.apache.org/jira/browse/LUCENE-1479 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/benchmark >Reporter: Shai Erera >Assignee: Michael McCandless > Fix For: 2.4.1, 2.9 > > > TrecDocMaker skips over Trec documents if they do not have a "Date" line. > When such a document is encountered, the code may skip over several documents > until the next tag that is searched for is found. > The result is, instead of reading ~25M documents from the GOV2 collection, > the code reads only ~23M (don't remember the actual numbers). > The fix adds a terminatingTag to read() such that the code looks for prefix, > but only until terminatingTag is found. Appropriate changes were made in > getNextDocData(). > Patch to follow -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662247#action_12662247 ] Doug Cutting commented on LUCENE-1476: -- bq. To really tighten this loop, you have to [ ... ] remove all function/method call overhead [and] operate directly on the memory mapped postings file. That sounds familiar... http://svn.apache.org/viewvc/lucene/java/trunk/src/gcj/org/apache/lucene/index/GCJTermDocs.cc?view=markup > BitVector implement DocIdSet > > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Trivial > Attachments: LUCENE-1476.patch, quasi_iterator_deletions.diff > > Original Estimate: 12h > Remaining Estimate: 12h > > BitVector can implement DocIdSet. This is for making > SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1494) Additional features for searching for value across multiple fields (many-to-one style)
[ https://issues.apache.org/jira/browse/LUCENE-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662244#action_12662244 ] Paul Cowan commented on LUCENE-1494: Hi Hoss, I don't disagree that an inverted inheritance hierarchy would make more sense, but the problem with that is that getField (which I _think_ is the only thing on SpanNearQuery that doesn't really make sense for a MultiField one) is mandated by the abstract method declaration of same in SpanQuery, which the inverted parent class would still extend. Looking at where getField() is used (primarily in SpanWeight.explain() and SpanWeight and BoostingTermWeight's .scorer() methods) I'm not sure how I can meaningfully deal with those in the case of a multifield span query. If you (or anyone else) have any suggestions for that then I'm all ears, this would be really useful for us (and a lot of other people I think, it's not an uncommon query on the lists etc). Personally I'd be equally happy with just eliminating the same-field requirement (as you mentioned, I think, that Doug suggested) but those explain()s and scorer() methods would still need to be changed. Any ideas? Paul > Additional features for searching for value across multiple fields > (many-to-one style) > -- > > Key: LUCENE-1494 > URL: https://issues.apache.org/jira/browse/LUCENE-1494 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 2.4 >Reporter: Paul Cowan >Priority: Minor > Attachments: LUCENE-1494-multifield.patch, > LUCENE-1494-positionincrement.patch > > > This issue is to cover the changes required to do a search across multiple > fields with the same name in a fashion similar to a many-to-one database. > Below is my post on java-dev on the topic, which details the changes we need: > --- > We have an interesting situation where we are effectively indexing two > 'entities' in our system, which share a one-to-many relationship (imagine > 'User' and 'Delivery Address' for demonstration purposes). At the moment, we > index one Lucene Document per 'many' end, duplicating the 'one' end data, > like so: > userid: 1 > userfirstname: fred > addresscountry: au > addressphone: 1234 > userid: 1 > userfirstname: fred > addresscountry: nz > addressphone: 5678 > userid: 2 > userfirstname: mary > addresscountry: au > addressphone: 5678 > (note: 2 Documents indexed for user 1). This is somewhat annoying for us, > because when we search in Lucene the results we want back (conceptually) are > at the 'user' level, so we have to collapse the results by distinct user id, > etc. etc (let alone that it blows out the size of our index enormously). So > why do we do it? It would make more sense to use multiple fields: > userid: 1 > userfirstname: fred > addresscountry: au > addressphone: 1234 > addresscountry: nz > addressphone: 5678 > userid: 2 > userfirstname: mary > addresscountry: au > addressphone: 5678 > But imagine the search "+addresscountry:au +addressphone:5678". We'd like > this to match ONLY Mary, but of course it matches Fred also because he > matches both those terms (just for different addresses). > There are two aspects to the approach we've (more or less) got working but > I'd like to run them past the group and see if they're worth trying to get > them into Lucene proper (if so, I'll create a JIRA issue for them) > 1) Use a modified SpanNearQuery. If we assume that country + phone will > always be one token, we can rely on the fact that the positions of 'au' and > '5678' in Fred's document will be different. >SpanQuery q1 = new SpanTermQuery(new Term("addresscountry", "au")); >SpanQuery q2 = new SpanTermQuery(new Term("addressphone", "5678")); >SpanQuery snq = new SpanNearQuery(new SpanQuery[]{q1, q2}, 0, false); > the slop of 0 means that we'll only return those where the two terms are in > the same position in their respective fields. This works brilliantly, BUT > requires a change to SpanNearQuery's constructor (which checks that all the > clauses are against the same field). Are people amenable to perhaps adding > another constructor to SNQ which doesn't do the check, or subclassing it to > do the same (give it a protected non-checking constructor for the subclass to > call)? > 2) It gets slightly more complicated in the case of variable-length terms. > For example, imagine if we had an 'address' field ('123 Smith St') which will > result in (1 to n) tokens; slop 0 in a SpanNearQuery won't work here, of > course. One thing we've toyed with is the idea of using > getPositionIncrementGap -- if we knew that 'address' would be, at most, 20
Re: Realtime Search
We have worked on this problem on the server level as well. We have also open sourced it at: http://code.google.com/p/zoie/ wiki on the realtime aspect: http://code.google.com/p/zoie/wiki/ZoieSystem -John On Fri, Dec 26, 2008 at 12:34 PM, Robert Engels wrote: > If you move to the "either embedded, or server model", the post reopen is > trivial, as the structures can be created as the segment is written. > > It is the networked shared access model that causes a lot of these > optimizations to be far more complex than needed. > > Would it maybe be simpler to move the "embedded or server" model, and add a > network shared file (e.g. nfs) access model as a layer? The latter is going > to perform far worse anyway. > > I guess I don't understand why Lucene continues to try and support this > model. NO ONE does it any more. This is the way MS Access worked, and > everyone that wanted performance needed to move to SQL server for the server > model. > > > -Original Message- > >From: Marvin Humphrey > >Sent: Dec 26, 2008 12:53 PM > >To: java-dev@lucene.apache.org > >Subject: Re: Realtime Search > > > >On Fri, Dec 26, 2008 at 06:22:23AM -0500, Michael McCandless wrote: > >> > 4) Allow 2 concurrent writers: one for small, fast updates, and one > for > >> > big background merges. > >> > >> Marvin can you describe more detail here? > > > >The goal is to improve worst-case write performance. > > > >Currently, writes are quick most of the time, but occassionally you'll > trigger > >a big merge and get stuck. To solve this problem, we can assign a merge > >policy to our primary writer which tells it to merge no more than > >mergeThreshold documents. The value of mergeTheshold will need tuning > >depending on document size, change rate, and so on, but the idea is that > we > >want this writer to do as much merging as it can while still keeping > >worst-case write performance down to an acceptable number. > > > >Doing only small merges just puts off the day of reckoning, of course. By > >avoiding big consolidations, we are slowly accumulating small-to-medium > sized > >segments and causing a gradual degradation of search-time performance. > > > >What we'd like is a separate write process, operating (mostly) in the > >background, dedicated solely to merging segments which contain at least > >mergeThreshold docs. > > > >If all we have to do is add documents to the index, adding that second > write > >process isn't a big deal. We have to worry about competion for segment, > >snapshot, and temp file names, but that's about it. > > > >Deletions make matters more complicated, but with a tombstone-based > deletions > >mechanism, the problems are solvable. > > > >When the background merge writer starts up, it will see a particular view > of > >the index in time, including deletions. It will perform nearly all of its > >operations based on this view of the index, mapping around documents which > >were marked as deleted at init time. > > > >In between the time when the background merge writer starts up and the > time it > >finishes consolidating segment data, we assume that the primary writer > will > >have modified the index. > > > > * New docs have been added in new segments. > > * Tombstones have been added which suppress documents in segments which > >didn't even exist when the background merge writer started up. > > * Tombstones have been added which suppress documents in segments which > >existed when the background merge writer started up, but were not > merged. > > * Tombstones have been added which suppress documents in segments which > have > >just been merged. > > > >Only the last category of deletions matters. > > > >At this point, the background merge writer aquires an exclusive write lock > on > >the index. It examines recently added tombstones, translates the document > >numbers and writes a tombstone file against itself. Then it writes the > >snapshot file to commit its changes and releases the write lock. > > > >Worst case update performance for the system is now the sum of the time it > >takes the background merge writer consolidate tombstones and worst-case > >performance of the primary writer. > > > >> It sounds like this is your solution for "decoupling" segments changes > due > >> to merges from changes from docs being indexed, from a reader's > standpoint? > > > >It's true that we are decoupling the process of making logical changes to > the > >index from the process of internal consolidation. I probably wouldn't > >describe that as being done from the reader's standpoint, though. > > > >With mmap and data structures optimized for it, we basically solve the > >read-time responsiveness cost problem. From the client perspective, the > delay > >between firing off a change order and seeing that change made live is now > >dominated by the time it takes to actually update the index. The time > between > >the commit and having an IndexReader which can see that commit is >
[jira] Updated: (LUCENE-1476) BitVector implement DocIdSet
[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marvin Humphrey updated LUCENE-1476: Attachment: quasi_iterator_deletions.diff Here's a patch implementing BitVector.nextSetBit() and converting SegmentTermDocs over to use the quasi-iterator style. Tested but not benchmarked. > BitVector implement DocIdSet > > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Trivial > Attachments: LUCENE-1476.patch, quasi_iterator_deletions.diff > > Original Estimate: 12h > Remaining Estimate: 12h > > BitVector can implement DocIdSet. This is for making > SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
Based on our discussions, it seems best to get realtime search going in small steps. Below are some possible steps to take. Patch #1: Expose an IndexWriter.getReader method that returns the current reader and shares the write lock Patch #2: Implement a realtime ram index class Patch #3: Implement realtime transactions in IndexWriter or in a subclass of IndexWriter by implementing a createTransaction method that generates a realtime Transaction object. When the transaction is flushed, the transaction index modifications are available via the getReader method of IndexWriter The remaining question is how to synchronize the flushes to disk with IndexWriter's other index update locking mechanisms. The flushing could simply use IW.addIndexes which has in place a locking mechanism. After flushing to disk, queued deletes would be applied to the newly copied disk segments. I think this entails opening the newly copied disk segments and applying deletes that occurred to the corresponding ram segments by cloning the new disk segments and replacing the deleteddocs bitvector then flushing the deleteddocs to disk. This system would allow us to avoid using UID in documents. The API needs to clearly separate realtime transactions vs. the existing index update method such as addDocument, deleteDocuments, and updateDocument. I don't think it's possible to transparently implement both because the underlying implementations behave differently. It is expected that multiple transaction may be created at once however the Transaction.flush method would block.
Re: stored fields / unicode compression
thanks for the response, this sounds great. some way to plug in arbitrary schemes would be helpful. I've experimented with a few for my case and unicode compression gave the best bang for the buck, but i remember some of the other schemes such as arithmetic coding seemed to provide wins for reasonably short fields where gzip was still making them bigger... On Thu, Jan 8, 2009 at 8:26 PM, Chris Hostetter wrote: > > Catching up on my holiday email, I on't think there were any replies to > this question yet. > > The low level file formats used by Lucene is an area I don't have > time/expertise to follow carefully, but if i'm remember correctly the > concensus is/was to more more towards pure (byte[] data, int start, int > end) based APIs for efficiency, with "String" based APIs provided as > syntactic sugar via a facade, and deprecating the existing "internal" gzip > compression in favor of similar "external" compression facades. So > something like you describe could be done as is using the byte[] > interfaces *and* be generally useful to others. > > Taking a step back to look at the broader picture, this is the kind of > thing that in Solr could be implemented as a new FieldType > > : Date: Fri, 26 Dec 2008 19:00:11 -0500 > : From: Robert Muir > : Subject: stored fields / unicode compression > : > : Has there been any thoughts of using SCSU or BOCU-1 instead of UTF-8 for > : stored fields? > : Personally I don't put huge amounts of text in stored fields but these > : encodings/compression work extremely well on short strings like titles, > etc. > : Removing the unicode penalty for non-latin text (i.e. cut in half) is > : nothing to sneeze at since with lots of docs my stored fields still > become > : pretty huge, biggest part of the index. > : > : I know I could use one of these schemes right now and store everything as > : bytes... but just thinking it might be something of more general use. The > : GZIP compression that is supported isn't very useful as it typically > makes > : short snippets bigger... > : > : Performance compared to UTF-8 is here... seems like a general win to me > (but > : maybe I am missing something) > : http://unicode.org/notes/tn6/#Performance > > > -Hoss > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com
[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662214#action_12662214 ] Jason Rutherglen commented on LUCENE-1476: -- M.M.:" I think the transactions layer would also sit on top of this "realtime" layer? EG this "realtime" layer would expose a commit() method, and the transaction layer above it would maintain the transaction log, periodically calling commit() and truncating the transaction log?" One approach that may be optimal is to expose from IndexWriter a createTransaction method that accepts new documents and deletes. All documents have an associated UID. The new documents could feasibly be encoded into a single segment that represents the added documents for that transaction. The deletes would be represented as document long UIDs rather than int doc ids. Then the commit method would be called on the transaction object who returns a reader representing the latest version of the index, plus the changes created by the transaction. This system would be a part of IndexWriter and would not rely on a transaction log. IndexWriter.commit would flush the in ram realtime indexes to disk. The realtime merge policy would flush based on the RAM usage or number of docs. {code} IndexWriter iw = new IndexWriter(); Transaction tr = iw.createTransaction(); tr.addDocument(new Document()); tr.addDocument(new Document()); tr.deleteDocument(1200l); IndexReader ir = tr.flush(); // flushes transaction to the index (probably to a ram index) IndexReader latestReader = iw.getReader(); // same as ir iw.commit(boolean doWait); // commits the in ram realtime index to disk {code} When commit is called, the disk segment reader flush their deletes to disk which is fast. The in ram realtime index is merged to disk. The process is described in more detail further down. M.H.: "how about writing a single-file Directory implementation?" I'm not sure we need this because and appending rolling transaction log should work. Segments don't change, only things like norms and deletes which can be appended to a rolling transaction log file system. If we had a generic transaction logging system, the future column stride fields, deletes, norms, and future realtime features could use it and be realtime. M.H.: "How do you guarantee that you always see the "current" version of a given document, and only that version? Each transaction returns an IndexReader. Each "row" or "object" could use a unique id in the transaction log model which would allow documents that were merged into other segments to be deleted during a transaction log replay. M.H.: "When do you expose new deletes in the RAMDirectory, when do you expose new deletes in the FSDirectory" When do you expose new deletes in the RAMDir, when do you expose new deletes in the FSDirectory, how do you manage slow merges from the RAMDir to the FSDirectory, how do you manage new adds to the RAMDir that take place during slow merges..." Queue deletes to the RAMDir, while copying the RAMDir to the FSDir in the background, perform the deletes after the copy is completed, then instantiate a new reader with the newly merged FSDirectory and a new RAMDirs. Writes that were occurring during this process would be happening to another new RAMDir. One way to think of the realtime problem is in terms of segments rather than FSDirs and RAMDirs. Some segments are on disk, some in RAM. Each transaction is an instance of some segments and their deletes (and we're not worried about the deletes being flushed or not so assume they exist as BitVectors). The system should expose an API to checkpoint/flush at a given transaction level (usually the current) and should not stop new updates from happening. When I wrote this type of system, I managed individual segments outside of IndexWriter's merge policy and performed the merging manually by placing each segment in it's own FSDirectory (the segment size was 64MB) which minimized the number of directories. I do not know the best approach for this when performed within IndexWriter. M.H.: "Two comments. First, if you don't sync, but rather leave it up to the OS when it wants to actually perform the actual disk i/o, how expensive is flushing? Can we make it cheap enough to meet Jason's absolute change rate requirements?" When I tried out the transaction log a write usually mapped pretty quickly to a hard disk write. I don't think it's safe to leave writes up to the OS. M.M.: "maintain & updated deleted docs even though IndexWriter has the write lock" In my previous realtime search implementation I got around this by having each segment in it's own directory. Assuming this is non-optimal, we will need to expose an IndexReader that has the writelock of the IndexWriter. > BitVector implement DocIdSet > > >
Re: stored fields / unicode compression
Catching up on my holiday email, I on't think there were any replies to this question yet. The low level file formats used by Lucene is an area I don't have time/expertise to follow carefully, but if i'm remember correctly the concensus is/was to more more towards pure (byte[] data, int start, int end) based APIs for efficiency, with "String" based APIs provided as syntactic sugar via a facade, and deprecating the existing "internal" gzip compression in favor of similar "external" compression facades. So something like you describe could be done as is using the byte[] interfaces *and* be generally useful to others. Taking a step back to look at the broader picture, this is the kind of thing that in Solr could be implemented as a new FieldType : Date: Fri, 26 Dec 2008 19:00:11 -0500 : From: Robert Muir : Subject: stored fields / unicode compression : : Has there been any thoughts of using SCSU or BOCU-1 instead of UTF-8 for : stored fields? : Personally I don't put huge amounts of text in stored fields but these : encodings/compression work extremely well on short strings like titles, etc. : Removing the unicode penalty for non-latin text (i.e. cut in half) is : nothing to sneeze at since with lots of docs my stored fields still become : pretty huge, biggest part of the index. : : I know I could use one of these schemes right now and store everything as : bytes... but just thinking it might be something of more general use. The : GZIP compression that is supported isn't very useful as it typically makes : short snippets bigger... : : Performance compared to UTF-8 is here... seems like a general win to me (but : maybe I am missing something) : http://unicode.org/notes/tn6/#Performance -Hoss - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
The way we've simplified this that every document has an OID. It simplifies updates and delete tracking (in the transaction log). On Jan 8, 2009, at 2:28 PM, Marvin Humphrey (JIRA) wrote: [ https://issues.apache.org/jira/browse/LUCENE-1476? page=com.atlassian.jira.plugin.system.issuetabpanels:comment- tabpanel&focusedCommentId=12662107#action_12662107 ] Marvin Humphrey commented on LUCENE-1476: - Mike McCandless: Commit is for crash recovery, and for knowing when it's OK to delete prior commits. Simply writing the files (and not syncing them), and perhaps giving IndexReader.open the SegmentInfos to use directly (and not writing a segments_N via the filesystem) would allow us to search added docs without paying the cost of sync'ing all the files. Mmm. I think I might have given IndexWriter.commit() slightly different semantics. Specifically, I might have given it a boolean "sync" argument which defaults to false. Also: brand new, tiny segments should be written into a RAMDirectory and then merged over time into the real Directory. Two comments. First, if you don't sync, but rather leave it up to the OS when it wants to actually perform the actual disk i/o, how expensive is flushing? Can we make it cheap enough to meet Jason's absolute change rate requirements? Second, the multi-index model is very tricky when dealing with "updates". How do you guarantee that you always see the "current" version of a given document, and only that version? When do you expose new deletes in the RAMDirectory, when do you expose new deletes in the FSDirectory, how do you manage slow merges from the RAMDirectory to the FSDirectory, how do you manage new adds to the RAMDirectory that take place during slow merges... Building a single-index, two-writer model that could handle fast updates while performing background merging was one of the main drivers behind the tombstone design. BitVector implement DocIdSet Key: LUCENE-1476 URL: https://issues.apache.org/jira/browse/ LUCENE-1476 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4 Reporter: Jason Rutherglen Priority: Trivial Attachments: LUCENE-1476.patch Original Estimate: 12h Remaining Estimate: 12h BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
> You can do that now by implementing BitVector.nextSetBit(int tick) and using > that in TermDocs to set a nextDeletion member var instead of checking every > doc num with BitVector.get(). This seems so easy, I should take a crack at it. :) Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662143#action_12662143 ] Marvin Humphrey commented on LUCENE-1476: - Mike McCandless: > So, net/net it seems like "deletes-as-a-filter" approach is compelling? In terms of CPU-cycles, maybe. My gut tells me that it's all but mandatory if we use merged-on-the-fly tombstone streams, but if Lucene goes that route it should cache a BitVector and use a shared pseudo-iterator -- in which case the costs will no longer be significantly more than the current system. Under the current system, I'm not certain that the deletions checks are that excessive. Consider this loop from TermDocs.read(): {code} while (i < length && count < df) { // manually inlined call to next() for speed final int docCode = freqStream.readVInt(); doc += docCode >>> 1; // shift off low bit if ((docCode & 1) != 0) // if low bit is set freq = 1; // freq is one else freq = freqStream.readVInt(); // else read freq count++; if (deletedDocs == null || !deletedDocs.get(doc)) { docs[i] = doc; freqs[i] = freq; ++i; } } {code} The CPU probably does a good job of predicting the result of the null check on deletedDocs. The readVInt() method call is already a pipeline killer. Here's how that loop looks after I patch the deletions check for pseudo-iteration. {code} while (i < length && count < df) { // manually inlined call to next() for speed final int docCode = freqStream.readVInt(); doc += docCode >>> 1; // shift off low bit if ((docCode & 1) != 0) // if low bit is set freq = 1; // freq is one else freq = freqStream.readVInt(); // else read freq count++; if (docNum >= nextDeletion) { if (docNum > nextDeletion) { nextDeletion = deletedDocs.nextSetBit(docNum); } if (docNum == nextDeletion) { continue; } } docs[i] = doc; freqs[i] = freq; ++i; } return i; } {code} Again, the CPU is probably going to do a pretty good job of predicting the results of the deletion check. And even then, we're accessing the same shared BitVector across all TermDocs, and its bits are hopefully a cache hit. To really tighten this loop, you have to do what Nate and I want with Lucy/KS: * Remove all function/method call overhead. * Operate directly on the memory mapped postings file. {code} SegPList_bulk_read(SegPostingList *self, i32_t *doc_nums, i32_t *freqs, u32_t request) { i32_t doc_num = self->doc_num; const u32_t remaining = self->doc_freq - self->count; const u32_t num_got = request < remaining ? request : remaining; char *buf = InStream_Buf(instream, C32_MAX_BYTES * num_got); u32_t i; for (i = 0; i < num_got; i++) { u32_t doc_code = Math_decode_c32(&buf); /* static inline function */ u32_t freq = (doc_code & 1) ? 1 : Math_decode_c32(&buf); doc_num+= doc_code >> 1; doc_nums[i]= doc_num; freqs[i] = freq; ++i; } InStream_Advance_Buf(instream, buf); self->doc_num = doc_num; self->count += num_got; return num_got; } {code} (That loop would be even better using PFOR instead of vbyte.) In terms of public API, I don't think it's reasonable to change Lucene's Scorer and TermDocs classes so that their iterators start returning deleted docs. We could potentially make that choice with Lucy/KS, thus allowing us to remove the deletions check in the PostingList iterator (as above) and getting a potential speedup. But even then I hesitate to push the deletions API upwards into a space where users of raw Scorer and TermDocs classes have to deal with it -- especially since iterator-style deletions aren't very user-friendly. > BitVector implement DocIdSet > > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Trivial > Attachments: LUCENE-1476.patch > > Original Estimate: 12h > Remaining Estimate: 12h > > BitVector can implement DocIdSet. This is for making > SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: jav
[jira] Updated: (LUCENE-1314) IndexReader.clone
[ https://issues.apache.org/jira/browse/LUCENE-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1314: - Attachment: LUCENE-1314.patch LUCENE-1314.patch All tests pass. IndexReader.close was made non-final to override in SegmentReader. This is due to the propagation of the method calls to SegmentReader.doClose previously passed through decRef which could be called by IndexReader.decRef or IndexReader.close. In order to decref the copy on write refs, the close method needs to decrement the references, rather than simply the decRef method. This caused the bug found in the previous comment where if decRef was called the deletedDocsRef did not need to also be decrefed which was the cause of the ref count assertion failing. Occasionally TestIndexReaderReopen.testThreadSafety fails due to an already closed exception. Trunk however also fails periodically. Given multi threading of reopen/close is usually unlikely I am not sure it is worth investigating further. Fixed norm byte refs not decrefing on close. Fixed cloneNorm() byteRef being created when there is no byte array, added assertion check. > IndexReader.clone > - > > Key: LUCENE-1314 > URL: https://issues.apache.org/jira/browse/LUCENE-1314 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.3.1 >Reporter: Jason Rutherglen >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, > LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, > LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, lucene-1314.patch, > lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, > lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, > lucene-1314.patch, lucene-1314.patch, lucene-1314.patch > > > Based on discussion > http://www.nabble.com/IndexReader.reopen-issue-td18070256.html. The problem > is reopen returns the same reader if there are no changes, so if docs are > deleted from the new reader, they are also reflected in the previous reader > which is not always desired behavior. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662110#action_12662110 ] Marvin Humphrey commented on LUCENE-1476: - Mike McCandless: > if it's sparse, you need an iterator (state) to remember where you are. We can hide the sparse representation and the internal state, having the object lazily build the a non-sparse representation. That's what I had in mind with the code for TombstoneDelEnum.nextDeletion(). TombstoneDelEnum.nextInternal() would be a private method used for building up the internal BitVector. > BitVector implement DocIdSet > > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Trivial > Attachments: LUCENE-1476.patch > > Original Estimate: 12h > Remaining Estimate: 12h > > BitVector can implement DocIdSet. This is for making > SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662107#action_12662107 ] Marvin Humphrey commented on LUCENE-1476: - Mike McCandless: > Commit is for crash recovery, and for knowing when it's OK to delete > prior commits. Simply writing the files (and not syncing them), and > perhaps giving IndexReader.open the SegmentInfos to use directly (and > not writing a segments_N via the filesystem) would allow us to search > added docs without paying the cost of sync'ing all the files. Mmm. I think I might have given IndexWriter.commit() slightly different semantics. Specifically, I might have given it a boolean "sync" argument which defaults to false. > Also: brand new, tiny segments should be written into a RAMDirectory > and then merged over time into the real Directory. Two comments. First, if you don't sync, but rather leave it up to the OS when it wants to actually perform the actual disk i/o, how expensive is flushing? Can we make it cheap enough to meet Jason's absolute change rate requirements? Second, the multi-index model is very tricky when dealing with "updates". How do you guarantee that you always see the "current" version of a given document, and only that version? When do you expose new deletes in the RAMDirectory, when do you expose new deletes in the FSDirectory, how do you manage slow merges from the RAMDirectory to the FSDirectory, how do you manage new adds to the RAMDirectory that take place during slow merges... Building a single-index, two-writer model that could handle fast updates while performing background merging was one of the main drivers behind the tombstone design. > BitVector implement DocIdSet > > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Trivial > Attachments: LUCENE-1476.patch > > Original Estimate: 12h > Remaining Estimate: 12h > > BitVector can implement DocIdSet. This is for making > SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662102#action_12662102 ] Michael McCandless commented on LUCENE-1476: {quote} > How about if we model deletions-as-iterator on BitSet.nextSetBit(int tick) > instead of a true iterator that keeps state? {quote} That works if under-the-hood it's a non-sparse representation. But if it's sparse, you need an iterator (state) to remember where you are. > BitVector implement DocIdSet > > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Trivial > Attachments: LUCENE-1476.patch > > Original Estimate: 12h > Remaining Estimate: 12h > > BitVector can implement DocIdSet. This is for making > SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662101#action_12662101 ] Michael McCandless commented on LUCENE-1476: {quote} > If we move the deletions filtering up, then we'd increase traffic through > that cache {quote} OK, right. So we may have some added cost because of this. I think it's only TermScorer that uses the bulk API though. {quote} > If you were applying deletions filtering after Scorer.next(), then it seems > likely that costs would go up because of extra hit processing. However, if > you use Scorer.skipTo() to jump past deletions, as in the loop I provided > above, then PhraseScorer etc. shouldn't incur any more costs themselves. {quote} Ahhh, now I got it! Good, you're right. {quote} > Under the skipTo() loop, I think the filter effectively does get applied > earlier in the chain. Does that make sense? {quote} Right. This is how Lucene works today. Excellent. So, net/net it seems like "deletes-as-a-filter" approach is compelling? > BitVector implement DocIdSet > > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Trivial > Attachments: LUCENE-1476.patch > > Original Estimate: 12h > Remaining Estimate: 12h > > BitVector can implement DocIdSet. This is for making > SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662100#action_12662100 ] Marvin Humphrey commented on LUCENE-1476: - Mike McCandless: > I'm also curious what cost you see of doing the merge sort for every > search; I think it could be uncomfortably high since it's so > hard-for-cpu-to-predict-branch-intensive. Probably true. You're going to get accelerating degradation as the number of deletions increases. In a large index, you could end up merging 20, 30 streams. Based on how the priority queue in ORScorer tends to take up space in profiling data, that might not be good. It'd be manageable if you can keep your index reasonably in good shape, but you'll be suckin' pondwater if it gets flabby. > We could take the first search that doesn't use skipTo and save the result > of the merge sort, essentially doing an in-RAM-only "merge" of those > deletes, and let subsequent searches use that single merged stream. That was what I had in mind when proposing the pseudo-iterator model. {code} class TombStoneDelEnum extends DelEnum { int nextDeletion(int docNum) { while (currentMax < docNum) { nextInternal(); } return bits.nextSetBit(docNum); } // ... } {code} > (This is not MMAP friendly, though). Yeah. Ironically, that use of tombstones is more compatible with the Lucene model. :-) I'd be reluctant to have Lucy/KS realize those large BitVectors in per-object process RAM. That'd spoil the "cheap wrapper around system i/o cache" IndexReader plan. I can't see an answer yet. But the one thing I do know is that Lucy/KS needs a pluggable deletions mechanism to make experimentation easier -- so that's what I'm working on today. > BitVector implement DocIdSet > > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Trivial > Attachments: LUCENE-1476.patch > > Original Estimate: 12h > Remaining Estimate: 12h > > BitVector can implement DocIdSet. This is for making > SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662097#action_12662097 ] Michael McCandless commented on LUCENE-1476: {quote} > It would be exposed as a combination reader writer that manages the > transaction status of each update. {quote} I think the transactions layer would also sit on top of this "realtime" layer? EG this "realtime" layer would expose a commit() method, and the transaction layer above it would maintain the transaction log, periodically calling commit() and truncating the transaction log? This "realtime" layer, then, would internally maintain a single IndexWriter and the readers. IndexWriter would flush (not commit) new segments into a RAMDir and yield its in-RAM SegmentInfos to IndexReader.reopen. MergePolicy periodically gets those into the real Directory. When reopening a reader we have the freedom to use old (already merged away) segments if the newly merged segment isn't warm yet. We "just" need to open some things up in IndexWriter: * IndexReader.reopen with the in-RAM SegmentInfos * Willingness to allow an IndexReader to maintain & updated deleted docs even though IndexWriter has the write lock * Access to segments that were already merged away (I think we could make a DeletionPolicy that pays attention to when the newly merged segment is not yet warmed and keeps thue prior segments around). I think this'd require allowing DeletionPolicy to see "flush points" in addition to commit points (it doesn't today). But I'm still hazy on the details on exactly how to open up IndexWriter. > BitVector implement DocIdSet > > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Trivial > Attachments: LUCENE-1476.patch > > Original Estimate: 12h > Remaining Estimate: 12h > > BitVector can implement DocIdSet. This is for making > SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662092#action_12662092 ] Michael McCandless commented on LUCENE-1476: {quote} > If Lucene crashed for some reason the transaction log would be replayed. {quote} I think the transaction log is useful for some applications, but could (should) be built as a separate (optional) layer entirely on top of Lucene's core. Ie, neither IndexWriter nor IndexReader need to be aware of the transaction log, which update belongs to which transaction, etc? > BitVector implement DocIdSet > > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Trivial > Attachments: LUCENE-1476.patch > > Original Estimate: 12h > Remaining Estimate: 12h > > BitVector can implement DocIdSet. This is for making > SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662089#action_12662089 ] Michael McCandless commented on LUCENE-1476: {quote} > There's going to be a change rate that overwhelms the multi-file > commit system, and it seems that you've determined you're up against > it. {quote} Well... IndexWriter need not "commit" in order to allow a reader to see the files? Commit is for crash recovery, and for knowing when it's OK to delete prior commits. Simply writing the files (and not syncing them), and perhaps giving IndexReader.open the SegmentInfos to use directly (and not writing a segments_N via the filesystem) would allow us to search added docs without paying the cost of sync'ing all the files. Also: brand new, tiny segments should be written into a RAMDirectory and then merged over time into the real Directory. > BitVector implement DocIdSet > > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Trivial > Attachments: LUCENE-1476.patch > > Original Estimate: 12h > Remaining Estimate: 12h > > BitVector can implement DocIdSet. This is for making > SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1479) TrecDocMaker skips over documents when "Date" is missing from documents
[ https://issues.apache.org/jira/browse/LUCENE-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-1479: -- Assignee: Michael McCandless > TrecDocMaker skips over documents when "Date" is missing from documents > --- > > Key: LUCENE-1479 > URL: https://issues.apache.org/jira/browse/LUCENE-1479 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/benchmark >Reporter: Shai Erera >Assignee: Michael McCandless > Fix For: 2.4.1, 2.9 > > Attachments: LUCENE-1479.patch > > > TrecDocMaker skips over Trec documents if they do not have a "Date" line. > When such a document is encountered, the code may skip over several documents > until the next tag that is searched for is found. > The result is, instead of reading ~25M documents from the GOV2 collection, > the code reads only ~23M (don't remember the actual numbers). > The fix adds a terminatingTag to read() such that the code looks for prefix, > but only until terminatingTag is found. Appropriate changes were made in > getNextDocData(). > Patch to follow -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1479) TrecDocMaker skips over documents when "Date" is missing from documents
[ https://issues.apache.org/jira/browse/LUCENE-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662073#action_12662073 ] Michael McCandless commented on LUCENE-1479: Shai, it seems like a doc that has no "Date: XXX" would leave dateStr as null and would then cause an NPE when parseDate is later called? Or am I missing something? Also I'm getting a compilation error: {code} [javac] Compiling 1 source file to /tango/mike/src/lucene.trecdocmaker/build/contrib/benchmark/classes/java [javac] /tango/mike/src/lucene.trecdocmaker/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/TrecDocMaker.java:190: variable name might not have been initialized [javac] String name = sb.substring(DOCNO.length(), name.indexOf(TERM_DOCNO, DOCNO.length())); [javac]^ [javac] 1 error {code} > TrecDocMaker skips over documents when "Date" is missing from documents > --- > > Key: LUCENE-1479 > URL: https://issues.apache.org/jira/browse/LUCENE-1479 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/benchmark >Reporter: Shai Erera > Fix For: 2.4.1, 2.9 > > Attachments: LUCENE-1479.patch > > > TrecDocMaker skips over Trec documents if they do not have a "Date" line. > When such a document is encountered, the code may skip over several documents > until the next tag that is searched for is found. > The result is, instead of reading ~25M documents from the GOV2 collection, > the code reads only ~23M (don't remember the actual numbers). > The fix adds a terminatingTag to read() such that the code looks for prefix, > but only until terminatingTag is found. Appropriate changes were made in > getNextDocData(). > Patch to follow -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662065#action_12662065 ] Marvin Humphrey commented on LUCENE-1476: - Jason Rutherglen: > I found in making the realtime search write speed fast enough that writing > to individual files per segment can become too costly (they accumulate fast, > appending to a single file is faster than creating new files, deleting the > files becomes costly). I saw you mentioning i/o overhead on Windows in particular. I can't see a way to mod Lucene so that it doesn't generate a bunch of files for each commit, and FWIW Lucy/KS is going to generate even more files than Lucene. Half-seriously... how about writing a single-file Directory implementation? > For example, writing to small individual files per commit, if the number of > segments is large and the delete spans multiple segments will generate many > files. There would be a maximum of two files per segment to hold the tombstones: one to hold the tombstone rows, and one to map segment identifiers to tombstone rows. (In Lucy/KS, the mappings would probably be stored in the JSON-encoded "segmeta" file, which stores human-readable metadata on behalf of multiple components.) Segments containing tombstones would be merged according to whatever merge policy was in place. So there won't ever be an obscene number of tombstone files unless you allow an obscene number of segments to accumulate. > Many users may not want a transaction log as they may be storing the updates > in a separate SQL database instance (this is the case where I work) and so a > transaction log is redundant and should be optional. I can see how this would be quite useful at the application level. However, I think it might be challenging to generalize the transaction log concept at the library level: {code} CustomAnalyzer analyzer = new CustomAnalyzer(); IndexWriter indexWriter = new IndexWriter(analyzer, "/path/to/index"); indexWriter.add(nextDoc()); analyzer.setFoo(2); // change of state not recorded by transaction log indexWriter.add(nextDoc()); {code} MySQL is more of a closed system than Lucene, which I think makes options available that aren't available to us. > The reader stack is drained based on whether a reader is too old to be > useful anymore (i.e. no references to it, or it's has N number of readers > ahead of it). Right, this is the kind of thing that Lucene has to do because of the single-reader model, and that were trying to get away from in Lucy/KS by exploiting mmap and making IndexReaders cheap wrappers around the system i/o cache. I don't think I can offer any alternative design suggestions that meet your needs. There's going to be a change rate that overwhelms the multi-file commit system, and it seems that you've determined you're up against it. What's killing us is something different: not absolute change rate, but poor worst-case performance. FWIW, we contemplated a multi-index system with an index on a RAM disk for fast changes and a primary index on the main file system. It would have worked fine for pure adds, but it was very tricky to manage state for documents which were being "updated", i.e. deleted and re-added. How are you handling all these small adds with your combo reader/writer? Do you not have that problem? > BitVector implement DocIdSet > > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Trivial > Attachments: LUCENE-1476.patch > > Original Estimate: 12h > Remaining Estimate: 12h > > BitVector can implement DocIdSet. This is for making > SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1482) Replace infoSteram by a logging framework (SLF4J)
[ https://issues.apache.org/jira/browse/LUCENE-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662044#action_12662044 ] Yonik Seeley commented on LUCENE-1482: -- It seems we should take into consideration the performance of a real logger (not the NOP logger) because real applications that already use SLF4J can't use NOP adapter. Solr just switched to SLF4J for example. > Replace infoSteram by a logging framework (SLF4J) > - > > Key: LUCENE-1482 > URL: https://issues.apache.org/jira/browse/LUCENE-1482 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shai Erera > Fix For: 2.4.1, 2.9 > > Attachments: LUCENE-1482-2.patch, LUCENE-1482.patch, > slf4j-api-1.5.6.jar, slf4j-nop-1.5.6.jar > > > Lucene makes use of infoStream to output messages in its indexing code only. > For debugging purposes, when the search application is run on the customer > side, getting messages from other code flows, like search, query parsing, > analysis etc can be extremely useful. > There are two main problems with infoStream today: > 1. It is owned by IndexWriter, so if I want to add logging capabilities to > other classes I need to either expose an API or propagate infoStream to all > classes (see for example DocumentsWriter, which receives its infoStream > instance from IndexWriter). > 2. I can either turn debugging on or off, for the entire code. > Introducing a logging framework can allow each class to control its logging > independently, and more importantly, allows the application to turn on > logging for only specific areas in the code (i.e., org.apache.lucene.index.*). > I've investigated SLF4J (stands for Simple Logging Facade for Java) which is, > as it names states, a facade over different logging frameworks. As such, you > can include the slf4j.jar in your application, and it recognizes at deploy > time what is the actual logging framework you'd like to use. SLF4J comes with > several adapters for Java logging, Log4j and others. If you know your > application uses Java logging, simply drop slf4j.jar and slf4j-jdk14.jar in > your classpath, and your logging statements will use Java logging underneath > the covers. > This makes the logging code very simple. For a class A the logger will be > instantiated like this: > public class A { > private static final logger = LoggerFactory.getLogger(A.class); > } > And will later be used like this: > public class A { > private static final logger = LoggerFactory.getLogger(A.class); > public void foo() { > if (logger.isDebugEnabled()) { > logger.debug("message"); > } > } > } > That's all ! > Checking for isDebugEnabled is very quick, at least using the JDK14 adapter > (but I assume it's fast also over other logging frameworks). > The important thing is, every class controls its own logger. Not all classes > have to output logging messages, and we can improve Lucene's logging > gradually, w/o changing the API, by adding more logging messages to > interesting classes. > I will submit a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1314) IndexReader.clone
[ https://issues.apache.org/jira/browse/LUCENE-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662043#action_12662043 ] Jason Rutherglen commented on LUCENE-1314: -- I executed on Eclipse Mac OS X on a 4 core box (core's significant due to the threads). I ran TestIndexReaderReopen.testThreadSafety 2 times in debug mode it worked, thought that debug mode wasn't making the bug reproduce so tried just running the test and it passed again. The 5th time it gave an error in debug mode. The test case fails consistently when SegmentReader.reopenSegment success == false and decRef is called afterwards in the finally clause. It seems that calling this decRef on the newly cloned object is causing the assertion error which is possibly related to threading. Probably because the decRef on the failed clone is decrementing one too many times on a deletedDocsRef used by another reader and causing the following assertion error. I'm not sure if this is a real bug or an issue that the test case should ignore. {code} java.lang.AssertionError at org.apache.lucene.index.SegmentReader$Ref.decRef(SegmentReader.java:104) at org.apache.lucene.index.SegmentReader.decRef(SegmentReader.java:249) at org.apache.lucene.index.MultiSegmentReader.doClose(MultiSegmentReader.java:413) at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:157) at org.apache.lucene.index.IndexReader.close(IndexReader.java:990) at org.apache.lucene.index.TestIndexReaderReopen$9.run(TestIndexReaderReopen.java:703) at org.apache.lucene.index.TestIndexReaderReopen$ReaderThread.run(TestIndexReaderReopen.java:818) {code} > IndexReader.clone > - > > Key: LUCENE-1314 > URL: https://issues.apache.org/jira/browse/LUCENE-1314 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.3.1 >Reporter: Jason Rutherglen >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, > LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, > LUCENE-1314.patch, LUCENE-1314.patch, lucene-1314.patch, lucene-1314.patch, > lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, > lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, > lucene-1314.patch, lucene-1314.patch > > > Based on discussion > http://www.nabble.com/IndexReader.reopen-issue-td18070256.html. The problem > is reopen returns the same reader if there are no changes, so if docs are > deleted from the new reader, they are also reflected in the previous reader > which is not always desired behavior. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662038#action_12662038 ] markrmil...@gmail.com edited comment on LUCENE-1483 at 1/8/09 9:15 AM: - Its the ORDSUBORD again (which I don't think we will use) and the two Policies. Odd because its the last hit of 10 that fails for all 3. I'll ferret it out tonight. - Mark *EDIT* yup...always the last entry thats wrong no matter the queue size - for all 3, which is odd because ORD_SUBORD doesnt have too much of a relationship to the two policies. Will be a fun one. was (Author: markrmil...@gmail.com): Its the ORDSUBORD again (which I don't think we will use) and the two Policies. Odd because its the last hit of 10 that fails for all 3. I'll ferret it out tonight. - Mark > Change IndexSearcher multisegment searches to search each individual segment > using a single HitCollector > > > Key: LUCENE-1483 > URL: https://issues.apache.org/jira/browse/LUCENE-1483 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 2.9 >Reporter: Mark Miller >Priority: Minor > Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > sortBench.py, sortCollate.py > > > FieldCache and Filters are forced down to a single segment reader, allowing > for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1482) Replace infoSteram by a logging framework (SLF4J)
[ https://issues.apache.org/jira/browse/LUCENE-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662039#action_12662039 ] Shai Erera commented on LUCENE-1482: Grant, given what I wrote below, having Lucene use NOP adapter, are you still worried w.r.t. the performance implications? If there is a general reluctance to add a dependency on SLF4J, can we review the other options I suggested - using infoStream as a class with static methods? That at least will allow adding more prints from other classes, w/o changing their API. I prefer SLF4J because IMO logging is important, but having infoStream as a service class is better than what exists today (and I don't believe someone can argue that calling a static method has any significant, if at all, performance implications). If the committers want to drop that issue, please let me know and I'll close it. I don't like to nag :-) > Replace infoSteram by a logging framework (SLF4J) > - > > Key: LUCENE-1482 > URL: https://issues.apache.org/jira/browse/LUCENE-1482 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Shai Erera > Fix For: 2.4.1, 2.9 > > Attachments: LUCENE-1482-2.patch, LUCENE-1482.patch, > slf4j-api-1.5.6.jar, slf4j-nop-1.5.6.jar > > > Lucene makes use of infoStream to output messages in its indexing code only. > For debugging purposes, when the search application is run on the customer > side, getting messages from other code flows, like search, query parsing, > analysis etc can be extremely useful. > There are two main problems with infoStream today: > 1. It is owned by IndexWriter, so if I want to add logging capabilities to > other classes I need to either expose an API or propagate infoStream to all > classes (see for example DocumentsWriter, which receives its infoStream > instance from IndexWriter). > 2. I can either turn debugging on or off, for the entire code. > Introducing a logging framework can allow each class to control its logging > independently, and more importantly, allows the application to turn on > logging for only specific areas in the code (i.e., org.apache.lucene.index.*). > I've investigated SLF4J (stands for Simple Logging Facade for Java) which is, > as it names states, a facade over different logging frameworks. As such, you > can include the slf4j.jar in your application, and it recognizes at deploy > time what is the actual logging framework you'd like to use. SLF4J comes with > several adapters for Java logging, Log4j and others. If you know your > application uses Java logging, simply drop slf4j.jar and slf4j-jdk14.jar in > your classpath, and your logging statements will use Java logging underneath > the covers. > This makes the logging code very simple. For a class A the logger will be > instantiated like this: > public class A { > private static final logger = LoggerFactory.getLogger(A.class); > } > And will later be used like this: > public class A { > private static final logger = LoggerFactory.getLogger(A.class); > public void foo() { > if (logger.isDebugEnabled()) { > logger.debug("message"); > } > } > } > That's all ! > Checking for isDebugEnabled is very quick, at least using the JDK14 adapter > (but I assume it's fast also over other logging frameworks). > The important thing is, every class controls its own logger. Not all classes > have to output logging messages, and we can improve Lucene's logging > gradually, w/o changing the API, by adding more logging messages to > interesting classes. > I will submit a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662038#action_12662038 ] Mark Miller commented on LUCENE-1483: - Its the ORDSUBORD again (which I don't think we will use) and the two Policies. Odd because its the last hit of 10 that fails for all 3. I'll ferret it out tonight. - Mark > Change IndexSearcher multisegment searches to search each individual segment > using a single HitCollector > > > Key: LUCENE-1483 > URL: https://issues.apache.org/jira/browse/LUCENE-1483 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 2.9 >Reporter: Mark Miller >Priority: Minor > Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > sortBench.py, sortCollate.py > > > FieldCache and Filters are forced down to a single segment reader, allowing > for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662033#action_12662033 ] Jason Rutherglen commented on LUCENE-1476: -- Marvin: "The whole tombstone idea arose out of the need for (close to) realtime search! It's intended to improve write speed." It does improve the write speed. I found in making the realtime search write speed fast enough that writing to individual files per segment can become too costly (they accumulate fast, appending to a single file is faster than creating new files, deleting the files becomes costly). For example, writing to small individual files per commit, if the number of segments is large and the delete spans multiple segments will generate many files. This is variable based on how often the updates are expected to occur. I modeled this after the extreme case of the frequency of updates of a MySQL instance backing data for a web application. The MySQL design, translated to Lucene is a transaction log per index. Where the updates consisting of documents and deletes are written to the transaction log file. If Lucene crashed for some reason the transaction log would be replayed. The in memory indexes and newly deleted document bitvectors would be held in RAM (LUCENE-1314) until flushed (the in memory indexes and deleted documents) manually or based on memory usage. Many users may not want a transaction log as they may be storing the updates in a separate SQL database instance (this is the case where I work) and so a transaction log is redundant and should be optional. The first implementation of this will not have a transaction log. Marvin: "I don't think I understand. Is this the "combination index reader/writer" model, where the writer prepares a data structure that then gets handed off to the reader?" It would be exposed as a combination reader writer that manages the transaction status of each update. The internal architecture is such that after each update a new reader representing the new documents and deletes for the transaction is generated and put onto a stack. The reader stack is drained based on whether a reader is too old to be useful anymore (i.e. no references to it, or it's has N number of readers ahead of it). > BitVector implement DocIdSet > > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Trivial > Attachments: LUCENE-1476.patch > > Original Estimate: 12h > Remaining Estimate: 12h > > BitVector can implement DocIdSet. This is for making > SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662028#action_12662028 ] Mark Miller commented on LUCENE-1483: - bq. It runs legacy vs new sort and asserts that they are the same. Clever. Very good idea. I'll fix it up. Also, if you have any ideas about what Policies you want to start with, I'd be happy to push those around a bit too. > Change IndexSearcher multisegment searches to search each individual segment > using a single HitCollector > > > Key: LUCENE-1483 > URL: https://issues.apache.org/jira/browse/LUCENE-1483 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 2.9 >Reporter: Mark Miller >Priority: Minor > Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > sortBench.py, sortCollate.py > > > FieldCache and Filters are forced down to a single segment reader, allowing > for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1483: --- Attachment: LUCENE-1483.patch Attached full patch (though you'll get failed hunks because of the annoying $Id$ expansion problem). I fixed various small issues, and added a new TestStressSort test. It runs legacy vs new sort and asserts that they are the same. It is currently failing... but I haven't spent any time digging into why. Mark could you dig and try to figure out why it's failing? I think we should resolve it before running (or, trusting) perf tests. Also: I wonder if we can remove the null checking in the compare methods for String*Comparator? EG maybe we need a new FieldCache.getString{s,Index} methods that optionally take a "fillNulls" param, and if true nulls are replaced with empty string? However... that would unfortunately cause a difference whereby "" would be equal to null (whereas now null sorts ahead of ""), which is not back compatible. I guess we could make a "non-null" comparator and use it whenever it's known there are no nulls in the FieldCache array. It may not be worth the hassle. If the value is never null, cpu will guess the right branch path every time, so penalty should be small (yet non-zero!). > Change IndexSearcher multisegment searches to search each individual segment > using a single HitCollector > > > Key: LUCENE-1483 > URL: https://issues.apache.org/jira/browse/LUCENE-1483 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 2.9 >Reporter: Mark Miller >Priority: Minor > Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > sortBench.py, sortCollate.py > > > FieldCache and Filters are forced down to a single segment reader, allowing > for individual segment reloading on reopen. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1497) Minor changes to SimpleHTMLFormatter
[ https://issues.apache.org/jira/browse/LUCENE-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1497. Resolution: Fixed Fix Version/s: (was: 2.4.1) Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) Committed revision 732739. Thanks Shai! > Minor changes to SimpleHTMLFormatter > > > Key: LUCENE-1497 > URL: https://issues.apache.org/jira/browse/LUCENE-1497 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter >Reporter: Shai Erera >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1497.patch > > > I'd like to make few minor changes to SimpleHTMLFormatter. > 1. Define DEFAULT_PRE_TAG and DEFAULT_POST_TAG and use them in the default > constructor. This will not trigger String lookups by the JVM whenever the > highlighter is instantiated. > 2. Create the StringBuffer in highlightTerm with the right number of > characters from the beginning. Even though StringBuffer's default constructor > allocates 16 chars, which will probably be enough for most highlighted terms > (pre + post tags are 7 chars, which leaves 9 chars for terms), I think it's > better to allocate SB with the right # of chars in advance, to avoid char[] > allocations in the middle. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1497) Minor changes to SimpleHTMLFormatter
[ https://issues.apache.org/jira/browse/LUCENE-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662020#action_12662020 ] Michael McCandless commented on LUCENE-1497: Ahh, OK, then let's leave your approach (dedicated single StringBuffer). I'll commit shortly. > Minor changes to SimpleHTMLFormatter > > > Key: LUCENE-1497 > URL: https://issues.apache.org/jira/browse/LUCENE-1497 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter >Reporter: Shai Erera >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4.1, 2.9 > > Attachments: LUCENE-1497.patch > > > I'd like to make few minor changes to SimpleHTMLFormatter. > 1. Define DEFAULT_PRE_TAG and DEFAULT_POST_TAG and use them in the default > constructor. This will not trigger String lookups by the JVM whenever the > highlighter is instantiated. > 2. Create the StringBuffer in highlightTerm with the right number of > characters from the beginning. Even though StringBuffer's default constructor > allocates 16 chars, which will probably be enough for most highlighted terms > (pre + post tags are 7 chars, which leaves 9 chars for terms), I think it's > better to allocate SB with the right # of chars in advance, to avoid char[] > allocations in the middle. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1497) Minor changes to SimpleHTMLFormatter
[ https://issues.apache.org/jira/browse/LUCENE-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662004#action_12662004 ] Shai Erera commented on LUCENE-1497: If I understand you correctly, you propose to change the code to: preTag + originalText + postTag. That creates 2 (or 3) StringBuffers actually. Java implements + by allocating a StringBuffer and appending both Strings to it. What I propose is to create the StringBuffer large enough from the beginning such that there won't be additional allocations. > Minor changes to SimpleHTMLFormatter > > > Key: LUCENE-1497 > URL: https://issues.apache.org/jira/browse/LUCENE-1497 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter >Reporter: Shai Erera >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4.1, 2.9 > > Attachments: LUCENE-1497.patch > > > I'd like to make few minor changes to SimpleHTMLFormatter. > 1. Define DEFAULT_PRE_TAG and DEFAULT_POST_TAG and use them in the default > constructor. This will not trigger String lookups by the JVM whenever the > highlighter is instantiated. > 2. Create the StringBuffer in highlightTerm with the right number of > characters from the beginning. Even though StringBuffer's default constructor > allocates 16 chars, which will probably be enough for most highlighted terms > (pre + post tags are 7 chars, which leaves 9 chars for terms), I think it's > better to allocate SB with the right # of chars in advance, to avoid char[] > allocations in the middle. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661998#action_12661998 ] Mark Miller commented on LUCENE-1476: - bq. I noticed that in one version of the patch for segment-centric search (LUCENE-1483), each sorted search involved the creation of sub-searchers, which were then used to compile Scorers. It would make sense to cache those as individual SegmentSearcher objects, no? Thats a fairly old version I think (based on using MutliSearcher as a hack). Now we are using one queue and running it through each subreader of the MultiReader. > BitVector implement DocIdSet > > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Trivial > Attachments: LUCENE-1476.patch > > Original Estimate: 12h > Remaining Estimate: 12h > > BitVector can implement DocIdSet. This is for making > SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661995#action_12661995 ] Marvin Humphrey commented on LUCENE-1476: - Mike McCandless: > For a TermQuery (one term) the cost of the two approaches should be > the same. It'll be close, but I don't think that's quite true. TermScorer pre-fetches document numbers in batches from the TermDocs object. At present, only non-deleted doc nums get cached. If we move the deletions filtering up, then we'd increase traffic through that cache. However, filling it would be slightly cheaper, because we wouldn't be performing the deletions check. In theory. I'm not sure there's a way to streamline away that deletions check in TermDocs and maintain backwards compatibility. And while this is a fun brainstorm, I'm still far from convinced that having TermDocs.next() and Scorer.next() return deleted docs by default is a good idea. > For AND (and other) queries I'm not sure. In theory, having to > process more docIDs is more costly, eg a PhraseQuery or SpanXXXQuery > may see much higher net cost. If you were applying deletions filtering after Scorer.next(), then it seems likely that costs would go up because of extra hit processing. However, if you use Scorer.skipTo() to jump past deletions, as in the loop I provided above, then PhraseScorer etc. shouldn't incur any more costs themselves. > a costly per-docID search > with a very restrictive filter could be far more efficient if you > applied the Filter earlier in the chain. Under the skipTo() loop, I think the filter effectively *does* get applied earlier in the chain. Does that make sense? I think the potential performance downside comes down to prefetching in TermScorer, unless there are other classes that do similar prefetching. > BitVector implement DocIdSet > > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Trivial > Attachments: LUCENE-1476.patch > > Original Estimate: 12h > Remaining Estimate: 12h > > BitVector can implement DocIdSet. This is for making > SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1497) Minor changes to SimpleHTMLFormatter
[ https://issues.apache.org/jira/browse/LUCENE-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661992#action_12661992 ] Michael McCandless commented on LUCENE-1497: In fact I think it may be faster to not even use StringBuffer in highlightTerm? Since we know we are concatenating 3 strings can we just + them? I suspect that'd give better net performance (pure speculation!). > Minor changes to SimpleHTMLFormatter > > > Key: LUCENE-1497 > URL: https://issues.apache.org/jira/browse/LUCENE-1497 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter >Reporter: Shai Erera >Priority: Minor > Fix For: 2.4.1, 2.9 > > Attachments: LUCENE-1497.patch > > > I'd like to make few minor changes to SimpleHTMLFormatter. > 1. Define DEFAULT_PRE_TAG and DEFAULT_POST_TAG and use them in the default > constructor. This will not trigger String lookups by the JVM whenever the > highlighter is instantiated. > 2. Create the StringBuffer in highlightTerm with the right number of > characters from the beginning. Even though StringBuffer's default constructor > allocates 16 chars, which will probably be enough for most highlighted terms > (pre + post tags are 7 chars, which leaves 9 chars for terms), I think it's > better to allocate SB with the right # of chars in advance, to avoid char[] > allocations in the middle. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1497) Minor changes to SimpleHTMLFormatter
[ https://issues.apache.org/jira/browse/LUCENE-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-1497: -- Assignee: Michael McCandless > Minor changes to SimpleHTMLFormatter > > > Key: LUCENE-1497 > URL: https://issues.apache.org/jira/browse/LUCENE-1497 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter >Reporter: Shai Erera >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4.1, 2.9 > > Attachments: LUCENE-1497.patch > > > I'd like to make few minor changes to SimpleHTMLFormatter. > 1. Define DEFAULT_PRE_TAG and DEFAULT_POST_TAG and use them in the default > constructor. This will not trigger String lookups by the JVM whenever the > highlighter is instantiated. > 2. Create the StringBuffer in highlightTerm with the right number of > characters from the beginning. Even though StringBuffer's default constructor > allocates 16 chars, which will probably be enough for most highlighted terms > (pre + post tags are 7 chars, which leaves 9 chars for terms), I think it's > better to allocate SB with the right # of chars in advance, to avoid char[] > allocations in the middle. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661982#action_12661982 ] Marvin Humphrey commented on LUCENE-1476: - How about if we model deletions-as-iterator on BitSet.nextSetBit(int tick) instead of a true iterator that keeps state? You can do that now by implementing BitVector.nextSetBit(int tick) and using that in TermDocs to set a nextDeletion member var instead of checking every doc num with BitVector.get(). That way, the object that provides deletions can still be shared. > BitVector implement DocIdSet > > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Trivial > Attachments: LUCENE-1476.patch > > Original Estimate: 12h > Remaining Estimate: 12h > > BitVector can implement DocIdSet. This is for making > SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661977#action_12661977 ] Marvin Humphrey commented on LUCENE-1476: - Paul Elschot: > How about a SegmentSearcher? I like the idea of a SegmentSearcher in general. A little while back, I wondered whether exposing SegmentReaders was really the best way to handle segment-centric search. Upon reflection, I think it is. Segments are a good unit. They're pure inverted indexes (notwithstanding doc stores and tombstones); the larger composite only masquerades as one. I noticed that in one version of the patch for segment-centric search (LUCENE-1483), each sorted search involved the creation of sub-searchers, which were then used to compile Scorers. It would make sense to cache those as individual SegmentSearcher objects, no? And then, to respond to the original suggestion, the SegmentSearcher level seems like a good place to handle application of a deletions quasi-filter. I think we could avoid having to deal with segment-start offsets that way. > BitVector implement DocIdSet > > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Trivial > Attachments: LUCENE-1476.patch > > Original Estimate: 12h > Remaining Estimate: 12h > > BitVector can implement DocIdSet. This is for making > SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1510) InstantiatedIndexReader throws NullPointerException in norms() when used with a MultiReader
[ https://issues.apache.org/jira/browse/LUCENE-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661956#action_12661956 ] Robert Newson commented on LUCENE-1510: --- Looks good to me. I wonder if you should add; private static final byte[] EMPTY = new byte[0]; and refer to that, as your todo suggests? > InstantiatedIndexReader throws NullPointerException in norms() when used with > a MultiReader > --- > > Key: LUCENE-1510 > URL: https://issues.apache.org/jira/browse/LUCENE-1510 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Affects Versions: 2.4 >Reporter: Robert Newson >Assignee: Karl Wettin > Fix For: 2.9 > > Attachments: TestWithMultiReader.java > > > When using InstantiatedIndexReader under a MultiReader where the other Reader > contains documents, a NullPointerException is thrown here; > public void norms(String field, byte[] bytes, int offset) throws IOException > { > byte[] norms = > getIndex().getNormsByFieldNameAndDocumentNumber().get(field); > System.arraycopy(norms, 0, bytes, offset, norms.length); > } > the 'norms' variable is null. Performing the copy only when norms is not null > does work, though I'm sure it's not the right fix. > java.lang.NullPointerException > at > org.apache.lucene.store.instantiated.InstantiatedIndexReader.norms(InstantiatedIndexReader.java:297) > at org.apache.lucene.index.MultiReader.norms(MultiReader.java:273) > at > org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:70) > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131) > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:112) > at org.apache.lucene.search.Searcher.search(Searcher.java:136) > at org.apache.lucene.search.Searcher.search(Searcher.java:146) > at > org.apache.lucene.store.instantiated.TestWithMultiReader.test(TestWithMultiReader.java:41) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at junit.framework.TestCase.runTest(TestCase.java:164) > at junit.framework.TestCase.runBare(TestCase.java:130) > at junit.framework.TestResult$1.protect(TestResult.java:106) > at junit.framework.TestResult.runProtected(TestResult.java:124) > at junit.framework.TestResult.run(TestResult.java:109) > at junit.framework.TestCase.run(TestCase.java:120) > at junit.framework.TestSuite.runTest(TestSuite.java:230) > at junit.framework.TestSuite.run(TestSuite.java:225) > at > org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130) > at > org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661944#action_12661944 ] Paul Elschot commented on LUCENE-1476: -- bq. To minimize CPU cycles, it would theoretically make more sense to handle deletions much higher up, at the top level Scorer, Searcher, or even the HitCollector level. How about a SegmentSearcher? > BitVector implement DocIdSet > > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Trivial > Attachments: LUCENE-1476.patch > > Original Estimate: 12h > Remaining Estimate: 12h > > BitVector can implement DocIdSet. This is for making > SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661934#action_12661934 ] Michael McCandless commented on LUCENE-1476: {quote} > PostingList would be completely ignorant of deletions, as would classes like > NOTScorer and MatchAllScorer: {quote} This is a neat idea! Deletions are then applied just like a Filter. For a TermQuery (one term) the cost of the two approaches should be the same. For OR'd Term queries, it actually seems like your proposed approach may be lower cost? Ie rather than each TermEnum doing the "AND NOT deleted" intersection, you only do it once at the top. There is added cost in that each TermEnum is now returning more docIDs than before, but the deleted ones are eliminated before scoring. For AND (and other) queries I'm not sure. In theory, having to process more docIDs is more costly, eg a PhraseQuery or SpanXXXQuery may see much higher net cost. We should test. Conceivably, a future "search optimization phase" could pick & choose the best point to inject the "AND NOT deleted" filter. In fact, it could also pick when to inject a Filter... a costly per-docID search with a very restrictive filter could be far more efficient if you applied the Filter earlier in the chain. I'm also curious what cost you see of doing the merge sort for every search; I think it could be uncomfortably high since it's so hard-for-cpu-to-predict-branch-intensive. We could take the first search that doesn't use skipTo and save the result of the merge sort, essentially doing an in-RAM-only "merge" of those deletes, and let subsequent searches use that single merged stream. (This is not MMAP friendly, though). In my initial rough testing, I switched to iterator API for SegmentTermEnum and found if %tg deletes is < 10% the search was a bit faster using an iterator vs random access, but above that was slower. This was with an already "merged" list of in-order docIDs. Switching to an iterator API for accessing field values for many docs (LUCENE-831 -- new FieldCache API, LUCENE-1231 -- column stride fields) shouldn't have this same problem since it's the "top level" that's accessing the values (ie, one iterator per field X query). > BitVector implement DocIdSet > > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Priority: Trivial > Attachments: LUCENE-1476.patch > > Original Estimate: 12h > Remaining Estimate: 12h > > BitVector can implement DocIdSet. This is for making > SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-1476) BitVector implement DocIdSet
robert engels wrote: Then why not always write segment.del, where is incremented. This is what Lucene does today. It's "write once". Each file may be compressed or uncompressed based on the number of deletions it contains. Lucene also does this. Still, as Marvin pointed out, the cost of committing a delete is in proportion to either the number of deletes already on the segment (if written sparse) or the number of documents in the segment (if written non-sparse). It doesn't scale well... though the constant factor may be very small (ie may not matter that much in practice?). With tombstones the commit cost would be in proportion to how many deletes you did (scales perfectly), at the expense of added per-search cost and search iterator state. For realtime search this could be a good tradeoff to make (lower latency on add/delete -> refreshed searcher, at higher per-search cost), but... in the realtime search discussion we are now thinking that the deletes live with the reader and are carried in RAM over to the reopened reader (LUCENE-1314), bypassing having to commit to the filesystem at all. One downside to this is it's single-JRE only, ie to do distributed realtime search you'd have to also re-apply the deletes to the head IndexReader on each JRE. (Whereas added docs would be written with a single IndexWriter, and propagated via the filesystem ). If we go forward with this model then indeed slowish commit times for new deletes are less important since it's for crash recovery and not for opening a new reader. But we'd have many "control" issues to work through... eg how the reader can re-open against old segments right after a new merge is committed (because the newly merged segment isn't warmed yet), and, how IndexReader can open segments written by the writer but not truly committed (sync'd). Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1510) InstantiatedIndexReader throws NullPointerException in norms() when used with a MultiReader
[ https://issues.apache.org/jira/browse/LUCENE-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin closed LUCENE-1510. --- Resolution: Fixed Fix Version/s: 2.9 > InstantiatedIndexReader throws NullPointerException in norms() when used with > a MultiReader > --- > > Key: LUCENE-1510 > URL: https://issues.apache.org/jira/browse/LUCENE-1510 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Affects Versions: 2.4 >Reporter: Robert Newson >Assignee: Karl Wettin > Fix For: 2.9 > > Attachments: TestWithMultiReader.java > > > When using InstantiatedIndexReader under a MultiReader where the other Reader > contains documents, a NullPointerException is thrown here; > public void norms(String field, byte[] bytes, int offset) throws IOException > { > byte[] norms = > getIndex().getNormsByFieldNameAndDocumentNumber().get(field); > System.arraycopy(norms, 0, bytes, offset, norms.length); > } > the 'norms' variable is null. Performing the copy only when norms is not null > does work, though I'm sure it's not the right fix. > java.lang.NullPointerException > at > org.apache.lucene.store.instantiated.InstantiatedIndexReader.norms(InstantiatedIndexReader.java:297) > at org.apache.lucene.index.MultiReader.norms(MultiReader.java:273) > at > org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:70) > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131) > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:112) > at org.apache.lucene.search.Searcher.search(Searcher.java:136) > at org.apache.lucene.search.Searcher.search(Searcher.java:146) > at > org.apache.lucene.store.instantiated.TestWithMultiReader.test(TestWithMultiReader.java:41) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at junit.framework.TestCase.runTest(TestCase.java:164) > at junit.framework.TestCase.runBare(TestCase.java:130) > at junit.framework.TestResult$1.protect(TestResult.java:106) > at junit.framework.TestResult.runProtected(TestResult.java:124) > at junit.framework.TestResult.run(TestResult.java:109) > at junit.framework.TestCase.run(TestCase.java:120) > at junit.framework.TestSuite.runTest(TestSuite.java:230) > at junit.framework.TestSuite.run(TestSuite.java:225) > at > org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130) > at > org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1510) InstantiatedIndexReader throws NullPointerException in norms() when used with a MultiReader
[ https://issues.apache.org/jira/browse/LUCENE-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661908#action_12661908 ] Karl Wettin commented on LUCENE-1510: - Thanks for the report Robert! I've committed a fix in revision 732661. Please check it out and let me know how it works for you. There was a bit of discrepancies between how the InstantiatedIndexReader handled null norms compared to a SegmentReader. I think these problems are fixed now. > InstantiatedIndexReader throws NullPointerException in norms() when used with > a MultiReader > --- > > Key: LUCENE-1510 > URL: https://issues.apache.org/jira/browse/LUCENE-1510 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Affects Versions: 2.4 >Reporter: Robert Newson >Assignee: Karl Wettin > Attachments: TestWithMultiReader.java > > > When using InstantiatedIndexReader under a MultiReader where the other Reader > contains documents, a NullPointerException is thrown here; > public void norms(String field, byte[] bytes, int offset) throws IOException > { > byte[] norms = > getIndex().getNormsByFieldNameAndDocumentNumber().get(field); > System.arraycopy(norms, 0, bytes, offset, norms.length); > } > the 'norms' variable is null. Performing the copy only when norms is not null > does work, though I'm sure it's not the right fix. > java.lang.NullPointerException > at > org.apache.lucene.store.instantiated.InstantiatedIndexReader.norms(InstantiatedIndexReader.java:297) > at org.apache.lucene.index.MultiReader.norms(MultiReader.java:273) > at > org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:70) > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131) > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:112) > at org.apache.lucene.search.Searcher.search(Searcher.java:136) > at org.apache.lucene.search.Searcher.search(Searcher.java:146) > at > org.apache.lucene.store.instantiated.TestWithMultiReader.test(TestWithMultiReader.java:41) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at junit.framework.TestCase.runTest(TestCase.java:164) > at junit.framework.TestCase.runBare(TestCase.java:130) > at junit.framework.TestResult$1.protect(TestResult.java:106) > at junit.framework.TestResult.runProtected(TestResult.java:124) > at junit.framework.TestResult.run(TestResult.java:109) > at junit.framework.TestCase.run(TestCase.java:120) > at junit.framework.TestSuite.runTest(TestSuite.java:230) > at junit.framework.TestSuite.run(TestSuite.java:225) > at > org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130) > at > org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org