Re: Realtime Search
Jason Rutherglen jason.rutherg...@gmail.com wrote: We'd also need to ensure when a merge kicks off, the SegmentReaders used by the merging are not newly reopened but also borrowed from The IW merge code currently opens the SegmentReader with a 4096 buffer size (different than the 1024 default), how will this case be handled? I think we'd just use 1024 when merging. reopen would then flush any added docs to new segments IR.reopen would call IW.flush? I think it has to? (Whether it is IR.reopen, or a class that sits on top of both IR IW, I'm not sure). Ie the interface would be you add/delete/updateDoc, setNorm a bunch of times, during which none of these changes are visible to your currently open reader, followed by reopen to get a reader that then sees those changes? (This is all still brainstorming at this point of course) When IW.commit is called, it also then asks each SegmentReader to commit. Ie, IR.commit would not be used. Why is this? SegmentReader.commitChanges would be called instead? Because IR.commit is doing other stuff (invoking deletion policy, syncing newly referenced files, writing new segments file, rollback logic on hitting an exception, etc.) that overlaps what IW.commit also does. It'd be great to factor this common stuff out so IW and IR would share a single source. (Yes, SR.commitChanges would be called directly, I think). Then when reopen is called, we must internally reopen that clone() such that its deleted docs are carried over to the newly reopened reader and newly flushed docs from IW are visible as new SegmentReaders. If deletes are made to the external reader (meaning the one obtained by IW.getReader), then deletes are made via IW.deleteDocument, then reopen is called, what happens in this case? We will need to merge the del docs from the internal clone into the newly reopened reader? I guess we could merge them. Ie, deletes made through reader (by docID) are immediately visible, but through through writer are buffered until a flush or reopen? Still, I don't like exposing two ways to do deletions, with two different behaviours (buffered or not). It's weird. Maybe, instead, all deletes done via IW would be immediate? It seems like either 1) all deletes are buffered until reopen, or 2) all deletes are immediately materialized. I think half/half is too strange. the IR becomes transactional as well -- deletes are not visible immediately until reopen is called Interesting. I'd rather somehow merge the IW and external reader's deletes, otherwise it seems like we're radically changing how IR works. Perhaps the IW keeps a copy of the external IR that has the write lock (thinking of IR.clone where the write lock is passed onto the latest clone). This way IW.getReader is about the same as reopen/clone (because it will call reopen on presumably the latest IR). We'd only be radically changing how the RealTimeReader works. I think the initial approach here might be to simply open up enough package-private APIs or subclass-ability on IR and IW so that we can experiment with these realtime ideas. Then we iterate w/ different experiments to see how things flesh out... Actually could you redo LUCENE-1516 now that LUCENE-1314 is in? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
deletes made through reader (by docID) are immediately visible, but through writer are buffered until a flush or reopen? This is what I was thinking, IW buffers deletes, IR does not. Making IW.deletes visible immediately by applying them to the IR makes sense as well. What should be the behavior of IW.updateDocument? LUCENE-1314 is in and we've agreed IR.reopen causes an IW.flush so I'll continue the LUCENE-1516 patch. On Fri, Jan 30, 2009 at 6:04 AM, Michael McCandless luc...@mikemccandless.com wrote: Jason Rutherglen jason.rutherg...@gmail.com wrote: We'd also need to ensure when a merge kicks off, the SegmentReaders used by the merging are not newly reopened but also borrowed from The IW merge code currently opens the SegmentReader with a 4096 buffer size (different than the 1024 default), how will this case be handled? I think we'd just use 1024 when merging. reopen would then flush any added docs to new segments IR.reopen would call IW.flush? I think it has to? (Whether it is IR.reopen, or a class that sits on top of both IR IW, I'm not sure). Ie the interface would be you add/delete/updateDoc, setNorm a bunch of times, during which none of these changes are visible to your currently open reader, followed by reopen to get a reader that then sees those changes? (This is all still brainstorming at this point of course) When IW.commit is called, it also then asks each SegmentReader to commit. Ie, IR.commit would not be used. Why is this? SegmentReader.commitChanges would be called instead? Because IR.commit is doing other stuff (invoking deletion policy, syncing newly referenced files, writing new segments file, rollback logic on hitting an exception, etc.) that overlaps what IW.commit also does. It'd be great to factor this common stuff out so IW and IR would share a single source. (Yes, SR.commitChanges would be called directly, I think). Then when reopen is called, we must internally reopen that clone() such that its deleted docs are carried over to the newly reopened reader and newly flushed docs from IW are visible as new SegmentReaders. If deletes are made to the external reader (meaning the one obtained by IW.getReader), then deletes are made via IW.deleteDocument, then reopen is called, what happens in this case? We will need to merge the del docs from the internal clone into the newly reopened reader? I guess we could merge them. Ie, deletes made through reader (by docID) are immediately visible, but through through writer are buffered until a flush or reopen? Still, I don't like exposing two ways to do deletions, with two different behaviours (buffered or not). It's weird. Maybe, instead, all deletes done via IW would be immediate? It seems like either 1) all deletes are buffered until reopen, or 2) all deletes are immediately materialized. I think half/half is too strange. the IR becomes transactional as well -- deletes are not visible immediately until reopen is called Interesting. I'd rather somehow merge the IW and external reader's deletes, otherwise it seems like we're radically changing how IR works. Perhaps the IW keeps a copy of the external IR that has the write lock (thinking of IR.clone where the write lock is passed onto the latest clone). This way IW.getReader is about the same as reopen/clone (because it will call reopen on presumably the latest IR). We'd only be radically changing how the RealTimeReader works. I think the initial approach here might be to simply open up enough package-private APIs or subclass-ability on IR and IW so that we can experiment with these realtime ideas. Then we iterate w/ different experiments to see how things flesh out... Actually could you redo LUCENE-1516 now that LUCENE-1314 is in? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
We'd also need to ensure when a merge kicks off, the SegmentReaders used by the merging are not newly reopened but also borrowed from The IW merge code currently opens the SegmentReader with a 4096 buffer size (different than the 1024 default), how will this case be handled? reopen would then flush any added docs to new segments IR.reopen would call IW.flush? When IW.commit is called, it also then asks each SegmentReader to commit. Ie, IR.commit would not be used. Why is this? SegmentReader.commitChanges would be called instead? Then when reopen is called, we must internally reopen that clone() such that its deleted docs are carried over to the newly reopened reader and newly flushed docs from IW are visible as new SegmentReaders. If deletes are made to the external reader (meaning the one obtained by IW.getReader), then deletes are made via IW.deleteDocument, then reopen is called, what happens in this case? We will need to merge the del docs from the internal clone into the newly reopened reader? the IR becomes transactional as well -- deletes are not visible immediately until reopen is called Interesting. I'd rather somehow merge the IW and external reader's deletes, otherwise it seems like we're radically changing how IR works. Perhaps the IW keeps a copy of the external IR that has the write lock (thinking of IR.clone where the write lock is passed onto the latest clone). This way IW.getReader is about the same as reopen/clone (because it will call reopen on presumably the latest IR). On Sat, Jan 24, 2009 at 4:29 AM, Michael McCandless luc...@mikemccandless.com wrote: Jason Rutherglen wrote: But I think for realtime we don't want to be using IW's deletion at all. We should do all deletes via the IndexReader. In fact if IW has handed out a reader (via getReader()) and that reader (or a reopened derivative) remains open we may have to block deletions via IW. Not sure... Can't IW use the IR to do it's deletions? Currently deletions in IW are implemented in DocumentsWriter.applyDeletes by loading a segment with SegmentReader.get() and making the deletions which causes term index load overhead per flush. If IW has an internal IR then the deletion process can use it (not SegmentReader.get) and there should not be a conflict anymore between the IR and IW deletion processes. Today, IW quickly opens each SegmentReader, applies deletes, then commits closes it, because we have considered it too costly to leave these readers open. But if you've opened a persistent IR via the IndexWriter anyway, we should use the SegmentReaders from that IR instead. It seems like the joint IR+IW would allow you to do adds, deletes, setNorms, all of which are not visible in the exposed IR until IR.reopen is called. reopen would then flush any added docs to new segments, materialize any buffered deletes into the BitVectors (or future transactional sorted int tree thingy), likewise for norms, and then return a new IR. Ie, the IR becomes transactional as well -- deletes are not visible immeidately until reopen is called (unlike today when you delete via IR). I think this means, internally when IW wants to make changes to the shared IR, it should make a clone() and do the changes privately to that instance. Then when reopen is called, we must internally reopen that clone() such that its deleted docs are carried over to the newly reopened reader and newly flushed docs from IW are visible as new SegmentReaders. And on reopen, the deletes should not be flushed to the Directory -- they only need to be moved into each SegmentReader's deletedDocs. We'd also need to ensure when a merge kicks off, the SegmentReaders used by the merging are not newly reopened but also borrowed from the already open IR. This could actually mean that some deleted docs get merged away before the deletions ever get flushed to the Directory. we may have to block deletions via IW Hopefully they can be buffered. Where else does the write lock need to be coordinated between IR and IW? somehow IW IR have to split the write lock else we may need to merge deletions somehow. This is a part I'd like to settle on before start of implementation. It looks like in IW deletes are buffered as terms or queries until flushed. I don't think there needs to be a lock until the flush is performed? For the merge changes to the index, the deletionpolicy can be used to insure a reader still has access to the segments it needs from the main directory. The write lock is held to prevent multiple writers from buffering and then writing changes to the index. Since we will have this joint IR/IW share state, as long as we properly synchronize/share things between IR/IW, it's fine if they both share the write lock. It seems like IR.reopen suddenly means have IW materialize all pending stuff and give me a new reader, where stuff is adds deletes. Adds must materialize via the directory.
Re: Realtime Search
Jason Rutherglen wrote: But I think for realtime we don't want to be using IW's deletion at all. We should do all deletes via the IndexReader. In fact if IW has handed out a reader (via getReader()) and that reader (or a reopened derivative) remains open we may have to block deletions via IW. Not sure... Can't IW use the IR to do it's deletions? Currently deletions in IW are implemented in DocumentsWriter.applyDeletes by loading a segment with SegmentReader.get() and making the deletions which causes term index load overhead per flush. If IW has an internal IR then the deletion process can use it (not SegmentReader.get) and there should not be a conflict anymore between the IR and IW deletion processes. Today, IW quickly opens each SegmentReader, applies deletes, then commits closes it, because we have considered it too costly to leave these readers open. But if you've opened a persistent IR via the IndexWriter anyway, we should use the SegmentReaders from that IR instead. It seems like the joint IR+IW would allow you to do adds, deletes, setNorms, all of which are not visible in the exposed IR until IR.reopen is called. reopen would then flush any added docs to new segments, materialize any buffered deletes into the BitVectors (or future transactional sorted int tree thingy), likewise for norms, and then return a new IR. Ie, the IR becomes transactional as well -- deletes are not visible immeidately until reopen is called (unlike today when you delete via IR). I think this means, internally when IW wants to make changes to the shared IR, it should make a clone() and do the changes privately to that instance. Then when reopen is called, we must internally reopen that clone() such that its deleted docs are carried over to the newly reopened reader and newly flushed docs from IW are visible as new SegmentReaders. And on reopen, the deletes should not be flushed to the Directory -- they only need to be moved into each SegmentReader's deletedDocs. We'd also need to ensure when a merge kicks off, the SegmentReaders used by the merging are not newly reopened but also borrowed from the already open IR. This could actually mean that some deleted docs get merged away before the deletions ever get flushed to the Directory. we may have to block deletions via IW Hopefully they can be buffered. Where else does the write lock need to be coordinated between IR and IW? somehow IW IR have to split the write lock else we may need to merge deletions somehow. This is a part I'd like to settle on before start of implementation. It looks like in IW deletes are buffered as terms or queries until flushed. I don't think there needs to be a lock until the flush is performed? For the merge changes to the index, the deletionpolicy can be used to insure a reader still has access to the segments it needs from the main directory. The write lock is held to prevent multiple writers from buffering and then writing changes to the index. Since we will have this joint IR/IW share state, as long as we properly synchronize/share things between IR/IW, it's fine if they both share the write lock. It seems like IR.reopen suddenly means have IW materialize all pending stuff and give me a new reader, where stuff is adds deletes. Adds must materialize via the directory. Deletes can materialize entirely in RAM. Likewise for norms. When IW.commit is called, it also then asks each SegmentReader to commit. Ie, IR.commit would not be used. We have to test performance to measure the net add - search latency. For many apps this approach may be plenty fast. If your IO system is an SSD it could be extremely fast. Swapping in RAMDir just makes it faster w/o changing the basic approach. It is true that this is best way to start and in fact may be good enough for many users. It could help new users to expose a reader from IW so the delineation between them is removed and Lucene becomes easier to use. At the very least this system allows concurrently updateable IR and IW due to sharing the write lock something that has is currently incorrect in Lucene. I wouldn't call it incorrect. It was an explicit design tradeoff to make the division between IR IW, and done for many good reasons. We are now talking about relaxing that and it clearly raises a number of challenging issues... Besides the transaction log (for crash recovery), which should fit above Lucene nicely, what else is needed for realtime beyond the single-transaction support Lucene already provides? What we have described above (exposing IR via IW) will be sufficient and realtime will live above it. OK, good. In this model, the combined IR+IW is still jointly transactional, in that the IW's commit() method still behaves as it does today. It's just that the IR that's linked to the IW is allowed to see changes, shared only in RAM, that a freshly opened IR on the index would not see until commit has been
Re: Realtime Search
Patch #2: Implement a realtime ram index class I think this one is optional, or, rather an optimazation that we can swap in later if/when necessary? Ie for starters little segments are written into the main Directory. John, Zoie could be of use for this patch. In addition, we may want to implement flushing the IW ram buffer to a RAMDir for reading as M.M. suggested. First though the IW - IR integration LUCENE-1516 needs to be implemented otherwise it's not possible to properly execute updates in realtime. On Fri, Jan 9, 2009 at 5:39 AM, Michael McCandless luc...@mikemccandless.com wrote: Jason Rutherglen wrote: Patch #1: Expose an IndexWriter.getReader method that returns the current reader and shares the write lock I tentatively like this approach so far... That reader is opened using IndexWriter's SegmentInfos instance, so it can read segments deletions that have been flushed but not committed. It's allowed to do its own deletions norms updating. When reopen() is called, it grabs the writers SegmentInfos again. Patch #2: Implement a realtime ram index class I think this one is optional, or, rather an optimazation that we can swap in later if/when necessary? Ie for starters little segments are written into the main Directory. Patch #3: Implement realtime transactions in IndexWriter or in a subclass of IndexWriter by implementing a createTransaction method that generates a realtime Transaction object. When the transaction is flushed, the transaction index modifications are available via the getReader method of IndexWriter Can't this be layered on top? Or... are you looking to add support for multiple transactions in flight at once on IndexWriter? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
Grant, Do you have a proposal in mind? It would help to suggest something like some classes and methods to help understand an alternative to what is being discussed. -J On Fri, Jan 9, 2009 at 12:05 PM, Grant Ingersoll gsing...@apache.orgwrote: I realize we aren't adding read functionality to the Writer, but it would be coupling the Writer to the Reader nonetheless. I understand it is brainstorming (like I said, not trying to distract from the discussion), just saying that if the Reader and the Writer both need access to the underlying data structures, then we should refactor to make that possible, not just glom the Reader onto the Writer. I suspect if that is done, anyway, that it may make the bigger picture a bit clearer, too. On Jan 9, 2009, at 2:53 PM, Michael McCandless wrote: Grant Ingersoll wrote: We've spent a lot of time up until now getting write functionality out of the Reader, and now we are going to add read functionality into the Writer? Well... we're not really adding read functionality into IW; instead, we are asking IW to open the reader for us, except the reader is provided the SegmentInfos it should use from IW (instead of trying to find the latest segments_N file in the Directory). Ie, what IW.getReader returns is an otherwise normal MultiSegmentReader. The goal is to allow an IndexReader to access segments flushed but not yet committed by IW. These segments are normally private to IW, in memory in its SegmentInfos instance. And this is all just thinking-out-loud-brainstorming. There are still many details to work through... Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
Just thinking out loud... haven't looked at your patch yet (one of these days I will be back up for air) My initial thought is that you would have a factory that produced both the Reader and the Writer as a pair, or was at least aware of what to go get from the Writer Something like: class IndexFactory{ IndexWriter getWriter() IndexReader getReader() //Not sure if this is needed yet, but IndexReader getReader(IndexWriter) } The factory (or whatever you want to call it) is responsible for making sure the Writer and Reader have the pieces they need, i.e. the SegmentInfos. The first getReader will get you the plain old Reader that everyone knows and loves today (assuming there is a benefit to keeping it around), the second one knows what to get off the Writer to create the appropriate Reader. It's nothing particularly hard to implement over what you are proposing, I don't think. Just trying to keep the Reader out of the Writer from an API cleanliness standpoint. -Grant On Jan 12, 2009, at 12:55 PM, Jason Rutherglen wrote: Grant, Do you have a proposal in mind? It would help to suggest something like some classes and methods to help understand an alternative to what is being discussed. -J On Fri, Jan 9, 2009 at 12:05 PM, Grant Ingersoll gsing...@apache.org wrote: I realize we aren't adding read functionality to the Writer, but it would be coupling the Writer to the Reader nonetheless. I understand it is brainstorming (like I said, not trying to distract from the discussion), just saying that if the Reader and the Writer both need access to the underlying data structures, then we should refactor to make that possible, not just glom the Reader onto the Writer. I suspect if that is done, anyway, that it may make the bigger picture a bit clearer, too. On Jan 9, 2009, at 2:53 PM, Michael McCandless wrote: Grant Ingersoll wrote: We've spent a lot of time up until now getting write functionality out of the Reader, and now we are going to add read functionality into the Writer? Well... we're not really adding read functionality into IW; instead, we are asking IW to open the reader for us, except the reader is provided the SegmentInfos it should use from IW (instead of trying to find the latest segments_N file in the Directory). Ie, what IW.getReader returns is an otherwise normal MultiSegmentReader. The goal is to allow an IndexReader to access segments flushed but not yet committed by IW. These segments are normally private to IW, in memory in its SegmentInfos instance. And this is all just thinking-out-loud-brainstorming. There are still many details to work through... Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: Realtime Search
Jason Rutherglen wrote: Patch #1: Expose an IndexWriter.getReader method that returns the current reader and shares the write lock I tentatively like this approach so far... That reader is opened using IndexWriter's SegmentInfos instance, so it can read segments deletions that have been flushed but not committed. It's allowed to do its own deletions norms updating. When reopen() is called, it grabs the writers SegmentInfos again. Patch #2: Implement a realtime ram index class I think this one is optional, or, rather an optimazation that we can swap in later if/when necessary? Ie for starters little segments are written into the main Directory. Patch #3: Implement realtime transactions in IndexWriter or in a subclass of IndexWriter by implementing a createTransaction method that generates a realtime Transaction object. When the transaction is flushed, the transaction index modifications are available via the getReader method of IndexWriter Can't this be layered on top? Or... are you looking to add support for multiple transactions in flight at once on IndexWriter? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
Marvin Humphrey mar...@rectangular.com wrote: The goal is to improve worst-case write performance. ... In between the time when the background merge writer starts up and the time it finishes consolidating segment data, we assume that the primary writer will have modified the index. * New docs have been added in new segments. * Tombstones have been added which suppress documents in segments which didn't even exist when the background merge writer started up. * Tombstones have been added which suppress documents in segments which existed when the background merge writer started up, but were not merged. * Tombstones have been added which suppress documents in segments which have just been merged. Only the last category of deletions matters. At this point, the background merge writer aquires an exclusive write lock on the index. It examines recently added tombstones, translates the document numbers and writes a tombstone file against itself. Then it writes the snapshot file to commit its changes and releases the write lock. OK, now I understand KS's two-writer model. Lucene has already solved this with the ConcurrentMergeScheduler -- all segment merges are done in the BG (by default). We also have to compute the deletions against the new segment to include deletions that happened to the merged segments after the merge kicked off. Still, it's not a panacea since often the IO system has horrible degradation in performance while a merge is running. If only we could mark all IO (reads writes) associated with merging as low priority and have the OS actually do the right thing... It's true that we are decoupling the process of making logical changes to the index from the process of internal consolidation. I probably wouldn't describe that as being done from the reader's standpoint, though. Right, we have a different problem in Lucene (because we must warm a reader before using it): after a large merge, warming the new IndexReader that includes that segment can be costly (though that cost is going down with LUCENE-1483, and eventually column-stride fields). But we can solve this by allowing a reopened reader to use the old segments, until the new segment is warmed. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
M.M.: That reader is opened using IndexWriter's SegmentInfos instance, so it can read segments deletions that have been flushed but not committed. It's allowed to do its own deletions norms updating. When reopen() is called, it grabs the writers SegmentInfos again. Are you referring to the IW.pendingCommit SegmentInfos variable? When you say flushed you are referring to the IW.prepareCommit method? I think step #1 is important and should be generally useful outside of realtime search, however it's unclear how/when calls to IW.deleteDocument will reflect in IW.getReader? I assumed that IW.commit would result in IW.deleteDocument changes showing up in IW.getReader. Calls to Transaction.deleteDocument/flush would show up immediately otherwise it's generally unclear to the user the semantics of realtime indexing vs. IW based batch indexing use cases. With IW indexing one adds documents and deletes documents then does a global commit to the main directory. Interleaving deletes with documents added isn't possible because if the documents are in the IW ram buffer, they are not necessarily deleted, so it seems that if the semantics are such that IW.commit or IW.prepareCommit expose deletes via IW.getReader, what is the difference compared to IndexReader.reopen on the index except the shared write lock? Ok, perhaps this is all one gets and as you mentioned the rest is placed on a level above IW which hopefully does not confuse the user. M.M.: Patch #2: Implement a realtime ram index class I think this one is optional, or, rather an optimazation that we can swap in later if/when necessary? Ie for starters little segments are written into the main Directory. If this is swapped in later how is the system realtime except perhaps deletes? M.M.: Can't this be layered on top? Or... are you looking to add support for multiple transactions in flight at once on IndexWriter? The initial version can be layered on top, that will make testing easier. Adding support for multiple transactions at once on IndexWriter outside of the realtime transactions seems to require a lot of refactoring. On Fri, Jan 9, 2009 at 5:39 AM, Michael McCandless luc...@mikemccandless.com wrote: Jason Rutherglen wrote: Patch #1: Expose an IndexWriter.getReader method that returns the current reader and shares the write lock I tentatively like this approach so far... That reader is opened using IndexWriter's SegmentInfos instance, so it can read segments deletions that have been flushed but not committed. It's allowed to do its own deletions norms updating. When reopen() is called, it grabs the writers SegmentInfos again. Patch #2: Implement a realtime ram index class I think this one is optional, or, rather an optimazation that we can swap in later if/when necessary? Ie for starters little segments are written into the main Directory. Patch #3: Implement realtime transactions in IndexWriter or in a subclass of IndexWriter by implementing a createTransaction method that generates a realtime Transaction object. When the transaction is flushed, the transaction index modifications are available via the getReader method of IndexWriter Can't this be layered on top? Or... are you looking to add support for multiple transactions in flight at once on IndexWriter? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
Jason Rutherglen jason.rutherg...@gmail.com wrote: Are you referring to the IW.pendingCommit SegmentInfos variable? No, I'm referring to segmentInfos. (pendingCommit is the snapshot of segmentInfos taken when committing...). When you say flushed you are referring to the IW.prepareCommit method? No, I'm referring to flush... it writes a new segment but not a new segments_N, does not sync the files, and does not invoke the deletion policy. I think step #1 is important and should be generally useful outside of realtime search, however it's unclear how/when calls to IW.deleteDocument will reflect in IW.getReader? You'd have to flush (to materialize pending deletions inside IW) then reopen the reader, to see any deletions done via the writer. But I think instead realtime search would do deletions via the reader (because if you use IW you're updating deletes through the Directory = too slow). Interleaving deletes with documents added isn't possible because if the documents are in the IW ram buffer, they are not necessarily deleted Well, we buffer the delete and then on flush we materialize the delete. So if you add a doc with field X=77, then delete-by-term X:77, then flush, you'll flush a 1 document segment whose only document is marked as deleted. But I think for realtime we don't want to be using IW's deletion at all. We should do all deletes via the IndexReader. In fact if IW has handed out a reader (via getReader()) and that reader (or a reopened derivative) remains open we may have to block deletions via IW. Not sure... somehow IW IR have to split the write lock else we may need to merge deletions somehow. If this is swapped in later how is the system realtime except perhaps deletes? We have to test performance to measure the net add - search latency. For many apps this approach may be plenty fast. If your IO system is an SSD it could be extremely fast. Swapping in RAMDir just makes it faster w/o changing the basic approach. Adding support for multiple transactions at once on IndexWriter outside of the realtime transactions seems to require a lot of refactoring. Besides the transaction log (for crash recovery), which should fit above Lucene nicely, what else is needed for realtime beyond the single-transaction support Lucene already provides? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
On Jan 9, 2009, at 8:39 AM, Michael McCandless wrote: Jason Rutherglen wrote: Patch #1: Expose an IndexWriter.getReader method that returns the current reader and shares the write lock I tentatively like this approach so far... That reader is opened using IndexWriter's SegmentInfos instance, so it can read segments deletions that have been flushed but not committed. It's allowed to do its own deletions norms updating. When reopen() is called, it grabs the writers SegmentInfos again. Minor design nit... We've spent a lot of time up until now getting write functionality out of the Reader, and now we are going to add read functionality into the Writer? Is that the right thing to do? Perhaps there is an interface or some shared objects to be used/exposed or maybe people should get Readers/Writers from a factory and you could have a RT Factory and a default Factory? Not trying to distract from the deeper issues here, but I don't think it makes sense to have the Writer coupled to the Reader. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
I realize we aren't adding read functionality to the Writer, but it would be coupling the Writer to the Reader nonetheless. I understand it is brainstorming (like I said, not trying to distract from the discussion), just saying that if the Reader and the Writer both need access to the underlying data structures, then we should refactor to make that possible, not just glom the Reader onto the Writer. I suspect if that is done, anyway, that it may make the bigger picture a bit clearer, too. On Jan 9, 2009, at 2:53 PM, Michael McCandless wrote: Grant Ingersoll wrote: We've spent a lot of time up until now getting write functionality out of the Reader, and now we are going to add read functionality into the Writer? Well... we're not really adding read functionality into IW; instead, we are asking IW to open the reader for us, except the reader is provided the SegmentInfos it should use from IW (instead of trying to find the latest segments_N file in the Directory). Ie, what IW.getReader returns is an otherwise normal MultiSegmentReader. The goal is to allow an IndexReader to access segments flushed but not yet committed by IW. These segments are normally private to IW, in memory in its SegmentInfos instance. And this is all just thinking-out-loud-brainstorming. There are still many details to work through... Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
But I think for realtime we don't want to be using IW's deletion at all. We should do all deletes via the IndexReader. In fact if IW has handed out a reader (via getReader()) and that reader (or a reopened derivative) remains open we may have to block deletions via IW. Not sure... Can't IW use the IR to do it's deletions? Currently deletions in IW are implemented in DocumentsWriter.applyDeletes by loading a segment with SegmentReader.get() and making the deletions which causes term index load overhead per flush. If IW has an internal IR then the deletion process can use it (not SegmentReader.get) and there should not be a conflict anymore between the IR and IW deletion processes. we may have to block deletions via IW Hopefully they can be buffered. Where else does the write lock need to be coordinated between IR and IW? somehow IW IR have to split the write lock else we may need to merge deletions somehow. This is a part I'd like to settle on before start of implementation. It looks like in IW deletes are buffered as terms or queries until flushed. I don't think there needs to be a lock until the flush is performed? For the merge changes to the index, the deletionpolicy can be used to insure a reader still has access to the segments it needs from the main directory. We have to test performance to measure the net add - search latency. For many apps this approach may be plenty fast. If your IO system is an SSD it could be extremely fast. Swapping in RAMDir just makes it faster w/o changing the basic approach. It is true that this is best way to start and in fact may be good enough for many users. It could help new users to expose a reader from IW so the delineation between them is removed and Lucene becomes easier to use. At the very least this system allows concurrently updateable IR and IW due to sharing the write lock something that has is currently incorrect in Lucene. Besides the transaction log (for crash recovery), which should fit above Lucene nicely, what else is needed for realtime beyond the single-transaction support Lucene already provides? What we have described above (exposing IR via IW) will be sufficient and realtime will live above it. On Fri, Jan 9, 2009 at 11:15 AM, Michael McCandless luc...@mikemccandless.com wrote: Jason Rutherglen jason.rutherg...@gmail.com wrote: Are you referring to the IW.pendingCommit SegmentInfos variable? No, I'm referring to segmentInfos. (pendingCommit is the snapshot of segmentInfos taken when committing...). When you say flushed you are referring to the IW.prepareCommit method? No, I'm referring to flush... it writes a new segment but not a new segments_N, does not sync the files, and does not invoke the deletion policy. I think step #1 is important and should be generally useful outside of realtime search, however it's unclear how/when calls to IW.deleteDocument will reflect in IW.getReader? You'd have to flush (to materialize pending deletions inside IW) then reopen the reader, to see any deletions done via the writer. But I think instead realtime search would do deletions via the reader (because if you use IW you're updating deletes through the Directory = too slow). Interleaving deletes with documents added isn't possible because if the documents are in the IW ram buffer, they are not necessarily deleted Well, we buffer the delete and then on flush we materialize the delete. So if you add a doc with field X=77, then delete-by-term X:77, then flush, you'll flush a 1 document segment whose only document is marked as deleted. But I think for realtime we don't want to be using IW's deletion at all. We should do all deletes via the IndexReader. In fact if IW has handed out a reader (via getReader()) and that reader (or a reopened derivative) remains open we may have to block deletions via IW. Not sure... somehow IW IR have to split the write lock else we may need to merge deletions somehow. If this is swapped in later how is the system realtime except perhaps deletes? We have to test performance to measure the net add - search latency. For many apps this approach may be plenty fast. If your IO system is an SSD it could be extremely fast. Swapping in RAMDir just makes it faster w/o changing the basic approach. Adding support for multiple transactions at once on IndexWriter outside of the realtime transactions seems to require a lot of refactoring. Besides the transaction log (for crash recovery), which should fit above Lucene nicely, what else is needed for realtime beyond the single-transaction support Lucene already provides? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
I think the IW integrated IR needs a rule regarding the behavior of IW.flush and IR.flush. There will need to be a flush lock that is shared between the IW and IR. The lock is acquired at the beginning of a flush and released immediately after a successful or unsuccessful call. We will need to share this lock down to the SegmentReader level as presumably IR.getSequentialReaders may be called and operated on individually. A few questions that need to be answered as to desired behavior. What happens when IW flushes w/deletes and IR has pending deletes not flushed yet? Can we automatically flush the IR deletes? If not automatically flushed are the IR deletes still valid and can the IR later flush them and not create a conflict (I think this is doable)? Or does the reader become readonly and IR.reopen must be called to obtain the new deletes? In the reverse scenario where IW has pending deletes and IR flushes deletes, are there issues that arise when IW later flushes? I think if it's made clear to the user the implications of using IR and IW in combination for deletes then there should not be an issue with supporting deletes from IR and IW. (I found another way to format with hard line breaks http://emailformattool.com/)
Re: Realtime Search
Based on our discussions, it seems best to get realtime search going in small steps. Below are some possible steps to take. Patch #1: Expose an IndexWriter.getReader method that returns the current reader and shares the write lock Patch #2: Implement a realtime ram index class Patch #3: Implement realtime transactions in IndexWriter or in a subclass of IndexWriter by implementing a createTransaction method that generates a realtime Transaction object. When the transaction is flushed, the transaction index modifications are available via the getReader method of IndexWriter The remaining question is how to synchronize the flushes to disk with IndexWriter's other index update locking mechanisms. The flushing could simply use IW.addIndexes which has in place a locking mechanism. After flushing to disk, queued deletes would be applied to the newly copied disk segments. I think this entails opening the newly copied disk segments and applying deletes that occurred to the corresponding ram segments by cloning the new disk segments and replacing the deleteddocs bitvector then flushing the deleteddocs to disk. This system would allow us to avoid using UID in documents. The API needs to clearly separate realtime transactions vs. the existing index update method such as addDocument, deleteDocuments, and updateDocument. I don't think it's possible to transparently implement both because the underlying implementations behave differently. It is expected that multiple transaction may be created at once however the Transaction.flush method would block.
Re: Realtime Search
We have worked on this problem on the server level as well. We have also open sourced it at: http://code.google.com/p/zoie/ wiki on the realtime aspect: http://code.google.com/p/zoie/wiki/ZoieSystem -John On Fri, Dec 26, 2008 at 12:34 PM, Robert Engels reng...@ix.netcom.comwrote: If you move to the either embedded, or server model, the post reopen is trivial, as the structures can be created as the segment is written. It is the networked shared access model that causes a lot of these optimizations to be far more complex than needed. Would it maybe be simpler to move the embedded or server model, and add a network shared file (e.g. nfs) access model as a layer? The latter is going to perform far worse anyway. I guess I don't understand why Lucene continues to try and support this model. NO ONE does it any more. This is the way MS Access worked, and everyone that wanted performance needed to move to SQL server for the server model. -Original Message- From: Marvin Humphrey mar...@rectangular.com Sent: Dec 26, 2008 12:53 PM To: java-dev@lucene.apache.org Subject: Re: Realtime Search On Fri, Dec 26, 2008 at 06:22:23AM -0500, Michael McCandless wrote: 4) Allow 2 concurrent writers: one for small, fast updates, and one for big background merges. Marvin can you describe more detail here? The goal is to improve worst-case write performance. Currently, writes are quick most of the time, but occassionally you'll trigger a big merge and get stuck. To solve this problem, we can assign a merge policy to our primary writer which tells it to merge no more than mergeThreshold documents. The value of mergeTheshold will need tuning depending on document size, change rate, and so on, but the idea is that we want this writer to do as much merging as it can while still keeping worst-case write performance down to an acceptable number. Doing only small merges just puts off the day of reckoning, of course. By avoiding big consolidations, we are slowly accumulating small-to-medium sized segments and causing a gradual degradation of search-time performance. What we'd like is a separate write process, operating (mostly) in the background, dedicated solely to merging segments which contain at least mergeThreshold docs. If all we have to do is add documents to the index, adding that second write process isn't a big deal. We have to worry about competion for segment, snapshot, and temp file names, but that's about it. Deletions make matters more complicated, but with a tombstone-based deletions mechanism, the problems are solvable. When the background merge writer starts up, it will see a particular view of the index in time, including deletions. It will perform nearly all of its operations based on this view of the index, mapping around documents which were marked as deleted at init time. In between the time when the background merge writer starts up and the time it finishes consolidating segment data, we assume that the primary writer will have modified the index. * New docs have been added in new segments. * Tombstones have been added which suppress documents in segments which didn't even exist when the background merge writer started up. * Tombstones have been added which suppress documents in segments which existed when the background merge writer started up, but were not merged. * Tombstones have been added which suppress documents in segments which have just been merged. Only the last category of deletions matters. At this point, the background merge writer aquires an exclusive write lock on the index. It examines recently added tombstones, translates the document numbers and writes a tombstone file against itself. Then it writes the snapshot file to commit its changes and releases the write lock. Worst case update performance for the system is now the sum of the time it takes the background merge writer consolidate tombstones and worst-case performance of the primary writer. It sounds like this is your solution for decoupling segments changes due to merges from changes from docs being indexed, from a reader's standpoint? It's true that we are decoupling the process of making logical changes to the index from the process of internal consolidation. I probably wouldn't describe that as being done from the reader's standpoint, though. With mmap and data structures optimized for it, we basically solve the read-time responsiveness cost problem. From the client perspective, the delay between firing off a change order and seeing that change made live is now dominated by the time it takes to actually update the index. The time between the commit and having an IndexReader which can see that commit is negligible in comparision. Since you are using mmap to achieve near zero brand-new IndexReader creation, whereas in Lucene we are moving towards
Re: Realtime Search
Andrzej Bialecki wrote: No matter whether you are right or wrong, please keep a civil tone on this public forum. +1 Ad-hominem remarks are anti-community. Doug - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
Robert Engels wrote: Do what you like. You obviously will. This is the problem with the Lucene managers - the problems are only the ones they see - same with the solutions. If the solution (or questions) put them outside their comfort zone, they are ignored or dismissed in a tone that is designed to limit any further questions (especially those that might question their ability and/or understanding). There are no Lucene managers. We are a collaborative community. As with any community, all are not equally informed in all matters, and some may not realize they are uninformed. Consensus building is an art. One cannot simply assert that one is correct. One must rather convince others. Offending them is not a good start. Polite persistence, illustrative examples and patches are often successful. Doug - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
Then your comments are misdirected. On Jan 5, 2009, at 1:19 PM, Doug Cutting wrote: Robert Engels wrote: Do what you like. You obviously will. This is the problem with the Lucene managers - the problems are only the ones they see - same with the solutions. If the solution (or questions) put them outside their comfort zone, they are ignored or dismissed in a tone that is designed to limit any further questions (especially those that might question their ability and/or understanding). There are no Lucene managers. We are a collaborative community. As with any community, all are not equally informed in all matters, and some may not realize they are uninformed. Consensus building is an art. One cannot simply assert that one is correct. One must rather convince others. Offending them is not a good start. Polite persistence, illustrative examples and patches are often successful. Doug - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
+1 Agreed, the initial version should use RAMDirectory in order to keep things simple and to benchmark against other MemoryIndex like index representations. On Fri, Dec 26, 2008 at 10:20 AM, Doug Cutting cutt...@apache.org wrote: Michael McCandless wrote: So then I think we should start with approach #2 (build real-time on top of the Lucene core) and iterate from there. Newly added docs go into a tiny segments, which IndexReader.reopen pulls in. Replaced or deleted docs record the delete against the right SegmentReader (and LUCENE-1314 lets reopen carry those pending deletes forward, in RAM). I would take the simple approach first: use ordinary SegmentReader on a RAMDirectory for the tiny segments. If that proves too slow, swap in Memory/InstantiatedIndex for the tiny segments. If that proves too slow, build a reader impl that reads from DocumentsWriter RAM buffer. +1 This sounds like a good approach to me. I don't see any fundamental reasons why we need different representations, and fewer implementations of IndexWriter and IndexReader is generally better, unless they get way too hairy. Mostly it seems that real-time can be done with our existing toolbox of datastructures, but with some slightly different control structures. Once we have the control structure in place then we should look at optimizing data structures as needed. Doug - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
Marvin Humphrey mar...@rectangular.com wrote: 4) Allow 2 concurrent writers: one for small, fast updates, and one for big background merges. Marvin can you describe more detail here? It sounds like this is your solution for decoupling segments changes due to merges from changes from docs being indexed, from a reader's standpoint? Since you are using mmap to achieve near zero brand-new IndexReader creation, whereas in Lucene we are moving towards achieving real-time by always reopening a current IndexReader (not a brand new one), it seems like you should not actually have to worry about the case of reopening a reader after a large merge has finished? We need to deal with this case (background the warming) because creating that new SegmentReader (on the newly merged segment) can take a non-trivial amount of time. Mike
Re: Realtime Search
That could very well be, but I was referencing your statement: 1) Design index formats that can be memory mapped rather than slurped, bringing the cost of opening/reopening an IndexReader down to a negligible level. The only reason to do this (or have it happen) is if you perform a binary search on the term index. Using a 2 file system is going to be WAY slower - I'll bet lunch. It might be workable if the files were on a striped drive, or put each file on a different drive/controller, but requiring such specially configured hardware is not a good idea. In the common case (single drive), you are going to be seeking all over the place. Saving the memory structure from the write of the segment is going to offer far superior performance - you can binary seek on the memory structure, not the mmap file. The only problem with this is that there is going to be a minimum memory requirement. Also, the mmap is only suitable for 64 bit platforms, since there is no way in Java to unmap, you are going to run out of address space as segments are rewritten. -Original Message- From: Marvin Humphrey mar...@rectangular.com Sent: Dec 24, 2008 1:31 PM To: java-dev@lucene.apache.org Subject: Re: Realtime Search On Wed, Dec 24, 2008 at 12:02:24PM -0600, robert engels wrote: As I understood this discussion though, it was an attempt to remove the in memory 'skip to' index, to avoid the reading of this during index open/reopen. No. That idea was entertained briefly and quickly discarded. There seems to be an awful lot of irrelevant noise in the current thread arising due to lack of familiarity with the ongoing discussions in JIRA. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
Also, if you are really set on the mmap strategy, why not use the single file with fixed length pages, using the header I proposed (and key compression). You don't need any fancy partial page stuff, just waste a small amount of space at the end of pages. I think this is going to far faster than a file of fixed length offsets (I assume you would also put the entry data length in file #1 as well), and a file of data (file #2). Mainly because the final page(s) can be more efficiently searched, and since you can use compression (since you have pages), the files are going to be significantly smaller (improving the write time, and the cache efficiency). -Original Message- From: Robert Engels reng...@ix.netcom.com Sent: Dec 26, 2008 11:30 AM To: java-dev@lucene.apache.org, java-dev@lucene.apache.org Subject: Re: Realtime Search That could very well be, but I was referencing your statement: 1) Design index formats that can be memory mapped rather than slurped, bringing the cost of opening/reopening an IndexReader down to a negligible level. The only reason to do this (or have it happen) is if you perform a binary search on the term index. Using a 2 file system is going to be WAY slower - I'll bet lunch. It might be workable if the files were on a striped drive, or put each file on a different drive/controller, but requiring such specially configured hardware is not a good idea. In the common case (single drive), you are going to be seeking all over the place. Saving the memory structure from the write of the segment is going to offer far superior performance - you can binary seek on the memory structure, not the mmap file. The only problem with this is that there is going to be a minimum memory requirement. Also, the mmap is only suitable for 64 bit platforms, since there is no way in Java to unmap, you are going to run out of address space as segments are rewritten. -Original Message- From: Marvin Humphrey mar...@rectangular.com Sent: Dec 24, 2008 1:31 PM To: java-dev@lucene.apache.org Subject: Re: Realtime Search On Wed, Dec 24, 2008 at 12:02:24PM -0600, robert engels wrote: As I understood this discussion though, it was an attempt to remove the in memory 'skip to' index, to avoid the reading of this during index open/reopen. No. That idea was entertained briefly and quickly discarded. There seems to be an awful lot of irrelevant noise in the current thread arising due to lack of familiarity with the ongoing discussions in JIRA. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
Michael McCandless wrote: So then I think we should start with approach #2 (build real-time on top of the Lucene core) and iterate from there. Newly added docs go into a tiny segments, which IndexReader.reopen pulls in. Replaced or deleted docs record the delete against the right SegmentReader (and LUCENE-1314 lets reopen carry those pending deletes forward, in RAM). I would take the simple approach first: use ordinary SegmentReader on a RAMDirectory for the tiny segments. If that proves too slow, swap in Memory/InstantiatedIndex for the tiny segments. If that proves too slow, build a reader impl that reads from DocumentsWriter RAM buffer. +1 This sounds like a good approach to me. I don't see any fundamental reasons why we need different representations, and fewer implementations of IndexWriter and IndexReader is generally better, unless they get way too hairy. Mostly it seems that real-time can be done with our existing toolbox of datastructures, but with some slightly different control structures. Once we have the control structure in place then we should look at optimizing data structures as needed. Doug - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
The addition of docs into tiny segments using the current data structures seems the right way to go. Sometime back one of my engineers implemented pseudo real-time using MultiSearcher by having an in-memory (RAM based) short-term index that auto-merged into a disk-based long term index that eventually get merged into archive indexes. Index optimization would take place during these merges. The search we required was very time-sensitive (searching last-minute breaking news wires). The advantage of having an archive index is that very old documents in our applications were not usually searched on unless archives were explicitely selected. -- Joaquin On Fri, Dec 26, 2008 at 10:20 AM, Doug Cutting cutt...@apache.org wrote: Michael McCandless wrote: So then I think we should start with approach #2 (build real-time on top of the Lucene core) and iterate from there. Newly added docs go into a tiny segments, which IndexReader.reopen pulls in. Replaced or deleted docs record the delete against the right SegmentReader (and LUCENE-1314 lets reopen carry those pending deletes forward, in RAM). I would take the simple approach first: use ordinary SegmentReader on a RAMDirectory for the tiny segments. If that proves too slow, swap in Memory/InstantiatedIndex for the tiny segments. If that proves too slow, build a reader impl that reads from DocumentsWriter RAM buffer. +1 This sounds like a good approach to me. I don't see any fundamental reasons why we need different representations, and fewer implementations of IndexWriter and IndexReader is generally better, unless they get way too hairy. Mostly it seems that real-time can be done with our existing toolbox of datastructures, but with some slightly different control structures. Once we have the control structure in place then we should look at optimizing data structures as needed. Doug - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
On Fri, Dec 26, 2008 at 06:22:23AM -0500, Michael McCandless wrote: 4) Allow 2 concurrent writers: one for small, fast updates, and one for big background merges. Marvin can you describe more detail here? The goal is to improve worst-case write performance. Currently, writes are quick most of the time, but occassionally you'll trigger a big merge and get stuck. To solve this problem, we can assign a merge policy to our primary writer which tells it to merge no more than mergeThreshold documents. The value of mergeTheshold will need tuning depending on document size, change rate, and so on, but the idea is that we want this writer to do as much merging as it can while still keeping worst-case write performance down to an acceptable number. Doing only small merges just puts off the day of reckoning, of course. By avoiding big consolidations, we are slowly accumulating small-to-medium sized segments and causing a gradual degradation of search-time performance. What we'd like is a separate write process, operating (mostly) in the background, dedicated solely to merging segments which contain at least mergeThreshold docs. If all we have to do is add documents to the index, adding that second write process isn't a big deal. We have to worry about competion for segment, snapshot, and temp file names, but that's about it. Deletions make matters more complicated, but with a tombstone-based deletions mechanism, the problems are solvable. When the background merge writer starts up, it will see a particular view of the index in time, including deletions. It will perform nearly all of its operations based on this view of the index, mapping around documents which were marked as deleted at init time. In between the time when the background merge writer starts up and the time it finishes consolidating segment data, we assume that the primary writer will have modified the index. * New docs have been added in new segments. * Tombstones have been added which suppress documents in segments which didn't even exist when the background merge writer started up. * Tombstones have been added which suppress documents in segments which existed when the background merge writer started up, but were not merged. * Tombstones have been added which suppress documents in segments which have just been merged. Only the last category of deletions matters. At this point, the background merge writer aquires an exclusive write lock on the index. It examines recently added tombstones, translates the document numbers and writes a tombstone file against itself. Then it writes the snapshot file to commit its changes and releases the write lock. Worst case update performance for the system is now the sum of the time it takes the background merge writer consolidate tombstones and worst-case performance of the primary writer. It sounds like this is your solution for decoupling segments changes due to merges from changes from docs being indexed, from a reader's standpoint? It's true that we are decoupling the process of making logical changes to the index from the process of internal consolidation. I probably wouldn't describe that as being done from the reader's standpoint, though. With mmap and data structures optimized for it, we basically solve the read-time responsiveness cost problem. From the client perspective, the delay between firing off a change order and seeing that change made live is now dominated by the time it takes to actually update the index. The time between the commit and having an IndexReader which can see that commit is negligible in comparision. Since you are using mmap to achieve near zero brand-new IndexReader creation, whereas in Lucene we are moving towards achieving real-time by always reopening a current IndexReader (not a brand new one), it seems like you should not actually have to worry about the case of reopening a reader after a large merge has finished? Even though we can rely on mmap rather than slurping, there are potentially a lot of files to open and a lot of JSON-encoded metadata to parse, so I'm not certain that Lucy/KS will never have to worry about the time it takes to open a new IndexReader. Fortunately, we can implement reopen() if we need to. We need to deal with this case (background the warming) because creating that new SegmentReader (on the newly merged segment) can take a non-trivial amount of time. Yes. Without mmap or some other solution, I think improvements to worst-case update performance in Lucene will continue to be constrained by post-commit IndexReader opening costs. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
One thing that I forgot to mention is that in our implementation the real-time indexing took place with many folder-based listeners writing to many tiny in-memory indexes partitioned by sub-sources with fewer long-term and archive indexes per box. Overall distributed search across various lucene-based search services was done using a federator component, very much like shard based searches is done today (I believe). -- Joaquin. l On Fri, Dec 26, 2008 at 10:48 AM, J. Delgado joaquin.delg...@gmail.comwrote: The addition of docs into tiny segments using the current data structures seems the right way to go. Sometime back one of my engineers implemented pseudo real-time using MultiSearcher by having an in-memory (RAM based) short-term index that auto-merged into a disk-based long term index that eventually get merged into archive indexes. Index optimization would take place during these merges. The search we required was very time-sensitive (searching last-minute breaking news wires). The advantage of having an archive index is that very old documents in our applications were not usually searched on unless archives were explicitely selected. -- Joaquin On Fri, Dec 26, 2008 at 10:20 AM, Doug Cutting cutt...@apache.org wrote: Michael McCandless wrote: So then I think we should start with approach #2 (build real-time on top of the Lucene core) and iterate from there. Newly added docs go into a tiny segments, which IndexReader.reopen pulls in. Replaced or deleted docs record the delete against the right SegmentReader (and LUCENE-1314 lets reopen carry those pending deletes forward, in RAM). I would take the simple approach first: use ordinary SegmentReader on a RAMDirectory for the tiny segments. If that proves too slow, swap in Memory/InstantiatedIndex for the tiny segments. If that proves too slow, build a reader impl that reads from DocumentsWriter RAM buffer. +1 This sounds like a good approach to me. I don't see any fundamental reasons why we need different representations, and fewer implementations of IndexWriter and IndexReader is generally better, unless they get way too hairy. Mostly it seems that real-time can be done with our existing toolbox of datastructures, but with some slightly different control structures. Once we have the control structure in place then we should look at optimizing data structures as needed. Doug - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
This is what we mostly do, but we serialize the documents to a log file first, so if server crashes before the background merge of the RAM segments into the disk segments completes, we can replay the operations on server restart. Since the serialize is a sequential write to an already open file, it is very fast. I realize that many users do not wrap Lucene in a server process, so it doesn't seem that writing only to the RAM segments will work? How will the other processes/servers see them? Doesn't seem it would be real-time for them. Maybe restrict the real-time search to server Lucene installations? If you are concerned about performance in the first place, seems a requirement anyway. On this note, maybe to allow greater advancement of Lucene, Lucene should move to a design approach similar to many databases. You have an embedded version, which is designed for single process with multiple threads, and a server version which wraps the embedded version allowing multiple clients. Seems to be a far simpler architecture. I know I addressed have brought this up in the past, but maybe time to revisit? It was the core of unix design (no file locks needed), and works well for many dbs (i.e. derby) -Original Message- From: Doug Cutting cutt...@apache.org Sent: Dec 26, 2008 12:20 PM To: java-dev@lucene.apache.org Subject: Re: Realtime Search Michael McCandless wrote: So then I think we should start with approach #2 (build real-time on top of the Lucene core) and iterate from there. Newly added docs go into a tiny segments, which IndexReader.reopen pulls in. Replaced or deleted docs record the delete against the right SegmentReader (and LUCENE-1314 lets reopen carry those pending deletes forward, in RAM). I would take the simple approach first: use ordinary SegmentReader on a RAMDirectory for the tiny segments. If that proves too slow, swap in Memory/InstantiatedIndex for the tiny segments. If that proves too slow, build a reader impl that reads from DocumentsWriter RAM buffer. +1 This sounds like a good approach to me. I don't see any fundamental reasons why we need different representations, and fewer implementations of IndexWriter and IndexReader is generally better, unless they get way too hairy. Mostly it seems that real-time can be done with our existing toolbox of datastructures, but with some slightly different control structures. Once we have the control structure in place then we should look at optimizing data structures as needed. Doug - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
If you move to the either embedded, or server model, the post reopen is trivial, as the structures can be created as the segment is written. It is the networked shared access model that causes a lot of these optimizations to be far more complex than needed. Would it maybe be simpler to move the embedded or server model, and add a network shared file (e.g. nfs) access model as a layer? The latter is going to perform far worse anyway. I guess I don't understand why Lucene continues to try and support this model. NO ONE does it any more. This is the way MS Access worked, and everyone that wanted performance needed to move to SQL server for the server model. -Original Message- From: Marvin Humphrey mar...@rectangular.com Sent: Dec 26, 2008 12:53 PM To: java-dev@lucene.apache.org Subject: Re: Realtime Search On Fri, Dec 26, 2008 at 06:22:23AM -0500, Michael McCandless wrote: 4) Allow 2 concurrent writers: one for small, fast updates, and one for big background merges. Marvin can you describe more detail here? The goal is to improve worst-case write performance. Currently, writes are quick most of the time, but occassionally you'll trigger a big merge and get stuck. To solve this problem, we can assign a merge policy to our primary writer which tells it to merge no more than mergeThreshold documents. The value of mergeTheshold will need tuning depending on document size, change rate, and so on, but the idea is that we want this writer to do as much merging as it can while still keeping worst-case write performance down to an acceptable number. Doing only small merges just puts off the day of reckoning, of course. By avoiding big consolidations, we are slowly accumulating small-to-medium sized segments and causing a gradual degradation of search-time performance. What we'd like is a separate write process, operating (mostly) in the background, dedicated solely to merging segments which contain at least mergeThreshold docs. If all we have to do is add documents to the index, adding that second write process isn't a big deal. We have to worry about competion for segment, snapshot, and temp file names, but that's about it. Deletions make matters more complicated, but with a tombstone-based deletions mechanism, the problems are solvable. When the background merge writer starts up, it will see a particular view of the index in time, including deletions. It will perform nearly all of its operations based on this view of the index, mapping around documents which were marked as deleted at init time. In between the time when the background merge writer starts up and the time it finishes consolidating segment data, we assume that the primary writer will have modified the index. * New docs have been added in new segments. * Tombstones have been added which suppress documents in segments which didn't even exist when the background merge writer started up. * Tombstones have been added which suppress documents in segments which existed when the background merge writer started up, but were not merged. * Tombstones have been added which suppress documents in segments which have just been merged. Only the last category of deletions matters. At this point, the background merge writer aquires an exclusive write lock on the index. It examines recently added tombstones, translates the document numbers and writes a tombstone file against itself. Then it writes the snapshot file to commit its changes and releases the write lock. Worst case update performance for the system is now the sum of the time it takes the background merge writer consolidate tombstones and worst-case performance of the primary writer. It sounds like this is your solution for decoupling segments changes due to merges from changes from docs being indexed, from a reader's standpoint? It's true that we are decoupling the process of making logical changes to the index from the process of internal consolidation. I probably wouldn't describe that as being done from the reader's standpoint, though. With mmap and data structures optimized for it, we basically solve the read-time responsiveness cost problem. From the client perspective, the delay between firing off a change order and seeing that change made live is now dominated by the time it takes to actually update the index. The time between the commit and having an IndexReader which can see that commit is negligible in comparision. Since you are using mmap to achieve near zero brand-new IndexReader creation, whereas in Lucene we are moving towards achieving real-time by always reopening a current IndexReader (not a brand new one), it seems like you should not actually have to worry about the case of reopening a reader after a large merge has finished? Even though we can rely on mmap rather than slurping, there are potentially a lot of files to open and a lot of JSON-encoded metadata to parse, so I'm not certain that Lucy/KS
Re: Realtime Search
There is also the distributed model - but in that case each node is running some sort of server anyway (as in Hadoop). It seems that the distributed model would be easier to develop using Hadoop over the embedded model. -Original Message- From: Robert Engels reng...@ix.netcom.com Sent: Dec 26, 2008 2:34 PM To: java-dev@lucene.apache.org Subject: Re: Realtime Search If you move to the either embedded, or server model, the post reopen is trivial, as the structures can be created as the segment is written. It is the networked shared access model that causes a lot of these optimizations to be far more complex than needed. Would it maybe be simpler to move the embedded or server model, and add a network shared file (e.g. nfs) access model as a layer? The latter is going to perform far worse anyway. I guess I don't understand why Lucene continues to try and support this model. NO ONE does it any more. This is the way MS Access worked, and everyone that wanted performance needed to move to SQL server for the server model. -Original Message- From: Marvin Humphrey mar...@rectangular.com Sent: Dec 26, 2008 12:53 PM To: java-dev@lucene.apache.org Subject: Re: Realtime Search On Fri, Dec 26, 2008 at 06:22:23AM -0500, Michael McCandless wrote: 4) Allow 2 concurrent writers: one for small, fast updates, and one for big background merges. Marvin can you describe more detail here? The goal is to improve worst-case write performance. Currently, writes are quick most of the time, but occassionally you'll trigger a big merge and get stuck. To solve this problem, we can assign a merge policy to our primary writer which tells it to merge no more than mergeThreshold documents. The value of mergeTheshold will need tuning depending on document size, change rate, and so on, but the idea is that we want this writer to do as much merging as it can while still keeping worst-case write performance down to an acceptable number. Doing only small merges just puts off the day of reckoning, of course. By avoiding big consolidations, we are slowly accumulating small-to-medium sized segments and causing a gradual degradation of search-time performance. What we'd like is a separate write process, operating (mostly) in the background, dedicated solely to merging segments which contain at least mergeThreshold docs. If all we have to do is add documents to the index, adding that second write process isn't a big deal. We have to worry about competion for segment, snapshot, and temp file names, but that's about it. Deletions make matters more complicated, but with a tombstone-based deletions mechanism, the problems are solvable. When the background merge writer starts up, it will see a particular view of the index in time, including deletions. It will perform nearly all of its operations based on this view of the index, mapping around documents which were marked as deleted at init time. In between the time when the background merge writer starts up and the time it finishes consolidating segment data, we assume that the primary writer will have modified the index. * New docs have been added in new segments. * Tombstones have been added which suppress documents in segments which didn't even exist when the background merge writer started up. * Tombstones have been added which suppress documents in segments which existed when the background merge writer started up, but were not merged. * Tombstones have been added which suppress documents in segments which have just been merged. Only the last category of deletions matters. At this point, the background merge writer aquires an exclusive write lock on the index. It examines recently added tombstones, translates the document numbers and writes a tombstone file against itself. Then it writes the snapshot file to commit its changes and releases the write lock. Worst case update performance for the system is now the sum of the time it takes the background merge writer consolidate tombstones and worst-case performance of the primary writer. It sounds like this is your solution for decoupling segments changes due to merges from changes from docs being indexed, from a reader's standpoint? It's true that we are decoupling the process of making logical changes to the index from the process of internal consolidation. I probably wouldn't describe that as being done from the reader's standpoint, though. With mmap and data structures optimized for it, we basically solve the read-time responsiveness cost problem. From the client perspective, the delay between firing off a change order and seeing that change made live is now dominated by the time it takes to actually update the index. The time between the commit and having an IndexReader which can see that commit is negligible in comparision. Since you are using mmap to achieve near zero brand-new IndexReader creation, whereas in Lucene we are moving towards achieving
Re: Realtime Search
Robert, Three exchanges ago in this thread, you made the incorrect assumption that the motivation behind using mmap was read speed, and that memory mapping was being waved around as some sort of magic wand: Is there something that I am missing? I see lots of references to using memory mapped files to dramatically improve performance. I don't think this is the case at all. At the lowest levels, it is somewhat more efficient from a CPU standpoint, but with a decent OS cache the IO performance difference is going to negligible. In response, I indicated that the mmap design had been discussed in JIRA, and pointed you at a particular issue. There have been substantial discussions about this design in JIRA, notably LUCENE-1458. The dramatic improvement is WRT to opening/reopening an IndexReader. Apparently, you did not go back to read that JIRA thread, because you subsequently offered a critique of a purely invented design you assumed we must have arrived at, and continued to argue with a straw man about read speed: 1. with fixed size terms, the additional IO (larger pages) probably offsets a lot of the random access benefit. This is why compressed disks on a fast machine (CPU) are often faster than uncompressed - more data is read during every IO access. While my reply did not specifically point back to LUCENE-1458 again, I hoped that having your foolish assumption exposed would motivate you to go back and read it, so that you could offer an informed critique of the *actual* design. I also linked to a specific comment in LUCENE-831 which explained how mmap applied to sort caches. Additionally, sort caches would be written at index time in three files, and memory mapped as laid out in https://issues.apache.org/jira/browse/LUCENE-831?focusedCommentId=12656150#action_12656150. Apparently you still didn't go back and read up, because you subsequently made a third incorrect assumption, this time about plans to do away with the term dictionary index. In response I griped about JIRA again, using slightly stronger but still intentionally indirect language. No. That idea was entertained briefly and quickly discarded. There seems to be an awful lot of irrelevant noise in the current thread arising due to lack of familiarity with the ongoing discussions in JIRA. Unfortunately, this must not have worked either, because you have now offered a fourth message based on incorrect assumptions which would have been remedied by bringing yourself up to date with the relevant JIRA threads. That could very well be, but I was referencing your statement: 1) Design index formats that can be memory mapped rather than slurped, bringing the cost of opening/reopening an IndexReader down to a negligible level. The only reason to do this (or have it happen) is if you perform a binary search on the term index. No. As discussed in LUCENE-1458, LUCENE-1483, the specific link I pointed you towards in LUCENE-831, the message where I provided you with that link, and elsewhere in this thread... loading the term dictionary index is important, but the cost pales in comparison to the cost of loading sort caches. Using a 2 file system is going to be WAY slower - I'll bet lunch. It might be workable if the files were on a striped drive, or put each file on a different drive/controller, but requiring such specially configured hardware is not a good idea. In the common case (single drive), you are going to be seeking all over the place. Mike McCandless and I had an extensive debate about the pros and cons of depending on the OS cache to hold the term dictionary index under LUCENE-1458. The concerns you express here were fully addressed, and even resolved under an agree to disagree design. Also, the mmap is only suitable for 64 bit platforms, since there is no way in Java to unmap, you are going to run out of address space as segments are rewritten. The discussion of how the mmap design translates from Lucy to Lucene is an important one, but I despair of having it if we have to rehash all of LUCENE-1458, LUCENE-831, and possibly LUCENE-1476 and LUCENE-1483 because you cannot be troubled to bring yourself up to speed before commenting. You are obviously knowledgable on the subject of low level memory issues. Me and Mike McCandless ain't exactly chopped liver, though, and neither are a lot of other people around here who *are* bothering to keep up with the threads in JIRA. I request that you show the rest of us more respect. Our time is valuable, too. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
You are full of crap. From your own comments in Lucene 1458: The work on streamlining the term dictionary is excellent, but perhaps we can do better still. Can we design a format that allows us rely upon the operating system's virtual memory and avoid caching in process memory altogether? Say that we break up the index file into fixed-width blocks of 1024 bytes. Most blocks would start with a complete term/pointer pairing, though at the top of each block, we'd need a status byte indicating whether the block contains a continuation from the previous block in order to handle cases where term length exceeds the block size. For Lucy/KinoSearch our plan would be to mmap() on the file, but accessing it as a stream would work, too. Seeking around the index term dictionary would involve seeking the stream to multiples of the block size and performing binary search, rather than performing binary search on an array of cached terms. There would be increased processor overhead; my guess is that since the second stage of a term dictionary seek – scanning through the primary term dictionary – involves comparatively more processor power than this, the increased costs would be acceptable. and then you state farther down Killing off the term dictionary index yields a nice improvement in code and file specification simplicity, and there's no performance penalty for our primary optimization target use case. We could also explore something in-between, eg it'd be nice to genericize MultiLevelSkipListWriter so that it could index arbitrary files, then we could use that to index the terms dict. You could choose to spend dedicated process RAM on the higher levels of the skip tree, and then tentatively trust IO cache for the lower levels. That doesn't meet the design goals of bringing the cost of opening/warming an IndexReader down to near-zero and sharing backing buffers among multiple forks. It's also very complicated, which of course bothers me more than it bothers you. So I imagine we'll choose different paths. The thing I find funny is that many are approaching these issues as if new ground is being broken. These are ALL standard, long-known issues that any database engineer has already worked with, and there are accepted designs given applicable constraints. This is why I've tried to point folks towards alternative designs that open the door much wider to increased performance/reliability/robustness. Do what you like. You obviously will. This is the problem with the Lucene managers - the problems are only the ones they see - same with the solutions. If the solution (or questions) put them outside their comfort zone, they are ignored or dismissed in a tone that is designed to limit any further questions (especially those that might question their ability and/or understanding). -Original Message- From: Marvin Humphrey mar...@rectangular.com Sent: Dec 26, 2008 3:53 PM To: java-dev@lucene.apache.org, Robert Engels reng...@ix.netcom.com Subject: Re: Realtime Search Robert, Three exchanges ago in this thread, you made the incorrect assumption that the motivation behind using mmap was read speed, and that memory mapping was being waved around as some sort of magic wand: Is there something that I am missing? I see lots of references to using memory mapped files to dramatically improve performance. I don't think this is the case at all. At the lowest levels, it is somewhat more efficient from a CPU standpoint, but with a decent OS cache the IO performance difference is going to negligible. In response, I indicated that the mmap design had been discussed in JIRA, and pointed you at a particular issue. There have been substantial discussions about this design in JIRA, notably LUCENE-1458. The dramatic improvement is WRT to opening/reopening an IndexReader. Apparently, you did not go back to read that JIRA thread, because you subsequently offered a critique of a purely invented design you assumed we must have arrived at, and continued to argue with a straw man about read speed: 1. with fixed size terms, the additional IO (larger pages) probably offsets a lot of the random access benefit. This is why compressed disks on a fast machine (CPU) are often faster than uncompressed - more data is read during every IO access. While my reply did not specifically point back to LUCENE-1458 again, I hoped that having your foolish assumption exposed would motivate you to go back and read it, so that you could offer an informed critique of the *actual* design. I also linked to a specific comment in LUCENE-831 which explained how mmap applied to sort caches. Additionally, sort caches would be written at index time in three files, and memory mapped as laid out in https://issues.apache.org/jira/browse/LUCENE-831?focusedCommentId=12656150#action_12656150. Apparently you still didn't go back and read up, because you
Re: Realtime Search
Robert Engels wrote: You are full of **beep** *beep* ... No matter whether you are right or wrong, please keep a civil tone on this public forum. We are professionals here, so let's discuss and disagree if must be - but in a professional and grown-up way. Thank you. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
I think the necessary low-level changes to Lucene for real-time are actually already well underway... The biggest barrier is how we now ask for FieldCache values a the Multi*Reader level. This makes reopen cost catastrophic for a large index. Once we succeed in making FieldCache usage within Lucene segment-centric (LUCENE-1483 = sorting becomes segment-centric), LUCENE-831 (= deprecate old FieldCache API in favor of segment-centric or iteration API), we are most of the way there. LUCENE-1231 (column stride fields) should make initing the per-segment FieldCache much faster, though I think that's a nice to have for real-time search (because either 1) warming will happen in the BG, or 2) the segment is tiny). So then I think we should start with approach #2 (build real-time on top of the Lucene core) and iterate from there. Newly added docs go into a tiny segments, which IndexReader.reopen pulls in. Replaced or deleted docs record the delete against the right SegmentReader (and LUCENE-1314 lets reopen carry those pending deletes forward, in RAM). I would take the simple approach first: use ordinary SegmentReader on a RAMDirectory for the tiny segments. If that proves too slow, swap in Memory/InstantiatedIndex for the tiny segments. If that proves too slow, build a reader impl that reads from DocumentsWriter RAM buffer. One challenge is reopening after a big merge finishes... we'd need a way to 1) allow the merge to be committed, then 2) start warming a new reader in the BG, but 3) allow newly flushed segments to use the old SegmentReaders reading the segments that were merged (because they are still warm), and 4) once new reader is warm, we decref old segments and use the new reader going forwards. Alternatively, and maybe simpler, a merge is not allowed to commit until a new SegmentReader has been warmed against the newly merged segment. I'm not sure how best to do this... we may need more info in SegmentInfo[s] to track the genealogy of each segment, or something. We may need to have IndexWriter give more info when it's modifying SegmentInfos, eg we'd need the reader to access newly flushed segments (IndexWriter does not write a new segments_N until commit). Maybe IndexWriter needs to warm readers... maybe IndexReader.open/reopen needs to be given an IndexWriter and then access its un-flushed in-memory SegmentInfos... not sure. We'd need to fix SegmentReader.get to provide single instance for a given segment. I agree we'd want a specialized merge policy. EG it should merge RAM segments w/ higher priority, and probably not merge mixed RAM disk segments. Mike Jason Rutherglen jason.rutherg...@gmail.com wrote: We've discussed realtime search before, it looks like after the next release we can get some sort of realtime search working. I was going to open a new issue but decided it might be best to discuss realtime search on the dev list. Lucene can implement realtime search as the ability to add, update, or delete documents with latency in the sub 5 millisecond range. A couple of different options are available. 1) Expose a rolling set of realtime readers over the memory index used by IndexWriter. Requires incrementally updating field caches and filters, and is somewhat unclear how IndexReader versioning would work (for example versions of the term dictionary). 2) Implement realtime search by incrementally creating and merging readers in memory. The system would use MemoryIndex or InstantiatedIndex to quickly (more quickly than RAMDirectory) create indexes from added documents. The in memory indexes would be periodically merged in the background and according to RAM used write to disk. Each update would generate a new IndexReader or MultiSearcher that includes the new updates. Field caches and filters could be cached per IndexReader according to how Lucene works today. The downside of this approach is the indexing will not be as fast as #1 because of the in memory merging which similar to the Lucene pre 2.3 which merged in memory segments using RAMDirectory. Are there other implementation options? A new patch would focus on providing in memory indexing as part of the core of Lucene. The work of LUCENE-1483 and LUCENE-1314 would be used. I am not sure if option #2 can become part of core if it relies on a contrib module? It makes sense to provide a new realtime oriented merge policy that merges segments based on the number of deletes rather than a merge factor. The realtime merge policy would keep the segments within a minimum and maximum size in kilobytes to limit the time consumed by merging which is assumed would occur frequently. LUCENE-1313 which includes a transaction log with rollback and was designed with distributed search and may be retired or the components split out.
Re: Realtime Search
Thinking about this some more, you could use fixed length pages for the term index, with a page header containing a count of entries, and use key compression (to avoid the constant entry size). The problem with this is that you still have to decode the entries (slowing the processing - since a simple binary search on the page is not possible). But, if you also add a 'least term and greatest term' to the page header (you can avoid the duplicate storage of these entries as well), you can perform a binary search of the term index much faster. You only need to decode the index page containing (maybe) the desired entry. If you were doing a prefix/range search, you will still end up decoding lots of pages... This is why a database has their own page cache, and usually caches the decoded form (for index pages) for faster processing - at the expense of higher memory usage. Usually data pages are not cached in the decoded/uncompressed form. In most cases the database vendor will recommend removing the OS page cache on the database server, and allocating all of the memory to the database process. You may be able to avoid some of the warm-up of an index using memory mapped files, but with proper ordering of the writing of the index, it probably isn't necessary. Beyond that, processing the term index directly using NIO does not appear that it will be faster than using an in-process cache of the term index (similar to the skip-to memory index now). The BEST approach is probably to have the index writer build the memory skip to structure as it writes the segment, and then include this in the segment during the reopen - no warming required !. As long as the reader and writer are in the same process, it will be a winner ! On Dec 23, 2008, at 11:02 PM, robert engels wrote: Seems doubtful you will be able to do this without increasing the index size dramatically. Since it will need to be stored unpacked (in order to have random access), yet the terms are variable length - leading to using a maximum=minimum size for every term. In the end I highly doubt it will make much difference in speed - here's several reasons why... 1. with fixed size terms, the additional IO (larger pages) probably offsets a lot of the random access benefit. This is why compressed disks on a fast machine (CPU) are often faster than uncompressed - more data is read during every IO access. 2. with a reopen, only new segments are read, and since it is a new segment, it is most likely already in the disk cache (from the write), so the reopen penalty is negligible (especially if the term index skip to is written last). 3. If the reopen is after an optimize - when the OS cache has probably been obliterated, then the warm up time is going to be similar in most cases anyway, since the index pages will also not be in core (in the case of memory mapped files). Again, writing the skip to last can help with this. Just because a file is memory mapped does not mean its pages will have an greater likelihood to be in the cache. The locality of reference is going to control this, just as the most/often access controls it in the OS disk cache. Also, most OSs will take real memory from the virtual address space and add it to the disk cache if the process is doing lots of IO. If you have a memory mapped term index, you are still going to need to perform a binary search to find the correct term page, and after an optimize the visited pages will not be in the cache (or in core). On Dec 23, 2008, at 9:20 PM, Marvin Humphrey wrote: On Tue, Dec 23, 2008 at 08:36:24PM -0600, robert engels wrote: Is there something that I am missing? Yes. I see lots of references to using memory mapped files to dramatically improve performance. There have been substantial discussions about this design in JIRA, notably LUCENE-1458. The dramatic improvement is WRT to opening/reopening an IndexReader. Presently in both KS and Lucene, certain data structures have to be read at IndexReader startup and unpacked into process memory -- in particular, the term dictionary index and sort caches. If those data structures can be represented by a memory mapped file rather than built up from scratch, we save big. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
Op Wednesday 24 December 2008 17:51:04 schreef robert engels: Thinking about this some more, you could use fixed length pages for the term index, with a page header containing a count of entries, and use key compression (to avoid the constant entry size). The problem with this is that you still have to decode the entries (slowing the processing - since a simple binary search on the page is not possible). The cache between the pages and the cpu is also a bottleneck nowadays. See here: Super-Scalar RAM-CPU Cache Compression M Zukowski, S Heman, N Nes, P Boncz - cwi.nl currently available from this link: http://www.cwi.nl/htbin/ins1/publications?request=pdfgzkey=ZuHeNeBo:ICDE:06 Also, some preliminary results on lucene indexes are available at LUCENE-1410. Regards, Paul Elschot But, if you also add a 'least term and greatest term' to the page header (you can avoid the duplicate storage of these entries as well), you can perform a binary search of the term index much faster. You only need to decode the index page containing (maybe) the desired entry. If you were doing a prefix/range search, you will still end up decoding lots of pages... This is why a database has their own page cache, and usually caches the decoded form (for index pages) for faster processing - at the expense of higher memory usage. Usually data pages are not cached in the decoded/uncompressed form. In most cases the database vendor will recommend removing the OS page cache on the database server, and allocating all of the memory to the database process. You may be able to avoid some of the warm-up of an index using memory mapped files, but with proper ordering of the writing of the index, it probably isn't necessary. Beyond that, processing the term index directly using NIO does not appear that it will be faster than using an in-process cache of the term index (similar to the skip-to memory index now). The BEST approach is probably to have the index writer build the memory skip to structure as it writes the segment, and then include this in the segment during the reopen - no warming required !. As long as the reader and writer are in the same process, it will be a winner ! On Dec 23, 2008, at 11:02 PM, robert engels wrote: Seems doubtful you will be able to do this without increasing the index size dramatically. Since it will need to be stored unpacked (in order to have random access), yet the terms are variable length - leading to using a maximum=minimum size for every term. In the end I highly doubt it will make much difference in speed - here's several reasons why... 1. with fixed size terms, the additional IO (larger pages) probably offsets a lot of the random access benefit. This is why compressed disks on a fast machine (CPU) are often faster than uncompressed - more data is read during every IO access. 2. with a reopen, only new segments are read, and since it is a new segment, it is most likely already in the disk cache (from the write), so the reopen penalty is negligible (especially if the term index skip to is written last). 3. If the reopen is after an optimize - when the OS cache has probably been obliterated, then the warm up time is going to be similar in most cases anyway, since the index pages will also not be in core (in the case of memory mapped files). Again, writing the skip to last can help with this. Just because a file is memory mapped does not mean its pages will have an greater likelihood to be in the cache. The locality of reference is going to control this, just as the most/often access controls it in the OS disk cache. Also, most OSs will take real memory from the virtual address space and add it to the disk cache if the process is doing lots of IO. If you have a memory mapped term index, you are still going to need to perform a binary search to find the correct term page, and after an optimize the visited pages will not be in the cache (or in core). On Dec 23, 2008, at 9:20 PM, Marvin Humphrey wrote: On Tue, Dec 23, 2008 at 08:36:24PM -0600, robert engels wrote: Is there something that I am missing? Yes. I see lots of references to using memory mapped files to dramatically improve performance. There have been substantial discussions about this design in JIRA, notably LUCENE-1458. The dramatic improvement is WRT to opening/reopening an IndexReader. Presently in both KS and Lucene, certain data structures have to be read at IndexReader startup and unpacked into process memory -- in particular, the term dictionary index and sort caches. If those data structures can be represented by a memory mapped file rather than built up from scratch, we save big. Marvin Humphrey -- --- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail:
Re: Realtime Search
Jason Rutherglen wrote: 2) Implement realtime search by incrementally creating and merging readers in memory. The system would use MemoryIndex or InstantiatedIndex to quickly (more quickly than RAMDirectory) create indexes from added documents. As a baseline, how fast is it to simply use RAMDirectory? If one, e.g., flushes changes every 10ms or so, and has a background thread that uses IndexReader.reopen() to keep a fresh version for reading? Also, what are the requirements? Must a document be visible to search within 10ms of being added? Or must it be visible to search from the time that the call to add it returns? In the latter case one might still use an approach like the above. Writing a small new segment to a RAMDirectory and then, with no merging, calling IndexReader.reopen(), should be quite fast. All merging could be done in the background, as should post-merge reopens() that involve large segments. In short, I wonder if new reader and writer implementations are in fact required or whether, perhaps with a few optimizations, the existing implementations might meet this need. Doug - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
As I pointed out in another email, I understand the benefits of compression (compressed disks vs. uncompressed, etc.). PFOR is definitely a winner ! As I understood this discussion though, it was an attempt to remove the in memory 'skip to' index, to avoid the reading of this during index open/reopen. I was attempting to point out that this in-memory index is still needed, but there are ways to improve the current process. I don't think a mapped file for the term index is going to work for a variety of reasons. Mapped files are designed as a programming simplification - mainly for older systems that use line delimited files - rather than having to create page/section caches when processing very large files (when only a small portion is used at any given time - ie. the data visible on the screen). When you end up visiting a large portion of the file anyway (to do a full repagination), an in-process intelligent cache is going to be far superior. My review of the Java Buffer related classes does not give me the impression it is going to be faster - in fact it will be slower- than a single copy into user space, and process/decompress there. The Buffer system is suitable when perform little inspection, and then direct copy to another buffer (think reading from a file, and sending out on a socket). If you end up inspecting the buffer, it is going to be very slow. On Dec 24, 2008, at 11:33 AM, Paul Elschot wrote: Op Wednesday 24 December 2008 17:51:04 schreef robert engels: Thinking about this some more, you could use fixed length pages for the term index, with a page header containing a count of entries, and use key compression (to avoid the constant entry size). The problem with this is that you still have to decode the entries (slowing the processing - since a simple binary search on the page is not possible). The cache between the pages and the cpu is also a bottleneck nowadays. See here: Super-Scalar RAM-CPU Cache Compression M Zukowski, S Heman, N Nes, P Boncz - cwi.nl currently available from this link: http://www.cwi.nl/htbin/ins1/publications? request=pdfgzkey=ZuHeNeBo:ICDE:06 Also, some preliminary results on lucene indexes are available at LUCENE-1410. Regards, Paul Elschot But, if you also add a 'least term and greatest term' to the page header (you can avoid the duplicate storage of these entries as well), you can perform a binary search of the term index much faster. You only need to decode the index page containing (maybe) the desired entry. If you were doing a prefix/range search, you will still end up decoding lots of pages... This is why a database has their own page cache, and usually caches the decoded form (for index pages) for faster processing - at the expense of higher memory usage. Usually data pages are not cached in the decoded/uncompressed form. In most cases the database vendor will recommend removing the OS page cache on the database server, and allocating all of the memory to the database process. You may be able to avoid some of the warm-up of an index using memory mapped files, but with proper ordering of the writing of the index, it probably isn't necessary. Beyond that, processing the term index directly using NIO does not appear that it will be faster than using an in-process cache of the term index (similar to the skip-to memory index now). The BEST approach is probably to have the index writer build the memory skip to structure as it writes the segment, and then include this in the segment during the reopen - no warming required !. As long as the reader and writer are in the same process, it will be a winner ! On Dec 23, 2008, at 11:02 PM, robert engels wrote: Seems doubtful you will be able to do this without increasing the index size dramatically. Since it will need to be stored unpacked (in order to have random access), yet the terms are variable length - leading to using a maximum=minimum size for every term. In the end I highly doubt it will make much difference in speed - here's several reasons why... 1. with fixed size terms, the additional IO (larger pages) probably offsets a lot of the random access benefit. This is why compressed disks on a fast machine (CPU) are often faster than uncompressed - more data is read during every IO access. 2. with a reopen, only new segments are read, and since it is a new segment, it is most likely already in the disk cache (from the write), so the reopen penalty is negligible (especially if the term index skip to is written last). 3. If the reopen is after an optimize - when the OS cache has probably been obliterated, then the warm up time is going to be similar in most cases anyway, since the index pages will also not be in core (in the case of memory mapped files). Again, writing the skip to last can help with this. Just because a file is memory mapped does not mean its pages will have an greater likelihood to be in the cache. The locality of reference is
Re: Realtime Search
Also, what are the requirements? Must a document be visible to search within 10ms of being added? 0-5ms. Otherwise it's not realtime, it's batch indexing. The realtime system can support small batches by encoding them into RAMDirectories if they are of sufficient size. Or must it be visible to search from the time that the call to add it returns? Most people probably expect the update latency offered by SQL databases. As a baseline, how fast is it to simply use RAMDirectory? It depends on how fast searches over the realtime index need to be. The detriment to speed occurs with having many small segments that are continuously decoded (terms, postings, etc). The advantage of MemoryIndex and InstantiatedIndex is an actual increase in search speed compared with RAMDirectory (see the Performance Notes at http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/index/memory/MemoryIndex.htmland )and no need to continuously decode segments that are short lived. Anecdotal tests indicated the merging overhead of using RAMDirectory as compared with MI or II is significant enough to make it only useful for doing batches in the 1000s which does not seem to be what people expect from realtime search. On Wed, Dec 24, 2008 at 9:53 AM, Doug Cutting cutt...@apache.org wrote: Jason Rutherglen wrote: 2) Implement realtime search by incrementally creating and merging readers in memory. The system would use MemoryIndex or InstantiatedIndex to quickly (more quickly than RAMDirectory) create indexes from added documents. As a baseline, how fast is it to simply use RAMDirectory? If one, e.g., flushes changes every 10ms or so, and has a background thread that uses IndexReader.reopen() to keep a fresh version for reading? Also, what are the requirements? Must a document be visible to search within 10ms of being added? Or must it be visible to search from the time that the call to add it returns? In the latter case one might still use an approach like the above. Writing a small new segment to a RAMDirectory and then, with no merging, calling IndexReader.reopen(), should be quite fast. All merging could be done in the background, as should post-merge reopens() that involve large segments. In short, I wonder if new reader and writer implementations are in fact required or whether, perhaps with a few optimizations, the existing implementations might meet this need. Doug - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
On Dec 24, 2008, at 12:23 PM, Jason Rutherglen wrote: Also, what are the requirements? Must a document be visible to search within 10ms of being added? 0-5ms. Otherwise it's not realtime, it's batch indexing. The realtime system can support small batches by encoding them into RAMDirectories if they are of sufficient size. Or must it be visible to search from the time that the call to add it returns? Most people probably expect the update latency offered by SQL databases. This is the problem spot. In an SQL database, when an update/add occurs, the same connection/transaction will see the changes when requested IMMEDIATELY - there is 0 latency. In order to do this you MUST have the concept of transactions and/or connections. OR you must make it so that every update/add is immediately available - this is probably simpler. You just need to always search the ram and the disk index. The deletions must be mapped to the disk index, and the latest version of the document must be obtained from the ram index (if it is there). You just need to merge the ram and disk in the background... and continually create new/merged ram disks. The memory requirements are going to go up, but you can always add a block so that if the background merger gets too far behind, the system blocks any current requests (to avoid the system running out of memory). As a baseline, how fast is it to simply use RAMDirectory? It depends on how fast searches over the realtime index need to be. The detriment to speed occurs with having many small segments that are continuously decoded (terms, postings, etc). The advantage of MemoryIndex and InstantiatedIndex is an actual increase in search speed compared with RAMDirectory (see the Performance Notes at http://hudson.zones.apache.org/hudson/job/ Lucene-trunk/javadoc//org/apache/lucene/index/memory/ MemoryIndex.html and )and no need to continuously decode segments that are short lived. Anecdotal tests indicated the merging overhead of using RAMDirectory as compared with MI or II is significant enough to make it only useful for doing batches in the 1000s which does not seem to be what people expect from realtime search. On Wed, Dec 24, 2008 at 9:53 AM, Doug Cutting cutt...@apache.org wrote: Jason Rutherglen wrote: 2) Implement realtime search by incrementally creating and merging readers in memory. The system would use MemoryIndex or InstantiatedIndex to quickly (more quickly than RAMDirectory) create indexes from added documents. As a baseline, how fast is it to simply use RAMDirectory? If one, e.g., flushes changes every 10ms or so, and has a background thread that uses IndexReader.reopen() to keep a fresh version for reading? Also, what are the requirements? Must a document be visible to search within 10ms of being added? Or must it be visible to search from the time that the call to add it returns? In the latter case one might still use an approach like the above. Writing a small new segment to a RAMDirectory and then, with no merging, calling IndexReader.reopen(), should be quite fast. All merging could be done in the background, as should post-merge reopens() that involve large segments. In short, I wonder if new reader and writer implementations are in fact required or whether, perhaps with a few optimizations, the existing implementations might meet this need. Doug - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
On Tue, Dec 23, 2008 at 11:02:56PM -0600, robert engels wrote: Seems doubtful you will be able to do this without increasing the index size dramatically. Since it will need to be stored unpacked (in order to have random access), yet the terms are variable length - leading to using a maximum=minimum size for every term. Wow. That's a spectacularly awful design. Its worst case -- one outlier term, say, 1000 characters in length, in a field where the average term length is in the single digits -- would explode the index size and incur wasteful IO overhead, just as you say. Good thing we've never considered it. :) I'm hoping we can improve on this, but for now, we've ended up at a two-file design for the term dictionary index. 1) Stacked 64-bit file pointers. 2) Variable length character and term info data, interpreted using a pluggable codec. In the index at least, each entry would contain the full term text, encoded as UTF-8. Probably the primary term dictionary would continue to use string diffs. That design offers no significant benefits other than those that flow from compatibility with mmap: faster IndexReader open/reaopen, lower RAM usage under multiple processes by way of buffer sharing. IO bandwidth requirements and speed are probably a little better, but lookups on the term dictionary index are not a significant search-time bottleneck. Additionally, sort caches would be written at index time in three files, and memory mapped as laid out in https://issues.apache.org/jira/browse/LUCENE-831?focusedCommentId=12656150#action_12656150. 1) Stacked 64-bit file pointers. 2) Character data. 3) Doc num to ord mapping. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
On Wed, Dec 24, 2008 at 12:02:24PM -0600, robert engels wrote: As I understood this discussion though, it was an attempt to remove the in memory 'skip to' index, to avoid the reading of this during index open/reopen. No. That idea was entertained briefly and quickly discarded. There seems to be an awful lot of irrelevant noise in the current thread arising due to lack of familiarity with the ongoing discussions in JIRA. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
On Tue, Dec 23, 2008 at 05:51:43PM -0800, Jason Rutherglen wrote: Are there other implementation options? Here's the plan for Lucy/KS: 1) Design index formats that can be memory mapped rather than slurped, bringing the cost of opening/reopening an IndexReader down to a negligible level. 2) Enable segment-centric sorted search. (LUCENE-1483) 3) Implement tombstone-based deletions, so that the cost of deleting documents scales with the number of deletions rather than the size of the index. 4) Allow 2 concurrent writers: one for small, fast updates, and one for big background merges. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
Is there something that I am missing? I see lots of references to using memory mapped files to dramatically improve performance. I don't think this is the case at all. At the lowest levels, it is somewhat more efficient from a CPU standpoint, but with a decent OS cache the IO performance difference is going to negligible. The primary benefit of memory mapped files is simplicity in code (although in Java there is another layer needed - think C ), and the file can be treated as a random accessible memory array. From my OS design experience, the page at http://en.wikipedia.org/ wiki/Memory-mapped_file is incorrect. Even if the memory mapped file is mapped into the virtual memory space, unless you specialized memory controllers and disk systems, when a page fault occurs, the OS loads the page just as any other. The difference with direct IO, is that there is first a simple translation from position to disk page, and the OS disk page cache is checked. Almost exactly the same thing occurs with a memory mapped file. The memory addressed is accessed, if not in memory, a page fault occurs, and the page is loaded from the file (it may be loaded from the OS disk cache in this process). The point being, if the page is not in the cache (which is probably the case with a large index), the time to load the page is far greater than the difference between the IO address translation and the memory address lookup. If all of the pages of the index can fit in memory, a properly configured system is going to have them in the page cache anyway On Dec 23, 2008, at 8:22 PM, Marvin Humphrey wrote: On Tue, Dec 23, 2008 at 05:51:43PM -0800, Jason Rutherglen wrote: Are there other implementation options? Here's the plan for Lucy/KS: 1) Design index formats that can be memory mapped rather than slurped, bringing the cost of opening/reopening an IndexReader down to a negligible level. 2) Enable segment-centric sorted search. (LUCENE-1483) 3) Implement tombstone-based deletions, so that the cost of deleting documents scales with the number of deletions rather than the size of the index. 4) Allow 2 concurrent writers: one for small, fast updates, and one for big background merges. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
On Tue, Dec 23, 2008 at 08:36:24PM -0600, robert engels wrote: Is there something that I am missing? Yes. I see lots of references to using memory mapped files to dramatically improve performance. There have been substantial discussions about this design in JIRA, notably LUCENE-1458. The dramatic improvement is WRT to opening/reopening an IndexReader. Presently in both KS and Lucene, certain data structures have to be read at IndexReader startup and unpacked into process memory -- in particular, the term dictionary index and sort caches. If those data structures can be represented by a memory mapped file rather than built up from scratch, we save big. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
Seems doubtful you will be able to do this without increasing the index size dramatically. Since it will need to be stored unpacked (in order to have random access), yet the terms are variable length - leading to using a maximum=minimum size for every term. In the end I highly doubt it will make much difference in speed - here's several reasons why... 1. with fixed size terms, the additional IO (larger pages) probably offsets a lot of the random access benefit. This is why compressed disks on a fast machine (CPU) are often faster than uncompressed - more data is read during every IO access. 2. with a reopen, only new segments are read, and since it is a new segment, it is most likely already in the disk cache (from the write), so the reopen penalty is negligible (especially if the term index skip to is written last). 3. If the reopen is after an optimize - when the OS cache has probably been obliterated, then the warm up time is going to be similar in most cases anyway, since the index pages will also not be in core (in the case of memory mapped files). Again, writing the skip to last can help with this. Just because a file is memory mapped does not mean its pages will have an greater likelihood to be in the cache. The locality of reference is going to control this, just as the most/often access controls it in the OS disk cache. Also, most OSs will take real memory from the virtual address space and add it to the disk cache if the process is doing lots of IO. If you have a memory mapped term index, you are still going to need to perform a binary search to find the correct term page, and after an optimize the visited pages will not be in the cache (or in core). On Dec 23, 2008, at 9:20 PM, Marvin Humphrey wrote: On Tue, Dec 23, 2008 at 08:36:24PM -0600, robert engels wrote: Is there something that I am missing? Yes. I see lots of references to using memory mapped files to dramatically improve performance. There have been substantial discussions about this design in JIRA, notably LUCENE-1458. The dramatic improvement is WRT to opening/reopening an IndexReader. Presently in both KS and Lucene, certain data structures have to be read at IndexReader startup and unpacked into process memory -- in particular, the term dictionary index and sort caches. If those data structures can be represented by a memory mapped file rather than built up from scratch, we save big. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search
Also, if you are thinking that accessing the buffer directly will be faster than parsing the packed structure, I'm not so sure. You can review the source for the various buffers, and since the is no struct support in Java, you end up combining bytes to make longs, etc. Also, a lot of the accesses are though Unsafe, which is slower than the indirection on a Java object to access a field. My review of these classes makes me think that parsing the skip to index once into java objects for later use is going to be a lot faster overall than accessing the entire mapped file directly on every invocation. On Dec 23, 2008, at 9:20 PM, Marvin Humphrey wrote: On Tue, Dec 23, 2008 at 08:36:24PM -0600, robert engels wrote: Is there something that I am missing? Yes. I see lots of references to using memory mapped files to dramatically improve performance. There have been substantial discussions about this design in JIRA, notably LUCENE-1458. The dramatic improvement is WRT to opening/reopening an IndexReader. Presently in both KS and Lucene, certain data structures have to be read at IndexReader startup and unpacked into process memory -- in particular, the term dictionary index and sort caches. If those data structures can be represented by a memory mapped file rather than built up from scratch, we save big. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Realtime Search for Social Networks Collaboration
archive based indexes which were used less (yes the search engine default search was on data no more than 1 month old, though user could open the time window by including archives). As for SOLR and OCEAN, I would argue that these semi-structured search engines are becomming more and more like relational databases with full-text search capablities (without the benefit of full reletional algebra -- for example joins are not possible using SOLR). Notice that real-time CRUD operations and transactionality are core DB concepts adn have been studied and developed by database communities for aquite long time. There has been recent efforts on how to effeciently integrate Lucene into releational databases (see Lucene JVM ORACLE integration, see http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html) I think we should seriously look at joining efforts with open-source Database engine projects, written in Java (see http://java-source.net/open-source/database-engines) in order to blend IR and ORM for once and for all. -- Joaquin I've read Jason's Wiki as well. Actually, I had to read it a number of times to understand bits and pieces of it. I have to admit there is still some fuzziness about the whole things in my head - is Ocean something that already works, a separate project on googlecode.com? I think so. If so, and if you are working on getting it integrated into Lucene, would it make it less confusing to just refer to it as real-time search, so there is no confusion? If this is to be initially integrated into Lucene, why are things like replication, crowding/field collapsing, locallucene, name service, tag index, etc. all mentioned there on the Wiki and bundled with description of how real-time search works and is to be implemented? I suppose mentioning replication kind-of makes sense because the replication approach is closely tied to real-time search - all query nodes need to see index changes fast. But Lucene itself offers no replication mechanism, so maybe the replication is something to figure out separately, say on the Solr level, later on once we get there. I think even just the essential real-time search requires substantial changes to Lucene (I remember seeing large patches in JIRA), which makes it hard to digest, understand, comment on, and ultimately commit (hence the luke warm response, I think). Bringing other non-essential elements into discussion at the same time makes it more difficult t o process all this new stuff, at least for me. Am I the only one who finds this hard? That said, it sounds like we have some discussion going (Karl...), so I look forward to understanding more! :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, September 4, 2008 10:13:32 AM Subject: Re: Realtime Search for Social Networks Collaboration On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen wrote: I also think it's got a lot of things now which makes integration difficult to do properly. I agree, and that's why the major bump in version number rather than minor - we recognize that some features will need some amount of rearchitecture. I think the problem with integration with SOLR is it was designed with a different problem set in mind than Ocean, originally the CNET shopping application. That was the first use of Solr, but it actually existed before that w/o any defined use other than to be a plan B alternative to MySQL based search servers (that's actually where some of the parameter names come from... the default /select URL instead of /search, the rows parameter, etc). But you're right... some things like the replication strategy were designed (well, borrowed from Doug to be exact) with the idea that it would be OK to have slightly stale views of the data in the range of minutes. It just made things easier/possible at the time. But tons of Solr and Lucene users want almost instantaneous visibility of added documents, if they can get it. It's hardly restricted to social network applications. Bottom line is that Solr aims to be a general enterprise search platform, and getting as real-time as we can get, and as scalable as we can get are some of the top priorities going forward. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands
Re: Realtime Search for Social Networks Collaboration
I've read Jason's Wiki as well. Actually, I had to read it a number of times to understand bits and pieces of it. I have to admit there is still some fuzziness about the whole things in my head - is Ocean something that already works, a separate project on googlecode.com? I think so. If so, and if you are working on getting it integrated into Lucene, would it make it less confusing to just refer to it as real-time search, so there is no confusion? If this is to be initially integrated into Lucene, why are things like replication, crowding/field collapsing, locallucene, name service, tag index, etc. all mentioned there on the Wiki and bundled with description of how real-time search works and is to be implemented? I suppose mentioning replication kind-of makes sense because the replication approach is closely tied to real-time search - all query nodes need to see index changes fast. But Lucene itself offers no replication mechanism, so maybe the replication is something to figure out separately, say on the Solr level, later on once we get there. I think even just the essential real-time search requires substantial changes to Lucene (I remember seeing large patches in JIRA), which makes it hard to digest, understand, comment on, and ultimately commit (hence the luke warm response, I think). Bringing other non-essential elements into discussion at the same time makes it more difficult t o process all this new stuff, at least for me. Am I the only one who finds this hard? That said, it sounds like we have some discussion going (Karl...), so I look forward to understanding more! :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, September 4, 2008 10:13:32 AM Subject: Re: Realtime Search for Social Networks Collaboration On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen wrote: I also think it's got a lot of things now which makes integration difficult to do properly. I agree, and that's why the major bump in version number rather than minor - we recognize that some features will need some amount of rearchitecture. I think the problem with integration with SOLR is it was designed with a different problem set in mind than Ocean, originally the CNET shopping application. That was the first use of Solr, but it actually existed before that w/o any defined use other than to be a plan B alternative to MySQL based search servers (that's actually where some of the parameter names come from... the default /select URL instead of /search, the rows parameter, etc). But you're right... some things like the replication strategy were designed (well, borrowed from Doug to be exact) with the idea that it would be OK to have slightly stale views of the data in the range of minutes. It just made things easier/possible at the time. But tons of Solr and Lucene users want almost instantaneous visibility of added documents, if they can get it. It's hardly restricted to social network applications. Bottom line is that Solr aims to be a general enterprise search platform, and getting as real-time as we can get, and as scalable as we can get are some of the top priorities going forward. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- --Noble Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
search engines are becomming more and more like relational databases with full-text search capablities (without the benefit of full reletional algebra -- for example joins are not possible using SOLR). Notice that real-time CRUD operations and transactionality are core DB concepts adn have been studied and developed by database communities for aquite long time. There has been recent efforts on how to effeciently integrate Lucene into releational databases (see Lucene JVM ORACLE integration, see http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html ) I think we should seriously look at joining efforts with open-source Database engine projects, written in Java (see http://java-source.net/open-source/database-engines) in order to blend IR and ORM for once and for all. -- Joaquin I've read Jason's Wiki as well. Actually, I had to read it a number of times to understand bits and pieces of it. I have to admit there is still some fuzziness about the whole things in my head - is Ocean something that already works, a separate project on googlecode.com? I think so. If so, and if you are working on getting it integrated into Lucene, would it make it less confusing to just refer to it as real-time search, so there is no confusion? If this is to be initially integrated into Lucene, why are things like replication, crowding/field collapsing, locallucene, name service, tag index, etc. all mentioned there on the Wiki and bundled with description of how real-time search works and is to be implemented? I suppose mentioning replication kind-of makes sense because the replication approach is closely tied to real-time search - all query nodes need to see index changes fast. But Lucene itself offers no replication mechanism, so maybe the replication is something to figure out separately, say on the Solr level, later on once we get there. I think even just the essential real-time search requires substantial changes to Lucene (I remember seeing large patches in JIRA), which makes it hard to digest, understand, comment on, and ultimately commit (hence the luke warm response, I think). Bringing other non-essential elements into discussion at the same time makes it more difficult t o process all this new stuff, at least for me. Am I the only one who finds this hard? That said, it sounds like we have some discussion going (Karl...), so I look forward to understanding more! :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, September 4, 2008 10:13:32 AM Subject: Re: Realtime Search for Social Networks Collaboration On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen wrote: I also think it's got a lot of things now which makes integration difficult to do properly. I agree, and that's why the major bump in version number rather than minor - we recognize that some features will need some amount of rearchitecture. I think the problem with integration with SOLR is it was designed with a different problem set in mind than Ocean, originally the CNET shopping application. That was the first use of Solr, but it actually existed before that w/o any defined use other than to be a plan B alternative to MySQL based search servers (that's actually where some of the parameter names come from... the default /select URL instead of /search, the rows parameter, etc). But you're right... some things like the replication strategy were designed (well, borrowed from Doug to be exact) with the idea that it would be OK to have slightly stale views of the data in the range of minutes. It just made things easier/possible at the time. But tons of Solr and Lucene users want almost instantaneous visibility of added documents, if they can get it. It's hardly restricted to social network applications. Bottom line is that Solr aims to be a general enterprise search platform, and getting as real-time as we can get, and as scalable as we can get are some of the top priorities going forward. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands
Re: Realtime Search for Social Networks Collaboration
could open the time window by including archives). As for SOLR and OCEAN, I would argue that these semi-structured search engines are becomming more and more like relational databases with full-text search capablities (without the benefit of full reletional algebra -- for example joins are not possible using SOLR). Notice that real-time CRUD operations and transactionality are core DB concepts adn have been studied and developed by database communities for aquite long time. There has been recent efforts on how to effeciently integrate Lucene into releational databases (see Lucene JVM ORACLE integration, see http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html ) I think we should seriously look at joining efforts with open-source Database engine projects, written in Java (see http://java-source.net/open-source/database-engines) in order to blend IR and ORM for once and for all. -- Joaquin I've read Jason's Wiki as well. Actually, I had to read it a number of times to understand bits and pieces of it. I have to admit there is still some fuzziness about the whole things in my head - is Ocean something that already works, a separate project on googlecode.com? I think so. If so, and if you are working on getting it integrated into Lucene, would it make it less confusing to just refer to it as real-time search, so there is no confusion? If this is to be initially integrated into Lucene, why are things like replication, crowding/field collapsing, locallucene, name service, tag index, etc. all mentioned there on the Wiki and bundled with description of how real-time search works and is to be implemented? I suppose mentioning replication kind-of makes sense because the replication approach is closely tied to real-time search - all query nodes need to see index changes fast. But Lucene itself offers no replication mechanism, so maybe the replication is something to figure out separately, say on the Solr level, later on once we get there. I think even just the essential real-time search requires substantial changes to Lucene (I remember seeing large patches in JIRA), which makes it hard to digest, understand, comment on, and ultimately commit (hence the luke warm response, I think). Bringing other non-essential elements into discussion at the same time makes it more difficult t o process all this new stuff, at least for me. Am I the only one who finds this hard? That said, it sounds like we have some discussion going (Karl...), so I look forward to understanding more! :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, September 4, 2008 10:13:32 AM Subject: Re: Realtime Search for Social Networks Collaboration On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen wrote: I also think it's got a lot of things now which makes integration difficult to do properly. I agree, and that's why the major bump in version number rather than minor - we recognize that some features will need some amount of rearchitecture. I think the problem with integration with SOLR is it was designed with a different problem set in mind than Ocean, originally the CNET shopping application. That was the first use of Solr, but it actually existed before that w/o any defined use other than to be a plan B alternative to MySQL based search servers (that's actually where some of the parameter names come from... the default /select URL instead of /search, the rows parameter, etc). But you're right... some things like the replication strategy were designed (well, borrowed from Doug to be exact) with the idea that it would be OK to have slightly stale views of the data in the range of minutes. It just made things easier/possible at the time. But tons of Solr and Lucene users want almost instantaneous visibility of added documents, if they can get it. It's hardly restricted to social network applications. Bottom line is that Solr aims to be a general enterprise search platform, and getting as real-time as we can get, and as scalable as we can get are some of the top priorities going forward. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED
Re: Realtime Search for Social Networks Collaboration
some fuzziness about the whole things in my head - is Ocean something that already works, a separate project on googlecode.com? I think so. If so, and if you are working on getting it integrated into Lucene, would it make it less confusing to just refer to it as real-time search, so there is no confusion? If this is to be initially integrated into Lucene, why are things like replication, crowding/field collapsing, locallucene, name service, tag index, etc. all mentioned there on the Wiki and bundled with description of how real-time search works and is to be implemented? I suppose mentioning replication kind-of makes sense because the replication approach is closely tied to real-time search - all query nodes need to see index changes fast. But Lucene itself offers no replication mechanism, so maybe the replication is something to figure out separately, say on the Solr level, later on once we get there. I think even just the essential real-time search requires substantial changes to Lucene (I remember seeing large patches in JIRA), which makes it hard to digest, understand, comment on, and ultimately commit (hence the luke warm response, I think). Bringing other non-essential elements into discussion at the same time makes it more difficult t o process all this new stuff, at least for me. Am I the only one who finds this hard? That said, it sounds like we have some discussion going (Karl...), so I look forward to understanding more! :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, September 4, 2008 10:13:32 AM Subject: Re: Realtime Search for Social Networks Collaboration On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen wrote: I also think it's got a lot of things now which makes integration difficult to do properly. I agree, and that's why the major bump in version number rather than minor - we recognize that some features will need some amount of rearchitecture. I think the problem with integration with SOLR is it was designed with a different problem set in mind than Ocean, originally the CNET shopping application. That was the first use of Solr, but it actually existed before that w/o any defined use other than to be a plan B alternative to MySQL based search servers (that's actually where some of the parameter names come from... the default /select URL instead of /search, the rows parameter, etc). But you're right... some things like the replication strategy were designed (well, borrowed from Doug to be exact) with the idea that it would be OK to have slightly stale views of the data in the range of minutes. It just made things easier/possible at the time. But tons of Solr and Lucene users want almost instantaneous visibility of added documents, if they can get it. It's hardly restricted to social network applications. Bottom line is that Solr aims to be a general enterprise search platform, and getting as real-time as we can get, and as scalable as we can get are some of the top priorities going forward. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- --Noble Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
Hi Mike, How does column stride fields work for StringIndex field caching? I have been working on the tag index which may be more suitable for field caching and makes range queries faster. It is something that would be good to integrate into core Lucene as well. It may be more suitable for many situations. Perhaps the column stride and tag index can be merged? What is the progress on cs? Reopen then must only materialize any buffered deletes by Term Query, unless we decide to move up that materialization into the actual delete cal, since we will have SegmentReaders open anyway. I think I'm leaning towards that approach... best to pay the cost as you go, instead of aggregated cost on reopen? I don't follow this part. There is an IndexReader exposed from IndexWriter. I think the individual SegmentReaders should be exposed as well, I don't see any reason not to and there are many cases where it has been frustrating that SegmentReaders are package protected. I am not sure from what you mentioned how the deletedDocs bitvector is handled. On Fri, Sep 19, 2008 at 8:30 AM, Michael McCandless [EMAIL PROTECTED] wrote: Jason Rutherglen wrote: Mike, The other issue that will occur that I addressed is the field caches. The underlying smaller IndexReaders will need to be exposed because of the field caching. Currently in ocean realtime search the individual readers are searched on using a MultiSearcher in order to search in parallel and reuse the field caches. How will field caching work with the IndexWriter approach? It seems like it would need a dynamically growing field cache array? That is a bit tricky. By doing in memory merging in ocean, the field caches last longer and do not require growing arrays. First off, I think the combination of LUCENE-1231 and LUCENE-831, which should result in FieldCache that is distributed down to each SegmentReader and much faster to initialize, should make incrementally updating the FieldCache much more efficient (ie, on calling IndexReader.reopen, it should only be the new segments that need to populate their FieldCache). Hopefully these land before real-time search, because then I have more API flexibility to expose column-stride fields on the in-RAM documents. There is still some trickiness, because an ordinary IndexWriter would never hold the column-stride fields in RAM. They'd be flushed to the Directory, immediately per document, just liked stored fields and term vectors are today. So, maybe, the first RAMReader you get from the IndexWriter would load back in these fields, triggering IndexWriter to add to them as documents are added (maybe using exponentially growing arrays as the underlying store, or, perhaps separate array fragments, to prevent synchronization when reading from them), such that subsequent reopens simply resync their max docID. How do you plan to handle rapidly delete the docs of the disk segments? Can the SegmentReader clone patch be used for this? I was thinking we'd flush new .del files every time a reopen is called, but that could very well be costly. Instead, we can keep the deletes pending in the SegmentReaders we're holding open, and then go back to flushing on IndexWriter's normal schedule. Reopen then must only materialize any buffered deletes by Term Query, unless we decide to move up that materialization into the actual delete cal, since we will have SegmentReaders open anyway. I think I'm leaning towards that approach... best to pay the cost as you go, instead of aggregated cost on reopen? Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
Mike, The other issue that will occur that I addressed is the field caches. The underlying smaller IndexReaders will need to be exposed because of the field caching. Currently in ocean realtime search the individual readers are searched on using a MultiSearcher in order to search in parallel and reuse the field caches. How will field caching work with the IndexWriter approach? It seems like it would need a dynamically growing field cache array? That is a bit tricky. By doing in memory merging in ocean, the field caches last longer and do not require growing arrays. How do you plan to handle rapidly delete the docs of the disk segments? Can the SegmentReader clone patch be used for this? Jason On Thu, Sep 11, 2008 at 8:29 AM, Michael McCandless [EMAIL PROTECTED] wrote: Right, there would need to be a snapshot taken of all terms when IndexWriter.getReader() is called. This snapshot would 1) hold a frozen int docFreq per term, and 2) sort the terms so TermEnum can just step through them. (We might be able to delay this sorting until the first time something asks for it). Also, it must merge this data from all threads, since each thread holds its hash per field. I've got a rough start at coding this up... The costs are clearly growing, in order to keep the point in time feature of this RAMIndexReader, but I think are still well contained unless you have a really huge RAM buffer. Flushing is still tricky because we cannot recycle the byte block buffers until all running TermDocs/TermPositions iterations are finished. Alternatively, I may just allocate new byte blocks and allow the old ones to be GC'd on their own once running iterations are finished. Mike Jason Rutherglen wrote: Hi Mike, There would be a new sorted list or something to replace the hashtable? Seems like an issue that is not solved. Jason On Tue, Sep 9, 2008 at 5:29 AM, Michael McCandless [EMAIL PROTECTED] wrote: This would just tap into the live hashtable that DocumentsWriter* maintain for the posting lists... except the docFreq will need to be copied away on reopen, I think. Mike Jason Rutherglen wrote: Term dictionary? I'm curious how that would be solved? On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless [EMAIL PROTECTED] wrote: Yonik Seeley wrote: I think it's quite feasible, but, it'd still have a reopen cost in that any buffered delete by term or query would have to be materialiazed into docIDs on reopen. Though, if this somehow turns out to be a problem, in the future we could do this materializing immediately, instead of buffering, if we already have a reader open. Right... it seems like re-using readers internally is something we could already be doing in IndexWriter. True. Flushing is somewhat tricky because any open RAM readers would then have to cutover to the newly flushed segment once the flush completes, so that the RAM buffer can be recycled for the next segment. Re-use of a RAM buffer doesn't seem like such a big deal. But, how would you maintain a static view of an index...? IndexReader r1 = indexWriter.getCurrentIndex() indexWriter.addDocument(...) IndexReader r2 = indexWriter.getCurrentIndex() I assume r1 will have a view of the index before the document was added, and r2 after? Right, getCurrentIndex would return a MultiReader that includes SegmentReader for each segment in the index, plus a RAMReader that searches the RAM buffer. That RAMReader is a tiny shell class that would basically just record the max docID it's allowed to go up to (the docID as of when it was opened), and stop enumerating docIDs (eg in the TermDocs) when it hits a docID beyond that limit. For reading stored fields and term vectors, which are now flushed immediately to disk, we need to somehow get an IndexInput from the IndexOutputs that IndexWriter holds open on these files. Or, maybe, just open new IndexInputs? Another thing that will help is if users could get their hands on the sub-readers of a multi-segment reader. Right now that is hidden in MultiSegmentReader and makes updating anything incrementally difficult. Besides what's handled by MultiSegmentReader.reopen already, what else do you need to incrementally update? Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
Right, there would need to be a snapshot taken of all terms when IndexWriter.getReader() is called. This snapshot would 1) hold a frozen int docFreq per term, and 2) sort the terms so TermEnum can just step through them. (We might be able to delay this sorting until the first time something asks for it). Also, it must merge this data from all threads, since each thread holds its hash per field. I've got a rough start at coding this up... The costs are clearly growing, in order to keep the point in time feature of this RAMIndexReader, but I think are still well contained unless you have a really huge RAM buffer. Flushing is still tricky because we cannot recycle the byte block buffers until all running TermDocs/TermPositions iterations are finished. Alternatively, I may just allocate new byte blocks and allow the old ones to be GC'd on their own once running iterations are finished. Mike Jason Rutherglen wrote: Hi Mike, There would be a new sorted list or something to replace the hashtable? Seems like an issue that is not solved. Jason On Tue, Sep 9, 2008 at 5:29 AM, Michael McCandless [EMAIL PROTECTED] wrote: This would just tap into the live hashtable that DocumentsWriter* maintain for the posting lists... except the docFreq will need to be copied away on reopen, I think. Mike Jason Rutherglen wrote: Term dictionary? I'm curious how that would be solved? On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless [EMAIL PROTECTED] wrote: Yonik Seeley wrote: I think it's quite feasible, but, it'd still have a reopen cost in that any buffered delete by term or query would have to be materialiazed into docIDs on reopen. Though, if this somehow turns out to be a problem, in the future we could do this materializing immediately, instead of buffering, if we already have a reader open. Right... it seems like re-using readers internally is something we could already be doing in IndexWriter. True. Flushing is somewhat tricky because any open RAM readers would then have to cutover to the newly flushed segment once the flush completes, so that the RAM buffer can be recycled for the next segment. Re-use of a RAM buffer doesn't seem like such a big deal. But, how would you maintain a static view of an index...? IndexReader r1 = indexWriter.getCurrentIndex() indexWriter.addDocument(...) IndexReader r2 = indexWriter.getCurrentIndex() I assume r1 will have a view of the index before the document was added, and r2 after? Right, getCurrentIndex would return a MultiReader that includes SegmentReader for each segment in the index, plus a RAMReader that searches the RAM buffer. That RAMReader is a tiny shell class that would basically just record the max docID it's allowed to go up to (the docID as of when it was opened), and stop enumerating docIDs (eg in the TermDocs) when it hits a docID beyond that limit. For reading stored fields and term vectors, which are now flushed immediately to disk, we need to somehow get an IndexInput from the IndexOutputs that IndexWriter holds open on these files. Or, maybe, just open new IndexInputs? Another thing that will help is if users could get their hands on the sub-readers of a multi-segment reader. Right now that is hidden in MultiSegmentReader and makes updating anything incrementally difficult. Besides what's handled by MultiSegmentReader.reopen already, what else do you need to incrementally update? Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
Hi Mike, There would be a new sorted list or something to replace the hashtable? Seems like an issue that is not solved. Jason On Tue, Sep 9, 2008 at 5:29 AM, Michael McCandless [EMAIL PROTECTED] wrote: This would just tap into the live hashtable that DocumentsWriter* maintain for the posting lists... except the docFreq will need to be copied away on reopen, I think. Mike Jason Rutherglen wrote: Term dictionary? I'm curious how that would be solved? On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless [EMAIL PROTECTED] wrote: Yonik Seeley wrote: I think it's quite feasible, but, it'd still have a reopen cost in that any buffered delete by term or query would have to be materialiazed into docIDs on reopen. Though, if this somehow turns out to be a problem, in the future we could do this materializing immediately, instead of buffering, if we already have a reader open. Right... it seems like re-using readers internally is something we could already be doing in IndexWriter. True. Flushing is somewhat tricky because any open RAM readers would then have to cutover to the newly flushed segment once the flush completes, so that the RAM buffer can be recycled for the next segment. Re-use of a RAM buffer doesn't seem like such a big deal. But, how would you maintain a static view of an index...? IndexReader r1 = indexWriter.getCurrentIndex() indexWriter.addDocument(...) IndexReader r2 = indexWriter.getCurrentIndex() I assume r1 will have a view of the index before the document was added, and r2 after? Right, getCurrentIndex would return a MultiReader that includes SegmentReader for each segment in the index, plus a RAMReader that searches the RAM buffer. That RAMReader is a tiny shell class that would basically just record the max docID it's allowed to go up to (the docID as of when it was opened), and stop enumerating docIDs (eg in the TermDocs) when it hits a docID beyond that limit. For reading stored fields and term vectors, which are now flushed immediately to disk, we need to somehow get an IndexInput from the IndexOutputs that IndexWriter holds open on these files. Or, maybe, just open new IndexInputs? Another thing that will help is if users could get their hands on the sub-readers of a multi-segment reader. Right now that is hidden in MultiSegmentReader and makes updating anything incrementally difficult. Besides what's handled by MultiSegmentReader.reopen already, what else do you need to incrementally update? Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
Yonik Seeley wrote: On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless [EMAIL PROTECTED] wrote: Right, getCurrentIndex would return a MultiReader that includes SegmentReader for each segment in the index, plus a RAMReader that searches the RAM buffer. That RAMReader is a tiny shell class that would basically just record the max docID it's allowed to go up to (the docID as of when it was opened), and stop enumerating docIDs (eg in the TermDocs) when it hits a docID beyond that limit. What about something like term freq? Would it need to count the number of docs after the local maxDoc or is there a better way? Good question... I think we'd have to take a full copy of the term - termFreq on reopen? I don't see how else to do it (I don't understand your suggestion above). So, this will clearly add to the cost of reopen. For reading stored fields and term vectors, which are now flushed immediately to disk, we need to somehow get an IndexInput from the IndexOutputs that IndexWriter holds open on these files. Or, maybe, just open new IndexInputs? Hmmm, seems like a case of our nice and simple Directory model not having quite enough features in this case. I think we can simply open IndexInputs on these files. I believe Java does the right thing on windows, such that if we are already writing to the file, it does not prevent another file handle from opening the file for reading. Another thing that will help is if users could get their hands on the sub-readers of a multi-segment reader. Right now that is hidden in MultiSegmentReader and makes updating anything incrementally difficult. Besides what's handled by MultiSegmentReader.reopen already, what else do you need to incrementally update? Anything that you want to incrementally update and uses an IndexReader as a key. Mostly caches I would think... Solr has user-level (application specific) caches, faceting caches, etc. Ahh ok. We should just open up access and mark this as advanced? Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
This would just tap into the live hashtable that DocumentsWriter* maintain for the posting lists... except the docFreq will need to be copied away on reopen, I think. Mike Jason Rutherglen wrote: Term dictionary? I'm curious how that would be solved? On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless [EMAIL PROTECTED] wrote: Yonik Seeley wrote: I think it's quite feasible, but, it'd still have a reopen cost in that any buffered delete by term or query would have to be materialiazed into docIDs on reopen. Though, if this somehow turns out to be a problem, in the future we could do this materializing immediately, instead of buffering, if we already have a reader open. Right... it seems like re-using readers internally is something we could already be doing in IndexWriter. True. Flushing is somewhat tricky because any open RAM readers would then have to cutover to the newly flushed segment once the flush completes, so that the RAM buffer can be recycled for the next segment. Re-use of a RAM buffer doesn't seem like such a big deal. But, how would you maintain a static view of an index...? IndexReader r1 = indexWriter.getCurrentIndex() indexWriter.addDocument(...) IndexReader r2 = indexWriter.getCurrentIndex() I assume r1 will have a view of the index before the document was added, and r2 after? Right, getCurrentIndex would return a MultiReader that includes SegmentReader for each segment in the index, plus a RAMReader that searches the RAM buffer. That RAMReader is a tiny shell class that would basically just record the max docID it's allowed to go up to (the docID as of when it was opened), and stop enumerating docIDs (eg in the TermDocs) when it hits a docID beyond that limit. For reading stored fields and term vectors, which are now flushed immediately to disk, we need to somehow get an IndexInput from the IndexOutputs that IndexWriter holds open on these files. Or, maybe, just open new IndexInputs? Another thing that will help is if users could get their hands on the sub-readers of a multi-segment reader. Right now that is hidden in MultiSegmentReader and makes updating anything incrementally difficult. Besides what's handled by MultiSegmentReader.reopen already, what else do you need to incrementally update? Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
On Tue, Sep 9, 2008 at 5:28 AM, Michael McCandless [EMAIL PROTECTED] wrote: Yonik Seeley wrote: What about something like term freq? Would it need to count the number of docs after the local maxDoc or is there a better way? Good question... I think we'd have to take a full copy of the term - termFreq on reopen? I don't see how else to do it (I don't understand your suggestion above). So, this will clearly add to the cost of reopen. One could adjust the freq by iterating over the terms documents... skipTo(localMaxDoc) and count how many are after that, then subtract from the freq. I didn't say it was a *good* idea :-) For reading stored fields and term vectors, which are now flushed immediately to disk, we need to somehow get an IndexInput from the IndexOutputs that IndexWriter holds open on these files. Or, maybe, just open new IndexInputs? Hmmm, seems like a case of our nice and simple Directory model not having quite enough features in this case. I think we can simply open IndexInputs on these files. I believe Java does the right thing on windows, such that if we are already writing to the file, it does not prevent another file handle from opening the file for reading. Yeah, I think the underlying RandomAccessFile might do the right thing, but IndexInput isn't required to see any changes on the fly (and current implementations don't) so at a minimum it would be a change of IndexInput semantics. Maybe there would need to be a refresh() function added, or we would need to require a specific Directory impl? OR, if all writes are append-only, perhaps we don't ever need to invalidate the read buffer and would just need to remove the current logic that caches the file length and then let the underlying RandomAccessFile do the EOF checking. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
On Mon, Sep 8, 2008 at 4:23 PM, Yonik Seeley [EMAIL PROTECTED] wrote: I thought an index reader which supports real-time search no longer maintains a static view of an index? It seems advantageous to just make it really cheap to get a new view of the index (if you do it for every search, t amounts to the same thing, right?) Sounds like these light-weight views of the index are backed up by something dynamic, right? Quite a bit of code in Lucene assumes a static view of the Index I think (even IndexSearcher), and it's nice to have a stable index view for the duration of a single request. Agree. On Tue, Sep 9, 2008 at 10:02 AM, Yonik Seeley [EMAIL PROTECTED] wrote: Yeah, I think the underlying RandomAccessFile might do the right thing, but IndexInput isn't required to see any changes on the fly (and current implementations don't) so at a minimum it would be a change of IndexInput semantics. Maybe there would need to be a refresh() function added, or we would need to require a specific Directory impl? OR, if all writes are append-only, perhaps we don't ever need to invalidate the read buffer and would just need to remove the current logic that caches the file length and then let the underlying RandomAccessFile do the EOF checking. We cannot assume it's always RandomAccessFile, can we? So we may have to flush after writing each document. Even so, this may not be sufficient for some FS such as HDFS... Is it reasonable in this case to keep in memory everything including stored fields and term vectors? Cheers, Ning - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
On Tue, Sep 9, 2008 at 11:42 AM, Ning Li [EMAIL PROTECTED] wrote: On Tue, Sep 9, 2008 at 10:02 AM, Yonik Seeley [EMAIL PROTECTED] wrote: Yeah, I think the underlying RandomAccessFile might do the right thing, but IndexInput isn't required to see any changes on the fly (and current implementations don't) so at a minimum it would be a change of IndexInput semantics. Maybe there would need to be a refresh() function added, or we would need to require a specific Directory impl? OR, if all writes are append-only, perhaps we don't ever need to invalidate the read buffer and would just need to remove the current logic that caches the file length and then let the underlying RandomAccessFile do the EOF checking. We cannot assume it's always RandomAccessFile, can we? No, it would essentially be a change in the semantics that all implementations would need to support. So we may have to flush after writing each document. Flush when creating a new index view (which could possibly be after every document is added, but doesn't have to be). Even so, this may not be sufficient for some FS such as HDFS... Is it reasonable in this case to keep in memory everything including stored fields and term vectors? We could maybe do something like a proxy IndexInput/IndexOutput that would allow updating the read buffer from the writer buffer. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
Yonik Seeley wrote: On Tue, Sep 9, 2008 at 5:28 AM, Michael McCandless [EMAIL PROTECTED] wrote: Yonik Seeley wrote: What about something like term freq? Would it need to count the number of docs after the local maxDoc or is there a better way? Good question... I think we'd have to take a full copy of the term - termFreq on reopen? I don't see how else to do it (I don't understand your suggestion above). So, this will clearly add to the cost of reopen. One could adjust the freq by iterating over the terms documents... skipTo(localMaxDoc) and count how many are after that, then subtract from the freq. I didn't say it was a *good* idea :-) Ahh, OK :) For reading stored fields and term vectors, which are now flushed immediately to disk, we need to somehow get an IndexInput from the IndexOutputs that IndexWriter holds open on these files. Or, maybe, just open new IndexInputs? Hmmm, seems like a case of our nice and simple Directory model not having quite enough features in this case. I think we can simply open IndexInputs on these files. I believe Java does the right thing on windows, such that if we are already writing to the file, it does not prevent another file handle from opening the file for reading. Yeah, I think the underlying RandomAccessFile might do the right thing, but IndexInput isn't required to see any changes on the fly (and current implementations don't) so at a minimum it would be a change of IndexInput semantics. Maybe there would need to be a refresh() function added, or we would need to require a specific Directory impl? OR, if all writes are append-only, perhaps we don't ever need to invalidate the read buffer and would just need to remove the current logic that caches the file length and then let the underlying RandomAccessFile do the EOF checking. All writes to these files are append only, and, when we open the IndexInput we would never read beyond it's current length (once we flush our IndexOutput) because that's the local maxDocID limit. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
Yonik Seeley wrote: On Tue, Sep 9, 2008 at 11:42 AM, Ning Li [EMAIL PROTECTED] wrote: On Tue, Sep 9, 2008 at 10:02 AM, Yonik Seeley [EMAIL PROTECTED] wrote: Yeah, I think the underlying RandomAccessFile might do the right thing, but IndexInput isn't required to see any changes on the fly (and current implementations don't) so at a minimum it would be a change of IndexInput semantics. Maybe there would need to be a refresh() function added, or we would need to require a specific Directory impl? OR, if all writes are append-only, perhaps we don't ever need to invalidate the read buffer and would just need to remove the current logic that caches the file length and then let the underlying RandomAccessFile do the EOF checking. We cannot assume it's always RandomAccessFile, can we? No, it would essentially be a change in the semantics that all implementations would need to support. Right, which is you are allowed to open an IndexInput on a file when an IndexOutput has that same file open and is still appending to it. So we may have to flush after writing each document. Flush when creating a new index view (which could possibly be after every document is added, but doesn't have to be). Assuming we can make the above semantics requirement change to IndexInput, we don't need to flush on opening a new RAM reader? Even so, this may not be sufficient for some FS such as HDFS... Is it reasonable in this case to keep in memory everything including stored fields and term vectors? We could maybe do something like a proxy IndexInput/IndexOutput that would allow updating the read buffer from the writer buffer. Does HDFS disallow a reader from reading a file that's still open for append? Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
On Tue, Sep 9, 2008 at 12:41 PM, Michael McCandless [EMAIL PROTECTED] wrote: Yonik Seeley wrote: OR, if all writes are append-only, perhaps we don't ever need to invalidate the read buffer and would just need to remove the current logic that caches the file length and then let the underlying RandomAccessFile do the EOF checking. All writes to these files are append only, and, when we open the IndexInput we would never read beyond it's current length (once we flush our IndexOutput) because that's the local maxDocID limit. Right, but it would be nice to not have to open a new IndexInput for each snapshot... opening a file is not a quick operation. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
On Tue, Sep 9, 2008 at 12:45 PM, Michael McCandless [EMAIL PROTECTED] wrote: Yonik Seeley wrote: No, it would essentially be a change in the semantics that all implementations would need to support. Right, which is you are allowed to open an IndexInput on a file when an IndexOutput has that same file open and is still appending to it. Not just that, but that the size can actually grow after the IndexInput has been opened, and that should be visible. That would seem necessary for sharing the IndexInput (via a clone). So we may have to flush after writing each document. Flush when creating a new index view (which could possibly be after every document is added, but doesn't have to be). Assuming we can make the above semantics requirement change to IndexInput, we don't need to flush on opening a new RAM reader? Yes, we would need to flush... I was just pointing out that you don't necessarily need a new RAM reader for every document added (but that is the worst case scenario). -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
Even so, this may not be sufficient for some FS such as HDFS... Is it reasonable in this case to keep in memory everything including stored fields and term vectors? We could maybe do something like a proxy IndexInput/IndexOutput that would allow updating the read buffer from the writer buffer. Does HDFS disallow a reader from reading a file that's still open for append? HDFS allows that. A reader is guaranteed to be able to read data that was 'flushed' before the reader opened the file. However, it may not see the latest appends (after open) even if they are flushed. Yonik's comments below also apply in this case. Right, but it would be nice to not have to open a new IndexInput for each snapshot... opening a file is not a quick operation. Cheers, Ning - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
Hi Joaquin, Using HBase with realtime Lucene would be in line with what Google does. However the question is whether or not this is completely necessary or the most simple approach. That probably can only be answered by doing a live comparison of the two! Unfortunately that would require probably quite a bit of work and resources. For now, Ocean stores the data in the Lucene indexes because it works, it's easy to implement etc. I have looked at other options, however they need to be prioritized in terms of need vs cost. I would put the HBase solution possibly at the high end of the resource scale. I think usually it's best to keep things as simple as possible and as cheap as possible. More complexity in a scalable realtime search solution would mean more people, more expertise, and more possibilities for breakage. It would need to be clear what HBase or other solutions for storing the data brought to the table, which because I don't have time to look at them, I cannot answer. Nonetheless it is somewhat interesting. Cheers, Jason Rutherglen On Sun, Sep 7, 2008 at 11:16 AM, J. Delgado [EMAIL PROTECTED] wrote: On Sun, Sep 7, 2008 at 2:41 AM, mark harwood [EMAIL PROTECTED] wrote: for example joins are not possible using SOLR). It's largely *because* Lucene doesn't do joins that it can be made to scale out. I've replaced two large-scale database systems this year with distributed Lucene solutions because this scale-out architecture provided significantly better performance. These were semi-structured systems too. Lucene's comparitively simplistic data model/query model is both a weakness and a strength in this regard. Hey, maybe the right way to go for a truly scalable and high performance semi-structured database is to marry HBase (Big-table like data storage) with SOLR/Lucene.I concur with you in the sense that simplistic data models coupled with high performance are the killer. Let me quote this from the original Bigtable paper from Google: Bigtable does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format, and allows clients to reason about the locality properties of the data represented in the underlying storage. Data is indexed using row and column names that can be arbitrary strings. Bigtable also treats data as uninterpreted strings, although clients often serialize various forms of structured and semi-structured data into these strings. Clients can control the locality of their data through careful choices in their schemas. Finally, Bigtable schema parameters let clients dynamically control whether to serve data out of memory or from disk. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
Hi, We experimented using HBase's scalable infrastructure to scale out Lucene: http://www.mail-archive.com/[EMAIL PROTECTED]/msg01143.html There is the concern on the impact of HDFS's random read performance on Lucene search performance. And we can discuss if HBase's architecture is best for scale-out Lucene. But to me, the general idea of reusing a scalable infrastructure (if a suitable one exits) is appealing - such an infrastructure already handles repartitioning for scalability, fault tolerance etc. I agree with Otis that the first step for Lucene is probably to support real-time search. The instantiated index in contrib seems to be something close... Cheers, Ning - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
Ning Li wrote: I agree with Otis that the first step for Lucene is probably to support real-time search. The instantiated index in contrib seems to be something close.. Maybe we should start fleshing out what we want in realtime search on the wiki? Could it be as simple as making InstantiatedIndex realtime (allow writes/read at same time?). Then you could search over your IndexReader as well as the InstantiatedIndex. Writes go to both the Writer and the InstantiatedIndex. Nothing is actually permanent until the true commit, but stuff is visible pretty fast...a new IndexReader view starts a fresh InstantiedIndex... Jasons realtime patch is still pretty large...would be nice if we could accomplish this with as few changes as possible... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
InstantiatedIndex isn't quite realtime. Instead a new InstantiatedIndex is created per transaction in Ocean and managed thereafter. This however is fairly easy to build and could offer realtime in Lucene without adding the transaction logging. It would be good to find out what scope is acceptable for a Lucene core version of realtime. Perhaps this basic feature set is good enough. On Mon, Sep 8, 2008 at 10:23 AM, Mark Miller [EMAIL PROTECTED] wrote: Ning Li wrote: I agree with Otis that the first step for Lucene is probably to support real-time search. The instantiated index in contrib seems to be something close.. Maybe we should start fleshing out what we want in realtime search on the wiki? Could it be as simple as making InstantiatedIndex realtime (allow writes/read at same time?). Then you could search over your IndexReader as well as the InstantiatedIndex. Writes go to both the Writer and the InstantiatedIndex. Nothing is actually permanent until the true commit, but stuff is visible pretty fast...a new IndexReader view starts a fresh InstantiedIndex... Jasons realtime patch is still pretty large...would be nice if we could accomplish this with as few changes as possible... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
I'd also trying to make time to explore the approach of creating an IndexReader impl. that searches IndexWriter's RAM buffer. I think it's quite feasible, but, it'd still have a reopen cost in that any buffered delete by term or query would have to be materialiazed into docIDs on reopen. Though, if this somehow turns out to be a problem, in the future we could do this materializing immediately, instead of buffering, if we already have a reader open. Flushing is somewhat tricky because any open RAM readers would then have to cutover to the newly flushed segment once the flush completes, so that the RAM buffer can be recycled for the next segment. Mike Jason Rutherglen wrote: InstantiatedIndex isn't quite realtime. Instead a new InstantiatedIndex is created per transaction in Ocean and managed thereafter. This however is fairly easy to build and could offer realtime in Lucene without adding the transaction logging. It would be good to find out what scope is acceptable for a Lucene core version of realtime. Perhaps this basic feature set is good enough. On Mon, Sep 8, 2008 at 10:23 AM, Mark Miller [EMAIL PROTECTED] wrote: Ning Li wrote: I agree with Otis that the first step for Lucene is probably to support real-time search. The instantiated index in contrib seems to be something close.. Maybe we should start fleshing out what we want in realtime search on the wiki? Could it be as simple as making InstantiatedIndex realtime (allow writes/read at same time?). Then you could search over your IndexReader as well as the InstantiatedIndex. Writes go to both the Writer and the InstantiatedIndex. Nothing is actually permanent until the true commit, but stuff is visible pretty fast...a new IndexReader view starts a fresh InstantiedIndex... Jasons realtime patch is still pretty large...would be nice if we could accomplish this with as few changes as possible... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
On Mon, Sep 8, 2008 at 12:33 PM, Michael McCandless [EMAIL PROTECTED] wrote: I'd also trying to make time to explore the approach of creating an IndexReader impl. that searches IndexWriter's RAM buffer. That seems like it could possibly be the best performing approach in the long run. I think it's quite feasible, but, it'd still have a reopen cost in that any buffered delete by term or query would have to be materialiazed into docIDs on reopen. Though, if this somehow turns out to be a problem, in the future we could do this materializing immediately, instead of buffering, if we already have a reader open. Right... it seems like re-using readers internally is something we could already be doing in IndexWriter. Flushing is somewhat tricky because any open RAM readers would then have to cutover to the newly flushed segment once the flush completes, so that the RAM buffer can be recycled for the next segment. Re-use of a RAM buffer doesn't seem like such a big deal. But, how would you maintain a static view of an index...? IndexReader r1 = indexWriter.getCurrentIndex() indexWriter.addDocument(...) IndexReader r2 = indexWriter.getCurrentIndex() I assume r1 will have a view of the index before the document was added, and r2 after? Another thing that will help is if users could get their hands on the sub-readers of a multi-segment reader. Right now that is hidden in MultiSegmentReader and makes updating anything incrementally difficult. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
I need to point out that the only thing I know InstantiatedIndex to be great at is read access in the inverted index. It consumes a lot more heap than RAMDirectory and InstantiatedIndexWriter is slightly less efficient than IndexWriter. Please let me know if your experience differs from the above statement. 8 sep 2008 kl. 16.36 skrev Jason Rutherglen: InstantiatedIndex isn't quite realtime. Instead a new InstantiatedIndex is created per transaction in Ocean and managed thereafter. This however is fairly easy to build and could offer realtime in Lucene without adding the transaction logging. It would be good to find out what scope is acceptable for a Lucene core version of realtime. Perhaps this basic feature set is good enough. On Mon, Sep 8, 2008 at 10:23 AM, Mark Miller [EMAIL PROTECTED] wrote: Ning Li wrote: I agree with Otis that the first step for Lucene is probably to support real-time search. The instantiated index in contrib seems to be something close.. Maybe we should start fleshing out what we want in realtime search on the wiki? Could it be as simple as making InstantiatedIndex realtime (allow writes/read at same time?). Then you could search over your IndexReader as well as the InstantiatedIndex. Writes go to both the Writer and the InstantiatedIndex. Nothing is actually permanent until the true commit, but stuff is visible pretty fast...a new IndexReader view starts a fresh InstantiedIndex... Jasons realtime patch is still pretty large...would be nice if we could accomplish this with as few changes as possible... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
Yonik Seeley wrote: I think it's quite feasible, but, it'd still have a reopen cost in that any buffered delete by term or query would have to be materialiazed into docIDs on reopen. Though, if this somehow turns out to be a problem, in the future we could do this materializing immediately, instead of buffering, if we already have a reader open. Right... it seems like re-using readers internally is something we could already be doing in IndexWriter. True. Flushing is somewhat tricky because any open RAM readers would then have to cutover to the newly flushed segment once the flush completes, so that the RAM buffer can be recycled for the next segment. Re-use of a RAM buffer doesn't seem like such a big deal. But, how would you maintain a static view of an index...? IndexReader r1 = indexWriter.getCurrentIndex() indexWriter.addDocument(...) IndexReader r2 = indexWriter.getCurrentIndex() I assume r1 will have a view of the index before the document was added, and r2 after? Right, getCurrentIndex would return a MultiReader that includes SegmentReader for each segment in the index, plus a RAMReader that searches the RAM buffer. That RAMReader is a tiny shell class that would basically just record the max docID it's allowed to go up to (the docID as of when it was opened), and stop enumerating docIDs (eg in the TermDocs) when it hits a docID beyond that limit. For reading stored fields and term vectors, which are now flushed immediately to disk, we need to somehow get an IndexInput from the IndexOutputs that IndexWriter holds open on these files. Or, maybe, just open new IndexInputs? Another thing that will help is if users could get their hands on the sub-readers of a multi-segment reader. Right now that is hidden in MultiSegmentReader and makes updating anything incrementally difficult. Besides what's handled by MultiSegmentReader.reopen already, what else do you need to incrementally update? Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
On Mon, Sep 8, 2008 at 2:43 PM, Yonik Seeley [EMAIL PROTECTED] wrote: But, how would you maintain a static view of an index...? IndexReader r1 = indexWriter.getCurrentIndex() indexWriter.addDocument(...) IndexReader r2 = indexWriter.getCurrentIndex() I assume r1 will have a view of the index before the document was added, and r2 after? I thought an index reader which supports real-time search no longer maintains a static view of an index? Similar to InstantiatedIndexReader, it will be in sync with an index writer. IndexReader r = indexWriter.getIndexReader(); getIndexReader() (i.e. get real-time index reader) returns the same reader instance for a writer instance. On Mon, Sep 8, 2008 at 12:33 PM, Michael McCandless [EMAIL PROTECTED] wrote: Flushing is somewhat tricky because any open RAM readers would then have to cutover to the newly flushed segment once the flush completes, so that the RAM buffer can be recycled for the next segment. Now this won't be a problem any more. Cheers, Ning - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
On Mon, Sep 8, 2008 at 3:56 PM, Ning Li [EMAIL PROTECTED] wrote: On Mon, Sep 8, 2008 at 2:43 PM, Yonik Seeley [EMAIL PROTECTED] wrote: But, how would you maintain a static view of an index...? IndexReader r1 = indexWriter.getCurrentIndex() indexWriter.addDocument(...) IndexReader r2 = indexWriter.getCurrentIndex() I assume r1 will have a view of the index before the document was added, and r2 after? I thought an index reader which supports real-time search no longer maintains a static view of an index? It seems advantageous to just make it really cheap to get a new view of the index (if you do it for every search, t amounts to the same thing, right?) Quite a bit of code in Lucene assumes a static view of the Index I think (even IndexSearcher), and it's nice to have a stable index view for the duration of a single request. Similar to InstantiatedIndexReader, it will be in sync with an index writer. Right... that's why I was clarifying. You can still make stable views of the index with multiple InstantiatedIndex instances, but it doesn't seem as efficient. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
That sounds about correct and I don't think it matters much. I keep the documents by default stored in InstantiatedIndex to 100. So the heap size doesn't become a problem. On Mon, Sep 8, 2008 at 2:58 PM, Karl Wettin [EMAIL PROTECTED] wrote: I need to point out that the only thing I know InstantiatedIndex to be great at is read access in the inverted index. It consumes a lot more heap than RAMDirectory and InstantiatedIndexWriter is slightly less efficient than IndexWriter. Please let me know if your experience differs from the above statement. 8 sep 2008 kl. 16.36 skrev Jason Rutherglen: InstantiatedIndex isn't quite realtime. Instead a new InstantiatedIndex is created per transaction in Ocean and managed thereafter. This however is fairly easy to build and could offer realtime in Lucene without adding the transaction logging. It would be good to find out what scope is acceptable for a Lucene core version of realtime. Perhaps this basic feature set is good enough. On Mon, Sep 8, 2008 at 10:23 AM, Mark Miller [EMAIL PROTECTED] wrote: Ning Li wrote: I agree with Otis that the first step for Lucene is probably to support real-time search. The instantiated index in contrib seems to be something close.. Maybe we should start fleshing out what we want in realtime search on the wiki? Could it be as simple as making InstantiatedIndex realtime (allow writes/read at same time?). Then you could search over your IndexReader as well as the InstantiatedIndex. Writes go to both the Writer and the InstantiatedIndex. Nothing is actually permanent until the true commit, but stuff is visible pretty fast...a new IndexReader view starts a fresh InstantiedIndex... Jasons realtime patch is still pretty large...would be nice if we could accomplish this with as few changes as possible... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
Term dictionary? I'm curious how that would be solved? On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless [EMAIL PROTECTED] wrote: Yonik Seeley wrote: I think it's quite feasible, but, it'd still have a reopen cost in that any buffered delete by term or query would have to be materialiazed into docIDs on reopen. Though, if this somehow turns out to be a problem, in the future we could do this materializing immediately, instead of buffering, if we already have a reader open. Right... it seems like re-using readers internally is something we could already be doing in IndexWriter. True. Flushing is somewhat tricky because any open RAM readers would then have to cutover to the newly flushed segment once the flush completes, so that the RAM buffer can be recycled for the next segment. Re-use of a RAM buffer doesn't seem like such a big deal. But, how would you maintain a static view of an index...? IndexReader r1 = indexWriter.getCurrentIndex() indexWriter.addDocument(...) IndexReader r2 = indexWriter.getCurrentIndex() I assume r1 will have a view of the index before the document was added, and r2 after? Right, getCurrentIndex would return a MultiReader that includes SegmentReader for each segment in the index, plus a RAMReader that searches the RAM buffer. That RAMReader is a tiny shell class that would basically just record the max docID it's allowed to go up to (the docID as of when it was opened), and stop enumerating docIDs (eg in the TermDocs) when it hits a docID beyond that limit. For reading stored fields and term vectors, which are now flushed immediately to disk, we need to somehow get an IndexInput from the IndexOutputs that IndexWriter holds open on these files. Or, maybe, just open new IndexInputs? Another thing that will help is if users could get their hands on the sub-readers of a multi-segment reader. Right now that is hidden in MultiSegmentReader and makes updating anything incrementally difficult. Besides what's handled by MultiSegmentReader.reopen already, what else do you need to incrementally update? Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless [EMAIL PROTECTED] wrote: Right, getCurrentIndex would return a MultiReader that includes SegmentReader for each segment in the index, plus a RAMReader that searches the RAM buffer. That RAMReader is a tiny shell class that would basically just record the max docID it's allowed to go up to (the docID as of when it was opened), and stop enumerating docIDs (eg in the TermDocs) when it hits a docID beyond that limit. What about something like term freq? Would it need to count the number of docs after the local maxDoc or is there a better way? For reading stored fields and term vectors, which are now flushed immediately to disk, we need to somehow get an IndexInput from the IndexOutputs that IndexWriter holds open on these files. Or, maybe, just open new IndexInputs? Hmmm, seems like a case of our nice and simple Directory model not having quite enough features in this case. Another thing that will help is if users could get their hands on the sub-readers of a multi-segment reader. Right now that is hidden in MultiSegmentReader and makes updating anything incrementally difficult. Besides what's handled by MultiSegmentReader.reopen already, what else do you need to incrementally update? Anything that you want to incrementally update and uses an IndexReader as a key. Mostly caches I would think... Solr has user-level (application specific) caches, faceting caches, etc. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
have some discussion going (Karl...), so I look forward to understanding more! :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, September 4, 2008 10:13:32 AM Subject: Re: Realtime Search for Social Networks Collaboration On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen wrote: I also think it's got a lot of things now which makes integration difficult to do properly. I agree, and that's why the major bump in version number rather than minor - we recognize that some features will need some amount of rearchitecture. I think the problem with integration with SOLR is it was designed with a different problem set in mind than Ocean, originally the CNET shopping application. That was the first use of Solr, but it actually existed before that w/o any defined use other than to be a plan B alternative to MySQL based search servers (that's actually where some of the parameter names come from... the default /select URL instead of /search, the rows parameter, etc). But you're right... some things like the replication strategy were designed (well, borrowed from Doug to be exact) with the idea that it would be OK to have slightly stale views of the data in the range of minutes. It just made things easier/possible at the time. But tons of Solr and Lucene users want almost instantaneous visibility of added documents, if they can get it. It's hardly restricted to social network applications. Bottom line is that Solr aims to be a general enterprise search platform, and getting as real-time as we can get, and as scalable as we can get are some of the top priorities going forward. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
approach is closely tied to real-time search - all query nodes need to see index changes fast. But Lucene itself offers no replication mechanism, so maybe the replication is something to figure out separately, say on the Solr level, later on once we get there. I think even just the essential real-time search requires substantial changes to Lucene (I remember seeing large patches in JIRA), which makes it hard to digest, understand, comment on, and ultimately commit (hence the luke warm response, I think). Bringing other non-essential elements into discussion at the same time makes it more difficult t o process all this new stuff, at least for me. Am I the only one who finds this hard? That said, it sounds like we have some discussion going (Karl...), so I look forward to understanding more! :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, September 4, 2008 10:13:32 AM Subject: Re: Realtime Search for Social Networks Collaboration On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen wrote: I also think it's got a lot of things now which makes integration difficult to do properly. I agree, and that's why the major bump in version number rather than minor - we recognize that some features will need some amount of rearchitecture. I think the problem with integration with SOLR is it was designed with a different problem set in mind than Ocean, originally the CNET shopping application. That was the first use of Solr, but it actually existed before that w/o any defined use other than to be a plan B alternative to MySQL based search servers (that's actually where some of the parameter names come from... the default /select URL instead of /search, the rows parameter, etc). But you're right... some things like the replication strategy were designed (well, borrowed from Doug to be exact) with the idea that it would be OK to have slightly stale views of the data in the range of minutes. It just made things easier/possible at the time. But tons of Solr and Lucene users want almost instantaneous visibility of added documents, if they can get it. It's hardly restricted to social network applications. Bottom line is that Solr aims to be a general enterprise search platform, and getting as real-time as we can get, and as scalable as we can get are some of the top priorities going forward. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
it integrated into Lucene, would it make it less confusing to just refer to it as real-time search, so there is no confusion? If this is to be initially integrated into Lucene, why are things like replication, crowding/field collapsing, locallucene, name service, tag index, etc. all mentioned there on the Wiki and bundled with description of how real-time search works and is to be implemented? I suppose mentioning replication kind-of makes sense because the replication approach is closely tied to real-time search - all query nodes need to see index changes fast. But Lucene itself offers no replication mechanism, so maybe the replication is something to figure out separately, say on the Solr level, later on once we get there. I think even just the essential real-time search requires substantial changes to Lucene (I remember seeing large patches in JIRA), which makes it hard to digest, understand, comment on, and ultimately commit (hence the luke warm response, I think). Bringing other non-essential elements into discussion at the same time makes it more difficult t o process all this new stuff, at least for me. Am I the only one who finds this hard? That said, it sounds like we have some discussion going (Karl...), so I look forward to understanding more! :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, September 4, 2008 10:13:32 AM Subject: Re: Realtime Search for Social Networks Collaboration On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen wrote: I also think it's got a lot of things now which makes integration difficult to do properly. I agree, and that's why the major bump in version number rather than minor - we recognize that some features will need some amount of rearchitecture. I think the problem with integration with SOLR is it was designed with a different problem set in mind than Ocean, originally the CNET shopping application. That was the first use of Solr, but it actually existed before that w/o any defined use other than to be a plan B alternative to MySQL based search servers (that's actually where some of the parameter names come from... the default /select URL instead of /search, the rows parameter, etc). But you're right... some things like the replication strategy were designed (well, borrowed from Doug to be exact) with the idea that it would be OK to have slightly stale views of the data in the range of minutes. It just made things easier/possible at the time. But tons of Solr and Lucene users want almost instantaneous visibility of added documents, if they can get it. It's hardly restricted to social network applications. Bottom line is that Solr aims to be a general enterprise search platform, and getting as real-time as we can get, and as scalable as we can get are some of the top priorities going forward. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
studied and developed by database communities for aquite long time. There has been recent efforts on how to effeciently integrate Lucene into releational databases (see Lucene JVM ORACLE integration, see http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html) I think we should seriously look at joining efforts with open-source Database engine projects, written in Java (see http://java-source.net/open-source/database-engines) in order to blend IR and ORM for once and for all. -- Joaquin I've read Jason's Wiki as well. Actually, I had to read it a number of times to understand bits and pieces of it. I have to admit there is still some fuzziness about the whole things in my head - is Ocean something that already works, a separate project on googlecode.com? I think so. If so, and if you are working on getting it integrated into Lucene, would it make it less confusing to just refer to it as real-time search, so there is no confusion? If this is to be initially integrated into Lucene, why are things like replication, crowding/field collapsing, locallucene, name service, tag index, etc. all mentioned there on the Wiki and bundled with description of how real-time search works and is to be implemented? I suppose mentioning replication kind-of makes sense because the replication approach is closely tied to real-time search - all query nodes need to see index changes fast. But Lucene itself offers no replication mechanism, so maybe the replication is something to figure out separately, say on the Solr level, later on once we get there. I think even just the essential real-time search requires substantial changes to Lucene (I remember seeing large patches in JIRA), which makes it hard to digest, understand, comment on, and ultimately commit (hence the luke warm response, I think). Bringing other non-essential elements into discussion at the same time makes it more difficult t o process all this new stuff, at least for me. Am I the only one who finds this hard? That said, it sounds like we have some discussion going (Karl...), so I look forward to understanding more! :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, September 4, 2008 10:13:32 AM Subject: Re: Realtime Search for Social Networks Collaboration On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen wrote: I also think it's got a lot of things now which makes integration difficult to do properly. I agree, and that's why the major bump in version number rather than minor - we recognize that some features will need some amount of rearchitecture. I think the problem with integration with SOLR is it was designed with a different problem set in mind than Ocean, originally the CNET shopping application. That was the first use of Solr, but it actually existed before that w/o any defined use other than to be a plan B alternative to MySQL based search servers (that's actually where some of the parameter names come from... the default /select URL instead of /search, the rows parameter, etc). But you're right... some things like the replication strategy were designed (well, borrowed from Doug to be exact) with the idea that it would be OK to have slightly stale views of the data in the range of minutes. It just made things easier/possible at the time. But tons of Solr and Lucene users want almost instantaneous visibility of added documents, if they can get it. It's hardly restricted to social network applications. Bottom line is that Solr aims to be a general enterprise search platform, and getting as real-time as we can get, and as scalable as we can get are some of the top priorities going forward. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Marcelo F. Ochoa http://marceloochoa.blogspot.com/ http://marcelo.ochoa.googlepages.com/home __ Do you Know DBPrism? Look @ DB Prism's Web Site http://www.dbprism.com.ar/index.html More info? Chapter 17 of the book Programming the Oracle Database using
Re: Realtime Search for Social Networks Collaboration
was on data no more than 1 month old, though user could open the time window by including archives). As for SOLR and OCEAN, I would argue that these semi-structured search engines are becomming more and more like relational databases with full-text search capablities (without the benefit of full reletional algebra -- for example joins are not possible using SOLR). Notice that real-time CRUD operations and transactionality are core DB concepts adn have been studied and developed by database communities for aquite long time. There has been recent efforts on how to effeciently integrate Lucene into releational databases (see Lucene JVM ORACLE integration, see http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html) I think we should seriously look at joining efforts with open-source Database engine projects, written in Java (see http://java-source.net/open-source/database-engines) in order to blend IR and ORM for once and for all. -- Joaquin I've read Jason's Wiki as well. Actually, I had to read it a number of times to understand bits and pieces of it. I have to admit there is still some fuzziness about the whole things in my head - is Ocean something that already works, a separate project on googlecode.com? I think so. If so, and if you are working on getting it integrated into Lucene, would it make it less confusing to just refer to it as real-time search, so there is no confusion? If this is to be initially integrated into Lucene, why are things like replication, crowding/field collapsing, locallucene, name service, tag index, etc. all mentioned there on the Wiki and bundled with description of how real-time search works and is to be implemented? I suppose mentioning replication kind-of makes sense because the replication approach is closely tied to real-time search - all query nodes need to see index changes fast. But Lucene itself offers no replication mechanism, so maybe the replication is something to figure out separately, say on the Solr level, later on once we get there. I think even just the essential real-time search requires substantial changes to Lucene (I remember seeing large patches in JIRA), which makes it hard to digest, understand, comment on, and ultimately commit (hence the luke warm response, I think). Bringing other non-essential elements into discussion at the same time makes it more difficult t o process all this new stuff, at least for me. Am I the only one who finds this hard? That said, it sounds like we have some discussion going (Karl...), so I look forward to understanding more! :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, September 4, 2008 10:13:32 AM Subject: Re: Realtime Search for Social Networks Collaboration On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen wrote: I also think it's got a lot of things now which makes integration difficult to do properly. I agree, and that's why the major bump in version number rather than minor - we recognize that some features will need some amount of rearchitecture. I think the problem with integration with SOLR is it was designed with a different problem set in mind than Ocean, originally the CNET shopping application. That was the first use of Solr, but it actually existed before that w/o any defined use other than to be a plan B alternative to MySQL based search servers (that's actually where some of the parameter names come from... the default /select URL instead of /search, the rows parameter, etc). But you're right... some things like the replication strategy were designed (well, borrowed from Doug to be exact) with the idea that it would be OK to have slightly stale views of the data in the range of minutes. It just made things easier/possible at the time. But tons of Solr and Lucene users want almost instantaneous visibility of added documents, if they can get it. It's hardly restricted to social network applications. Bottom line is that Solr aims to be a general enterprise search platform, and getting as real-time as we can get, and as scalable as we can get are some of the top priorities going forward. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional
Re: Realtime Search for Social Networks Collaboration
non real-time. For example, in my previous life, I designed and help implement a quasi-realtime enterprise search engine using Lucene, having a set of multi-threaded indexers hitting a set of multiple indexes alocatted accross different search services which powered a broker based distributed search interface. The most recent documents provided to the indexers were always added to the smaller in-memory (RAM) indexes which usually could absorbe the load of a bulk add transaction and later would be merged into larger disk based indexes and then flushed to make them ready to absorbe new fresh docs. We even had further partitioning of the indexes that reflected time periods with caps on size for them to be merged into older more archive based indexes which were used less (yes the search engine default search was on data no more than 1 month old, though user could open the time window by including archives). As for SOLR and OCEAN, I would argue that these semi-structured search engines are becomming more and more like relational databases with full-text search capablities (without the benefit of full reletional algebra -- for example joins are not possible using SOLR). Notice that real-time CRUD operations and transactionality are core DB concepts adn have been studied and developed by database communities for aquite long time. There has been recent efforts on how to effeciently integrate Lucene into releational databases (see Lucene JVM ORACLE integration, see http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html) I think we should seriously look at joining efforts with open-source Database engine projects, written in Java (see http://java-source.net/open-source/database-engines) in order to blend IR and ORM for once and for all. -- Joaquin I've read Jason's Wiki as well. Actually, I had to read it a number of times to understand bits and pieces of it. I have to admit there is still some fuzziness about the whole things in my head - is Ocean something that already works, a separate project on googlecode.com? I think so. If so, and if you are working on getting it integrated into Lucene, would it make it less confusing to just refer to it as real-time search, so there is no confusion? If this is to be initially integrated into Lucene, why are things like replication, crowding/field collapsing, locallucene, name service, tag index, etc. all mentioned there on the Wiki and bundled with description of how real-time search works and is to be implemented? I suppose mentioning replication kind-of makes sense because the replication approach is closely tied to real-time search - all query nodes need to see index changes fast. But Lucene itself offers no replication mechanism, so maybe the replication is something to figure out separately, say on the Solr level, later on once we get there. I think even just the essential real-time search requires substantial changes to Lucene (I remember seeing large patches in JIRA), which makes it hard to digest, understand, comment on, and ultimately commit (hence the luke warm response, I think). Bringing other non-essential elements into discussion at the same time makes it more difficult t o process all this new stuff, at least for me. Am I the only one who finds this hard? That said, it sounds like we have some discussion going (Karl...), so I look forward to understanding more! :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, September 4, 2008 10:13:32 AM Subject: Re: Realtime Search for Social Networks Collaboration On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen wrote: I also think it's got a lot of things now which makes integration difficult to do properly. I agree, and that's why the major bump in version number rather than minor - we recognize that some features will need some amount of rearchitecture. I think the problem with integration with SOLR is it was designed with a different problem set in mind than Ocean, originally the CNET shopping application. That was the first use of Solr, but it actually existed before that w/o any defined use other than to be a plan B alternative to MySQL based search servers (that's actually where some of the parameter names come from... the default /select URL instead of /search, the rows parameter, etc). But you're right... some things like the replication strategy were designed (well, borrowed from Doug to be exact) with the idea that it would be OK to have slightly stale views of the data in the range of minutes. It just made things easier/possible at the time. But tons
Re: Realtime Search for Social Networks Collaboration
On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Regarding real-time search and Solr, my feeling is the focus should be on first adding real-time search to Lucene, and then we'll figure out how to incorporate that into Solr later. Otis, what do you mean exactly by adding real-time search to Lucene? Note that Lucene, being a indexing/search library (and not a full blown search engine), is by definition real-time: once you add/write a document to the index it becomes immediately searchable and if a document is logically deleted and no longer returned in a search, though physical deletion happens during an index optimization. Now, the problem of adding/deleting documents in bulk, as part of a transaction and making these documents available for search immediately after the transaction is commited sounds more like a search engine problem (i.e. SOLR, Nutch, Ocean), specially if these transactions are known to be I/O expensive and thus are usually implemented bached proceeses with some kind of sync mechanism, which makes them non real-time. For example, in my previous life, I designed and help implement a quasi-realtime enterprise search engine using Lucene, having a set of multi-threaded indexers hitting a set of multiple indexes alocatted accross different search services which powered a broker based distributed search interface. The most recent documents provided to the indexers were always added to the smaller in-memory (RAM) indexes which usually could absorbe the load of a bulk add transaction and later would be merged into larger disk based indexes and then flushed to make them ready to absorbe new fresh docs. We even had further partitioning of the indexes that reflected time periods with caps on size for them to be merged into older more archive based indexes which were used less (yes the search engine default search was on data no more than 1 month old, though user could open the time window by including archives). As for SOLR and OCEAN, I would argue that these semi-structured search engines are becomming more and more like relational databases with full-text search capablities (without the benefit of full reletional algebra -- for example joins are not possible using SOLR). Notice that real-time CRUD operations and transactionality are core DB concepts adn have been studied and developed by database communities for aquite long time. There has been recent efforts on how to effeciently integrate Lucene into releational databases (see Lucene JVM ORACLE integration, see http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html ) I think we should seriously look at joining efforts with open-source Database engine projects, written in Java (see http://java-source.net/open-source/database-engines) in order to blend IR and ORM for once and for all. -- Joaquin I've read Jason's Wiki as well. Actually, I had to read it a number of times to understand bits and pieces of it. I have to admit there is still some fuzziness about the whole things in my head - is Ocean something that already works, a separate project on googlecode.com? I think so. If so, and if you are working on getting it integrated into Lucene, would it make it less confusing to just refer to it as real-time search, so there is no confusion? If this is to be initially integrated into Lucene, why are things like replication, crowding/field collapsing, locallucene, name service, tag index, etc. all mentioned there on the Wiki and bundled with description of how real-time search works and is to be implemented? I suppose mentioning replication kind-of makes sense because the replication approach is closely tied to real-time search - all query nodes need to see index changes fast. But Lucene itself offers no replication mechanism, so maybe the replication is something to figure out separately, say on the Solr level, later on once we get there. I think even just the essential real-time search requires substantial changes to Lucene (I remember seeing large patches in JIRA), which makes it hard to digest, understand, comment on, and ultimately commit (hence the luke warm response, I think). Bringing other non-essential elements into discussion at the same time makes it more difficult to process all this new stuff, at least for me. Am I the only one who finds this hard? That said, it sounds like we have some discussion going (Karl...), so I look forward to understanding more! :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, September 4, 2008 10:13:32 AM Subject: Re: Realtime Search for Social Networks Collaboration On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen wrote: I also think it's got a lot of things now which makes integration difficult to do properly. I agree, and that's why
Re: Realtime Search for Social Networks Collaboration
Interesting discussion. I think we should seriously look at joining efforts with open-source Database engine projects I posted some initial dabblings here with a couple of the databases on your list :http://markmail.org/message/3bu5klzzc5i6uhl7 but this is not really a scalable solution (which is what Jason and others need) for example joins are not possible using SOLR). It's largely *because* Lucene doesn't do joins that it can be made to scale out. I've replaced two large-scale database systems this year with distributed Lucene solutions because this scale-out architecture provided significantly better performance. These were semi-structured systems too. Lucene's comparitively simplistic data model/query model is both a weakness and a strength in this regard. Cheers, Mark.
Re: Realtime Search for Social Networks Collaboration
On Sun, Sep 7, 2008 at 2:41 AM, mark harwood [EMAIL PROTECTED]wrote: for example joins are not possible using SOLR). It's largely *because* Lucene doesn't do joins that it can be made to scale out. I've replaced two large-scale database systems this year with distributed Lucene solutions because this scale-out architecture provided significantly better performance. These were semi-structured systems too. Lucene's comparitively simplistic data model/query model is both a weakness and a strength in this regard. Hey, maybe the right way to go for a truly scalable and high performance semi-structured database is to marry HBase (Big-table like data storage) with SOLR/Lucene.I concur with you in the sense that simplistic data models coupled with high performance are the killer. Let me quote this from the original Bigtable paper from Google: Bigtable does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format, and allows clients to reason about the locality properties of the data represented in the underlying storage. Data is indexed using row and column names that can be arbitrary strings. Bigtable also treats data as uninterpreted strings, although clients often serialize various forms of structured and semi-structured data into these strings. Clients can control the locality of their data through careful choices in their schemas. Finally, Bigtable schema parameters let clients dynamically control whether to serve data out of memory or from disk.
Re: Realtime Search for Social Networks Collaboration
BTW, quoting Marcelo Ochoa (the developer behind the Oracle/Lucene implementation) the three minimal features a transactional DB should support for Lucene integration are: 1) The ability to define new functions (e.g. lcontains() lscore) which would allow to bind queries to lucene and obtain document/row scores 2) An API that would allow DML intercepts, like Oracle's ODCI. 3) The ability to extend and/or implement new types of domain indexes that the engine's query evaluation and execution/optimization planner can use efficiently. Thanks Marcelo. -- Joaquin On Sun, Sep 7, 2008 at 8:16 AM, J. Delgado [EMAIL PROTECTED]wrote: On Sun, Sep 7, 2008 at 2:41 AM, mark harwood [EMAIL PROTECTED]wrote: for example joins are not possible using SOLR). It's largely *because* Lucene doesn't do joins that it can be made to scale out. I've replaced two large-scale database systems this year with distributed Lucene solutions because this scale-out architecture provided significantly better performance. These were semi-structured systems too. Lucene's comparitively simplistic data model/query model is both a weakness and a strength in this regard. Hey, maybe the right way to go for a truly scalable and high performance semi-structured database is to marry HBase (Big-table like data storage) with SOLR/Lucene.I concur with you in the sense that simplistic data models coupled with high performance are the killer. Let me quote this from the original Bigtable paper from Google: Bigtable does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format, and allows clients to reason about the locality properties of the data represented in the underlying storage. Data is indexed using row and column names that can be arbitrary strings. Bigtable also treats data as uninterpreted strings, although clients often serialize various forms of structured and semi-structured data into these strings. Clients can control the locality of their data through careful choices in their schemas. Finally, Bigtable schema parameters let clients dynamically control whether to serve data out of memory or from disk.
Re: Realtime Search for Social Networks Collaboration
Hi, - Original Message From: J. Delgado [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Sunday, September 7, 2008 4:04:58 AM Subject: Re: Realtime Search for Social Networks Collaboration On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Regarding real-time search and Solr, my feeling is the focus should be on first adding real-time search to Lucene, and then we'll figure out how to incorporate that into Solr later. Otis, what do you mean exactly by adding real-time search to Lucene? Note that Lucene, being a indexing/search library (and not a full blown search engine), is by definition real-time: once you add/write a document to the index it becomes immediately searchable and if a document is logically deleted and no longer returned in a search, though physical deletion happens during an index optimization. OG: When I think about real-time search I see it as: Make the newly added document show up in search results without closing and reopening the whole index with IndexWriter. In other words, minimize re-reading of the old/unchanged data just to be able to see the newly added data. I believe this is similar to what IndexReader.reopen does and Jason does make use of it. Otis Now, the problem of adding/deletingdocuments in bulk, as part of a transaction and making these documents available for search immediately after the transaction is commited sounds more like a search engine problem (i.e. SOLR, Nutch, Ocean), specially if these transactions are known to be I/O expensive and thus are usually implemented bached proceeses with some kind of sync mechanism, which makes them non real-time. For example, in my previous life, I designed and help implement a quasi-realtime enterprise search engine using Lucene, having a set of multi-threaded indexers hitting a set of multiple indexes alocatted accross different search services which powered a broker based distributed search interface. The most recent documents provided to the indexers were always added to the smaller in-memory (RAM) indexes which usually could absorbe the load of a bulk add transaction and later would be merged into larger disk based indexes and then flushed to make them ready to absorbe new fresh docs. We even had further partitioning of the indexes that reflected time periods with caps on size for them to be merged into older more archive based indexes which were used less (yes the search engine default search was on data no more than 1 month old, though user could open the time window by including archives). As for SOLR and OCEAN, I would argue that these semi-structured search engines are becomming more and more like relational databases with full-text search capablities (without the benefit of full reletional algebra -- for example joins are not possible using SOLR). Notice that real-time CRUD operations and transactionality are core DB concepts adn have been studied and developed by database communities for aquite long time. There has been recent efforts on how to effeciently integrate Lucene into releational databases (see Lucene JVM ORACLE integration, see http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html) I think we should seriously look at joining efforts with open-source Database engine projects, written in Java (see http://java-source.net/open-source/database-engines) in order to blend IR and ORM for once and for all. -- Joaquin I've read Jason's Wiki as well. Actually, I had to read it a number of times to understand bits and pieces of it. I have to admit there is still some fuzziness about the whole things in my head - is Ocean something that already works, a separate project on googlecode.com? I think so. If so, and if you are working on getting it integrated into Lucene, would it make it less confusing to just refer to it as real-time search, so there is no confusion? If this is to be initially integrated into Lucene, why are things like replication, crowding/field collapsing, locallucene, name service, tag index, etc. all mentioned there on the Wiki and bundled with description of how real-time search works and is to be implemented? I suppose mentioning replication kind-of makes sense because the replication approach is closely tied to real-time search - all query nodes need to see index changes fast. But Lucene itself offers no replication mechanism, so maybe the replication is something to figure out separately, say on the Solr level, later on once we get there. I think even just the essential real-time search requires substantial changes to Lucene (I remember seeing large patches in JIRA), which makes it hard to digest, understand, comment on, and ultimately commit (hence the luke warm response, I think). Bringing other non-essential elements into discussion at the same time makes it more difficult t o process all this new stuff, at least for me. Am I the only
Re: Realtime Search for Social Networks Collaboration
requires substantial changes to Lucene (I remember seeing large patches in JIRA), which makes it hard to digest, understand, comment on, and ultimately commit (hence the luke warm response, I think). Bringing other non-essential elements into discussion at the same time makes it more difficult to process all this new stuff, at least for me. Am I the only one who finds this hard? That said, it sounds like we have some discussion going (Karl...), so I look forward to understanding more! :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, September 4, 2008 10:13:32 AM Subject: Re: Realtime Search for Social Networks Collaboration On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen wrote: I also think it's got a lot of things now which makes integration difficult to do properly. I agree, and that's why the major bump in version number rather than minor - we recognize that some features will need some amount of rearchitecture. I think the problem with integration with SOLR is it was designed with a different problem set in mind than Ocean, originally the CNET shopping application. That was the first use of Solr, but it actually existed before that w/o any defined use other than to be a plan B alternative to MySQL based search servers (that's actually where some of the parameter names come from... the default /select URL instead of /search, the rows parameter, etc). But you're right... some things like the replication strategy were designed (well, borrowed from Doug to be exact) with the idea that it would be OK to have slightly stale views of the data in the range of minutes. It just made things easier/possible at the time. But tons of Solr and Lucene users want almost instantaneous visibility of added documents, if they can get it. It's hardly restricted to social network applications. Bottom line is that Solr aims to be a general enterprise search platform, and getting as real-time as we can get, and as scalable as we can get are some of the top priorities going forward. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
There's a good percent of the Solr community that is looking to add everything you are (from a functional point of view). Some of the other little things that we haven't considered (like a remote Java API) sound cool... no reason not to add that also. We're also planning on adding alternatives to some of the things you don't currently like about Solr (HTTP, XML config, etc). Apache has always emphasized community over code... and it's a large part of what open source is about here. It's not always easier and faster to work in an open community, making compromises and trying to reach general consensus, but it tends to be good for projects in the long term. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
Hi Yonik, I fully agree with good for projects in the long term. I just figured it would be best if someone went ahead and built the things and they could be integrated later into other projects, that's why I checked them into Apache as patches. Sounds like a few folks like Shalin and Noble would like to build a SOLR specific realtime search. I think that's a good idea that I may be able to offer some help on. Realtime is relative anyways, for many projects database like updates are probably not necessary, neither is replication, or perhaps even 100% uptime and scalability. I just want the features, and if someone would like to work with me to get them into core Lucene and SOLR projects that would be cool. If not at least the code is out there to get ideas from. These discussions are a good starting point. Cheers, Jason On Sat, Sep 6, 2008 at 11:21 AM, Yonik Seeley [EMAIL PROTECTED] wrote: There's a good percent of the Solr community that is looking to add everything you are (from a functional point of view). Some of the other little things that we haven't considered (like a remote Java API) sound cool... no reason not to add that also. We're also planning on adding alternatives to some of the things you don't currently like about Solr (HTTP, XML config, etc). Apache has always emphasized community over code... and it's a large part of what open source is about here. It's not always easier and faster to work in an open community, making compromises and trying to reach general consensus, but it tends to be good for projects in the long term. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
Hi Jason, I think this is a misunderstanding. I only want to add these features incrementally so that users can use them as soon as possible, rather than delay them to a later release by re-architecting (which may take more time and shift our focus from our users). The features are more important than the code but it will of course help a lot too. I think a good starting point for us (Lucene/Solr folks) would be to study Ocean's source and any documentation that you can provide so that we can also suggest an optimal integration strategy or alternate implementation ideas. Until now the bulk of such work has been on your shoulders. I appreciate your patience and the amount of work you have put in. These features will be a huge value proposition for our users and a collaboration will be the good for the community in the long term. On Sat, Sep 6, 2008 at 9:11 PM, Jason Rutherglen [EMAIL PROTECTED] wrote: Hi Yonik, I fully agree with good for projects in the long term. I just figured it would be best if someone went ahead and built the things and they could be integrated later into other projects, that's why I checked them into Apache as patches. Sounds like a few folks like Shalin and Noble would like to build a SOLR specific realtime search. I think that's a good idea that I may be able to offer some help on. Realtime is relative anyways, for many projects database like updates are probably not necessary, neither is replication, or perhaps even 100% uptime and scalability. I just want the features, and if someone would like to work with me to get them into core Lucene and SOLR projects that would be cool. If not at least the code is out there to get ideas from. These discussions are a good starting point. Cheers, Jason On Sat, Sep 6, 2008 at 11:21 AM, Yonik Seeley [EMAIL PROTECTED] wrote: There's a good percent of the Solr community that is looking to add everything you are (from a functional point of view). Some of the other little things that we haven't considered (like a remote Java API) sound cool... no reason not to add that also. We're also planning on adding alternatives to some of the things you don't currently like about Solr (HTTP, XML config, etc). Apache has always emphasized community over code... and it's a large part of what open source is about here. It's not always easier and faster to work in an open community, making compromises and trying to reach general consensus, but it tends to be good for projects in the long term. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Regards, Shalin Shekhar Mangar.
Re: Realtime Search for Social Networks Collaboration
On Sep 6, 2008, at 4:36 AM, Otis Gospodnetic wrote: Regarding real-time search and Solr, my feeling is the focus should be on first adding real-time search to Lucene, and then we'll figure out how to incorporate that into Solr later. I've read Jason's Wiki as well. Actually, I had to read it a number of times to understand bits and pieces of it. I have to admit there is still some fuzziness about the whole things in my head - is Ocean something that already works, a separate project on googlecode.com? I think so. If so, and if you are working on getting it integrated into Lucene, would it make it less confusing to just refer to it as real-time search, so there is no confusion? If this is to be initially integrated into Lucene, why are things like replication, crowding/field collapsing, locallucene, name service, tag index, etc. all mentioned there on the Wiki and bundled with description of how real-time search works and is to be implemented? I suppose mentioning replication kind-of makes sense because the replication approach is closely tied to real-time search - all query nodes need to see index changes fast. But Lucene itself offers no replication mechanism, so maybe the replication is something to figure out separately, say on the Solr level, later on once we get there. I think even just the essential real-time search requires substantial changes to Lucene (I remember seeing large patches in JIRA), which makes it hard to digest, understand, comment on, and ultimately commit (hence the luke warm response, I think). Bringing other non-essential elements into discussion at the same time makes it more difficult to process all this new stuff, at least for me. Am I the only one who finds this hard? Yeah, I agree. There's a place for RT search in Lucene, but it seems to me we have a pretty good search server in Solr that needs some things going forward, but are reasonable to work on there. It makes sense to me not to duplicate efforts on all of those fronts and have two projects/communities that share 80-90% of their functionality (either existing, or planned). As Yonik says, it may take longer than just doing it by oneself, but in the long run, the outcome is usually better. My two cents, Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
Op Saturday 06 September 2008 18:53:39 schreef Shalin Shekhar Mangar: ... The features are more important than the code but it will of course help a lot too. I think a good starting point for us (Lucene/Solr folks) would be to study Ocean's source and any documentation that you can provide so that we can also suggest an optimal integration strategy or alternate implementation ideas. Until now the bulk of such work has been on your shoulders. I appreciate your patience and the amount of work you have put in. These features will be a huge value proposition for our users and a collaboration will be the good for the community in the long term. Some experience from larger patches: - stepwise is good, - so plan for steps, in which - each step is improvement on its own. Then: - try to keep the first step as small as possible, - with some luck, someone else will improve the first step, - learn from the improvement, - repeat, and never hurry. Some comments on the current patch at LUCENE-1313: - Copyright is assigned to individual authors, better assign that to ASF. - Individual authors are mentioned in the code, that's not lucene policy at the moment. - Some files do not contain an ASF licence, not a real problem. - The directory structure could also be in contrib/ocean as top directory. - There is a whole package of logging in there, but there's no logging in lucene at the moment. - There is at least one empty class, SearcherPolicy. - Unseen so far: - the second half of the patch, - the java code within the class {...} statements (sorry.) Even though the patch is down to 25% of it's first size, it's still 474 kb, which is large by any standard. So the question is: is there a first step to be taken from this patch that would be an improvement on its own? Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
Hello Shalin, When I tried to integrate before it seemed fairly simple. However the Ocean core code wasn't quite up to par yet so that needed work. It will help to work with SOLR people directly who can figure how they want to integrate such as yourself. Right now I'm finishing up the OceanDatabase portion (sorry for all the Ocean names and things, these can be changed, doesn't matter, but it should be something we agree on). The methods to TransactionSystem are like IndexWriter. The update method for OceanDatabase is perform(Action action). There are 3 actions, Insert, Update, Delete. To execute queries the whole thing is abstracted out as a Task. The method is Object run(Task task). Where task gets a reference to the TransactionSytem. I implemented a MultiThreadSearchTask that as the name suggests, executes a query in multiple threads over the latest Snapshot. The reason for the Task abstraction is to give the client complete access to the server via a potentially dynamically loaded subclass of Task. OceanDatabase should be the main class for most uses of the realtime system because it implements optimistic concurrency. I prefer the simplicity of the main entry point into the search server being only two methods, with the run method offering unlimited functionality without recompiling, building and deploying the server for each new piece of functionality required. Regards, Jason On Sat, Sep 6, 2008 at 12:53 PM, Shalin Shekhar Mangar [EMAIL PROTECTED] wrote: Hi Jason, I think this is a misunderstanding. I only want to add these features incrementally so that users can use them as soon as possible, rather than delay them to a later release by re-architecting (which may take more time and shift our focus from our users). The features are more important than the code but it will of course help a lot too. I think a good starting point for us (Lucene/Solr folks) would be to study Ocean's source and any documentation that you can provide so that we can also suggest an optimal integration strategy or alternate implementation ideas. Until now the bulk of such work has been on your shoulders. I appreciate your patience and the amount of work you have put in. These features will be a huge value proposition for our users and a collaboration will be the good for the community in the long term. On Sat, Sep 6, 2008 at 9:11 PM, Jason Rutherglen [EMAIL PROTECTED] wrote: Hi Yonik, I fully agree with good for projects in the long term. I just figured it would be best if someone went ahead and built the things and they could be integrated later into other projects, that's why I checked them into Apache as patches. Sounds like a few folks like Shalin and Noble would like to build a SOLR specific realtime search. I think that's a good idea that I may be able to offer some help on. Realtime is relative anyways, for many projects database like updates are probably not necessary, neither is replication, or perhaps even 100% uptime and scalability. I just want the features, and if someone would like to work with me to get them into core Lucene and SOLR projects that would be cool. If not at least the code is out there to get ideas from. These discussions are a good starting point. Cheers, Jason On Sat, Sep 6, 2008 at 11:21 AM, Yonik Seeley [EMAIL PROTECTED] wrote: There's a good percent of the Solr community that is looking to add everything you are (from a functional point of view). Some of the other little things that we haven't considered (like a remote Java API) sound cool... no reason not to add that also. We're also planning on adding alternatives to some of the things you don't currently like about Solr (HTTP, XML config, etc). Apache has always emphasized community over code... and it's a large part of what open source is about here. It's not always easier and faster to work in an open community, making compromises and trying to reach general consensus, but it tends to be good for projects in the long term. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Regards, Shalin Shekhar Mangar. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
Hi Grant, I think the way to integrate with SOLR and Lucene is if people who are committers to the respective projects work with me (if they want) on the integration which will make it fairly straightforward as it was designed and intended to be. Cheers, Jason On Sat, Sep 6, 2008 at 3:16 PM, Grant Ingersoll [EMAIL PROTECTED] wrote: On Sep 6, 2008, at 4:36 AM, Otis Gospodnetic wrote: Regarding real-time search and Solr, my feeling is the focus should be on first adding real-time search to Lucene, and then we'll figure out how to incorporate that into Solr later. I've read Jason's Wiki as well. Actually, I had to read it a number of times to understand bits and pieces of it. I have to admit there is still some fuzziness about the whole things in my head - is Ocean something that already works, a separate project on googlecode.com? I think so. If so, and if you are working on getting it integrated into Lucene, would it make it less confusing to just refer to it as real-time search, so there is no confusion? If this is to be initially integrated into Lucene, why are things like replication, crowding/field collapsing, locallucene, name service, tag index, etc. all mentioned there on the Wiki and bundled with description of how real-time search works and is to be implemented? I suppose mentioning replication kind-of makes sense because the replication approach is closely tied to real-time search - all query nodes need to see index changes fast. But Lucene itself offers no replication mechanism, so maybe the replication is something to figure out separately, say on the Solr level, later on once we get there. I think even just the essential real-time search requires substantial changes to Lucene (I remember seeing large patches in JIRA), which makes it hard to digest, understand, comment on, and ultimately commit (hence the luke warm response, I think). Bringing other non-essential elements into discussion at the same time makes it more difficult to process all this new stuff, at least for me. Am I the only one who finds this hard? Yeah, I agree. There's a place for RT search in Lucene, but it seems to me we have a pretty good search server in Solr that needs some things going forward, but are reasonable to work on there. It makes sense to me not to duplicate efforts on all of those fronts and have two projects/communities that share 80-90% of their functionality (either existing, or planned). As Yonik says, it may take longer than just doing it by oneself, but in the long run, the outcome is usually better. My two cents, Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
Hi Paul, It's unfortunate the code is larger than most contribs. The libraries can be factored out. The next patch includes OceanDatabase. The Ocean package and class names can be removed in favor of realtime? - There is a whole package of logging in there, but there's no logging in lucene at the moment. Can be removed, in favor of the IndexWriter style logging? Is this really the best way to go? Makes debugging more painful with no automatic method and class insertion in the log entries. I can do it, just thinking of other folks who work on it. The locking and such uses JDK 1.5, I can downgrade it but for such locking, and with 3.0 possibly coming out soon is that best? SearcherPolicy It's a marker class like MergePolicy or Serializable - Individual authors are mentioned in the code, that's not lucene policy at the moment. Agreed, Eclipse throws them in, I delete them, maybe some made it in. Maybe the @author should be removed from FieldCacheImpl, FieldDoc, and FieldCache. On Sat, Sep 6, 2008 at 3:41 PM, Paul Elschot [EMAIL PROTECTED] wrote: Op Saturday 06 September 2008 18:53:39 schreef Shalin Shekhar Mangar: ... The features are more important than the code but it will of course help a lot too. I think a good starting point for us (Lucene/Solr folks) would be to study Ocean's source and any documentation that you can provide so that we can also suggest an optimal integration strategy or alternate implementation ideas. Until now the bulk of such work has been on your shoulders. I appreciate your patience and the amount of work you have put in. These features will be a huge value proposition for our users and a collaboration will be the good for the community in the long term. Some experience from larger patches: - stepwise is good, - so plan for steps, in which - each step is improvement on its own. Then: - try to keep the first step as small as possible, - with some luck, someone else will improve the first step, - learn from the improvement, - repeat, and never hurry. Some comments on the current patch at LUCENE-1313: - Copyright is assigned to individual authors, better assign that to ASF. - Individual authors are mentioned in the code, that's not lucene policy at the moment. - Some files do not contain an ASF licence, not a real problem. - The directory structure could also be in contrib/ocean as top directory. - There is a whole package of logging in there, but there's no logging in lucene at the moment. - There is at least one empty class, SearcherPolicy. - Unseen so far: - the second half of the patch, - the java code within the class {...} statements (sorry.) Even though the patch is down to 25% of it's first size, it's still 474 kb, which is large by any standard. So the question is: is there a first step to be taken from this patch that would be an improvement on its own? Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]