fast update handlers
I'm trying to setup a system to have very low index latency (1-2 seconds) and one of the javadocs intrigued me: DirectUpdateHandler2 implements an UpdateHandler where documents are added directly to the main Lucene index as opposed to adding to a separate smaller index The plain DirectUpdateHandler also had the same in its docs. Does this imply that there use to be another handler that could send docs to a small/faster index and then merge them in with a larger one or that someone could in the future? I read through a good bit of the code and didn't see how it could be handled from a searcher perspective but perhaps I'm missing some key piece. - will
Re: fast update handlers
On 5/10/07, Will Johnson [EMAIL PROTECTED] wrote: I'm trying to setup a system to have very low index latency (1-2 seconds) and one of the javadocs intrigued me: DirectUpdateHandler2 implements an UpdateHandler where documents are added directly to the main Lucene index as opposed to adding to a separate smaller index The plain DirectUpdateHandler also had the same in its docs. Does this imply that there use to be another handler that could send docs to a small/faster index and then merge them in with a larger one or that someone could in the future? That was the original design, before I thought of the current method in DUH2. DirectUpdateHandler was just meant to get things working to establish the external interface (it's only for testing... very slow at overwriting docs). Adding documents to a separate index and then merging would have no real indexing speed advantage (it's essentially what Lucene does anyway when adding to a large index). There would be some advantage for index distribution, but it would complicate things greatly. High latency is caused by segment merges... this would happen when you periodically had to merge the smaller index into the larger anyway. You could do some other tricks for more predictable index times... set a large mergeFactor and then call optimize after you have added your batch of documents. Stay tuned though... there has been some work on a lucene patch to do merging in a background thread. -Yonik
RE: fast update handlers
I guess I was more concerned with doing the frequent commits and how that would affect the caches. Say I have 2M docs in my main index but I want to add docs every 2 seconds all while doing queries. if I do commits every 2 seconds I basically loose any caching advantage and my faceting performance goes down the tube. If however, I were to add things to a smaller index and then roll it into the larger one every ~30 minutes then I only take the hit on computing the larger filters caches on that interval. Further, if my smaller index were based on a RAMDirectory instead of a FSDirectory I assume computing the filter sets for the smaller index should be fast enough even every 2 seconds. - will -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Thursday, May 10, 2007 9:49 AM To: solr-user@lucene.apache.org Subject: Re: fast update handlers On 5/10/07, Will Johnson [EMAIL PROTECTED] wrote: I'm trying to setup a system to have very low index latency (1-2 seconds) and one of the javadocs intrigued me: DirectUpdateHandler2 implements an UpdateHandler where documents are added directly to the main Lucene index as opposed to adding to a separate smaller index The plain DirectUpdateHandler also had the same in its docs. Does this imply that there use to be another handler that could send docs to a small/faster index and then merge them in with a larger one or that someone could in the future? That was the original design, before I thought of the current method in DUH2. DirectUpdateHandler was just meant to get things working to establish the external interface (it's only for testing... very slow at overwriting docs). Adding documents to a separate index and then merging would have no real indexing speed advantage (it's essentially what Lucene does anyway when adding to a large index). There would be some advantage for index distribution, but it would complicate things greatly. High latency is caused by segment merges... this would happen when you periodically had to merge the smaller index into the larger anyway. You could do some other tricks for more predictable index times... set a large mergeFactor and then call optimize after you have added your batch of documents. Stay tuned though... there has been some work on a lucene patch to do merging in a background thread. -Yonik
Re: fast update handlers
On 5/10/07, Will Johnson [EMAIL PROTECTED] wrote: I guess I was more concerned with doing the frequent commits and how that would affect the caches. Say I have 2M docs in my main index but I want to add docs every 2 seconds all while doing queries. if I do commits every 2 seconds I basically loose any caching advantage and my faceting performance goes down the tube. If however, I were to add things to a smaller index and then roll it into the larger one every ~30 minutes then I only take the hit on computing the larger filters caches on that interval. Further, if my smaller index were based on a RAMDirectory instead of a FSDirectory I assume computing the filter sets for the smaller index should be fast enough even every 2 seconds. There isn't currently any support for incrementally updating filters. -Yonik
RE: fast update handlers
What about issuing separate commits to the index on a regularly scheduled basis? For example, you add documents to the index every 2 seconds, or however often, but these operations don't commit. Instead, you have a cron'd script or something that just issues a commit every 5 or 10 minutes or whatever interval you'd like. I had to do something similar when I was running a re-index of my entire dataset. My program wasn't issuing commits, so I just cron'd a commit for every half hour so it didn't overload the server. Thanks, Charlie -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Thursday, May 10, 2007 9:07 AM To: solr-user@lucene.apache.org Subject: Re: fast update handlers On 5/10/07, Will Johnson [EMAIL PROTECTED] wrote: I guess I was more concerned with doing the frequent commits and how that would affect the caches. Say I have 2M docs in my main index but I want to add docs every 2 seconds all while doing queries. if I do commits every 2 seconds I basically loose any caching advantage and my faceting performance goes down the tube. If however, I were to add things to a smaller index and then roll it into the larger one every ~30 minutes then I only take the hit on computing the larger filters caches on that interval. Further, if my smaller index were based on a RAMDirectory instead of a FSDirectory I assume computing the filter sets for the smaller index should be fast enough even every 2 seconds. There isn't currently any support for incrementally updating filters. -Yonik
RE: fast update handlers
The problem is I want the newly added documents to be made searchable every 1-2 seconds so I need the commits. I was hoping that the caches could be stored/tied to the IndexSearcher then a MultiSearcher could take advantage of the multiple sub indexes and their respective caches. I think the best approach now will be to write a top level federator that can merge the large ~static index and the smaller more dynamic index. - will -Original Message- From: Charlie Jackson [mailto:[EMAIL PROTECTED] Sent: Thursday, May 10, 2007 10:53 AM To: solr-user@lucene.apache.org Subject: RE: fast update handlers What about issuing separate commits to the index on a regularly scheduled basis? For example, you add documents to the index every 2 seconds, or however often, but these operations don't commit. Instead, you have a cron'd script or something that just issues a commit every 5 or 10 minutes or whatever interval you'd like. I had to do something similar when I was running a re-index of my entire dataset. My program wasn't issuing commits, so I just cron'd a commit for every half hour so it didn't overload the server. Thanks, Charlie -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Thursday, May 10, 2007 9:07 AM To: solr-user@lucene.apache.org Subject: Re: fast update handlers On 5/10/07, Will Johnson [EMAIL PROTECTED] wrote: I guess I was more concerned with doing the frequent commits and how that would affect the caches. Say I have 2M docs in my main index but I want to add docs every 2 seconds all while doing queries. if I do commits every 2 seconds I basically loose any caching advantage and my faceting performance goes down the tube. If however, I were to add things to a smaller index and then roll it into the larger one every ~30 minutes then I only take the hit on computing the larger filters caches on that interval. Further, if my smaller index were based on a RAMDirectory instead of a FSDirectory I assume computing the filter sets for the smaller index should be fast enough even every 2 seconds. There isn't currently any support for incrementally updating filters. -Yonik
Re: fast update handlers
I don't know if this helps, but... Do *all* your queries need to include the fast updates? I have a setup where there are some cases that need the newest stuff but most cases can wait 5 mins (or so) In that case, I have two solr instances pointing to the same index files. One is used for updates and queries that need everything. The other is a read-only index that serves the majority of queries. What is nice about this is that you can set different cache sizes and auto-warming for the different cases. ryan Will Johnson wrote: The problem is I want the newly added documents to be made searchable every 1-2 seconds so I need the commits. I was hoping that the caches could be stored/tied to the IndexSearcher then a MultiSearcher could take advantage of the multiple sub indexes and their respective caches. I think the best approach now will be to write a top level federator that can merge the large ~static index and the smaller more dynamic index. - will -Original Message- From: Charlie Jackson [mailto:[EMAIL PROTECTED] Sent: Thursday, May 10, 2007 10:53 AM To: solr-user@lucene.apache.org Subject: RE: fast update handlers What about issuing separate commits to the index on a regularly scheduled basis? For example, you add documents to the index every 2 seconds, or however often, but these operations don't commit. Instead, you have a cron'd script or something that just issues a commit every 5 or 10 minutes or whatever interval you'd like. I had to do something similar when I was running a re-index of my entire dataset. My program wasn't issuing commits, so I just cron'd a commit for every half hour so it didn't overload the server. Thanks, Charlie -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Thursday, May 10, 2007 9:07 AM To: solr-user@lucene.apache.org Subject: Re: fast update handlers On 5/10/07, Will Johnson [EMAIL PROTECTED] wrote: I guess I was more concerned with doing the frequent commits and how that would affect the caches. Say I have 2M docs in my main index but I want to add docs every 2 seconds all while doing queries. if I do commits every 2 seconds I basically loose any caching advantage and my faceting performance goes down the tube. If however, I were to add things to a smaller index and then roll it into the larger one every ~30 minutes then I only take the hit on computing the larger filters caches on that interval. Further, if my smaller index were based on a RAMDirectory instead of a FSDirectory I assume computing the filter sets for the smaller index should be fast enough even every 2 seconds. There isn't currently any support for incrementally updating filters. -Yonik
RE: fast update handlers
: want to add docs every 2 seconds all while doing queries. if I do : commits every 2 seconds I basically loose any caching advantage and my : faceting performance goes down the tube. If however, I were to add : things to a smaller index and then roll it into the larger one every ~30 : minutes then I only take the hit on computing the larger filters caches searching across both of these indexes (the big and the little) would require something like a MultiReader, a way to unify DocSets between the two, and the ability to cache on the sub indexes and on the main MultiReader. fortunately, a MultiReader is exactly what Lucence uses under the covers when dealing with an FSDIrectory, so we're half way there. something like these might get us the rest of the way... https://issues.apache.org/jira/browse/LUCENE-831 https://issues.apache.org/jira/browse/LUCENE-743 -Hoss