subject:"fast update handlers"

fast update handlers

2007-05-10 Thread Will Johnson

I'm trying to setup a system to have very low index latency (1-2
seconds) and one of the javadocs intrigued me:

 

DirectUpdateHandler2 implements an UpdateHandler where documents are
added directly to the main Lucene index as opposed to adding to a
separate smaller index

 

The plain DirectUpdateHandler also had the same in its docs.  Does this
imply that there use to be another handler that could send docs to a
small/faster index and then merge them in with a larger one or that
someone could in the future?  I read through a good bit of the code and
didn't see how it could be handled from a searcher perspective but
perhaps I'm missing some key piece.

 

- will

Re: fast update handlers

2007-05-10 Thread Yonik Seeley


On 5/10/07, Will Johnson [EMAIL PROTECTED] wrote:

I'm trying to setup a system to have very low index latency (1-2
seconds) and one of the javadocs intrigued me:

DirectUpdateHandler2 implements an UpdateHandler where documents are
added directly to the main Lucene index as opposed to adding to a
separate smaller index


The plain DirectUpdateHandler also had the same in its docs.  Does this
imply that there use to be another handler that could send docs to a
small/faster index and then merge them in with a larger one or that
someone could in the future?


That was the original design, before I thought of the current method
in DUH2. DirectUpdateHandler was just meant to get things working to
establish the external interface (it's only for testing... very slow
at overwriting docs).

Adding documents to a separate index and then merging would have no
real indexing speed advantage (it's essentially what Lucene does
anyway when adding to a large index).  There would be some advantage
for index distribution, but it would complicate things greatly.

High latency is caused by segment merges... this would happen when you
periodically had to merge the smaller index into the larger anyway.
You could do some other tricks for more predictable index times... set
a large mergeFactor and then call optimize after you have added your
batch of documents.

Stay tuned though... there has been some work on a lucene patch to do
merging in a background thread.

-Yonik

RE: fast update handlers

2007-05-10 Thread Will Johnson

I guess I was more concerned with doing the frequent commits and how
that would affect the caches.  Say I have 2M docs in my main index but I
want to add docs every 2 seconds all while doing queries.  if I do
commits every 2 seconds I basically loose any caching advantage and my
faceting performance goes down the tube.  If however, I were to add
things to a smaller index and then roll it into the larger one every ~30
minutes then I only take the hit on computing the larger filters caches
on that interval.  Further, if my smaller index were based on a
RAMDirectory instead of a FSDirectory I assume computing the filter sets
for the smaller index should be fast enough even every 2 seconds.

- will




-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Thursday, May 10, 2007 9:49 AM
To: solr-user@lucene.apache.org
Subject: Re: fast update handlers

On 5/10/07, Will Johnson [EMAIL PROTECTED] wrote:
 I'm trying to setup a system to have very low index latency (1-2
 seconds) and one of the javadocs intrigued me:

 DirectUpdateHandler2 implements an UpdateHandler where documents are
 added directly to the main Lucene index as opposed to adding to a
 separate smaller index


 The plain DirectUpdateHandler also had the same in its docs.  Does
this
 imply that there use to be another handler that could send docs to a
 small/faster index and then merge them in with a larger one or that
 someone could in the future?

That was the original design, before I thought of the current method
in DUH2. DirectUpdateHandler was just meant to get things working to
establish the external interface (it's only for testing... very slow
at overwriting docs).

Adding documents to a separate index and then merging would have no
real indexing speed advantage (it's essentially what Lucene does
anyway when adding to a large index).  There would be some advantage
for index distribution, but it would complicate things greatly.

High latency is caused by segment merges... this would happen when you
periodically had to merge the smaller index into the larger anyway.
You could do some other tricks for more predictable index times... set
a large mergeFactor and then call optimize after you have added your
batch of documents.

Stay tuned though... there has been some work on a lucene patch to do
merging in a background thread.

-Yonik

Re: fast update handlers

2007-05-10 Thread Yonik Seeley


On 5/10/07, Will Johnson [EMAIL PROTECTED] wrote:

I guess I was more concerned with doing the frequent commits and how
that would affect the caches.  Say I have 2M docs in my main index but I
want to add docs every 2 seconds all while doing queries.  if I do
commits every 2 seconds I basically loose any caching advantage and my
faceting performance goes down the tube.  If however, I were to add
things to a smaller index and then roll it into the larger one every ~30
minutes then I only take the hit on computing the larger filters caches
on that interval.  Further, if my smaller index were based on a
RAMDirectory instead of a FSDirectory I assume computing the filter sets
for the smaller index should be fast enough even every 2 seconds.


There isn't currently any support for incrementally updating filters.

-Yonik

RE: fast update handlers

2007-05-10 Thread Charlie Jackson

What about issuing separate commits to the index on a regularly
scheduled basis? For example, you add documents to the index every 2
seconds, or however often, but these operations don't commit. Instead,
you have a cron'd script or something that just issues a commit every 5
or 10 minutes or whatever interval you'd like. 

I had to do something similar when I was running a re-index of my entire
dataset. My program wasn't issuing commits, so I just cron'd a commit
for every half hour so it didn't overload the server. 

Thanks,
Charlie


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Thursday, May 10, 2007 9:07 AM
To: solr-user@lucene.apache.org
Subject: Re: fast update handlers

On 5/10/07, Will Johnson [EMAIL PROTECTED] wrote:
 I guess I was more concerned with doing the frequent commits and how
 that would affect the caches.  Say I have 2M docs in my main index but
I
 want to add docs every 2 seconds all while doing queries.  if I do
 commits every 2 seconds I basically loose any caching advantage and my
 faceting performance goes down the tube.  If however, I were to add
 things to a smaller index and then roll it into the larger one every
~30
 minutes then I only take the hit on computing the larger filters
caches
 on that interval.  Further, if my smaller index were based on a
 RAMDirectory instead of a FSDirectory I assume computing the filter
sets
 for the smaller index should be fast enough even every 2 seconds.

There isn't currently any support for incrementally updating filters.

-Yonik

RE: fast update handlers

2007-05-10 Thread Will Johnson

The problem is I want the newly added documents to be made searchable
every 1-2 seconds so I need the commits.  I was hoping that the caches
could be stored/tied to the IndexSearcher then a MultiSearcher could
take advantage of the multiple sub indexes and their respective caches.


I think the best approach now will be to write a top level federator
that can merge the large ~static index and the smaller more dynamic
index.

- will



-Original Message-
From: Charlie Jackson [mailto:[EMAIL PROTECTED] 
Sent: Thursday, May 10, 2007 10:53 AM
To: solr-user@lucene.apache.org
Subject: RE: fast update handlers

What about issuing separate commits to the index on a regularly
scheduled basis? For example, you add documents to the index every 2
seconds, or however often, but these operations don't commit. Instead,
you have a cron'd script or something that just issues a commit every 5
or 10 minutes or whatever interval you'd like. 

I had to do something similar when I was running a re-index of my entire
dataset. My program wasn't issuing commits, so I just cron'd a commit
for every half hour so it didn't overload the server. 

Thanks,
Charlie


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Thursday, May 10, 2007 9:07 AM
To: solr-user@lucene.apache.org
Subject: Re: fast update handlers

On 5/10/07, Will Johnson [EMAIL PROTECTED] wrote:
 I guess I was more concerned with doing the frequent commits and how
 that would affect the caches.  Say I have 2M docs in my main index but
I
 want to add docs every 2 seconds all while doing queries.  if I do
 commits every 2 seconds I basically loose any caching advantage and my
 faceting performance goes down the tube.  If however, I were to add
 things to a smaller index and then roll it into the larger one every
~30
 minutes then I only take the hit on computing the larger filters
caches
 on that interval.  Further, if my smaller index were based on a
 RAMDirectory instead of a FSDirectory I assume computing the filter
sets
 for the smaller index should be fast enough even every 2 seconds.

There isn't currently any support for incrementally updating filters.

-Yonik

Re: fast update handlers

2007-05-10 Thread Ryan McKinley



I don't know if this helps, but...

Do *all* your queries need to include the fast updates?  I have a setup 
where there are some cases that need the newest stuff but most cases can 
wait 5 mins (or so)


In that case, I have two solr instances pointing to the same index 
files.  One is used for updates and queries that need everything.  The 
other is a read-only index that serves the majority of queries.


What is nice about this is that you can set different cache sizes and 
auto-warming for the different cases.


ryan


Will Johnson wrote:

The problem is I want the newly added documents to be made searchable
every 1-2 seconds so I need the commits.  I was hoping that the caches
could be stored/tied to the IndexSearcher then a MultiSearcher could
take advantage of the multiple sub indexes and their respective caches.


I think the best approach now will be to write a top level federator
that can merge the large ~static index and the smaller more dynamic
index.

- will



-Original Message-
From: Charlie Jackson [mailto:[EMAIL PROTECTED] 
Sent: Thursday, May 10, 2007 10:53 AM

To: solr-user@lucene.apache.org
Subject: RE: fast update handlers

What about issuing separate commits to the index on a regularly
scheduled basis? For example, you add documents to the index every 2
seconds, or however often, but these operations don't commit. Instead,
you have a cron'd script or something that just issues a commit every 5
or 10 minutes or whatever interval you'd like. 


I had to do something similar when I was running a re-index of my entire
dataset. My program wasn't issuing commits, so I just cron'd a commit
for every half hour so it didn't overload the server. 


Thanks,
Charlie


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Thursday, May 10, 2007 9:07 AM
To: solr-user@lucene.apache.org
Subject: Re: fast update handlers

On 5/10/07, Will Johnson [EMAIL PROTECTED] wrote:

I guess I was more concerned with doing the frequent commits and how
that would affect the caches.  Say I have 2M docs in my main index but

I

want to add docs every 2 seconds all while doing queries.  if I do
commits every 2 seconds I basically loose any caching advantage and my
faceting performance goes down the tube.  If however, I were to add
things to a smaller index and then roll it into the larger one every

~30

minutes then I only take the hit on computing the larger filters

caches

on that interval.  Further, if my smaller index were based on a
RAMDirectory instead of a FSDirectory I assume computing the filter

sets

for the smaller index should be fast enough even every 2 seconds.


There isn't currently any support for incrementally updating filters.

-Yonik

RE: fast update handlers

2007-05-10 Thread Chris Hostetter

: want to add docs every 2 seconds all while doing queries.  if I do
: commits every 2 seconds I basically loose any caching advantage and my
: faceting performance goes down the tube.  If however, I were to add
: things to a smaller index and then roll it into the larger one every ~30
: minutes then I only take the hit on computing the larger filters caches

searching across both of these indexes (the big and the little) would
require something like a MultiReader, a way to unify DocSets
between the two, and the ability to cache on the sub indexes and on the
main MultiReader.

fortunately, a MultiReader is exactly what Lucence uses under the covers
when dealing with an FSDIrectory, so we're half way there.  something like
these might get us the rest of the way...

https://issues.apache.org/jira/browse/LUCENE-831
https://issues.apache.org/jira/browse/LUCENE-743




-Hoss

fast update handlers

Re: fast update handlers

RE: fast update handlers

Re: fast update handlers

RE: fast update handlers

RE: fast update handlers

Re: fast update handlers

RE: fast update handlers

8 matches

Site Navigation

Mail list logo

Footer information