RE: Scaling indexes with high document count

2010-03-11 Thread Peter S

Hi,

 

Thanks for your reply (an apologies for the orig msg being ent multiple times 
to the list - googlemail problems).

 

I actually meant to put 'maxBufferredDocs'. I admit I'm not that familar with 
this parameter, but as I understand it, it is the number of documents that are 
held in ram before flushing to disk. I've noticed the ramBufferSizeMB is a 
similar parameter, but using memory as the threshold rather than number of docs.

 

Is it best not to set these too high on indexers?

 

In my environment, all writes are done via SolrJ, where documents are placed in 
a SolrDocumentList and commit()ed when the list reaches 1000 (default value), 
or a configured commit thread interval is reached (default is 20s, whichever 
comes first). I suppose this is a SolrJ-side version of 'maxBufferedDocs', so 
maybe I don't need to set maxBufferedDocs in solrconfig? (the SolrJ 'client' is 
on the same machine as the index)

 

For the indexer cores (essentially write-only indexes), I wasn't planning on 
configuring extra memory for read cache (Lucene value cache or filter cache), 
as no queries would/should be received on these. Should I reconsider this? 
They'll be plenty of RAM available for indexers to use and still leave enough 
for the OS file system cache to do its thing. Do you have any suggestions as to 
what would be the best way to use this memory to achieve optimal indexing 
speed? 

The main things I do now to tune for fast indexing are: 

 * commiting lists of docs rather than each one separately

 * not optimizing too often

 * bump up the mergeFactor (I use a value of 25)

 

 

Many Thanks!

Peter

 

 

 
> Date: Thu, 11 Mar 2010 09:19:12 -0800
> From: hossman_luc...@fucit.org
> To: solr-user@lucene.apache.org
> Subject: Re: Scaling indexes with high document count
> 
> 
> : I wonder if anyone might have some insight/advice on index scaling for high
> : document count vs size deployments...
> 
> Your general approach sounds reasonable, although specifics of how you'll 
> need to tune the caches and how much hardware you'll need will largely 
> depend on the specifics of the data and the queries.
> 
> I'm not sure what you mean by this though...
> 
> 
> : As searching would always be performed on replicas - the indexing cores
> : wouldn't be tuned with much autowarming/read cache, but have loads of
> : 'maxdocs' cache. The searchers would be the other way 'round - lots of
> 
> what do you mean by "'maxdocs' cache" ?
> 
> 
> 
> -Hoss
> 
  
_
Tell us your greatest, weirdest and funniest Hotmail stories
http://clk.atdmt.com/UKM/go/195013117/direct/01/

Re: Scaling indexes with high document count

2010-03-11 Thread Chris Hostetter

: I wonder if anyone might have some insight/advice on index scaling for high
: document count vs size deployments...

Your general approach sounds reasonable, although specifics of how you'll 
need to tune the caches and how much hardware you'll need will largely 
depend on the specifics of the data and the queries.

I'm not sure what you mean by this though...


: As searching would always be performed on replicas - the indexing cores
: wouldn't be tuned with much autowarming/read cache, but have loads of
: 'maxdocs' cache. The searchers would be the other way 'round - lots of

what do you mean by "'maxdocs' cache" ?



-Hoss



Scaling indexes with high document count

2010-03-10 Thread Peter S

Hello,

I wonder if anyone might have some insight/advice on index scaling for high 
document count vs size deployments...

The nature of the incoming data is a steady stream of, on average, 4GB per day. 
Importantly, the number of documents inserted during this time is ~7million 
(i.e. lots of small entries).
The plan is to partition shards on a per month basis, and hold 6 months of data.

On the search side, this would mean 6 shards (as replicas), each holding ~120GB 
with ~210million document entries.
It is envisioned to deploy 2 indexing cores of which one is active at a time. 
When the active core gets 'full' (e.g. a month has passed), the other core 
kicks in for live indexing while the other completes its replication to it 
searchers. It's then cleared, ready for the next time period. Each time there 
is a 'switch', the next available replica is cleared and told to replicate to 
the newly active indexing core. After 6 months, the first replica is re-used, 
and so on...
This type of layout allows indexing to carry on pretty much uninterrupted, and 
makes it relatively easy to manage replicas separately from the indexers (e.g. 
add replicas to store, say, 9 months, backup, forward etc.).

As searching would always be performed on replicas - the indexing cores 
wouldn't be tuned with much autowarming/read cache, but have loads of 'maxdocs' 
cache. The searchers would be the other way 'round - lots of filter/fieldvalue 
cache. Please correct me if I'm wrong about these. (btw, client searches use 
faceting in a big way)

The 120GB disk footprint is perfectly reasonable. Searching on potentially 
1.3billion document entries, each with up to 30-80 facets (+potentially lots of 
unique values), plus date faceting and range queries, and still keep search 
performance up is where I could use some advice.
Is this a case of simply throwing enough tin at the problem to handle the 
caching/faceting/distributed searches?

What advice could you give to get the best performance out of such a scenario?
Any experiences/insight etc. is greatly appreciated.

Thanks,
Peter

BTW: Many thanks to Yonik and Lucid for your excellent Mastering Solr webinar - 
really useful and highly informative!

 
  
_
Do you have a story that started on Hotmail? Tell us now
http://clk.atdmt.com/UKM/go/195013117/direct/01/

Scaling indexes with high document count

2010-03-10 Thread Peter Sturge
Hello,

I wonder if anyone might have some insight/advice on index scaling for high
document count vs size deployments...

The nature of the incoming data is a steady stream of, on average, 4GB per
day. Importantly, the number of documents inserted during this time is
~7million (i.e. lots of small entries).
The plan is to partition shards on a per month basis, and hold 6 months of
data.

On the search side, this would mean 6 shards (as replicas), each holding
~120GB with ~210million document entries.
It is envisioned to deploy 2 indexing cores of which one is active at a
time. When the active core gets 'full' (e.g. a month has passed), the other
core kicks in for live indexing while the other completes its replication to
it searchers. It's then cleared, ready for the next time period. Each time
there is a 'switch', the next available replica is cleared and told to
replicate to the newly active indexing core. After 6 months, the first
replica is re-used, and so on...
This type of layout allows indexing to carry on pretty much uninterrupted,
and makes it relatively easy to manage replicas separately from the indexers
(e.g. add replicas to store, say, 9 months, backup, forward etc.).

As searching would always be performed on replicas - the indexing cores
wouldn't be tuned with much autowarming/read cache, but have loads of
'maxdocs' cache. The searchers would be the other way 'round - lots of
filter/fieldvalue cache. Please correct me if I'm wrong about these. (btw,
client searches use faceting in a big way)

The 120GB disk footprint is perfectly reasonable. Searching on potentially
1.3billion document entries, each with up to 30-80 facets (+potentially lots
of unique values), plus date faceting and range queries, and still keep
search performance up is where I could use some advice.
Is this a case of simply throwing enough tin at the problem to handle the
caching/faceting/distributed searches?

What advice could you give to get the best performance out of such a
scenario?
Any experiences/insight etc. is greatly appreciated.

Thanks,
Peter

BTW: Many thanks to Yonik and Lucid for your excellent Mastering Solr
webinar - really useful and highly informative!


Scaling indexes with high document count

2010-03-10 Thread Peter Sturge
Hello,

I wonder if anyone might have some insight/advice on index scaling for high
document count vs size deployments...

The nature of the incoming data is a steady stream of, on average, 4GB per
day. Importantly, the number of documents inserted during this time is
~7million (i.e. lots of small entries).
The plan is to partition shards on a per month basis, and hold 6 months of
data.

On the search side, this would mean 6 shards (as replicas), each holding
~120GB with ~210million document entries.
It is envisioned to deploy 2 indexing cores of which one is active at a
time. When the active core gets 'full' (e.g. a month has passed), the other
core kicks in for live indexing while the other completes its replication to
it searcher(s). It's then cleared, ready for the next time period. Each time
there is a 'switch', the next available replica is cleared and told to
replicate to the newly active indexing core. After 6 months, the first
replica is re-used, and so on...
This type of layout allows indexing to carry on pretty much uninterrupted,
and makes it relatively easy to manage replicas separately from the indexers
(e.g. add replicas to store, say, 9 months, backup, forward etc.).

As searching would always be performed on replicas - the indexing cores
wouldn't be tuned with much autowarming/read cache, but have loads of
'maxdocs' cache. The searchers would be the other way 'round - lots of
filter/fieldvalue cache. Please correct me if I'm wrong about these. (btw,
client searches use faceting in a big way)

The 120GB disk footprint is perfectly reasonable. Searching on potentially
1.3billion document entries, each with up to 30-80 facets (+potentially lots
of unique values), plus date faceting and range queries, and still keep
search performance up is where I could use some advice.
Is this a case of simply throwing enough tin at the problem to handle the
caching/faceting/distributed searches?

What advice would you give to get the best performance out of such a
scenario?
Any experiences/insight etc. is greatly appreciated.

Thanks,
Peter

BTW: Many thanks, Yonik and Lucid for your excellent Mastering Solr webinar
- really useful and highly informative!


Scaling indexes with high document count

2010-03-09 Thread Peter Sturge
Hello,

I wonder if anyone might have some insight/advice on index scaling for high
document count vs size deployments...

The nature of the incoming data is a steady stream of, on average, 4GB per
day. Importantly, the number of documents inserted during this time is
~7million (i.e. lots of small entries).
The plan is to partition shards on a per month basis, and hold 6 months of
data.

On the search side, this would mean 6 shards (as replicas), each holding
~120GB with ~210million document entries.
It is envisioned to deploy 2 indexing cores of which one is active at a
time. When the active core gets 'full' (e.g. a month has passed), the other
core kicks in for live indexing while the other completes its replication to
it searcher(s). It's then cleared, ready for the next time period. Each time
there is a 'switch', the next available replica is cleared and told to
replicate to the newly active indexing core. After 6 months, the first
replica is re-used, and so on...
This type of layout allows indexing to carry on pretty much uninterrupted,
and makes it relatively easy to manage replicas separately from the indexers
(e.g. add replicas to store, say, 9 months, backup, forward etc.).

As searching would always be performed on replicas - the indexing cores
wouldn't be tuned with much autowarming/read cache, but have loads of
'maxdocs' cache. The searchers would be the other way 'round - lots of
filter/fieldvalue cache. Please correct me if I'm wrong about these. (btw,
client searches use faceting in a big way)

The 120GB disk footprint is perfectly reasonable. Searching on potentially
1.3billion document entries, each with up to 30-80 facets (+potentially lots
of unique values), plus date faceting and range queries, and still keep
search performance up is where I could use some advice.
Is this a case of simply throwing enough tin at the problem to handle the
caching/faceting/distributed searches?

What advice would you give to get the best performance out of such a
scenario?
Any experiences/insight etc. is greatly appreciated.

Thanks,
Peter

BTW: Many thanks, Yonik and Lucid for your excellent Mastering Solr webinar
- really useful and highly informative!