RE: huge shards (300GB each) and load balancing

2011-06-15 Thread Burton-West, Tom
Hi Dimitry,

>>The parameters you have menioned -- termInfosIndexDivisor and
>>termIndexInterval -- are not found in the solr 1.4.1 config|schema. Are you 
>>using SOLR 3.1?

I'm pretty sure that the termIndexInterval (ratio of tii file to tis file) is 
in the 1.4.1 example solrconfig.xml file, although I don't have a copy to check 
at the moment.  We are using a 3.1 dev version.  As far as the 
termInfosIndexDivisor, I I'm also pretty sure it works with 1.4.1, but you 
might have to ask the list to be sure.  As you can see from the blog posts 
those settings really reduced our memory requirements.We haven't been doing 
faceting so we expect memory use to go up again once we add faceting, but at 
least we are starting at a 4GB baseline instead of a 20-32GB baseline.

>>Did you you do logical sharding or document hash based?

On the indexing side we just assign documents to a particular shard on a round 
robin basis and use a database to keep track of which document is in which 
shard so if we need to update it we update the right shard (See the "Forty 
days" article on the blog for a more detailed description and some diagrams) .  
We hope that this distributes the documents evenly enough to avoid problems 
with Solr's lack of global idf.

>>Do you have load balancer between the front SOLR (or front entity) and shards,

As far as load balancing which shard is the head shard/front shard, again, our 
app layer just randomly picks one of the shards to be the head shard.  We 
originally were going to do tests to determine if it was better to have one 
dedicated machine configured to be the head shard, but never got around to 
that.  We have a very low query request rate, so haven't had to seriously look 
at load balancing

>>do you do merging? 

I'm not sure what you mean by "do you do merging" .  We are just using the 
default Solr distributed search.  In theory our documents should be randomly 
distributed among the shards so the lack of global idf should not hurt the 
merging process.  Andrzej Bialecki gave a recent presentation on Solr 
distributed search that talks about less than optimal results merging and some 
ideas for dealing with it:
http://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/AndrzejBialecki-Buzzwords-2011_0.pdf

>>Each shard currently is allocated max 12GB memory. 
I'm curious about how much memory you leave to the OS for disk caching.  Can 
you give any details about the number of shards per machine and the total 
memory on the machine.


Tom Burton-West
 http://www.hathitrust.org/blogs/large-scale-search




From: Dmitry Kan [dmitry....@gmail.com]
Sent: Tuesday, June 14, 2011 2:15 PM
To: solr-user@lucene.apache.org
Subject: Re: huge shards (300GB each) and load balancing

Hi Tom,

Thanks a lot for sharing this. We have about half a terabyte total index
size, and we have split our index over 10 shards (horizontal scaling, no
replication). Each shard currently is allocated max 12GB memory. We use
facet search a lot and non-facet search with parameter values generated by
facet search (hence more focused search that hits small portion of solr
documents).

The parameters you have menioned -- termInfosIndexDivisor and
termIndexInterval -- are not found in the solr 1.4.1 config|schema. Are you
using SOLR 3.1? Did you you do logical sharding or document hash based? Do
you have load balancer between the front SOLR (or front entity) and shards,
do you do merging?





Re: huge shards (300GB each) and load balancing

2011-06-14 Thread Dmitry Kan
Hi Tom,

Thanks a lot for sharing this. We have about half a terabyte total index
size, and we have split our index over 10 shards (horizontal scaling, no
replication). Each shard currently is allocated max 12GB memory. We use
facet search a lot and non-facet search with parameter values generated by
facet search (hence more focused search that hits small portion of solr
documents).

The parameters you have menioned -- termInfosIndexDivisor and
termIndexInterval -- are not found in the solr 1.4.1 config|schema. Are you
using SOLR 3.1? Did you you do logical sharding or document hash based? Do
you have load balancer between the front SOLR (or front entity) and shards,
do you do merging?

On Wed, Jun 8, 2011 at 10:23 PM, Burton-West, Tom wrote:

> Hi Dmitry,
>
> I am assuming you are splitting one very large index over multiple shards
> rather than replicating and index multiple times.
>
> Just for a point of comparison, I thought I would describe our experience
> with large shards. At HathiTrust, we run a 6 terabyte index over 12 shards.
>  This is split over 4 machines with 3 shards per machine and our shards are
> about 400-500GB.  We get average response times of around 200 ms with the
> 99th percentile queries up around 1-2 seconds. We have a very low qps rate,
> i.e. less than 1 qps.  We also index offline on a separate machine and
> update the indexes nightly.
>
> Some of the issues we have found with very large shards are:
> 1) Becaue of the very large shard size, I/O tends to be the bottleneck,
> with phrase queries containing common words being the slowest.
> 2) Because of the I/O issues running cache-warming queries to get postings
> into the OS disk cache is important as is leaving significant free memory
> for the OS to use for disk caching
> 3) Because of the I/O issues using stop words or CommonGrams produces a
> significant performance increase.
> 2) We have a huge number of unique terms in our indexes.  In order to
> reduce the amount of memory needed by the in-memory terms index we set the
> termInfosIndexDivisor to 8, which causes Solr to only load every 8th term
> from the tii file into memory. This reduced memory use from over 18GB to
> below 3G and got rid of 30 second stop the world java Garbage Collections.
> (See
> http://www.hathitrust.org/blogs/large-scale-search/too-many-words-againfor 
> details)  We later ran into memory problems when indexing so instead
> changed the index time parameter termIndexInterval from 128 to 1024.
>
> (More details here: http://www.hathitrust.org/blogs/large-scale-search)
>
> Tom Burton-West
>
>


-- 
Regards,

Dmitry Kan


RE: huge shards (300GB each) and load balancing

2011-06-08 Thread Burton-West, Tom
Hi Dmitry,

I am assuming you are splitting one very large index over multiple shards 
rather than replicating and index multiple times.

Just for a point of comparison, I thought I would describe our experience with 
large shards. At HathiTrust, we run a 6 terabyte index over 12 shards.  This is 
split over 4 machines with 3 shards per machine and our shards are about 
400-500GB.  We get average response times of around 200 ms with the 99th 
percentile queries up around 1-2 seconds. We have a very low qps rate, i.e. 
less than 1 qps.  We also index offline on a separate machine and update the 
indexes nightly.

Some of the issues we have found with very large shards are:
1) Becaue of the very large shard size, I/O tends to be the bottleneck, with 
phrase queries containing common words being the slowest.
2) Because of the I/O issues running cache-warming queries to get postings into 
the OS disk cache is important as is leaving significant free memory for the OS 
to use for disk caching
3) Because of the I/O issues using stop words or CommonGrams produces a 
significant performance increase.
2) We have a huge number of unique terms in our indexes.  In order to reduce 
the amount of memory needed by the in-memory terms index we set the 
termInfosIndexDivisor to 8, which causes Solr to only load every 8th term from 
the tii file into memory. This reduced memory use from over 18GB to below 3G 
and got rid of 30 second stop the world java Garbage Collections. (See 
http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again for 
details)  We later ran into memory problems when indexing so instead changed 
the index time parameter termIndexInterval from 128 to 1024.

(More details here: http://www.hathitrust.org/blogs/large-scale-search)

Tom Burton-West



Re: huge shards (300GB each) and load balancing

2011-06-08 Thread Dmitry Kan
Hi, Bill. Thanks, always nice to have options!

Dmitry

On Wed, Jun 8, 2011 at 4:47 PM, Bill Bell  wrote:

> Re Amazon elb.
>
> This is not exactly true. The ELB does load balancer internal IPs. But the
> ELB   IP address must be external. Still a major issue unless you use
> authentication. Nginx and others can also do load balancing.
>
> Bill Bell
> Sent from mobile
>
>
> On Jun 8, 2011, at 3:32 AM, "Upayavira"  wrote:
>
> >
> >
> > On Wed, 08 Jun 2011 10:42 +0300, "Dmitry Kan" 
> > wrote:
> >> Hello list,
> >>
> >> Thanks for attending to my previous questions so far, have learnt a lot.
> >> Here is another one, I hope it will be interesting to answer.
> >>
> >>
> >>
> >> We run our SOLR shards and front end SOLR on the Amazon high-end
> >> machines.
> >> Currently we have 6 shards with around 200GB in each. Currently we have
> >> only
> >> one front end SOLR which, given a client query, redirects it to all the
> >> shards. Our shards are constantly growing, data is at times reindexed
> (in
> >> batches, which is done by removing a decent chunk before replacing it
> >> with
> >> updated data), constant stream of new data is coming every hour (usually
> >> hits the latest shard in time, but can also hit other shards, which have
> >> older data). Since the front end SOLR has started to be a SPOF, we are
> >> thinking about setting up some sort of load balancer.
> >>
> >> 1) do you think ELB from Amazon is a good solution for starters? We
> don't
> >> need to maintain sessions between SOLR and client.
> >> 2) What other load balancers have been used specifically with SOLR?
> >>
> >>
> >> Overall: does SOLR scale to such size (200GB in an index) and what can
> be
> >> recommended as next step -- resharding (cutting existing shards to
> >> smaller
> >> chunks), replication?
> >
> > Really, it is going to be up to you to work out what works in your
> > situation. You may be reaching the limit of what a Lucene index can
> > handle, don't know. If your query traffic is low, you might find that
> > two 100Gb cores in a single instance performs better. But then, maybe
> > not! Or two 100Gb shards on smaller Amazon hosts. But then, maybe not!
> > :-)
> >
> > The principal issue with Amazon's load balancers (at least when I was
> > using them last year) is that the ports that they balance need to be
> > public. You can't use an Amazon load balancer as an internal service
> > within a security group. For a service such as Solr, that can be a bit
> > of a killer.
> >
> > If they've fixed that issue, then they'd work fine (I used them quite
> > happily in another scenario).
> >
> > When looking at resolving single points of failure, handling search is
> > pretty easy (as you say, stateless load balancer). You will need to give
> > more attention though to how you handle it regarding indexing.
> >
> > Hope that helps a bit!
> >
> > Upayavira
> >
> >
> >
> >
> >
> > ---
> > Enterprise Search Consultant at Sourcesense UK,
> > Making Sense of Open Source
> >
>



-- 
Regards,

Dmitry Kan


Re: huge shards (300GB each) and load balancing

2011-06-08 Thread Bill Bell
Re Amazon elb.

This is not exactly true. The ELB does load balancer internal IPs. But the ELB  
 IP address must be external. Still a major issue unless you use 
authentication. Nginx and others can also do load balancing.

Bill Bell
Sent from mobile


On Jun 8, 2011, at 3:32 AM, "Upayavira"  wrote:

> 
> 
> On Wed, 08 Jun 2011 10:42 +0300, "Dmitry Kan" 
> wrote:
>> Hello list,
>> 
>> Thanks for attending to my previous questions so far, have learnt a lot.
>> Here is another one, I hope it will be interesting to answer.
>> 
>> 
>> 
>> We run our SOLR shards and front end SOLR on the Amazon high-end
>> machines.
>> Currently we have 6 shards with around 200GB in each. Currently we have
>> only
>> one front end SOLR which, given a client query, redirects it to all the
>> shards. Our shards are constantly growing, data is at times reindexed (in
>> batches, which is done by removing a decent chunk before replacing it
>> with
>> updated data), constant stream of new data is coming every hour (usually
>> hits the latest shard in time, but can also hit other shards, which have
>> older data). Since the front end SOLR has started to be a SPOF, we are
>> thinking about setting up some sort of load balancer.
>> 
>> 1) do you think ELB from Amazon is a good solution for starters? We don't
>> need to maintain sessions between SOLR and client.
>> 2) What other load balancers have been used specifically with SOLR?
>> 
>> 
>> Overall: does SOLR scale to such size (200GB in an index) and what can be
>> recommended as next step -- resharding (cutting existing shards to
>> smaller
>> chunks), replication?
> 
> Really, it is going to be up to you to work out what works in your
> situation. You may be reaching the limit of what a Lucene index can
> handle, don't know. If your query traffic is low, you might find that
> two 100Gb cores in a single instance performs better. But then, maybe
> not! Or two 100Gb shards on smaller Amazon hosts. But then, maybe not!
> :-)
> 
> The principal issue with Amazon's load balancers (at least when I was
> using them last year) is that the ports that they balance need to be
> public. You can't use an Amazon load balancer as an internal service
> within a security group. For a service such as Solr, that can be a bit
> of a killer.
> 
> If they've fixed that issue, then they'd work fine (I used them quite
> happily in another scenario).
> 
> When looking at resolving single points of failure, handling search is
> pretty easy (as you say, stateless load balancer). You will need to give
> more attention though to how you handle it regarding indexing.
> 
> Hope that helps a bit!
> 
> Upayavira
> 
> 
> 
> 
> 
> --- 
> Enterprise Search Consultant at Sourcesense UK, 
> Making Sense of Open Source
> 


Re: huge shards (300GB each) and load balancing

2011-06-08 Thread Dmitry Kan
Hi Upayavira,

Thanks for sharing insights and experience on this.

As we have 6 shards at the moment, it is pretty hard (=almost impossible) to
keep them on a single box, so that's why we decided to shard. On the other
hand, we have never tried multicore architecture, so that's a good point,
thanks.

On the indexing side, we do it rather straightforward, that is, by updating
the online shards. This should hopefully be improved with [offline update /
http swap] system, as already now, updating online 200GB shards at times
produces OOM, freezing and other issues.



Does someone have other experience / pointers to load balancer software that
was tried with SOLR?

Dmitry

On Wed, Jun 8, 2011 at 12:32 PM, Upayavira  wrote:

>
>
> On Wed, 08 Jun 2011 10:42 +0300, "Dmitry Kan" 
> wrote:
> > Hello list,
> >
> > Thanks for attending to my previous questions so far, have learnt a lot.
> > Here is another one, I hope it will be interesting to answer.
> >
> >
> >
> > We run our SOLR shards and front end SOLR on the Amazon high-end
> > machines.
> > Currently we have 6 shards with around 200GB in each. Currently we have
> > only
> > one front end SOLR which, given a client query, redirects it to all the
> > shards. Our shards are constantly growing, data is at times reindexed (in
> > batches, which is done by removing a decent chunk before replacing it
> > with
> > updated data), constant stream of new data is coming every hour (usually
> > hits the latest shard in time, but can also hit other shards, which have
> > older data). Since the front end SOLR has started to be a SPOF, we are
> > thinking about setting up some sort of load balancer.
> >
> > 1) do you think ELB from Amazon is a good solution for starters? We don't
> > need to maintain sessions between SOLR and client.
> > 2) What other load balancers have been used specifically with SOLR?
> >
> >
> > Overall: does SOLR scale to such size (200GB in an index) and what can be
> > recommended as next step -- resharding (cutting existing shards to
> > smaller
> > chunks), replication?
>
> Really, it is going to be up to you to work out what works in your
> situation. You may be reaching the limit of what a Lucene index can
> handle, don't know. If your query traffic is low, you might find that
> two 100Gb cores in a single instance performs better. But then, maybe
> not! Or two 100Gb shards on smaller Amazon hosts. But then, maybe not!
> :-)
>
> The principal issue with Amazon's load balancers (at least when I was
> using them last year) is that the ports that they balance need to be
> public. You can't use an Amazon load balancer as an internal service
> within a security group. For a service such as Solr, that can be a bit
> of a killer.
>
> If they've fixed that issue, then they'd work fine (I used them quite
> happily in another scenario).
>
> When looking at resolving single points of failure, handling search is
> pretty easy (as you say, stateless load balancer). You will need to give
> more attention though to how you handle it regarding indexing.
>
> Hope that helps a bit!
>
> Upayavira
>
>
>
>
>
> ---
> Enterprise Search Consultant at Sourcesense UK,
> Making Sense of Open Source
>
>


Re: huge shards (300GB each) and load balancing

2011-06-08 Thread Upayavira


On Wed, 08 Jun 2011 10:42 +0300, "Dmitry Kan" 
wrote:
> Hello list,
> 
> Thanks for attending to my previous questions so far, have learnt a lot.
> Here is another one, I hope it will be interesting to answer.
> 
> 
> 
> We run our SOLR shards and front end SOLR on the Amazon high-end
> machines.
> Currently we have 6 shards with around 200GB in each. Currently we have
> only
> one front end SOLR which, given a client query, redirects it to all the
> shards. Our shards are constantly growing, data is at times reindexed (in
> batches, which is done by removing a decent chunk before replacing it
> with
> updated data), constant stream of new data is coming every hour (usually
> hits the latest shard in time, but can also hit other shards, which have
> older data). Since the front end SOLR has started to be a SPOF, we are
> thinking about setting up some sort of load balancer.
> 
> 1) do you think ELB from Amazon is a good solution for starters? We don't
> need to maintain sessions between SOLR and client.
> 2) What other load balancers have been used specifically with SOLR?
> 
> 
> Overall: does SOLR scale to such size (200GB in an index) and what can be
> recommended as next step -- resharding (cutting existing shards to
> smaller
> chunks), replication?

Really, it is going to be up to you to work out what works in your
situation. You may be reaching the limit of what a Lucene index can
handle, don't know. If your query traffic is low, you might find that
two 100Gb cores in a single instance performs better. But then, maybe
not! Or two 100Gb shards on smaller Amazon hosts. But then, maybe not!
:-)

The principal issue with Amazon's load balancers (at least when I was
using them last year) is that the ports that they balance need to be
public. You can't use an Amazon load balancer as an internal service
within a security group. For a service such as Solr, that can be a bit
of a killer.

If they've fixed that issue, then they'd work fine (I used them quite
happily in another scenario).

When looking at resolving single points of failure, handling search is
pretty easy (as you say, stateless load balancer). You will need to give
more attention though to how you handle it regarding indexing.

Hope that helps a bit!

Upayavira





--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source