Re: solr multicore vs sharding vs 1 big collection

2015-08-04 Thread Shawn Heisey
On 8/4/2015 3:30 PM, Jay Potharaju wrote:
> For the last few days I have been trying to correlate the timeouts with GC.
> I noticed in the GC logs that full GC takes long time once in a while. Does
> this mean that the jvm memory is to high or is it set to low?



> 1973953.560: [GC 4474277K->3300411K(4641280K), 0.0423129 secs]
> 1973960.674: [GC 4536894K->3371225K(4630016K), 0.0560341 secs]
> 1973960.731: [Full GC 3371225K->3339436K(5086208K), 15.5285889 secs]
> 1973990.516: [GC 4548268K->3405111K(5096448K), 0.0657788 secs]
> 1973998.191: [GC 4613934K->3527257K(5086208K), 0.1304232 secs]

Based on what I can see there, it looks like 6GB might be enough heap. 
Your low points are all in the 3GB range, which is only half of that.  A
6GB heap is not very big in the Solr world.

Based on that GC log and my own experiences, I'm guessing that your GC
isn't tuned.  The default collector that Java chooses is *terrible* for
Solr.  Even just switching collectors to CMS or G1 will not improve the
situation.  Solr requires extensive GC tuning for good performance.

The SolrPerformanceProblems wiki page that I pointed you to previously
contains a little bit of info on GC tuning, and it also links to the
following page, which is my personal page on the wiki, and documents
some of my garbage collection journey with Solr:

https://wiki.apache.org/solr/ShawnHeisey

Thanks,
Shawn



Re: solr multicore vs sharding vs 1 big collection

2015-08-04 Thread Jay Potharaju
For the last few days I have been trying to correlate the timeouts with GC.
I noticed in the GC logs that full GC takes long time once in a while. Does
this mean that the jvm memory is to high or is it set to low?


 [GC 4730643K->3552794K(4890112K), 0.0433146 secs]
1973853.751: [Full GC 3552794K->2926402K(4635136K), 0.3123954 secs]
1973864.170: [GC 4127554K->2972129K(4644864K), 0.0418248 secs]
1973873.341: [GC 4185569K->2990123K(4640256K), 0.0451723 secs]
1973882.452: [GC 4201770K->2999178K(4645888K), 0.0611839 secs]
1973890.684: [GC 4220298K->3010751K(4646400K), 0.0302890 secs]
1973900.539: [GC 4229514K->3015049K(4646912K), 0.0470857 secs]
1973911.179: [GC 4237193K->3040837K(4646912K), 0.0373900 secs]
1973920.822: [GC 4262981K->3072045K(4655104K), 0.0450480 secs]
1973927.136: [GC 4307501K->3129835K(4635648K), 0.0392559 secs]
1973933.057: [GC 4363058K->3178923K(4647936K), 0.0426612 secs]
1973940.981: [GC 4405163K->3210677K(4648960K), 0.0557622 secs]
1973946.680: [GC 4436917K->3239408K(4656128K), 0.0430889 secs]
1973953.560: [GC 4474277K->3300411K(4641280K), 0.0423129 secs]
1973960.674: [GC 4536894K->3371225K(4630016K), 0.0560341 secs]
1973960.731: [Full GC 3371225K->3339436K(5086208K), 15.5285889 secs]
1973990.516: [GC 4548268K->3405111K(5096448K), 0.0657788 secs]
1973998.191: [GC 4613934K->3527257K(5086208K), 0.1304232 secs]
1974006.505: [GC 4723801K->3597899K(5132800K), 0.0899599 secs]
1974014.748: [GC 4793955K->3654280K(5163008K), 0.0989430 secs]
1974025.349: [GC 4880823K->3672457K(5182464K), 0.0683296 secs]
1974037.517: [GC 4899721K->3681560K(5234688K), 0.1028356 secs]
1974050.066: [GC 4938520K->3718901K(5256192K), 0.0796073 secs]
1974061.466: [GC 4974356K->3726357K(5308928K), 0.1324846 secs]
1974071.726: [GC 5003687K->3757516K(5336064K), 0.0734227 secs]
1974081.917: [GC 5036492K->3777662K(5387264K), 0.1475958 secs]
1974091.853: [GC 5074558K->3800799K(5421056K), 0.0799311 secs]
1974101.882: [GC 5097363K->3846378K(5434880K), 0.3011178 secs]
1974109.234: [GC 5121936K->3930457K(5478912K), 0.0956342 secs]
1974116.082: [GC 5206361K->3974011K(5215744K), 0.1967284 secs]

Thanks
Jay

On Mon, Aug 3, 2015 at 1:53 PM, Bill Bell  wrote:

> Yeah a separate by month or year is good and can really help in this case.
>
> Bill Bell
> Sent from mobile
>
>
> > On Aug 2, 2015, at 5:29 PM, Jay Potharaju  wrote:
> >
> > Shawn,
> > Thanks for the feedback. I agree that increasing timeout might alleviate
> > the timeout issue. The main problem with increasing timeout is the
> > detrimental effect it will have on the user experience, therefore can't
> > increase it.
> > I have looked at the queries that threw errors, next time I try it
> > everything seems to work fine. Not sure how to reproduce the error.
> > My concern with increasing the memory to 32GB is what happens when the
> > index size grows over the next few months.
> > One of the other solutions I have been thinking about is to rebuild
> > index(weekly) and create a new collection and use it. Are there any good
> > references for doing that?
> > Thanks
> > Jay
> >
> >> On Sun, Aug 2, 2015 at 10:19 AM, Shawn Heisey 
> wrote:
> >>
> >>> On 8/2/2015 8:29 AM, Jay Potharaju wrote:
> >>> The document contains around 30 fields and have stored set to true for
> >>> almost 15 of them. And these stored fields are queried and updated all
> >> the
> >>> time. You will notice that the deleted documents is almost 30% of the
> >>> docs.  And it has stayed around that percent and has not come down.
> >>> I did try optimize but that was disruptive as it caused search errors.
> >>> I have been playing with merge factor to see if that helps with deleted
> >>> documents or not. It is currently set to 5.
> >>>
> >>> The server has 24 GB of memory out of which memory consumption is
> around
> >> 23
> >>> GB normally and the jvm is set to 6 GB. And have noticed that the
> >> available
> >>> memory on the server goes to 100 MB at times during a day.
> >>> All the updates are run through DIH.
> >>
> >> Using all availble memory is completely normal operation for ANY
> >> operating system.  If you hold up Windows as an example of one that
> >> doesn't ... it lies to you about "available" memory.  All modern
> >> operating systems will utilize memory that is not explicitly allocated
> >> for the OS disk cache.
> >>
> >> The disk cache will instantly give up any of the memory it is using for
> >> programs that request it.  Linux doesn't try to hide the disk cache from
> >> you, but older versions of Windows do.  In the newer versions of Windows
> >> that have the Resource Monitor, you can go there to see the actual
> >> memory usage including the cache.
> >>
> >>> Every day at least once i see the following error, which result in
> search
> >>> errors on the front end of the site.
> >>>
> >>> ERROR org.apache.solr.servlet.SolrDispatchFilter -
> >>> null:org.eclipse.jetty.io.EofException
> >>>
> >>> From what I have read these are mainly due to timeout and my timeout is
> >> set
> >>> to 30 se

Re: solr multicore vs sharding vs 1 big collection

2015-08-03 Thread Bill Bell
Yeah a separate by month or year is good and can really help in this case.

Bill Bell
Sent from mobile


> On Aug 2, 2015, at 5:29 PM, Jay Potharaju  wrote:
> 
> Shawn,
> Thanks for the feedback. I agree that increasing timeout might alleviate
> the timeout issue. The main problem with increasing timeout is the
> detrimental effect it will have on the user experience, therefore can't
> increase it.
> I have looked at the queries that threw errors, next time I try it
> everything seems to work fine. Not sure how to reproduce the error.
> My concern with increasing the memory to 32GB is what happens when the
> index size grows over the next few months.
> One of the other solutions I have been thinking about is to rebuild
> index(weekly) and create a new collection and use it. Are there any good
> references for doing that?
> Thanks
> Jay
> 
>> On Sun, Aug 2, 2015 at 10:19 AM, Shawn Heisey  wrote:
>> 
>>> On 8/2/2015 8:29 AM, Jay Potharaju wrote:
>>> The document contains around 30 fields and have stored set to true for
>>> almost 15 of them. And these stored fields are queried and updated all
>> the
>>> time. You will notice that the deleted documents is almost 30% of the
>>> docs.  And it has stayed around that percent and has not come down.
>>> I did try optimize but that was disruptive as it caused search errors.
>>> I have been playing with merge factor to see if that helps with deleted
>>> documents or not. It is currently set to 5.
>>> 
>>> The server has 24 GB of memory out of which memory consumption is around
>> 23
>>> GB normally and the jvm is set to 6 GB. And have noticed that the
>> available
>>> memory on the server goes to 100 MB at times during a day.
>>> All the updates are run through DIH.
>> 
>> Using all availble memory is completely normal operation for ANY
>> operating system.  If you hold up Windows as an example of one that
>> doesn't ... it lies to you about "available" memory.  All modern
>> operating systems will utilize memory that is not explicitly allocated
>> for the OS disk cache.
>> 
>> The disk cache will instantly give up any of the memory it is using for
>> programs that request it.  Linux doesn't try to hide the disk cache from
>> you, but older versions of Windows do.  In the newer versions of Windows
>> that have the Resource Monitor, you can go there to see the actual
>> memory usage including the cache.
>> 
>>> Every day at least once i see the following error, which result in search
>>> errors on the front end of the site.
>>> 
>>> ERROR org.apache.solr.servlet.SolrDispatchFilter -
>>> null:org.eclipse.jetty.io.EofException
>>> 
>>> From what I have read these are mainly due to timeout and my timeout is
>> set
>>> to 30 seconds and cant set it to a higher number. I was thinking maybe
>> due
>>> to high memory usage, sometimes it leads to bad performance/errors.
>> 
>> Although this error can be caused by timeouts, it has a specific
>> meaning.  It means that the client disconnected before Solr responded to
>> the request, so when Solr tried to respond (through jetty), it found a
>> closed TCP connection.
>> 
>> Client timeouts need to either be completely removed, or set to a value
>> much longer than any request will take.  Five minutes is a good starting
>> value.
>> 
>> If all your client timeout is set to 30 seconds and you are seeing
>> EofExceptions, that means that your requests are taking longer than 30
>> seconds, and you likely have some performance issues.  It's also
>> possible that some of your client timeouts are set a lot shorter than 30
>> seconds.
>> 
>>> My objective is to stop the errors, adding more memory to the server is
>> not
>>> a good scaling strategy. That is why i was thinking maybe there is a
>> issue
>>> with the way things are set up and need to be revisited.
>> 
>> You're right that adding more memory to the servers is not a good
>> scaling strategy for the general case ... but in this situation, I think
>> it might be prudent.  For your index and heap sizes, I would want the
>> company to pay for at least 32GB of RAM.
>> 
>> Having said that ... I've seen Solr installs work well with a LOT less
>> memory than the ideal.  I don't know that adding more memory is
>> necessary, unless your system (CPU, storage, and memory speeds) is
>> particularly slow.  Based on your document count and index size, your
>> documents are quite small, so I think your memory size is probably good
>> -- if the CPU, memory bus, and storage are very fast.  If one or more of
>> those subsystems aren't fast, then make up the difference with lots of
>> memory.
>> 
>> Some light reading, where you will learn why I think 32GB is an ideal
>> memory size for your system:
>> 
>> https://wiki.apache.org/solr/SolrPerformanceProblems
>> 
>> It is possible that your 6GB heap is not quite big enough for good
>> performance, or that your GC is not well-tuned.  These topics are also
>> discussed on that wiki page.  If you increase your heap size, then the
>> likelihood of ne

Re: solr multicore vs sharding vs 1 big collection

2015-08-03 Thread Upayavira
There are two things that are likely to cause the timeouts you are
seeing, I'd say.

Firstly, your server is overloaded - that can be handled by adding
additional replicas.

However, it doesn't seem like this is the case, because the second query
works fine.

Secondly, you are hitting garbage collection issues. This seems more
likely to me. You have 40m documents inside a 6Gb heap. That seems
relatively tight to me. What that means is that Java may well not have
enough space to create all the objects it needs inside a single commit
cycle, forcing a garbage collection which can cause application pauses,
which would fit with what you are seeing.

I'd suggest using the jstat -gcutil command (I think I have that right)
to watch the number of garbage collections taking place. You will
quickly see from that if garbage collection is your issue. The
simplistic remedy would be to allow your JVM a bit more memory.

The other concern I have is that Solr (and Lucene) is intended for high
read/low write scenarios. Its index structure is highly tuned for this
scenario. If you are doing a lot of writes, then you will be creating a
lot of index churn which will require more frequent merges, consuming
both CPU and memory in the process. It may be worth looking at *how* you
use Solr, and see whether, for example, you can separate your documents
into slow moving, and fast moving parts, to better suit the Lucene index
structures. Or to consider whether a Lucene based system is best for
what you are attempting to achieve.

For garbage collection, see here for a good Solr related write-up:

  http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/

Upayavira

On Mon, Aug 3, 2015, at 12:29 AM, Jay Potharaju wrote:
> Shawn,
> Thanks for the feedback. I agree that increasing timeout might alleviate
> the timeout issue. The main problem with increasing timeout is the
> detrimental effect it will have on the user experience, therefore can't
> increase it.
> I have looked at the queries that threw errors, next time I try it
> everything seems to work fine. Not sure how to reproduce the error.
> My concern with increasing the memory to 32GB is what happens when the
> index size grows over the next few months.
> One of the other solutions I have been thinking about is to rebuild
> index(weekly) and create a new collection and use it. Are there any good
> references for doing that?
> Thanks
> Jay
> 
> On Sun, Aug 2, 2015 at 10:19 AM, Shawn Heisey 
> wrote:
> 
> > On 8/2/2015 8:29 AM, Jay Potharaju wrote:
> > > The document contains around 30 fields and have stored set to true for
> > > almost 15 of them. And these stored fields are queried and updated all
> > the
> > > time. You will notice that the deleted documents is almost 30% of the
> > > docs.  And it has stayed around that percent and has not come down.
> > > I did try optimize but that was disruptive as it caused search errors.
> > > I have been playing with merge factor to see if that helps with deleted
> > > documents or not. It is currently set to 5.
> > >
> > > The server has 24 GB of memory out of which memory consumption is around
> > 23
> > > GB normally and the jvm is set to 6 GB. And have noticed that the
> > available
> > > memory on the server goes to 100 MB at times during a day.
> > > All the updates are run through DIH.
> >
> > Using all availble memory is completely normal operation for ANY
> > operating system.  If you hold up Windows as an example of one that
> > doesn't ... it lies to you about "available" memory.  All modern
> > operating systems will utilize memory that is not explicitly allocated
> > for the OS disk cache.
> >
> > The disk cache will instantly give up any of the memory it is using for
> > programs that request it.  Linux doesn't try to hide the disk cache from
> > you, but older versions of Windows do.  In the newer versions of Windows
> > that have the Resource Monitor, you can go there to see the actual
> > memory usage including the cache.
> >
> > > Every day at least once i see the following error, which result in search
> > > errors on the front end of the site.
> > >
> > > ERROR org.apache.solr.servlet.SolrDispatchFilter -
> > > null:org.eclipse.jetty.io.EofException
> > >
> > > From what I have read these are mainly due to timeout and my timeout is
> > set
> > > to 30 seconds and cant set it to a higher number. I was thinking maybe
> > due
> > > to high memory usage, sometimes it leads to bad performance/errors.
> >
> > Although this error can be caused by timeouts, it has a specific
> > meaning.  It means that the client disconnected before Solr responded to
> > the request, so when Solr tried to respond (through jetty), it found a
> > closed TCP connection.
> >
> > Client timeouts need to either be completely removed, or set to a value
> > much longer than any request will take.  Five minutes is a good starting
> > value.
> >
> > If all your client timeout is set to 30 seconds and you are seeing
> > EofExceptions, that means tha

Re: solr multicore vs sharding vs 1 big collection

2015-08-02 Thread Jay Potharaju
Shawn,
Thanks for the feedback. I agree that increasing timeout might alleviate
the timeout issue. The main problem with increasing timeout is the
detrimental effect it will have on the user experience, therefore can't
increase it.
I have looked at the queries that threw errors, next time I try it
everything seems to work fine. Not sure how to reproduce the error.
My concern with increasing the memory to 32GB is what happens when the
index size grows over the next few months.
One of the other solutions I have been thinking about is to rebuild
index(weekly) and create a new collection and use it. Are there any good
references for doing that?
Thanks
Jay

On Sun, Aug 2, 2015 at 10:19 AM, Shawn Heisey  wrote:

> On 8/2/2015 8:29 AM, Jay Potharaju wrote:
> > The document contains around 30 fields and have stored set to true for
> > almost 15 of them. And these stored fields are queried and updated all
> the
> > time. You will notice that the deleted documents is almost 30% of the
> > docs.  And it has stayed around that percent and has not come down.
> > I did try optimize but that was disruptive as it caused search errors.
> > I have been playing with merge factor to see if that helps with deleted
> > documents or not. It is currently set to 5.
> >
> > The server has 24 GB of memory out of which memory consumption is around
> 23
> > GB normally and the jvm is set to 6 GB. And have noticed that the
> available
> > memory on the server goes to 100 MB at times during a day.
> > All the updates are run through DIH.
>
> Using all availble memory is completely normal operation for ANY
> operating system.  If you hold up Windows as an example of one that
> doesn't ... it lies to you about "available" memory.  All modern
> operating systems will utilize memory that is not explicitly allocated
> for the OS disk cache.
>
> The disk cache will instantly give up any of the memory it is using for
> programs that request it.  Linux doesn't try to hide the disk cache from
> you, but older versions of Windows do.  In the newer versions of Windows
> that have the Resource Monitor, you can go there to see the actual
> memory usage including the cache.
>
> > Every day at least once i see the following error, which result in search
> > errors on the front end of the site.
> >
> > ERROR org.apache.solr.servlet.SolrDispatchFilter -
> > null:org.eclipse.jetty.io.EofException
> >
> > From what I have read these are mainly due to timeout and my timeout is
> set
> > to 30 seconds and cant set it to a higher number. I was thinking maybe
> due
> > to high memory usage, sometimes it leads to bad performance/errors.
>
> Although this error can be caused by timeouts, it has a specific
> meaning.  It means that the client disconnected before Solr responded to
> the request, so when Solr tried to respond (through jetty), it found a
> closed TCP connection.
>
> Client timeouts need to either be completely removed, or set to a value
> much longer than any request will take.  Five minutes is a good starting
> value.
>
> If all your client timeout is set to 30 seconds and you are seeing
> EofExceptions, that means that your requests are taking longer than 30
> seconds, and you likely have some performance issues.  It's also
> possible that some of your client timeouts are set a lot shorter than 30
> seconds.
>
> > My objective is to stop the errors, adding more memory to the server is
> not
> > a good scaling strategy. That is why i was thinking maybe there is a
> issue
> > with the way things are set up and need to be revisited.
>
> You're right that adding more memory to the servers is not a good
> scaling strategy for the general case ... but in this situation, I think
> it might be prudent.  For your index and heap sizes, I would want the
> company to pay for at least 32GB of RAM.
>
> Having said that ... I've seen Solr installs work well with a LOT less
> memory than the ideal.  I don't know that adding more memory is
> necessary, unless your system (CPU, storage, and memory speeds) is
> particularly slow.  Based on your document count and index size, your
> documents are quite small, so I think your memory size is probably good
> -- if the CPU, memory bus, and storage are very fast.  If one or more of
> those subsystems aren't fast, then make up the difference with lots of
> memory.
>
> Some light reading, where you will learn why I think 32GB is an ideal
> memory size for your system:
>
> https://wiki.apache.org/solr/SolrPerformanceProblems
>
> It is possible that your 6GB heap is not quite big enough for good
> performance, or that your GC is not well-tuned.  These topics are also
> discussed on that wiki page.  If you increase your heap size, then the
> likelihood of needing more memory in the system becomes greater, because
> there will be less memory available for the disk cache.
>
> Thanks,
> Shawn
>
>


-- 
Thanks
Jay Potharaju


Re: solr multicore vs sharding vs 1 big collection

2015-08-02 Thread Shawn Heisey
On 8/2/2015 8:29 AM, Jay Potharaju wrote:
> The document contains around 30 fields and have stored set to true for
> almost 15 of them. And these stored fields are queried and updated all the
> time. You will notice that the deleted documents is almost 30% of the
> docs.  And it has stayed around that percent and has not come down.
> I did try optimize but that was disruptive as it caused search errors.
> I have been playing with merge factor to see if that helps with deleted
> documents or not. It is currently set to 5.
> 
> The server has 24 GB of memory out of which memory consumption is around 23
> GB normally and the jvm is set to 6 GB. And have noticed that the available
> memory on the server goes to 100 MB at times during a day.
> All the updates are run through DIH.

Using all availble memory is completely normal operation for ANY
operating system.  If you hold up Windows as an example of one that
doesn't ... it lies to you about "available" memory.  All modern
operating systems will utilize memory that is not explicitly allocated
for the OS disk cache.

The disk cache will instantly give up any of the memory it is using for
programs that request it.  Linux doesn't try to hide the disk cache from
you, but older versions of Windows do.  In the newer versions of Windows
that have the Resource Monitor, you can go there to see the actual
memory usage including the cache.

> Every day at least once i see the following error, which result in search
> errors on the front end of the site.
> 
> ERROR org.apache.solr.servlet.SolrDispatchFilter -
> null:org.eclipse.jetty.io.EofException
> 
> From what I have read these are mainly due to timeout and my timeout is set
> to 30 seconds and cant set it to a higher number. I was thinking maybe due
> to high memory usage, sometimes it leads to bad performance/errors.

Although this error can be caused by timeouts, it has a specific
meaning.  It means that the client disconnected before Solr responded to
the request, so when Solr tried to respond (through jetty), it found a
closed TCP connection.

Client timeouts need to either be completely removed, or set to a value
much longer than any request will take.  Five minutes is a good starting
value.

If all your client timeout is set to 30 seconds and you are seeing
EofExceptions, that means that your requests are taking longer than 30
seconds, and you likely have some performance issues.  It's also
possible that some of your client timeouts are set a lot shorter than 30
seconds.

> My objective is to stop the errors, adding more memory to the server is not
> a good scaling strategy. That is why i was thinking maybe there is a issue
> with the way things are set up and need to be revisited.

You're right that adding more memory to the servers is not a good
scaling strategy for the general case ... but in this situation, I think
it might be prudent.  For your index and heap sizes, I would want the
company to pay for at least 32GB of RAM.

Having said that ... I've seen Solr installs work well with a LOT less
memory than the ideal.  I don't know that adding more memory is
necessary, unless your system (CPU, storage, and memory speeds) is
particularly slow.  Based on your document count and index size, your
documents are quite small, so I think your memory size is probably good
-- if the CPU, memory bus, and storage are very fast.  If one or more of
those subsystems aren't fast, then make up the difference with lots of
memory.

Some light reading, where you will learn why I think 32GB is an ideal
memory size for your system:

https://wiki.apache.org/solr/SolrPerformanceProblems

It is possible that your 6GB heap is not quite big enough for good
performance, or that your GC is not well-tuned.  These topics are also
discussed on that wiki page.  If you increase your heap size, then the
likelihood of needing more memory in the system becomes greater, because
there will be less memory available for the disk cache.

Thanks,
Shawn



Re: solr multicore vs sharding vs 1 big collection

2015-08-02 Thread Jay Potharaju
The document contains around 30 fields and have stored set to true for
almost 15 of them. And these stored fields are queried and updated all the
time. You will notice that the deleted documents is almost 30% of the
docs.  And it has stayed around that percent and has not come down.
I did try optimize but that was disruptive as it caused search errors.
I have been playing with merge factor to see if that helps with deleted
documents or not. It is currently set to 5.

The server has 24 GB of memory out of which memory consumption is around 23
GB normally and the jvm is set to 6 GB. And have noticed that the available
memory on the server goes to 100 MB at times during a day.
All the updates are run through DIH.

Every day at least once i see the following error, which result in search
errors on the front end of the site.

ERROR org.apache.solr.servlet.SolrDispatchFilter -
null:org.eclipse.jetty.io.EofException

>From what I have read these are mainly due to timeout and my timeout is set
to 30 seconds and cant set it to a higher number. I was thinking maybe due
to high memory usage, sometimes it leads to bad performance/errors.

My objective is to stop the errors, adding more memory to the server is not
a good scaling strategy. That is why i was thinking maybe there is a issue
with the way things are set up and need to be revisited.

Thanks


On Sat, Aug 1, 2015 at 7:06 PM, Shawn Heisey  wrote:

> On 8/1/2015 6:49 PM, Jay Potharaju wrote:
> > I currently have a single collection with 40 million documents and index
> > size of 25 GB. The collections gets updated every n minutes and as a
> result
> > the number of deleted documents is constantly growing. The data in the
> > collection is an amalgamation of more than 1000+ customer records. The
> > number of documents per each customer is around 100,000 records on
> average.
> >
> > Now that being said, I 'm trying to get an handle on the growing deleted
> > document size. Because of the growing index size both the disk space and
> > memory is being used up. And would like to reduce it to a manageable
> size.
> >
> > I have been thinking of splitting the data into multiple core, 1 for each
> > customer. This would allow me manage the smaller collection easily and
> can
> > create/update the collection also fast. My concern is that number of
> > collections might become an issue. Any suggestions on how to address this
> > problem. What are my other alternatives to moving to a multicore
> > collections.?
> >
> > Solr: 4.9
> > Index size:25 GB
> > Max doc: 40 million
> > Doc count:29 million
> >
> > Replication:4
> >
> > 4 servers in solrcloud.
>
> Creating 1000+ collections in SolrCloud is definitely problematic.  If
> you need to choose between a lot of shards and a lot of collections, I
> would definitely go with a lot of shards.  I would also want a lot of
> servers for an index with that many pieces.
>
> https://issues.apache.org/jira/browse/SOLR-7191
>
> I don't think it would matter how many collections or shards you have
> when it comes to how many deleted documents are in your index.  If you
> want to clean up a large number of deletes in an index, the best option
> is an optimize.  An optimize requires a large amount of disk I/O, so it
> can be extremely disruptive if the query volume is high.  It should be
> done when the query volume is at its lowest.  For the index you
> describe, a nightly or weekly optimize seems like a good option.
>
> Aside from having a lot of deleted documents in your index, what kind of
> problems are you trying to solve?
>
> Thanks,
> Shawn
>
>


-- 
Thanks
Jay Potharaju


Re: solr multicore vs sharding vs 1 big collection

2015-08-01 Thread Shawn Heisey
On 8/1/2015 6:49 PM, Jay Potharaju wrote:
> I currently have a single collection with 40 million documents and index
> size of 25 GB. The collections gets updated every n minutes and as a result
> the number of deleted documents is constantly growing. The data in the
> collection is an amalgamation of more than 1000+ customer records. The
> number of documents per each customer is around 100,000 records on average.
> 
> Now that being said, I 'm trying to get an handle on the growing deleted
> document size. Because of the growing index size both the disk space and
> memory is being used up. And would like to reduce it to a manageable size.
> 
> I have been thinking of splitting the data into multiple core, 1 for each
> customer. This would allow me manage the smaller collection easily and can
> create/update the collection also fast. My concern is that number of
> collections might become an issue. Any suggestions on how to address this
> problem. What are my other alternatives to moving to a multicore
> collections.?
> 
> Solr: 4.9
> Index size:25 GB
> Max doc: 40 million
> Doc count:29 million
> 
> Replication:4
> 
> 4 servers in solrcloud.

Creating 1000+ collections in SolrCloud is definitely problematic.  If
you need to choose between a lot of shards and a lot of collections, I
would definitely go with a lot of shards.  I would also want a lot of
servers for an index with that many pieces.

https://issues.apache.org/jira/browse/SOLR-7191

I don't think it would matter how many collections or shards you have
when it comes to how many deleted documents are in your index.  If you
want to clean up a large number of deletes in an index, the best option
is an optimize.  An optimize requires a large amount of disk I/O, so it
can be extremely disruptive if the query volume is high.  It should be
done when the query volume is at its lowest.  For the index you
describe, a nightly or weekly optimize seems like a good option.

Aside from having a lot of deleted documents in your index, what kind of
problems are you trying to solve?

Thanks,
Shawn



Re: solr multicore vs sharding vs 1 big collection

2015-08-01 Thread Erick Erickson
40 million docs isn't really very many by modern standards,
although if they're huge documents then that might be an issue.

So is this a single shard or multiple shards? If you're really facing
performance issues, simply making a new collection with more
than one shard (independent of how many replicas each has) is
probably simplest.

The number of deleted documents really shouldn't be a problem.
Typically the deleted documents are purged during segment
merging that happens automatically as you add documents. I often
see 10-15% or the corpus consist of deleted documents.

You can force these by doing a force merge (aka optimization), but that
is usually not recommended unless you have a strange situation where
you have lots and lots of docs that have been deleted as measured
by the Admin UI page, the "deleted docs" entry relative to the maxDoc
number (again on the admin UI page).

So show us what you're seeing that's concerning. Typically, especially
on an index that's continually getting updates it's adequate to just
let the background segment merging take care of things.

Best,
Erick

On Sat, Aug 1, 2015 at 8:49 PM, Jay Potharaju  wrote:
> Hi
>
> I currently have a single collection with 40 million documents and index
> size of 25 GB. The collections gets updated every n minutes and as a result
> the number of deleted documents is constantly growing. The data in the
> collection is an amalgamation of more than 1000+ customer records. The
> number of documents per each customer is around 100,000 records on average.
>
> Now that being said, I 'm trying to get an handle on the growing deleted
> document size. Because of the growing index size both the disk space and
> memory is being used up. And would like to reduce it to a manageable size.
>
> I have been thinking of splitting the data into multiple core, 1 for each
> customer. This would allow me manage the smaller collection easily and can
> create/update the collection also fast. My concern is that number of
> collections might become an issue. Any suggestions on how to address this
> problem. What are my other alternatives to moving to a multicore
> collections.?
>
> Solr: 4.9
> Index size:25 GB
> Max doc: 40 million
> Doc count:29 million
>
> Replication:4
>
> 4 servers in solrcloud.
>
> Thanks
> Jay


solr multicore vs sharding vs 1 big collection

2015-08-01 Thread Jay Potharaju
Hi

I currently have a single collection with 40 million documents and index
size of 25 GB. The collections gets updated every n minutes and as a result
the number of deleted documents is constantly growing. The data in the
collection is an amalgamation of more than 1000+ customer records. The
number of documents per each customer is around 100,000 records on average.

Now that being said, I 'm trying to get an handle on the growing deleted
document size. Because of the growing index size both the disk space and
memory is being used up. And would like to reduce it to a manageable size.

I have been thinking of splitting the data into multiple core, 1 for each
customer. This would allow me manage the smaller collection easily and can
create/update the collection also fast. My concern is that number of
collections might become an issue. Any suggestions on how to address this
problem. What are my other alternatives to moving to a multicore
collections.?

Solr: 4.9
Index size:25 GB
Max doc: 40 million
Doc count:29 million

Replication:4

4 servers in solrcloud.

Thanks
Jay