Re: Shard size variation

2018-05-03 Thread Erick Erickson
"We generally try not to change defaults when possible, sounds like
there will be new default settings for the segment sizes and merging
policy?"

usually wise.

No, there won't be any change in the default settings.

What _will_ change is the behavior of a forceMerge (aka optimize) and
expungeDeletes when using TieredMergePolicy (the default) in that they
will by default respect maxSegmentSizeMB which has defaulted to 5G
since forever. The fact that optimize merged down to a single segment
by default is, in one view, a bug.

The current implementation can hover around 50% deleted docs in an
index, that behavior won't change with LUCENE-7976. The percentage
could possibly be larger if you've optimized before, see the problem
statement on that JIRA.

The other behavior that'll change is if you _have_ merged down to one
segment, that very large segment will be eligible for merging in
situations where it wasn't before so your index should hover around
50% deleted docs if it's large to begin with. See Mike's blog here:
https://www.elastic.co/blog/lucenes-handling-of-deleted-documents

LUCENE-8263 is where we're discussing adding a new parameter to TMP,
that won't be in LUCENE-7976

If you absolutely _insist_ on having one large segment like currently,
you will be able to use the existing maxSegments option for optimize
and get your one large segment back. It's strongly recommended that
you _don't_ do that though, although you can. Optimize will purge all
deleted docs from the index as now though, just respecting max segment
size.

"Am I right in thinking that expungeDeletes will (in theory) be a 7.4
forwards option?"

expungeDeletes is a current option and has been around for quite a
while, see: 
https://lucene.apache.org/solr/guide/7_3/uploading-data-with-index-handlers.html.
It suffers from the same problem forceMerge/optimize does, however
since it can create very large segments. That operation will also
respect the max segment size as of LUCENE-7976.

Best,
Erick


On Thu, May 3, 2018 at 7:02 AM, Michael Joyner  wrote:
> We generally try not to change defaults when possible, sounds like there
> will be new default settings for the segment sizes and merging policy?
>
> Am I right in thinking that expungeDeletes will (in theory) be a 7.4
> forwards option?
>
>
> On 05/02/2018 01:29 PM, Erick Erickson wrote:
>>
>> You can always increase the maximum segment size. For large indexes
>> that should reduce the number of segments. But watch your indexing
>> stats, I can't predict the consequences of bumping it to 100G for
>> instance. I'd _expect_  bursty I/O whne those large segments started
>> to be created or merged
>>
>> You'll be interested in LUCENE-7976 (Solr 7.4?), especially (probably)
>> the idea of increasing the segment sizes and/or a related JIRA that
>> allows you to tweak how aggressively solr merges segments that have
>> deleted docs.
>>
>> NOTE: that JIRA has the consequence that _by default_ the optimize
>> with no parameters respects the maximum segment size, which is a
>> change from now.
>>
>> Finally, expungeDeletes may be useful as that too will respect max
>> segment size, again after LUCENE-7976 is committed.
>>
>> Best,
>> Erick
>>
>> On Wed, May 2, 2018 at 9:22 AM, Michael Joyner  wrote:
>>>
>>> The main reason we go this route is that after awhile (with default
>>> settings) we end up with hundreds of shards and performance of course
>>> drops
>>> abysmally as a result. By using a stepped optimize a) we don't run into
>>> the
>>> we need the 3x+ head room issue, b) optimize performance penalty during
>>> optimize is less than the hundreds of shards not being optimized
>>> performance
>>> penalty.
>>>
>>> BTW, as we use batched a batch insert/update cycle [once daily] we only
>>> do
>>> optimize to a segment of 1 after a complete batch has been run. Though
>>> during the batch we reduce segment counts down to a max of 16 every 250K
>>> insert/updates to prevent the large segment count performance penalty.
>>>
>>>
>>> On 04/30/2018 07:10 PM, Erick Erickson wrote:

 There's really no good way to purge deleted documents from the index
 other than to wait until merging happens.

 Optimize/forceMerge and expungeDeletes both suffer from the problem
 that they create massive segments that then stick around for a very
 long time, see:


 https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

 Best,
 Erick

 On Mon, Apr 30, 2018 at 1:56 PM, Michael Joyner 
 wrote:
>
> Based on experience, 2x head room is room is not always enough,
> sometimes
> not even 3x, if you are optimizing from many segments down to 1 segment
> in a
> single go.
>
> We have however figured out a way that can work with as little as 51%
> free
> space via the following iteration cycle:
>
> public void solrOptimize() {
>   int initialMaxSegments = 256;
>   i

Re: Shard size variation

2018-05-03 Thread Michael Joyner
We generally try not to change defaults when possible, sounds like there 
will be new default settings for the segment sizes and merging policy?


Am I right in thinking that expungeDeletes will (in theory) be a 7.4 
forwards option?



On 05/02/2018 01:29 PM, Erick Erickson wrote:

You can always increase the maximum segment size. For large indexes
that should reduce the number of segments. But watch your indexing
stats, I can't predict the consequences of bumping it to 100G for
instance. I'd _expect_  bursty I/O whne those large segments started
to be created or merged

You'll be interested in LUCENE-7976 (Solr 7.4?), especially (probably)
the idea of increasing the segment sizes and/or a related JIRA that
allows you to tweak how aggressively solr merges segments that have
deleted docs.

NOTE: that JIRA has the consequence that _by default_ the optimize
with no parameters respects the maximum segment size, which is a
change from now.

Finally, expungeDeletes may be useful as that too will respect max
segment size, again after LUCENE-7976 is committed.

Best,
Erick

On Wed, May 2, 2018 at 9:22 AM, Michael Joyner  wrote:

The main reason we go this route is that after awhile (with default
settings) we end up with hundreds of shards and performance of course drops
abysmally as a result. By using a stepped optimize a) we don't run into the
we need the 3x+ head room issue, b) optimize performance penalty during
optimize is less than the hundreds of shards not being optimized performance
penalty.

BTW, as we use batched a batch insert/update cycle [once daily] we only do
optimize to a segment of 1 after a complete batch has been run. Though
during the batch we reduce segment counts down to a max of 16 every 250K
insert/updates to prevent the large segment count performance penalty.


On 04/30/2018 07:10 PM, Erick Erickson wrote:

There's really no good way to purge deleted documents from the index
other than to wait until merging happens.

Optimize/forceMerge and expungeDeletes both suffer from the problem
that they create massive segments that then stick around for a very
long time, see:

https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

Best,
Erick

On Mon, Apr 30, 2018 at 1:56 PM, Michael Joyner 
wrote:

Based on experience, 2x head room is room is not always enough, sometimes
not even 3x, if you are optimizing from many segments down to 1 segment
in a
single go.

We have however figured out a way that can work with as little as 51%
free
space via the following iteration cycle:

public void solrOptimize() {
  int initialMaxSegments = 256;
  int finalMaxSegments = 1;
  if (isShowSegmentCounter()) {
  log.info("Optimizing ...");
  }
  try (SolrClient solrServerInstance = getSolrClientInstance()){
  for (int segments=initialMaxSegments;
segments>=finalMaxSegments; segments--) {
  if (isShowSegmentCounter()) {
  System.out.println("Optimizing to a max of
"+segments+"
segments.");
  }
  solrServerInstance.optimize(true, true, segments);
  }
  } catch (SolrServerException | IOException e) {
  throw new RuntimeException(e);

  }
  }


On 04/30/2018 04:23 PM, Walter Underwood wrote:

You need 2X the minimum index size in disk space anyway, so don’t worry
about keeping the indexes as small as possible. Worry about having
enough
headroom.

If your indexes are 250 GB, you need 250 GB of free space.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Apr 30, 2018, at 1:13 PM, Antony A  wrote:

Thanks Erick/Deepak.

The cloud is running on baremetal (128 GB/24 cpu).

Is there an option to run a compact on the data files to make the size
equal on both the clouds? I am trying find all the options before I add
the
new fields into the production cloud.

Thanks
AA

On Mon, Apr 30, 2018 at 10:45 AM, Erick Erickson

wrote:


Anthony:

You are probably seeing the results of removing deleted documents from
the shards as they're merged. Even on replicas in the same _shard_,
the size of the index on disk won't necessarily be identical. This has
to do with which segments are selected for merging, which are not
necessarily coordinated across replicas.

The test is if the number of docs on each collection is the same. If
it is, then don't worry about index sizes.

Best,
Erick

On Mon, Apr 30, 2018 at 9:38 AM, Deepak Goel 
wrote:

Could you please also give the machine details of the two clouds you
are
running?



Deepak
"The greatness of a nation can be judged by the way its animals are
treated. Please stop cruelty to Animals, become a Vegan"

+91 73500 12833
deic...@gmail.com

Facebook: https://www.facebook.com/deicool
LinkedIn: www.linkedin.com/in/deicool

"Plant a Tree, Go Green"

Make In India : http://www.makeinindia.com/home

On Mon, Apr 30, 2018 at 9:51 PM, Antony A 

wro

Re: Shard size variation

2018-05-02 Thread Erick Erickson
You can always increase the maximum segment size. For large indexes
that should reduce the number of segments. But watch your indexing
stats, I can't predict the consequences of bumping it to 100G for
instance. I'd _expect_  bursty I/O whne those large segments started
to be created or merged

You'll be interested in LUCENE-7976 (Solr 7.4?), especially (probably)
the idea of increasing the segment sizes and/or a related JIRA that
allows you to tweak how aggressively solr merges segments that have
deleted docs.

NOTE: that JIRA has the consequence that _by default_ the optimize
with no parameters respects the maximum segment size, which is a
change from now.

Finally, expungeDeletes may be useful as that too will respect max
segment size, again after LUCENE-7976 is committed.

Best,
Erick

On Wed, May 2, 2018 at 9:22 AM, Michael Joyner  wrote:
> The main reason we go this route is that after awhile (with default
> settings) we end up with hundreds of shards and performance of course drops
> abysmally as a result. By using a stepped optimize a) we don't run into the
> we need the 3x+ head room issue, b) optimize performance penalty during
> optimize is less than the hundreds of shards not being optimized performance
> penalty.
>
> BTW, as we use batched a batch insert/update cycle [once daily] we only do
> optimize to a segment of 1 after a complete batch has been run. Though
> during the batch we reduce segment counts down to a max of 16 every 250K
> insert/updates to prevent the large segment count performance penalty.
>
>
> On 04/30/2018 07:10 PM, Erick Erickson wrote:
>>
>> There's really no good way to purge deleted documents from the index
>> other than to wait until merging happens.
>>
>> Optimize/forceMerge and expungeDeletes both suffer from the problem
>> that they create massive segments that then stick around for a very
>> long time, see:
>>
>> https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
>>
>> Best,
>> Erick
>>
>> On Mon, Apr 30, 2018 at 1:56 PM, Michael Joyner 
>> wrote:
>>>
>>> Based on experience, 2x head room is room is not always enough, sometimes
>>> not even 3x, if you are optimizing from many segments down to 1 segment
>>> in a
>>> single go.
>>>
>>> We have however figured out a way that can work with as little as 51%
>>> free
>>> space via the following iteration cycle:
>>>
>>> public void solrOptimize() {
>>>  int initialMaxSegments = 256;
>>>  int finalMaxSegments = 1;
>>>  if (isShowSegmentCounter()) {
>>>  log.info("Optimizing ...");
>>>  }
>>>  try (SolrClient solrServerInstance = getSolrClientInstance()){
>>>  for (int segments=initialMaxSegments;
>>> segments>=finalMaxSegments; segments--) {
>>>  if (isShowSegmentCounter()) {
>>>  System.out.println("Optimizing to a max of
>>> "+segments+"
>>> segments.");
>>>  }
>>>  solrServerInstance.optimize(true, true, segments);
>>>  }
>>>  } catch (SolrServerException | IOException e) {
>>>  throw new RuntimeException(e);
>>>
>>>  }
>>>  }
>>>
>>>
>>> On 04/30/2018 04:23 PM, Walter Underwood wrote:

 You need 2X the minimum index size in disk space anyway, so don’t worry
 about keeping the indexes as small as possible. Worry about having
 enough
 headroom.

 If your indexes are 250 GB, you need 250 GB of free space.

 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)

> On Apr 30, 2018, at 1:13 PM, Antony A  wrote:
>
> Thanks Erick/Deepak.
>
> The cloud is running on baremetal (128 GB/24 cpu).
>
> Is there an option to run a compact on the data files to make the size
> equal on both the clouds? I am trying find all the options before I add
> the
> new fields into the production cloud.
>
> Thanks
> AA
>
> On Mon, Apr 30, 2018 at 10:45 AM, Erick Erickson
> 
> wrote:
>
>> Anthony:
>>
>> You are probably seeing the results of removing deleted documents from
>> the shards as they're merged. Even on replicas in the same _shard_,
>> the size of the index on disk won't necessarily be identical. This has
>> to do with which segments are selected for merging, which are not
>> necessarily coordinated across replicas.
>>
>> The test is if the number of docs on each collection is the same. If
>> it is, then don't worry about index sizes.
>>
>> Best,
>> Erick
>>
>> On Mon, Apr 30, 2018 at 9:38 AM, Deepak Goel 
>> wrote:
>>>
>>> Could you please also give the machine details of the two clouds you
>>> are
>>> running?
>>>
>>>
>>>
>>> Deepak
>>> "The greatness of a nation can be judged by the way its animals are
>>> treated. Please stop cruelty to Animals, become a Vegan"
>

Re: Shard size variation

2018-05-02 Thread Michael Joyner
The main reason we go this route is that after awhile (with default 
settings) we end up with hundreds of shards and performance of course 
drops abysmally as a result. By using a stepped optimize a) we don't run 
into the we need the 3x+ head room issue, b) optimize performance 
penalty during optimize is less than the hundreds of shards not being 
optimized performance penalty.


BTW, as we use batched a batch insert/update cycle [once daily] we only 
do optimize to a segment of 1 after a complete batch has been run. 
Though during the batch we reduce segment counts down to a max of 16 
every 250K insert/updates to prevent the large segment count performance 
penalty.



On 04/30/2018 07:10 PM, Erick Erickson wrote:

There's really no good way to purge deleted documents from the index
other than to wait until merging happens.

Optimize/forceMerge and expungeDeletes both suffer from the problem
that they create massive segments that then stick around for a very
long time, see:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

Best,
Erick

On Mon, Apr 30, 2018 at 1:56 PM, Michael Joyner  wrote:

Based on experience, 2x head room is room is not always enough, sometimes
not even 3x, if you are optimizing from many segments down to 1 segment in a
single go.

We have however figured out a way that can work with as little as 51% free
space via the following iteration cycle:

public void solrOptimize() {
 int initialMaxSegments = 256;
 int finalMaxSegments = 1;
 if (isShowSegmentCounter()) {
 log.info("Optimizing ...");
 }
 try (SolrClient solrServerInstance = getSolrClientInstance()){
 for (int segments=initialMaxSegments;
segments>=finalMaxSegments; segments--) {
 if (isShowSegmentCounter()) {
 System.out.println("Optimizing to a max of "+segments+"
segments.");
 }
 solrServerInstance.optimize(true, true, segments);
 }
 } catch (SolrServerException | IOException e) {
 throw new RuntimeException(e);

 }
 }


On 04/30/2018 04:23 PM, Walter Underwood wrote:

You need 2X the minimum index size in disk space anyway, so don’t worry
about keeping the indexes as small as possible. Worry about having enough
headroom.

If your indexes are 250 GB, you need 250 GB of free space.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Apr 30, 2018, at 1:13 PM, Antony A  wrote:

Thanks Erick/Deepak.

The cloud is running on baremetal (128 GB/24 cpu).

Is there an option to run a compact on the data files to make the size
equal on both the clouds? I am trying find all the options before I add
the
new fields into the production cloud.

Thanks
AA

On Mon, Apr 30, 2018 at 10:45 AM, Erick Erickson

wrote:


Anthony:

You are probably seeing the results of removing deleted documents from
the shards as they're merged. Even on replicas in the same _shard_,
the size of the index on disk won't necessarily be identical. This has
to do with which segments are selected for merging, which are not
necessarily coordinated across replicas.

The test is if the number of docs on each collection is the same. If
it is, then don't worry about index sizes.

Best,
Erick

On Mon, Apr 30, 2018 at 9:38 AM, Deepak Goel  wrote:

Could you please also give the machine details of the two clouds you
are
running?



Deepak
"The greatness of a nation can be judged by the way its animals are
treated. Please stop cruelty to Animals, become a Vegan"

+91 73500 12833
deic...@gmail.com

Facebook: https://www.facebook.com/deicool
LinkedIn: www.linkedin.com/in/deicool

"Plant a Tree, Go Green"

Make In India : http://www.makeinindia.com/home

On Mon, Apr 30, 2018 at 9:51 PM, Antony A 

wrote:

Hi Shawn,

The cloud is running version 6.2.1. with ClassicIndexSchemaFactory

The sum of size from admin UI on all the shards is around 265 G vs 224
G
between the two clouds.

I created the collection using "numShards" so compositeId router.

If you need more information, please let me know.

Thanks
AA

On Mon, Apr 30, 2018 at 10:04 AM, Shawn Heisey 
wrote:


On 4/30/2018 9:51 AM, Antony A wrote:


I am running two separate solr clouds. I have 8 shards in each with
a
total
of 300 million documents. Both the clouds are indexing the document

from

the same source/configuration.

I am noticing there is a difference in the size of the collection

between

them. I am planning to add more shards to see if that helps solve
the
issue. Has anyone come across similar issue?


There's no information here about exactly what you are seeing, what

you

are expecting to see, and why you believe that what you are seeing is

wrong.

You did say that there is "a difference in size".  That is a very

vague

problem description.

FYI, unless a SolrCloud collection is using the implicit router, you
cannot add shards.  And if it *IS* using the imp

Re: Shard size variation

2018-04-30 Thread Shawn Heisey
On 4/30/2018 2:56 PM, Michael Joyner wrote:
> Based on experience, 2x head room is room is not always enough,
> sometimes not even 3x, if you are optimizing from many segments down
> to 1 segment in a single go.

In all situations a user is likely to encounter in the wild, having
enough extra disk space for all Solr indexes to triple in size
temporarily should be plenty.  A situation where that is not enough
would be pathological and extremely unlikely to occur in the wild.

An optimize operation will never require more than 2X the index size *at
the moment the optimize begins*.  If changes to the index are made while
the optimize is underway, then those changes would require additional space.

There is a situation people have encountered in the wild where the total
temporary space required is 3X the **final** index size.  But this is a
situation where there is more happening than just an optimize.

Thanks,
Shawn



Re: Shard size variation

2018-04-30 Thread Antony A
Thank you all. I have around 70% free space in production. I will compute for 
the additional fields.


Sent from my mobile. Please excuse any typos.

> On Apr 30, 2018, at 5:10 PM, Erick Erickson  wrote:
> 
> There's really no good way to purge deleted documents from the index
> other than to wait until merging happens.
> 
> Optimize/forceMerge and expungeDeletes both suffer from the problem
> that they create massive segments that then stick around for a very
> long time, see:
> https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
> 
> Best,
> Erick
> 
>> On Mon, Apr 30, 2018 at 1:56 PM, Michael Joyner  wrote:
>> Based on experience, 2x head room is room is not always enough, sometimes
>> not even 3x, if you are optimizing from many segments down to 1 segment in a
>> single go.
>> 
>> We have however figured out a way that can work with as little as 51% free
>> space via the following iteration cycle:
>> 
>> public void solrOptimize() {
>>int initialMaxSegments = 256;
>>int finalMaxSegments = 1;
>>if (isShowSegmentCounter()) {
>>log.info("Optimizing ...");
>>}
>>try (SolrClient solrServerInstance = getSolrClientInstance()){
>>for (int segments=initialMaxSegments;
>> segments>=finalMaxSegments; segments--) {
>>if (isShowSegmentCounter()) {
>>System.out.println("Optimizing to a max of "+segments+"
>> segments.");
>>}
>>solrServerInstance.optimize(true, true, segments);
>>}
>>} catch (SolrServerException | IOException e) {
>>throw new RuntimeException(e);
>> 
>>}
>>}
>> 
>> 
>>> On 04/30/2018 04:23 PM, Walter Underwood wrote:
>>> 
>>> You need 2X the minimum index size in disk space anyway, so don’t worry
>>> about keeping the indexes as small as possible. Worry about having enough
>>> headroom.
>>> 
>>> If your indexes are 250 GB, you need 250 GB of free space.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
 On Apr 30, 2018, at 1:13 PM, Antony A  wrote:
 
 Thanks Erick/Deepak.
 
 The cloud is running on baremetal (128 GB/24 cpu).
 
 Is there an option to run a compact on the data files to make the size
 equal on both the clouds? I am trying find all the options before I add
 the
 new fields into the production cloud.
 
 Thanks
 AA
 
 On Mon, Apr 30, 2018 at 10:45 AM, Erick Erickson
 
 wrote:
 
> Anthony:
> 
> You are probably seeing the results of removing deleted documents from
> the shards as they're merged. Even on replicas in the same _shard_,
> the size of the index on disk won't necessarily be identical. This has
> to do with which segments are selected for merging, which are not
> necessarily coordinated across replicas.
> 
> The test is if the number of docs on each collection is the same. If
> it is, then don't worry about index sizes.
> 
> Best,
> Erick
> 
>> On Mon, Apr 30, 2018 at 9:38 AM, Deepak Goel  wrote:
>> 
>> Could you please also give the machine details of the two clouds you
>> are
>> running?
>> 
>> 
>> 
>> Deepak
>> "The greatness of a nation can be judged by the way its animals are
>> treated. Please stop cruelty to Animals, become a Vegan"
>> 
>> +91 73500 12833
>> deic...@gmail.com
>> 
>> Facebook: https://www.facebook.com/deicool
>> LinkedIn: www.linkedin.com/in/deicool
>> 
>> "Plant a Tree, Go Green"
>> 
>> Make In India : http://www.makeinindia.com/home
>> 
>> On Mon, Apr 30, 2018 at 9:51 PM, Antony A 
> 
> wrote:
>>> 
>>> Hi Shawn,
>>> 
>>> The cloud is running version 6.2.1. with ClassicIndexSchemaFactory
>>> 
>>> The sum of size from admin UI on all the shards is around 265 G vs 224
>>> G
>>> between the two clouds.
>>> 
>>> I created the collection using "numShards" so compositeId router.
>>> 
>>> If you need more information, please let me know.
>>> 
>>> Thanks
>>> AA
>>> 
>>> On Mon, Apr 30, 2018 at 10:04 AM, Shawn Heisey 
>>> wrote:
>>> 
> On 4/30/2018 9:51 AM, Antony A wrote:
> 
> I am running two separate solr clouds. I have 8 shards in each with
> a
> total
> of 300 million documents. Both the clouds are indexing the document
> 
> from
> 
> the same source/configuration.
> 
> I am noticing there is a difference in the size of the collection
>>> 
>>> between
> 
> them. I am planning to add more shards to see if that helps solve
> the
> issue. Has anyone come across similar issue?
> 
 There's no information here about exactly what you are seeing, what
> 

Re: Shard size variation

2018-04-30 Thread Erick Erickson
There's really no good way to purge deleted documents from the index
other than to wait until merging happens.

Optimize/forceMerge and expungeDeletes both suffer from the problem
that they create massive segments that then stick around for a very
long time, see:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

Best,
Erick

On Mon, Apr 30, 2018 at 1:56 PM, Michael Joyner  wrote:
> Based on experience, 2x head room is room is not always enough, sometimes
> not even 3x, if you are optimizing from many segments down to 1 segment in a
> single go.
>
> We have however figured out a way that can work with as little as 51% free
> space via the following iteration cycle:
>
> public void solrOptimize() {
> int initialMaxSegments = 256;
> int finalMaxSegments = 1;
> if (isShowSegmentCounter()) {
> log.info("Optimizing ...");
> }
> try (SolrClient solrServerInstance = getSolrClientInstance()){
> for (int segments=initialMaxSegments;
> segments>=finalMaxSegments; segments--) {
> if (isShowSegmentCounter()) {
> System.out.println("Optimizing to a max of "+segments+"
> segments.");
> }
> solrServerInstance.optimize(true, true, segments);
> }
> } catch (SolrServerException | IOException e) {
> throw new RuntimeException(e);
>
> }
> }
>
>
> On 04/30/2018 04:23 PM, Walter Underwood wrote:
>>
>> You need 2X the minimum index size in disk space anyway, so don’t worry
>> about keeping the indexes as small as possible. Worry about having enough
>> headroom.
>>
>> If your indexes are 250 GB, you need 250 GB of free space.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>> On Apr 30, 2018, at 1:13 PM, Antony A  wrote:
>>>
>>> Thanks Erick/Deepak.
>>>
>>> The cloud is running on baremetal (128 GB/24 cpu).
>>>
>>> Is there an option to run a compact on the data files to make the size
>>> equal on both the clouds? I am trying find all the options before I add
>>> the
>>> new fields into the production cloud.
>>>
>>> Thanks
>>> AA
>>>
>>> On Mon, Apr 30, 2018 at 10:45 AM, Erick Erickson
>>> 
>>> wrote:
>>>
 Anthony:

 You are probably seeing the results of removing deleted documents from
 the shards as they're merged. Even on replicas in the same _shard_,
 the size of the index on disk won't necessarily be identical. This has
 to do with which segments are selected for merging, which are not
 necessarily coordinated across replicas.

 The test is if the number of docs on each collection is the same. If
 it is, then don't worry about index sizes.

 Best,
 Erick

 On Mon, Apr 30, 2018 at 9:38 AM, Deepak Goel  wrote:
>
> Could you please also give the machine details of the two clouds you
> are
> running?
>
>
>
> Deepak
> "The greatness of a nation can be judged by the way its animals are
> treated. Please stop cruelty to Animals, become a Vegan"
>
> +91 73500 12833
> deic...@gmail.com
>
> Facebook: https://www.facebook.com/deicool
> LinkedIn: www.linkedin.com/in/deicool
>
> "Plant a Tree, Go Green"
>
> Make In India : http://www.makeinindia.com/home
>
> On Mon, Apr 30, 2018 at 9:51 PM, Antony A 

 wrote:
>>
>> Hi Shawn,
>>
>> The cloud is running version 6.2.1. with ClassicIndexSchemaFactory
>>
>> The sum of size from admin UI on all the shards is around 265 G vs 224
>> G
>> between the two clouds.
>>
>> I created the collection using "numShards" so compositeId router.
>>
>> If you need more information, please let me know.
>>
>> Thanks
>> AA
>>
>> On Mon, Apr 30, 2018 at 10:04 AM, Shawn Heisey 
>> wrote:
>>
>>> On 4/30/2018 9:51 AM, Antony A wrote:
>>>
 I am running two separate solr clouds. I have 8 shards in each with
 a
 total
 of 300 million documents. Both the clouds are indexing the document

 from

 the same source/configuration.

 I am noticing there is a difference in the size of the collection
>>
>> between

 them. I am planning to add more shards to see if that helps solve
 the
 issue. Has anyone come across similar issue?

>>> There's no information here about exactly what you are seeing, what

 you
>>>
>>> are expecting to see, and why you believe that what you are seeing is
>>
>> wrong.
>>>
>>> You did say that there is "a difference in size".  That is a very

 vague
>>>
>>> problem description.
>>>
>>> FYI, unless a SolrCloud collection is using the implicit router, you
>>> cannot add shards.  And if it *IS* using the implicit router,

Re: Shard size variation

2018-04-30 Thread Michael Joyner
Based on experience, 2x head room is room is not always enough, 
sometimes not even 3x, if you are optimizing from many segments down to 
1 segment in a single go.


We have however figured out a way that can work with as little as 51% 
free space via the following iteration cycle:


public void solrOptimize() {
        int initialMaxSegments = 256;
        int finalMaxSegments = 1;
        if (isShowSegmentCounter()) {
            log.info("Optimizing ...");
        }
        try (SolrClient solrServerInstance = getSolrClientInstance()){
            for (int segments=initialMaxSegments; 
segments>=finalMaxSegments; segments--) {

                if (isShowSegmentCounter()) {
                    System.out.println("Optimizing to a max of 
"+segments+" segments.");

                }
                solrServerInstance.optimize(true, true, segments);
            }
        } catch (SolrServerException | IOException e) {
            throw new RuntimeException(e);
        }
    }


On 04/30/2018 04:23 PM, Walter Underwood wrote:

You need 2X the minimum index size in disk space anyway, so don’t worry about 
keeping the indexes as small as possible. Worry about having enough headroom.

If your indexes are 250 GB, you need 250 GB of free space.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Apr 30, 2018, at 1:13 PM, Antony A  wrote:

Thanks Erick/Deepak.

The cloud is running on baremetal (128 GB/24 cpu).

Is there an option to run a compact on the data files to make the size
equal on both the clouds? I am trying find all the options before I add the
new fields into the production cloud.

Thanks
AA

On Mon, Apr 30, 2018 at 10:45 AM, Erick Erickson 
wrote:


Anthony:

You are probably seeing the results of removing deleted documents from
the shards as they're merged. Even on replicas in the same _shard_,
the size of the index on disk won't necessarily be identical. This has
to do with which segments are selected for merging, which are not
necessarily coordinated across replicas.

The test is if the number of docs on each collection is the same. If
it is, then don't worry about index sizes.

Best,
Erick

On Mon, Apr 30, 2018 at 9:38 AM, Deepak Goel  wrote:

Could you please also give the machine details of the two clouds you are
running?



Deepak
"The greatness of a nation can be judged by the way its animals are
treated. Please stop cruelty to Animals, become a Vegan"

+91 73500 12833
deic...@gmail.com

Facebook: https://www.facebook.com/deicool
LinkedIn: www.linkedin.com/in/deicool

"Plant a Tree, Go Green"

Make In India : http://www.makeinindia.com/home

On Mon, Apr 30, 2018 at 9:51 PM, Antony A 

wrote:

Hi Shawn,

The cloud is running version 6.2.1. with ClassicIndexSchemaFactory

The sum of size from admin UI on all the shards is around 265 G vs 224 G
between the two clouds.

I created the collection using "numShards" so compositeId router.

If you need more information, please let me know.

Thanks
AA

On Mon, Apr 30, 2018 at 10:04 AM, Shawn Heisey 
wrote:


On 4/30/2018 9:51 AM, Antony A wrote:


I am running two separate solr clouds. I have 8 shards in each with a
total
of 300 million documents. Both the clouds are indexing the document

from

the same source/configuration.

I am noticing there is a difference in the size of the collection

between

them. I am planning to add more shards to see if that helps solve the
issue. Has anyone come across similar issue?


There's no information here about exactly what you are seeing, what

you

are expecting to see, and why you believe that what you are seeing is

wrong.

You did say that there is "a difference in size".  That is a very

vague

problem description.

FYI, unless a SolrCloud collection is using the implicit router, you
cannot add shards.  And if it *IS* using the implicit router, then you

are

100% in control of document routing -- Solr cannot influence that at

all.

Thanks,
Shawn








Re: Shard size variation

2018-04-30 Thread Walter Underwood
You need 2X the minimum index size in disk space anyway, so don’t worry about 
keeping the indexes as small as possible. Worry about having enough headroom.

If your indexes are 250 GB, you need 250 GB of free space.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Apr 30, 2018, at 1:13 PM, Antony A  wrote:
> 
> Thanks Erick/Deepak.
> 
> The cloud is running on baremetal (128 GB/24 cpu).
> 
> Is there an option to run a compact on the data files to make the size
> equal on both the clouds? I am trying find all the options before I add the
> new fields into the production cloud.
> 
> Thanks
> AA
> 
> On Mon, Apr 30, 2018 at 10:45 AM, Erick Erickson 
> wrote:
> 
>> Anthony:
>> 
>> You are probably seeing the results of removing deleted documents from
>> the shards as they're merged. Even on replicas in the same _shard_,
>> the size of the index on disk won't necessarily be identical. This has
>> to do with which segments are selected for merging, which are not
>> necessarily coordinated across replicas.
>> 
>> The test is if the number of docs on each collection is the same. If
>> it is, then don't worry about index sizes.
>> 
>> Best,
>> Erick
>> 
>> On Mon, Apr 30, 2018 at 9:38 AM, Deepak Goel  wrote:
>>> Could you please also give the machine details of the two clouds you are
>>> running?
>>> 
>>> 
>>> 
>>> Deepak
>>> "The greatness of a nation can be judged by the way its animals are
>>> treated. Please stop cruelty to Animals, become a Vegan"
>>> 
>>> +91 73500 12833
>>> deic...@gmail.com
>>> 
>>> Facebook: https://www.facebook.com/deicool
>>> LinkedIn: www.linkedin.com/in/deicool
>>> 
>>> "Plant a Tree, Go Green"
>>> 
>>> Make In India : http://www.makeinindia.com/home
>>> 
>>> On Mon, Apr 30, 2018 at 9:51 PM, Antony A 
>> wrote:
>>> 
 Hi Shawn,
 
 The cloud is running version 6.2.1. with ClassicIndexSchemaFactory
 
 The sum of size from admin UI on all the shards is around 265 G vs 224 G
 between the two clouds.
 
 I created the collection using "numShards" so compositeId router.
 
 If you need more information, please let me know.
 
 Thanks
 AA
 
 On Mon, Apr 30, 2018 at 10:04 AM, Shawn Heisey 
 wrote:
 
> On 4/30/2018 9:51 AM, Antony A wrote:
> 
>> I am running two separate solr clouds. I have 8 shards in each with a
>> total
>> of 300 million documents. Both the clouds are indexing the document
>> from
>> the same source/configuration.
>> 
>> I am noticing there is a difference in the size of the collection
 between
>> them. I am planning to add more shards to see if that helps solve the
>> issue. Has anyone come across similar issue?
>> 
> 
> There's no information here about exactly what you are seeing, what
>> you
> are expecting to see, and why you believe that what you are seeing is
 wrong.
> 
> You did say that there is "a difference in size".  That is a very
>> vague
> problem description.
> 
> FYI, unless a SolrCloud collection is using the implicit router, you
> cannot add shards.  And if it *IS* using the implicit router, then you
 are
> 100% in control of document routing -- Solr cannot influence that at
>> all.
> 
> Thanks,
> Shawn
> 
> 
 
>> 



Re: Shard size variation

2018-04-30 Thread Antony A
Thanks Erick/Deepak.

The cloud is running on baremetal (128 GB/24 cpu).

Is there an option to run a compact on the data files to make the size
equal on both the clouds? I am trying find all the options before I add the
new fields into the production cloud.

Thanks
AA

On Mon, Apr 30, 2018 at 10:45 AM, Erick Erickson 
wrote:

> Anthony:
>
> You are probably seeing the results of removing deleted documents from
> the shards as they're merged. Even on replicas in the same _shard_,
> the size of the index on disk won't necessarily be identical. This has
> to do with which segments are selected for merging, which are not
> necessarily coordinated across replicas.
>
> The test is if the number of docs on each collection is the same. If
> it is, then don't worry about index sizes.
>
> Best,
> Erick
>
> On Mon, Apr 30, 2018 at 9:38 AM, Deepak Goel  wrote:
> > Could you please also give the machine details of the two clouds you are
> > running?
> >
> >
> >
> > Deepak
> > "The greatness of a nation can be judged by the way its animals are
> > treated. Please stop cruelty to Animals, become a Vegan"
> >
> > +91 73500 12833
> > deic...@gmail.com
> >
> > Facebook: https://www.facebook.com/deicool
> > LinkedIn: www.linkedin.com/in/deicool
> >
> > "Plant a Tree, Go Green"
> >
> > Make In India : http://www.makeinindia.com/home
> >
> > On Mon, Apr 30, 2018 at 9:51 PM, Antony A 
> wrote:
> >
> >> Hi Shawn,
> >>
> >> The cloud is running version 6.2.1. with ClassicIndexSchemaFactory
> >>
> >> The sum of size from admin UI on all the shards is around 265 G vs 224 G
> >> between the two clouds.
> >>
> >> I created the collection using "numShards" so compositeId router.
> >>
> >> If you need more information, please let me know.
> >>
> >> Thanks
> >> AA
> >>
> >> On Mon, Apr 30, 2018 at 10:04 AM, Shawn Heisey 
> >> wrote:
> >>
> >> > On 4/30/2018 9:51 AM, Antony A wrote:
> >> >
> >> >> I am running two separate solr clouds. I have 8 shards in each with a
> >> >> total
> >> >> of 300 million documents. Both the clouds are indexing the document
> from
> >> >> the same source/configuration.
> >> >>
> >> >> I am noticing there is a difference in the size of the collection
> >> between
> >> >> them. I am planning to add more shards to see if that helps solve the
> >> >> issue. Has anyone come across similar issue?
> >> >>
> >> >
> >> > There's no information here about exactly what you are seeing, what
> you
> >> > are expecting to see, and why you believe that what you are seeing is
> >> wrong.
> >> >
> >> > You did say that there is "a difference in size".  That is a very
> vague
> >> > problem description.
> >> >
> >> > FYI, unless a SolrCloud collection is using the implicit router, you
> >> > cannot add shards.  And if it *IS* using the implicit router, then you
> >> are
> >> > 100% in control of document routing -- Solr cannot influence that at
> all.
> >> >
> >> > Thanks,
> >> > Shawn
> >> >
> >> >
> >>
>


Re: Shard size variation

2018-04-30 Thread Erick Erickson
Anthony:

You are probably seeing the results of removing deleted documents from
the shards as they're merged. Even on replicas in the same _shard_,
the size of the index on disk won't necessarily be identical. This has
to do with which segments are selected for merging, which are not
necessarily coordinated across replicas.

The test is if the number of docs on each collection is the same. If
it is, then don't worry about index sizes.

Best,
Erick

On Mon, Apr 30, 2018 at 9:38 AM, Deepak Goel  wrote:
> Could you please also give the machine details of the two clouds you are
> running?
>
>
>
> Deepak
> "The greatness of a nation can be judged by the way its animals are
> treated. Please stop cruelty to Animals, become a Vegan"
>
> +91 73500 12833
> deic...@gmail.com
>
> Facebook: https://www.facebook.com/deicool
> LinkedIn: www.linkedin.com/in/deicool
>
> "Plant a Tree, Go Green"
>
> Make In India : http://www.makeinindia.com/home
>
> On Mon, Apr 30, 2018 at 9:51 PM, Antony A  wrote:
>
>> Hi Shawn,
>>
>> The cloud is running version 6.2.1. with ClassicIndexSchemaFactory
>>
>> The sum of size from admin UI on all the shards is around 265 G vs 224 G
>> between the two clouds.
>>
>> I created the collection using "numShards" so compositeId router.
>>
>> If you need more information, please let me know.
>>
>> Thanks
>> AA
>>
>> On Mon, Apr 30, 2018 at 10:04 AM, Shawn Heisey 
>> wrote:
>>
>> > On 4/30/2018 9:51 AM, Antony A wrote:
>> >
>> >> I am running two separate solr clouds. I have 8 shards in each with a
>> >> total
>> >> of 300 million documents. Both the clouds are indexing the document from
>> >> the same source/configuration.
>> >>
>> >> I am noticing there is a difference in the size of the collection
>> between
>> >> them. I am planning to add more shards to see if that helps solve the
>> >> issue. Has anyone come across similar issue?
>> >>
>> >
>> > There's no information here about exactly what you are seeing, what you
>> > are expecting to see, and why you believe that what you are seeing is
>> wrong.
>> >
>> > You did say that there is "a difference in size".  That is a very vague
>> > problem description.
>> >
>> > FYI, unless a SolrCloud collection is using the implicit router, you
>> > cannot add shards.  And if it *IS* using the implicit router, then you
>> are
>> > 100% in control of document routing -- Solr cannot influence that at all.
>> >
>> > Thanks,
>> > Shawn
>> >
>> >
>>


Re: Shard size variation

2018-04-30 Thread Deepak Goel
Could you please also give the machine details of the two clouds you are
running?



Deepak
"The greatness of a nation can be judged by the way its animals are
treated. Please stop cruelty to Animals, become a Vegan"

+91 73500 12833
deic...@gmail.com

Facebook: https://www.facebook.com/deicool
LinkedIn: www.linkedin.com/in/deicool

"Plant a Tree, Go Green"

Make In India : http://www.makeinindia.com/home

On Mon, Apr 30, 2018 at 9:51 PM, Antony A  wrote:

> Hi Shawn,
>
> The cloud is running version 6.2.1. with ClassicIndexSchemaFactory
>
> The sum of size from admin UI on all the shards is around 265 G vs 224 G
> between the two clouds.
>
> I created the collection using "numShards" so compositeId router.
>
> If you need more information, please let me know.
>
> Thanks
> AA
>
> On Mon, Apr 30, 2018 at 10:04 AM, Shawn Heisey 
> wrote:
>
> > On 4/30/2018 9:51 AM, Antony A wrote:
> >
> >> I am running two separate solr clouds. I have 8 shards in each with a
> >> total
> >> of 300 million documents. Both the clouds are indexing the document from
> >> the same source/configuration.
> >>
> >> I am noticing there is a difference in the size of the collection
> between
> >> them. I am planning to add more shards to see if that helps solve the
> >> issue. Has anyone come across similar issue?
> >>
> >
> > There's no information here about exactly what you are seeing, what you
> > are expecting to see, and why you believe that what you are seeing is
> wrong.
> >
> > You did say that there is "a difference in size".  That is a very vague
> > problem description.
> >
> > FYI, unless a SolrCloud collection is using the implicit router, you
> > cannot add shards.  And if it *IS* using the implicit router, then you
> are
> > 100% in control of document routing -- Solr cannot influence that at all.
> >
> > Thanks,
> > Shawn
> >
> >
>


Re: Shard size variation

2018-04-30 Thread Antony A
Hi Shawn,

The cloud is running version 6.2.1. with ClassicIndexSchemaFactory

The sum of size from admin UI on all the shards is around 265 G vs 224 G
between the two clouds.

I created the collection using "numShards" so compositeId router.

If you need more information, please let me know.

Thanks
AA

On Mon, Apr 30, 2018 at 10:04 AM, Shawn Heisey  wrote:

> On 4/30/2018 9:51 AM, Antony A wrote:
>
>> I am running two separate solr clouds. I have 8 shards in each with a
>> total
>> of 300 million documents. Both the clouds are indexing the document from
>> the same source/configuration.
>>
>> I am noticing there is a difference in the size of the collection between
>> them. I am planning to add more shards to see if that helps solve the
>> issue. Has anyone come across similar issue?
>>
>
> There's no information here about exactly what you are seeing, what you
> are expecting to see, and why you believe that what you are seeing is wrong.
>
> You did say that there is "a difference in size".  That is a very vague
> problem description.
>
> FYI, unless a SolrCloud collection is using the implicit router, you
> cannot add shards.  And if it *IS* using the implicit router, then you are
> 100% in control of document routing -- Solr cannot influence that at all.
>
> Thanks,
> Shawn
>
>


Re: Shard size variation

2018-04-30 Thread Shawn Heisey

On 4/30/2018 9:51 AM, Antony A wrote:

I am running two separate solr clouds. I have 8 shards in each with a total
of 300 million documents. Both the clouds are indexing the document from
the same source/configuration.

I am noticing there is a difference in the size of the collection between
them. I am planning to add more shards to see if that helps solve the
issue. Has anyone come across similar issue?


There's no information here about exactly what you are seeing, what you 
are expecting to see, and why you believe that what you are seeing is wrong.


You did say that there is "a difference in size".  That is a very vague 
problem description.


FYI, unless a SolrCloud collection is using the implicit router, you 
cannot add shards.  And if it *IS* using the implicit router, then you 
are 100% in control of document routing -- Solr cannot influence that at 
all.


Thanks,
Shawn