[jira] [Comment Edited] (LUCENE-8263) Add indexPctDeletedTarget as a parameter to TieredMergePolicy to control more aggressive merging

2018-07-16 Thread Marc Morissette (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545788#comment-16545788
 ] 

Marc Morissette edited comment on LUCENE-8263 at 7/16/18 10:02 PM:
---

{quote}I've gone back and forth on this. Now that optimize and forceMerge 
respect maxSegmentSize I've been thinking that those operations would suffice 
for those real-world edge cases.

forceMergeDeletes (expungeDeletes) has a maximum percent of deletes allowed per 
segment for instance that must be between 0 and 100. 0 is roughly equivalent to 
forceMerge/optimize at this point. And will not create any segments over 
maxSegmentSizeMB.
{quote}
I hadn't considered using forceMergeDeletes to address these edge cases but the 
more I think about it, the more I like it. Consider me convinced.

My only remaining concern with forceMergeDeletes as it is currently designed 
(and if I'm reading the code correctly) is that if enough segments somehow end 
up having a delete % above forceMergeDeletesPctAllowed, then it is possible for 
it to use a lot of disk space. Perhaps we could find a way to configure an 
upper limit on the number of merges that forceMergeDeletes can perform per 
call? When configured this way, each forceMergeDeletes could only claim a 
maximum amount of disk space before returning. Repeated calls would be 
necessary to "clean" an entire index but if each one were accompanied by a soft 
commit, then the amount of free disk space required to perform the entire 
operation would be more predictable.


was (Author: marc.morissette):
{quote}I've gone back and forth on this. Now that optimize and forceMerge 
respect maxSegmentSize I've been thinking that those operations would suffice 
for those real-world edge cases.

forceMergeDeletes (expungeDeletes) has a maximum percent of deletes allowed per 
segment for instance that must be between 0 and 100. 0 is roughly equivalent to 
forceMerge/optimize at this point. And will not create any segments over 
maxSegmentSizeMB.
{quote}
I hadn't considered using forceMergeDeletes to address these edge cases but the 
more I think about it, the more I like it. Consider me convinced.

My only remaining concern with forceMergeDeletes as it is currently designed 
(and if I'm reading the code correctly) is that if enough segments somehow end 
up having a delete % above forceMergeDeletesPctAllowed, then it is possible for 
it to use a lot of disk space. Perhaps we could find a way to configure an 
upper limit on the number of merges that forceMergeDeletes can perform per 
call? When configured this way, each forceMergeDeletes could only claim a 
maximum amount of disk space before returning. Repeated calls would be 
necessary to "clean" an entire index but if each one were accompanied by a soft 
commit, then the amount of free disk space required to perform the operation 
would be more predictable.

> Add indexPctDeletedTarget as a parameter to TieredMergePolicy to control more 
> aggressive merging
> 
>
> Key: LUCENE-8263
> URL: https://issues.apache.org/jira/browse/LUCENE-8263
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
> Attachments: LUCENE-8263.patch
>
>
> Spinoff of LUCENE-7976 to keep the two issues separate.
> The current TMP allows up to 50% deleted docs, which can be wasteful on large 
> indexes. This parameter will do more aggressive merging of segments with 
> deleted documents when the _total_ percentage of deleted docs in the entire 
> index exceeds it.
> Setting this to 50% should approximate current behavior. Setting it to 20% 
> caused the first cut at this to increase I/O roughly 10%. Setting it to 10% 
> caused about a 50% increase in I/O.
> I was conflating the two issues, so I'll change 7976 and comment out the bits 
> that reference this new parameter. After it's checked in we can bring this 
> back. That should be less work than reconstructing this later.
> Among the questions to be answered:
> 1> what should the default be? I propose 20% as it results in significantly 
> less space wasted and helps control heap usage for a modest increase in I/O.
> 2> what should the floor be? I propose 10% with _strong_ documentation 
> warnings about not setting it below 20%.
> 3> should there be two parameters? I think this was discussed somewhat in 
> 7976. The first cut at  this used this number for two purposes:
> 3a> the total percentage of deleted docs index-wide to trip this trigger
> 3b> the percentage of an _individual_ segment that had to be deleted if the 
> segment was over maxSegmentSize/2 bytes in order to be eligible for merging. 
> Empirically, using the same percentage for both caused the 

[jira] [Commented] (LUCENE-8263) Add indexPctDeletedTarget as a parameter to TieredMergePolicy to control more aggressive merging

2018-07-16 Thread Marc Morissette (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545788#comment-16545788
 ] 

Marc Morissette commented on LUCENE-8263:
-

{quote}I've gone back and forth on this. Now that optimize and forceMerge 
respect maxSegmentSize I've been thinking that those operations would suffice 
for those real-world edge cases.

forceMergeDeletes (expungeDeletes) has a maximum percent of deletes allowed per 
segment for instance that must be between 0 and 100. 0 is roughly equivalent to 
forceMerge/optimize at this point. And will not create any segments over 
maxSegmentSizeMB.
{quote}
I hadn't considered using forceMergeDeletes to address these edge cases but the 
more I think about it, the more I like it. Consider me convinced.

My only remaining concern with forceMergeDeletes as it is currently designed 
(and if I'm reading the code correctly) is that if enough segments somehow end 
up having a delete % above forceMergeDeletesPctAllowed, then it is possible for 
it to use a lot of disk space. Perhaps we could find a way to configure an 
upper limit on the number of merges that forceMergeDeletes can perform per 
call? When configured this way, each forceMergeDeletes could only claim a 
maximum amount of disk space before returning. Repeated calls would be 
necessary to "clean" an entire index but if each one were accompanied by a soft 
commit, then the amount of free disk space required to perform the operation 
would be more predictable.

> Add indexPctDeletedTarget as a parameter to TieredMergePolicy to control more 
> aggressive merging
> 
>
> Key: LUCENE-8263
> URL: https://issues.apache.org/jira/browse/LUCENE-8263
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
> Attachments: LUCENE-8263.patch
>
>
> Spinoff of LUCENE-7976 to keep the two issues separate.
> The current TMP allows up to 50% deleted docs, which can be wasteful on large 
> indexes. This parameter will do more aggressive merging of segments with 
> deleted documents when the _total_ percentage of deleted docs in the entire 
> index exceeds it.
> Setting this to 50% should approximate current behavior. Setting it to 20% 
> caused the first cut at this to increase I/O roughly 10%. Setting it to 10% 
> caused about a 50% increase in I/O.
> I was conflating the two issues, so I'll change 7976 and comment out the bits 
> that reference this new parameter. After it's checked in we can bring this 
> back. That should be less work than reconstructing this later.
> Among the questions to be answered:
> 1> what should the default be? I propose 20% as it results in significantly 
> less space wasted and helps control heap usage for a modest increase in I/O.
> 2> what should the floor be? I propose 10% with _strong_ documentation 
> warnings about not setting it below 20%.
> 3> should there be two parameters? I think this was discussed somewhat in 
> 7976. The first cut at  this used this number for two purposes:
> 3a> the total percentage of deleted docs index-wide to trip this trigger
> 3b> the percentage of an _individual_ segment that had to be deleted if the 
> segment was over maxSegmentSize/2 bytes in order to be eligible for merging. 
> Empirically, using the same percentage for both caused the merging to hover 
> around the value specified for this parameter.
> My proposal for <3> would be to have the parameter do double-duty. Assuming 
> my preliminary results hold, you specify this parameter at, say, 20% and once 
> the index hits that % deleted docs it hovers right around there, even if 
> you've forceMerged earlier down to 1 segment. This seems in line with what 
> I'd expect and adding another parameter seems excessively complicated to no 
> good purpose. We could always add something like that later if we wanted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8263) Add indexPctDeletedTarget as a parameter to TieredMergePolicy to control more aggressive merging

2018-07-14 Thread Marc Morissette (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544392#comment-16544392
 ] 

Marc Morissette commented on LUCENE-8263:
-

{quote}the above simulations suggest around 2.1x more merging with 10% of 
allowed deletes but I wouldn't be surprised that it could be much worse in 
practice in production under certain conditions.{quote}
I understand why you would rather not give users another way to shoot 
themselves in the foot but I think you may underestimate how diverse and 
idiosyncratic some use cases can get. There are many real world situations 
where a setting lower than 20% might be very appropriate
 * Super large indexes that are not updated often i.e. where size is way more 
important than IO
 * Indexes where large documents are updated more often than small documents 
which skews TieredMergePolicy's estimate of delete%
 * Query-heavy update-light indexes where update IO is a tiny fraction of query 
IO

Users who will be looking to alter deletesPctAllowed will presumably be doing 
so because the default is inappropriate for their use case. I feel that 20-50% 
might be too narrow a range for some significant percentage of these use cases.

I think documenting the danger of setting too low a value and letting users do 
their own experiments is the better course of action.

 

> Add indexPctDeletedTarget as a parameter to TieredMergePolicy to control more 
> aggressive merging
> 
>
> Key: LUCENE-8263
> URL: https://issues.apache.org/jira/browse/LUCENE-8263
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
> Attachments: LUCENE-8263.patch
>
>
> Spinoff of LUCENE-7976 to keep the two issues separate.
> The current TMP allows up to 50% deleted docs, which can be wasteful on large 
> indexes. This parameter will do more aggressive merging of segments with 
> deleted documents when the _total_ percentage of deleted docs in the entire 
> index exceeds it.
> Setting this to 50% should approximate current behavior. Setting it to 20% 
> caused the first cut at this to increase I/O roughly 10%. Setting it to 10% 
> caused about a 50% increase in I/O.
> I was conflating the two issues, so I'll change 7976 and comment out the bits 
> that reference this new parameter. After it's checked in we can bring this 
> back. That should be less work than reconstructing this later.
> Among the questions to be answered:
> 1> what should the default be? I propose 20% as it results in significantly 
> less space wasted and helps control heap usage for a modest increase in I/O.
> 2> what should the floor be? I propose 10% with _strong_ documentation 
> warnings about not setting it below 20%.
> 3> should there be two parameters? I think this was discussed somewhat in 
> 7976. The first cut at  this used this number for two purposes:
> 3a> the total percentage of deleted docs index-wide to trip this trigger
> 3b> the percentage of an _individual_ segment that had to be deleted if the 
> segment was over maxSegmentSize/2 bytes in order to be eligible for merging. 
> Empirically, using the same percentage for both caused the merging to hover 
> around the value specified for this parameter.
> My proposal for <3> would be to have the parameter do double-duty. Assuming 
> my preliminary results hold, you specify this parameter at, say, 20% and once 
> the index hits that % deleted docs it hovers right around there, even if 
> you've forceMerged earlier down to 1 segment. This seems in line with what 
> I'd expect and adding another parameter seems excessively complicated to no 
> good purpose. We could always add something like that later if we wanted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8263) Add indexPctDeletedTarget as a parameter to TieredMergePolicy to control more aggressive merging

2018-07-13 Thread Marc Morissette (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543616#comment-16543616
 ] 

Marc Morissette edited comment on LUCENE-8263 at 7/13/18 7:37 PM:
--

I would like to argue against a 20% floor.

Some indexes contain documents of wildly different sizes with the larger 
documents experiencing much higher turnover. I have seen indexes with around 
20% deletions that were more than 2x their optimized size because of this 
phenomenon.

I such situations, deletesPctAllowed around 10-15% would make a lot of sense. I 
say keep the floor at 10%.

Or maybe simply issue a warning instead?


was (Author: marc.morissette):
I would like to argue against a 20% floor.

Some indexes contain documents of wildly different sizes with the larger 
documents experiencing much higher turnover. I have seen indexes with around 
20% deletions that were more than 2x their optimized size because of this 
phenomenon.

I such situations, deletesPctAllowed around 10-15% would make a lot of sense. I 
say keep the floor at 10%.

> Add indexPctDeletedTarget as a parameter to TieredMergePolicy to control more 
> aggressive merging
> 
>
> Key: LUCENE-8263
> URL: https://issues.apache.org/jira/browse/LUCENE-8263
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
> Attachments: LUCENE-8263.patch
>
>
> Spinoff of LUCENE-7976 to keep the two issues separate.
> The current TMP allows up to 50% deleted docs, which can be wasteful on large 
> indexes. This parameter will do more aggressive merging of segments with 
> deleted documents when the _total_ percentage of deleted docs in the entire 
> index exceeds it.
> Setting this to 50% should approximate current behavior. Setting it to 20% 
> caused the first cut at this to increase I/O roughly 10%. Setting it to 10% 
> caused about a 50% increase in I/O.
> I was conflating the two issues, so I'll change 7976 and comment out the bits 
> that reference this new parameter. After it's checked in we can bring this 
> back. That should be less work than reconstructing this later.
> Among the questions to be answered:
> 1> what should the default be? I propose 20% as it results in significantly 
> less space wasted and helps control heap usage for a modest increase in I/O.
> 2> what should the floor be? I propose 10% with _strong_ documentation 
> warnings about not setting it below 20%.
> 3> should there be two parameters? I think this was discussed somewhat in 
> 7976. The first cut at  this used this number for two purposes:
> 3a> the total percentage of deleted docs index-wide to trip this trigger
> 3b> the percentage of an _individual_ segment that had to be deleted if the 
> segment was over maxSegmentSize/2 bytes in order to be eligible for merging. 
> Empirically, using the same percentage for both caused the merging to hover 
> around the value specified for this parameter.
> My proposal for <3> would be to have the parameter do double-duty. Assuming 
> my preliminary results hold, you specify this parameter at, say, 20% and once 
> the index hits that % deleted docs it hovers right around there, even if 
> you've forceMerged earlier down to 1 segment. This seems in line with what 
> I'd expect and adding another parameter seems excessively complicated to no 
> good purpose. We could always add something like that later if we wanted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8263) Add indexPctDeletedTarget as a parameter to TieredMergePolicy to control more aggressive merging

2018-07-13 Thread Marc Morissette (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543616#comment-16543616
 ] 

Marc Morissette commented on LUCENE-8263:
-

I would like to argue against a 20% floor.

Some indexes contain documents of wildly different sizes with the larger 
documents experiencing much higher turnover. I have seen indexes with around 
20% deletions that were more than 2x their optimized size because of this 
phenomenon.

I such situations, deletesPctAllowed around 10-15% would make a lot of sense. I 
say keep the floor at 10%.

> Add indexPctDeletedTarget as a parameter to TieredMergePolicy to control more 
> aggressive merging
> 
>
> Key: LUCENE-8263
> URL: https://issues.apache.org/jira/browse/LUCENE-8263
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
> Attachments: LUCENE-8263.patch
>
>
> Spinoff of LUCENE-7976 to keep the two issues separate.
> The current TMP allows up to 50% deleted docs, which can be wasteful on large 
> indexes. This parameter will do more aggressive merging of segments with 
> deleted documents when the _total_ percentage of deleted docs in the entire 
> index exceeds it.
> Setting this to 50% should approximate current behavior. Setting it to 20% 
> caused the first cut at this to increase I/O roughly 10%. Setting it to 10% 
> caused about a 50% increase in I/O.
> I was conflating the two issues, so I'll change 7976 and comment out the bits 
> that reference this new parameter. After it's checked in we can bring this 
> back. That should be less work than reconstructing this later.
> Among the questions to be answered:
> 1> what should the default be? I propose 20% as it results in significantly 
> less space wasted and helps control heap usage for a modest increase in I/O.
> 2> what should the floor be? I propose 10% with _strong_ documentation 
> warnings about not setting it below 20%.
> 3> should there be two parameters? I think this was discussed somewhat in 
> 7976. The first cut at  this used this number for two purposes:
> 3a> the total percentage of deleted docs index-wide to trip this trigger
> 3b> the percentage of an _individual_ segment that had to be deleted if the 
> segment was over maxSegmentSize/2 bytes in order to be eligible for merging. 
> Empirically, using the same percentage for both caused the merging to hover 
> around the value specified for this parameter.
> My proposal for <3> would be to have the parameter do double-duty. Assuming 
> my preliminary results hold, you specify this parameter at, say, 20% and once 
> the index hits that % deleted docs it hovers right around there, even if 
> you've forceMerged earlier down to 1 segment. This seems in line with what 
> I'd expect and adding another parameter seems excessively complicated to no 
> good purpose. We could always add something like that later if we wanted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-12550) ConcurrentUpdateSolrClient doesn't respect timeouts for commits and optimize

2018-07-12 Thread Marc Morissette (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marc Morissette updated SOLR-12550:
---
Description: 
We're in a situation where we need to optimize some of our collections. These 
optimizations are done with waitSearcher=true as a simple throttling mechanism 
to prevent too many collections from being optimized at once.

We're seeing these optimize commands return without error after 10 minutes but 
well before the end of the operation. Our Solr logs show errors with 
socketTimeout stack traces. Setting distribUpdateSoTimeout to a higher value 
has no effect.

See the links section for my patch.

It turns out that ConcurrentUpdateSolrClient delegates commit and optimize 
commands to a private HttpSolrClient but fails to pass along its builder's 
timeouts to that client.


  was:
We're in a situation where we need to optimize some of our collections. These 
optimizations are done with waitSearcher=true as a simple throttling mechanism 
to prevent too many collections from being optimized at once.

We're seeing these optimize commands return without error after 10 minutes but 
well before the end of the operation. Our Solr logs show errors with 
socketTimeout stack traces. Setting distribUpdateSoTimeout to a higher value 
has no effect.

It turns out that ConcurrentUpdateSolrClient delegates commit and optimize 
commands to a private HttpSolrClient but fails to pass along its builder's 
timeouts to that client.


Environment: 
[~elyograg] I am going to assume you didn't see that a patch with a unit test 
is attached to this bug (It's in the links section. It looks Github has stopped 
adding comments when a new pull request is detected).

Also, maybe I wasn't clear in my description but we don't use 
ConcurrentUpdateSolrClient in our client code. The issue is in SolrCloud itself 
where timeouts may occur in the ConcurrentUpdateSolrClient Solr uses to relay 
commit and optimize commands to its shards.

> ConcurrentUpdateSolrClient doesn't respect timeouts for commits and optimize
> 
>
> Key: SOLR-12550
> URL: https://issues.apache.org/jira/browse/SOLR-12550
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
> Environment: [~elyograg] I am going to assume you didn't see that a 
> patch with a unit test is attached to this bug (It's in the links section. It 
> looks Github has stopped adding comments when a new pull request is detected).
> Also, maybe I wasn't clear in my description but we don't use 
> ConcurrentUpdateSolrClient in our client code. The issue is in SolrCloud 
> itself where timeouts may occur in the ConcurrentUpdateSolrClient Solr uses 
> to relay commit and optimize commands to its shards.
>Reporter: Marc Morissette
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We're in a situation where we need to optimize some of our collections. These 
> optimizations are done with waitSearcher=true as a simple throttling 
> mechanism to prevent too many collections from being optimized at once.
> We're seeing these optimize commands return without error after 10 minutes 
> but well before the end of the operation. Our Solr logs show errors with 
> socketTimeout stack traces. Setting distribUpdateSoTimeout to a higher value 
> has no effect.
> See the links section for my patch.
> It turns out that ConcurrentUpdateSolrClient delegates commit and optimize 
> commands to a private HttpSolrClient but fails to pass along its builder's 
> timeouts to that client.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-12550) ConcurrentUpdateSolrClient doesn't respect timeouts for commits and optimize

2018-07-12 Thread Marc Morissette (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marc Morissette updated SOLR-12550:
---
Environment: (was: [~elyograg] I am going to assume you didn't see that 
a patch with a unit test is attached to this bug (It's in the links section. It 
looks Github has stopped adding comments when a new pull request is detected).

Also, maybe I wasn't clear in my description but we don't use 
ConcurrentUpdateSolrClient in our client code. The issue is in SolrCloud itself 
where timeouts may occur in the ConcurrentUpdateSolrClient Solr uses to relay 
commit and optimize commands to its shards.)
Description: 
We're in a situation where we need to optimize some of our collections. These 
optimizations are done with waitSearcher=true as a simple throttling mechanism 
to prevent too many collections from being optimized at once.

We're seeing these optimize commands return without error after 10 minutes but 
well before the end of the operation. Our Solr logs show errors with 
socketTimeout stack traces. Setting distribUpdateSoTimeout to a higher value 
has no effect.

See the links section for my patch.

It turns out that ConcurrentUpdateSolrClient delegates commit and optimize 
commands to a private HttpSolrClient but fails to pass along its builder's 
timeouts to that client.

A patch is attached in the links section.


  was:
We're in a situation where we need to optimize some of our collections. These 
optimizations are done with waitSearcher=true as a simple throttling mechanism 
to prevent too many collections from being optimized at once.

We're seeing these optimize commands return without error after 10 minutes but 
well before the end of the operation. Our Solr logs show errors with 
socketTimeout stack traces. Setting distribUpdateSoTimeout to a higher value 
has no effect.

See the links section for my patch.

It turns out that ConcurrentUpdateSolrClient delegates commit and optimize 
commands to a private HttpSolrClient but fails to pass along its builder's 
timeouts to that client.



[~elyograg] I am going to assume you didn't see that a patch with a unit test 
is attached to this bug (It's in the links section. It looks Github has stopped 
adding comments when a new pull request is detected).

Also, maybe I wasn't clear in my description but we don't use 
ConcurrentUpdateSolrClient in our client code. The issue is in SolrCloud itself 
where timeouts may occur in the ConcurrentUpdateSolrClient Solr uses to relay 
commit and optimize commands to its shards.

> ConcurrentUpdateSolrClient doesn't respect timeouts for commits and optimize
> 
>
> Key: SOLR-12550
> URL: https://issues.apache.org/jira/browse/SOLR-12550
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Marc Morissette
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We're in a situation where we need to optimize some of our collections. These 
> optimizations are done with waitSearcher=true as a simple throttling 
> mechanism to prevent too many collections from being optimized at once.
> We're seeing these optimize commands return without error after 10 minutes 
> but well before the end of the operation. Our Solr logs show errors with 
> socketTimeout stack traces. Setting distribUpdateSoTimeout to a higher value 
> has no effect.
> See the links section for my patch.
> It turns out that ConcurrentUpdateSolrClient delegates commit and optimize 
> commands to a private HttpSolrClient but fails to pass along its builder's 
> timeouts to that client.
> A patch is attached in the links section.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12550) ConcurrentUpdateSolrClient doesn't respect timeouts for commits and optimize

2018-07-12 Thread Marc Morissette (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542261#comment-16542261
 ] 

Marc Morissette commented on SOLR-12550:


By the way, I have not investigated why the optimize() command returns without 
an error despite the fact that it did not complete normally. 

> ConcurrentUpdateSolrClient doesn't respect timeouts for commits and optimize
> 
>
> Key: SOLR-12550
> URL: https://issues.apache.org/jira/browse/SOLR-12550
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Marc Morissette
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We're in a situation where we need to optimize some of our collections. These 
> optimizations are done with waitSearcher=true as a simple throttling 
> mechanism to prevent too many collections from being optimized at once.
> We're seeing these optimize commands return without error after 10 minutes 
> but well before the end of the operation. Our Solr logs show errors with 
> socketTimeout stack traces. Setting distribUpdateSoTimeout to a higher value 
> has no effect.
> It turns out that ConcurrentUpdateSolrClient delegates commit and optimize 
> commands to a private HttpSolrClient but fails to pass along its builder's 
> timeouts to that client.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-12550) ConcurrentUpdateSolrClient doesn't respect timeouts for commits and optimize

2018-07-12 Thread Marc Morissette (JIRA)
Marc Morissette created SOLR-12550:
--

 Summary: ConcurrentUpdateSolrClient doesn't respect timeouts for 
commits and optimize
 Key: SOLR-12550
 URL: https://issues.apache.org/jira/browse/SOLR-12550
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Marc Morissette


We're in a situation where we need to optimize some of our collections. These 
optimizations are done with waitSearcher=true as a simple throttling mechanism 
to prevent too many collections from being optimized at once.

We're seeing these optimize commands return without error after 10 minutes but 
well before the end of the operation. Our Solr logs show errors with 
socketTimeout stack traces. Setting distribUpdateSoTimeout to a higher value 
has no effect.

It turns out that ConcurrentUpdateSolrClient delegates commit and optimize 
commands to a private HttpSolrClient but fails to pass along its builder's 
timeouts to that client.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8365) ArrayIndexOutOfBoundsException in UnifiedHighlighter

2018-06-19 Thread Marc Morissette (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517729#comment-16517729
 ] 

Marc Morissette commented on LUCENE-8365:
-

The fix is in Github

> ArrayIndexOutOfBoundsException in UnifiedHighlighter
> 
>
> Key: LUCENE-8365
> URL: https://issues.apache.org/jira/browse/LUCENE-8365
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/highlighter
>Affects Versions: 7.3.1
>Reporter: Marc Morissette
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We see ArrayIndexOutOfBoundsExceptions coming out of the UnifiedHighlighter 
> in our production logs from time to time:
> {code}
> java.lang.ArrayIndexOutOfBoundsException
>   at java.base/java.lang.System.arraycopy(Native Method)
>   at 
> org.apache.lucene.search.uhighlight.PhraseHelper$SpanCollectedOffsetsEnum.add(PhraseHelper.java:386)
>   at 
> org.apache.lucene.search.uhighlight.PhraseHelper$OffsetSpanCollector.collectLeaf(PhraseHelper.java:341)
>   at org.apache.lucene.search.spans.TermSpans.collect(TermSpans.java:121)
>   at 
> org.apache.lucene.search.spans.NearSpansOrdered.collect(NearSpansOrdered.java:149)
>   at 
> org.apache.lucene.search.spans.NearSpansUnordered.collect(NearSpansUnordered.java:171)
>   at 
> org.apache.lucene.search.spans.FilterSpans.collect(FilterSpans.java:120)
>   at 
> org.apache.lucene.search.uhighlight.PhraseHelper.createOffsetsEnumsForSpans(PhraseHelper.java:261)
> ...
> {code}
> It turns out that there is an "off by one" error in the UnifiedHighlighter's 
> code that, as far as I can tell, is only triggered when two nested 
> SpanNearQueries contain the same term.
> The resulting behaviour depends on the content of the highlighted document. 
> Either, some highlighted terms go missing or an 
> ArrayIndexOutOfBoundsException is thrown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8365) ArrayIndexOutOfBoundsException in UnifiedHighlighter

2018-06-19 Thread Marc Morissette (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marc Morissette updated LUCENE-8365:

Description: 
We see ArrayIndexOutOfBoundsExceptions coming out of the UnifiedHighlighter in 
our production logs from time to time:

{code}
java.lang.ArrayIndexOutOfBoundsException
at java.base/java.lang.System.arraycopy(Native Method)
at 
org.apache.lucene.search.uhighlight.PhraseHelper$SpanCollectedOffsetsEnum.add(PhraseHelper.java:386)
at 
org.apache.lucene.search.uhighlight.PhraseHelper$OffsetSpanCollector.collectLeaf(PhraseHelper.java:341)
at org.apache.lucene.search.spans.TermSpans.collect(TermSpans.java:121)
at 
org.apache.lucene.search.spans.NearSpansOrdered.collect(NearSpansOrdered.java:149)
at 
org.apache.lucene.search.spans.NearSpansUnordered.collect(NearSpansUnordered.java:171)
at 
org.apache.lucene.search.spans.FilterSpans.collect(FilterSpans.java:120)
at 
org.apache.lucene.search.uhighlight.PhraseHelper.createOffsetsEnumsForSpans(PhraseHelper.java:261)
...
{code}

It turns out that there is an "off by one" error in the UnifiedHighlighter's 
code that, as far as I can tell, is only triggered when two nested 
SpanNearQueries contain the same term.

The resulting behaviour depends on the content of the highlighted document. 
Either, some highlighted terms go missing or an ArrayIndexOutOfBoundsException 
is thrown.

  was:
We see an ArrayOutOfBoundsExceptions coming out of the UnifiedHighlighter in 
our production logs from time to time:

{code}
java.lang.ArrayIndexOutOfBoundsException
at java.base/java.lang.System.arraycopy(Native Method)
at 
org.apache.lucene.search.uhighlight.PhraseHelper$SpanCollectedOffsetsEnum.add(PhraseHelper.java:386)
at 
org.apache.lucene.search.uhighlight.PhraseHelper$OffsetSpanCollector.collectLeaf(PhraseHelper.java:341)
at org.apache.lucene.search.spans.TermSpans.collect(TermSpans.java:121)
at 
org.apache.lucene.search.spans.NearSpansOrdered.collect(NearSpansOrdered.java:149)
at 
org.apache.lucene.search.spans.NearSpansUnordered.collect(NearSpansUnordered.java:171)
at 
org.apache.lucene.search.spans.FilterSpans.collect(FilterSpans.java:120)
at 
org.apache.lucene.search.uhighlight.PhraseHelper.createOffsetsEnumsForSpans(PhraseHelper.java:261)
...
{code}

It turns out that there is an "off by one" error in UnifiedHighlighter code 
that, as far as I can tell, is currently only invoked when two nested 
SpanNearQueries contain the same term.

The behaviour depends on the highlighted document. In most cases, some terms 
will fail to be highlighted. In others, an Exception is thrown.


> ArrayIndexOutOfBoundsException in UnifiedHighlighter
> 
>
> Key: LUCENE-8365
> URL: https://issues.apache.org/jira/browse/LUCENE-8365
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/highlighter
>Affects Versions: 7.3.1
>Reporter: Marc Morissette
>Priority: Major
>
> We see ArrayIndexOutOfBoundsExceptions coming out of the UnifiedHighlighter 
> in our production logs from time to time:
> {code}
> java.lang.ArrayIndexOutOfBoundsException
>   at java.base/java.lang.System.arraycopy(Native Method)
>   at 
> org.apache.lucene.search.uhighlight.PhraseHelper$SpanCollectedOffsetsEnum.add(PhraseHelper.java:386)
>   at 
> org.apache.lucene.search.uhighlight.PhraseHelper$OffsetSpanCollector.collectLeaf(PhraseHelper.java:341)
>   at org.apache.lucene.search.spans.TermSpans.collect(TermSpans.java:121)
>   at 
> org.apache.lucene.search.spans.NearSpansOrdered.collect(NearSpansOrdered.java:149)
>   at 
> org.apache.lucene.search.spans.NearSpansUnordered.collect(NearSpansUnordered.java:171)
>   at 
> org.apache.lucene.search.spans.FilterSpans.collect(FilterSpans.java:120)
>   at 
> org.apache.lucene.search.uhighlight.PhraseHelper.createOffsetsEnumsForSpans(PhraseHelper.java:261)
> ...
> {code}
> It turns out that there is an "off by one" error in the UnifiedHighlighter's 
> code that, as far as I can tell, is only triggered when two nested 
> SpanNearQueries contain the same term.
> The resulting behaviour depends on the content of the highlighted document. 
> Either, some highlighted terms go missing or an 
> ArrayIndexOutOfBoundsException is thrown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8365) ArrayIndexOutOfBoundsException in UnifiedHighlighter

2018-06-19 Thread Marc Morissette (JIRA)
Marc Morissette created LUCENE-8365:
---

 Summary: ArrayIndexOutOfBoundsException in UnifiedHighlighter
 Key: LUCENE-8365
 URL: https://issues.apache.org/jira/browse/LUCENE-8365
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/highlighter
Affects Versions: 7.3.1
Reporter: Marc Morissette


We see an ArrayOutOfBoundsExceptions coming out of the UnifiedHighlighter in 
our production logs from time to time:

{code}
java.lang.ArrayIndexOutOfBoundsException
at java.base/java.lang.System.arraycopy(Native Method)
at 
org.apache.lucene.search.uhighlight.PhraseHelper$SpanCollectedOffsetsEnum.add(PhraseHelper.java:386)
at 
org.apache.lucene.search.uhighlight.PhraseHelper$OffsetSpanCollector.collectLeaf(PhraseHelper.java:341)
at org.apache.lucene.search.spans.TermSpans.collect(TermSpans.java:121)
at 
org.apache.lucene.search.spans.NearSpansOrdered.collect(NearSpansOrdered.java:149)
at 
org.apache.lucene.search.spans.NearSpansUnordered.collect(NearSpansUnordered.java:171)
at 
org.apache.lucene.search.spans.FilterSpans.collect(FilterSpans.java:120)
at 
org.apache.lucene.search.uhighlight.PhraseHelper.createOffsetsEnumsForSpans(PhraseHelper.java:261)
...
{code}

It turns out that there is an "off by one" error in UnifiedHighlighter code 
that, as far as I can tell, is currently only invoked when two nested 
SpanNearQueries contain the same term.

The behaviour depends on the highlighted document. In most cases, some terms 
will fail to be highlighted. In others, an Exception is thrown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7976) Make TieredMergePolicy respect maxSegmentSizeMB and allow singleton merges of very large segments

2018-04-05 Thread Marc Morissette (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16427982#comment-16427982
 ] 

Marc Morissette commented on LUCENE-7976:
-

[~erickerickson] Thanks for tackling this.

Regarding singleton merges: if I read your code correctly and am right about 
how Lucene works, I think that, on a large enough collection, your patch could 
generate ~50% more reads/writes when re-indexing the whole collection:
 * I think new documents are typically flushed once and merged 2-3 times before 
ending up in a large segment.
 * With a 20% delete threshold, old documents would, on average, be singleton 
merged 4 times before being expunged vs only one merge at a 50% delete 
threshold. In Latex notation:

{code:java}
20% deleted docs threshold:
\sum_{n=1}^\infnty (1 - 0.2)^n = (1 / (1 - (1 - 0.2))) - 1 = 4

50% deleted docs threshold:
\sum_{n=1}^\infnty (1 - 0.5)^n = (1 / (1 - (1 - 0.5))) - 1 = 1{code}
On the odd chance that my math bears any resemblance to reality, I would 
suggest that you disable singleton merges when the short term deletion rate of 
a segment is above a certain threshold (say 0.5% per hour). This should prevent 
performance degradations during heavy re-indexation while maintaining the 
desired behaviour on seldom updated indexes.

> Make TieredMergePolicy respect maxSegmentSizeMB and allow singleton merges of 
> very large segments
> -
>
> Key: LUCENE-7976
> URL: https://issues.apache.org/jira/browse/LUCENE-7976
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
> Attachments: LUCENE-7976.patch, LUCENE-7976.patch
>
>
> We're seeing situations "in the wild" where there are very large indexes (on 
> disk) handled quite easily in a single Lucene index. This is particularly 
> true as features like docValues move data into MMapDirectory space. The 
> current TMP algorithm allows on the order of 50% deleted documents as per a 
> dev list conversation with Mike McCandless (and his blog here:  
> https://www.elastic.co/blog/lucenes-handling-of-deleted-documents).
> Especially in the current era of very large indexes in aggregate, (think many 
> TB) solutions like "you need to distribute your collection over more shards" 
> become very costly. Additionally, the tempting "optimize" button exacerbates 
> the issue since once you form, say, a 100G segment (by 
> optimizing/forceMerging) it is not eligible for merging until 97.5G of the 
> docs in it are deleted (current default 5G max segment size).
> The proposal here would be to add a new parameter to TMP, something like 
>  (no, that's not serious name, suggestions 
> welcome) which would default to 100 (or the same behavior we have now).
> So if I set this parameter to, say, 20%, and the max segment size stays at 
> 5G, the following would happen when segments were selected for merging:
> > any segment with > 20% deleted documents would be merged or rewritten NO 
> > MATTER HOW LARGE. There are two cases,
> >> the segment has < 5G "live" docs. In that case it would be merged with 
> >> smaller segments to bring the resulting segment up to 5G. If no smaller 
> >> segments exist, it would just be rewritten
> >> The segment has > 5G "live" docs (the result of a forceMerge or optimize). 
> >> It would be rewritten into a single segment removing all deleted docs no 
> >> matter how big it is to start. The 100G example above would be rewritten 
> >> to an 80G segment for instance.
> Of course this would lead to potentially much more I/O which is why the 
> default would be the same behavior we see now. As it stands now, though, 
> there's no way to recover from an optimize/forceMerge except to re-index from 
> scratch. We routinely see 200G-300G Lucene indexes at this point "in the 
> wild" with 10s of  shards replicated 3 or more times. And that doesn't even 
> include having these over HDFS.
> Alternatives welcome! Something like the above seems minimally invasive. A 
> new merge policy is certainly an alternative.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11508) Make coreRootDirectory configurable via an environment variable (SOLR_CORE_HOME)

2017-12-04 Thread Marc Morissette (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277849#comment-16277849
 ] 

Marc Morissette commented on SOLR-11508:


[~elyograg] This is an interesting idea but I'm not sure how this solves the 
problem. It would be nice if Solr could start without solr.xml but it would 
condemn cloud mode users to choose between sticking to the default settings or 
mixing their configuration and data. 

It's either that or we would need to externalize every configuration parameter 
available in solr.xml (and there are a lot).

> Make coreRootDirectory configurable via an environment variable 
> (SOLR_CORE_HOME)
> 
>
> Key: SOLR-11508
> URL: https://issues.apache.org/jira/browse/SOLR-11508
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Marc Morissette
>
> (Heavily edited)
> Since Solr 7, it is possible to store Solr cores in separate disk locations 
> using solr.data.home (see SOLR-6671). This is very useful when running Solr 
> in Docker where data must be stored in a directory which is independent from 
> the rest of the container.
> While this works well in standalone mode, it doesn't in Cloud mode as the 
> core.properties automatically created by Solr are still stored in 
> coreRootDirectory and cores created that way disappear when the Solr Docker 
> container is redeployed.
> The solution is to configure coreRootDirectory to an empty directory that can 
> be mounted outside the Docker container.
> The incoming patch makes this easier to do by allowing coreRootDirectory to 
> be configured via a solr.core.home system property and SOLR_CORE_HOME 
> environment variable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11508) Make coreRootDirectory configurable via an environment variable (SOLR_CORE_HOME)

2017-12-04 Thread Marc Morissette (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277362#comment-16277362
 ] 

Marc Morissette commented on SOLR-11508:


I've started work on a patch that adds the ability to set coreRootDirectory via 
an environment variable and command line option: 
https://github.com/morissm/lucene-solr/commit/95cbd1410fb4bdf97fd9ffec8737117a7931054d

I'm starting to have second thoughts though. Solr already has a steep learning 
curve and I'm loathe to add yet another option if there is a way to avoid it.

What if core.properties files were stored in SOLR_DATA_HOME only when Solr is 
in cloud mode? Unless I'm mistaken, all configuration is stored in Zookeeper in 
cloud mode so that is the only file that matters. As I've argued earlier, 
core.properties files in cloud mode are mostly an implementation detail and 
belong with the data. 

The only issue would be how to handle the transition for people who have set 
SOLR_DATA_HOME in cloud mode pre 7.2. I've thought of many automated ways to 
handle the transition but this might not be easy to accomplish without 
introducing some potential unintended behaviours.

Comments?

> Make coreRootDirectory configurable via an environment variable 
> (SOLR_CORE_HOME)
> 
>
> Key: SOLR-11508
> URL: https://issues.apache.org/jira/browse/SOLR-11508
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Marc Morissette
>
> (Heavily edited)
> Since Solr 7, it is possible to store Solr cores in separate disk locations 
> using solr.data.home (see SOLR-6671). This is very useful when running Solr 
> in Docker where data must be stored in a directory which is independent from 
> the rest of the container.
> While this works well in standalone mode, it doesn't in Cloud mode as the 
> core.properties automatically created by Solr are still stored in 
> coreRootDirectory and cores created that way disappear when the Solr Docker 
> container is redeployed.
> The solution is to configure coreRootDirectory to an empty directory that can 
> be mounted outside the Docker container.
> The incoming patch makes this easier to do by allowing coreRootDirectory to 
> be configured via a solr.core.home system property and SOLR_CORE_HOME 
> environment variable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-11508) Make coreRootDirectory configurable via an environment variable (SOLR_CORE_HOME)

2017-12-03 Thread Marc Morissette (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-11508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marc Morissette updated SOLR-11508:
---
Description: 
(Heavily edited)

Since Solr 7, it is possible to store Solr cores in separate disk locations 
using solr.data.home (see SOLR-6671). This is very useful when running Solr in 
Docker where data must be stored in a directory which is independent from the 
rest of the container.

While this works well in standalone mode, it doesn't in Cloud mode as the 
core.properties automatically created by Solr are still stored in 
coreRootDirectory and cores created that way disappear when the Solr Docker 
container is redeployed.

The solution is to configure coreRootDirectory to an empty directory that can 
be mounted outside the Docker container.

The incoming patch makes this easier to do by allowing coreRootDirectory to be 
configured via a solr.core.home system property and SOLR_CORE_HOME environment 
variable.

  was:
Since Solr 7, it is possible to store Solr cores in separate disk locations 
using solr.data.home (see SOLR-6671). This is very useful where running Solr in 
Docker where data must be stored in a directory which is independent from the 
rest of the container.

Unfortunately, while core data is stored in 
{{$\{solr.data.home}/$\{core.name}/index/...}}, core.properties is stored in 
{{$\{solr.solr.home}/$\{core.name}/core.properties}}.

Reading SOLR-6671 comments, I think this was the expected behaviour but I don't 
think it is the correct one.

In addition to being inelegant and counterintuitive, this has the drawback of 
stripping a core of its metadata and breaking core discovery when a Solr 
installation is redeployed, whether in Docker or not.

core.properties is mostly metadata and although it contains some configuration, 
this configuration is specific to the core it accompanies. I believe it should 
be stored in solr.data.home, with the rest of the data it describes.

Summary: Make coreRootDirectory configurable via an environment 
variable (SOLR_CORE_HOME)  (was: core.properties should be stored 
$solr.data.home/$core.name)

> Make coreRootDirectory configurable via an environment variable 
> (SOLR_CORE_HOME)
> 
>
> Key: SOLR-11508
> URL: https://issues.apache.org/jira/browse/SOLR-11508
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Marc Morissette
>
> (Heavily edited)
> Since Solr 7, it is possible to store Solr cores in separate disk locations 
> using solr.data.home (see SOLR-6671). This is very useful when running Solr 
> in Docker where data must be stored in a directory which is independent from 
> the rest of the container.
> While this works well in standalone mode, it doesn't in Cloud mode as the 
> core.properties automatically created by Solr are still stored in 
> coreRootDirectory and cores created that way disappear when the Solr Docker 
> container is redeployed.
> The solution is to configure coreRootDirectory to an empty directory that can 
> be mounted outside the Docker container.
> The incoming patch makes this easier to do by allowing coreRootDirectory to 
> be configured via a solr.core.home system property and SOLR_CORE_HOME 
> environment variable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11508) core.properties should be stored $solr.data.home/$core.name

2017-12-01 Thread Marc Morissette (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16274784#comment-16274784
 ] 

Marc Morissette commented on SOLR-11508:


[~elyograg], unfortunately what you propose is not really compatible with 
Docker. In Docker, configuration remains part of the image and users customize 
that configuration by either extending base images, mapping configuration files 
during deployment or configuring environment variables. Data must go in a 
separate directory, ideally one that can be empty without adverse effects. 
SOLR_HOME is thus not a good solution because it contains configsets and 
solr.xml.

SOLR_DATA_HOME is a good solution for people who use Solr in standalone mode 
and I will readily admit my patch addresses this use case poorly. I did not 
completely understand this variable's purpose at first and thought it was 
somehow "wrong" but it's not. I'm not arguing any change to it anymore.

In Cloud mode however, we deal with collections. Cores are more of an 
implementation detail. In Cloud Mode, I'd argue individual core.properties are 
closer to segment descriptors in their purpose which is why it makes more sense 
to keep them with the rest of the data. This is why I believe coreRootDirectory 
is the best way to separate configuration from data in Cloud mode.

To summarize, after reading everyone's viewpoint, I believe all 3 configuration 
variables are necessary as they address different use cases. [~dsmiley] and I 
are simply arguing for an easier way to configure coreRootDirectory. If no one 
sees an objection to that, I'll change the description of this bug as it's 
getting pretty stale and I'll find some time to work on a new patch to address 
that.

> core.properties should be stored $solr.data.home/$core.name
> ---
>
> Key: SOLR-11508
> URL: https://issues.apache.org/jira/browse/SOLR-11508
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Marc Morissette
>
> Since Solr 7, it is possible to store Solr cores in separate disk locations 
> using solr.data.home (see SOLR-6671). This is very useful where running Solr 
> in Docker where data must be stored in a directory which is independent from 
> the rest of the container.
> Unfortunately, while core data is stored in 
> {{$\{solr.data.home}/$\{core.name}/index/...}}, core.properties is stored in 
> {{$\{solr.solr.home}/$\{core.name}/core.properties}}.
> Reading SOLR-6671 comments, I think this was the expected behaviour but I 
> don't think it is the correct one.
> In addition to being inelegant and counterintuitive, this has the drawback of 
> stripping a core of its metadata and breaking core discovery when a Solr 
> installation is redeployed, whether in Docker or not.
> core.properties is mostly metadata and although it contains some 
> configuration, this configuration is specific to the core it accompanies. I 
> believe it should be stored in solr.data.home, with the rest of the data it 
> describes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11508) core.properties should be stored $solr.data.home/$core.name

2017-11-30 Thread Marc Morissette (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273899#comment-16273899
 ] 

Marc Morissette commented on SOLR-11508:


[~dsmiley] I was thinking the same thing. 

What should the environment variable be called? 
* SOLR_CORE_HOME fits well with SOLR_HOME and SOLR_DATA_HOME
* SOLR_CORE_ROOT_DIRECTORY is most similar to coreRootDirectory.

I think I like SOLR_CORE_HOME a little bit better.

What should the behaviour be if coreRootDirectory is already defined in 
solr.xml? Should the environment variable override solr.xml or vice-versa? I 
guess environment variables/command line parameters usually override 
configuration files?

> core.properties should be stored $solr.data.home/$core.name
> ---
>
> Key: SOLR-11508
> URL: https://issues.apache.org/jira/browse/SOLR-11508
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Marc Morissette
>
> Since Solr 7, it is possible to store Solr cores in separate disk locations 
> using solr.data.home (see SOLR-6671). This is very useful where running Solr 
> in Docker where data must be stored in a directory which is independent from 
> the rest of the container.
> Unfortunately, while core data is stored in 
> {{$\{solr.data.home}/$\{core.name}/index/...}}, core.properties is stored in 
> {{$\{solr.solr.home}/$\{core.name}/core.properties}}.
> Reading SOLR-6671 comments, I think this was the expected behaviour but I 
> don't think it is the correct one.
> In addition to being inelegant and counterintuitive, this has the drawback of 
> stripping a core of its metadata and breaking core discovery when a Solr 
> installation is redeployed, whether in Docker or not.
> core.properties is mostly metadata and although it contains some 
> configuration, this configuration is specific to the core it accompanies. I 
> believe it should be stored in solr.data.home, with the rest of the data it 
> describes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-11508) core.properties should be stored $solr.data.home/$core.name

2017-11-30 Thread Marc Morissette (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273577#comment-16273577
 ] 

Marc Morissette edited comment on SOLR-11508 at 11/30/17 10:47 PM:
---

As to Erick's question, I believe:

* solr.solr.home contains the server-wide config i.e. solr.xml and the 
configsets.
* coreRootDirectory is where core discovery happens. It contains the 
core.properties files and conf directories. Defaults to solr.solr.home.
* solr.data.home is where core data is stored. It's a directory structure that 
is completely parallel to the one that contains the core.properties (see Core 
Discovery documentation). Defaults to coreRootDirectory.

The issue here is that the doc says:

{quote}  -t   Sets the solr.data.home system property, where Solr will 
store data (index).
  If not set, Solr uses solr.solr.home for config and 
data.{quote}
 
The doc suggests that the core config will be stored in the directory indicated 
by -t. It's currently not the case but I think it should be.

coreRootDirectory has been there for a long time because it makes sense for 
people to want to store their cores away from their server configuration (1). 
solr.data.home addresses what I think might be a less popular requirement: to 
store core config away from core data (2).

The problem is that since 7.0, the command line options and defaults now make 
it quite easy to think you're addressing need (1) when you're in reality 
configuring for need (2).


was (Author: marc.morissette):
As to Erick's question, I believe:

* solr.solr.home contains the server config i.e. solr.xml and the configsets
* coreRootDirectory is where core discovery happens. It contains the 
core.properties files and conf directories. Defaults to solr.solr.home.
* solr.data.home is where the core data is stored. It's a directory structure 
that is parallel to the one that contains the core.properties. Defaults to 
coreRootDirectory.

> core.properties should be stored $solr.data.home/$core.name
> ---
>
> Key: SOLR-11508
> URL: https://issues.apache.org/jira/browse/SOLR-11508
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Marc Morissette
>
> Since Solr 7, it is possible to store Solr cores in separate disk locations 
> using solr.data.home (see SOLR-6671). This is very useful where running Solr 
> in Docker where data must be stored in a directory which is independent from 
> the rest of the container.
> Unfortunately, while core data is stored in 
> {{$\{solr.data.home}/$\{core.name}/index/...}}, core.properties is stored in 
> {{$\{solr.solr.home}/$\{core.name}/core.properties}}.
> Reading SOLR-6671 comments, I think this was the expected behaviour but I 
> don't think it is the correct one.
> In addition to being inelegant and counterintuitive, this has the drawback of 
> stripping a core of its metadata and breaking core discovery when a Solr 
> installation is redeployed, whether in Docker or not.
> core.properties is mostly metadata and although it contains some 
> configuration, this configuration is specific to the core it accompanies. I 
> believe it should be stored in solr.data.home, with the rest of the data it 
> describes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11508) core.properties should be stored $solr.data.home/$core.name

2017-11-30 Thread Marc Morissette (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273577#comment-16273577
 ] 

Marc Morissette commented on SOLR-11508:


As to Erick's question, I believe:

* solr.solr.home contains the server config i.e. solr.xml and the configsets
* coreRootDirectory is where core discovery happens. It contains the 
core.properties files and conf directories. Defaults to solr.solr.home.
* solr.data.home is where the core data is stored. It's a directory structure 
that is parallel to the one that contains the core.properties. Defaults to 
coreRootDirectory.

> core.properties should be stored $solr.data.home/$core.name
> ---
>
> Key: SOLR-11508
> URL: https://issues.apache.org/jira/browse/SOLR-11508
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Marc Morissette
>
> Since Solr 7, it is possible to store Solr cores in separate disk locations 
> using solr.data.home (see SOLR-6671). This is very useful where running Solr 
> in Docker where data must be stored in a directory which is independent from 
> the rest of the container.
> Unfortunately, while core data is stored in 
> {{$\{solr.data.home}/$\{core.name}/index/...}}, core.properties is stored in 
> {{$\{solr.solr.home}/$\{core.name}/core.properties}}.
> Reading SOLR-6671 comments, I think this was the expected behaviour but I 
> don't think it is the correct one.
> In addition to being inelegant and counterintuitive, this has the drawback of 
> stripping a core of its metadata and breaking core discovery when a Solr 
> installation is redeployed, whether in Docker or not.
> core.properties is mostly metadata and although it contains some 
> configuration, this configuration is specific to the core it accompanies. I 
> believe it should be stored in solr.data.home, with the rest of the data it 
> describes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11508) core.properties should be stored $solr.data.home/$core.name

2017-11-30 Thread Marc Morissette (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273556#comment-16273556
 ] 

Marc Morissette commented on SOLR-11508:


I think there might be a way to minimize problems with existing Solr 
installations.

Instead of changing coreRootDirectory's default behaviour, the vanilla solr.xml 
could be modified to contain 
$\{solr.data.home:}

Users with existing installations that have used the service installation 
scripts would typically remain on the old solr.xml. I'd venture that the subset 
of users who define SOLR_DATA_HOME and use the default SOLR_HOME and default 
solr.xml is probably quite small.

> core.properties should be stored $solr.data.home/$core.name
> ---
>
> Key: SOLR-11508
> URL: https://issues.apache.org/jira/browse/SOLR-11508
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Marc Morissette
>
> Since Solr 7, it is possible to store Solr cores in separate disk locations 
> using solr.data.home (see SOLR-6671). This is very useful where running Solr 
> in Docker where data must be stored in a directory which is independent from 
> the rest of the container.
> Unfortunately, while core data is stored in 
> {{$\{solr.data.home}/$\{core.name}/index/...}}, core.properties is stored in 
> {{$\{solr.solr.home}/$\{core.name}/core.properties}}.
> Reading SOLR-6671 comments, I think this was the expected behaviour but I 
> don't think it is the correct one.
> In addition to being inelegant and counterintuitive, this has the drawback of 
> stripping a core of its metadata and breaking core discovery when a Solr 
> installation is redeployed, whether in Docker or not.
> core.properties is mostly metadata and although it contains some 
> configuration, this configuration is specific to the core it accompanies. I 
> believe it should be stored in solr.data.home, with the rest of the data it 
> describes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11508) core.properties should be stored $solr.data.home/$core.name

2017-11-30 Thread Marc Morissette (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273530#comment-16273530
 ] 

Marc Morissette commented on SOLR-11508:


I've created this bug because a lot of documentation (including the 
command-line help) indicates that SOLR_DATA_HOME is how you store your data 
outside the installation. It's true but quite misleading because a lot of what 
is needed to load that data remains in coreRootDirectory.

Core.properties and the conf directory is not just config but metadata. If you 
delete a core's directory, you would expect the metadata to follow. If you 
download a new version of Solr and point it to your solr.data.home, you would 
expect Solr to be able to load your cores without a sweat. Cores are databases 
and their individual configuration should lie with them, not with the server 
(except for configsets).

Now, I understand why this makes less sense to Solr veterans who have known 
Solr for a long time but please understand how inintuitive this feels to 
SolrCloud and less experimented users. 

My patch does not add or remove any feature. You can still configure different 
values for SOLR_DATA_HOME and coreRootDirectory. I've simply changed the 
defaults to something I consider more intuitive (God knows Solr could use a 
little more of that). 

Yes, changing the default could break some installations (those that have 
defined SOLR_DATA_HOME but not coreRootDirectory) but that is why I've added 
the release note. I feel this is acceptable as long as it makes Solr easier to 
use. Believe me, I'm not the first one to be tripped up by this issue.


> core.properties should be stored $solr.data.home/$core.name
> ---
>
> Key: SOLR-11508
> URL: https://issues.apache.org/jira/browse/SOLR-11508
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Marc Morissette
>
> Since Solr 7, it is possible to store Solr cores in separate disk locations 
> using solr.data.home (see SOLR-6671). This is very useful where running Solr 
> in Docker where data must be stored in a directory which is independent from 
> the rest of the container.
> Unfortunately, while core data is stored in 
> {{$\{solr.data.home}/$\{core.name}/index/...}}, core.properties is stored in 
> {{$\{solr.solr.home}/$\{core.name}/core.properties}}.
> Reading SOLR-6671 comments, I think this was the expected behaviour but I 
> don't think it is the correct one.
> In addition to being inelegant and counterintuitive, this has the drawback of 
> stripping a core of its metadata and breaking core discovery when a Solr 
> installation is redeployed, whether in Docker or not.
> core.properties is mostly metadata and although it contains some 
> configuration, this configuration is specific to the core it accompanies. I 
> believe it should be stored in solr.data.home, with the rest of the data it 
> describes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-11508) core.properties should be stored $solr.data.home/$core.name

2017-10-18 Thread Marc Morissette (JIRA)
Marc Morissette created SOLR-11508:
--

 Summary: core.properties should be stored 
$solr.data.home/$core.name
 Key: SOLR-11508
 URL: https://issues.apache.org/jira/browse/SOLR-11508
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Marc Morissette


Since Solr 7, it is possible to store Solr cores in separate disk locations 
using solr.data.home (see SOLR-6671). This is very useful where running Solr in 
Docker where data must be stored in a directory which is independent from the 
rest of the container.

Unfortunately, while core data is stored in 
{{$\{solr.data.home}/$\{core.name}/index/...}}, core.properties is stored in 
{{$\{solr.solr.home}/$\{core.name}/core.properties}}.

Reading SOLR-6671 comments, I think this was the expected behaviour but I don't 
think it is the correct one.

In addition to being inelegant and counterintuitive, this has the drawback of 
stripping a core of its metadata and breaking core discovery when a Solr 
installation is redeployed, whether in Docker or not.

core.properties is mostly metadata and although it contains some configuration, 
this configuration is specific to the core it accompanies. I believe it should 
be stored in solr.data.home, with the rest of the data it describes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11508) core.properties should be stored $solr.data.home/$core.name

2017-10-18 Thread Marc Morissette (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209794#comment-16209794
 ] 

Marc Morissette commented on SOLR-11508:


Are there any objection before I begin work on a patch?

> core.properties should be stored $solr.data.home/$core.name
> ---
>
> Key: SOLR-11508
> URL: https://issues.apache.org/jira/browse/SOLR-11508
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Marc Morissette
>
> Since Solr 7, it is possible to store Solr cores in separate disk locations 
> using solr.data.home (see SOLR-6671). This is very useful where running Solr 
> in Docker where data must be stored in a directory which is independent from 
> the rest of the container.
> Unfortunately, while core data is stored in 
> {{$\{solr.data.home}/$\{core.name}/index/...}}, core.properties is stored in 
> {{$\{solr.solr.home}/$\{core.name}/core.properties}}.
> Reading SOLR-6671 comments, I think this was the expected behaviour but I 
> don't think it is the correct one.
> In addition to being inelegant and counterintuitive, this has the drawback of 
> stripping a core of its metadata and breaking core discovery when a Solr 
> installation is redeployed, whether in Docker or not.
> core.properties is mostly metadata and although it contains some 
> configuration, this configuration is specific to the core it accompanies. I 
> believe it should be stored in solr.data.home, with the rest of the data it 
> describes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Deleted] (SOLR-11399) UnifiedHighlighter ignores hl.fragsize value if hl.bs.type=SEPARATOR

2017-09-25 Thread Marc Morissette (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-11399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marc Morissette updated SOLR-11399:
---
Comment: was deleted

(was: I've created a pull request that fixes this issue: 
https://github.com/apache/lucene-solr/pull/253)

> UnifiedHighlighter ignores hl.fragsize value if hl.bs.type=SEPARATOR
> 
>
> Key: SOLR-11399
> URL: https://issues.apache.org/jira/browse/SOLR-11399
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: highlighter
>Reporter: Marc Morissette
>
> The UnifiedHighlighter always acts as if hl.fragsize=-1 when 
> hl.bs.type=SEPARATOR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11399) UnifiedHighlighter ignores hl.fragsize value if hl.bs.type=SEPARATOR

2017-09-25 Thread Marc Morissette (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179869#comment-16179869
 ] 

Marc Morissette commented on SOLR-11399:


I've created a pull request that fixes this issue: 
https://github.com/apache/lucene-solr/pull/253

> UnifiedHighlighter ignores hl.fragsize value if hl.bs.type=SEPARATOR
> 
>
> Key: SOLR-11399
> URL: https://issues.apache.org/jira/browse/SOLR-11399
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: highlighter
>Reporter: Marc Morissette
>
> The UnifiedHighlighter always acts as if hl.fragsize=-1 when 
> hl.bs.type=SEPARATOR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-11399) UnifiedHighlighter ignores hl.fragsize value if hl.bs.type=SEPARATOR

2017-09-25 Thread Marc Morissette (JIRA)
Marc Morissette created SOLR-11399:
--

 Summary: UnifiedHighlighter ignores hl.fragsize value if 
hl.bs.type=SEPARATOR
 Key: SOLR-11399
 URL: https://issues.apache.org/jira/browse/SOLR-11399
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: highlighter
Reporter: Marc Morissette


The UnifiedHighlighter always acts as if hl.fragsize=-1 when 
hl.bs.type=SEPARATOR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10059) In SolrCloud, every fq added via is computed twice.

2017-03-07 Thread Marc Morissette (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900371#comment-15900371
 ] 

Marc Morissette commented on SOLR-10059:


[~hossman] It might be that existing parameters are not descriptive enough to 
handle every use case. We could add a new parameter to CommonParams: 
"handler.chain" or "distrib.call.stack" or something similar. It would be a 
comma delimited list of all the handlers that were involved in a distributed 
operation and that have forwarded their parameters to the current 
RequestHandler. A handler would be identified by Collection or Core Name 
followed by /RequestHandler. e.g. 
distrib.call.stack=MyCollection/MyHandler,MyCollection2/MyHandler2,... 
RequestHandlerBase could use this parameter to determine whether defaults, 
appends and initParams were already applied by the same handler up the chain.

It would not handle the case of appends in initParams that apply to different 
handlers in the same call chain but I would assume this rarely occurs in 
practice.

I'd rather not add more parameters to Solr given how messy the current 
parameter namespace already is but I don't see a better solution. What do you 
think? 

> In SolrCloud, every fq added via  is computed twice.
> 
>
> Key: SOLR-10059
> URL: https://issues.apache.org/jira/browse/SOLR-10059
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 6.4.0
>Reporter: Marc Morissette
>  Labels: performance
>
> While researching another issue, I noticed that parameters appended to a 
> query via SearchHandler's  are added to the query twice 
> in SolrCloud: once on the aggregator and again on the shard.
> The FacetComponent corrects this automatically by removing duplicates. Field 
> queries added in this fashion are however computed twice and that hinders 
> performance on filter queries that aren't simple bitsets such as those 
> produced by the CollapsingQueryParser.
> To reproduce the issue, simply test this handler on a large enough 
> collection, then replace "appends" with "defaults". You'll notice significant 
> performance improvements.
> {code}
> 
> 
> {!collapse field=routingKey hint=top_fc}
> 
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] (SOLR-10059) In SolrCloud, every fq added via is computed twice.

2017-01-30 Thread Marc Morissette (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Marc Morissette updated an issue 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Solr /  SOLR-10059 
 
 
 
  In SolrCloud, every fq added via  is computed twice.  
 
 
 
 
 
 
 
 
 

Change By:
 
 Marc Morissette 
 
 
 
 
 
 
 
 
 
 While researching another issue, I noticed that parameters appended to a query via SearchHandler's  are added to the query twice in SolrCloud: once on the aggregator and again on the shard.The FacetComponent corrects this automatically by removing duplicates. Field queries added in this fashion are however computed twice and that  seriously  hinders performance on  large data sets  filter queries that aren't simple bitsets such as those produced by the CollapsingQueryParser .To reproduce the issue, simply test this handler on a large enough collection, then replace "appends" with "defaults" , you . You 'll notice significant performance improvements.{code}{! tag collapse field = myField routingKey hint=top_fc } myValue {code} 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SOLR-10059) In SolrCloud, every fq added via is computed twice.

2017-01-30 Thread Marc Morissette (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Marc Morissette commented on  SOLR-10059 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
  Re: In SolrCloud, every fq added via  is computed twice.  
 
 
 
 
 
 
 
 
 
 
I am willing to work on a patch but I'd like some guidance. I see two ways to solve this: 
 

Eliminate duplicate filter queries. Other parameters might however suffer from the same duplication issue so it seems like too narrow a solution.
 

Disable RequestHandler "appends" when ShardParams.IS_SHARD is true. This seems like the better solution since the appended parameters should already have been added to the query by the aggregating node. I don't know if there are some corner cases that I haven't considered though.
 
 
I'd appreciate some feedback. 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SOLR-10059) In SolrCloud, every fq added via is computed twice.

2017-01-30 Thread Marc Morissette (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Marc Morissette created an issue 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Solr /  SOLR-10059 
 
 
 
  In SolrCloud, every fq added via  is computed twice.  
 
 
 
 
 
 
 
 
 

Issue Type:
 
  Bug 
 
 
 

Affects Versions:
 

 6.4.0 
 
 
 

Assignee:
 

 Unassigned 
 
 
 

Components:
 

 SolrCloud 
 
 
 

Created:
 

 31/Jan/17 04:30 
 
 
 

Labels:
 

 performance 
 
 
 

Priority:
 
  Major 
 
 
 

Reporter:
 
 Marc Morissette 
 
 
 

Security Level:
 

 Public (Default Security Level. Issues are Public) 
 
 
 
 
 
 
 
 
 
 
While researching another issue, I noticed that parameters appended to a query via SearchHandler's  are added to the query twice in SolrCloud: once on the aggregator and again on the shard. 
The FacetComponent corrects this automatically by removing duplicates. Field queries added in this 

[jira] [Commented] (LUCENE-7431) Allow negative pre/post values in SpanNotQuery

2016-11-08 Thread Marc Morissette (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15648417#comment-15648417
 ] 

Marc Morissette commented on LUCENE-7431:
-

Thanks David!

> Allow negative pre/post values in SpanNotQuery
> --
>
> Key: LUCENE-7431
> URL: https://issues.apache.org/jira/browse/LUCENE-7431
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Marc Morissette
>Assignee: David Smiley
>Priority: Minor
> Fix For: 6.4
>
> Attachments: LUCENE-7431.patch
>
>
> I need to be able to specify a certain range of allowed overlap between the 
> include and exclude parameters of SpanNotQuery.
> Since this behaviour is the inverse of the behaviour implemented by the pre 
> and post constructor arguments, I suggest that this be implemented with 
> negative pre and post values.
> Patch incoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7431) Allow negative pre/post values in SpanNotQuery

2016-10-06 Thread Marc Morissette (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15552928#comment-15552928
 ] 

Marc Morissette edited comment on LUCENE-7431 at 10/6/16 7:30 PM:
--

Can I get a review of this patch please? It's rather small and includes tests.


was (Author: marc.morissette):
Can I get a review of this patch please? It's rather small and code complete.

> Allow negative pre/post values in SpanNotQuery
> --
>
> Key: LUCENE-7431
> URL: https://issues.apache.org/jira/browse/LUCENE-7431
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Marc Morissette
>Priority: Minor
> Attachments: LUCENE-7431.patch
>
>
> I need to be able to specify a certain range of allowed overlap between the 
> include and exclude parameters of SpanNotQuery.
> Since this behaviour is the inverse of the behaviour implemented by the pre 
> and post constructor arguments, I suggest that this be implemented with 
> negative pre and post values.
> Patch incoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7431) Allow negative pre/post values in SpanNotQuery

2016-10-06 Thread Marc Morissette (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15552928#comment-15552928
 ] 

Marc Morissette commented on LUCENE-7431:
-

Can I get a review of this patch please? It's rather small and code complete.

> Allow negative pre/post values in SpanNotQuery
> --
>
> Key: LUCENE-7431
> URL: https://issues.apache.org/jira/browse/LUCENE-7431
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Marc Morissette
>Priority: Minor
> Attachments: LUCENE-7431.patch
>
>
> I need to be able to specify a certain range of allowed overlap between the 
> include and exclude parameters of SpanNotQuery.
> Since this behaviour is the inverse of the behaviour implemented by the pre 
> and post constructor arguments, I suggest that this be implemented with 
> negative pre and post values.
> Patch incoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7431) Allow negative pre/post values in SpanNotQuery

2016-08-30 Thread Marc Morissette (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marc Morissette updated LUCENE-7431:

Attachment: LUCENE-7431.patch

> Allow negative pre/post values in SpanNotQuery
> --
>
> Key: LUCENE-7431
> URL: https://issues.apache.org/jira/browse/LUCENE-7431
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Marc Morissette
>Priority: Minor
> Attachments: LUCENE-7431.patch
>
>
> I need to be able to specify a certain range of allowed overlap between the 
> include and exclude parameters of SpanNotQuery.
> Since this behaviour is the inverse of the behaviour implemented by the pre 
> and post constructor arguments, I suggest that this be implemented with 
> negative pre and post values.
> Patch incoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7431) Allow negative pre/post values in SpanNotQuery

2016-08-30 Thread Marc Morissette (JIRA)
Marc Morissette created LUCENE-7431:
---

 Summary: Allow negative pre/post values in SpanNotQuery
 Key: LUCENE-7431
 URL: https://issues.apache.org/jira/browse/LUCENE-7431
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/search
Reporter: Marc Morissette
Priority: Minor


I need to be able to specify a certain range of allowed overlap between the 
include and exclude parameters of SpanNotQuery.

Since this behaviour is the inverse of the behaviour implemented by the pre and 
post constructor arguments, I suggest that this be implemented with negative 
pre and post values.

Patch incoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org