Re: Distributed commits in CloudSolrServer

2014-04-16 Thread Peter Keegan
>Are distributed commits also done in parallel across shards?
I meant 'sequentially' across shards.


On Wed, Apr 16, 2014 at 9:08 AM, Peter Keegan wrote:

> Are distributed commits also done in parallel across shards?
>
> Peter
>
>
> On Tue, Apr 15, 2014 at 3:50 PM, Mark Miller wrote:
>
>> Inline responses below.
>> --
>> Mark Miller
>> about.me/markrmiller
>>
>> On April 15, 2014 at 2:12:31 PM, Peter Keegan (peterlkee...@gmail.com)
>> wrote:
>>
>> I have a SolrCloud index, 1 shard, with a leader and one replica, and 3
>> ZKs. The Solr indexes are behind a load balancer. There is one
>> CloudSolrServer client updating the indexes. The index schema includes 3
>> ExternalFileFields. When the CloudSolrServer client issues a hard commit,
>> I
>> observe that the commits occur sequentially, not in parallel, on the
>> leader
>> and replica. The duration of each commit is about a minute. Most of this
>> time is spent reloading the 3 ExternalFileField files. Because of the
>> sequential commits, there is a period of time (1 minute+) when the index
>> searchers will return different results, which can cause a bad user
>> experience. This will get worse as replicas are added to handle
>> auto-scaling. The goal is to keep all replicas in sync w.r.t. the user
>> queries.
>>
>> My questions:
>>
>> 1. Is there a reason that the distributed commits are done in sequence,
>> not
>> in parallel? Is there a way to change this behavior?
>>
>>
>> The reason is that updates are currently done this way - it’s the only
>> safe way to do it without solving some more problems. I don’t think you can
>> easily change this. I think we should probably file a JIRA issue to track a
>> better solution for commit handling. I think there are some complications
>> because of how commits can be added on update requests, but its something
>> we probably want to try and solve before tackling *all* updates to replicas
>> in parallel with the leader.
>>
>>
>>
>> 2. If instead, the commits were done in parallel by a separate client via
>> a
>> GET to each Solr instance, how would this client get the host/port values
>> for each Solr instance from zookeeper? Are there any downsides to doing
>> commits this way?
>>
>> Not really, other than the extra management.
>>
>>
>>
>>
>>
>> Thanks,
>> Peter
>>
>
>


Re: Distributed commits in CloudSolrServer

2014-04-16 Thread Peter Keegan
Are distributed commits also done in parallel across shards?

Peter


On Tue, Apr 15, 2014 at 3:50 PM, Mark Miller  wrote:

> Inline responses below.
> --
> Mark Miller
> about.me/markrmiller
>
> On April 15, 2014 at 2:12:31 PM, Peter Keegan (peterlkee...@gmail.com)
> wrote:
>
> I have a SolrCloud index, 1 shard, with a leader and one replica, and 3
> ZKs. The Solr indexes are behind a load balancer. There is one
> CloudSolrServer client updating the indexes. The index schema includes 3
> ExternalFileFields. When the CloudSolrServer client issues a hard commit,
> I
> observe that the commits occur sequentially, not in parallel, on the
> leader
> and replica. The duration of each commit is about a minute. Most of this
> time is spent reloading the 3 ExternalFileField files. Because of the
> sequential commits, there is a period of time (1 minute+) when the index
> searchers will return different results, which can cause a bad user
> experience. This will get worse as replicas are added to handle
> auto-scaling. The goal is to keep all replicas in sync w.r.t. the user
> queries.
>
> My questions:
>
> 1. Is there a reason that the distributed commits are done in sequence,
> not
> in parallel? Is there a way to change this behavior?
>
>
> The reason is that updates are currently done this way - it’s the only
> safe way to do it without solving some more problems. I don’t think you can
> easily change this. I think we should probably file a JIRA issue to track a
> better solution for commit handling. I think there are some complications
> because of how commits can be added on update requests, but its something
> we probably want to try and solve before tackling *all* updates to replicas
> in parallel with the leader.
>
>
>
> 2. If instead, the commits were done in parallel by a separate client via
> a
> GET to each Solr instance, how would this client get the host/port values
> for each Solr instance from zookeeper? Are there any downsides to doing
> commits this way?
>
> Not really, other than the extra management.
>
>
>
>
>
> Thanks,
> Peter
>


Re: Distributed commits in CloudSolrServer

2014-04-15 Thread Mark Miller
Inline responses below.
-- 
Mark Miller
about.me/markrmiller

On April 15, 2014 at 2:12:31 PM, Peter Keegan (peterlkee...@gmail.com) wrote:

I have a SolrCloud index, 1 shard, with a leader and one replica, and 3 
ZKs. The Solr indexes are behind a load balancer. There is one 
CloudSolrServer client updating the indexes. The index schema includes 3 
ExternalFileFields. When the CloudSolrServer client issues a hard commit, I 
observe that the commits occur sequentially, not in parallel, on the leader 
and replica. The duration of each commit is about a minute. Most of this 
time is spent reloading the 3 ExternalFileField files. Because of the 
sequential commits, there is a period of time (1 minute+) when the index 
searchers will return different results, which can cause a bad user 
experience. This will get worse as replicas are added to handle 
auto-scaling. The goal is to keep all replicas in sync w.r.t. the user 
queries. 

My questions: 

1. Is there a reason that the distributed commits are done in sequence, not 
in parallel? Is there a way to change this behavior? 


The reason is that updates are currently done this way - it’s the only safe way 
to do it without solving some more problems. I don’t think you can easily 
change this. I think we should probably file a JIRA issue to track a better 
solution for commit handling. I think there are some complications because of 
how commits can be added on update requests, but its something we probably want 
to try and solve before tackling *all* updates to replicas in parallel with the 
leader.



2. If instead, the commits were done in parallel by a separate client via a 
GET to each Solr instance, how would this client get the host/port values 
for each Solr instance from zookeeper? Are there any downsides to doing 
commits this way? 

Not really, other than the extra management.





Thanks, 
Peter 


Distributed commits in CloudSolrServer

2014-04-15 Thread Peter Keegan
I have a SolrCloud index, 1 shard, with a leader and one replica, and 3
ZKs. The Solr indexes are behind a load balancer. There is one
CloudSolrServer client updating the indexes. The index schema includes 3
ExternalFileFields. When the CloudSolrServer client issues a hard commit, I
observe that the commits occur sequentially, not in parallel, on the leader
and replica. The duration of each commit is about a minute. Most of this
time is spent reloading the 3 ExternalFileField files. Because of the
sequential commits, there is a period of time (1 minute+) when the index
searchers will return different results, which can cause a bad user
experience. This will get worse as replicas are added to handle
auto-scaling. The goal is to keep all replicas in sync w.r.t. the user
queries.

My questions:

1. Is there a reason that the distributed commits are done in sequence, not
in parallel? Is there a way to change this behavior?

2. If instead, the commits were done in parallel by a separate client via a
GET to each Solr instance, how would this client get the host/port values
for each Solr instance from zookeeper? Are there any downsides to doing
commits this way?

Thanks,
Peter