Re: New leader/replica solution for HDFS

2015-03-04 Thread longsan
Our updating requests is very heavy. So we met several performance problems:
1)replicas can not catch up the index speed of leader after run some moment
and had to recover, but very slow and often failed.
2)data inconsistent between leader/replica, you got different results when
do same query twice.
3)replica occupy too much CPU than leader
4)recovery failed when updating heavy requests at the same time

so finally we gave up replicas.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/New-leader-replica-solution-for-HDFS-tp4188735p4191086.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: New leader/replica solution for HDFS

2015-03-04 Thread longsan
I'm happy to hear that. It's good option for Solr + HDFS solution. This can
avoid much performance issues.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/New-leader-replica-solution-for-HDFS-tp4188735p4191082.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: New leader/replica solution for HDFS

2015-02-26 Thread Joseph Obernberger

Great!  Thank you!

I had a 4 shard setup - no replicas.  Index size was 2.0TBytes stored in 
HDFS with each node having approximately 500G of index.  I added four 
more shards on four other machines as replicas.  One thing that happened 
was the 4 replicas all ran out of HDFS cache size
 (SnapPull failed: java.lang.RuntimeException: The max direct memory is 
likely too low.  Either increase it (by adding - 
XX:MaxDirectMemorySize=g -XX:+UseLargePages to your containers 
startup args) or disable direct allocation using 
solr.hdfs.blockcache.direct.memory.allocation=false in solrconfig.xml.  
If you are putting the block cache on the heap, your java heap size 
might not be large enough.  Failed allocating)


I was using 160 slabs (20GBytes iof RAM).  I dropped the config to 80 
slabs and restarted the replicas.  Two of the replicas came up OK, but 
the other 2 have stayed in 'Recovering'.  I stopped those two and 
restarted them - now I have 3 OK, but one is still in Recovering.


Given that each replica does indexing as well, I was expecting the 
amount of HDFS disk usage to double, but that has not happened. Once I 
get the last replica to come up, I'll run some tests.


-Joe

On 2/26/2015 10:45 AM, Mark Miller wrote:

I’ll be working on this at some point: 
https://issues.apache.org/jira/browse/SOLR-6237

- Mark

http://about.me/markrmiller


On Feb 25, 2015, at 2:12 AM, longsan  wrote:

We used HDFS as our Solr index storage and we really have a heavy update
load. We had met much problems with current leader/replica solution. There
is duplicate index computing on Replilca side. And the data sync between
leader/replica is always a problem.

As HDFS already provides data replication on data layer, could Solr provide
just service layer replication?

My thought is that the leader and the replica all bind to the same data
index directory. And the leader will build up index for new request, the
replica will just keep update the index version with the leader(such as a
soft commit periodically? ). If the leader lost then the replica will take
the duty immediately.

Thanks for any suggestion of this idea.







--
View this message in context: 
http://lucene.472066.n3.nabble.com/New-leader-replica-solution-for-HDFS-tp4188735.html
Sent from the Solr - User mailing list archive at Nabble.com.






Re: New leader/replica solution for HDFS

2015-02-26 Thread Mark Miller
I’ll be working on this at some point: 
https://issues.apache.org/jira/browse/SOLR-6237

- Mark

http://about.me/markrmiller

> On Feb 25, 2015, at 2:12 AM, longsan  wrote:
> 
> We used HDFS as our Solr index storage and we really have a heavy update
> load. We had met much problems with current leader/replica solution. There
> is duplicate index computing on Replilca side. And the data sync between
> leader/replica is always a problem.
> 
> As HDFS already provides data replication on data layer, could Solr provide
> just service layer replication?
> 
> My thought is that the leader and the replica all bind to the same data
> index directory. And the leader will build up index for new request, the
> replica will just keep update the index version with the leader(such as a
> soft commit periodically? ). If the leader lost then the replica will take
> the duty immediately. 
> 
> Thanks for any suggestion of this idea.
> 
> 
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/New-leader-replica-solution-for-HDFS-tp4188735.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: New leader/replica solution for HDFS

2015-02-25 Thread William Bell
Use DocValues.

On Wed, Feb 25, 2015 at 3:14 PM, Joseph Obernberger  wrote:

> Thank you!  I'm mainly concerned about facet performance.  When we have
> indexing turned on, our facet performance suffers significantly.
> I will add replicas and measure the performance change.
>
> -Joe Obernberger
>
>
> On 2/25/2015 4:31 PM, Erick Erickson wrote:
>
>> bq: Is adding replicas going to increase search performance?
>>
>> Absolutely, assuming you've maxed out Solr. You can scale the SOLR
>> query/second rate nearly linearly by adding replicas regardless of
>> whether it's over HDFS or not.
>>
>> Having multiple replicas per shard _also_ increases fault tolerance,
>> so you get both. Even with HDFS, though, a single replica (just a
>> leader) per shard means that you don't have any redundancy if the
>> motherboard on that server dies even though HDFS has multiple copies
>> of the _data_.
>>
>> Best,
>> Erick
>>
>> On Wed, Feb 25, 2015 at 12:01 PM, Joseph Obernberger
>>  wrote:
>>
>>> I am also confused on this.  Is adding replicas going to increase search
>>> performance?  I'm not sure I see the point of any replicas when using
>>> HDFS.
>>> Is there one?
>>> Thank you!
>>>
>>> -Joe
>>>
>>>
>>> On 2/25/2015 10:57 AM, Erick Erickson wrote:
>>>
 bq: And the data sync between leader/replica is always a problem

 Not quite sure what you mean by this. There shouldn't need to be
 any synching in the sense that the index gets replicated, the
 incoming documents should be sent to each node (and indexed
 to HDFS) as they come in.

 bq: There is duplicate index computing on Replilca side.

 Yes, that's the design of SolrCloud, explicitly to provide data safety.
 If you instead rely on the leader to index and somehow pull that
 indexed form to the replica, then you will lose data if the leader
 goes down before sending the indexed form.

 bq: My thought is that the leader and the replica all bind to the same
 data
 index directory.

 This is unsafe. They would both then try to _write_ to the same
 index, which can easily corrupt indexes and/or all but the first
 one to access the index would be locked out.

 All that said, the HDFS triple-redundancy compounded with the
 Solr leaders/replicas redundancy means a bunch of extra
 storage. You can turn the HDFS replication down to 1, but that has
 other implications.

 Best,
 Erick

 On Tue, Feb 24, 2015 at 11:12 PM, longsan  wrote:

> We used HDFS as our Solr index storage and we really have a heavy
> update
> load. We had met much problems with current leader/replica solution.
> There
> is duplicate index computing on Replilca side. And the data sync
> between
> leader/replica is always a problem.
>
> As HDFS already provides data replication on data layer, could Solr
> provide
> just service layer replication?
>
> My thought is that the leader and the replica all bind to the same data
> index directory. And the leader will build up index for new request,
> the
> replica will just keep update the index version with the leader(such
> as a
> soft commit periodically? ). If the leader lost then the replica will
> take
> the duty immediately.
>
> Thanks for any suggestion of this idea.
>
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/New-leader-replica-
> solution-for-HDFS-tp4188735.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

>>>
>


-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: New leader/replica solution for HDFS

2015-02-25 Thread Joseph Obernberger
Thank you!  I'm mainly concerned about facet performance.  When we have 
indexing turned on, our facet performance suffers significantly.

I will add replicas and measure the performance change.

-Joe Obernberger

On 2/25/2015 4:31 PM, Erick Erickson wrote:

bq: Is adding replicas going to increase search performance?

Absolutely, assuming you've maxed out Solr. You can scale the SOLR
query/second rate nearly linearly by adding replicas regardless of
whether it's over HDFS or not.

Having multiple replicas per shard _also_ increases fault tolerance,
so you get both. Even with HDFS, though, a single replica (just a
leader) per shard means that you don't have any redundancy if the
motherboard on that server dies even though HDFS has multiple copies
of the _data_.

Best,
Erick

On Wed, Feb 25, 2015 at 12:01 PM, Joseph Obernberger
 wrote:

I am also confused on this.  Is adding replicas going to increase search
performance?  I'm not sure I see the point of any replicas when using HDFS.
Is there one?
Thank you!

-Joe


On 2/25/2015 10:57 AM, Erick Erickson wrote:

bq: And the data sync between leader/replica is always a problem

Not quite sure what you mean by this. There shouldn't need to be
any synching in the sense that the index gets replicated, the
incoming documents should be sent to each node (and indexed
to HDFS) as they come in.

bq: There is duplicate index computing on Replilca side.

Yes, that's the design of SolrCloud, explicitly to provide data safety.
If you instead rely on the leader to index and somehow pull that
indexed form to the replica, then you will lose data if the leader
goes down before sending the indexed form.

bq: My thought is that the leader and the replica all bind to the same
data
index directory.

This is unsafe. They would both then try to _write_ to the same
index, which can easily corrupt indexes and/or all but the first
one to access the index would be locked out.

All that said, the HDFS triple-redundancy compounded with the
Solr leaders/replicas redundancy means a bunch of extra
storage. You can turn the HDFS replication down to 1, but that has
other implications.

Best,
Erick

On Tue, Feb 24, 2015 at 11:12 PM, longsan  wrote:

We used HDFS as our Solr index storage and we really have a heavy update
load. We had met much problems with current leader/replica solution.
There
is duplicate index computing on Replilca side. And the data sync between
leader/replica is always a problem.

As HDFS already provides data replication on data layer, could Solr
provide
just service layer replication?

My thought is that the leader and the replica all bind to the same data
index directory. And the leader will build up index for new request, the
replica will just keep update the index version with the leader(such as a
soft commit periodically? ). If the leader lost then the replica will
take
the duty immediately.

Thanks for any suggestion of this idea.







--
View this message in context:
http://lucene.472066.n3.nabble.com/New-leader-replica-solution-for-HDFS-tp4188735.html
Sent from the Solr - User mailing list archive at Nabble.com.






Re: New leader/replica solution for HDFS

2015-02-25 Thread Erick Erickson
bq: Is adding replicas going to increase search performance?

Absolutely, assuming you've maxed out Solr. You can scale the SOLR
query/second rate nearly linearly by adding replicas regardless of
whether it's over HDFS or not.

Having multiple replicas per shard _also_ increases fault tolerance,
so you get both. Even with HDFS, though, a single replica (just a
leader) per shard means that you don't have any redundancy if the
motherboard on that server dies even though HDFS has multiple copies
of the _data_.

Best,
Erick

On Wed, Feb 25, 2015 at 12:01 PM, Joseph Obernberger
 wrote:
> I am also confused on this.  Is adding replicas going to increase search
> performance?  I'm not sure I see the point of any replicas when using HDFS.
> Is there one?
> Thank you!
>
> -Joe
>
>
> On 2/25/2015 10:57 AM, Erick Erickson wrote:
>>
>> bq: And the data sync between leader/replica is always a problem
>>
>> Not quite sure what you mean by this. There shouldn't need to be
>> any synching in the sense that the index gets replicated, the
>> incoming documents should be sent to each node (and indexed
>> to HDFS) as they come in.
>>
>> bq: There is duplicate index computing on Replilca side.
>>
>> Yes, that's the design of SolrCloud, explicitly to provide data safety.
>> If you instead rely on the leader to index and somehow pull that
>> indexed form to the replica, then you will lose data if the leader
>> goes down before sending the indexed form.
>>
>> bq: My thought is that the leader and the replica all bind to the same
>> data
>> index directory.
>>
>> This is unsafe. They would both then try to _write_ to the same
>> index, which can easily corrupt indexes and/or all but the first
>> one to access the index would be locked out.
>>
>> All that said, the HDFS triple-redundancy compounded with the
>> Solr leaders/replicas redundancy means a bunch of extra
>> storage. You can turn the HDFS replication down to 1, but that has
>> other implications.
>>
>> Best,
>> Erick
>>
>> On Tue, Feb 24, 2015 at 11:12 PM, longsan  wrote:
>>>
>>> We used HDFS as our Solr index storage and we really have a heavy update
>>> load. We had met much problems with current leader/replica solution.
>>> There
>>> is duplicate index computing on Replilca side. And the data sync between
>>> leader/replica is always a problem.
>>>
>>> As HDFS already provides data replication on data layer, could Solr
>>> provide
>>> just service layer replication?
>>>
>>> My thought is that the leader and the replica all bind to the same data
>>> index directory. And the leader will build up index for new request, the
>>> replica will just keep update the index version with the leader(such as a
>>> soft commit periodically? ). If the leader lost then the replica will
>>> take
>>> the duty immediately.
>>>
>>> Thanks for any suggestion of this idea.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/New-leader-replica-solution-for-HDFS-tp4188735.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: New leader/replica solution for HDFS

2015-02-25 Thread Joseph Obernberger
I am also confused on this.  Is adding replicas going to increase search 
performance?  I'm not sure I see the point of any replicas when using 
HDFS.  Is there one?

Thank you!

-Joe

On 2/25/2015 10:57 AM, Erick Erickson wrote:

bq: And the data sync between leader/replica is always a problem

Not quite sure what you mean by this. There shouldn't need to be
any synching in the sense that the index gets replicated, the
incoming documents should be sent to each node (and indexed
to HDFS) as they come in.

bq: There is duplicate index computing on Replilca side.

Yes, that's the design of SolrCloud, explicitly to provide data safety.
If you instead rely on the leader to index and somehow pull that
indexed form to the replica, then you will lose data if the leader
goes down before sending the indexed form.

bq: My thought is that the leader and the replica all bind to the same data
index directory.

This is unsafe. They would both then try to _write_ to the same
index, which can easily corrupt indexes and/or all but the first
one to access the index would be locked out.

All that said, the HDFS triple-redundancy compounded with the
Solr leaders/replicas redundancy means a bunch of extra
storage. You can turn the HDFS replication down to 1, but that has
other implications.

Best,
Erick

On Tue, Feb 24, 2015 at 11:12 PM, longsan  wrote:

We used HDFS as our Solr index storage and we really have a heavy update
load. We had met much problems with current leader/replica solution. There
is duplicate index computing on Replilca side. And the data sync between
leader/replica is always a problem.

As HDFS already provides data replication on data layer, could Solr provide
just service layer replication?

My thought is that the leader and the replica all bind to the same data
index directory. And the leader will build up index for new request, the
replica will just keep update the index version with the leader(such as a
soft commit periodically? ). If the leader lost then the replica will take
the duty immediately.

Thanks for any suggestion of this idea.







--
View this message in context: 
http://lucene.472066.n3.nabble.com/New-leader-replica-solution-for-HDFS-tp4188735.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: New leader/replica solution for HDFS

2015-02-25 Thread Erick Erickson
bq: And the data sync between leader/replica is always a problem

Not quite sure what you mean by this. There shouldn't need to be
any synching in the sense that the index gets replicated, the
incoming documents should be sent to each node (and indexed
to HDFS) as they come in.

bq: There is duplicate index computing on Replilca side.

Yes, that's the design of SolrCloud, explicitly to provide data safety.
If you instead rely on the leader to index and somehow pull that
indexed form to the replica, then you will lose data if the leader
goes down before sending the indexed form.

bq: My thought is that the leader and the replica all bind to the same data
index directory.

This is unsafe. They would both then try to _write_ to the same
index, which can easily corrupt indexes and/or all but the first
one to access the index would be locked out.

All that said, the HDFS triple-redundancy compounded with the
Solr leaders/replicas redundancy means a bunch of extra
storage. You can turn the HDFS replication down to 1, but that has
other implications.

Best,
Erick

On Tue, Feb 24, 2015 at 11:12 PM, longsan  wrote:
> We used HDFS as our Solr index storage and we really have a heavy update
> load. We had met much problems with current leader/replica solution. There
> is duplicate index computing on Replilca side. And the data sync between
> leader/replica is always a problem.
>
> As HDFS already provides data replication on data layer, could Solr provide
> just service layer replication?
>
> My thought is that the leader and the replica all bind to the same data
> index directory. And the leader will build up index for new request, the
> replica will just keep update the index version with the leader(such as a
> soft commit periodically? ). If the leader lost then the replica will take
> the duty immediately.
>
> Thanks for any suggestion of this idea.
>
>
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/New-leader-replica-solution-for-HDFS-tp4188735.html
> Sent from the Solr - User mailing list archive at Nabble.com.