RE: Use multiple istance simultaneously

2015-12-17 Thread Gian Maria Ricci - aka Alkampfer
Hi,

I've a quick question on zookeeper, how can I run zookeeper as service in linux 
so it autostart if the instance is rebooted? The only information I've found in 
the internet is on this link 
http://positivealex.github.io/blog/posts/how-to-install-zookeeper-as-service-on-centos
 and it seems to be slightly old. 

--
Gian Maria Ricci
Cell: +39 320 0136949


-Original Message-
From: outlook_288fbf38c031d...@outlook.com 
[mailto:outlook_288fbf38c031d...@outlook.com] On Behalf Of Gian Maria Ricci - 
aka Alkampfer
Sent: sabato 12 dicembre 2015 11:39
To: solr-user@lucene.apache.org
Subject: RE: Use multiple istance simultaneously

Thanks a lot for all the clarifications.

Actually resources are not a big problem, I think customer can afford 4 GB RAM 
Red Hat linux machines for Zookeeper. Solr Machines will have in production 64 
or 96 GB of ram, depending on the dimension of the index.

My primary concern is maintenance of the structure. With single independent 
machines, the situation is trivial, we can stop solr on one of the machine 
during the night, and issue a full backup of the indexes. With a full backup of 
the indexes, rebuilding a machine from scratch in case of disaster is simple, 
just spin off a new Virtual machine, restore the backup, restart solr and 
everything is ok.

If for any reason the SolrCloud cluster stops working, restoring everything is 
somewhat more complicated. Are there any best practice for SolrCloud to backup 
everything so we can restore the entire cluster if anything goes wrong?

Thanks a lot for the interesting discussion and for the really useful 
information you gave me.

--
Gian Maria Ricci
Cell: +39 320 0136949


-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org]
Sent: venerdì 11 dicembre 2015 17:11
To: solr-user@lucene.apache.org
Subject: Re: Use multiple istance simultaneously

On 12/11/2015 8:19 AM, Gian Maria Ricci - aka Alkampfer wrote:
> Thanks for all of your clarification. I know that solrcloud is a 
> really better configuration than any other, but actually it has a 
> complexity that is really higher. I just want to give you the pain 
> point I've noticed while I was gathering all the info I can got on SolrCloud.
> 
> 1) zookeeper documentation says that to have the best experience you 
> should have a dedicated filesystem for the persistence and it should 
> never swap to disk. I've not found any guidelines on how I should 
> dimension zookeeper machine, how much ram, disk? Can I install 
> zookeeper in the same machines where Solr resides ( I suspect no, 
> because Solr machine are under stress and if zookeeper start swapping is can 
> lead to problem)?

Standalone zookeeper doesn't require much in the way of resources.
Unless the SolrCloud installation is enormous, a machine with 1-2GB of RAM is 
probably plenty, if the only thing it is doing is zookeeper and it's not 
running Windows.  If the SolrCloud install has a lot of collections, shards, 
and/or servers, then you might need more, because the zookeeper database will 
be larger.

> 2) What about the update? If I need to update my solrcloud instance 
> and the new version requires a new version of zookeeper which is the 
> path to go? I need to first update zookeeper, or upgrading solr to existing 
> machine or?
> Maybe I did not search well but I did not find a comprehensive 
> guideline that told me how to upgrade my SolrCloud installation in various 
> situation.

If you're following recommendations and using standalone zookeeper, then 
upgrading it is entirely separate from upgrading Solr.  It's probably a good 
idea to upgrade your three (or more) zookeeper servers first.

Here's a FAQ entry from zookeeper about upgrades:

https://wiki.apache.org/hadoop/ZooKeeper/FAQ#A6

> 3) Which are the best practices to run DIH in solrcloud? I think I can 
> round robin triggering DIH import on different server composing the 
> cloud infrastructure, or there is a better way to go? (I probably need 
> to trigger a DIH each 5/10 minutes but the number of new records is 
> really small)

When checking the status of an import, you must send the status request to the 
same machine where you sent the command to start the import.

If you're only ever going to run one DIH at a time, then I don't see any reason 
to involve multiple servers.  If you want to run more than one simultaneously, 
then you might want to run each one on a different machine.

> 4) Since I believe that it is not best practice to install zookeeper 
> on same SolrMachine (as separated process, not the built in 
> zookeeper), I need at least three more machine to maintain / monitor / 
> upgrade and I need also to monitor zookeeper, a new appliance that 
> need to be mastered by IT Infrastructure.

The only real reason to avoid zookeeper and Solr on the same machine is 
performance under high load, and mostly that comes d

RE: Use multiple istance simultaneously

2015-12-12 Thread Gian Maria Ricci - aka Alkampfer
Thanks a lot for all the clarifications.

Actually resources are not a big problem, I think customer can afford 4 GB RAM 
Red Hat linux machines for Zookeeper. Solr Machines will have in production 64 
or 96 GB of ram, depending on the dimension of the index.

My primary concern is maintenance of the structure. With single independent 
machines, the situation is trivial, we can stop solr on one of the machine 
during the night, and issue a full backup of the indexes. With a full backup of 
the indexes, rebuilding a machine from scratch in case of disaster is simple, 
just spin off a new Virtual machine, restore the backup, restart solr and 
everything is ok.

If for any reason the SolrCloud cluster stops working, restoring everything is 
somewhat more complicated. Are there any best practice for SolrCloud to backup 
everything so we can restore the entire cluster if anything goes wrong?

Thanks a lot for the interesting discussion and for the really useful 
information you gave me.

--
Gian Maria Ricci
Cell: +39 320 0136949


-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: venerdì 11 dicembre 2015 17:11
To: solr-user@lucene.apache.org
Subject: Re: Use multiple istance simultaneously

On 12/11/2015 8:19 AM, Gian Maria Ricci - aka Alkampfer wrote:
> Thanks for all of your clarification. I know that solrcloud is a 
> really better configuration than any other, but actually it has a 
> complexity that is really higher. I just want to give you the pain 
> point I've noticed while I was gathering all the info I can got on SolrCloud.
> 
> 1) zookeeper documentation says that to have the best experience you 
> should have a dedicated filesystem for the persistence and it should 
> never swap to disk. I've not found any guidelines on how I should 
> dimension zookeeper machine, how much ram, disk? Can I install 
> zookeeper in the same machines where Solr resides ( I suspect no, 
> because Solr machine are under stress and if zookeeper start swapping is can 
> lead to problem)?

Standalone zookeeper doesn't require much in the way of resources.
Unless the SolrCloud installation is enormous, a machine with 1-2GB of RAM is 
probably plenty, if the only thing it is doing is zookeeper and it's not 
running Windows.  If the SolrCloud install has a lot of collections, shards, 
and/or servers, then you might need more, because the zookeeper database will 
be larger.

> 2) What about the update? If I need to update my solrcloud instance 
> and the new version requires a new version of zookeeper which is the 
> path to go? I need to first update zookeeper, or upgrading solr to existing 
> machine or?
> Maybe I did not search well but I did not find a comprehensive 
> guideline that told me how to upgrade my SolrCloud installation in various 
> situation.

If you're following recommendations and using standalone zookeeper, then 
upgrading it is entirely separate from upgrading Solr.  It's probably a good 
idea to upgrade your three (or more) zookeeper servers first.

Here's a FAQ entry from zookeeper about upgrades:

https://wiki.apache.org/hadoop/ZooKeeper/FAQ#A6

> 3) Which are the best practices to run DIH in solrcloud? I think I can 
> round robin triggering DIH import on different server composing the 
> cloud infrastructure, or there is a better way to go? (I probably need 
> to trigger a DIH each 5/10 minutes but the number of new records is 
> really small)

When checking the status of an import, you must send the status request to the 
same machine where you sent the command to start the import.

If you're only ever going to run one DIH at a time, then I don't see any reason 
to involve multiple servers.  If you want to run more than one simultaneously, 
then you might want to run each one on a different machine.

> 4) Since I believe that it is not best practice to install zookeeper 
> on same SolrMachine (as separated process, not the built in 
> zookeeper), I need at least three more machine to maintain / monitor / 
> upgrade and I need also to monitor zookeeper, a new appliance that 
> need to be mastered by IT Infrastructure.

The only real reason to avoid zookeeper and Solr on the same machine is 
performance under high load, and mostly that comes down to I/O performance, so 
if you can put zookeeper on a separate set of disks, you're probably good.  If 
the query/update load will not be high, then sharing machines will likely work 
well, even if the disks are all shared.

> Is there any guidelines on how to automate promoting a slave as a 
> master in classic Master Slave situation? I did not find anything 
> official, because auto promoting a slave into master could solve my problem.

I don't know of any explicit information explaining how to promote a new 
master.  Basically what you have to do is reconfigure the new master's 
replication (so it stops trying to be a slav

Re: Use multiple istance simultaneously

2015-12-11 Thread Shawn Heisey
On 12/11/2015 8:19 AM, Gian Maria Ricci - aka Alkampfer wrote:
> Thanks for all of your clarification. I know that solrcloud is a really
> better configuration than any other, but actually it has a complexity that
> is really higher. I just want to give you the pain point I've noticed while
> I was gathering all the info I can got on SolrCloud.
> 
> 1) zookeeper documentation says that to have the best experience you should
> have a dedicated filesystem for the persistence and it should never swap to
> disk. I've not found any guidelines on how I should dimension zookeeper
> machine, how much ram, disk? Can I install zookeeper in the same machines
> where Solr resides ( I suspect no, because Solr machine are under stress and
> if zookeeper start swapping is can lead to problem)?

Standalone zookeeper doesn't require much in the way of resources.
Unless the SolrCloud installation is enormous, a machine with 1-2GB of
RAM is probably plenty, if the only thing it is doing is zookeeper and
it's not running Windows.  If the SolrCloud install has a lot of
collections, shards, and/or servers, then you might need more, because
the zookeeper database will be larger.

> 2) What about the update? If I need to update my solrcloud instance and the
> new version requires a new version of zookeeper which is the path to go? I
> need to first update zookeeper, or upgrading solr to existing machine or?
> Maybe I did not search well but I did not find a comprehensive guideline
> that told me how to upgrade my SolrCloud installation in various situation. 

If you're following recommendations and using standalone zookeeper, then
upgrading it is entirely separate from upgrading Solr.  It's probably a
good idea to upgrade your three (or more) zookeeper servers first.

Here's a FAQ entry from zookeeper about upgrades:

https://wiki.apache.org/hadoop/ZooKeeper/FAQ#A6

> 3) Which are the best practices to run DIH in solrcloud? I think I can round
> robin triggering DIH import on different server composing the cloud
> infrastructure, or there is a better way to go? (I probably need to trigger
> a DIH each 5/10 minutes but the number of new records is really small)

When checking the status of an import, you must send the status request
to the same machine where you sent the command to start the import.

If you're only ever going to run one DIH at a time, then I don't see any
reason to involve multiple servers.  If you want to run more than one
simultaneously, then you might want to run each one on a different machine.

> 4) Since I believe that it is not best practice to install zookeeper on same
> SolrMachine (as separated process, not the built in zookeeper), I need at
> least three more machine to maintain / monitor / upgrade and I need also to
> monitor zookeeper, a new appliance that need to be mastered by IT
> Infrastructure.

The only real reason to avoid zookeeper and Solr on the same machine is
performance under high load, and mostly that comes down to I/O
performance, so if you can put zookeeper on a separate set of disks,
you're probably good.  If the query/update load will not be high, then
sharing machines will likely work well, even if the disks are all shared.

> Is there any guidelines on how to automate promoting a slave as a master in
> classic Master Slave situation? I did not find anything official, because
> auto promoting a slave into master could solve my problem.

I don't know of any explicit information explaining how to promote a new
master.  Basically what you have to do is reconfigure the new master's
replication (so it stops trying to be a slave), reconfigure every slave
to point to the new master, and reconfigure every client that makes
index updates.  DNS changes *might* be able to automate the slave and
update client reconfig, but the master reconfig requires changing Solr's
configuration, which at the very least will require reloading or
restarting that server.  That could be automated, but it's up to you to
write the automation.

Thanks,
Shawn



RE: Use multiple istance simultaneously

2015-12-11 Thread Gian Maria Ricci - aka Alkampfer
Thanks for all of your clarification. I know that solrcloud is a really
better configuration than any other, but actually it has a complexity that
is really higher. I just want to give you the pain point I've noticed while
I was gathering all the info I can got on SolrCloud.

1) zookeeper documentation says that to have the best experience you should
have a dedicated filesystem for the persistence and it should never swap to
disk. I've not found any guidelines on how I should dimension zookeeper
machine, how much ram, disk? Can I install zookeeper in the same machines
where Solr resides ( I suspect no, because Solr machine are under stress and
if zookeeper start swapping is can lead to problem)?

2) What about the update? If I need to update my solrcloud instance and the
new version requires a new version of zookeeper which is the path to go? I
need to first update zookeeper, or upgrading solr to existing machine or?
Maybe I did not search well but I did not find a comprehensive guideline
that told me how to upgrade my SolrCloud installation in various situation. 

3) Which are the best practices to run DIH in solrcloud? I think I can round
robin triggering DIH import on different server composing the cloud
infrastructure, or there is a better way to go? (I probably need to trigger
a DIH each 5/10 minutes but the number of new records is really small)

4) Since I believe that it is not best practice to install zookeeper on same
SolrMachine (as separated process, not the built in zookeeper), I need at
least three more machine to maintain / monitor / upgrade and I need also to
monitor zookeeper, a new appliance that need to be mastered by IT
Infrastructure.

Is there any guidelines on how to automate promoting a slave as a master in
classic Master Slave situation? I did not find anything official, because
auto promoting a slave into master could solve my problem.

--
Gian Maria Ricci
Cell: +39 320 0136949


-Original Message-
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com] 
Sent: martedì 8 dicembre 2015 11:25
To: solr-user@lucene.apache.org
Subject: Re: Use multiple istance simultaneously

Can you tolerate havin
g indices in different state or you plan to keep them in sync with
controlled commits. DIH-ing content from source when new machine is needed
will probably be slow and I am afraid that you will end up simulating
master-slave model (copying state from one of healthy nodes and DIH-ing
diff). I would recommend using SolrCloud with single shard and let Solr do
the hard work.

Regards,
Emir

On 04.12.2015 14:37, Gian Maria Ricci - aka Alkampfer wrote:
> Many thanks for your response.
>
> I worked with Solr until early version 4.0, then switched to 
> ElasticSearch for a variety of reasons. I've used replication in the 
> past with SolR, but with Elasticsearch basically I had no problem 
> because it works similar to SolrCloud by default and with almost zero
configuration.
>
> Now I've a customer that want to use Solr, and he want the simplest 
> possible stuff to maintain in production. Since most of the work will 
> be done by Data Import Handler, having multiple parallel and 
> independent mach
ines is easy to
> maintain. If one machine fails, it is enough to configure another 
> machine, configure core and restart DIH.
>
> I'd like to know if other people went through this path in the past.
>
> --
> Gian Maria Ricci
> Cell: +39 320 0136949
>  
>
> -Original Message-
> From: Shawn Heisey [mailto:apa...@elyograg.org]
> Sent: giovedì 3 dicembre 2015 10:15
> To: solr-user@lucene.apache.org
> Subject: Re: Use multiple istance simultaneously
>
> On 12/3/2015 1:25 AM, Gian Maria Ricci - aka Alkampfer wrote:
>> In such a scenario could it be feasible to simply configure 2 or 3 
>> identical instance of Solr and configure the application that 
>> transfer data to solr to all the instances simultaneously (the 
>> approach will be a DIH incremental for some core and an external 
>> application that push data continuously for other cores)? Which could 
>> be the drawback of using this approach?
> When I first set up Solr, I used replication.  Then version 3.1.0 was 
> released, in
cluding a non-backward-compatible upgrade to javabin, and it was
> not possible to replicate between 1.x and 3.x.
>
> This incompatibility meant that it would not be possible to do a 
> gradual upgrade to 3.x, where the slaves are upgraded first and then the
master.
>
> To get around the problem, I basically did exactly wh at you've described.
> I turned off replication and configured a second copy of my build 
> program to update what used to be slave servers.
>
> Later, when I moved to a SolrJ program for index maintenance, I made 
> one copy of the maintenance program capable of updating multiple 
> copies of the index in parallel.

Re: Use multiple istance simultaneously

2015-12-08 Thread Emir Arnautovic
Can you tolerate having indices in different state or you plan to keep 
them in sync with controlled commits. DIH-ing content from source when 
new machine is needed  will probably be slow and I am afraid that you 
will end up simulating master-slave model (copying state from one of 
healthy nodes and DIH-ing diff). I would recommend using SolrCloud with 
single shard and let Solr do the hard work.


Regards,
Emir

On 04.12.2015 14:37, Gian Maria Ricci - aka Alkampfer wrote:

Many thanks for your response.

I worked with Solr until early version 4.0, then switched to ElasticSearch
for a variety of reasons. I've used replication in the past with SolR, but
with Elasticsearch basically I had no problem because it works similar to
SolrCloud by default and with almost zero configuration.

Now I've a customer that want to use Solr, and he want the simplest possible
stuff to maintain in production. Since most of the work will be done by Data
Import Handler, having multiple parallel and independent machines is easy to
maintain. If one machine fails, it is enough to configure another machine,
configure core and restart DIH.

I'd like to know if other people went through this path in the past.

--
Gian Maria Ricci
Cell: +39 320 0136949
 


-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org]
Sent: giovedì 3 dicembre 2015 10:15
To: solr-user@lucene.apache.org
Subject: Re: Use multiple istance simultaneously

On 12/3/2015 1:25 AM, Gian Maria Ricci - aka Alkampfer wrote:

In such a scenario could it be feasible to simply configure 2 or 3
identical instance of Solr and configure the application that transfer
data to solr to all the instances simultaneously (the approach will be
a DIH incremental for some core and an external application that push
data continuously for other cores)? Which could be the drawback of
using this approach?

When I first set up Solr, I used replication.  Then version 3.1.0 was
released, including a non-backward-compatible upgrade to javabin, and it was
not possible to replicate between 1.x and 3.x.

This incompatibility meant that it would not be possible to do a gradual
upgrade to 3.x, where the slaves are upgraded first and then the master.

To get around the problem, I basically did exactly wh at you've described.
I turned off replication and configured a second copy of my build program to
update what used to be slave servers.

Later, when I moved to a SolrJ program for index maintenance, I made one
copy of the maintenance program capable of updating multiple copies of the
index in parallel.

I have stuck with this architecture through 4.x and moving into 5.x, even
though I could go back to replication or switch to SolrCloud.
Having completely independent indexes allows a great deal of flexibility
with upgrades and testing new configurations, flexibility that isn't
available with SolrCloud or master-slave replication.

Thanks,
Shawn



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Use multiple istance simultaneously

2015-12-07 Thread Shawn Heisey
On 12/4/2015 6:37 AM, Gian Maria Ricci - aka Alkampfer wrote:
> Many thanks for your response.
> 
> I worked with Solr until early version 4.0, then switched to ElasticSearch
> for a variety of reasons. I've used replication in the past with SolR, but
> with Elasticsearch basically I had no problem because it works similar to
> SolrCloud by default and with almost zero configuration.
> 
> Now I've a customer that want to use Solr, and he want the simplest possible
> stuff to maintain in production. Since most of the work will be done by Data
> Import Handler, having multiple parallel and independent machines is easy to
> maintain. If one machine fails, it is enough to configure another machine,
> configure core and restart DIH.
> 
> I'd like to know if other people went through this path in the past.

Even though I don't use SolrCloud myself for my primary indexes, if I
were setting up a brand new install of Solr for someone else to manage
after I'm finished with it, I would use SolrCloud.  SolrCloud has no
master, no single point of failure.  Handling multiple shards and
multiple replicas is mostly automatic.  If the clients use SolrJ,
there's no need for a load balancer.

I've never used elasticsearch, but I've looked a little bit at its
configuration.  There are aspects of it that are much easier than Solr.
 Solr does not hide very much of the lower-level complexity from the
administrator.  This makes the learning curve for Solr a lot steeper
than the learning curve for ES, but once that is tackled, the Solr
administrator understands the inner workings a lot better than the ES
administrator.

I've seen claims that ES is much faster than Solr ... but if the
benchmarks supporting those claims are using the out-of-the-box
configurations, then it is an unfair comparison -- Solr's out of the box
configuration has much more capability turned on and is going to run
slower as a result.  I have not seen any numbers where Solr and ES are
set up with configurations that are as identical as possible.  I have to
wonder if this is because the performance would be similar.

Thanks,
Shawn



RE: Use multiple istance simultaneously

2015-12-04 Thread Gian Maria Ricci - aka Alkampfer
Many thanks for your response.

I worked with Solr until early version 4.0, then switched to ElasticSearch
for a variety of reasons. I've used replication in the past with SolR, but
with Elasticsearch basically I had no problem because it works similar to
SolrCloud by default and with almost zero configuration.

Now I've a customer that want to use Solr, and he want the simplest possible
stuff to maintain in production. Since most of the work will be done by Data
Import Handler, having multiple parallel and independent machines is easy to
maintain. If one machine fails, it is enough to configure another machine,
configure core and restart DIH.

I'd like to know if other people went through this path in the past.

--
Gian Maria Ricci
Cell: +39 320 0136949


-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: giovedì 3 dicembre 2015 10:15
To: solr-user@lucene.apache.org
Subject: Re: Use multiple istance simultaneously

On 12/3/2015 1:25 AM, Gian Maria Ricci - aka Alkampfer wrote:
> In such a scenario could it be feasible to simply configure 2 or 3 
> identical instance of Solr and configure the application that transfer 
> data to solr to all the instances simultaneously (the approach will be 
> a DIH incremental for some core and an external application that push 
> data continuously for other cores)? Which could be the drawback of 
> using this approach?

When I first set up Solr, I used replication.  Then version 3.1.0 was
released, including a non-backward-compatible upgrade to javabin, and it was
not possible to replicate between 1.x and 3.x.

This incompatibility meant that it would not be possible to do a gradual
upgrade to 3.x, where the slaves are upgraded first and then the master.

To get around the problem, I basically did exactly wh at you've described.
I turned off replication and configured a second copy of my build program to
update what used to be slave servers.

Later, when I moved to a SolrJ program for index maintenance, I made one
copy of the maintenance program capable of updating multiple copies of the
index in parallel.

I have stuck with this architecture through 4.x and moving into 5.x, even
though I could go back to replication or switch to SolrCloud.
Having completely independent indexes allows a great deal of flexibility
with upgrades and testing new configurations, flexibility that isn't
available with SolrCloud or master-slave replication.

Thanks,
Shawn



Re: Use multiple istance simultaneously

2015-12-03 Thread Shawn Heisey
On 12/3/2015 1:25 AM, Gian Maria Ricci - aka Alkampfer wrote:
> In such a scenario could it be feasible to simply configure 2 or 3
> identical instance of Solr and configure the application that transfer
> data to solr to all the instances simultaneously (the approach will be a
> DIH incremental for some core and an external application that push data
> continuously for other cores)? Which could be the drawback of using this
> approach?

When I first set up Solr, I used replication.  Then version 3.1.0 was
released, including a non-backward-compatible upgrade to javabin, and it
was not possible to replicate between 1.x and 3.x.

This incompatibility meant that it would not be possible to do a gradual
upgrade to 3.x, where the slaves are upgraded first and then the master.

To get around the problem, I basically did exactly what you've
described.  I turned off replication and configured a second copy of my
build program to update what used to be slave servers.

Later, when I moved to a SolrJ program for index maintenance, I made one
copy of the maintenance program capable of updating multiple copies of
the index in parallel.

I have stuck with this architecture through 4.x and moving into 5.x,
even though I could go back to replication or switch to SolrCloud.
Having completely independent indexes allows a great deal of flexibility
with upgrades and testing new configurations, flexibility that isn't
available with SolrCloud or master-slave replication.

Thanks,
Shawn