RE: FW: Very very large scale Solr Deployment = how to do (Expert Question)?
Hi all, I'd love to share the diagram, just not sure how to do that on the list (it's a word document I tried to send as attachment). Jens, to answer your questions: 1. Correct, in our setup the source of the data is a DB from which we pull the data using DIH (search the list for my previous post DIH - deleting documents, high performance (delta) imports, and passing parameters if you want info about that). We were lucky enough to have the data sharded at the DB level before we started using Solr, so using the same shards was an easy extension. Note that we're not (yet...) using SolrCloud, it was just something I thought you should consider. 2. I got the idea for the aggregator from the Solr book (PACKT). I don't remember if that term was used in the book or if I made it up (if Google doesn't know it, I probably mad it up...), but I think it conveys what this part of the puzzle does. As you said, this is simply a Solr instance which doesn't hold its own index, but shares the same schema as the slaves and masters. I actually defined the default query handler on this instance to include the shards parameter (see below), so the client doesn't have to know anything about the internal workings of the sharded setup, it just hits the aggregator load balancer with a regular query and everything is handled behind the scenes. This simplifies the client and allows me to change the architecture in the future (i.e. change the number of shards or their DNS name) without requiring a client change. Sharded query handler: requestHandler name=sharded class=solr.SearchHandler default=${aggregator:false} !-- default values for query parameters -- lst name=defaults str name=echoParamsexplicit/str str name=shards${slaveUrls:null}/str /lst /requestHandler All of our Solr instances share the same configs (solrconfig.xml, schema.xml, etc.) and different instances take different roles according to properties defined in solr.xml which is generated by a script specifically for each Solr instance (the script has a map of which instances should be on which host, and has to be run once on each host). In this case, this is how the generated solr.xml looks: solr sharedLib=../lib persistent=true property name=name value=aggregator /-- just a name that appears in Solr management -- to make it easier to know which instance you're on property name=aggregator value=true /-- this tells the instance is an aggregator, -- so it should use the sharded request handler by default -- masters and slaves have master/slave accordingly do define -- replication, a regular default search handler for slaves, -- and DIH on masters property name=shardID value= / -- this is used by instances which are shards in order to determine which -- DB they should import from (masters) -- and which master they should replicate from (slaves) property name=slaveUrls value=long,list.of,shard.urls / -- used by the sharded request handler property name=HealthCheckDir value=/data/servers/x_solr/ aggregator/core0/conf / -- used by load balancer to -- know if this instance is alive cores adminPath=/admin/cores defaultCoreName=prod core name=prod instanceDir=core0//-- just one core for this instance -- indexers have 2 cores, one prod and one for full reindex /cores /solr Let me know if I can assist any further. Ephraim Ofir -Original Message- From: Jonathan DeMello [mailto:demello@googlemail.com] Sent: Wednesday, April 06, 2011 8:58 AM To: solr-user@lucene.apache.org Cc: Isan Fulia; Tirthankar Chatterjee Subject: Re: FW: Very very large scale Solr Deployment = how to do (Expert Question)? I third that request. Would greatly appreciate taking a look at that diagram! Regards, Jonathan On Wed, Apr 6, 2011 at 9:12 AM, Isan Fulia isan.fu...@germinait.com wrote: Hi Ephraim/Jen, Can u share that diagram with all.It may really help all of us. Thanks, Isan Fulia. On 6 April 2011 10:15, Tirthankar Chatterjee tchatter...@commvault.com wrote: Hi Jen, Can you please forward the diagram attachment too that Ephraim sent. :-) Thanks, Tirthankar -Original Message- From: Jens Mueller [mailto:supidupi...@googlemail.com] Sent: Tuesday, April 05, 2011 10:30 PM To: solr-user@lucene.apache.org Subject: Re: FW: Very very large scale Solr Deployment = how to do (Expert Question)? Hello Ephraim, thank you so much for the great Document/Scaling-Concept!! First I think you really should publish this on the solr wiki. This approach is nowhere
Re: FW: Very very large scale Solr Deployment = how to do (Expert Question)?
Hello Ephraim, thank you so much for the great Document/Scaling-Concept!! First I think you really should publish this on the solr wiki. This approach is nowhere documented there and not really obvious for newbies and your document is great and explains this very well! Please allow me to further questions regarding your document: 1.) Is it correct, that you mean by DB the Origin-Data-Source of the data that is fed into the Solr Cloud for searching? 2.) Solr Aggregator: This term did not yeald any google results, but is a very important aspect of your design (and this was the missing piece for me when thinking about solr architectures): Is it cocrrec that the aggregators are simply tomcat instances, with the solr webapp deployed? These Aggregators do not have their own index but only run the solr webapp and I access them via the ?shard= parameter giving the shards I want to query? (So in the end they aggreate the data of the shards but do not have their own data). This is really an important aspect that is not documented well enough in the solr documentation. Thank you very much! Jens 2011/4/5 Ephraim Ofir ephra...@icq.com of course the attachment didn't get to the list, so here it is if you want it... Ephraim Ofir -Original Message- From: Ephraim Ofir Sent: Tuesday, April 05, 2011 10:20 AM To: 'solr-user@lucene.apache.org' Subject: RE: Very very large scale Solr Deployment = how to do (Expert Question)? I'm not sure about the scale you're aiming for, but you probably want to do both sharding and replication. There's no central server which would be the bottleneck. The guidelines should probably be something like: 1. Split your index to enough shards so it can keep up with the update rate. 2. Have enough replicates of each shard master to keep up with the rate of queries. 3. Have enough aggregators in front of the shard replicates so the aggregation doesn't become a bottleneck. 4. Make sure you have good load balancing across your system. Attached is a diagram of the setup we have. You might want to look into SolrCloud as well. Ephraim Ofir -Original Message- From: Jens Mueller [mailto:supidupi...@googlemail.com] Sent: Tuesday, April 05, 2011 4:25 AM To: solr-user@lucene.apache.org Subject: Very very large scale Solr Deployment = how to do (Expert Question)? Hello Experts, I am a Solr newbie but read quite a lot of docs. I still do not understand what would be the best way to setup very large scale deployments: Goal (threoretical): A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size) B) Queries: 10 Queries/ per Second C) Updates: 10 Updates / per Second Solr offers: 1.)Replication = Scales Well for B) BUT A) and C) are not satisfied 2.)Sharding = Scales well for A) BUT B) and C) are not satisfied (= As I understand the Sharding approach all goes through a central server, that dispatches the updates and assembles the quries retrieved from the different shards. But this central server has also some capacity limits...) What is the right approach to handle such large deployments? I would be thankfull for just a rough sketch of the concepts so I can experiment/search further... Maybe I am missing something very trivial as I think some of the Solr Users/Use Cases on the homepage are that kind of large deployments. How are they implemented? Thanky very much!!! Jens
RE: FW: Very very large scale Solr Deployment = how to do (Expert Question)?
Hi Jen, Can you please forward the diagram attachment too that Ephraim sent. :-) Thanks, Tirthankar -Original Message- From: Jens Mueller [mailto:supidupi...@googlemail.com] Sent: Tuesday, April 05, 2011 10:30 PM To: solr-user@lucene.apache.org Subject: Re: FW: Very very large scale Solr Deployment = how to do (Expert Question)? Hello Ephraim, thank you so much for the great Document/Scaling-Concept!! First I think you really should publish this on the solr wiki. This approach is nowhere documented there and not really obvious for newbies and your document is great and explains this very well! Please allow me to further questions regarding your document: 1.) Is it correct, that you mean by DB the Origin-Data-Source of the data that is fed into the Solr Cloud for searching? 2.) Solr Aggregator: This term did not yeald any google results, but is a very important aspect of your design (and this was the missing piece for me when thinking about solr architectures): Is it cocrrec that the aggregators are simply tomcat instances, with the solr webapp deployed? These Aggregators do not have their own index but only run the solr webapp and I access them via the ?shard= parameter giving the shards I want to query? (So in the end they aggreate the data of the shards but do not have their own data). This is really an important aspect that is not documented well enough in the solr documentation. Thank you very much! Jens 2011/4/5 Ephraim Ofir ephra...@icq.com of course the attachment didn't get to the list, so here it is if you want it... Ephraim Ofir -Original Message- From: Ephraim Ofir Sent: Tuesday, April 05, 2011 10:20 AM To: 'solr-user@lucene.apache.org' Subject: RE: Very very large scale Solr Deployment = how to do (Expert Question)? I'm not sure about the scale you're aiming for, but you probably want to do both sharding and replication. There's no central server which would be the bottleneck. The guidelines should probably be something like: 1. Split your index to enough shards so it can keep up with the update rate. 2. Have enough replicates of each shard master to keep up with the rate of queries. 3. Have enough aggregators in front of the shard replicates so the aggregation doesn't become a bottleneck. 4. Make sure you have good load balancing across your system. Attached is a diagram of the setup we have. You might want to look into SolrCloud as well. Ephraim Ofir -Original Message- From: Jens Mueller [mailto:supidupi...@googlemail.com] Sent: Tuesday, April 05, 2011 4:25 AM To: solr-user@lucene.apache.org Subject: Very very large scale Solr Deployment = how to do (Expert Question)? Hello Experts, I am a Solr newbie but read quite a lot of docs. I still do not understand what would be the best way to setup very large scale deployments: Goal (threoretical): A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size) B) Queries: 10 Queries/ per Second C) Updates: 10 Updates / per Second Solr offers: 1.)Replication = Scales Well for B) BUT A) and C) are not satisfied 2.)Sharding = Scales well for A) BUT B) and C) are not satisfied (= As I understand the Sharding approach all goes through a central server, that dispatches the updates and assembles the quries retrieved from the different shards. But this central server has also some capacity limits...) What is the right approach to handle such large deployments? I would be thankfull for just a rough sketch of the concepts so I can experiment/search further... Maybe I am missing something very trivial as I think some of the Solr Users/Use Cases on the homepage are that kind of large deployments. How are they implemented? Thanky very much!!! Jens **Legal Disclaimer*** This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you. *
Re: FW: Very very large scale Solr Deployment = how to do (Expert Question)?
Hi Ephraim/Jen, Can u share that diagram with all.It may really help all of us. Thanks, Isan Fulia. On 6 April 2011 10:15, Tirthankar Chatterjee tchatter...@commvault.comwrote: Hi Jen, Can you please forward the diagram attachment too that Ephraim sent. :-) Thanks, Tirthankar -Original Message- From: Jens Mueller [mailto:supidupi...@googlemail.com] Sent: Tuesday, April 05, 2011 10:30 PM To: solr-user@lucene.apache.org Subject: Re: FW: Very very large scale Solr Deployment = how to do (Expert Question)? Hello Ephraim, thank you so much for the great Document/Scaling-Concept!! First I think you really should publish this on the solr wiki. This approach is nowhere documented there and not really obvious for newbies and your document is great and explains this very well! Please allow me to further questions regarding your document: 1.) Is it correct, that you mean by DB the Origin-Data-Source of the data that is fed into the Solr Cloud for searching? 2.) Solr Aggregator: This term did not yeald any google results, but is a very important aspect of your design (and this was the missing piece for me when thinking about solr architectures): Is it cocrrec that the aggregators are simply tomcat instances, with the solr webapp deployed? These Aggregators do not have their own index but only run the solr webapp and I access them via the ?shard= parameter giving the shards I want to query? (So in the end they aggreate the data of the shards but do not have their own data). This is really an important aspect that is not documented well enough in the solr documentation. Thank you very much! Jens 2011/4/5 Ephraim Ofir ephra...@icq.com of course the attachment didn't get to the list, so here it is if you want it... Ephraim Ofir -Original Message- From: Ephraim Ofir Sent: Tuesday, April 05, 2011 10:20 AM To: 'solr-user@lucene.apache.org' Subject: RE: Very very large scale Solr Deployment = how to do (Expert Question)? I'm not sure about the scale you're aiming for, but you probably want to do both sharding and replication. There's no central server which would be the bottleneck. The guidelines should probably be something like: 1. Split your index to enough shards so it can keep up with the update rate. 2. Have enough replicates of each shard master to keep up with the rate of queries. 3. Have enough aggregators in front of the shard replicates so the aggregation doesn't become a bottleneck. 4. Make sure you have good load balancing across your system. Attached is a diagram of the setup we have. You might want to look into SolrCloud as well. Ephraim Ofir -Original Message- From: Jens Mueller [mailto:supidupi...@googlemail.com] Sent: Tuesday, April 05, 2011 4:25 AM To: solr-user@lucene.apache.org Subject: Very very large scale Solr Deployment = how to do (Expert Question)? Hello Experts, I am a Solr newbie but read quite a lot of docs. I still do not understand what would be the best way to setup very large scale deployments: Goal (threoretical): A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size) B) Queries: 10 Queries/ per Second C) Updates: 10 Updates / per Second Solr offers: 1.)Replication = Scales Well for B) BUT A) and C) are not satisfied 2.)Sharding = Scales well for A) BUT B) and C) are not satisfied (= As I understand the Sharding approach all goes through a central server, that dispatches the updates and assembles the quries retrieved from the different shards. But this central server has also some capacity limits...) What is the right approach to handle such large deployments? I would be thankfull for just a rough sketch of the concepts so I can experiment/search further... Maybe I am missing something very trivial as I think some of the Solr Users/Use Cases on the homepage are that kind of large deployments. How are they implemented? Thanky very much!!! Jens **Legal Disclaimer*** This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you. * -- Thanks Regards, Isan Fulia.
Re: FW: Very very large scale Solr Deployment = how to do (Expert Question)?
I third that request. Would greatly appreciate taking a look at that diagram! Regards, Jonathan On Wed, Apr 6, 2011 at 9:12 AM, Isan Fulia isan.fu...@germinait.com wrote: Hi Ephraim/Jen, Can u share that diagram with all.It may really help all of us. Thanks, Isan Fulia. On 6 April 2011 10:15, Tirthankar Chatterjee tchatter...@commvault.com wrote: Hi Jen, Can you please forward the diagram attachment too that Ephraim sent. :-) Thanks, Tirthankar -Original Message- From: Jens Mueller [mailto:supidupi...@googlemail.com] Sent: Tuesday, April 05, 2011 10:30 PM To: solr-user@lucene.apache.org Subject: Re: FW: Very very large scale Solr Deployment = how to do (Expert Question)? Hello Ephraim, thank you so much for the great Document/Scaling-Concept!! First I think you really should publish this on the solr wiki. This approach is nowhere documented there and not really obvious for newbies and your document is great and explains this very well! Please allow me to further questions regarding your document: 1.) Is it correct, that you mean by DB the Origin-Data-Source of the data that is fed into the Solr Cloud for searching? 2.) Solr Aggregator: This term did not yeald any google results, but is a very important aspect of your design (and this was the missing piece for me when thinking about solr architectures): Is it cocrrec that the aggregators are simply tomcat instances, with the solr webapp deployed? These Aggregators do not have their own index but only run the solr webapp and I access them via the ?shard= parameter giving the shards I want to query? (So in the end they aggreate the data of the shards but do not have their own data). This is really an important aspect that is not documented well enough in the solr documentation. Thank you very much! Jens 2011/4/5 Ephraim Ofir ephra...@icq.com of course the attachment didn't get to the list, so here it is if you want it... Ephraim Ofir -Original Message- From: Ephraim Ofir Sent: Tuesday, April 05, 2011 10:20 AM To: 'solr-user@lucene.apache.org' Subject: RE: Very very large scale Solr Deployment = how to do (Expert Question)? I'm not sure about the scale you're aiming for, but you probably want to do both sharding and replication. There's no central server which would be the bottleneck. The guidelines should probably be something like: 1. Split your index to enough shards so it can keep up with the update rate. 2. Have enough replicates of each shard master to keep up with the rate of queries. 3. Have enough aggregators in front of the shard replicates so the aggregation doesn't become a bottleneck. 4. Make sure you have good load balancing across your system. Attached is a diagram of the setup we have. You might want to look into SolrCloud as well. Ephraim Ofir -Original Message- From: Jens Mueller [mailto:supidupi...@googlemail.com] Sent: Tuesday, April 05, 2011 4:25 AM To: solr-user@lucene.apache.org Subject: Very very large scale Solr Deployment = how to do (Expert Question)? Hello Experts, I am a Solr newbie but read quite a lot of docs. I still do not understand what would be the best way to setup very large scale deployments: Goal (threoretical): A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size) B) Queries: 10 Queries/ per Second C) Updates: 10 Updates / per Second Solr offers: 1.)Replication = Scales Well for B) BUT A) and C) are not satisfied 2.)Sharding = Scales well for A) BUT B) and C) are not satisfied (= As I understand the Sharding approach all goes through a central server, that dispatches the updates and assembles the quries retrieved from the different shards. But this central server has also some capacity limits...) What is the right approach to handle such large deployments? I would be thankfull for just a rough sketch of the concepts so I can experiment/search further... Maybe I am missing something very trivial as I think some of the Solr Users/Use Cases on the homepage are that kind of large deployments. How are they implemented? Thanky very much!!! Jens **Legal Disclaimer*** This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you. * -- Thanks Regards, Isan Fulia.