RE: FW: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-06 Thread Ephraim Ofir
Hi all,
I'd love to share the diagram, just not sure how to do that on the list
(it's a word document I tried to send as attachment).

Jens, to answer your questions:
1. Correct, in our setup the source of the data is a DB from which we
pull the data using DIH (search the list for my previous post DIH -
deleting documents, high performance (delta) imports, and passing
parameters if you want info about that).  We were lucky enough to have
the data sharded at the DB level before we started using Solr, so using
the same shards was an easy extension.  Note that we're not (yet...)
using SolrCloud, it was just something I thought you should consider.
2. I got the idea for the aggregator from the Solr book (PACKT).  I
don't remember if that term was used in the book or if I made it up (if
Google doesn't know it, I probably mad it up...), but I think it conveys
what this part of the puzzle does.  As you said, this is simply a Solr
instance which doesn't hold its own index, but shares the same schema as
the slaves and masters.  I actually defined the default query handler on
this instance to include the shards parameter (see below), so the client
doesn't have to know anything about the internal workings of the sharded
setup, it just hits the aggregator load balancer with a regular query
and everything is handled behind the scenes.  This simplifies the client
and allows me to change the architecture in the future (i.e. change the
number of shards or their DNS name) without requiring a client change.

Sharded query handler:

  requestHandler name=sharded class=solr.SearchHandler
default=${aggregator:false}
!-- default values for query parameters --
 lst name=defaults
   str name=echoParamsexplicit/str
   str name=shards${slaveUrls:null}/str
 /lst
  /requestHandler

All of our Solr instances share the same configs (solrconfig.xml,
schema.xml, etc.) and different instances take different roles according
to properties defined in solr.xml which is generated by a script
specifically for each Solr instance (the script has a map of which
instances should be on which host, and has to be run once on each host).
In this case, this is how the generated solr.xml looks:

solr sharedLib=../lib persistent=true
   property name=name value=aggregator /-- just a name that
appears in Solr management
  -- to make it easier
to know which instance you're on

   property name=aggregator value=true /-- this tells the
instance is an aggregator,
  -- so it should use
the sharded request handler by default
  -- masters and slaves
have master/slave accordingly do define
  -- replication, a
regular default search handler for slaves,
  -- and DIH on masters

   property name=shardID value= /  -- this is used by instances
which are shards in order to determine which
 -- DB they should import from
(masters)
 -- and which master they should
replicate from (slaves)

   property name=slaveUrls value=long,list.of,shard.urls /  --
used by the sharded request handler

   property name=HealthCheckDir value=/data/servers/x_solr/
aggregator/core0/conf / -- used by load balancer to
 
-- know if this instance is alive
   cores adminPath=/admin/cores defaultCoreName=prod
  core name=prod instanceDir=core0//-- just
one core for this instance
  --
indexers have 2 cores, one prod and one for full reindex
   /cores
/solr


Let me know if I can assist any further.
Ephraim Ofir


-Original Message-
From: Jonathan DeMello [mailto:demello@googlemail.com] 
Sent: Wednesday, April 06, 2011 8:58 AM
To: solr-user@lucene.apache.org
Cc: Isan Fulia; Tirthankar Chatterjee
Subject: Re: FW: Very very large scale Solr Deployment = how to do
(Expert Question)?

I third that request.

Would greatly appreciate taking a look at that diagram!

Regards,

Jonathan

On Wed, Apr 6, 2011 at 9:12 AM, Isan Fulia isan.fu...@germinait.com
wrote:

 Hi Ephraim/Jen,

 Can u share that diagram with all.It may really help all of us.
 Thanks,
 Isan Fulia.

 On 6 April 2011 10:15, Tirthankar Chatterjee
tchatter...@commvault.com
 wrote:

  Hi Jen,
  Can you please forward the diagram attachment too that Ephraim sent.
:-)
  Thanks,
  Tirthankar
 
  -Original Message-
  From: Jens Mueller [mailto:supidupi...@googlemail.com]
  Sent: Tuesday, April 05, 2011 10:30 PM
  To: solr-user@lucene.apache.org
  Subject: Re: FW: Very very large scale Solr Deployment = how to do
 (Expert
  Question)?
 
  Hello Ephraim,
 
  thank you so much for the great Document/Scaling-Concept!!
 
  First I think you really should publish this on the solr wiki. This
  approach is nowhere

Re: FW: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-05 Thread Jens Mueller
Hello Ephraim,

thank you so much for the great Document/Scaling-Concept!!

First I think you really should publish this on the solr wiki. This approach
is nowhere documented there and not really obvious for newbies and your
document is great and explains this very well!

Please allow me to further questions regarding your document:
1.) Is it correct, that you mean by DB the Origin-Data-Source of the data
that is fed into the Solr Cloud for searching?

2.) Solr Aggregator: This term did not yeald any google results, but is a
very important aspect of your design (and this was the missing piece for me
when thinking about solr architectures): Is it cocrrec that the
aggregators are simply tomcat instances, with the solr webapp deployed?
These Aggregators do not have their own index but only run the solr webapp
and I access them via the ?shard= parameter giving the shards I want to
query? (So in the end they aggreate the data of the shards but do not have
their own data). This is really an important aspect that is not documented
well enough in the solr documentation.

Thank you very much!
Jens


2011/4/5 Ephraim Ofir ephra...@icq.com

 of course the attachment didn't get to the list, so here it is if you
 want it...

 Ephraim Ofir


 -Original Message-
 From: Ephraim Ofir
 Sent: Tuesday, April 05, 2011 10:20 AM
 To: 'solr-user@lucene.apache.org'
 Subject: RE: Very very large scale Solr Deployment = how to do (Expert
 Question)?

 I'm not sure about the scale you're aiming for, but you probably want to
 do both sharding and replication.  There's no central server which would
 be the bottleneck. The guidelines should probably be something like:
 1. Split your index to enough shards so it can keep up with the update
 rate.
 2. Have enough replicates of each shard master to keep up with the rate
 of queries.
 3. Have enough aggregators in front of the shard replicates so the
 aggregation doesn't become a bottleneck.
 4. Make sure you have good load balancing across your system.

 Attached is a diagram of the setup we have.  You might want to look into
 SolrCloud as well.

 Ephraim Ofir


 -Original Message-
 From: Jens Mueller [mailto:supidupi...@googlemail.com]
 Sent: Tuesday, April 05, 2011 4:25 AM
 To: solr-user@lucene.apache.org
 Subject: Very very large scale Solr Deployment = how to do (Expert
 Question)?

 Hello Experts,



 I am a Solr newbie but read quite a lot of docs. I still do not
 understand what would be the best way to setup very large scale
 deployments:



 Goal (threoretical):

  A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)

  B) Queries: 10 Queries/ per Second

  C) Updates: 10 Updates / per Second




 Solr offers:

 1.)Replication = Scales Well for B)  BUT  A) and C) are not
 satisfied


 2.)Sharding = Scales well for A) BUT B) and C) are not satisfied
 (= As
 I understand the Sharding approach all goes through a central server,
 that dispatches the updates and assembles the quries retrieved from the
 different shards. But this central server has also some capacity
 limits...)




 What is the right approach to handle such large deployments? I would be
 thankfull for just a rough sketch of the concepts so I can
 experiment/search further...


 Maybe I am missing something very trivial as I think some of the Solr
 Users/Use Cases on the homepage are that kind of large deployments. How
 are they implemented?



 Thanky very much!!!

 Jens



RE: FW: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-05 Thread Tirthankar Chatterjee
Hi Jen,
Can you please forward the diagram attachment too that Ephraim sent. :-)
Thanks,
Tirthankar 

-Original Message-
From: Jens Mueller [mailto:supidupi...@googlemail.com] 
Sent: Tuesday, April 05, 2011 10:30 PM
To: solr-user@lucene.apache.org
Subject: Re: FW: Very very large scale Solr Deployment = how to do (Expert 
Question)?

Hello Ephraim,

thank you so much for the great Document/Scaling-Concept!!

First I think you really should publish this on the solr wiki. This approach is 
nowhere documented there and not really obvious for newbies and your document 
is great and explains this very well!

Please allow me to further questions regarding your document:
1.) Is it correct, that you mean by DB the Origin-Data-Source of the data 
that is fed into the Solr Cloud for searching?

2.) Solr Aggregator: This term did not yeald any google results, but is a very 
important aspect of your design (and this was the missing piece for me when 
thinking about solr architectures): Is it cocrrec that the aggregators are 
simply tomcat instances, with the solr webapp deployed?
These Aggregators do not have their own index but only run the solr webapp and 
I access them via the ?shard= parameter giving the shards I want to query? (So 
in the end they aggreate the data of the shards but do not have their own 
data). This is really an important aspect that is not documented well enough in 
the solr documentation.

Thank you very much!
Jens


2011/4/5 Ephraim Ofir ephra...@icq.com

 of course the attachment didn't get to the list, so here it is if you 
 want it...

 Ephraim Ofir


 -Original Message-
 From: Ephraim Ofir
 Sent: Tuesday, April 05, 2011 10:20 AM
 To: 'solr-user@lucene.apache.org'
 Subject: RE: Very very large scale Solr Deployment = how to do (Expert 
 Question)?

 I'm not sure about the scale you're aiming for, but you probably want 
 to do both sharding and replication.  There's no central server which 
 would be the bottleneck. The guidelines should probably be something like:
 1. Split your index to enough shards so it can keep up with the update 
 rate.
 2. Have enough replicates of each shard master to keep up with the 
 rate of queries.
 3. Have enough aggregators in front of the shard replicates so the 
 aggregation doesn't become a bottleneck.
 4. Make sure you have good load balancing across your system.

 Attached is a diagram of the setup we have.  You might want to look 
 into SolrCloud as well.

 Ephraim Ofir


 -Original Message-
 From: Jens Mueller [mailto:supidupi...@googlemail.com]
 Sent: Tuesday, April 05, 2011 4:25 AM
 To: solr-user@lucene.apache.org
 Subject: Very very large scale Solr Deployment = how to do (Expert 
 Question)?

 Hello Experts,



 I am a Solr newbie but read quite a lot of docs. I still do not 
 understand what would be the best way to setup very large scale
 deployments:



 Goal (threoretical):

  A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)

  B) Queries: 10 Queries/ per Second

  C) Updates: 10 Updates / per Second




 Solr offers:

 1.)Replication = Scales Well for B)  BUT  A) and C) are not
 satisfied


 2.)Sharding = Scales well for A) BUT B) and C) are not satisfied
 (= As
 I understand the Sharding approach all goes through a central server, 
 that dispatches the updates and assembles the quries retrieved from 
 the different shards. But this central server has also some capacity
 limits...)




 What is the right approach to handle such large deployments? I would 
 be thankfull for just a rough sketch of the concepts so I can 
 experiment/search further...


 Maybe I am missing something very trivial as I think some of the Solr 
 Users/Use Cases on the homepage are that kind of large deployments. 
 How are they implemented?



 Thanky very much!!!

 Jens

**Legal Disclaimer***
This communication may contain confidential and privileged
material for the sole use of the intended recipient. Any
unauthorized review, use or distribution by others is strictly
prohibited. If you have received the message in error, please
advise the sender by reply email and delete the message. Thank
you.
*


Re: FW: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-05 Thread Isan Fulia
Hi Ephraim/Jen,

Can u share that diagram with all.It may really help all of us.
Thanks,
Isan Fulia.

On 6 April 2011 10:15, Tirthankar Chatterjee tchatter...@commvault.comwrote:

 Hi Jen,
 Can you please forward the diagram attachment too that Ephraim sent. :-)
 Thanks,
 Tirthankar

 -Original Message-
 From: Jens Mueller [mailto:supidupi...@googlemail.com]
 Sent: Tuesday, April 05, 2011 10:30 PM
 To: solr-user@lucene.apache.org
 Subject: Re: FW: Very very large scale Solr Deployment = how to do (Expert
 Question)?

 Hello Ephraim,

 thank you so much for the great Document/Scaling-Concept!!

 First I think you really should publish this on the solr wiki. This
 approach is nowhere documented there and not really obvious for newbies and
 your document is great and explains this very well!

 Please allow me to further questions regarding your document:
 1.) Is it correct, that you mean by DB the Origin-Data-Source of the data
 that is fed into the Solr Cloud for searching?

 2.) Solr Aggregator: This term did not yeald any google results, but is a
 very important aspect of your design (and this was the missing piece for me
 when thinking about solr architectures): Is it cocrrec that the
 aggregators are simply tomcat instances, with the solr webapp deployed?
 These Aggregators do not have their own index but only run the solr webapp
 and I access them via the ?shard= parameter giving the shards I want to
 query? (So in the end they aggreate the data of the shards but do not have
 their own data). This is really an important aspect that is not documented
 well enough in the solr documentation.

 Thank you very much!
 Jens


 2011/4/5 Ephraim Ofir ephra...@icq.com

  of course the attachment didn't get to the list, so here it is if you
  want it...
 
  Ephraim Ofir
 
 
  -Original Message-
  From: Ephraim Ofir
  Sent: Tuesday, April 05, 2011 10:20 AM
  To: 'solr-user@lucene.apache.org'
  Subject: RE: Very very large scale Solr Deployment = how to do (Expert
  Question)?
 
  I'm not sure about the scale you're aiming for, but you probably want
  to do both sharding and replication.  There's no central server which
  would be the bottleneck. The guidelines should probably be something
 like:
  1. Split your index to enough shards so it can keep up with the update
  rate.
  2. Have enough replicates of each shard master to keep up with the
  rate of queries.
  3. Have enough aggregators in front of the shard replicates so the
  aggregation doesn't become a bottleneck.
  4. Make sure you have good load balancing across your system.
 
  Attached is a diagram of the setup we have.  You might want to look
  into SolrCloud as well.
 
  Ephraim Ofir
 
 
  -Original Message-
  From: Jens Mueller [mailto:supidupi...@googlemail.com]
  Sent: Tuesday, April 05, 2011 4:25 AM
  To: solr-user@lucene.apache.org
  Subject: Very very large scale Solr Deployment = how to do (Expert
  Question)?
 
  Hello Experts,
 
 
 
  I am a Solr newbie but read quite a lot of docs. I still do not
  understand what would be the best way to setup very large scale
  deployments:
 
 
 
  Goal (threoretical):
 
   A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)
 
   B) Queries: 10 Queries/ per Second
 
   C) Updates: 10 Updates / per Second
 
 
 
 
  Solr offers:
 
  1.)Replication = Scales Well for B)  BUT  A) and C) are not
  satisfied
 
 
  2.)Sharding = Scales well for A) BUT B) and C) are not satisfied
  (= As
  I understand the Sharding approach all goes through a central server,
  that dispatches the updates and assembles the quries retrieved from
  the different shards. But this central server has also some capacity
  limits...)
 
 
 
 
  What is the right approach to handle such large deployments? I would
  be thankfull for just a rough sketch of the concepts so I can
  experiment/search further...
 
 
  Maybe I am missing something very trivial as I think some of the Solr
  Users/Use Cases on the homepage are that kind of large deployments.
  How are they implemented?
 
 
 
  Thanky very much!!!
 
  Jens
 
 **Legal Disclaimer***
 This communication may contain confidential and privileged
 material for the sole use of the intended recipient. Any
 unauthorized review, use or distribution by others is strictly
 prohibited. If you have received the message in error, please
 advise the sender by reply email and delete the message. Thank
 you.
 *




-- 
Thanks  Regards,
Isan Fulia.


Re: FW: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-05 Thread Jonathan DeMello
I third that request.

Would greatly appreciate taking a look at that diagram!

Regards,

Jonathan

On Wed, Apr 6, 2011 at 9:12 AM, Isan Fulia isan.fu...@germinait.com wrote:

 Hi Ephraim/Jen,

 Can u share that diagram with all.It may really help all of us.
 Thanks,
 Isan Fulia.

 On 6 April 2011 10:15, Tirthankar Chatterjee tchatter...@commvault.com
 wrote:

  Hi Jen,
  Can you please forward the diagram attachment too that Ephraim sent. :-)
  Thanks,
  Tirthankar
 
  -Original Message-
  From: Jens Mueller [mailto:supidupi...@googlemail.com]
  Sent: Tuesday, April 05, 2011 10:30 PM
  To: solr-user@lucene.apache.org
  Subject: Re: FW: Very very large scale Solr Deployment = how to do
 (Expert
  Question)?
 
  Hello Ephraim,
 
  thank you so much for the great Document/Scaling-Concept!!
 
  First I think you really should publish this on the solr wiki. This
  approach is nowhere documented there and not really obvious for newbies
 and
  your document is great and explains this very well!
 
  Please allow me to further questions regarding your document:
  1.) Is it correct, that you mean by DB the Origin-Data-Source of the
 data
  that is fed into the Solr Cloud for searching?
 
  2.) Solr Aggregator: This term did not yeald any google results, but is a
  very important aspect of your design (and this was the missing piece for
 me
  when thinking about solr architectures): Is it cocrrec that the
  aggregators are simply tomcat instances, with the solr webapp deployed?
  These Aggregators do not have their own index but only run the solr
 webapp
  and I access them via the ?shard= parameter giving the shards I want to
  query? (So in the end they aggreate the data of the shards but do not
 have
  their own data). This is really an important aspect that is not
 documented
  well enough in the solr documentation.
 
  Thank you very much!
  Jens
 
 
  2011/4/5 Ephraim Ofir ephra...@icq.com
 
   of course the attachment didn't get to the list, so here it is if you
   want it...
  
   Ephraim Ofir
  
  
   -Original Message-
   From: Ephraim Ofir
   Sent: Tuesday, April 05, 2011 10:20 AM
   To: 'solr-user@lucene.apache.org'
   Subject: RE: Very very large scale Solr Deployment = how to do (Expert
   Question)?
  
   I'm not sure about the scale you're aiming for, but you probably want
   to do both sharding and replication.  There's no central server which
   would be the bottleneck. The guidelines should probably be something
  like:
   1. Split your index to enough shards so it can keep up with the update
   rate.
   2. Have enough replicates of each shard master to keep up with the
   rate of queries.
   3. Have enough aggregators in front of the shard replicates so the
   aggregation doesn't become a bottleneck.
   4. Make sure you have good load balancing across your system.
  
   Attached is a diagram of the setup we have.  You might want to look
   into SolrCloud as well.
  
   Ephraim Ofir
  
  
   -Original Message-
   From: Jens Mueller [mailto:supidupi...@googlemail.com]
   Sent: Tuesday, April 05, 2011 4:25 AM
   To: solr-user@lucene.apache.org
   Subject: Very very large scale Solr Deployment = how to do (Expert
   Question)?
  
   Hello Experts,
  
  
  
   I am a Solr newbie but read quite a lot of docs. I still do not
   understand what would be the best way to setup very large scale
   deployments:
  
  
  
   Goal (threoretical):
  
A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)
  
B) Queries: 10 Queries/ per Second
  
C) Updates: 10 Updates / per Second
  
  
  
  
   Solr offers:
  
   1.)Replication = Scales Well for B)  BUT  A) and C) are not
   satisfied
  
  
   2.)Sharding = Scales well for A) BUT B) and C) are not satisfied
   (= As
   I understand the Sharding approach all goes through a central server,
   that dispatches the updates and assembles the quries retrieved from
   the different shards. But this central server has also some capacity
   limits...)
  
  
  
  
   What is the right approach to handle such large deployments? I would
   be thankfull for just a rough sketch of the concepts so I can
   experiment/search further...
  
  
   Maybe I am missing something very trivial as I think some of the Solr
   Users/Use Cases on the homepage are that kind of large deployments.
   How are they implemented?
  
  
  
   Thanky very much!!!
  
   Jens
  
  **Legal Disclaimer***
  This communication may contain confidential and privileged
  material for the sole use of the intended recipient. Any
  unauthorized review, use or distribution by others is strictly
  prohibited. If you have received the message in error, please
  advise the sender by reply email and delete the message. Thank
  you.
  *
 



 --
 Thanks  Regards,
 Isan Fulia.