RE: Limitations of Distributed Search ....

Ken Krugler Mon, 08 Dec 2008 12:41:54 -0800

Any inputs on this would be really helpful. Looking forsuggestions/viewpoints from you guys.

One area where you might have issues is with date range queries. Ifyou have many docs, then you can run into OOM errors. There was arecent thread about this, where Yonik (and others) had some goodsuggestions for ways to avoid this problem.

I don't know what the impact would be of merging results that usedate ranges - I'm guessing low, but Yonik would know best.

As to how well a 50-server configuration would work...that would beabout 200M docs/server, which is a large number even if the data/docis small (1K). But the real performance is going to be heavilyimpacted by the nature of the data and the types of queries.

You'll also need to think about how you distribute the data to avoidskew, as performance is constrained by the worst case of any of thesearchers. With Nutch we wound up having to add termination logic toavoid having long-running queries clog things up, primarily whendealing with load.

Best first step is to create a single Solr with representative data,and see how well that performs. My guess is your issues are going tobe more around the limits of one box with 200M docs, versus thedistributed nature. Though keeping 50-60 servers alive and happy is asignificant ops task in itself.

Finally, you'd want to decide early on whether this is search orquery. In other words, is it OK if a result set happens to be missinga doc, because that server is down or timed out. If it's not, thenyou're looking more at a query-type solution, where Solr would beless interesting.


-- Ken

-----Original Message-----
From: souravm
Sent: Saturday, December 06, 2008 9:41 PM
To: solr-user@lucene.apache.org
Subject: Limitations of Distributed Search ....

Hi,
We are planning to use Solr for processing large volume ofapplication log files (around ~ 10 Billions documents of size 5-6TB).
One of the approach we are considering for the same is to useDistributed Search extensively.
What we have in mind is distributing the log files in multiple boxesmonthly or weekly basis - where at the weekly basis itself thevolume can go to the level of 200 M of documents. And a search querycan spread across all weeks (e.g. number of a given txn for 1st 6months of a year)
However, what we are not sure how well the distributed search wouldscale when we may use around 50-60 boxes to distribute indexeddocuments on weekly basis. The specific questions I have in mind are-
a) How would be the impact on the performance when a query spreadsover 50 boxesb) Is there any hard limit on the number of slaves which can becontacted from the master server?c) How much load will this type of approach create on master serverfor merging data, keeping the track whether a slave is down or not
d) Any other manageability issues with so many slaves
If anyone of you have deployed Solr in such a environment it wouldbe great if you can share your experience on the same.
Thanks in advance.

Regards,
Sourav



**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message.Further, you are notto copy, disclose, or distribute this e-mail or its contents to anyother person andany such actions are unlawful. This e-mail may contain viruses.Infosys has takenevery reasonable precaution to minimize this risk, but is not liablefor any damageyou may sustain as a result of any virus in this e-mail. You shouldcarry out your
own virus checks before opening the e-mail or attachment. Infosys reserves the
right to monitor and review the content of all messages sent to orfrom this e-mail
address. Messages sent to or from this e-mail address may be stored on the
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***



--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

RE: Limitations of Distributed Search ....

Reply via email to