Re: What should focus be on hardware for solr servers?

Michael Della Bitta Thu, 14 Feb 2013 07:56:13 -0800

Or perhaps we should develop our own, Solr-based benchmark...

Michael Della Bitta


------------------------------------------------
Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Thu, Feb 14, 2013 at 10:54 AM, Michael Della Bitta
<michael.della.bi...@appinions.com> wrote:
> My dual-core, HT-enabled Dell Latitude from last year has this CPU:
> model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
> bogomips: 4988.65
>
> An m3.xlarge reports:
> model name : Intel(R) Xeon(R) CPU           E5645  @ 2.40GHz
> bogomips : 4000.14
>
> I tried running geekbench and phoronx-test-suite and failed at both...
> Anybody have a favorite, free, CLI benchmarking suite?
>
> Michael Della Bitta
>
> ------------------------------------------------
> Appinions
> 18 East 41st Street, 2nd Floor
> New York, NY 10017-6271
>
> www.appinions.com
>
> Where Influence Isn’t a Game
>
>
> On Thu, Feb 14, 2013 at 8:10 AM, Jack Krupansky <j...@basetechnology.com> 
> wrote:
>> That raises the question of how your average professional notebook computer
>> (PC or Mac or Linux) compares to a garden-variety cloud server such as an
>> Amazon EC2 m1.large (or m3.xlarge) in terms of performance such as document
>> ingestion rate or how many documents you can load before load and/or query
>> performance starts to fall off the cliff. Anybody have any numbers? I mean,
>> is a MacBook Pro half of an EC2 m1.large? Twice? Less? More? Any rough feel?
>> (With all the usual caveats that "it all depends" and "your mileage will
>> vary.) But the intent would be for a similar workload on both (like loading
>> the wikipedia dump.)
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Erick Erickson
>> Sent: Thursday, February 14, 2013 7:31 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: What should focus be on hardware for solr servers?
>>
>>
>> One data point: I can comfortably index and search the Wikipedia dump (11M
>> articles, 5M with text) on my Macbook Pro. Admittedly not heavy-duty
>> queries, but....
>>
>> Erick
>>
>>
>> On Wed, Feb 13, 2013 at 4:01 PM, Matthew Shapiro <m...@mshapiro.net> wrote:
>>
>>> Excellent, thank you very much for the reply!
>>>
>>> On Wed, Feb 13, 2013 at 2:08 PM, Toke Eskildsen <t...@statsbiblioteket.dk
>>> >wrote:
>>>
>>> > Matthew Shapiro [m...@mshapiro.net] wrote:
>>> >
>>> > > Sorry, I should clarify our current statistics.  First of all I meant
>>> > 183k
>>> > > documents (not 183, woops). Around 100k of those are full fledged html
>>> > > articles (not web pages but articles in our CMS with html content
>>> inside
>>> > > of them),
>>> >
>>> > If an article is around 10-30 pages (or the equivalent), this is still a
>>> > small corpus.
>>> >
>>> > > the rest of the data are more like key/value data records with a lot
>>> > > of attached meta data for searching.
>>> >
>>> > If the amount of unique categories (model, author, playtime, lix,
>>> > favorite_band, year...) in the meta data is in the lower hundreds, you
>>> > should be fine.
>>> >
>>> > > Also, what I meant by search without a search term is that probably >
>>> > > > 80%
>>> > > (hard to confirm due to the lack of stats given by the GSA) of our
>>> > searches
>>> > > are done on pure metadata clauses without any searching through the
>>> > content
>>> > > itself,
>>> >
>>> > That clarifies a lot, thanks. So we have roughly speaking 4000*5
>>> > queries/day ~= 14 queries/minute. Guessing wildly that your peak time
>>> > traffic is about 5 times that, we end up with about 1 query/second. That
>>> is
>>> > a very light load for the Solr installation we're discussing.
>>> >
>>> > > so for example "give me documents that have a content type of
>>> > > video, that are marked for client X, have a category of Y or Z, and >
>>> > > > was
>>> > > published to platform A, ordered by date published".
>>> >
>>> > That is a near-trivial query and you should get a reply very fast on
>>> > modest hardware.
>>> >
>>> > > The searches that use a search term are more like use the same query
>>> > from the
>>> > > example as before, but find me all the documents that have the string
>>> > "My Video"
>>> > > in it's title and description.
>>> >
>>> > Unless you experiment with fuzzy matches and phrase slop, this should
>>> also
>>> > be fast. Ignoring analyzers, there is practically no difference between
>>> > > a
>>> > meta data field and a larger content field in Solr.
>>> >
>>> > Your current search (guessing here) iterates all terms in the content
>>> > fields and take a comparatively large penalty when a large document is
>>> > encountered. The inversion of index in Solr means that the search terms
>>> are
>>> > looked up in a dictionary and refers to the documents they belong to. >
>>> > The
>>> > penalty for having thousands or millions of terms as compared to tens or
>>> > hundreds in a field in an inverted index is very small.
>>> >
>>> > We're still in "any random machine you've got available"-land so I >
>>> > second
>>> > Michael's suggestion.
>>> >
>>> > Regards,
>>> > Toke Eskildsen
>>>
>>

Re: What should focus be on hardware for solr servers?

Reply via email to