That raises the question of how your average professional notebook computer (PC or Mac or Linux) compares to a garden-variety cloud server such as an Amazon EC2 m1.large (or m3.xlarge) in terms of performance such as document ingestion rate or how many documents you can load before load and/or query performance starts to fall off the cliff. Anybody have any numbers? I mean, is a MacBook Pro half of an EC2 m1.large? Twice? Less? More? Any rough feel? (With all the usual caveats that "it all depends" and "your mileage will vary.) But the intent would be for a similar workload on both (like loading the wikipedia dump.)

-- Jack Krupansky

-----Original Message----- From: Erick Erickson
Sent: Thursday, February 14, 2013 7:31 AM
To: solr-user@lucene.apache.org
Subject: Re: What should focus be on hardware for solr servers?

One data point: I can comfortably index and search the Wikipedia dump (11M
articles, 5M with text) on my Macbook Pro. Admittedly not heavy-duty
queries, but....

Erick


On Wed, Feb 13, 2013 at 4:01 PM, Matthew Shapiro <m...@mshapiro.net> wrote:

Excellent, thank you very much for the reply!

On Wed, Feb 13, 2013 at 2:08 PM, Toke Eskildsen <t...@statsbiblioteket.dk
>wrote:

> Matthew Shapiro [m...@mshapiro.net] wrote:
>
> > Sorry, I should clarify our current statistics.  First of all I meant
> 183k
> > documents (not 183, woops). Around 100k of those are full fledged html
> > articles (not web pages but articles in our CMS with html content
inside
> > of them),
>
> If an article is around 10-30 pages (or the equivalent), this is still a
> small corpus.
>
> > the rest of the data are more like key/value data records with a lot
> > of attached meta data for searching.
>
> If the amount of unique categories (model, author, playtime, lix,
> favorite_band, year...) in the meta data is in the lower hundreds, you
> should be fine.
>
> > Also, what I meant by search without a search term is that probably > > 80%
> > (hard to confirm due to the lack of stats given by the GSA) of our
> searches
> > are done on pure metadata clauses without any searching through the
> content
> > itself,
>
> That clarifies a lot, thanks. So we have roughly speaking 4000*5
> queries/day ~= 14 queries/minute. Guessing wildly that your peak time
> traffic is about 5 times that, we end up with about 1 query/second. That
is
> a very light load for the Solr installation we're discussing.
>
> > so for example "give me documents that have a content type of
> > video, that are marked for client X, have a category of Y or Z, and > > was
> > published to platform A, ordered by date published".
>
> That is a near-trivial query and you should get a reply very fast on
> modest hardware.
>
> > The searches that use a search term are more like use the same query
> from the
> > example as before, but find me all the documents that have the string
> "My Video"
> > in it's title and description.
>
> Unless you experiment with fuzzy matches and phrase slop, this should
also
> be fast. Ignoring analyzers, there is practically no difference between > a
> meta data field and a larger content field in Solr.
>
> Your current search (guessing here) iterates all terms in the content
> fields and take a comparatively large penalty when a large document is
> encountered. The inversion of index in Solr means that the search terms
are
> looked up in a dictionary and refers to the documents they belong to. > The
> penalty for having thousands or millions of terms as compared to tens or
> hundreds in a field in an inverted index is very small.
>
> We're still in "any random machine you've got available"-land so I > second
> Michael's suggestion.
>
> Regards,
> Toke Eskildsen


Reply via email to