Matthew Shapiro [m...@mshapiro.net] wrote:

> Sorry, I should clarify our current statistics.  First of all I meant 183k
> documents (not 183, woops). Around 100k of those are full fledged html 
> articles (not web pages but articles in our CMS with html content inside 
> of them),

If an article is around 10-30 pages (or the equivalent), this is still a small 
corpus.

> the rest of the data are more like key/value data records with a lot
> of attached meta data for searching.

If the amount of unique categories (model, author, playtime, lix, 
favorite_band, year...) in the meta data is in the lower hundreds, you should 
be fine.

> Also, what I meant by search without a search term is that probably 80%
> (hard to confirm due to the lack of stats given by the GSA) of our searches
> are done on pure metadata clauses without any searching through the content
> itself,

That clarifies a lot, thanks. So we have roughly speaking 4000*5 queries/day ~= 
14 queries/minute. Guessing wildly that your peak time traffic is about 5 times 
that, we end up with about 1 query/second. That is a very light load for the 
Solr installation we're discussing.

> so for example "give me documents that have a content type of
> video, that are marked for client X, have a category of Y or Z, and was
> published to platform A, ordered by date published". 

That is a near-trivial query and you should get a reply very fast on modest 
hardware.

> The searches that use a search term are more like use the same query from the 
> example as before, but find me all the documents that have the string "My 
> Video" 
> in it's title and description.

Unless you experiment with fuzzy matches and phrase slop, this should also be 
fast. Ignoring analyzers, there is practically no difference between a meta 
data field and a larger content field in Solr.

Your current search (guessing here) iterates all terms in the content fields 
and take a comparatively large penalty when a large document is encountered. 
The inversion of index in Solr means that the search terms are looked up in a 
dictionary and refers to the documents they belong to. The penalty for having 
thousands or millions of terms as compared to tens or hundreds in a field in an 
inverted index is very small.

We're still in "any random machine you've got available"-land so I second 
Michael's suggestion.

Regards,
Toke Eskildsen

Reply via email to