Re: What should focus be on hardware for solr servers?

2013-02-19 Thread Michael Della Bitta
I actually tried to get the Phoronix test suite working, but for some
reason, it was timing out when it was listing the suites of tests that
were available. Maybe I don't know enough about it...

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Tue, Feb 19, 2013 at 3:18 AM, Dotan Cohen  wrote:
> On Thu, Feb 14, 2013 at 5:54 PM, Michael Della Bitta
>  wrote:
>> My dual-core, HT-enabled Dell Latitude from last year has this CPU:
>> model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
>> bogomips: 4988.65
>>
>> An m3.xlarge reports:
>> model name : Intel(R) Xeon(R) CPU   E5645  @ 2.40GHz
>> bogomips : 4000.14
>>
>> I tried running geekbench and phoronx-test-suite and failed at both...
>> Anybody have a favorite, free, CLI benchmarking suite?
>>
>
> I'll suggest to the Phoronix team to include some Solr tests in their
> suite. Solr does seem to be a perfect test for Phoronix, and much more
> relevant for some readers than Jack-the-Ripper or Quake.
>
>
> --
> Dotan Cohen
>
> http://gibberish.co.il
> http://what-is-what.com


Re: What should focus be on hardware for solr servers?

2013-02-19 Thread Dotan Cohen
On Thu, Feb 14, 2013 at 5:54 PM, Michael Della Bitta
 wrote:
> My dual-core, HT-enabled Dell Latitude from last year has this CPU:
> model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
> bogomips: 4988.65
>
> An m3.xlarge reports:
> model name : Intel(R) Xeon(R) CPU   E5645  @ 2.40GHz
> bogomips : 4000.14
>
> I tried running geekbench and phoronx-test-suite and failed at both...
> Anybody have a favorite, free, CLI benchmarking suite?
>

I'll suggest to the Phoronix team to include some Solr tests in their
suite. Solr does seem to be a perfect test for Phoronix, and much more
relevant for some readers than Jack-the-Ripper or Quake.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: What should focus be on hardware for solr servers?

2013-02-14 Thread Otis Gospodnetic
You could run Lucene benchmark stuff and compare. Or look at
ActionGenerator from Sematext on Github which you could also use for
performance testing and comparing.

Otis
Solr & ElasticSearch Support
http://sematext.com/
On Feb 14, 2013 10:56 AM, "Michael Della Bitta" <
michael.della.bi...@appinions.com> wrote:

> Or perhaps we should develop our own, Solr-based benchmark...
>
> Michael Della Bitta
>
> 
> Appinions
> 18 East 41st Street, 2nd Floor
> New York, NY 10017-6271
>
> www.appinions.com
>
> Where Influence Isn’t a Game
>
>
> On Thu, Feb 14, 2013 at 10:54 AM, Michael Della Bitta
>  wrote:
> > My dual-core, HT-enabled Dell Latitude from last year has this CPU:
> > model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
> > bogomips: 4988.65
> >
> > An m3.xlarge reports:
> > model name : Intel(R) Xeon(R) CPU   E5645  @ 2.40GHz
> > bogomips : 4000.14
> >
> > I tried running geekbench and phoronx-test-suite and failed at both...
> > Anybody have a favorite, free, CLI benchmarking suite?
> >
> > Michael Della Bitta
> >
> > 
> > Appinions
> > 18 East 41st Street, 2nd Floor
> > New York, NY 10017-6271
> >
> > www.appinions.com
> >
> > Where Influence Isn’t a Game
> >
> >
> > On Thu, Feb 14, 2013 at 8:10 AM, Jack Krupansky 
> wrote:
> >> That raises the question of how your average professional notebook
> computer
> >> (PC or Mac or Linux) compares to a garden-variety cloud server such as
> an
> >> Amazon EC2 m1.large (or m3.xlarge) in terms of performance such as
> document
> >> ingestion rate or how many documents you can load before load and/or
> query
> >> performance starts to fall off the cliff. Anybody have any numbers? I
> mean,
> >> is a MacBook Pro half of an EC2 m1.large? Twice? Less? More? Any rough
> feel?
> >> (With all the usual caveats that "it all depends" and "your mileage will
> >> vary.) But the intent would be for a similar workload on both (like
> loading
> >> the wikipedia dump.)
> >>
> >> -- Jack Krupansky
> >>
> >> -Original Message- From: Erick Erickson
> >> Sent: Thursday, February 14, 2013 7:31 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: What should focus be on hardware for solr servers?
> >>
> >>
> >> One data point: I can comfortably index and search the Wikipedia dump
> (11M
> >> articles, 5M with text) on my Macbook Pro. Admittedly not heavy-duty
> >> queries, but
> >>
> >> Erick
> >>
> >>
> >> On Wed, Feb 13, 2013 at 4:01 PM, Matthew Shapiro 
> wrote:
> >>
> >>> Excellent, thank you very much for the reply!
> >>>
> >>> On Wed, Feb 13, 2013 at 2:08 PM, Toke Eskildsen <
> t...@statsbiblioteket.dk
> >>> >wrote:
> >>>
> >>> > Matthew Shapiro [m...@mshapiro.net] wrote:
> >>> >
> >>> > > Sorry, I should clarify our current statistics.  First of all I
> meant
> >>> > 183k
> >>> > > documents (not 183, woops). Around 100k of those are full fledged
> html
> >>> > > articles (not web pages but articles in our CMS with html content
> >>> inside
> >>> > > of them),
> >>> >
> >>> > If an article is around 10-30 pages (or the equivalent), this is
> still a
> >>> > small corpus.
> >>> >
> >>> > > the rest of the data are more like key/value data records with a
> lot
> >>> > > of attached meta data for searching.
> >>> >
> >>> > If the amount of unique categories (model, author, playtime, lix,
> >>> > favorite_band, year...) in the meta data is in the lower hundreds,
> you
> >>> > should be fine.
> >>> >
> >>> > > Also, what I meant by search without a search term is that
> probably >
> >>> > > > 80%
> >>> > > (hard to confirm due to the lack of stats given by the GSA) of our
> >>> > searches
> >>> > > are done on pure metadata clauses without any searching through the
> >>> > content
> >>> > > itself,
> >>> >
> >>> > That clarifies a lot, thanks. So we have roughly speaking 4000*5
> >>> > queries/day ~= 14 quer

RE: What should focus be on hardware for solr servers?

2013-02-14 Thread Toke Eskildsen
Steve Rowe [sar...@gmail.com] wrote:
> On Feb 14, 2013, at 11:24 AM, Walter Underwood  wrote:
> > Laptop disks are slower than the EC2 disks.

> My laptop disk is an SSD.

So it's not a disk? ...Sorry, couldn't resist.

Unfortunately Amazon only has two SSD-backed solutions and they are #3 and #2 
in terms of cost/hour (http://www.ec2instances.info/). To make matters worse, 
one of them has only 240GB of storage, which leaves the $3.10/hour for 2TB of 
SSD as the only choice right now.

At Berlin Buzzwords 2013 there was a very interesting talk about indexing 24 
billion tweets, with the clear conclusion that it was a lot cheaper to buy your 
own hardware (with SSDs) instead of going Amazon. At that point in time, for 
that kind of corpus yadda yadda. There's a recording at 
http://2012.berlinbuzzwords.de/sessions/you-know-search-querying-24-billion-records-900ms

Regards,
Toke Eskildsen

Re: What should focus be on hardware for solr servers?

2013-02-14 Thread Steve Rowe

On Feb 14, 2013, at 11:24 AM, Walter Underwood  wrote:
> Laptop disks are slower than the EC2 disks.

My laptop disk is an SSD.


Re: What should focus be on hardware for solr servers?

2013-02-14 Thread Michael Della Bitta
Just for sake of comparison, http://www.ec2instances.info/

At the low end, EC2 CPUs come in 1, 2, 2.5, and 3.25 unit sizes. A
m2.xlarge uses 3.25 unit CPUs, so one would have to step up to the
high storage, high IO, or cluster compute nodes to do better than that
at single threaded tasks.

Good thing Solr isn't single threaded, or my company would be bankrupt! :)


Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Thu, Feb 14, 2013 at 11:24 AM, Walter Underwood
 wrote:
> Just using a single CPU (log processing with Python), my MacBook Pro (2GHz 
> Intel Core i7) is twice as fast as an m2.xlarge EC2 instance.
>
> Laptop disks are slower than the EC2 disks.
>
> EC2 is for quantity, not quality.
>
> wunder
>
> On Feb 14, 2013, at 5:10 AM, Jack Krupansky wrote:
>
>> That raises the question of how your average professional notebook computer 
>> (PC or Mac or Linux) compares to a garden-variety cloud server such as an 
>> Amazon EC2 m1.large (or m3.xlarge) in terms of performance such as document 
>> ingestion rate or how many documents you can load before load and/or query 
>> performance starts to fall off the cliff. Anybody have any numbers? I mean, 
>> is a MacBook Pro half of an EC2 m1.large? Twice? Less? More? Any rough feel? 
>> (With all the usual caveats that "it all depends" and "your mileage will 
>> vary.) But the intent would be for a similar workload on both (like loading 
>> the wikipedia dump.)
>>
>> -- Jack Krupansky
>>
>> -Original Message----- From: Erick Erickson
>> Sent: Thursday, February 14, 2013 7:31 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: What should focus be on hardware for solr servers?
>>
>> One data point: I can comfortably index and search the Wikipedia dump (11M
>> articles, 5M with text) on my Macbook Pro. Admittedly not heavy-duty
>> queries, but
>>
>> Erick
>>
>>
>> On Wed, Feb 13, 2013 at 4:01 PM, Matthew Shapiro  wrote:
>>
>>> Excellent, thank you very much for the reply!
>>>
>>> On Wed, Feb 13, 2013 at 2:08 PM, Toke Eskildsen >> >wrote:
>>>
>>> > Matthew Shapiro [m...@mshapiro.net] wrote:
>>> >
>>> > > Sorry, I should clarify our current statistics.  First of all I meant
>>> > 183k
>>> > > documents (not 183, woops). Around 100k of those are full fledged html
>>> > > articles (not web pages but articles in our CMS with html content
>>> inside
>>> > > of them),
>>> >
>>> > If an article is around 10-30 pages (or the equivalent), this is still a
>>> > small corpus.
>>> >
>>> > > the rest of the data are more like key/value data records with a lot
>>> > > of attached meta data for searching.
>>> >
>>> > If the amount of unique categories (model, author, playtime, lix,
>>> > favorite_band, year...) in the meta data is in the lower hundreds, you
>>> > should be fine.
>>> >
>>> > > Also, what I meant by search without a search term is that probably > > 
>>> > > 80%
>>> > > (hard to confirm due to the lack of stats given by the GSA) of our
>>> > searches
>>> > > are done on pure metadata clauses without any searching through the
>>> > content
>>> > > itself,
>>> >
>>> > That clarifies a lot, thanks. So we have roughly speaking 4000*5
>>> > queries/day ~= 14 queries/minute. Guessing wildly that your peak time
>>> > traffic is about 5 times that, we end up with about 1 query/second. That
>>> is
>>> > a very light load for the Solr installation we're discussing.
>>> >
>>> > > so for example "give me documents that have a content type of
>>> > > video, that are marked for client X, have a category of Y or Z, and > > 
>>> > > was
>>> > > published to platform A, ordered by date published".
>>> >
>>> > That is a near-trivial query and you should get a reply very fast on
>>> > modest hardware.
>>> >
>>> > > The searches that use a search term are more like use the same query
>>> > from the
>>> > > example as before, but find me all the documents that have the string
>>> > "My Video"
>>> > > in it's title and description.
>>> >

Re: What should focus be on hardware for solr servers?

2013-02-14 Thread Walter Underwood
Just using a single CPU (log processing with Python), my MacBook Pro (2GHz 
Intel Core i7) is twice as fast as an m2.xlarge EC2 instance.

Laptop disks are slower than the EC2 disks.

EC2 is for quantity, not quality.

wunder

On Feb 14, 2013, at 5:10 AM, Jack Krupansky wrote:

> That raises the question of how your average professional notebook computer 
> (PC or Mac or Linux) compares to a garden-variety cloud server such as an 
> Amazon EC2 m1.large (or m3.xlarge) in terms of performance such as document 
> ingestion rate or how many documents you can load before load and/or query 
> performance starts to fall off the cliff. Anybody have any numbers? I mean, 
> is a MacBook Pro half of an EC2 m1.large? Twice? Less? More? Any rough feel? 
> (With all the usual caveats that "it all depends" and "your mileage will 
> vary.) But the intent would be for a similar workload on both (like loading 
> the wikipedia dump.)
> 
> -- Jack Krupansky
> 
> -Original Message- From: Erick Erickson
> Sent: Thursday, February 14, 2013 7:31 AM
> To: solr-user@lucene.apache.org
> Subject: Re: What should focus be on hardware for solr servers?
> 
> One data point: I can comfortably index and search the Wikipedia dump (11M
> articles, 5M with text) on my Macbook Pro. Admittedly not heavy-duty
> queries, but
> 
> Erick
> 
> 
> On Wed, Feb 13, 2013 at 4:01 PM, Matthew Shapiro  wrote:
> 
>> Excellent, thank you very much for the reply!
>> 
>> On Wed, Feb 13, 2013 at 2:08 PM, Toke Eskildsen > >wrote:
>> 
>> > Matthew Shapiro [m...@mshapiro.net] wrote:
>> >
>> > > Sorry, I should clarify our current statistics.  First of all I meant
>> > 183k
>> > > documents (not 183, woops). Around 100k of those are full fledged html
>> > > articles (not web pages but articles in our CMS with html content
>> inside
>> > > of them),
>> >
>> > If an article is around 10-30 pages (or the equivalent), this is still a
>> > small corpus.
>> >
>> > > the rest of the data are more like key/value data records with a lot
>> > > of attached meta data for searching.
>> >
>> > If the amount of unique categories (model, author, playtime, lix,
>> > favorite_band, year...) in the meta data is in the lower hundreds, you
>> > should be fine.
>> >
>> > > Also, what I meant by search without a search term is that probably > > 
>> > > 80%
>> > > (hard to confirm due to the lack of stats given by the GSA) of our
>> > searches
>> > > are done on pure metadata clauses without any searching through the
>> > content
>> > > itself,
>> >
>> > That clarifies a lot, thanks. So we have roughly speaking 4000*5
>> > queries/day ~= 14 queries/minute. Guessing wildly that your peak time
>> > traffic is about 5 times that, we end up with about 1 query/second. That
>> is
>> > a very light load for the Solr installation we're discussing.
>> >
>> > > so for example "give me documents that have a content type of
>> > > video, that are marked for client X, have a category of Y or Z, and > > 
>> > > was
>> > > published to platform A, ordered by date published".
>> >
>> > That is a near-trivial query and you should get a reply very fast on
>> > modest hardware.
>> >
>> > > The searches that use a search term are more like use the same query
>> > from the
>> > > example as before, but find me all the documents that have the string
>> > "My Video"
>> > > in it's title and description.
>> >
>> > Unless you experiment with fuzzy matches and phrase slop, this should
>> also
>> > be fast. Ignoring analyzers, there is practically no difference between > a
>> > meta data field and a larger content field in Solr.
>> >
>> > Your current search (guessing here) iterates all terms in the content
>> > fields and take a comparatively large penalty when a large document is
>> > encountered. The inversion of index in Solr means that the search terms
>> are
>> > looked up in a dictionary and refers to the documents they belong to. > The
>> > penalty for having thousands or millions of terms as compared to tens or
>> > hundreds in a field in an inverted index is very small.
>> >
>> > We're still in "any random machine you've got available"-land so I > second
>> > Michael's suggestion.
>> >
>> > Regards,
>> > Toke Eskildsen
> 

--
Walter Underwood
wun...@wunderwood.org





Re: What should focus be on hardware for solr servers?

2013-02-14 Thread Michael Della Bitta
Or perhaps we should develop our own, Solr-based benchmark...

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Thu, Feb 14, 2013 at 10:54 AM, Michael Della Bitta
 wrote:
> My dual-core, HT-enabled Dell Latitude from last year has this CPU:
> model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
> bogomips: 4988.65
>
> An m3.xlarge reports:
> model name : Intel(R) Xeon(R) CPU   E5645  @ 2.40GHz
> bogomips : 4000.14
>
> I tried running geekbench and phoronx-test-suite and failed at both...
> Anybody have a favorite, free, CLI benchmarking suite?
>
> Michael Della Bitta
>
> 
> Appinions
> 18 East 41st Street, 2nd Floor
> New York, NY 10017-6271
>
> www.appinions.com
>
> Where Influence Isn’t a Game
>
>
> On Thu, Feb 14, 2013 at 8:10 AM, Jack Krupansky  
> wrote:
>> That raises the question of how your average professional notebook computer
>> (PC or Mac or Linux) compares to a garden-variety cloud server such as an
>> Amazon EC2 m1.large (or m3.xlarge) in terms of performance such as document
>> ingestion rate or how many documents you can load before load and/or query
>> performance starts to fall off the cliff. Anybody have any numbers? I mean,
>> is a MacBook Pro half of an EC2 m1.large? Twice? Less? More? Any rough feel?
>> (With all the usual caveats that "it all depends" and "your mileage will
>> vary.) But the intent would be for a similar workload on both (like loading
>> the wikipedia dump.)
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Erick Erickson
>> Sent: Thursday, February 14, 2013 7:31 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: What should focus be on hardware for solr servers?
>>
>>
>> One data point: I can comfortably index and search the Wikipedia dump (11M
>> articles, 5M with text) on my Macbook Pro. Admittedly not heavy-duty
>> queries, but
>>
>> Erick
>>
>>
>> On Wed, Feb 13, 2013 at 4:01 PM, Matthew Shapiro  wrote:
>>
>>> Excellent, thank you very much for the reply!
>>>
>>> On Wed, Feb 13, 2013 at 2:08 PM, Toke Eskildsen >> >wrote:
>>>
>>> > Matthew Shapiro [m...@mshapiro.net] wrote:
>>> >
>>> > > Sorry, I should clarify our current statistics.  First of all I meant
>>> > 183k
>>> > > documents (not 183, woops). Around 100k of those are full fledged html
>>> > > articles (not web pages but articles in our CMS with html content
>>> inside
>>> > > of them),
>>> >
>>> > If an article is around 10-30 pages (or the equivalent), this is still a
>>> > small corpus.
>>> >
>>> > > the rest of the data are more like key/value data records with a lot
>>> > > of attached meta data for searching.
>>> >
>>> > If the amount of unique categories (model, author, playtime, lix,
>>> > favorite_band, year...) in the meta data is in the lower hundreds, you
>>> > should be fine.
>>> >
>>> > > Also, what I meant by search without a search term is that probably >
>>> > > > 80%
>>> > > (hard to confirm due to the lack of stats given by the GSA) of our
>>> > searches
>>> > > are done on pure metadata clauses without any searching through the
>>> > content
>>> > > itself,
>>> >
>>> > That clarifies a lot, thanks. So we have roughly speaking 4000*5
>>> > queries/day ~= 14 queries/minute. Guessing wildly that your peak time
>>> > traffic is about 5 times that, we end up with about 1 query/second. That
>>> is
>>> > a very light load for the Solr installation we're discussing.
>>> >
>>> > > so for example "give me documents that have a content type of
>>> > > video, that are marked for client X, have a category of Y or Z, and >
>>> > > > was
>>> > > published to platform A, ordered by date published".
>>> >
>>> > That is a near-trivial query and you should get a reply very fast on
>>> > modest hardware.
>>> >
>>> > > The searches that use a search term are more like use the same query
>>> > from the
>>> > > example as before, but find me all the documents that have the string
>>> > "My Video"
>>> > > in it's title and description.
>>> >
>>> > Unless you experiment with fuzzy matches and phrase slop, this should
>>> also
>>> > be fast. Ignoring analyzers, there is practically no difference between
>>> > > a
>>> > meta data field and a larger content field in Solr.
>>> >
>>> > Your current search (guessing here) iterates all terms in the content
>>> > fields and take a comparatively large penalty when a large document is
>>> > encountered. The inversion of index in Solr means that the search terms
>>> are
>>> > looked up in a dictionary and refers to the documents they belong to. >
>>> > The
>>> > penalty for having thousands or millions of terms as compared to tens or
>>> > hundreds in a field in an inverted index is very small.
>>> >
>>> > We're still in "any random machine you've got available"-land so I >
>>> > second
>>> > Michael's suggestion.
>>> >
>>> > Regards,
>>> > Toke Eskildsen
>>>
>>


Re: What should focus be on hardware for solr servers?

2013-02-14 Thread Michael Della Bitta
My dual-core, HT-enabled Dell Latitude from last year has this CPU:
model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
bogomips: 4988.65

An m3.xlarge reports:
model name : Intel(R) Xeon(R) CPU   E5645  @ 2.40GHz
bogomips : 4000.14

I tried running geekbench and phoronx-test-suite and failed at both...
Anybody have a favorite, free, CLI benchmarking suite?

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Thu, Feb 14, 2013 at 8:10 AM, Jack Krupansky  wrote:
> That raises the question of how your average professional notebook computer
> (PC or Mac or Linux) compares to a garden-variety cloud server such as an
> Amazon EC2 m1.large (or m3.xlarge) in terms of performance such as document
> ingestion rate or how many documents you can load before load and/or query
> performance starts to fall off the cliff. Anybody have any numbers? I mean,
> is a MacBook Pro half of an EC2 m1.large? Twice? Less? More? Any rough feel?
> (With all the usual caveats that "it all depends" and "your mileage will
> vary.) But the intent would be for a similar workload on both (like loading
> the wikipedia dump.)
>
> -- Jack Krupansky
>
> -Original Message- From: Erick Erickson
> Sent: Thursday, February 14, 2013 7:31 AM
> To: solr-user@lucene.apache.org
> Subject: Re: What should focus be on hardware for solr servers?
>
>
> One data point: I can comfortably index and search the Wikipedia dump (11M
> articles, 5M with text) on my Macbook Pro. Admittedly not heavy-duty
> queries, but
>
> Erick
>
>
> On Wed, Feb 13, 2013 at 4:01 PM, Matthew Shapiro  wrote:
>
>> Excellent, thank you very much for the reply!
>>
>> On Wed, Feb 13, 2013 at 2:08 PM, Toke Eskildsen > >wrote:
>>
>> > Matthew Shapiro [m...@mshapiro.net] wrote:
>> >
>> > > Sorry, I should clarify our current statistics.  First of all I meant
>> > 183k
>> > > documents (not 183, woops). Around 100k of those are full fledged html
>> > > articles (not web pages but articles in our CMS with html content
>> inside
>> > > of them),
>> >
>> > If an article is around 10-30 pages (or the equivalent), this is still a
>> > small corpus.
>> >
>> > > the rest of the data are more like key/value data records with a lot
>> > > of attached meta data for searching.
>> >
>> > If the amount of unique categories (model, author, playtime, lix,
>> > favorite_band, year...) in the meta data is in the lower hundreds, you
>> > should be fine.
>> >
>> > > Also, what I meant by search without a search term is that probably >
>> > > > 80%
>> > > (hard to confirm due to the lack of stats given by the GSA) of our
>> > searches
>> > > are done on pure metadata clauses without any searching through the
>> > content
>> > > itself,
>> >
>> > That clarifies a lot, thanks. So we have roughly speaking 4000*5
>> > queries/day ~= 14 queries/minute. Guessing wildly that your peak time
>> > traffic is about 5 times that, we end up with about 1 query/second. That
>> is
>> > a very light load for the Solr installation we're discussing.
>> >
>> > > so for example "give me documents that have a content type of
>> > > video, that are marked for client X, have a category of Y or Z, and >
>> > > > was
>> > > published to platform A, ordered by date published".
>> >
>> > That is a near-trivial query and you should get a reply very fast on
>> > modest hardware.
>> >
>> > > The searches that use a search term are more like use the same query
>> > from the
>> > > example as before, but find me all the documents that have the string
>> > "My Video"
>> > > in it's title and description.
>> >
>> > Unless you experiment with fuzzy matches and phrase slop, this should
>> also
>> > be fast. Ignoring analyzers, there is practically no difference between
>> > > a
>> > meta data field and a larger content field in Solr.
>> >
>> > Your current search (guessing here) iterates all terms in the content
>> > fields and take a comparatively large penalty when a large document is
>> > encountered. The inversion of index in Solr means that the search terms
>> are
>> > looked up in a dictionary and refers to the documents they belong to. >
>> > The
>> > penalty for having thousands or millions of terms as compared to tens or
>> > hundreds in a field in an inverted index is very small.
>> >
>> > We're still in "any random machine you've got available"-land so I >
>> > second
>> > Michael's suggestion.
>> >
>> > Regards,
>> > Toke Eskildsen
>>
>


Re: What should focus be on hardware for solr servers?

2013-02-14 Thread Jack Krupansky
That raises the question of how your average professional notebook computer 
(PC or Mac or Linux) compares to a garden-variety cloud server such as an 
Amazon EC2 m1.large (or m3.xlarge) in terms of performance such as document 
ingestion rate or how many documents you can load before load and/or query 
performance starts to fall off the cliff. Anybody have any numbers? I mean, 
is a MacBook Pro half of an EC2 m1.large? Twice? Less? More? Any rough feel? 
(With all the usual caveats that "it all depends" and "your mileage will 
vary.) But the intent would be for a similar workload on both (like loading 
the wikipedia dump.)


-- Jack Krupansky

-Original Message- 
From: Erick Erickson

Sent: Thursday, February 14, 2013 7:31 AM
To: solr-user@lucene.apache.org
Subject: Re: What should focus be on hardware for solr servers?

One data point: I can comfortably index and search the Wikipedia dump (11M
articles, 5M with text) on my Macbook Pro. Admittedly not heavy-duty
queries, but

Erick


On Wed, Feb 13, 2013 at 4:01 PM, Matthew Shapiro  wrote:


Excellent, thank you very much for the reply!

On Wed, Feb 13, 2013 at 2:08 PM, Toke Eskildsen wrote:

> Matthew Shapiro [m...@mshapiro.net] wrote:
>
> > Sorry, I should clarify our current statistics.  First of all I meant
> 183k
> > documents (not 183, woops). Around 100k of those are full fledged html
> > articles (not web pages but articles in our CMS with html content
inside
> > of them),
>
> If an article is around 10-30 pages (or the equivalent), this is still a
> small corpus.
>
> > the rest of the data are more like key/value data records with a lot
> > of attached meta data for searching.
>
> If the amount of unique categories (model, author, playtime, lix,
> favorite_band, year...) in the meta data is in the lower hundreds, you
> should be fine.
>
> > Also, what I meant by search without a search term is that probably 
> > 80%

> > (hard to confirm due to the lack of stats given by the GSA) of our
> searches
> > are done on pure metadata clauses without any searching through the
> content
> > itself,
>
> That clarifies a lot, thanks. So we have roughly speaking 4000*5
> queries/day ~= 14 queries/minute. Guessing wildly that your peak time
> traffic is about 5 times that, we end up with about 1 query/second. That
is
> a very light load for the Solr installation we're discussing.
>
> > so for example "give me documents that have a content type of
> > video, that are marked for client X, have a category of Y or Z, and 
> > was

> > published to platform A, ordered by date published".
>
> That is a near-trivial query and you should get a reply very fast on
> modest hardware.
>
> > The searches that use a search term are more like use the same query
> from the
> > example as before, but find me all the documents that have the string
> "My Video"
> > in it's title and description.
>
> Unless you experiment with fuzzy matches and phrase slop, this should
also
> be fast. Ignoring analyzers, there is practically no difference between 
> a

> meta data field and a larger content field in Solr.
>
> Your current search (guessing here) iterates all terms in the content
> fields and take a comparatively large penalty when a large document is
> encountered. The inversion of index in Solr means that the search terms
are
> looked up in a dictionary and refers to the documents they belong to. 
> The

> penalty for having thousands or millions of terms as compared to tens or
> hundreds in a field in an inverted index is very small.
>
> We're still in "any random machine you've got available"-land so I 
> second

> Michael's suggestion.
>
> Regards,
> Toke Eskildsen





Re: What should focus be on hardware for solr servers?

2013-02-14 Thread Erick Erickson
One data point: I can comfortably index and search the Wikipedia dump (11M
articles, 5M with text) on my Macbook Pro. Admittedly not heavy-duty
queries, but

Erick


On Wed, Feb 13, 2013 at 4:01 PM, Matthew Shapiro  wrote:

> Excellent, thank you very much for the reply!
>
> On Wed, Feb 13, 2013 at 2:08 PM, Toke Eskildsen  >wrote:
>
> > Matthew Shapiro [m...@mshapiro.net] wrote:
> >
> > > Sorry, I should clarify our current statistics.  First of all I meant
> > 183k
> > > documents (not 183, woops). Around 100k of those are full fledged html
> > > articles (not web pages but articles in our CMS with html content
> inside
> > > of them),
> >
> > If an article is around 10-30 pages (or the equivalent), this is still a
> > small corpus.
> >
> > > the rest of the data are more like key/value data records with a lot
> > > of attached meta data for searching.
> >
> > If the amount of unique categories (model, author, playtime, lix,
> > favorite_band, year...) in the meta data is in the lower hundreds, you
> > should be fine.
> >
> > > Also, what I meant by search without a search term is that probably 80%
> > > (hard to confirm due to the lack of stats given by the GSA) of our
> > searches
> > > are done on pure metadata clauses without any searching through the
> > content
> > > itself,
> >
> > That clarifies a lot, thanks. So we have roughly speaking 4000*5
> > queries/day ~= 14 queries/minute. Guessing wildly that your peak time
> > traffic is about 5 times that, we end up with about 1 query/second. That
> is
> > a very light load for the Solr installation we're discussing.
> >
> > > so for example "give me documents that have a content type of
> > > video, that are marked for client X, have a category of Y or Z, and was
> > > published to platform A, ordered by date published".
> >
> > That is a near-trivial query and you should get a reply very fast on
> > modest hardware.
> >
> > > The searches that use a search term are more like use the same query
> > from the
> > > example as before, but find me all the documents that have the string
> > "My Video"
> > > in it's title and description.
> >
> > Unless you experiment with fuzzy matches and phrase slop, this should
> also
> > be fast. Ignoring analyzers, there is practically no difference between a
> > meta data field and a larger content field in Solr.
> >
> > Your current search (guessing here) iterates all terms in the content
> > fields and take a comparatively large penalty when a large document is
> > encountered. The inversion of index in Solr means that the search terms
> are
> > looked up in a dictionary and refers to the documents they belong to. The
> > penalty for having thousands or millions of terms as compared to tens or
> > hundreds in a field in an inverted index is very small.
> >
> > We're still in "any random machine you've got available"-land so I second
> > Michael's suggestion.
> >
> > Regards,
> > Toke Eskildsen
>


Re: What should focus be on hardware for solr servers?

2013-02-13 Thread Matthew Shapiro
Excellent, thank you very much for the reply!

On Wed, Feb 13, 2013 at 2:08 PM, Toke Eskildsen wrote:

> Matthew Shapiro [m...@mshapiro.net] wrote:
>
> > Sorry, I should clarify our current statistics.  First of all I meant
> 183k
> > documents (not 183, woops). Around 100k of those are full fledged html
> > articles (not web pages but articles in our CMS with html content inside
> > of them),
>
> If an article is around 10-30 pages (or the equivalent), this is still a
> small corpus.
>
> > the rest of the data are more like key/value data records with a lot
> > of attached meta data for searching.
>
> If the amount of unique categories (model, author, playtime, lix,
> favorite_band, year...) in the meta data is in the lower hundreds, you
> should be fine.
>
> > Also, what I meant by search without a search term is that probably 80%
> > (hard to confirm due to the lack of stats given by the GSA) of our
> searches
> > are done on pure metadata clauses without any searching through the
> content
> > itself,
>
> That clarifies a lot, thanks. So we have roughly speaking 4000*5
> queries/day ~= 14 queries/minute. Guessing wildly that your peak time
> traffic is about 5 times that, we end up with about 1 query/second. That is
> a very light load for the Solr installation we're discussing.
>
> > so for example "give me documents that have a content type of
> > video, that are marked for client X, have a category of Y or Z, and was
> > published to platform A, ordered by date published".
>
> That is a near-trivial query and you should get a reply very fast on
> modest hardware.
>
> > The searches that use a search term are more like use the same query
> from the
> > example as before, but find me all the documents that have the string
> "My Video"
> > in it's title and description.
>
> Unless you experiment with fuzzy matches and phrase slop, this should also
> be fast. Ignoring analyzers, there is practically no difference between a
> meta data field and a larger content field in Solr.
>
> Your current search (guessing here) iterates all terms in the content
> fields and take a comparatively large penalty when a large document is
> encountered. The inversion of index in Solr means that the search terms are
> looked up in a dictionary and refers to the documents they belong to. The
> penalty for having thousands or millions of terms as compared to tens or
> hundreds in a field in an inverted index is very small.
>
> We're still in "any random machine you've got available"-land so I second
> Michael's suggestion.
>
> Regards,
> Toke Eskildsen


RE: What should focus be on hardware for solr servers?

2013-02-13 Thread Toke Eskildsen
Matthew Shapiro [m...@mshapiro.net] wrote:

> Sorry, I should clarify our current statistics.  First of all I meant 183k
> documents (not 183, woops). Around 100k of those are full fledged html 
> articles (not web pages but articles in our CMS with html content inside 
> of them),

If an article is around 10-30 pages (or the equivalent), this is still a small 
corpus.

> the rest of the data are more like key/value data records with a lot
> of attached meta data for searching.

If the amount of unique categories (model, author, playtime, lix, 
favorite_band, year...) in the meta data is in the lower hundreds, you should 
be fine.

> Also, what I meant by search without a search term is that probably 80%
> (hard to confirm due to the lack of stats given by the GSA) of our searches
> are done on pure metadata clauses without any searching through the content
> itself,

That clarifies a lot, thanks. So we have roughly speaking 4000*5 queries/day ~= 
14 queries/minute. Guessing wildly that your peak time traffic is about 5 times 
that, we end up with about 1 query/second. That is a very light load for the 
Solr installation we're discussing.

> so for example "give me documents that have a content type of
> video, that are marked for client X, have a category of Y or Z, and was
> published to platform A, ordered by date published". 

That is a near-trivial query and you should get a reply very fast on modest 
hardware.

> The searches that use a search term are more like use the same query from the 
> example as before, but find me all the documents that have the string "My 
> Video" 
> in it's title and description.

Unless you experiment with fuzzy matches and phrase slop, this should also be 
fast. Ignoring analyzers, there is practically no difference between a meta 
data field and a larger content field in Solr.

Your current search (guessing here) iterates all terms in the content fields 
and take a comparatively large penalty when a large document is encountered. 
The inversion of index in Solr means that the search terms are looked up in a 
dictionary and refers to the documents they belong to. The penalty for having 
thousands or millions of terms as compared to tens or hundreds in a field in an 
inverted index is very small.

We're still in "any random machine you've got available"-land so I second 
Michael's suggestion.

Regards,
Toke Eskildsen

Re: What should focus be on hardware for solr servers?

2013-02-13 Thread Matthew Shapiro
That definitely will be a useful tool in this conversion, thanks.

On Wed, Feb 13, 2013 at 12:25 PM, Michael Della Bitta <
michael.della.bi...@appinions.com> wrote:

> Ooops: https://code.google.com/p/solrmeter/
>
>
>
> Michael Della Bitta
>
> 
> Appinions
> 18 East 41st Street, 2nd Floor
> New York, NY 10017-6271
>
> www.appinions.com
>
> Where Influence Isn’t a Game
>
>
> On Wed, Feb 13, 2013 at 12:25 PM, Michael Della Bitta
>  wrote:
> > Matthew,
> >
> > With an index that small, you should be able to build a proof of
> > concept on your own hardware and discover how it performs using
> > something like SolrMeter:
> >
> >
> > Michael Della Bitta
> >
> > 
> > Appinions
> > 18 East 41st Street, 2nd Floor
> > New York, NY 10017-6271
> >
> > www.appinions.com
> >
> > Where Influence Isn’t a Game
> >
> >
> > On Wed, Feb 13, 2013 at 12:21 PM, Matthew Shapiro 
> wrote:
> >> Thanks for the reply.
> >>
> >> If the main amount of searches are the exact same (e.g. the empty
> search),
> >>> the result will be cached. If 5,683 searches/month is the real count,
> this
> >>> sounds like a very low amount of searches in a very limited corpus.
> Just
> >>> about any machine should be fine. I guess I am missing something here.
> >>> Could you elaborate a bit? How large is a document, how many do you
> expect
> >>> to handle, what do you expect a query to look like, how should the
> result
> >>> be presented?
> >>
> >>
> >> Sorry, I should clarify our current statistics.  First of all I meant
> 183k
> >> documents (not 183, woops).  Around 100k of those are full fledged html
> >> articles (not web pages but articles in our CMS with html content
> inside of
> >> them), the rest of the data are more like key/value data records with a
> lot
> >> of attached meta data for searching.
> >>
> >> Also, what I meant by search without a search term is that probably 80%
> >> (hard to confirm due to the lack of stats given by the GSA) of our
> searches
> >> are done on pure metadata clauses without any searching through the
> content
> >> itself, so for example "give me documents that have a content type of
> >> video, that are marked for client X, have a category of Y or Z, and was
> >> published to platform A, ordered by date published".  The searches that
> use
> >> a search term are more like use the same query from the example as
> before,
> >> but find me all the documents that have the string "My Video" in it's
> title
> >> and description.  From the way that the GSA provides us statistics
> (which
> >> are pretty bare), it appears like they do not count "no search term"
> >> searches in part of those statistics (the GSA is not really built for
> not
> >> using search terms either, and we've had various issues using it in this
> >> way because of it).
> >>
> >> The reason we are using the GSA for this and not our MSSql database is
> >> because some of this data requires multiple, and expensive, joins and
> we do
> >> need full text search for when users want to use that option.  Also for
> >> faceting.
> >>
> >>
> >> On Wed, Feb 13, 2013 at 11:24 AM, Toke Eskildsen <
> t...@statsbiblioteket.dk>wrote:
> >>
> >>> Matthew Shapiro [m...@mshapiro.net] wrote:
> >>>
> >>> [Hardware for Solr]
> >>>
> >>> > What type of hardware (at a high level) should I be looking for.
>  Are the
> >>> > main constraints disk I/O, memory size, processing power, etc...?
> >>>
> >>> That depends on what you are trying to achieve. Broadly speaking,
> "simple"
> >>> search and retrieval is mainly I/O bound. The easy way to handle that
> is to
> >>> use SSDs as storage. However, a lot of people like the old school
> solution
> >>> and compensates for the slow seeks of spinning drives by adding  RAM
> and
> >>> doing warmup of the searcher or index files. So either SSD or RAM on
> the
> >>> I/O side. If the corpus is non-trivial is size that is, which brings us
> >>> to...
> >>>
> >>> > Right now we have about 183 documents stored in the GSA (which will
> go
> >>> up a
> >>> > lot once we are on Solr since the GSA is limiting).  The search
> systems
> >>> are
> >>> > used to display core information on several of our homepages, so our
> >>> search
> >>> > traffic is pretty significant (the GSA reports 5,683 searches in the
> last
> >>> > month, however I am 99% sure this is not correct and is not counting
> >>> search
> >>> > requests without any search terms, which consists of most of our
> search
> >>> > traffic).
> >>>
> >>> If the main amount of searches are the exact same (e.g. the empty
> search),
> >>> the result will be cached. If 5,683 searches/month is the real count,
> this
> >>> sounds like a very low amount of searches in a very limited corpus.
> Just
> >>> about any machine should be fine. I guess I am missing something here.
> >>> Could you elaborate a bit? How large is a document, how many do you
> expect
> >>> to handle, what do you expect a qu

Re: What should focus be on hardware for solr servers?

2013-02-13 Thread Michael Della Bitta
Ooops: https://code.google.com/p/solrmeter/



Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Wed, Feb 13, 2013 at 12:25 PM, Michael Della Bitta
 wrote:
> Matthew,
>
> With an index that small, you should be able to build a proof of
> concept on your own hardware and discover how it performs using
> something like SolrMeter:
>
>
> Michael Della Bitta
>
> 
> Appinions
> 18 East 41st Street, 2nd Floor
> New York, NY 10017-6271
>
> www.appinions.com
>
> Where Influence Isn’t a Game
>
>
> On Wed, Feb 13, 2013 at 12:21 PM, Matthew Shapiro  wrote:
>> Thanks for the reply.
>>
>> If the main amount of searches are the exact same (e.g. the empty search),
>>> the result will be cached. If 5,683 searches/month is the real count, this
>>> sounds like a very low amount of searches in a very limited corpus. Just
>>> about any machine should be fine. I guess I am missing something here.
>>> Could you elaborate a bit? How large is a document, how many do you expect
>>> to handle, what do you expect a query to look like, how should the result
>>> be presented?
>>
>>
>> Sorry, I should clarify our current statistics.  First of all I meant 183k
>> documents (not 183, woops).  Around 100k of those are full fledged html
>> articles (not web pages but articles in our CMS with html content inside of
>> them), the rest of the data are more like key/value data records with a lot
>> of attached meta data for searching.
>>
>> Also, what I meant by search without a search term is that probably 80%
>> (hard to confirm due to the lack of stats given by the GSA) of our searches
>> are done on pure metadata clauses without any searching through the content
>> itself, so for example "give me documents that have a content type of
>> video, that are marked for client X, have a category of Y or Z, and was
>> published to platform A, ordered by date published".  The searches that use
>> a search term are more like use the same query from the example as before,
>> but find me all the documents that have the string "My Video" in it's title
>> and description.  From the way that the GSA provides us statistics (which
>> are pretty bare), it appears like they do not count "no search term"
>> searches in part of those statistics (the GSA is not really built for not
>> using search terms either, and we've had various issues using it in this
>> way because of it).
>>
>> The reason we are using the GSA for this and not our MSSql database is
>> because some of this data requires multiple, and expensive, joins and we do
>> need full text search for when users want to use that option.  Also for
>> faceting.
>>
>>
>> On Wed, Feb 13, 2013 at 11:24 AM, Toke Eskildsen 
>> wrote:
>>
>>> Matthew Shapiro [m...@mshapiro.net] wrote:
>>>
>>> [Hardware for Solr]
>>>
>>> > What type of hardware (at a high level) should I be looking for.  Are the
>>> > main constraints disk I/O, memory size, processing power, etc...?
>>>
>>> That depends on what you are trying to achieve. Broadly speaking, "simple"
>>> search and retrieval is mainly I/O bound. The easy way to handle that is to
>>> use SSDs as storage. However, a lot of people like the old school solution
>>> and compensates for the slow seeks of spinning drives by adding  RAM and
>>> doing warmup of the searcher or index files. So either SSD or RAM on the
>>> I/O side. If the corpus is non-trivial is size that is, which brings us
>>> to...
>>>
>>> > Right now we have about 183 documents stored in the GSA (which will go
>>> up a
>>> > lot once we are on Solr since the GSA is limiting).  The search systems
>>> are
>>> > used to display core information on several of our homepages, so our
>>> search
>>> > traffic is pretty significant (the GSA reports 5,683 searches in the last
>>> > month, however I am 99% sure this is not correct and is not counting
>>> search
>>> > requests without any search terms, which consists of most of our search
>>> > traffic).
>>>
>>> If the main amount of searches are the exact same (e.g. the empty search),
>>> the result will be cached. If 5,683 searches/month is the real count, this
>>> sounds like a very low amount of searches in a very limited corpus. Just
>>> about any machine should be fine. I guess I am missing something here.
>>> Could you elaborate a bit? How large is a document, how many do you expect
>>> to handle, what do you expect a query to look like, how should the result
>>> be presented?
>>>
>>> Regards,
>>> Toke Eskildsen


Re: What should focus be on hardware for solr servers?

2013-02-13 Thread Michael Della Bitta
Matthew,

With an index that small, you should be able to build a proof of
concept on your own hardware and discover how it performs using
something like SolrMeter:


Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Wed, Feb 13, 2013 at 12:21 PM, Matthew Shapiro  wrote:
> Thanks for the reply.
>
> If the main amount of searches are the exact same (e.g. the empty search),
>> the result will be cached. If 5,683 searches/month is the real count, this
>> sounds like a very low amount of searches in a very limited corpus. Just
>> about any machine should be fine. I guess I am missing something here.
>> Could you elaborate a bit? How large is a document, how many do you expect
>> to handle, what do you expect a query to look like, how should the result
>> be presented?
>
>
> Sorry, I should clarify our current statistics.  First of all I meant 183k
> documents (not 183, woops).  Around 100k of those are full fledged html
> articles (not web pages but articles in our CMS with html content inside of
> them), the rest of the data are more like key/value data records with a lot
> of attached meta data for searching.
>
> Also, what I meant by search without a search term is that probably 80%
> (hard to confirm due to the lack of stats given by the GSA) of our searches
> are done on pure metadata clauses without any searching through the content
> itself, so for example "give me documents that have a content type of
> video, that are marked for client X, have a category of Y or Z, and was
> published to platform A, ordered by date published".  The searches that use
> a search term are more like use the same query from the example as before,
> but find me all the documents that have the string "My Video" in it's title
> and description.  From the way that the GSA provides us statistics (which
> are pretty bare), it appears like they do not count "no search term"
> searches in part of those statistics (the GSA is not really built for not
> using search terms either, and we've had various issues using it in this
> way because of it).
>
> The reason we are using the GSA for this and not our MSSql database is
> because some of this data requires multiple, and expensive, joins and we do
> need full text search for when users want to use that option.  Also for
> faceting.
>
>
> On Wed, Feb 13, 2013 at 11:24 AM, Toke Eskildsen 
> wrote:
>
>> Matthew Shapiro [m...@mshapiro.net] wrote:
>>
>> [Hardware for Solr]
>>
>> > What type of hardware (at a high level) should I be looking for.  Are the
>> > main constraints disk I/O, memory size, processing power, etc...?
>>
>> That depends on what you are trying to achieve. Broadly speaking, "simple"
>> search and retrieval is mainly I/O bound. The easy way to handle that is to
>> use SSDs as storage. However, a lot of people like the old school solution
>> and compensates for the slow seeks of spinning drives by adding  RAM and
>> doing warmup of the searcher or index files. So either SSD or RAM on the
>> I/O side. If the corpus is non-trivial is size that is, which brings us
>> to...
>>
>> > Right now we have about 183 documents stored in the GSA (which will go
>> up a
>> > lot once we are on Solr since the GSA is limiting).  The search systems
>> are
>> > used to display core information on several of our homepages, so our
>> search
>> > traffic is pretty significant (the GSA reports 5,683 searches in the last
>> > month, however I am 99% sure this is not correct and is not counting
>> search
>> > requests without any search terms, which consists of most of our search
>> > traffic).
>>
>> If the main amount of searches are the exact same (e.g. the empty search),
>> the result will be cached. If 5,683 searches/month is the real count, this
>> sounds like a very low amount of searches in a very limited corpus. Just
>> about any machine should be fine. I guess I am missing something here.
>> Could you elaborate a bit? How large is a document, how many do you expect
>> to handle, what do you expect a query to look like, how should the result
>> be presented?
>>
>> Regards,
>> Toke Eskildsen


Re: What should focus be on hardware for solr servers?

2013-02-13 Thread Matthew Shapiro
Thanks for the reply.

If the main amount of searches are the exact same (e.g. the empty search),
> the result will be cached. If 5,683 searches/month is the real count, this
> sounds like a very low amount of searches in a very limited corpus. Just
> about any machine should be fine. I guess I am missing something here.
> Could you elaborate a bit? How large is a document, how many do you expect
> to handle, what do you expect a query to look like, how should the result
> be presented?


Sorry, I should clarify our current statistics.  First of all I meant 183k
documents (not 183, woops).  Around 100k of those are full fledged html
articles (not web pages but articles in our CMS with html content inside of
them), the rest of the data are more like key/value data records with a lot
of attached meta data for searching.

Also, what I meant by search without a search term is that probably 80%
(hard to confirm due to the lack of stats given by the GSA) of our searches
are done on pure metadata clauses without any searching through the content
itself, so for example "give me documents that have a content type of
video, that are marked for client X, have a category of Y or Z, and was
published to platform A, ordered by date published".  The searches that use
a search term are more like use the same query from the example as before,
but find me all the documents that have the string "My Video" in it's title
and description.  From the way that the GSA provides us statistics (which
are pretty bare), it appears like they do not count "no search term"
searches in part of those statistics (the GSA is not really built for not
using search terms either, and we've had various issues using it in this
way because of it).

The reason we are using the GSA for this and not our MSSql database is
because some of this data requires multiple, and expensive, joins and we do
need full text search for when users want to use that option.  Also for
faceting.


On Wed, Feb 13, 2013 at 11:24 AM, Toke Eskildsen 
wrote:

> Matthew Shapiro [m...@mshapiro.net] wrote:
>
> [Hardware for Solr]
>
> > What type of hardware (at a high level) should I be looking for.  Are the
> > main constraints disk I/O, memory size, processing power, etc...?
>
> That depends on what you are trying to achieve. Broadly speaking, "simple"
> search and retrieval is mainly I/O bound. The easy way to handle that is to
> use SSDs as storage. However, a lot of people like the old school solution
> and compensates for the slow seeks of spinning drives by adding  RAM and
> doing warmup of the searcher or index files. So either SSD or RAM on the
> I/O side. If the corpus is non-trivial is size that is, which brings us
> to...
>
> > Right now we have about 183 documents stored in the GSA (which will go
> up a
> > lot once we are on Solr since the GSA is limiting).  The search systems
> are
> > used to display core information on several of our homepages, so our
> search
> > traffic is pretty significant (the GSA reports 5,683 searches in the last
> > month, however I am 99% sure this is not correct and is not counting
> search
> > requests without any search terms, which consists of most of our search
> > traffic).
>
> If the main amount of searches are the exact same (e.g. the empty search),
> the result will be cached. If 5,683 searches/month is the real count, this
> sounds like a very low amount of searches in a very limited corpus. Just
> about any machine should be fine. I guess I am missing something here.
> Could you elaborate a bit? How large is a document, how many do you expect
> to handle, what do you expect a query to look like, how should the result
> be presented?
>
> Regards,
> Toke Eskildsen


RE: What should focus be on hardware for solr servers?

2013-02-13 Thread Toke Eskildsen
Matthew Shapiro [m...@mshapiro.net] wrote:

[Hardware for Solr]

> What type of hardware (at a high level) should I be looking for.  Are the
> main constraints disk I/O, memory size, processing power, etc...?

That depends on what you are trying to achieve. Broadly speaking, "simple" 
search and retrieval is mainly I/O bound. The easy way to handle that is to use 
SSDs as storage. However, a lot of people like the old school solution and 
compensates for the slow seeks of spinning drives by adding  RAM and doing 
warmup of the searcher or index files. So either SSD or RAM on the I/O side. If 
the corpus is non-trivial is size that is, which brings us to...

> Right now we have about 183 documents stored in the GSA (which will go up a
> lot once we are on Solr since the GSA is limiting).  The search systems are
> used to display core information on several of our homepages, so our search
> traffic is pretty significant (the GSA reports 5,683 searches in the last
> month, however I am 99% sure this is not correct and is not counting search
> requests without any search terms, which consists of most of our search
> traffic).

If the main amount of searches are the exact same (e.g. the empty search), the 
result will be cached. If 5,683 searches/month is the real count, this sounds 
like a very low amount of searches in a very limited corpus. Just about any 
machine should be fine. I guess I am missing something here. Could you 
elaborate a bit? How large is a document, how many do you expect to handle, 
what do you expect a query to look like, how should the result be presented?

Regards,
Toke Eskildsen