That definitely will be a useful tool in this conversion, thanks. On Wed, Feb 13, 2013 at 12:25 PM, Michael Della Bitta < michael.della.bi...@appinions.com> wrote:
> Ooops: https://code.google.com/p/solrmeter/ > > > > Michael Della Bitta > > ------------------------------------------------ > Appinions > 18 East 41st Street, 2nd Floor > New York, NY 10017-6271 > > www.appinions.com > > Where Influence Isn’t a Game > > > On Wed, Feb 13, 2013 at 12:25 PM, Michael Della Bitta > <michael.della.bi...@appinions.com> wrote: > > Matthew, > > > > With an index that small, you should be able to build a proof of > > concept on your own hardware and discover how it performs using > > something like SolrMeter: > > > > > > Michael Della Bitta > > > > ------------------------------------------------ > > Appinions > > 18 East 41st Street, 2nd Floor > > New York, NY 10017-6271 > > > > www.appinions.com > > > > Where Influence Isn’t a Game > > > > > > On Wed, Feb 13, 2013 at 12:21 PM, Matthew Shapiro <m...@mshapiro.net> > wrote: > >> Thanks for the reply. > >> > >> If the main amount of searches are the exact same (e.g. the empty > search), > >>> the result will be cached. If 5,683 searches/month is the real count, > this > >>> sounds like a very low amount of searches in a very limited corpus. > Just > >>> about any machine should be fine. I guess I am missing something here. > >>> Could you elaborate a bit? How large is a document, how many do you > expect > >>> to handle, what do you expect a query to look like, how should the > result > >>> be presented? > >> > >> > >> Sorry, I should clarify our current statistics. First of all I meant > 183k > >> documents (not 183, woops). Around 100k of those are full fledged html > >> articles (not web pages but articles in our CMS with html content > inside of > >> them), the rest of the data are more like key/value data records with a > lot > >> of attached meta data for searching. > >> > >> Also, what I meant by search without a search term is that probably 80% > >> (hard to confirm due to the lack of stats given by the GSA) of our > searches > >> are done on pure metadata clauses without any searching through the > content > >> itself, so for example "give me documents that have a content type of > >> video, that are marked for client X, have a category of Y or Z, and was > >> published to platform A, ordered by date published". The searches that > use > >> a search term are more like use the same query from the example as > before, > >> but find me all the documents that have the string "My Video" in it's > title > >> and description. From the way that the GSA provides us statistics > (which > >> are pretty bare), it appears like they do not count "no search term" > >> searches in part of those statistics (the GSA is not really built for > not > >> using search terms either, and we've had various issues using it in this > >> way because of it). > >> > >> The reason we are using the GSA for this and not our MSSql database is > >> because some of this data requires multiple, and expensive, joins and > we do > >> need full text search for when users want to use that option. Also for > >> faceting. > >> > >> > >> On Wed, Feb 13, 2013 at 11:24 AM, Toke Eskildsen < > t...@statsbiblioteket.dk>wrote: > >> > >>> Matthew Shapiro [m...@mshapiro.net] wrote: > >>> > >>> [Hardware for Solr] > >>> > >>> > What type of hardware (at a high level) should I be looking for. > Are the > >>> > main constraints disk I/O, memory size, processing power, etc...? > >>> > >>> That depends on what you are trying to achieve. Broadly speaking, > "simple" > >>> search and retrieval is mainly I/O bound. The easy way to handle that > is to > >>> use SSDs as storage. However, a lot of people like the old school > solution > >>> and compensates for the slow seeks of spinning drives by adding RAM > and > >>> doing warmup of the searcher or index files. So either SSD or RAM on > the > >>> I/O side. If the corpus is non-trivial is size that is, which brings us > >>> to... > >>> > >>> > Right now we have about 183 documents stored in the GSA (which will > go > >>> up a > >>> > lot once we are on Solr since the GSA is limiting). The search > systems > >>> are > >>> > used to display core information on several of our homepages, so our > >>> search > >>> > traffic is pretty significant (the GSA reports 5,683 searches in the > last > >>> > month, however I am 99% sure this is not correct and is not counting > >>> search > >>> > requests without any search terms, which consists of most of our > search > >>> > traffic). > >>> > >>> If the main amount of searches are the exact same (e.g. the empty > search), > >>> the result will be cached. If 5,683 searches/month is the real count, > this > >>> sounds like a very low amount of searches in a very limited corpus. > Just > >>> about any machine should be fine. I guess I am missing something here. > >>> Could you elaborate a bit? How large is a document, how many do you > expect > >>> to handle, what do you expect a query to look like, how should the > result > >>> be presented? > >>> > >>> Regards, > >>> Toke Eskildsen >