Re: [Ferret-talk] Performance Testing +counting occurences of words in the res

David Balmain Fri, 08 Sep 2006 01:50:14 -0700

On 9/8/06, Clare <[EMAIL PROTECTED]> wrote:
> Thanks David
>
> I will try both options. I am infact doing some performance testing now.
> I have created 100,000 search result set and it takes around 5 seconds
> (end to end) on my internal server to be returned (with 1 user). I am
> only doing 6 significant searches on this set. One for the main results
> and one for the top level categories. This is only on my test server and
> not in the larger production server and I am happy with this
> performance. If however I were to do my second level category search
> that has around 40 nodes in it, that would be 30 searches. I am not sure
> how this would perform.
>
> What I am seeing is CPU hungry search but not memory hungry. This makes
> sense to me.
>
> Q - I have test data set up in my tests that has some random junk in and
> then a word such as "fish" at the end of it. I am starting to think that
> I may have set up the test data wrong and should use a lot of different
> words in the result set because I am sure that Ferret will cache the
> search. This would give me a false impression on speed of search.


Firstly, searches don't get cached. Only filters do. If you want to
cache the results from a query (which you would in this instance) then
you should use a QueryFilter. Secondly, I'm not sure exactly what you
are saying when you say your tests have some random junk and then the
word "fish"? If you are putting data like this into every document;

    index << "asdlgkjhasd askdj asdg asdg asdg asdg lkjh asd fish"

Then you probably should work on your test data. As far as search
perfomance goes, this will be no different to doing this;

    index << "fish"

What is important is to remember that TermQueries (fish) perform a lot
better than BooleanQueries (fish AND rod) and PhraseQueries ("fishing
rod") which perform better again than WildCardQueries (fi*) so you
should try these queries too.

Here is a much better way to create random strings;

    WORDS = %w{one two three}

    def random_sentence(min_size, max_size)
      len = min_size + rand(max_size - min_size)
      sentence = []
      len.times {sentence << WORDS[Math.sqrt(rand(WORDS.size * WORDS.size))]}
      sentence.join(" ")
    end

    10.times { puts random_sentence(10, 100) }

The Math.sqrt stuff makes sure that words aren't evenly distributed to
be more realistic. Words appearing later in the WORDS array will be
much more common. Even better than this would be to use a copy of the
real data that you will be using though.

> I will create more test data however at the weekend but my instinct is
> that your method outlined above may be faster.
>
> I have 5 top level categories and this will not change much. Depending
> on the search there were be a lot more results in one category that the
> rest after the initial search.
>
> Drilling into the second level categories, the most nodes I have in a
> single second level category is around 40 at the moment although this is
> likely to be added to over time. The resuls again will not be normally
> distributed over the results set but assuming for now that they were and
> I had 500,000 records, and drilled into the second tier category
> structure I would have 100,000 records in this category. I would be
> doing 40 searches over 100,000 records.
>
> Q - What do you think will perform faster in this instance?

Impossible to say without testing. Both methods are pretty simple
though so I'd try both with a variety of search strings.

> I would love to have the time to build a x-dimensional memory resident
> result (bucket set) that kept all the results parameterised for all the
> categories, built at the initial time of the search. Would be memory
> hungry but would make searching through categories and nodes and
> parameters in subsequent searches lightening fast.
>
> Would be a great addition or am I missing something?

As far as I'm concerned this functionality is already there with the
filter_proc parameter. Make it any less general than this and it isn't
much use anymore. For example;

    require 'rubygems'
    require 'ferret'

    include Ferret
    index = I.new

    words = %w{one two three four five}
    100000.times do |i|
      index << {:id => "%05d" % i, :word => words[rand(words.size)]}
    end

    groups = {}

    filter_proc = lambda do |doc, score, searcher|
      word = searcher[doc][:word]
      (groups[word]||=[]) << doc
    end

    resultset = index.search("id:[09900 10000}", :limit => 1,
:filter_proc => filter_proc)
    puts resultset.total_hits
    puts groups.inspect
    puts groups["two"].size

I really can't see how you could make it any easier than that.

> I am really interested in the performance testing scenarios. As stated
> above, I only have one word "FISH" in my test data with random made up
> beforehand. e.g. "sadssderssdaatg FISH" etc.
>
> Q - Would I be better using more words in my test data?

See above.

> Also - I am interested in the round trip performance of search. The
> length of time it takes from when the user clicks on search and gets the
> results back. I will do this on the production server in the production
> environment. My rule of thumb is that it should not take longer than 8
> seconds to return the results or the user will refresh (even worse for
> performance). With one user on my test system with 6 searches over
> 100,000 records it takes 5 seconds at the moment.

5 seconds seems like a long time. Try optimizing your index and see
how you go then. The example above took  0.028109 seconds. Personally,
I would be worried about anything taking over 1 second which was the
whole reason I wrote Ferret in C.

> I am expecting a large number of concurrent searches happening. I am
> defining concurrency as someone searching at the same time as another
> user is either searching or waiting for the results to be returned.
>
> Most testing tools that I can see only show you what is happening on the
> server. I am interested from the users perspective.
>
> I had a thought of setting up a script that would open a number of
> browser sessions and doing random searches concurrently and hammering
> the server to see when it 1) breaks search, 2) breaks something else 3)
> search goes over the 8 second limit.
>
> Q - does anyone have any experience in this area. Even better does
> anyone have a script to do this? If not, and I do write a script to do
> this would this be of value to the greater community?

If I were you, I'd test plain old search performance before I tested
performance through a browser. And, again, it is pretty hard to
generalize a script like this since so many people have different
search needs. In my opinion, Ruby makes it easy enough to write this
from scratch each time.

> Sorry for the long winded post. My search page and category search is
> the most critical part of my site and I am anal on performance of this
> because if it does not work then my site will not work.
>
> Thanks once again for all your assistance. Sorry for any stupid or
> ignorant thoughts/remarks.
>
> Ferret rocks!

You're welcome,

Dave

> Clare
>
>
> --
> Posted via http://www.ruby-forum.com/.
> _______________________________________________
> Ferret-talk mailing list
> [email protected]
> http://rubyforge.org/mailman/listinfo/ferret-talk
>
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Re: [Ferret-talk] Performance Testing +counting occurences of words in the res

Reply via email to