Re: [Ferret-talk] AAF Sorting by date - what am I doing wrong?

David Balmain Sun, 03 Sep 2006 21:26:10 -0700

On 9/4/06, Ian Zabel <[EMAIL PROTECTED]> wrote:
> Thanks for all the help, everyone.
>
> I am now using this statement in my model: acts_as_ferret :fields => {
> 'comment' => {}, :forum_id => {:index => :untokenized}, 'mod_type' =>
> {:index => :untokenized} , 'user_id' => {:index => :untokenized} ,
> 'ferret_created_at' => {:index => :untokenized} }
>
> I rebuilt the index, and sorting now seems to work properly with both
> "ferret_created_at" and "id", like so
>
> sort_fields = []
> sort_fields <<
> Ferret::Search::SortField.new("ferret_created_at",:reverse => :true)
> or
> sort_fields << Ferret::Search::SortField.new("id",:reverse => :true)
> Comment.find_by_contents("test", :sort => sort_fields, :limit => 5)
>
> Sorting by id is now MUCH faster, as well.


Great to hear.

> The only thing I notice now is that the index is MUCH larger. The index
> is now about 91MB, whereas before I changed the aaf settings for the
> model, it was about 20MB. I guess untokenized values take up a lot more
> space?


That can be correct but it is surprising for your schema. For example,
imagine the following six documents;

    "one two three" (13-bytes)
    "one three two"
    "two three one"
    "two one three"
    "three one two"
    "three two one"

If you tokenized the fields you'd have tree terms "one" (3-bytes),
"two" (3-bytes), "three" (5-bytes) and each term would use six bytes
to store the doc_ids of the documents they occur in. So you'd have 3 +
3 + 5 + 3*6 = 29 bytes. Storing the fields as untokenized would take
13 bytes per field plus 1 byte to signify the document each field
occurs in which would be (13 + 1) * 6 = 84 bytes. Of course this is a
simplification of what is really going on. There is a lot of
compression happening and a lot of other data is stored as well like
term positions, term frequencies, term-vectors as well as actually
storing the data.

Now, if you want to save space, there are a few other parameters you
can set. You can start by discarding :term_vectors. These are used for
excerpts and match highlighting but are unnecessary in most cases.
Also, there is no need to store all your data. Often, the only fields
you'll want to store are the model IDs. If you aren't referencing the
field in the document from the Ferret index, don't bother storing it.
So for example; :ferret_created_at could be

    :ferret_created_at => {:index => :untokenized, :store => :no,
:term_vectors => :no}

Note also I recommend always using Symbols for your field names rather
than Strings.

Cheers,
Dave
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Re: [Ferret-talk] AAF Sorting by date - what am I doing wrong?

Reply via email to