Re: [Ferret-talk] Experience using ferret to index log files

John Leach Sat, 23 Feb 2008 05:33:16 -0800

Hi Chris,

On Fri, 2008-02-22 at 20:09 +0000, Chris TenHarmsel wrote:


> First, I'm not really that knowledgeable on the tokenizing that is
> happening.  I looked through the docs and I think I understand the
> basics, but I'm not even sure how I would go about doing my own
> tokenizing to create more meaningful tokens.  Is a token basically a
> thing that can be searched for? 

Tokenizing is splitting the input text into words that can be searched
for.  Sometimes you can just split the text up by whitespace, but I'm
thinking that log files might need some specific attention.

> So if I had a token of "sometoken" and searched for "some" would it
> find it?

No. Though if you did a search for "some*", Ferret would search the
available tokens (one of which would be sometoken), then do a search on
the matching tokens.

You might write a clever tokenizer to recognise that "sometoken" was
actually two words without a space and return them as the separate
tokens "some" and "token".

> From what I can tell, I would have to subclass the TokenStream class
> and implement "text=()" to split the input into my "tokens" and then
> have the "next" method just return them in order, correct?

Not sure off the top of my head, but that's about right, but then you
need to make an Analyzer class that uses your new tokenizer.  I have an
example but I've not got time to extract it right now, sorry!

> Secondly, I'm not sure what you mean by looking at the term_vector to
> find the position.    If I do a search and get
> "Hits" (http://ferret.davebalmain.com/api/classes/Ferret/Search/Hit.html) 
> back, I thought all I got was the doc id and the score.  Can you explain a 
> little more on this?

The term vectors stores the offset in the document to the match, byte
position and length - it's used often for highlighting search matches.

I've not actually used them myself - a quick look at the api makes it
sound like they're used internally by the highlight method.  You can get
to them using some methods on the index_reader, which return TermVector
objects.

index_reader.term_vector(doc_id, field)

http://ferret.davebalmain.com/api/classes/Ferret/Index/TermVector.html

John.

-- 
http://www.brightbox.co.uk - UK/EU Ruby on Rails Hosting
http://johnleach.co.uk

_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Re: [Ferret-talk] Experience using ferret to index log files

Reply via email to