Note: Sorry if this was double posted, I sent it from the wrong email
address before.
Hi John,
Thanks for the tips. Currently I'm using these tunables for my indexer:
:max_buffer_memory => 204857600,
:max_buffered_docs => 1000000,
:merge_factor => 100000,
For some reason, if I set max_buffered_docs to 1000001 or higher, Ferret
segfaults, so I'm stuck at that.
I wasn't aware that Ferret by default only indexes the first 10000 terms, so
I will definitely have to change that for log file-level indexing
Could you maybe elaborate a little more on a couple things:
First, I'm not really that knowledgeable on the tokenizing that is
happening. I looked through the docs and I think I understand the basics,
but I'm not even sure how I would go about doing my own tokenizing to create
more meaningful tokens. Is a token basically a thing that can be searched
for? So if I had a token of "sometoken" and searched for "some" would it
find it? From what I can tell, I would have to subclass the TokenStream
class and implement "text=()" to split the input into my "tokens" and then
have the "next" method just return them in order, correct?
Secondly, I'm not sure what you mean by looking at the term_vector to find
the position. If I do a search and get "Hits" (
http://ferret.davebalmain.com/api/classes/Ferret/Search/Hit.html) back, I
thought all I got was the doc id and the score. Can you explain a little
more on this?
THanks,
On Fri, Feb 22, 2008 at 6:52 PM, John Leach <[EMAIL PROTECTED]> wrote:
> Hi Chris,
>
> I've been toying with the idea of a Ferret log indexer for my Linux
> systems so this is rather interesting.
>
> Regarding performance of the one ferret document per line, you should
> look into the various tunables. An obvious one is ensuring auto_flush is
> disabled, but the next likely is :max_buffered_docs. This, by default,
> it set to flush to the index every 10,000 documents, but your log file
> lines will be hitting that regularly. Also consider :max_buffer_memory.
>
> As log files will often have lots of unique but "useless" terms (such as
> the timestamps) I'd recommend pre-parsing your log lines. If it's
> syslog files you're indexing, parse the timestamp and convert it to
> 200802221816 format and add that as a separate untokenized field to the
> index. Cut it down to the maximum accuracy you'll need as this will
> reduce the number of unique terms in the index (maybe you'll only ever
> need to find logs down to the day, not the hour and minute)
>
> Also, disable term vectors, as this will save disk space.
>
> I've also found using a field as the the id is slooow, so avoid that
> (that's usually only something done with the primary key from databases
> though, so I doubt you're doing it)
>
> Regarding performance and index size for the one ferret document per log
> file: By default, ferret only indexes the first 10,000 terms of each
> document so it might only be faster because it's indexing less! Ditto
> for the index file size :S See the :max_field_length option.
>
> Write your own custom stop words list to skip indexing hugely common
> words - this will reduce the size of your index.
>
> Consider writing your own Analyzer to do tokenization to reduce the
> number of unique terms, for example the following line from a log file
> on my system:
>
> Feb 21 05:13:10 lion named[15722]: unexpected RCODE (SERVFAIL) resolving '
> ns1.rfrjqrkfccysqlycevtyz.info/AAAA/IN': 194.168.8.100#53
>
> I'm not sure exactly how the default analyzer would tokenize this, but
> an ideal list of tokens would probably be:
>
> lion named unexpected RCODE SERVFAIL resolving
> ns1.rfrjqrkfccysqlycevtyz.info AAAA IN 194.168.8.100 53
>
> If you still want to stick to document per log file, you can use the
> term_vectors to find the offset of the match in the log file - then you
> just open the log file and jump to that position (store the log
> filename). It does use a bit more disk space per term indexed, but
> useful!
>
> Also, omitting norms will save 1 byte per field per document too, a huge
> saving I'm sure you'll agree ;) :index => :yes_omit_norms
>
> Um, I think I'm done. The Ferret shortcut book by the Ferret author
> covers all this stuff - it's cheap and good:
>
> http://www.oreilly.com/catalog/9780596527853/index.html
>
> John.
> --
> http://johnleach.co.uk
> http://www.brightbox.co.uk - UK/EU Ruby on Rails hosting
>
> On Thu, 2008-02-21 at 17:35 +0000, Chris TenHarmsel wrote:
> > Hi everyone,
> > I've been exploring using ferret for indexing large amounts of
> > production log files. Right now we have a homemade system for
> > searching through the logs that involves specifying a date/time range
> > and then grepping through the relevant files. This can take a long
> > time.
> >
> > My initial tests (on 2gb of log files) have been promising, I've taken
> > two separate approaches:
> > The first is loading each line in each log file as a "document". The
> > plus side to this is that doing a search will get you individual log
> > lines as the results, which is what I want. The downside is that
> > indexing takes a long long time and the index size is very large even
> > when not storing the contents of the lines. This approach is not
> > viable for indexing all of our logs.
> >
> > The second approach is indexing the log files as documents. This is
> > relatively fast, 211sec for 2gb of logs, and the index size is a nice
> > 12% of the sample size. The downside is that after figuring out which
> > files match your search terms, you have to crawl through each "hit"
> > document to find the relevant lines.
> >
> > For the sake of full disclosure, at any given time we keep roughly 30
> > days of logs which comes to about 800ish Gb of log files. Each file
> > is roughly 15Mb in size before it gets rotated.
> >
> > Has anyone else tackled a problem like this and can offer any ideas on
> > how to go about searching those logs? The best idea I can come up
> > with (that I haven't implemented yet to get real numbers) is to index
> > a certain number of log files by line, like the last 2 days, and then
> > do another set by file (like the last week). This would have fast
> > results for the more recent logs and you would just have to be patient
> > for the slightly older logs.
> >
> > Any ideas/help?
> >
>
>
>
> _______________________________________________
> Ferret-talk mailing list
> [email protected]
> http://rubyforge.org/mailman/listinfo/ferret-talk
>
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk