Re: [htdig] htdig for heavy data indexing

Jim Sun, 23 Jul 2006 20:07:40 -0700

On Thu, 20 Jul 2006, Arya, Manish Kumar wrote:

>    I want to use htdig for searching syslog-ng log
> outputs. I have installed apache and its running on
> 127.0.0.1:8080 (with document root as path to log dir)
> and in htdig conf i have given this url for indexing.
> and one apache is running on public interface to run
> htdig html/cgi.
>   my Q is that can htdig handle 5-10 G data per day
> for indexing I am planing to rebuild index after every
> 6 hrs or so?


I have never worked with this much data on a daily basis, so there is 
little that I can say regarding the specifics. In general what you can 
index in a given day is going to be almost entirely dependent upon how 
much hardware you throw at the problem. About the only way you are going 
to get a definitive answer to this question, for your circumstances, is 
to just start indexing and see what happens.

You should also keep in mind that htdig indexes at a file level. If you 
trying to index very large log files, your results are going to be very 
coarse. A hit might end up telling nothing more than that a term was 
found somewhere in a file that is tens or hundreds of megabytes in size.

>   second Q, I want to customize htdig to show complete
> log message on search output means message terminated
> till newline "\n" and i dont want to show link url of
> search results (because its pointing to 127.0.0.1
> interface so its of no use)

If you want to retain all log messages in a manner allowing them to be 
displayed, you will need to set max_head_length very high (larger than 
the largest log file). This is going to result in an extremely large 
database set and significantly increase the required index time.

Picking out individual log messages for display in the results is not 
going to work out the box. htdig doesn't know anything about what 
constitutes a log entry. I don't think any notion of newlines is even 
retained in the excerpts. I am fairly certain that you would need to 
hack the code that handles display to even attempt what you need here.

About the only way I can think of to reasonably use ht://Dig for the 
type of task you are describing is to add an extra component that splits
each log file into individual files, one per log entry. Then index and
search on those files.

There are various ways to map URLs if you can overcome the other issues.
See for example the documentation on url_rewrite_rules,
search_rewrite_rules, and url_part_aliases.


Jim

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
ht://Dig general mailing list: <[email protected]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Re: [htdig] htdig for heavy data indexing

Reply via email to