Thanks to the suggestions I finally found a solution to create a bad_words 
file. First, I should have read better the documentation so that I would have 
found that (as Gilles pointed out) in the 3.2 series one has to use the 
htdump program to create an ASCII version of the dbs. Then to create a 
bad_words file I just have to:
1) run rundig
2) run htdump
3) look at the db.worddump file and count the frequency of each word
4) put in the bad_word file all the word I don't want to be index
5) repeat step 1)-4) until satisfied

I created a simple shell script to dump the dbs and count the word frequency.
My little contribution.

Maurizio

--- cut here ---
#! /bin/sh
#
# Word frequency in db.worddump
# To be used in conjunction with ht://Dig version 3.2.x

HTDUMP=/usr/local/htdig-3.2.0b5/bin/htdump
WORDDUMP=/var/htdig/db/db.worddump

if [ ! -x $HTDUMP ]
then
    echo "File \"$HTDUMP\" does not exists."
    exit 1
fi

# Dump the db.word file
$HTDUMP

if [ ! -f $WORDUMP ]
then
    echo "File \"$WORDDUMP\" does not exists."
    exit 1
fi

# Count the word frequency

sed '/^#/d' $WORDDUMP | cut -f1 | uniq -c | sort -nr

exit 0
--- cut here ---



-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to