Thanks to the suggestions I finally found a solution to create a bad_words
file. First, I should have read better the documentation so that I would have
found that (as Gilles pointed out) in the 3.2 series one has to use the
htdump program to create an ASCII version of the dbs. Then to create a
bad_words file I just have to:
1) run rundig
2) run htdump
3) look at the db.worddump file and count the frequency of each word
4) put in the bad_word file all the word I don't want to be index
5) repeat step 1)-4) until satisfied
I created a simple shell script to dump the dbs and count the word frequency.
My little contribution.
Maurizio
--- cut here ---
#! /bin/sh
#
# Word frequency in db.worddump
# To be used in conjunction with ht://Dig version 3.2.x
HTDUMP=/usr/local/htdig-3.2.0b5/bin/htdump
WORDDUMP=/var/htdig/db/db.worddump
if [ ! -x $HTDUMP ]
then
echo "File \"$HTDUMP\" does not exists."
exit 1
fi
# Dump the db.word file
$HTDUMP
if [ ! -f $WORDUMP ]
then
echo "File \"$WORDDUMP\" does not exists."
exit 1
fi
# Count the word frequency
sed '/^#/d' $WORDDUMP | cut -f1 | uniq -c | sort -nr
exit 0
--- cut here ---
-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general