Hi Gilles,
Thanks a ton for the information and for developing htdig. I have
installed it on our intranet newsgroup server and it works like
a charm.
I need further help on a small issue ...
Does it also index contents inside <A HREF .....> tag ?
I am using HyperNews for our newsgroup. Each page has a few buttons
for member list, subscription etc. Some how some of these things (like
the word search, members, subscription) seem to be getting indexed.
Am attaching a portion of HTML page that was located by the htsearch
when I looked for the word "search". The only portion of the page
which had the word "search" is pasted here ...
-----------------------
<A HREF="http://www.imgnews.com/search.html"
onMouseOver="return window.help('Search for Messages');"
onMouseOut="return window.help('');">
<IMG SRC="http://www.imgnews.com/Icons/search.gif" BORDER=0
WIDTH=60 HEIGHT=17
ALT="Search"
></A>
-----------------------
Is there a way we can avoid this getting indexed ? I tried using
bad_words to filter this out .. but I guess that is not the correct
place ..
Regards,
- Vikram
----------------------------------------------------------
Vikram Lele Imagine Technologies, Pune
91-20-605 2190
91-20-605 1913
On Thu, 6 Jul 2000, Gilles Detillieux wrote:
GD > According to Vikram Lele:
GD > > I am looking for a search engine for our intranet news group.
GD > > We are using HyperNews as our intranet newsgroup.
GD > >
GD > > Is htDig suitable for such applications ? As you know newsgroup
GD > > contents change almost on hourly basis .. so the search engine must be
GD > > capable of incremental indexing.
GD > >
GD > > From your site I couldn't figure if incremental indexing is supported.
GD > > Please advice.
GD >
GD > A number of ht://Dig users use it for indexing mailing lists, which
GD > would be quite similar, so perhaps some of them would care to comment
GD > on the issues involved.
GD >
GD > The htdig program is capable of incremental indexing, but it tends to
GD > recheck all indexed documents to see if they've changed, which in itself
GD > can mean a lot of overhead when you have lots of static content with a
GD > smaller proportion of new content. There are a few ways to avoid this.
GD > One is to make sure you index via the local filesystem rather than via
GD > HTTP, so the modification time checks for unchanged files would be very
GD > quick. Another is to break up your index into multiple databases for
GD > various "ages" of documents (e.g. one per month or one per year, depending
GD > on how far back you go), so that only the most recent database needs to
GD > be updated, and then you can merge them all to get a full search database.
GD >
GD > Depending on the size of the data you're indexing, this may or may not be
GD > an issue.
GD >
GD > --
GD > Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
GD > Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
GD > Dept. Physiology, U. of Manitoba Phone: (204)789-3766
GD > Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
GD >
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.