At 17:31 on Friday, 26 Mar 2004, Duane Burchell wrote:


I've been scouring the htdig site and faq, and need help understanding the
most basic of things :


When a search is performed, where does the description that appears with
the result come from?

This depends on the configuration settings (htdig.conf)



From the faq, people refer to Meta Descriptions and Excerpts. I'm
assuming the meta descriptions is whatever is defined within the meta tags
of the html pages that are crawled. Where is this Excerpt information
gathered from?


these are all relevant properties and you should read up on them in the configuration file section of the web site.
description_meta_tag_names
max_meta_description_length
no_excerpt_text
no_excerpt_show_top
excerpt_length
excerpt_show_top
use_meta_description
max_head_length


How they are set will determine the answer to your question.

by default htdig stores 512bytes of each document (starting from the top and excluding hmtl markup)


Also, I've seen the discussion of the use of the excerpt_show_top and no_excerpt_show_top configuration attributes. Dumb question : what exactly is considered the "top" of the html page?

top is start at html and add text until the setting in max_head_length is reached.



I've inherited a search server to administer (and have very limited experience with htdig), but here's the problem I'm having :

My search results have garbage in the descriptions.  Literally, it is
picking up image 'alt' values and whatnot, and displaying it as the
description.

So as a result, I get :

Title of matching document

... Info Frequently Asked Questions About Us Contact ...


which are clearly parts of our menu structure within that html page.


Anyone who can help me understand how htdig and htsearch grabs this
information would have my eternal gratitude.


Simply put (I hope) htdig stores a fixed amount text for each document found and links words to documents.

the default behaviour for the search results is to display the links to the documents that contained the word and an excerpt from the stored text. (this can be changed in the results templates)

the excerpt displayed in results is a snippet of text, with the search word in the middle, up to the number of characters set in "excerpt_length".

e.g. the default is to try and display an excerpt containing the word. if "excerpt_show_top" is set to true then the excerpt will be the top of the document instead (what you see depends on how much of the document you have stored)

if the word isn't in the page (it might come from the link to the page - see title_factor, where title is from <a href="" title="">) then "no_excerpt_text2 and "no_excerpt_show_top" come into play. Which means you will either get a no excerpt message (default is: (None of the search words were found in the top of this document.) ) or the top of the document is shown.

if you set "use_meta_description" to true, the metadescription for the page will be used instead of any excerpt, if the document has no metadescription, then it tries to display an excerpt according to the above.

getting garbage in the results suggests that you're not storing enough text or you have "excerpt_show_top" set to true.

I like to set "use_meta_description" to true and store more of the document (increase max_head_length) and leave excerpt_show_top at its default of false.

try messing with those settings (you'll need to re-index the site to see the changes)


to further refine the results you can mess with:


" ignore_alt_text"
If set to true, htdig will not index text in the ALT attribute of IMG tags, nor include this text in excerpts.



If you have control over the pages you are indexing you can also invoke:


"noindex_start", "noindex_end" attribute settings

and add these:

<!--htdig_noindex-->

text not to be indexed

<!--/htdig_noindex-->

to prevent parts of your page (like navigation) from being indexed...

all this info is in the configuration section of the website.



Hope that helps a bit.

Tony








------------------------------------------------------- This SF.Net email is sponsored by: IBM Linux Tutorials Free Linux tutorial presented by Daniel Robbins, President and CEO of GenToo technologies. Learn everything from fundamentals to system administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to