Re: AW: [htdig] irrelevant pages in search
Hartmut Steffin wrote: > > Thanks for the answer, > > > > htmerge does not seem to honour the TMPDIR variable which > > IS properly set > this seems to be an individual problem on my machine. there is even a > difference in running rundig from commandline (ok) and via cron/batch > (erroneous) It's not a plot against you, honest. :) If you get different results from the command line and from cron it simply means that cron's environment is different from the shell's. You might try setting the TMPDIR environment explicitly in the crontab file and see if that improves things. Good luck, Doug -- "Welcome to the desert of the real." - Laurence Fishburne as Morpheus, "The Matrix" To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
Re: SV: [htdig] Foreign chars (Swedish)
According to Philippe Ramkvist-Henry: > > Are the hits all capitalized, or do some of them have the lowercase ä? > > Does this problem happen consistently with certain accented letters, and > > not others? Do you have certain uppercase letters appearing in db.wordlist? > > With hits you mean the actual words from the document I guess. Well only those > which are supposed to be capitalized are. For example: A search for "ättestupan" > renders 0 hits while a search for "Ättestupan" renders 18. The word is in the >documents > always written as "Ättestupan" so this would be natural if the search was case >sensitive. > The problem is that "Åsa" and "åsa" gives the exact same hits and it's also always > reffered to as "Åsa". The problem only exists (as far as I can test) for "äÄ". > > The db.wordlist only contain lowercase letters. OK, so the word Ättestupan appears in there as ättestupan, correct? Very strange. So searches for words containing Ä will find words with ä in its place, as expected, but searches for words containing ä will match neither ä nor Ä, is that right? I'm at a bit of a loss to explain it, but at some point it would seem that htsearch is mangling the lower case ä. Do you have any documents containing a lower case ä somewhere in a word, and if so, does that word make it into db.wordlist correctly? I still suspect a problem with ctype for your locale. Could you compile and run the following C program on your system, and send me the output? (Run it with the name of your locale, "sv", as an argument.) Does using a locale of sv_SE (or even something else entirely like fr or fr_FR) make any difference in your results? And for the long-shot question, do are your documents use ISO 8859-1 (Latin 1) encoding, or are there some that use a 7-bit encoding for Sweden? --- #include #include main(int ac, char **av) { int i; unsigned char c; if (ac > 1) setlocale(LC_ALL, av[1]); for (i = 0; i < 256; ++i) { printf("%3d 0x%02X: ", i, i); c = i; if (isprint(c)) printf(" %c", c); else if (c < 0x80 && isprint(c ^ '@')) printf("^%c", c ^ '@'); else if (isprint((c & 0x7F) ^ '@')) printf("~%c", (c & 0x7F) ^ '@'); else printf(" "); printf(" %c%c%c%c%c%c%c%c%c%c%c%c%c\n", isascii(c) ? 'A' : '-', isalpha(c) ? 'a' : '-', islower(c) ? 'l' : '-', isupper(c) ? 'u' : '-', isalnum(c) ? 'n' : '-', isdigit(c) ? 'd' : '-', isxdigit(c) ? 'x' : '-', isgraph(c) ? 'g' : '-', isprint(c) ? 't' : '-', ispunct(c) ? 'p' : '-', iscntrl(c) ? 'c' : '-', isspace(c) ? 's' : '-', #ifdef isblank isblank(c) ? 'b' : '-' #else '?' #endif ); } } --- -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
Re: [htdig] word_list columns
According to Aaron Turner: > there are 6 columns in the wordlist file. Obviously col1 is the word. > What are the others? (i, l, w, c a) First field:indexed word (lower case) i: doc ID (to match up with records in db.docs.index) l: location of word in doc (0-1000, i.e. tenth of a percent units) w: weight of word in searches c: no. of occurrences of word in document, if > 1 a: index into anchor list in db.docdb record, to indicate which anchor name, if any, preceded this word Fields are tab separated. All of this info gets put into db.words.db by htmerge, so htsearch doesn't actually look at db.wordlist. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
Re: [htdig] i need help on htdig database format
According to ronald: > when htdig exports results from an index as textformat it generates two > files. The files look like this : > > file1: > 0 u:http://www.htdig.org/ t:ht://Dig -- Internet search engine software a:0 > m:936027636 s:373 h: h: l:940510479 L:2 I:373 >d:http://www.htdig.org/www.htdig.orght://Dig Search Software (yes, the developers >use it)ht://DigParent Directory A: First field:doc ID u: URL of doc t: doc title a: doc state (refer to source) m: date/time last modified, sec since 1970-01-01 00:00:00 UTC s: doc size in bytes h: doc head (excerpt of first max_head_length bytes of doc) h: (2nd)meta description contents (this 2nd h is a bug - it really should be a unique value like D or something) l: date/time document was indexed (sec since 1970) L: no. of links doc has to other docs I: "docImageSize" - has nothing to do with images, but seems to contain document size, and may be cumulative in some circumstances - can anyone else make any sense of this? d: link descriptions - text of links to this doc, ^A separated A: anchor names (bookmarks) in doc, ^A separated All fields are tab (^I) separated. Sub-fields of d & A use ^A separator. doc head field has all runs of white space (space, tab, newline, etc.) collapsed to single spaces. > file2: This is db.wordlist... > 01oct99 i:115 l:0 w:100998c:2 > 01oct99 i:116 l:0 w:100998c:2 > 01oct99 i:45l:6 w:100381c:2 > 01oct99 i:46l:0 w:100998c:2 > 02aug1999 i:48l:361 w:639 a:2 > 02jun1999 i:50l:262 w:1382 c:2 a:2 > 02mar1999 i:53l:378 w:622 a:2 > 02may1999 i:51l:280 w:1349 c:2 a:2 First field:indexed word (lower case) i: doc ID (to match up with records from above) l: location of word in doc (0-1000, i.e. tenth of a percent units) w: weight of word in searches c: no. of occurrences of word in document, if > 1 a: index into "A:" list above, to indicate which anchor name, if any, preceded this word -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
SV: [htdig] Foreign chars (Swedish)
> Are the hits all capitalized, or do some of them have the lowercase ä? > Does this problem happen consistently with certain accented letters, and > not others? Do you have certain uppercase letters appearing in db.wordlist? With hits you mean the actual words from the document I guess. Well only those which are supposed to be capitalized are. For example: A search for "ättestupan" renders 0 hits while a search for "Ättestupan" renders 18. The word is in the documents always written as "Ättestupan" so this would be natural if the search was case sensitive. The problem is that "Åsa" and "åsa" gives the exact same hits and it's also always reffered to as "Åsa". The problem only exists (as far as I can test) for "äÄ". The db.wordlist only contain lowercase letters. > > I asked a guy here a the University and he said that there might be > > complications with "unsigned char" and "char". He gave me the example > > below. Please answer at a novice level, my C++ and Unix knowledge is very > > limited. > > Good hunch, but given that some accented letters work and some give > problems, I wouldn't expect that it's a problem with sign extension. > This seems to point to a problem with the ctype tables for your locale, > but there could be something else that I'm missing here. Please keep > us posted. I'm also looking for a synonym wordlist in swedish... If anyone has one, please send me a copy. To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
[htdig] word_list columns
there are 6 columns in the wordlist file. Obviously col1 is the word. What are the others? (i, l, w, c a) -- Aaron Turner, Core Developer http://vodka.linuxkb.org/~aturner/ Linux Knowledge Base Organization http://linuxkb.org/ Because world domination requires quality open documentation. aka: [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
Re: [htdig] Foreign chars (Swedish)
According to Philippe Ramkvist-Henry: > I'm having problems with some foreign chars when using htdig to index and > search a Swedish site. The locale is set right (sv) and is working in > other applications. The problem I have is somewhat weird, maybe it has > something to do with "uppercase" "lowercase"? > > Well, I can search words like "Åsa,åsa,Öl,öl" and get the same matches. > But when I try to search "bäst" I get no hits. With "bÄst" I get several > hits... Are the hits all capitalized, or do some of them have the lowercase ä? Does this problem happen consistently with certain accented letters, and not others? Do you have certain uppercase letters appearing in db.wordlist? > I asked a guy here a the University and he said that there might be > complications with "unsigned char" and "char". He gave me the example > below. Please answer at a novice level, my C++ and Unix knowledge is very > limited. Good hunch, but given that some accented letters work and some give problems, I wouldn't expect that it's a problem with sign extension. This seems to point to a problem with the ctype tables for your locale, but there could be something else that I'm missing here. Please keep us posted. > htlib/StringMatch.cc > > while ((unsigned char)string[pos]) > { > new_state = table[trans[string[pos]]][state]; > > Should be? or? > > while (string[pos]) You don't need to take off the type cast on the "while" condition above, but the trans[] array subscript below definitely should be type cast! I'll fix this in the source. However, this seems to be a problem only in the StringMatch::Compare() method, which isn't used for looking at words in documents or in the database. It only affects a few internal ASCII-only string matches, and the robots.txt disallow comparisons, so unless you use upper-half characters in URLs, this bug shouldn't be a problem (which explains how it's evaded detection this long). > { > new_state = table[trans[(unsigned char)string[pos]]][state]; -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
Re: [htdig] Exclude URLs from search
According to Jason Carvalho: > I currently use the following feature in my search form: > > Personal pages: > Include > Exclude > > > This enables people to exclude personal/public pages from their > search. > > I would now like to add an additional feature which enables people to > exclude another area from their search (/cww/). > > Has anybody used multiple excludes in their search forms before? I > would be interested to know how it is done. I believe this has been done before, for either restrict or exclude, using radio buttons. You could probably also define multiple select lists for the exclude parameter. Either should work as long as you have htsearch 3.1.2 or later. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
Re: [htdig] Rundig
According to Jason Carvalho: > When I run 'rundig', it crawls my web site then when it comes to the > merge stage, it outputs: > > Deleted, no excerpt :2156 http://ww...etc. for loads of my pages. > > All in all, it found about 9500 pages but only merged 7500, giving the > above message for the rest. > > What does this mean? The two most common causes are: a) the document contained no text, or the text was excluded by noindex meta tags, or b) the document was disallowed by the server's robots.txt file. If you ran htdig or rundig with -vvv, then htdig's output should give you more of an indication of which situation arose with these pages. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
Re: [htdig] parse_doc.pl alterations
According to David Adams: > I have downloaded the parse_doc.pl script, and the xpdf and catdoc > utilities, and I am now using them to extend our search index to include > Word and PDF files. It all works well and with a bit of alteration to > the Perl script does exactly what I want. My thanks to the developers! I forgot to ask before, what were your alterations? Something very specific to your needs, or something worth sharing with other? -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
Re: [htdig] WordPerfect parser?
According to David Adams: > I have downloaded the parse_doc.pl script, and the xpdf and catdoc > utilities, and I am now using them to extend our search index to include > Word and PDF files. It all works well and with a bit of alteration to > the Perl script does exactly what I want. My thanks to the developers! > > We also have a need to index WordPerfect documents, including those > produced by WP 6.1 and later. Can anyone recommend a utility that will > run under IRIX 6.5 ? I haven't come across any open source/freeware WP to text converters. The reason I put the WP hooks in there originally was because some sites had .doc files that were WP rather than Word documents, and the WP documents caused catdoc to blow chunks. Same story for .doc files in RTF format. I then realised there are all sort of .doc files that aren't MS-Word, so I put in explicit checks for MS-Word magic numbers rather than using catdoc by default, but still kept the WP and RTF hooks in by way of example. If WordPerfect for UNIX is available for IRIX, and it contains the cvt utility as WP for Linux does, you could write a script that uses that, or adapt the parse_doc.pl script to use it directly. Its usage is: /usr/local/wplinux/shbin10/cvt -l file.wpd file.txt asci > /dev/null -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
Re: [htdig] htdig 3.1.3 freezes
According to Geoff Hutchison: > At 7:04 PM +0100 11/24/99, Marcus Ertl wrote: > >Hi! > > I have installed htdig 3.1.3 last night, and now I try to dig a new > >database. But it always freezes on some pages. For example on > >http://www.dilettanten.de/welt/haduloa.htm ... why? what can I do > >against this? > > Taking a look at the page, I think it's probably a problem with the link to: > http://service.kundenserver.de/cgi-bin/guestbook/guestbook.cgi?action= > display&gb_domain=dilettanten.de&gb_id=1 > > I'm surprised it's *freezing*, but there is a known bug with parsing > URLs of this form in 3.1.3. Try this patch: > > http://www.htdig.org/files/contrib/other/htdig-3.1.3-urlparmbug.patch I grabbed a copy of haduloa.html and ran an unpatched copy of htdig 3.1.3 against it, and had no hanging, so it's not hanging in the parser. That's not to say the patch won't solve the problem for you - it might if your gestbook.cgi is being called with bad parameters and it's the cause of the hang. If the patch does solve the problem, you may want to look into the possibility of making your cgi script more robust. If that doesn't work, I'd suggest trying to reduce the problem. If you dig a smaller set of pages, does it hang at the same documents? Does running htdig with -vvv give any clearer indication of where it's hanging, and perhaps even why? (-vvv will produce LOTS of debugging output) -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
Re: [htdig] System specifications
According to Udaya Bhasker: > We downloaded your search engine after searching the web for more than > one week.Your search engine fitted our bill perfectly. > I downloaded it in my home directory and I made a symbolic link for > /htdoc/index.html file to our our index.html file. > > We have doubts about the entires that have to be made in the CONFIG > file.we made entries in the CONFIG file which we thought as relevant > ones .We opened our site through a browser and opened the symbolic link > file.We encountered the message"URL forbidden". It seems to me you're trying to read the htdoc documentation from your web server. It may be that your web server isn't configured to allow following symbolic links ("Options FollowSymLinks" in Apache), or the symbolic link points to a directory that the web server can't access (look for execute permissions turned off on the htdoc directory, or any directory above it up to root). You said you installed the source in your home directory - many times users have their home directory permissions set to rwx--, which blocks out any access to the directory (or anything under it) from any user other than yourself. Your web server runs under a different user ID than your own, of course. You can also browse the ht://Dig documentation on-line at http://www.htdig.org/ -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
[htdig] i need help on htdig database format
when htdig exports results from an index as textformat it generates two files. The files look like this : file1: 0 u:http://www.htdig.org/ t:ht://Dig -- Internet search engine software a:0 m:936027636 s:373 h: h: l:940510479 L:2 I:373 d:http://www.htdig.org/ www.htdig.org ht://Dig Search Software (yes, the developers use it) ht://Dig Parent Directory A: 1 u:http://www.htdig.org/contents.htmlt:ht://Dig Table of Contentsa:0 m:936027636 s:3539 h: Contents General ht://Dig Features and Requirements Where to get it Installation Configuration FAQ Mailing list Uses of ht://Dig License information Reference htdig htmerge htnotify htfuzzy htsearch Configuration file META tags Other How it works Contributors Release notes ChangeLog TODO Bug Reporting Contributed Work Website stats Developer Site Quick Search:h: l:940510479 L:25I:3539 d:/contents.htmlA: 2 u:http://www.htdig.org/main.htmlt:ht://Dig: Overviewa:0 m:940044123 s:3717 h: WWW Search Engine Software ht://Dig Copyright (c) 1995-1999 The ht://Dig Group Please see the file COPYING for license information. Recent News * 22 Sep 1999: A new stable release of ht://Dig, htdig-3.1.3, is released. This release is recommended for all production systems. It solves most of the outstanding bugs in the 3.1.x releases. See the release notes or download it. * 1 June 1999: Unfortunately, due to lack of interest from key developers, the ht://Dig Conference from Aug 19-20 will be cancelled. We hope h: l:940510480 L:10I:3717 d:ht://Dig /main.html A: 3 and so on. file2: 01oct99 i:115 l:0 w:100998c:2 01oct99 i:116 l:0 w:100998c:2 01oct99 i:45l:6 w:100381c:2 01oct99 i:46l:0 w:100998c:2 02aug1999 i:48l:361 w:639 a:2 02jun1999 i:50l:262 w:1382 c:2 a:2 02mar1999 i:53l:378 w:622 a:2 02may1999 i:51l:280 w:1349 c:2 a:2 and so on Can anyone please tell me exactly what these fields mean ? Ronald _ Ronald Tournier Stichting De Digitale Stad 1011 TD Amsterdam tel. 020 6257493 fax. 020 6382817 tel direkt: 020 5205335 e-mail: [EMAIL PROTECTED] To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
Re: [htdig] pure numbers as search words
At 3:37 PM +0100 11/25/99, [EMAIL PROTECTED] wrote: >a a string consisting of digits only is completely disregarded. >Is there a way to reconfigure this? See http://www.htdig.org/attrs.html#allow_numbers -Geoff Hutchison Williams Students Online http://wso.williams.edu/ To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
Re: [htdig] top of page?
According to Benson Yeh: > Ok. Now it works. For some reason, when I had it before, the command > excerpt_show_top: yes > on the bottom of the .conf file. I belive that by moving it up some > has fixed the problem. We've had reports before of problems with attributes at the end of the conf file being ignored. It turns out that Configuration::Read() ends up ignoring the last line if it doesn't end with a newline character, because it reaches EOF before seeing a complete line. I'll try to fix this, but in the meantime, watch out for that last line, and make sure you terminate it correctly. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
[htdig] pure numbers as search words
From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Date: Thu, 25 Nov 1999 15:37:10 +0100 Subject: pure numbers as search words Hi everybody, as a new user of htdig I have the following problem: Although search strings combined of letters and digits are properly found, a a string consisting of digits only is completely disregarded. Is there a way to reconfigure this? Thanks in advance Florian Nill floriann.vcf To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
[htdig] Foreign chars (Swedish)
Hello! I'm having problems with some foreign chars when using htdig to index and search a Swedish site. The locale is set right (sv) and is working in other applications. The problem I have is somewhat weird, maybe it has something to do with "uppercase" "lowercase"? Well, I can search words like "Åsa,åsa,Öl,öl" and get the same matches. But when I try to search "bäst" I get no hits. With "bÄst" I get several hits... I asked a guy here a the University and he said that there might be complications with "unsigned char" and "char". He gave me the example below. Please answer at a novice level, my C++ and Unix knowledge is very limited. Thanks Philippe Ramkvist-Henry htlib/StringMatch.cc while ((unsigned char)string[pos]) { new_state = table[trans[string[pos]]][state]; Should be? or? while (string[pos]) { new_state = table[trans[(unsigned char)string[pos]]][state]; To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
AW: [htdig] irrelevant pages in search
Thanks for the answer, > > htmerge does not seem to honour the TMPDIR variable which > IS properly set this seems to be an individual problem on my machine. there is even a difference in running rundig from commandline (ok) and via cron/batch (erroneous) > > in ANY case, > > 1. htmerge should do a better error message (I even used -v) > > We're open to suggestions, but if the problem is the sort > program that fails > silently, there isn't much that htmerge can do to guess at why. hmm, maybe this was me yelling out too loud without thinking. I think you cannot do more than supplying stderr of sort plus maybe errno the exit value as a hint. > > 2. htsearch should be able to identify a corrupt db > I too would like to see more error checking to detect such > problems, but > I wouldn't know where to begin in adding code, and what to > look for in terms > of database problems. Anyone else have any ideas? IMHO this is the most important part. I did not have a look at sources so far, but isn't it possible to have a flag "under_construction" somewhere (as part of the db itself) that is set as long as different files of the db are not reflecting the status quo? I am not in internals, but i feel you even have bad results between running htdig and htmerge? so the flag could even state "ok", "htdig running", "sorting", "merging" (and possibly count in the presence of the -i flag if necessary) htsearch could read this flag and tell if a search might be unreliable right now. (or even give this wonderful message "contact the webmaster" :( Just ideas, I don't know how practicable. Hardy To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
[htdig] Exclude URLs from search
I currently use the following feature in my search form: Personal pages: Include Exclude This enables people to exclude personal/public pages from their search. I would now like to add an additional feature which enables people to exclude another area from their search (/cww/). Has anybody used multiple excludes in their search forms before? I would be interested to know how it is done. Many Thanks! -- -- Jason Carvalho Web Analyst Cranfield University [EMAIL PROTECTED] -- To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
Re: [htdig] Reducing the importance of pages.
> > Is it possible to reduce the importance of certain pages? We have > some pages on our site that are directories and contain thousands of > entries. As a result they always seem to come up as top results > whenever we search for anything. I don't really want to remove these > pages from a search but I would like them tol appear lower down the > list. Is this at all possible (perhaps by using negative weighting or > similar?)? > > Thanks! > > -- > -- > Jason Carvalho > Web Analyst > Cranfield University > [EMAIL PROTECTED] You could increase the weighting of other pages by encouraging the use of and in their headers. On our site we have increased the weighting of keywords to 200. You might consider not indexing the directory pages atall by placing in their headers. Links in them will still be followed, but htdig will not index the words in them. -- David J Adams <[EMAIL PROTECTED]> Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
[htdig] Reducing the importance of pages.
Is it possible to reduce the importance of certain pages? We have some pages on our site that are directories and contain thousands of entries. As a result they always seem to come up as top results whenever we search for anything. I don't really want to remove these pages from a search but I would like them tol appear lower down the list. Is this at all possible (perhaps by using negative weighting or similar?)? Thanks! -- -- Jason Carvalho Web Analyst Cranfield University [EMAIL PROTECTED] -- To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
[htdig] Rundig
When I run 'rundig', it crawls my web site then when it comes to the merge stage, it outputs: Deleted, no excerpt :2156 http://ww...etc. for loads of my pages. All in all, it found about 9500 pages but only merged 7500, giving the above message for the rest. What does this mean? -- -- Jason Carvalho Web Analyst Cranfield University [EMAIL PROTECTED] -- To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
[htdig] WordPerfect parser?
I have downloaded the parse_doc.pl script, and the xpdf and catdoc utilities, and I am now using them to extend our search index to include Word and PDF files. It all works well and with a bit of alteration to the Perl script does exactly what I want. My thanks to the developers! We also have a need to index WordPerfect documents, including those produced by WP 6.1 and later. Can anyone recommend a utility that will run under IRIX 6.5 ? Thanks. -- David J Adams <[EMAIL PROTECTED]> Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.