Re: AW: [htdig] irrelevant pages in search

1999-11-25 Thread Doug Barton

Hartmut Steffin wrote:
> 
> Thanks for the answer,
> 
> > > htmerge does not seem to honour the TMPDIR variable which
> > IS properly set
> this seems to be an individual problem on my machine. there is even a
> difference in running rundig from commandline (ok) and via cron/batch
> (erroneous)

It's not a plot against you, honest. :) If you get different results from
the command line and from cron it simply means that cron's environment is
different from the shell's. You might try setting the TMPDIR environment
explicitly in the crontab file and see if that improves things. 

Good luck,

Doug
-- 
"Welcome to the desert of the real." 

- Laurence Fishburne as Morpheus, "The Matrix"


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



Re: SV: [htdig] Foreign chars (Swedish)

1999-11-25 Thread Gilles Detillieux

According to Philippe Ramkvist-Henry:
> > Are the hits all capitalized, or do some of them have the lowercase ä?
> > Does this problem happen consistently with certain accented letters, and
> > not others?  Do you have certain uppercase letters appearing in db.wordlist?
> 
> With hits you mean the actual words from the document I guess. Well only those 
> which are supposed to be capitalized are. For example: A search for "ättestupan" 
> renders 0 hits while a search for "Ättestupan" renders 18. The word is in the 
>documents
> always written as "Ättestupan" so this would be natural if the search was case 
>sensitive.
> The problem is that "Åsa" and "åsa" gives the exact same hits and it's also always 
> reffered to as "Åsa". The problem only exists (as far as I can test) for "äÄ".
> 
> The db.wordlist only contain lowercase letters.

OK, so the word Ättestupan appears in there as ättestupan, correct?
Very strange.  So searches for words containing Ä will find words with
ä in its place, as expected, but searches for words containing ä will
match neither ä nor Ä, is that right?  I'm at a bit of a loss to explain
it, but at some point it would seem that htsearch is mangling the lower
case ä.  Do you have any documents containing a lower case ä somewhere
in a word, and if so, does that word make it into db.wordlist correctly?

I still suspect a problem with ctype for your locale.  Could you compile
and run the following C program on your system, and send me the output?
(Run it with the name of your locale, "sv", as an argument.)

Does using a locale of sv_SE (or even something else entirely like fr or
fr_FR) make any difference in your results?  And for the long-shot question,
do are your documents use ISO 8859-1 (Latin 1) encoding, or are there some
that use a 7-bit encoding for Sweden?

---
#include 
#include 

main(int ac, char **av)
{
int i;
unsigned char   c;

if (ac > 1) setlocale(LC_ALL, av[1]);

for (i = 0; i < 256; ++i) {
printf("%3d 0x%02X: ", i, i);
c = i;
if (isprint(c))
printf(" %c", c);
else if (c < 0x80 && isprint(c ^ '@'))
printf("^%c", c ^ '@');
else if (isprint((c & 0x7F) ^ '@'))
printf("~%c", (c & 0x7F) ^ '@');
else
printf("  ");
printf("  %c%c%c%c%c%c%c%c%c%c%c%c%c\n",
isascii(c)  ? 'A' : '-',
isalpha(c)  ? 'a' : '-',
islower(c)  ? 'l' : '-',
isupper(c)  ? 'u' : '-',
isalnum(c)  ? 'n' : '-',
isdigit(c)  ? 'd' : '-',
isxdigit(c) ? 'x' : '-',
isgraph(c)  ? 'g' : '-',
isprint(c)  ? 't' : '-',
ispunct(c)  ? 'p' : '-',
iscntrl(c)  ? 'c' : '-',
isspace(c)  ? 's' : '-',
#ifdef  isblank
isblank(c)  ? 'b' : '-'
#else
'?'
#endif
);
}
}
---

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



Re: [htdig] word_list columns

1999-11-25 Thread Gilles Detillieux

According to Aaron Turner:
> there are 6 columns in the wordlist file.  Obviously col1 is the word.
> What are the others? (i, l, w, c a)

First field:indexed word (lower case)
i:  doc ID (to match up with records in db.docs.index)
l:  location of word in doc (0-1000, i.e. tenth of a percent units)
w:  weight of word in searches
c:  no. of occurrences of word in document, if > 1
a:  index into anchor list in db.docdb record, to indicate which
anchor name, if any, preceded this word

Fields are tab separated.  All of this info gets put into db.words.db by
htmerge, so htsearch doesn't actually look at db.wordlist.

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



Re: [htdig] i need help on htdig database format

1999-11-25 Thread Gilles Detillieux

According to ronald:
> when htdig exports results from an index as textformat it generates two
> files. The files look like this :
> 
> file1:
> 0 u:http://www.htdig.org/ t:ht://Dig -- Internet search engine software   a:0
> m:936027636 s:373   h:  h:  l:940510479 L:2 I:373   
>d:http://www.htdig.org/www.htdig.orght://Dig Search Software (yes, the developers 
>use it)ht://DigParent Directory   A:

First field:doc ID
u:  URL of doc
t:  doc title
a:  doc state (refer to source)
m:  date/time last modified, sec since 1970-01-01 00:00:00 UTC
s:  doc size in bytes
h:  doc head (excerpt of first max_head_length bytes of doc)
h: (2nd)meta description contents
(this 2nd h is a bug - it really should be a unique value
 like D or something)
l:  date/time document was indexed (sec since 1970)
L:  no. of links doc has to other docs
I:  "docImageSize" - has nothing to do with images, but seems to
contain document size, and may be cumulative in some
circumstances - can anyone else make any sense of this?
d:  link descriptions - text of links to this doc, ^A separated
A:  anchor names (bookmarks) in doc, ^A separated

All fields are tab (^I) separated.  Sub-fields of d & A use ^A separator.
doc head field has all runs of white space (space, tab, newline, etc.)
collapsed to single spaces.

> file2:

This is db.wordlist...

> 01oct99   i:115   l:0 w:100998c:2
> 01oct99   i:116   l:0 w:100998c:2
> 01oct99   i:45l:6 w:100381c:2
> 01oct99   i:46l:0 w:100998c:2
> 02aug1999 i:48l:361   w:639   a:2
> 02jun1999 i:50l:262   w:1382  c:2 a:2
> 02mar1999 i:53l:378   w:622   a:2
> 02may1999 i:51l:280   w:1349  c:2 a:2

First field:indexed word (lower case)
i:  doc ID (to match up with records from above)
l:  location of word in doc (0-1000, i.e. tenth of a percent units)
w:  weight of word in searches
c:  no. of occurrences of word in document, if > 1
a:  index into "A:" list above, to indicate which anchor name,
if any, preceded this word

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



SV: [htdig] Foreign chars (Swedish)

1999-11-25 Thread Philippe Ramkvist-Henry


> Are the hits all capitalized, or do some of them have the lowercase ä?
> Does this problem happen consistently with certain accented letters, and
> not others?  Do you have certain uppercase letters appearing in db.wordlist?

With hits you mean the actual words from the document I guess. Well only those 
which are supposed to be capitalized are. For example: A search for "ättestupan" 
renders 0 hits while a search for "Ättestupan" renders 18. The word is in the documents
always written as "Ättestupan" so this would be natural if the search was case 
sensitive.
The problem is that "Åsa" and "åsa" gives the exact same hits and it's also always 
reffered to as "Åsa". The problem only exists (as far as I can test) for "äÄ".

The db.wordlist only contain lowercase letters.

> > I asked a guy here a the University and he said that there might be
> > complications with "unsigned char" and "char". He gave me the example
> > below. Please answer at a novice level, my C++ and Unix knowledge is very
> > limited.  
> 
> Good hunch, but given that some accented letters work and some give
> problems, I wouldn't expect that it's a problem with sign extension.
> This seems to point to a problem with the ctype tables for your locale,
> but there could be something else that I'm missing here.  Please keep
> us posted.

I'm also looking for a synonym wordlist in swedish... If anyone has one, please 
send me a copy.



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



[htdig] word_list columns

1999-11-25 Thread Aaron Turner


there are 6 columns in the wordlist file.  Obviously col1 is the word.
What are the others? (i, l, w, c a)

--
Aaron Turner, Core Developer   http://vodka.linuxkb.org/~aturner/
Linux Knowledge Base Organization  http://linuxkb.org/
Because world domination requires quality open documentation.
aka: [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED]



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



Re: [htdig] Foreign chars (Swedish)

1999-11-25 Thread Gilles Detillieux

According to Philippe Ramkvist-Henry:
> I'm having problems with some foreign chars when using htdig to index and
> search a Swedish site. The locale is set right (sv) and is working in
> other applications. The problem I have is somewhat weird, maybe it has
> something to do with "uppercase" "lowercase"?
> 
> Well, I can search words like "Åsa,åsa,Öl,öl" and get the same matches.
> But when I try to search "bäst" I get no hits. With "bÄst" I get several
> hits...

Are the hits all capitalized, or do some of them have the lowercase ä?
Does this problem happen consistently with certain accented letters, and
not others?  Do you have certain uppercase letters appearing in db.wordlist?

> I asked a guy here a the University and he said that there might be
> complications with "unsigned char" and "char". He gave me the example
> below. Please answer at a novice level, my C++ and Unix knowledge is very
> limited.  

Good hunch, but given that some accented letters work and some give
problems, I wouldn't expect that it's a problem with sign extension.
This seems to point to a problem with the ctype tables for your locale,
but there could be something else that I'm missing here.  Please keep
us posted.

>  htlib/StringMatch.cc
>  
>  while ((unsigned char)string[pos])
>  {
>  new_state = table[trans[string[pos]]][state];
>  
> Should be? or? 
>  
>  while (string[pos])

You don't need to take off the type cast on the "while" condition above,
but the trans[] array subscript below definitely should be type cast!
I'll fix this in the source.  However, this seems to be a problem only
in the StringMatch::Compare() method, which isn't used for looking at
words in documents or in the database.  It only affects a few internal
ASCII-only string matches, and the robots.txt disallow comparisons, so
unless you use upper-half characters in URLs, this bug shouldn't be a
problem (which explains how it's evaded detection this long).

>  {
>  new_state = table[trans[(unsigned char)string[pos]]][state];

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



Re: [htdig] Exclude URLs from search

1999-11-25 Thread Gilles Detillieux

According to Jason Carvalho:
> I currently use the following feature in my search form:
> 
> Personal pages: 
>   Include
>   Exclude
> 
> 
> This enables people to exclude personal/public pages from their
> search.
> 
> I would now like to add an additional feature which enables people to
> exclude another area from their search (/cww/).
> 
> Has anybody used multiple excludes in their search forms before?  I
> would be interested to know how it is done.

I believe this has been done before, for either restrict or exclude, using
radio buttons.  You could probably also define multiple select lists
for the exclude parameter.  Either should work as long as you have htsearch
3.1.2 or later.

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



Re: [htdig] Rundig

1999-11-25 Thread Gilles Detillieux

According to Jason Carvalho:
> When I run 'rundig', it crawls my web site then when it comes to the
> merge stage, it outputs:
> 
> Deleted, no excerpt :2156 http://ww...etc.   for loads of my pages.
> 
> All in all, it found about 9500 pages but only merged 7500, giving the
> above message for the rest.
> 
> What does this mean?

The two most common causes are:  a) the document contained no text, or
the text was excluded by noindex meta tags, or b) the document was
disallowed by the server's robots.txt file.  If you ran htdig or rundig
with -vvv, then htdig's output should give you more of an indication of
which situation arose with these pages.

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



Re: [htdig] parse_doc.pl alterations

1999-11-25 Thread Gilles Detillieux

According to David Adams:
> I have downloaded the parse_doc.pl script, and the xpdf and catdoc
> utilities, and I am now using them to extend our search index to include
> Word and PDF files.  It all works well and with a bit of alteration to
> the Perl script does exactly what I want.  My thanks to the developers!

I forgot to ask before, what were your alterations?  Something very
specific to your needs, or something worth sharing with other?

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



Re: [htdig] WordPerfect parser?

1999-11-25 Thread Gilles Detillieux

According to David Adams:
> I have downloaded the parse_doc.pl script, and the xpdf and catdoc
> utilities, and I am now using them to extend our search index to include
> Word and PDF files.  It all works well and with a bit of alteration to
> the Perl script does exactly what I want.  My thanks to the developers!
> 
> We also have a need to index WordPerfect documents, including those
> produced by WP 6.1 and later.  Can anyone recommend a utility that will
> run under IRIX 6.5 ?

I haven't come across any open source/freeware WP to text converters.
The reason I put the WP hooks in there originally was because some sites
had .doc files that were WP rather than Word documents, and the WP documents
caused catdoc to blow chunks.  Same story for .doc files in RTF format.
I then realised there are all sort of .doc files that aren't MS-Word,
so I put in explicit checks for MS-Word magic numbers rather than using
catdoc by default, but still kept the WP and RTF hooks in by way of
example.

If WordPerfect for UNIX is available for IRIX, and it contains the cvt
utility as WP for Linux does, you could write a script that uses that,
or adapt the parse_doc.pl script to use it directly.  Its usage is:

/usr/local/wplinux/shbin10/cvt -l file.wpd file.txt asci > /dev/null

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



Re: [htdig] htdig 3.1.3 freezes

1999-11-25 Thread Gilles Detillieux

According to Geoff Hutchison:
> At 7:04 PM +0100 11/24/99, Marcus Ertl wrote:
> >Hi!
> >   I have installed htdig 3.1.3 last night, and now I try to dig a new
> >database. But it always freezes on some pages. For example on
> >http://www.dilettanten.de/welt/haduloa.htm ... why? what can I do
> >against this?
> 
> Taking a look at the page, I think it's probably a problem with the link to:
> http://service.kundenserver.de/cgi-bin/guestbook/guestbook.cgi?action= 
> display&gb_domain=dilettanten.de&gb_id=1
> 
> I'm surprised it's *freezing*, but there is a known bug with parsing 
> URLs of this form in 3.1.3. Try this patch:
> 
> http://www.htdig.org/files/contrib/other/htdig-3.1.3-urlparmbug.patch

I grabbed a copy of haduloa.html and ran an unpatched copy of htdig
3.1.3 against it, and had no hanging, so it's not hanging in the parser.
That's not to say the patch won't solve the problem for you - it might
if your gestbook.cgi is being called with bad parameters and it's the
cause of the hang.  If the patch does solve the problem, you may want
to look into the possibility of making your cgi script more robust.

If that doesn't work, I'd suggest trying to reduce the problem.  If you
dig a smaller set of pages, does it hang at the same documents?  Does
running htdig with -vvv give any clearer indication of where it's hanging,
and perhaps even why?  (-vvv will produce LOTS of debugging output)

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



Re: [htdig] System specifications

1999-11-25 Thread Gilles Detillieux

According to Udaya  Bhasker:
> We downloaded your search engine after searching the web for more than
> one week.Your search engine fitted our bill perfectly.
>  I downloaded it in my home directory and I  made a symbolic link for
> /htdoc/index.html file to our our index.html file.
> 
> We have doubts about the entires that have to be made in the CONFIG
> file.we  made entries in the CONFIG file which we thought as relevant
> ones .We opened our site through a browser and opened the symbolic link
> file.We encountered the message"URL forbidden".

It seems to me you're trying to read the htdoc documentation from your
web server.  It may be that your web server isn't configured to allow
following symbolic links ("Options FollowSymLinks" in Apache), or the
symbolic link points to a directory that the web server can't access
(look for execute permissions turned off on the htdoc directory, or
any directory above it up to root).  You said you installed the source
in your home directory - many times users have their home directory
permissions set to rwx--, which blocks out any access to the
directory (or anything under it) from any user other than yourself.
Your web server runs under a different user ID than your own, of course.

You can also browse the ht://Dig documentation on-line at
http://www.htdig.org/

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



[htdig] i need help on htdig database format

1999-11-25 Thread ronald

when htdig exports results from an index as textformat it generates two
files. The files look like this :

file1:
0   u:http://www.htdig.org/ t:ht://Dig -- Internet search engine software
a:0 m:936027636 s:373   h:  h:  l:940510479 L:2 I:373   
d:http://www.htdig.org/
www.htdig.org ht://Dig Search Software (yes, the developers use it)
ht://Dig Parent Directory   A:
1   u:http://www.htdig.org/contents.htmlt:ht://Dig Table of Contentsa:0
m:936027636 s:3539  h: Contents General ht://Dig Features and Requirements
Where to get it Installation Configuration FAQ Mailing list Uses of
ht://Dig License information Reference htdig htmerge htnotify htfuzzy
htsearch Configuration file META tags Other How it works Contributors
Release notes ChangeLog TODO Bug Reporting Contributed Work Website stats
Developer Site Quick Search:h:  l:940510479 L:25I:3539  
d:/contents.htmlA:
2   u:http://www.htdig.org/main.htmlt:ht://Dig: Overviewa:0 
m:940044123
s:3717  h: WWW Search Engine Software ht://Dig Copyright (c) 1995-1999 The
ht://Dig Group Please see the file COPYING for license information. Recent
News * 22 Sep 1999: A new stable release of ht://Dig, htdig-3.1.3, is
released. This release is recommended for all production systems. It solves
most of the outstanding bugs in the 3.1.x releases. See the release notes
or download it. * 1 June 1999: Unfortunately, due to lack of interest from
key developers, the ht://Dig Conference from Aug 19-20 will be cancelled.
We hope h:  l:940510480 L:10I:3717  d:ht://Dig /main.html   A:
3  and so on.


file2:
01oct99 i:115   l:0 w:100998c:2
01oct99 i:116   l:0 w:100998c:2
01oct99 i:45l:6 w:100381c:2
01oct99 i:46l:0 w:100998c:2
02aug1999   i:48l:361   w:639   a:2
02jun1999   i:50l:262   w:1382  c:2 a:2
02mar1999   i:53l:378   w:622   a:2
02may1999   i:51l:280   w:1349  c:2 a:2
and so on


Can anyone please tell me exactly what these fields mean ? 

Ronald





_
Ronald Tournier
Stichting De Digitale Stad
1011 TD Amsterdam
tel. 020 6257493
fax. 020 6382817
tel direkt: 020 5205335
e-mail: [EMAIL PROTECTED]



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



Re: [htdig] pure numbers as search words

1999-11-25 Thread Geoff Hutchison

At 3:37 PM +0100 11/25/99, [EMAIL PROTECTED] wrote:
>a a string consisting of digits only is completely disregarded.
>Is there a way to reconfigure this?

See http://www.htdig.org/attrs.html#allow_numbers

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



Re: [htdig] top of page?

1999-11-25 Thread Gilles Detillieux

According to Benson Yeh:
> Ok.  Now it works.  For some reason, when I had it before, the command
> excerpt_show_top:  yes
> on the bottom of the .conf file.  I belive that by moving it up some
> has fixed the problem.

We've had reports before of problems with attributes at the end of the
conf file being ignored.  It turns out that Configuration::Read() ends
up ignoring the last line if it doesn't end with a newline character,
because it reaches EOF before seeing a complete line.  I'll try to fix
this, but in the meantime, watch out for that last line, and make sure
you terminate it correctly.

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



[htdig] pure numbers as search words

1999-11-25 Thread florian . nill

From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Date: Thu, 25 Nov 1999 15:37:10 +0100
Subject: pure numbers as search words

Hi everybody,

as a new user of htdig I have the following problem:

Although search strings combined of letters and digits are properly
found,
a a string consisting of digits only is completely disregarded.
Is there a way to reconfigure this?

Thanks in advance

Florian Nill
 floriann.vcf


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.


[htdig] Foreign chars (Swedish)

1999-11-25 Thread Philippe Ramkvist-Henry


Hello!

I'm having problems with some foreign chars when using htdig to index and
search a Swedish site. The locale is set right (sv) and is working in
other applications. The problem I have is somewhat weird, maybe it has
something to do with "uppercase" "lowercase"?

Well, I can search words like "Åsa,åsa,Öl,öl" and get the same matches.
But when I try to search "bäst" I get no hits. With "bÄst" I get several
hits...

I asked a guy here a the University and he said that there might be
complications with "unsigned char" and "char". He gave me the example
below. Please answer at a novice level, my C++ and Unix knowledge is very
limited.  

Thanks
Philippe Ramkvist-Henry



 htlib/StringMatch.cc
 
 while ((unsigned char)string[pos])
 {
 new_state = table[trans[string[pos]]][state];
 
Should be? or? 
 
 while (string[pos])
 {
 new_state = table[trans[(unsigned 
 char)string[pos]]][state];
  
   



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



AW: [htdig] irrelevant pages in search

1999-11-25 Thread Hartmut Steffin

Thanks for the answer,

> > htmerge does not seem to honour the TMPDIR variable which
> IS properly set
this seems to be an individual problem on my machine. there is even a
difference in running rundig from commandline (ok) and via cron/batch
(erroneous)

> > in ANY case,
> > 1. htmerge should do a better error message (I even used -v)
>
> We're open to suggestions, but if the problem is the sort
> program that fails
> silently, there isn't much that htmerge can do to guess at why.
hmm, maybe this was me yelling out too loud without thinking. I think you
cannot do more than supplying stderr of sort plus maybe errno the exit value
as a hint.

> > 2. htsearch should be able to identify a corrupt db
> I too would like to see more error checking to detect such
> problems, but
> I wouldn't know where to begin in adding code, and what to
> look for in terms
> of database problems.  Anyone else have any ideas?
IMHO this is the most important part. I did not have a look at sources so
far, but isn't it possible to have a flag "under_construction" somewhere (as
part of the db itself) that is set as long as different files of the db are
not reflecting the status quo? I am not in internals, but i feel you even
have bad results between running htdig and htmerge? so the flag could even
state "ok", "htdig running", "sorting", "merging"  (and possibly count
in the presence of the -i flag if necessary)
htsearch could read this flag and tell if a search might be unreliable right
now. (or even give this wonderful message "contact the webmaster" :(

Just ideas, I don't know how practicable.
Hardy




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



[htdig] Exclude URLs from search

1999-11-25 Thread Jason Carvalho

I currently use the following feature in my search form:

Personal pages: 
  Include
  Exclude


This enables people to exclude personal/public pages from their
search.

I would now like to add an additional feature which enables people to
exclude another area from their search (/cww/).

Has anybody used multiple excludes in their search forms before?  I
would be interested to know how it is done.

Many Thanks!
-- 
--
Jason Carvalho
Web Analyst
Cranfield University
[EMAIL PROTECTED]
--


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



Re: [htdig] Reducing the importance of pages.

1999-11-25 Thread David Adams

> 
> Is it possible to reduce the importance of certain pages?  We have
> some pages on our site that are directories and contain thousands of
> entries.  As a result they always seem to come up as top results
> whenever we search for anything.  I don't really want to remove these
> pages from a search but I would like them tol appear lower down the
> list.  Is this at all possible (perhaps by using negative weighting or
> similar?)?
> 
> Thanks!
> 
> -- 
> --
> Jason Carvalho
> Web Analyst
> Cranfield University
> [EMAIL PROTECTED]

You could increase the weighting of other pages by encouraging
the use of



and



in their headers.  On our site we have increased the weighting
of keywords to 200.

You might consider not indexing the directory pages atall by placing



in their headers.  Links in them will still be followed, but htdig
will not index the words in them.

-- 
 
David J Adams
<[EMAIL PROTECTED]>
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



[htdig] Reducing the importance of pages.

1999-11-25 Thread Jason Carvalho

Is it possible to reduce the importance of certain pages?  We have
some pages on our site that are directories and contain thousands of
entries.  As a result they always seem to come up as top results
whenever we search for anything.  I don't really want to remove these
pages from a search but I would like them tol appear lower down the
list.  Is this at all possible (perhaps by using negative weighting or
similar?)?

Thanks!

-- 
--
Jason Carvalho
Web Analyst
Cranfield University
[EMAIL PROTECTED]
--


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



[htdig] Rundig

1999-11-25 Thread Jason Carvalho

When I run 'rundig', it crawls my web site then when it comes to the
merge stage, it outputs:

Deleted, no excerpt :2156 http://ww...etc.   for loads of my pages.

All in all, it found about 9500 pages but only merged 7500, giving the
above message for the rest.

What does this mean?

-- 
--
Jason Carvalho
Web Analyst
Cranfield University
[EMAIL PROTECTED]
--


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



[htdig] WordPerfect parser?

1999-11-25 Thread David Adams

I have downloaded the parse_doc.pl script, and the xpdf and catdoc
utilities, and I am now using them to extend our search index to include
Word and PDF files.  It all works well and with a bit of alteration to
the Perl script does exactly what I want.  My thanks to the developers!

We also have a need to index WordPerfect documents, including those
produced by WP 6.1 and later.  Can anyone recommend a utility that will
run under IRIX 6.5 ?

Thanks.

-- 
 
David J Adams
<[EMAIL PROTECTED]>
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.