Re: [htdig] Target Tag Matching

2000-07-24 Thread Geoff Hutchison

At 4:16 PM -0700 7/24/00, Jeff Mandel wrote:
>There are folks here that use target tags like keywords. Is there a way
>to extract those words from a target tag and maybe even weight them?
>
>html source would look like this:
>Monitoring

There are much better ways to do this using the HTML spec. The anchor 
tag is intended for jumping to specific places on a page and this is 
how ht://Dig treats it.

You could certainly hack the HTML parser to add these words to the 
database (see HTML.cc, specifically the TITLE attribute. It's not 
used by ht://Dig directly yet, but it will be.

>Besides having them manually add a keyword list to the documents, any
>suggestions?

A META keyword list is going to be the most widely accepted way of 
adding keywords. It's used by almost every search engine and spider 
in existence (obviously including htdig). Beyond that, I'd say to 
stick to the spec.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Update search indexes?

2000-07-24 Thread Geoff Hutchison

At 7:25 PM -0500 7/24/00, David Gibbs wrote:
>Would more memory effect the processing significantly?  Maybe more 
>disk?  (got 6gb now).  Is there a mode of operation where it only 
>looks at changed pages for indexing?

Sure. If you have pre-existing databases then it will use those to 
only look at changed pages. For example, the rundig.sh script 
 uses pre-existing .work 
files to do this. (Granted, I'm a bit partial to this script. :-)

More memory or a faster disk might help. Also, if you have access to 
the files through a local filesystem (preferably not NFS), see the 
local_urls attributes:

e.g.


--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




[htdig] Update search indexes?

2000-07-24 Thread David Gibbs

Ok, this question might be quite basic ... but I couldn't find anything in 
the FAQ that addressed it specifically.

I'm in the process of seting up a web archive of a number of mailing lists 
I run, with htdig as the search engine.

When I ran 'rundig' against the archive to create the initial database it 
took quite a long time (~6 hours on a P2 300 & 64mb of RAM.

When I update the archive & re-run the 'dig', it seems to take a long time 
also.

I am running it with the '-a' flag, so it doesn't blow away the current 
files ... but I was wondering if there is a way I can make the database 
file creation run faster?

Would more memory effect the processing significantly?  Maybe more 
disk?  (got 6gb now).  Is there a mode of operation where it only looks at 
changed pages for indexing?

Thanks!

david
--
| Internet: [EMAIL PROTECTED]
| WWW:  http://www.midrange.com/david
|
| This message was written and delivered using 100%
| post-consumer (recycled) data bits.



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




[htdig] Target Tag Matching

2000-07-24 Thread Jeff Mandel

Hello All,

There are folks here that use target tags like keywords. Is there a way
to extract those words from a target tag and maybe even weight them?

html source would look like this:
Monitoring

A search for "Dinosaur" "Monitoring", or "Reactive Dye" would bring up
the doc containing this tag. I've been unable to scare up text that just
appears in tags, so I'm guessing htdig purposely ignoring these target
tags, yes?

Besides having them manually add a keyword list to the documents, any
suggestions?

Thanks,

Jeff



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] SSL and curl (was Problem with CGI)

2000-07-24 Thread Geoff Hutchison

On Mon, 24 Jul 2000, Gilles Detillieux wrote:

> Out of curiosity, how did "curl" deal with the patent and export
> restrictions on RSA?  Do they have a non-GPL, non-export release
> with https support, and one without?

Curl has the advantage that it's developed outside the US. I don't know
about the legality of the US mirror. In my case, I built from source and
linked against the RSA-licensed SSL library I had from a secure (non-free)
webserver.

I don't want to look too far in the future, but as of early September, it
no longer matters since the patent expires.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] WordKey::Compare: key length for a or b < info.minimum_length

2000-07-24 Thread Geoff Hutchison

On Mon, 24 Jul 2000, Adam H. Lewenberg wrote:

> When running htdig with 4 hops I am getting some strange results. Most
> recently, I get the message
> 
>WordKey::Compare: key length for a or b < info.minimum_length

First off, you don't say what version you're using. Since I know that the
WordKey class only exists in the 3.2 code, I assume you're using a
snapshot or beta. My first suggestion is to try the latest snapshot, since
this may (or may not) have fixed bugs in previous code.

Next, does this seem to be coming from one particular page? If you just
index that page, does it have the same problem?

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Htmerge: "Deleted, invalid"

2000-07-24 Thread Geoff Hutchison

On Mon, 24 Jul 2000, Gilles Detillieux wrote:

> Have you been able to build ht://Dig using SGI's compiler?  I may be

No.

> build of ht://Dig with it.  In all likelyhood, it would be a problem in
> the ht://Dig code that just doesn't manifest itself when built with the
> GNU compiler.

Maybe. Except I sometimes have trouble building software that's *supposed*
to work with SGI's compiler (like GCC or Emacs or CVS). Yes, I'd like to
see ht://Dig compile cleanly with various native compilers. But since
there seems to be an Ok workaround (if not ideal), I'm not personally
going to put much effort in this direction.

Then again, I haven't used SGI's compiler extensively. I admit readily
that I'd much rather compile GCC (or get binaries) and use them than fuss
with the native one.

In my group here at Northwestern, I'm not alone. Only SGI's Fortran
compiler is used, in part because there isn't a GNU compiler for anything
beyond f77.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Read only volume indexing

2000-07-24 Thread htdig

That is ok.  
I should have clarified, the cdroms are accessed under server.com/mnt/ in
the web browser and then by local_url.

I was just worried that this thing wouldnt scale very well if it had to
always re-read up to 25 cdroms worth of archived html.

Thanks,
Justin



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Read only volume indexing

2000-07-24 Thread Gilles Detillieux

According to [EMAIL PROTECTED]:
> How does htdig handle indexing of html pages on read only volumes(cdroms)?
> Will it only index it once and just skip the whole volume the next time?

The 3.1.x series of htdig only handles http:// type URLs, so the read-only
volumes would have to be accessible from a web server in order to be
indexed to begin with.  In 3.2.x (currently in beta), you can index
file:// URLs as well, so this gives more flexibility.  In either case,
an update run of htdig would still check the last-modified date of every
document in the database to see if it's been updated, so it wouldn't
exactly skip over the whole volume, but it wouldn't have to re-read
every document either.

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Htmerge: "Deleted, invalid"

2000-07-24 Thread Gilles Detillieux

According to David Adams:
> I have been using htdig (3.1.2 and then 3.1.5) on an IRIX system for about a 
> year and I have been very pleased with it.  I would say that we've given it a 
> good workout here.  The problem with the "Deleted, invalid" messages only 
> occurs with a second, relatively new search index.

I guess I should have read your message before responding to Geoff's!

> The first index is made from a single run of htdig covering 33 servers, all in 
> the local domain, and on this week's initial dig htmerge reports 49,233 
> documents and not a single "Deleted, invalid".
> 
> The second index is made from two runs of htdig covering a total 969 (yes 969 
> !) servers using a proxy.  Htmerge reports a mere 3,096 documents and 86 
> "Deleted, invalid".
> 
> I have looked at the db.wordlist files (which are written to only by htdig - is 
> that right?)

Yes and no.  htdig creates and writes the initial db.wordlist, then htmerge
sorts it, merges words together, and processes flags for page removals.  It
then rewrites this file before creating the word index database.

> and it would appear that htdig is flagging the pages for htmerge 
> to delete and is not finding any words in them.
> 
> I can advance these theories:
> 
> It is not a bug, but is due to the use of a proxy. (I use a proxy 
> because without one, a portion of the sites on any run of htdig were 
> found to be not responding or even unknown.  With a proxy, htdig appears
> to have no such problems.)

Hold on there!  The problem of sites being down (unknown or not
responding) is exactly the sort of thing that causes the "Deleted,
invalid" situation, and I said so last week.  How did you conclude that
htdig appears to have no such problems with a proxy, when it does indeed
appear to be having exactly that problem?  It would make sense that if
a site is not responding, the proxy would inform htdig of this (unless
it happened to quietly substitute a cached copy of the requested page
- assuming it had one), and htdig would respond the same way it would
without a proxy.  I think this is the most likely theory.

> It is a bug due to the use of a proxy.
> 
> It is a bug which only shows when compiled under IRIX.
> 
> It is a bug which only occurs when there many different servers.
> 
> I intend to re-build the second index using htdig -vvv and perhaps learn 
> something.

The only sure way to rule out an SGI compiler or IRIX-specific problem
would be to run htdig on a Linux box with the same configuration and
the same proxy, and see if you get the same results.  However, based on
what you said about a portion of the sites not responding, I'd guess
this is a more likely problem.  I guess there could also be a problem
with the proxy server itself, causing it to act like a server is down
when it isn't.  You may want to try different proxies as well.  In any
case, a close look at htdig -vvv output should give some clues.

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




[htdig] WordKey::Compare: key length for a or b < info.minimum_length

2000-07-24 Thread Adam H. Lewenberg

When running htdig with 4 hops I am getting some strange results. Most
recently, I get the message

   WordKey::Compare: key length for a or b < info.minimum_length

repeated over and over on my console. 

I downloaded the source and compiled it under Linux RedHat 6.2 on a
dual Pentium Dell server. I have plenty of hard space (>10G) and 256M
ram. 

Anybody else run into this problem? 

Thanks, A. Lewenberg


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Problem with CGI

2000-07-24 Thread Gilles Detillieux

According to Geoff Hutchison:
> At 9:52 AM -0400 7/11/00, gil cohen wrote:
> >I don't think SSL support has been put in, nor will it ever. The 
> >thing I'm waiting for is multithreading.
> 
> As Gilles mentioned, the problem with SSL is one of patent problems 
> and export restrictions. Since this is an international community, it 
> makes it a bit hard to do certain things. Certainly when the RSA 
> patent expires in the U.S. in September, I believe life will be a bit 
> easier. AFAIK at the moment, code even with *hooks* to an SSL library 
> like OpenSSL would no longer fulfill the GPL in the U.S. since it 
> would have unnecessary restrictions due to the patent.
> 
> Then again, I'm not a legal expert by any means.
> 
> However, in the 3.2 code there is support for "external transport 
> scripts" so you can write a short script to grab a page based on a 
> URL. This can be used to retrieve HTTPS pages and I have tested this 
> approach with the program curl.
> 
> As far as multithreading, unless someone offers to do it, we'll be 
> waiting quite some time. It's certainly not on my TODO list since I 
> don't have the expertise to even know where to start.

Out of curiosity, how did "curl" deal with the patent and export
restrictions on RSA?  Do they have a non-GPL, non-export release
with https support, and one without?

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Htmerge: "Deleted, invalid"

2000-07-24 Thread Gilles Detillieux

According to Geoff Hutchison:
> 
> At 10:34 AM -0500 7/19/00, Gilles Detillieux wrote:
> >  > I use the standard MIPSpro compiler.  The script I use (thanks to my former
> >  > collegeaue James Hammick) to setup the Makefile is:
> 
> I have used SGI's compiler on quite a lot of code

Have you been able to build ht://Dig using SGI's compiler?  I may be
wrong, but I recall seeing several error reports from SGI users, and
the response was usually to use the GNU compiler.  I'm not saying SGI's
compiler is bad or buggy, just that I haven't heard of a successful
build of ht://Dig with it.  In all likelyhood, it would be a problem in
the ht://Dig code that just doesn't manifest itself when built with the
GNU compiler.

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: AW: [htdig] Fw: & in Titles

2000-07-24 Thread Gilles Detillieux

According to Geoff Hutchison:
> At 3:51 PM -0500 7/7/00, Gilles Detillieux wrote:
> >In 3.2, these troublesome attributes have been removed, so the entities
> >are always translated.  What's really puzzling is that Roger Salisbury
> >has reported the same behaviour in 3.2.  I can't imagine how that could
> >happen unless his database was built with an earlier development snapshot
> >of 3.2, before the removal of translate_amp.
> 
> That would have to be an absolutely *ancient* version of 3.2, dating 
> from the first branch after 3.1.0. Almost the first thing I did in 
> 3.2 was to start writing the new SGML encoding procedures. They 
> weren't turned on immediately, but still... It's also fairly easy to 
> determine the version by using the -? option to get help on the 
> command-line utils.

That they weren't turned on is the main issue, though.  It turns out that
you didn't scrap the translate_* attributes, and fix HtSGMLCodec.cc to
always translate " & < and > until Feb 12, after 3.2.0b1
was released, so if Roger built his database with 3.2.0b1, without
explicitly turning on the translate_* attributes, and hasn't rebuilt it
since, then his database would include the untranslated SGML entities
in the excerpts, regardless of the version he's currently running.

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




[htdig] Read only volume indexing

2000-07-24 Thread htdig

How does htdig handle indexing of html pages on read only volumes(cdroms)?
Will it only index it once and just skip the whole volume the next time?

Justin


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




[htdig] Using Htdig to produce an A-Z index for a site

2000-07-24 Thread Joe McFadden

Hi,

Has anyone used / adapted htdig to produce a site index page, i.e. a list of
keywords, with links to (say) the 5 pages most relevant to that keyword? I'd
imagine this might need to be a partly manual process - i.e. get it spit out a
full index, then edit out less important keywords by hand. Any ideas/ advice
much appreciated.

tbanks,

Joe
-- 
Joe McFadden  |   Email:  [EMAIL PROTECTED]
Web Development Manager   | Web:  http://www.icr.ac.uk/
Institute of Cancer Research  |  Direct:  020 7970 6064
123 Old Brompton Road |  Mobile:  0799 0513 710
LONDON SW7 3RP, UK| Fax:  020 7970 6019
Support the everyman campaign at http://www.icr.ac.uk/everyman/


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Htmerge: "Deleted, invalid"

2000-07-24 Thread David Adams

Quoting Gilles Detillieux <[EMAIL PROTECTED]>:

> According to David Adams:
> > I use the standard MIPSpro compiler.  The script I use (thanks to my
> former 
> > collegeaue James Hammick) to setup the Makefile is:
> > 
> > #!/bin/sh
> > CFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CFLAGS
> > CPPFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CPPFLAGS
> > LDFLAGS="-mips4 -L/usr/lib32 -rpath /opt/local/htdig-3.1.5/lib";
> > export LDFLAGS
> > ./configure --prefix=/opt/local/htdig-3.1.5 \
> >   --with-cgi-bin-dir=/opt/local/htdig-3.1.5/cgi-bin \
> >   --with-image-dir=/opt/local/htdig-3.1.5/graphics \
> >   --with-search-dir=/opt/local/htdig-3.1.5/htdocs/sample
> > 
> > A lot of that is site-specific, and the "-rpath " option is
> only
> > needed because the compression library is not in a standard place on the 
> > machine on which htdig is run.
> > 
> > The "-woff all" option suppresses most warning messages.  I will remove
> it,
> > recompile htdig and send the result directly to Gilles, it might contain a
> clue.
> 
> As Sinclair mentioned, 'you need to have the 2.95.2 gcc and the latest
> gnu "make".'  I don't know that anyone has ever gotten ht://Dig to work
> with SGI's own compiler.  If fact, we got a lot of reports from folks
> who couldn't even get it to compile.
> 
> If you're really determined to get to the bottom of this and make it work
> with the SGI compiler, I wish you well, but I doubt I can help much.
> I looked at the output you sent me, and didn't really see any red
> flags pointing to an obvious problem.  I know that the Serialize and
> Deserialize functions for the db.docdb records can be a tad finicky, so
> that would probably be a place to look.  There could also be problems
> with incorrect assumptions about word sizes, e.g. if the SGI compiler
> has 64-bit long ints.  I'd also look at the db.wordlist records (they're
> ASCII text) before and after htmerge, to see if htdig is actually telling
> htmerge to remove some of these documents, or if htmerge is deciding to
> do so on its own.
> 
> For the time being, the ht://Dig code hasn't had much of a workout on
> non-GNU compilers, so it doesn't seem to do well on them.  If you can
> help remedy that, great.  If you want to get the package working as
> quickly and easily as possible, I'd suggest trying the GNU C and C++
> compilers.
> 
> -- 
> Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
> Spinal Cord Research Centre   WWW:   
> http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
> Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930
> 

I have been using htdig (3.1.2 and then 3.1.5) on an IRIX system for about a 
year and I have been very pleased with it.  I would say that we've given it a 
good workout here.  The problem with the "Deleted, invalid" messages only 
occurs with a second, relatively new search index.

The first index is made from a single run of htdig covering 33 servers, all in 
the local domain, and on this week's initial dig htmerge reports 49,233 
documents and not a single "Deleted, invalid".

The second index is made from two runs of htdig covering a total 969 (yes 969 
!) servers using a proxy.  Htmerge reports a mere 3,096 documents and 86 
"Deleted, invalid".

I have looked at the db.wordlist files (which are written to only by htdig - is 
that right?) and it would appear that htdig is flagging the pages for htmerge 
to delete and is not finding any words in them.

I can advance these theories:

It is not a bug, but is due to the use of a proxy. (I use a proxy 
because without one, a portion of the sites on any run of htdig were 
found to be not responding or even unknown.  With a proxy, htdig appears
to have no such problems.)

It is a bug due to the use of a proxy.

It is a bug which only shows when compiled under IRIX.

It is a bug which only occurs when there many different servers.

I intend to re-build the second index using htdig -vvv and perhaps learn 
something.

--
David Adams
<[EMAIL PROTECTED]>
Computing Services
Southampton University


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.