Re: htdig: Pages get indexed, but no results: BUG?

1998-12-22 Thread Gilles Detillieux

Last week I wrote:
> According to Andriu:
> > I actually did run htdig -sivvv and I did see that the pages which are linked from
> > products.htm were defenitely indexed - there are at least 100 pages linked from
> > products.htm so its certain that they have been indexed.
> > 
> > But when I search with keywords from these pages, htsearch does not find any 
>results - I
> > made several tests, so Im sure.
> > 
> > htsearc finds pages from that site which do not start from products.htm - no 
>problem
> > there.
> > 
> > That is why Im assuming that the pages get indexed and then deleted again.
> > 
> > That is also why I think that it does not help when I take a start URL starting 
>directly
> > from products.htm.
> 
> Well, I throw my hands up on this one.  I was able to reproduce the
> problem here, with 3.1.0b3, but I'm at a loss to explain it.  As far as I
> can tell htdig will index the file if it sees the lowercase URL first,
> and fail to index it if it sees the uppercase URL first.  However,
> it wasn't showing up in the search.  I'm baffled.  Also, if you put a
> page that contains the lowercase URL as the first page in the start_url
> list, it doesn't quite work either.  In this case, the page shows up in
> the search, but it shows up with the uppercase URL!  Wierd.  However,
> if you put the products.htm page itself as the first URL in start_url,
> it does seem to work - at least with 3.1.0b3.  But you have to explicitly
> give the limit_urls_to, or htdig seems to get confused.

OK, after looking at Retriever.cc a bit more, for another reason, I came
across something that I think explains some of this behaviour.  Every time
htdig sees an href, it updates the docdb record for that URL, to update
the backlink count and add the new description text.  It also sets the URL
in the database to the URL it got in the latest href.  I'm not sure why
it does this, but it would explain the strange behaviour.  So, to avoid
problems with the missing upper-case file name, you'd have to make sure
that htdig sees the lower-case file name first, so it actually digs the
real document rather than getting an error, plus you have to make sure
that the last href to that document that htdig sees has the lower-case
file name as well, or else the wrong file name ends up in the docdb!

That still doesn't explain why in some cases pages appeared to be dug,
but didn't show up in the searches.  Maybe that'll come to me sometime
in the new year.  :)  I still maintain that this isn't a problem on a
properly configured server, with properly set up hrefs in your documents,
so I don't think I'll go out of my way to solve this one.

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930
--
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.



Re: htdig: Pages get indexed, but no results: BUG?

1998-12-18 Thread Gilles Detillieux

According to Andriu:
> I actually did run htdig -sivvv and I did see that the pages which are linked from
> products.htm were defenitely indexed - there are at least 100 pages linked from
> products.htm so its certain that they have been indexed.
> 
> But when I search with keywords from these pages, htsearch does not find any results 
>- I
> made several tests, so Im sure.
> 
> htsearc finds pages from that site which do not start from products.htm - no problem
> there.
> 
> That is why Im assuming that the pages get indexed and then deleted again.
> 
> That is also why I think that it does not help when I take a start URL starting 
>directly
> from products.htm.

Well, I throw my hands up on this one.  I was able to reproduce the
problem here, with 3.1.0b3, but I'm at a loss to explain it.  As far as I
can tell htdig will index the file if it sees the lowercase URL first,
and fail to index it if it sees the uppercase URL first.  However,
it wasn't showing up in the search.  I'm baffled.  Also, if you put a
page that contains the lowercase URL as the first page in the start_url
list, it doesn't quite work either.  In this case, the page shows up in
the search, but it shows up with the uppercase URL!  Wierd.  However,
if you put the products.htm page itself as the first URL in start_url,
it does seem to work - at least with 3.1.0b3.  But you have to explicitly
give the limit_urls_to, or htdig seems to get confused.

> Also I dig several sites, so would it make sense to use limit urls to?

It's hard to avoid using it, I think.  By default it's set to ${start_url},
which works for simple cases where start_url lists the main page of one or
more web sites.  However, when you get htdig to start at a deeply nested
page somewhere on one site, you need to explicitly set limit_urls_to to
include everything you want included, from all sites you dig.  E.g. you
can set it to the list of main page URLs for every site you dig:

limit_urls_to:  http://www.mysite.com/ \
http://www.alma.mater.edu/ \
http://www.htdig.org/ \
http://www.something.org/
start_url:  http://www.mysite.com/my/ownpage/products.htm \
http://www.mysite.com/ \
http://www.alma.mater.edu/ \
http://www.htdig.org/ \
http://www.something.org/

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930
--
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.



Re: htdig: Pages get indexed, but no results: BUG?

1998-12-17 Thread Netsolution Internet Consulting



Gilles Detillieux wrote:

>
> Who says it's deleting anything?  Does an htdig -vvv seem to suggest that?
>
> What I'm suggesting is that htdig sees the href to PRODUCTS.HTM before any
> href to products.htm, and so it queues up the upper-case URL, but marks
> the lower-case URL as visited (because all visits are recorded in lower
> case).  So, it tries to get PRODUCTS.HTM, and fails, so it never sees the
> real file.  Whenever it sees any of the good hrefs to products.htm, it
> thinks the file was already visited, so it doesn't queue it up again.
>
> Do you have any hard evidence that htdig is indeed fetching products.htm
> from the server, and deleting its hrefs?

I actually did run htdig -sivvv and I did see that the pages which are linked from
products.htm were defenitely indexed - there are at least 100 pages linked from
products.htm so its certain that they have been indexed.

But when I search with keywords from these pages, htsearch does not find any results - 
I
made several tests, so Im sure.

htsearc finds pages from that site which do not start from products.htm - no problem
there.

That is why Im assuming that the pages get indexed and then deleted again.

That is also why I think that it does not help when I take a start URL starting 
directly
from products.htm.

Also I dig several sites, so would it make sense to use limit urls to?

Andriu



--
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.



Re: htdig: Pages get indexed, but no results: BUG?

1998-12-17 Thread Gilles Detillieux

According to Rodger Zeisler:
> htdig has a case_sensitive option that would make PRODUCT.HTM and
> product.htm appear the same.  Since you have 3 character extenstions (.htm),
> I am assuming that you are on NT not Unix (.html).  NT (and MS Windows) is
> case insenstive.

The case_sensitive option currently affects only parsing of disallow
statements in the robots.txt file, and not how htdig keeps track of
visited documents.  However, if you want to change how htdig keeps
track of visits, as I suggested earlier, it would be wise to make that
conditional on this option.  Thanks for pointing it out.

The .htm extensions have, unfortunately, polluted a great many Unix
servers over the past few years, as many web developers use M$ systems,
and stick to that ugly 3 character extension limit they've carried over
from DOS, even though Win95 & NT no longer impose that limit.  So, it's
not a safe assumption that a server that has .htm files is not Unix-based.
If Andriu claims it's a Unix server, I'll take his word for it.  In any
case, the problem stems from the fact that the developers assumed the
server was case insensitive, when in fact it's case sensitive, and
therefore not an NT server.  That's why the href to PRODUCTS.HTM fails.
It should be lower-case, and the server cares which case is used.

According to Andriu Isenring Ritsch:
> I just wondered, when PRODUCTS.HTM returns 404 and products.htm returns OK and has
> a lot of links on it to other pages etc., why does htdig assume that the link is
> not ok and deletes all pages that have been retrieved starting from the
> products.htm page? I mean, htdig got the pages and deletes them again - they
> defenitely must exist, so why deleting them?

Who says it's deleting anything?  Does an htdig -vvv seem to suggest that?

What I'm suggesting is that htdig sees the href to PRODUCTS.HTM before any
href to products.htm, and so it queues up the upper-case URL, but marks
the lower-case URL as visited (because all visits are recorded in lower
case).  So, it tries to get PRODUCTS.HTM, and fails, so it never sees the
real file.  Whenever it sees any of the good hrefs to products.htm, it
thinks the file was already visited, so it doesn't queue it up again.

Do you have any hard evidence that htdig is indeed fetching products.htm
from the server, and deleting its hrefs?

> One could also do it the other way around: htdig could know that one link was not
> ok, but since it was able to follow the page at some point, there must be a valid
> page with that name, so the links from that page must also be valid etc.  (htdig
> could assume that there was a uppercase/lowercase problem).
>
> What do you think about that?

The way I read the code, I can't see how htdig would have tried to dig
both products.htm and PRODUCTS.HTM in the same run.  If it sees the
real file first, it shouldn't try the bogus one at all, and if it sees
the bogus one first, it seems it would ignore the real one, because it
thinks it's the same file.

> (I know, fixing the link should be easier, but still...)

Personally, I find it ridiculous that they want you to put all this work
into setting up and demoing a search engine, and they can't be bothered
to find one person to take 30 seconds to fix one bad URL in one document.
But that's life, I guess.

How's this for a work around:  put the products.htm file as the first
URL in your start_url option, to make sure htdig sees it before the
bogus href.  If you do so, and your limit_urls_to option refers to
${start_url}, as it does by default, you may want to explicitly specify
the limit_urls_to option.  E.g.:

start_url:  http://silly.server.com/some/path/to/products.htm \
http://silly.server.com/
limit_urls_to:  http://silly.server.com/


-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930
--
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.



Re: htdig: Pages get indexed, but no results: BUG?

1998-12-17 Thread Rodger Zeisler

htdig has a case_sensitive option that would make PRODUCT.HTM and
product.htm appear the same.  Since you have 3 character extenstions (.htm),
I am assuming that you are on NT not Unix (.html).  NT (and MS Windows) is
case insenstive.


Rodger Zeisler
Everest Software Corp. - http://www.outsourcing-mgmt.com/ - Helping You
Manage Software
InfoServer LLC - http://www.infoserver.com - The Journal For Strategic
Outsourcing Information
[EMAIL PROTECTED]
Work 972.980.0013 x738
Home 972.390.0206


- Original Message -
From: Andriu Isenring Ritsch <[EMAIL PROTECTED]>
To: Gilles Detillieux <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Thursday, December 17, 1998 10:57 AM
Subject: Re: htdig: Pages get indexed, but no results: BUG?


>
>
>Gilles Detillieux wrote:
>
>> According to Andriu Isenring Ritsch:
>> > I've noticed, that the links to the pages that don't seem to get
indexed
>> > all start on one page called products.htm
>> >
>> > Now there is the problem that the page that links to products.htm has
>> > two links, one to products.htm and one to PRODUCTS.HTM.
>> >
>> > Because it's a Unix server and the page name is really products.htm,
>> > PRODUCTS.HTM gives a page not found error.
>> >
>> > Is it now possible, that htdig removes all pages indexed starting form
>> > products.htm, because PRODUCTS.HTM was not found?
>> >
>> > It seems to me like that...
>> >
>> > What workaround is there? Unfortunately most of the site is linked from
>> > products.htm
>>
>> Is fixing the defective link not an option?  If not, how about making
>> sure htdig sees the good one before the bad one, somehow?
>>
>> As it stands now, htdig keeps track of visited URLs by mapping them to
>> lower-case.  This is valid for case-insensitive servers, but can be a
>> problem with case-sensitive ones - when they're not set up properly!
>> If you're careful to set up your links properly, and you don't use the
>> same name twice (one lower- and one upper-case) for different documents,
>> it shouldn't be a problem.
>>
>> The only other "fix" would be to edit htdig/Retriever.cc, and find all
>> instances where the "visited" object is used.  These are preceeded by
>> an url.lowercase() or temp.lowercase(), to map the URL to lower-case.
>> You'd need to remove these, or replace them with calls to a function
>> that would only map the first part of the URL (http://host.name.dom/)
>> to lower-case, and leave the rest of the path as mixed case.  Removing
>> the lowercase() calls altogether would mean you'd have to be consistent
>> in the case used in the hostname part of the URLs - probably not a safe
>> assumption given the fact that your site isn't even consistent in the
>> case expected for the document names.  Mapping the first part of the
>> URL, but leaving the path as mixed case would solve your problem, but
>> could pose a problem if you index any case-insensitive servers.
>>
>> So, to answer the question you pose in the subject line, it's not a bug,
>> it's a feature!  :-)
>
>I just wondered, when PRODUCTS.HTM returns 404 and products.htm returns OK
and has
>a lot of links on it to other pages etc., why does htdig assume that the
link is
>not ok and deletes all pages that have been retrieved starting from the
>products.htm page? I mean, htdig got the pages and deletes them again -
they
>defenitely must exist, so why deleting them?
>One could also do it the other way around: htdig could know that one link
was not
>ok, but since it was able to follow the page at some point, there must be a
valid
>page with that name, so the links from that page must also be valid etc.
(htdig
>could assume that there was a uppercase/lowercase problem).
>
>What do you think about that?
>
>(I know, fixing the link should be easier, but still...)
>
>Andriu
>
>--
>To unsubscribe from the htdig mailing list, send a message to
>[EMAIL PROTECTED] containing the single word "unsubscribe" in
>the body of the message.
>

--
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.



Re: htdig: Pages get indexed, but no results: BUG?

1998-12-17 Thread Andriu Isenring Ritsch

>
> >What workaround is there? Unfortunately most of the site is linked from
> >products.htm
>
> Why is there a link to PRODUCTS.HTM if the server will return a 404? Can
> you take out that link? [Geoff ]

I can't - its a very big company, and every modification takes time.

They did not know about the mistake until htdig found out (so thanks for
that feature ;-)

The link is only a small part of the complete link, it is the "P" from a
link called "Products and Services", unfortunately the "P" has the link to
"PRODUCTS.HTM" that affects the whole indexing process.

Anyway, they are going to fix it someday, I just asked about a workaround
because we have a presentation tomorrow, and as I said, the main part of
that site could not be indexed because of this small error.

Now I just have to use keywords that HAVE been indexed for the presentation
and nobody will notice...

Thanks
Andriu

--
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.



Re: htdig: Pages get indexed, but no results: BUG?

1998-12-17 Thread Andriu Isenring Ritsch



Gilles Detillieux wrote:

> According to Andriu Isenring Ritsch:
> > I've noticed, that the links to the pages that don't seem to get indexed
> > all start on one page called products.htm
> >
> > Now there is the problem that the page that links to products.htm has
> > two links, one to products.htm and one to PRODUCTS.HTM.
> >
> > Because it's a Unix server and the page name is really products.htm,
> > PRODUCTS.HTM gives a page not found error.
> >
> > Is it now possible, that htdig removes all pages indexed starting form
> > products.htm, because PRODUCTS.HTM was not found?
> >
> > It seems to me like that...
> >
> > What workaround is there? Unfortunately most of the site is linked from
> > products.htm
>
> Is fixing the defective link not an option?  If not, how about making
> sure htdig sees the good one before the bad one, somehow?
>
> As it stands now, htdig keeps track of visited URLs by mapping them to
> lower-case.  This is valid for case-insensitive servers, but can be a
> problem with case-sensitive ones - when they're not set up properly!
> If you're careful to set up your links properly, and you don't use the
> same name twice (one lower- and one upper-case) for different documents,
> it shouldn't be a problem.
>
> The only other "fix" would be to edit htdig/Retriever.cc, and find all
> instances where the "visited" object is used.  These are preceeded by
> an url.lowercase() or temp.lowercase(), to map the URL to lower-case.
> You'd need to remove these, or replace them with calls to a function
> that would only map the first part of the URL (http://host.name.dom/)
> to lower-case, and leave the rest of the path as mixed case.  Removing
> the lowercase() calls altogether would mean you'd have to be consistent
> in the case used in the hostname part of the URLs - probably not a safe
> assumption given the fact that your site isn't even consistent in the
> case expected for the document names.  Mapping the first part of the
> URL, but leaving the path as mixed case would solve your problem, but
> could pose a problem if you index any case-insensitive servers.
>
> So, to answer the question you pose in the subject line, it's not a bug,
> it's a feature!  :-)

I just wondered, when PRODUCTS.HTM returns 404 and products.htm returns OK and has
a lot of links on it to other pages etc., why does htdig assume that the link is
not ok and deletes all pages that have been retrieved starting from the
products.htm page? I mean, htdig got the pages and deletes them again - they
defenitely must exist, so why deleting them?
One could also do it the other way around: htdig could know that one link was not
ok, but since it was able to follow the page at some point, there must be a valid
page with that name, so the links from that page must also be valid etc. (htdig
could assume that there was a uppercase/lowercase problem).

What do you think about that?

(I know, fixing the link should be easier, but still...)

Andriu

--
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.



Re: htdig: Pages get indexed, but no results: BUG?

1998-12-17 Thread Geoff Hutchison

At 1:11 PM -0500 12/16/98, Andriu Isenring Ritsch wrote:
>Is it now possible, that htdig removes all pages indexed starting form
>products.htm, because PRODUCTS.HTM was not found?

This is very likely.

>What workaround is there? Unfortunately most of the site is linked from
>products.htm

Why is there a link to PRODUCTS.HTM if the server will return a 404? Can
you take out that link?


-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/


--
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.



Re: htdig: Pages get indexed, but no results: BUG?

1998-12-17 Thread Gilles Detillieux

According to Andriu Isenring Ritsch:
> I've noticed, that the links to the pages that don't seem to get indexed
> all start on one page called products.htm
> 
> Now there is the problem that the page that links to products.htm has
> two links, one to products.htm and one to PRODUCTS.HTM.
> 
> Because it's a Unix server and the page name is really products.htm,
> PRODUCTS.HTM gives a page not found error.
> 
> Is it now possible, that htdig removes all pages indexed starting form
> products.htm, because PRODUCTS.HTM was not found?
> 
> It seems to me like that...
> 
> What workaround is there? Unfortunately most of the site is linked from
> products.htm

Is fixing the defective link not an option?  If not, how about making
sure htdig sees the good one before the bad one, somehow?

As it stands now, htdig keeps track of visited URLs by mapping them to
lower-case.  This is valid for case-insensitive servers, but can be a
problem with case-sensitive ones - when they're not set up properly!
If you're careful to set up your links properly, and you don't use the
same name twice (one lower- and one upper-case) for different documents,
it shouldn't be a problem.

The only other "fix" would be to edit htdig/Retriever.cc, and find all
instances where the "visited" object is used.  These are preceeded by
an url.lowercase() or temp.lowercase(), to map the URL to lower-case.
You'd need to remove these, or replace them with calls to a function
that would only map the first part of the URL (http://host.name.dom/)
to lower-case, and leave the rest of the path as mixed case.  Removing
the lowercase() calls altogether would mean you'd have to be consistent
in the case used in the hostname part of the URLs - probably not a safe
assumption given the fact that your site isn't even consistent in the
case expected for the document names.  Mapping the first part of the
URL, but leaving the path as mixed case would solve your problem, but
could pose a problem if you index any case-insensitive servers.

So, to answer the question you pose in the subject line, it's not a bug,
it's a feature!  :-)

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930
--
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.



htdig: Pages get indexed, but no results: BUG?

1998-12-16 Thread Andriu Isenring Ritsch

I've noticed, that the links to the pages that don't seem to get indexed
all start on one page called products.htm

Now there is the problem that the page that links to products.htm has
two links, one to products.htm and one to PRODUCTS.HTM.

Because it's a Unix server and the page name is really products.htm,
PRODUCTS.HTM gives a page not found error.

Is it now possible, that htdig removes all pages indexed starting form
products.htm, because PRODUCTS.HTM was not found?

It seems to me like that...

What workaround is there? Unfortunately most of the site is linked from
products.htm

Thanks
Andriu


Hello

When running htdig -isv I can see all pages that get indexed.

Now I take some words from a page that I have seen has been indexed.
These words are unique to that page.

When I search for these words, no results are displayed, that means,
htsearch can not find a page with these words, although the page was
indexed.

(Yes, I did run htmerge, and max_head_length is 50'000, and the words im
looking fore are within these 50'000)

What could be the problem?

It affects only some pages (especially on one site it affects most of
the pages), and with others there is no problem...

Thanks
Andriu
--
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.



htdig: Pages get indexed, but no results

1998-12-16 Thread Andriu Isenring Ritsch

Hello

When running htdig -isv I can see all pages that get indexed.

Now I take some words from a page that I have seen has been indexed.
These words are unique to that page.

When I search for these words, no results are displayed, that means,
htsearch can not find a page with these words, although the page was
indexed.

(Yes, I did run htmerge, and max_head_length is 50'000, and the words im
looking fore are within these 50'000)

What could be the problem?

It affects only some pages (especially on one site it affects most of
the pages), and with others there is no problem...

Thanks
Andriu

--
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.