Re: [htdig] Does htmerge remove URL from database ?

2000-11-30 Thread Olivier Korn

At 09:30 27/11/2000 +, David Adams wrote:
>I found that the extra runs of htmerge were necessary when I was merging two
>runs of htdig.  Unless I ran both databases through htmerge before merging
>them I was getting
>
>Deleted, invalid:

I never had this problem.

>against some pages in the htmerge run.  Compared to the time required to run
>htdig, the extra htmerge runs are trivial, so you have little to loose by
>including them.

And this is what I've done but with no success.

>Use the -v option with both htdig and htmerge and see if you get any message
>re the pages that don't appear in the final index.

I've got to try this out...


Olivier Korn.
Strasbourg, France.



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] Does htmerge remove URL from database ?

2000-11-30 Thread Olivier Korn

At 22:07 25/11/2000 -0600, Geoff Hutchison wrote:
>At 2:21 PM +0100 11/23/00, Olivier Korn wrote:
>[snip]
>>Some of the web hosts are case sensitives and some are not. Could it be 
>>the source of my problem ?
>
>I wouldn't think so. But you have to be pretty careful that the URL 
>encodings are shared between your site.conf files. Personally, I make up a 
>"main.conf," include that in the other files and only set the start_url 
>and a minimal number of things in the individual site.conf files. In 
>particular, it makes it easy to change something in all config files at once.

I'm not sure about what do you mean by "to be careful that the URL 
encodings are shared between your site.conf files" ?

Each of my site#.conf contains this "minimal number of things" :
database_base:  ${database_dir}/site#
start_url:  http://www.site#.fr/somepath/
limit_urls_to:  ${start_url}# or something else (it depends on 
the site #)
case_sensitive: true# or false (it depends on the site #)
remove_default_doc: default.htm # or something else, it depends on...
 # ... the site # ! (you guessed ;-)
include:${config_dir}/_commun_include

And that's all (everything else is in _commun_include and is the same for 
each site #)

Well... How could I be sure that "the URL encodings are shared between my 
site#.conf files" ?

Regards,
Olivier Korn.
Strasbourg, France.



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] Does htmerge remove URL from database ?

2000-11-27 Thread David Adams

I found that the extra runs of htmerge were necessary when I was merging two
runs of htdig.  Unless I ran both databases through htmerge before merging
them I was getting

Deleted, invalid:

against some pages in the htmerge run.  Compared to the time required to run
htdig, the extra htmerge runs are trivial, so you have little to loose by
including them.

Use the -v option with both htdig and htmerge and see if you get any message
re the pages that don't appear in the final index.


- Original Message -
From: "Geoff Hutchison" <[EMAIL PROTECTED]>
To: "Olivier Korn" <[EMAIL PROTECTED]>
Cc: "Gilles Detillieux" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Sunday, November 26, 2000 4:07 AM
Subject: Re: [htdig] Does htmerge remove URL from database ?


> At 2:21 PM +0100 11/23/00, Olivier Korn wrote:
> >I tried it and it didn't solve the problem. BTW, I don't think that
> >these extra merges are necessary either.
>
> No, they should not be at all necessary unless there's truly
> something horrific wrong with the merging code--it only uses the
> files directly output from htdig. (My idea was that it would be
> faster if you didn't need to run htmerge on intermediate DB.)
>
> >Now, I run :
> >htmerge -c site#.conf
> >then
> >htmerge -c site1.conf -m site#.conf (with # > 1)
> >
> >If I then run
> >htsearch -c site5.conf with words="rénovation tourisme", it finds
> >the document (in first place.)
> >But if I do
> >htsearch -c site1.conf with the same words, it returns the "nomatch"
document.
> >
> >Some of the web hosts are case sensitives and some are not. Could it
> >be the source of my problem ?
>
> I wouldn't think so. But you have to be pretty careful that the URL
> encodings are shared between your site.conf files. Personally, I make
> up a "main.conf," include that in the other files and only set the
> start_url and a minimal number of things in the individual site.conf
> files. In particular, it makes it easy to change something in all
> config files at once.
>
> --
> -Geoff Hutchison
> Williams Students Online
> http://wso.williams.edu/
>
> 
> To unsubscribe from the htdig mailing list, send a message to
> [EMAIL PROTECTED]
> You will receive a message to confirm this.
> List archives:  <http://www.htdig.org/mail/menu.html>
> FAQ:<http://www.htdig.org/FAQ.html>
>
>



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  <http://www.htdig.org/mail/menu.html>
FAQ:<http://www.htdig.org/FAQ.html>




Re: [htdig] Does htmerge remove URL from database ?

2000-11-25 Thread Geoff Hutchison

At 2:21 PM +0100 11/23/00, Olivier Korn wrote:
>I tried it and it didn't solve the problem. BTW, I don't think that 
>these extra merges are necessary either.

No, they should not be at all necessary unless there's truly 
something horrific wrong with the merging code--it only uses the 
files directly output from htdig. (My idea was that it would be 
faster if you didn't need to run htmerge on intermediate DB.)

>Now, I run :
>htmerge -c site#.conf
>then
>htmerge -c site1.conf -m site#.conf (with # > 1)
>
>If I then run
>htsearch -c site5.conf with words="rénovation tourisme", it finds 
>the document (in first place.)
>But if I do
>htsearch -c site1.conf with the same words, it returns the "nomatch" document.
>
>Some of the web hosts are case sensitives and some are not. Could it 
>be the source of my problem ?

I wouldn't think so. But you have to be pretty careful that the URL 
encodings are shared between your site.conf files. Personally, I make 
up a "main.conf," include that in the other files and only set the 
start_url and a minimal number of things in the individual site.conf 
files. In particular, it makes it easy to change something in all 
config files at once.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] Does htmerge remove URL from database ?

2000-11-23 Thread Olivier Korn

At 12:35 22/11/2000 -0600, Gilles Detillieux wrote:
> > 4. After all the sites have been htdigged, I run htmerge in sequence in
> > order to merge all the small databases into one.
> > First call is "htmerge -c site1.conf", subsequents call are "htmerge -c
> > site1.conf -m site2.conf", "htmerge -c site1.conf -m site3.conf", (and 
> so on.)
>...
> > 2. Now let's hear the amazing part of my story. If I do a "htmerge -c
> > site5.conf" (notice there is no -m this time.) and if I htsearch -c
> > site5.conf with "rénovation tourisme" my document is said to be found !
> > Said in another way, the document was indexed but was certainly ripped out
> > when merging with another database.
>
>I think after each separate htdig -i -c site#.conf you should run a
>separate htmerge -c site#.conf, not just on the first site, before you
>merge everything together.  Try that and see if it solves the problem.
>I think the intention was that these extra merges should not have been
>necessary, but this has come up before, and I think there's a problem
>with merging multiple DBs when they haven't already been cleaned up by
>a simple htmerge.

I tried it and it didn't solve the problem. BTW, I don't think that these 
extra merges are necessary either.

Now, I run :
htmerge -c site#.conf
then
htmerge -c site1.conf -m site#.conf (with # > 1)

If I then run
htsearch -c site5.conf with words="rénovation tourisme", it finds the 
document (in first place.)
But if I do
htsearch -c site1.conf with the same words, it returns the "nomatch" document.

Some of the web hosts are case sensitives and some are not. Could it be the 
source of my problem ?

What are the rules for htmerge ? When does it really remove URLs from 
database ?

--
Olivier Korn
Strasbourg, France.



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




Re: [htdig] Does htmerge remove URL from database ?

2000-11-22 Thread Gilles Detillieux

According to Olivier Korn:
> 3. Once a week, htdig is called on each site with "htdig -i -c site1.conf" 
> then "htdig -i -c site2.conf", (and so on.)
> 
> 4. After all the sites have been htdigged, I run htmerge in sequence in 
> order to merge all the small databases into one.
> First call is "htmerge -c site1.conf", subsequents call are "htmerge -c 
> site1.conf -m site2.conf", "htmerge -c site1.conf -m site3.conf", (and so on.)
...
> 2. Now let's hear the amazing part of my story. If I do a "htmerge -c 
> site5.conf" (notice there is no -m this time.) and if I htsearch -c 
> site5.conf with "rénovation tourisme" my document is said to be found ! 
> Said in another way, the document was indexed but was certainly ripped out 
> when merging with another database.

I think after each separate htdig -i -c site#.conf you should run a
separate htmerge -c site#.conf, not just on the first site, before you
merge everything together.  Try that and see if it solves the problem.
I think the intention was that these extra merges should not have been
necessary, but this has come up before, and I think there's a problem
with merging multiple DBs when they haven't already been cleaned up by
a simple htmerge.

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ: