[htdig] Does htmerge remove URLs from databases when merging ?
Hi, I tried everything was proposed a few weeks ago but nothing worked (even with -v -v -v -v -v, my document was not marked as deleted.) So, here are three config files which are good enough to reproduce my problem. (htmerge-problem.zip is zipped with WinZip but is readable with the free unzip for Unix we could find on several web sites.) I'm using locale: fr_FR (look at include.conf), it could be unconvenient for you but I don't have any other short example for the moment (I could try to find one which doesn't use accented words if you like.) Operations : 1. Unzip the three files into the "config_dir" of ht://Dig. 2. htdig -c ${config_dir}/site1.conf 3. htdig -c ${config_dir}/site2.conf 4. htmerge -c ${config_dir}/site1.conf 5. htmerge -c ${config_dir}/site2.conf 6. htmerge -c ${config_dir}/site1.conf -m ${config_dir}/site2.conf 7. htsearch -c ${config_dir}/site2.conf 7.1. words="rénovation tourisme" (without quotes) 7.2. htsearch finds http://www.ac-orleans-tours.fr/tourisme/renovation.html (in first place) 8. htsearch -c ${config_dir}/site1.conf 8.1. words="rénovation tourisme" (as before) 8.2. htsearch returns the "no match found" page. It's clear : htmerge lost the http://www.ac-orleans-tours.fr/tourisme/renovation.html page when merging the two databases. Any suggestions ? Sincerely yours, Oivier Korn, Strasbourg, France. htmerge-problem.zip To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Does htmerge remove URL from database ?
At 22:07 25/11/2000 -0600, Geoff Hutchison wrote: At 2:21 PM +0100 11/23/00, Olivier Korn wrote: [snip] Some of the web hosts are case sensitives and some are not. Could it be the source of my problem ? I wouldn't think so. But you have to be pretty careful that the URL encodings are shared between your site.conf files. Personally, I make up a "main.conf," include that in the other files and only set the start_url and a minimal number of things in the individual site.conf files. In particular, it makes it easy to change something in all config files at once. I'm not sure about what do you mean by "to be careful that the URL encodings are shared between your site.conf files" ? Each of my site#.conf contains this "minimal number of things" : database_base: ${database_dir}/site# start_url: http://www.site#.fr/somepath/ limit_urls_to: ${start_url}# or something else (it depends on the site #) case_sensitive: true# or false (it depends on the site #) remove_default_doc: default.htm # or something else, it depends on... # ... the site # ! (you guessed ;-) include:${config_dir}/_commun_include And that's all (everything else is in _commun_include and is the same for each site #) Well... How could I be sure that "the URL encodings are shared between my site#.conf files" ? Regards, Olivier Korn. Strasbourg, France. To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Does htmerge remove URL from database ?
At 09:30 27/11/2000 +, David Adams wrote: I found that the extra runs of htmerge were necessary when I was merging two runs of htdig. Unless I ran both databases through htmerge before merging them I was getting Deleted, invalid: I never had this problem. against some pages in the htmerge run. Compared to the time required to run htdig, the extra htmerge runs are trivial, so you have little to loose by including them. And this is what I've done but with no success. Use the -v option with both htdig and htmerge and see if you get any message re the pages that don't appear in the final index. I've got to try this out... Olivier Korn. Strasbourg, France. To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Does htmerge remove URL from database ?
At 12:35 22/11/2000 -0600, Gilles Detillieux wrote: 4. After all the sites have been htdigged, I run htmerge in sequence in order to merge all the small databases into one. First call is "htmerge -c site1.conf", subsequents call are "htmerge -c site1.conf -m site2.conf", "htmerge -c site1.conf -m site3.conf", (and so on.) ... 2. Now let's hear the amazing part of my story. If I do a "htmerge -c site5.conf" (notice there is no -m this time.) and if I htsearch -c site5.conf with "rénovation tourisme" my document is said to be found ! Said in another way, the document was indexed but was certainly ripped out when merging with another database. I think after each separate htdig -i -c site#.conf you should run a separate htmerge -c site#.conf, not just on the first site, before you merge everything together. Try that and see if it solves the problem. I think the intention was that these extra merges should not have been necessary, but this has come up before, and I think there's a problem with merging multiple DBs when they haven't already been cleaned up by a simple htmerge. I tried it and it didn't solve the problem. BTW, I don't think that these extra merges are necessary either. Now, I run : htmerge -c site#.conf then htmerge -c site1.conf -m site#.conf (with # 1) If I then run htsearch -c site5.conf with words="rénovation tourisme", it finds the document (in first place.) But if I do htsearch -c site1.conf with the same words, it returns the "nomatch" document. Some of the web hosts are case sensitives and some are not. Could it be the source of my problem ? What are the rules for htmerge ? When does it really remove URLs from database ? -- Olivier Korn Strasbourg, France. To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] Does htmerge remove URL from database ?
Hi, We were using ht://Dig for many months now and we didn't have to complain about it but... There is something strange that I don't understand. The way, we're using ht://Dig is described here : 1. We have 20 or so web sites named, say, http://www.site1.fr/a-path/, http://www.site2.fr/a-path-which-does-not-read-the-same-as-site1/, and so on. Some are MS-IIS, some are Linux/Apache hosted. 2. For each of these sites, I made up a site1.conf, site2.conf, (and so on) containing start_url, restrict thing, (and so on.) Each of these .conf includes a file named "_commun_include". Of course, I changed database prefix for each of the sites. 3. Once a week, htdig is called on each site with "htdig -i -c site1.conf" then "htdig -i -c site2.conf", (and so on.) 4. After all the sites have been htdigged, I run htmerge in sequence in order to merge all the small databases into one. First call is "htmerge -c site1.conf", subsequents call are "htmerge -c site1.conf -m site2.conf", "htmerge -c site1.conf -m site3.conf", (and so on.) 5. Everything seems to work perfectly. Using htsearch, I can find documents which are on any of the sites. Let's note for later that my locale is correctly set so I don't have any problem with accents (I also use the accents patch which works fine.) (I say all this because of the example I give below.) ("htfuzzy accents" is run after all the htmerge.) Here is the problem : 1. On site5, there is an HTML document named "Rénovation du BTS tourisme". When searching for "rénovation tourisme" (method=and) the document is not found (ht://Dig even says there is no document containing these words.) Using the "restrict=http://www.site5.fr/site5-path-to-docs/" parameter doesn't change anything (this is not a surprise but... I wanted to be sure.) 2. Now let's hear the amazing part of my story. If I do a "htmerge -c site5.conf" (notice there is no -m this time.) and if I htsearch -c site5.conf with "rénovation tourisme" my document is said to be found ! Said in another way, the document was indexed but was certainly ripped out when merging with another database. Well, I'd like to know if somebody already ran into this particular problem or if it is a "feature" of htmerge (deleting entry when merging two databases together.) What can I do against it ? I'm really confused about all of this (this state of mind doesn't help me to write correct english. Sorry about that.) -- Olivier Korn Strasbourg, France. To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html