According to Ace Suares: > But your answer confuses me ! > > > > > > > url_rewrite_rules: https://(.*) http://\\1 > > > search_rewrite_rules: http://(.*) https://\\1 > > ... > > > It works one way around (I am using local files, but with https, I > > > changed that to http as per one of your mails) > > > but in the search results, stuff doens't get translated back ! > > > > In 3.1.x versions, htdig only allows http:// URLs, and it checks for that > > well before it applies url_rewrite_rules, so it would never get around > > to processing your https URLs and changing them to http. However, if > > your files are all accessible via HTTP, then you don't need to use the > > url_rewrite_rules line. Just use http:// URLs in start_url, and then > > use search_rewrite_rules to rewrite them at search time to https:// URLs. > > My pages are only accessible with https.
But you should be aware that url_rewrite_rules is applied to the URL _before_ the document is fetched. So, even if you could get htdig 3.1.6 to allow https:// URLs (which you can with a patch, see below), with your rewrite rules above htdig would still be trying to fetch the documents using HTTP, not HTTPS. > So, I decided to search them locally. If all of your documents are reachable via local_urls, then it doesn't matter if you "fake up" http:// URLs for them - it will grab them via the local filesystem and at search time change the fake URLs using your search_rewrite_rules. But, local_urls is pretty strict about what it allows. See http://www.htdig.org/attrs.html#local_urls for the allowed file types. Also, you can't index "bare" directories, i.e. without an index.html file, via local_urls. > However, each URL in the documents I am searching contains https > URLs. I followed your answer found in google and use http in > start_url and search_rewrite_rules (but, probably, the other way > around: > search_rewrite_rules: https://(.*) http://\\1 No, that would change https:// URLs, assuming you got htdig to accept them, into http:// URLs at search time, so the links in search results won't work if the pages are only accessible via https. > The search and merge work fine. But the output of a search contains > http urls, not https urls. I thought that with url_rewrite_rules I > could convert them back, and that I didn't have to have to config > files. If by "The search and merge" you mean building the index with htdig and htmerge, then I don't know how you got it to work fine with an unpatched 3.1.x version of htdig. On the other hand, if you have applied the ftp://ftp.ccsf.org/htdig-patches/3.1.6/ssl.9 patch and got that working, then you should be all set. Of course, in this case, you don't want to mess with any URL rewriting, either at indexing time or searching time, as you want to stick to https:// URLs throughout both indexing and (ht)search phases. If you didn't apply that patch, then how did you get the "search and merge" to work fine? Note the distinction I'm drawing between indexing and searching above: indexing is building the database, with htdig/htmerge, while searching is querying the database with htsearch. The "spidering" of documents is part of indexing, so we don't refer to that as searching. When you refer to "search and merge", you seem to be implying the indexing phase, so I just want to make sure I'm understanding you correctly. > If you feel this message has ripened enough, I would very much > appreciate your answer, to find out what is my mistake here. > Htdig is great, like so many OSS projects, and I love to find out > this new way to use it. This time around, it wasn't deliberate "ripening", which I don't do to followups, even when they're off-list. However, I got busy and it took a while to get back to your message. That's the benefit of keeping even followup messages on the list - there's always a chance someone else can answer you, or even clarify my answer, before I can get back to it. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

