This rightly belongs on htdig-general, not htdig-dev...

Session IDs are the bane of search engines, and just generally make life
difficult for getting around the web.  If getting rid of them altogether
isn't an option, then you can at least remove them while indexing.  See

  http://www.htdig.org/attrs.html#url_rewrite_rules

for an example of how to do this.  This will only work if you can access
the documents without a session ID, as that's what htdig will do - it
rewrites the URL before fetching the document, and presents the rewritten
URL (without session ID) in search results.

According to Paolo Subiaco:
> Hi all.
> I see there is a problem spidering forums like phpBB and electrifiedpenguin .
> Because these forums return a Session ID, it occurr that htdig spidering the 
> forum pages will get more than one SID.
> The result is that
> 1. the same forum page is indexed more than one time
> 2. the amount of CPU and time used for indexing is very large.
> 
> Take a look at the log above....
> Thank you. Paolo
> 
> 217.168.237.106 - - [14/Oct/2002:01:06:54 +0200] "GET /forum/index.php 
> HTTP/1.0" 200 35398 "http://www.ir3ip.net/forum/"; "htdig/3.1.5 
> 
> 217.168.237.106 - - [14/Oct/2002:01:06:55 +0200] "GET /forum/search.php 
> HTTP/1.0" 200 19754 "http://www.ir3ip.net/forum/"; "htdig/3.1.5
> 
> 217.168.237.106 - - [14/Oct/2002:01:06:57 +0200] "GET /forum/faq.php 
> HTTP/1.0" 200 51949 "http://www.ir3ip.net/forum/"; "htdig/3.1.5 (ro
> 
> 217.168.237.106 - - [14/Oct/2002:01:07:00 +0200] "GET /forum/memberlist.php 
> HTTP/1.0" 200 22715 "http://www.ir3ip.net/forum/"; "htdig/3.
> 
> 217.168.237.106 - - [14/Oct/2002:01:07:02 +0200] "GET 
> /forum/index.php?sid=6923db608dd988b9167c2464278dcffb HTTP/1.0" 200 35398 
> "http:/
> 
> 217.168.237.106 - - [14/Oct/2002:01:07:04 +0200] "GET 
> /forum/faq.php?sid=6923db608dd988b9167c2464278dcffb HTTP/1.0" 200 51949 
> "http://w
> 
> ..... another dozen of lines with the same sid was removed....
> after few minutes, several dozens of access with another sid:
> 
> 217.168.237.106 - - [14/Oct/2002:01:09:19 +0200] "GET 
> /forum/index.php?sid=27619eca3a821c36bbfe3222b99f62aa HTTP/1.0" 200 35398 
> "http://www.ir3ip.net/forum/viewforum.php?f=12"; "htdig/3.1.5
> 
> 217.168.237.106 - - [14/Oct/2002:01:09:21 +0200] "GET 
> /forum/faq.php?sid=27619eca3a821c36bbfe3222b99f62aa HTTP/1.0" 200 51949 
> "http://www.ir3ip.net/forum/viewforum.php?f=12"; "htdig/3.1.5 (r
> 
> 217.168.237.106 - - [14/Oct/2002:01:09:23 +0200] "GET 
> /forum/search.php?sid=27619eca3a821c36bbfe3222b99f62aa HTTP/1.0" 200 19754 
> "http://www.ir3ip.net/forum/viewforum.php?f=12"; "htdig/3.1.5
> 
> 217.168.237.106 - - [14/Oct/2002:01:09:26 +0200] "GET 
> /forum/memberlist.php?sid=27619eca3a821c36bbfe3222b99f62aa HTTP/1.0" 200 
> 22715 "http://www.ir3ip.net/forum/viewforum.php?f=12"; "htdig/3


-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This sf.net email is sponsored by: viaVerio will pay you up to
$1,000 for every account that you consolidate with us.
http://ad.doubleclick.net/clk;4749864;7604308;v?
http://www.viaverio.com/consolidator/osdn.cfm
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to