Re: Web spiders - disabling jsessionid

Len Popp Fri, 01 Dec 2006 11:03:31 -0800

On 12/1/06, Christopher Schultz <[EMAIL PROTECTED]> wrote:

Mikolaj,

Back to the original question...

Mikolaj Rydzewski wrote:
> As you may know url rewriting feature is not a nice thing when spiders
> come to index your site -
> http://gabrito.com/post/javas-seo-blunder-jsessionid.

So, the problem is that your URLs contain ";jsessionid=...", right? When
does that become a problem?

That becomes a problem when google (or whomever) crawls your site on
different days and sees the same content with "different" URLs. Well, I
have a couple of thoughts about that.

1. A semi-colon is listed in the HTTP specification as being a valid
   delimiter, despite pretty much every major web server out there
   ignoring it and thinking that it's part of the path.
   This is partially the crawler's fault for not following the HTTP
   specification. The ";" character is not technically a valid URL
   character outside of it's role as a delimiter, just like "&" or "?".


Whether or not you consider it part of the URL, Google treats it that
way, and so we have to live with it.

2. If you strip-off the jsessionid argument for all of these URLs,
   you will end up with thousands of sessions being created for
   each URL requested by the google bot. Do you think that's a good
   idea?


As far as I can see, that's not a problem - I don't get anywhere near
a thousand live sessions from Google. In fact, Google's crawler seems
to limit itself to about one page per minute (according to my logs) so
there won't be more than a few dozen sessions at most.

3. If you don't want googlebot to get a session, why are you allocating
   one? If you need sessions to manage site navigation, then you
   cannot turn them off and have things work correctly... can you?


On my site (as on many others) you can browse the site without a
session, but if you want to log in (to add content or to use
personalized settings) you need a session. Sessions aren't required
for site navigation or crawling, but they are required for other
reasons.

4. Consider instructing googlebot not to crawl certain portions of your
   site (those which require a session) by using a robots.txt file.


Not an option if that would mean not indexing the interesting parts of the site.

The best solution I could find is to use a filter and
HttpServletResponse wrapper, as others have described. An
implementation of the wrapper class can be found here:
http://mail-archives.apache.org/mod_mbox/struts-user/200311.mbox/[EMAIL 
PROTECTED]
The result is that to login to the site you need a browser that
supports session cookies, but I can accept that. And, Google can now
index the site without crawling all over it repeatedly with different
jsessionids. Yay.
--
Len

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Web spiders - disabling jsessionid

Reply via email to