On Aug 25, 2009, at 9:50am, Fuad Efendi wrote:
I forgot to add for “Allow Redirects” to work properly we need also
Cookie handling in HttpClient... Most “stateful” websites generate
links inside HTML with Session tokens if they find that Client does
not support cookies; but if HttpClient supports – we are forced to
allow redirects (although new version of HttpClient supports per-
host cookies cache?!); to be verified...
HttpClient 4.0 provides per-user/thread context, which includes
cookies. I don't know of any per-host cookie support, just per-host
routing.
-- Ken
From: Fuad Efendi [mailto:f...@efendi.ca]
Sent: August-25-09 12:42 PM
To: nutch-dev@lucene.apache.org
Subject: Nutch Performance Improvements
Hello,
Few years ago I noticed some performance bottlenecks of Nutch;
checking source code now... the same...
1. RegexURLNormalizer and similar plugins
It’s singleton, and main method is synchronized. Would be better to
have per-thread instance, non-synchronized; but how to make it
plugin then?
2. “Allow Redirects” for HttpClient
By allowing redirects we can avoid HttpSession related tokens in
final URLs
(may be it’s not acceptable for general crawl, but would be nice to
have such configuration option)
Fuad Efendi
==================================
http://www.linkedin.com/in/liferay
http://www.tokenizer.org
http://www.casaGURU.com
==================================
--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378