Hi,
I have a one question regarding setting of cookies across multiple iterations
and I hope that someone can help.
I am trying to crawl one URL that does the following characteristics when
loading it via the browser
1. URL #1 does a 301 redirect to URL # 2 which stores the following
cookies
a. JSESSIONID and
b. Custom Cookie
2. URL # 2 does a 301 redirect back to URL # 1 with the above cookies and
then loads the page.
I am reverse engineering Nutch to crawl an internal site URL # 1 and I am
looking through the Fetcher.java code to see if I can store the cookies upon a
301 redirect.
I am trying to use the session cookies to make another fetcher request in the
next iteration.
I can't quite figure out how to do that. Can someone help?
Kartik
In Fetcher.java, this is the following code
case ProtocolStatus.MOVED: // redirect
case ProtocolStatus.TEMP_MOVED:
int code;
boolean temp;
if (status.getCode() == ProtocolStatus.MOVED) {
code = CrawlDatum.STATUS_FETCH_REDIR_PERM;
temp = false;
} else {
code = CrawlDatum.STATUS_FETCH_REDIR_TEMP;
temp = true;
}
output(fit.url, fit.datum, content, status, code);
String newUrl = status.getMessage();
Text redirUrl =
handleRedirect(fit.url, fit.datum,
urlString, newUrl, temp,
Fetcher.PROTOCOL_REDIR);
if (redirUrl != null) {
queueRedirect(redirUrl, fit);
} else {
// stop redirecting
redirecting = false;
}
break;
I have access to the cookie string from Protocol
I have modified the ProtocolOutput#getProtocolOutput as follows to set the
cookie response in the protocol output object as follows
public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {
String urlString = url.toString();
try {
URL u = new URL(urlString);
long startTime = System.currentTimeMillis();
Response response = getResponse(u, datum, false); // make a request
if(this.responseTime) {
int elapsedTime = (int) (System.currentTimeMillis() - startTime);
datum.getMetaData().put(RESPONSE_TIME, new IntWritable(elapsedTime));
}
int code = response.getCode();
byte[] content = response.getContent();
Content c = new Content(u.toString(), u.toString(),
(content == null ? EMPTY_CONTENT : content),
response.getHeader("Content-Type"),
response.getHeaders(), this.conf);
String cookiesString = response.getHeader("Set-Cookie"); // My custom
change
// BOA Changes
if (code == 200) { // got a good response
ProtocolOutput output = new ProtocolOutput(c);
output.setCookieString(cookiesString); // My custom change
return output; // return it
} else if (code >= 300 && code < 400) { // handle redirect
String location = response.getHeader("Location");
. . . Nutch Code . . .
// handle this in the higher layer.
ProtocolOutput output = new ProtocolOutput(c, new
ProtocolStatus(protocolStatusCode, u));
output.setCookieString(cookiesString); // My custom change.
return output;
} else if (code == 400) { // bad request, mark as GONE
if (logger.isTraceEnabled()) { logger.trace("400 Bad request: " + u); }
return new ProtocolOutput(c, new ProtocolStatus(ProtocolStatus.GONE,
u));
} else if (code == 401) { // requires authorization, but no valid auth
provided.
if (logger.isTraceEnabled()) { logger.trace("401 Authentication
Required"); }
return new ProtocolOutput(c, new
ProtocolStatus(ProtocolStatus.ACCESS_DENIED, "Authentication required: "
+ urlString));
} else if (code == 404) {
return new ProtocolOutput(c, new
ProtocolStatus(ProtocolStatus.NOTFOUND, u));
} else if (code == 410) { // permanently GONE
return new ProtocolOutput(c, new ProtocolStatus(ProtocolStatus.GONE,
"Http: " + code + " url=" + u));
} else {
return new ProtocolOutput(c, new
ProtocolStatus(ProtocolStatus.EXCEPTION, "Http code=" + code + ", url="
+ u));
}
} catch (Throwable e) {
logger.error("Failed to get protocol output", e);
return new ProtocolOutput(null, new ProtocolStatus(e));
}
}
----------------------------------------------------------------------
This message, and any attachments, is for the intended recipient(s) only, may
contain information that is privileged, confidential and/or proprietary and
subject to important terms and conditions available at
http://www.bankofamerica.com/emaildisclaimer. If you are not the intended
recipient, please delete this message.