Hi,

I have a one question regarding setting of cookies across multiple iterations 
and I hope that someone can help.

I am trying to crawl one URL that does the following characteristics when 
loading it via the browser


1.       URL #1 does a 301 redirect to URL # 2 which stores the following 
cookies

a.       JSESSIONID and

b.      Custom Cookie

2.       URL # 2 does a 301 redirect back to URL # 1 with the above cookies and 
then loads the page.

I am reverse engineering Nutch to crawl an internal site URL # 1 and I am 
looking through the Fetcher.java code to see if I can store the cookies upon a 
301 redirect.

I am trying to use the session cookies to make another fetcher request in the 
next iteration.

I can't quite figure out how to do that. Can someone help?

Kartik


In Fetcher.java, this is  the following code

case ProtocolStatus.MOVED:         // redirect
              case ProtocolStatus.TEMP_MOVED:
                int code;
                boolean temp;
                if (status.getCode() == ProtocolStatus.MOVED) {
                  code = CrawlDatum.STATUS_FETCH_REDIR_PERM;
                  temp = false;
                } else {
                  code = CrawlDatum.STATUS_FETCH_REDIR_TEMP;
                  temp = true;
                }
                output(fit.url, fit.datum, content, status, code);
                String newUrl = status.getMessage();
                Text redirUrl =
                  handleRedirect(fit.url, fit.datum,
                                 urlString, newUrl, temp,
                                 Fetcher.PROTOCOL_REDIR);
                if (redirUrl != null) {
                  queueRedirect(redirUrl, fit);
                } else {
                  // stop redirecting
                  redirecting = false;
                }
                break;

I have access to the cookie string from Protocol
I have modified the ProtocolOutput#getProtocolOutput as follows to set the 
cookie response in the protocol output object as follows

public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {

    String urlString = url.toString();
    try {
      URL u = new URL(urlString);
      long startTime = System.currentTimeMillis();
      Response response = getResponse(u, datum, false); // make a request

      if(this.responseTime) {
        int elapsedTime = (int) (System.currentTimeMillis() - startTime);
        datum.getMetaData().put(RESPONSE_TIME, new IntWritable(elapsedTime));
      }

      int code = response.getCode();
      byte[] content = response.getContent();
      Content c = new Content(u.toString(), u.toString(),
                              (content == null ? EMPTY_CONTENT : content),
                              response.getHeader("Content-Type"),
                              response.getHeaders(), this.conf);
      String cookiesString = response.getHeader("Set-Cookie"); // My custom 
change
      // BOA Changes
      if (code == 200) { // got a good response
        ProtocolOutput output = new ProtocolOutput(c);
        output.setCookieString(cookiesString); // My custom change
        return output; // return it

      } else if (code >= 300 && code < 400) { // handle redirect
        String location = response.getHeader("Location");
         . . . Nutch Code . . .
        // handle this in the higher layer.
        ProtocolOutput output =  new ProtocolOutput(c, new 
ProtocolStatus(protocolStatusCode, u));
        output.setCookieString(cookiesString); // My custom change.
        return output;
      } else if (code == 400) { // bad request, mark as GONE
        if (logger.isTraceEnabled()) { logger.trace("400 Bad request: " + u); }
        return new ProtocolOutput(c, new ProtocolStatus(ProtocolStatus.GONE, 
u));
      } else if (code == 401) { // requires authorization, but no valid auth 
provided.
        if (logger.isTraceEnabled()) { logger.trace("401 Authentication 
Required"); }
        return new ProtocolOutput(c, new 
ProtocolStatus(ProtocolStatus.ACCESS_DENIED, "Authentication required: "
               + urlString));
      } else if (code == 404) {
        return new ProtocolOutput(c, new 
ProtocolStatus(ProtocolStatus.NOTFOUND, u));
      } else if (code == 410) { // permanently GONE
        return new ProtocolOutput(c, new ProtocolStatus(ProtocolStatus.GONE, 
"Http: " + code + " url=" + u));
      } else {
        return new ProtocolOutput(c, new 
ProtocolStatus(ProtocolStatus.EXCEPTION, "Http code=" + code + ", url="
                + u));
      }
    } catch (Throwable e) {
      logger.error("Failed to get protocol output", e);
      return new ProtocolOutput(null, new ProtocolStatus(e));
    }
  }

----------------------------------------------------------------------
This message, and any attachments, is for the intended recipient(s) only, may 
contain information that is privileged, confidential and/or proprietary and 
subject to important terms and conditions available at 
http://www.bankofamerica.com/emaildisclaimer.   If you are not the intended 
recipient, please delete this message.

Reply via email to