RE: [Nutch-dev] Fetch / Parse errors and a Bug

Chirag Chaman Wed, 29 Dec 2004 07:19:18 -0800

That is strange, coz I would expect it to be case insensitive, but then
again I have not tested, just looking at the code.


You see how the TreeMap is initialized with String.CASE_INSENSITIVE_ORDER

private Map parseHeaders(PushbackInputStream in, StringBuffer line)
    throws IOException, HttpException {
    TreeMap headers = new TreeMap(String.CASE_INSENSITIVE_ORDER);
    return parseHeaders(in, line, headers); 

So I would imagine that a look up for Content-Type is case insensitive as
well.


Can you send me the link to a page that has this problem -- I'll run some
tests to see what's causing this.


-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Sven
Wende
Sent: Wednesday, December 29, 2004 9:16 AM
To: [EMAIL PROTECTED]
Subject: RE: [Nutch-dev] Fetch / Parse errors and a Bug

Chirag:

> I looked at where you mention that the content type is being looked up 
> and is Case Sensitive -- that is not correct. The HTTP protocol is 
> adding the Content-type to the TreeMap which is initialized with the 
> String.CASE_INSENSITIVE_ORDER comparator. Thus it internally will do a 
> case-insensitive match.

Which code do you refer to?

I described a problem in the protocoll-http plugin. Just take a look at the
following code snippet from the CVS. As you can see, the headers are read in
and stored in a simple Hashtable. 
The problem with case sensitive headers for content-type occurs in the
toContent() method. (for example)

****************************************************************************
*****
package net.nutch.protocol.http;

/** An HTTP response. */

public class HttpResponse {
  private Properties headers = new Properties();                    

  /** Returns the value of a named header. */
  public String getHeader(String name) {
    return (String)headers.get(name);
  }

  public Content toContent() {
    String contentType = getHeader("Content-Type");
    if (contentType == null)
      contentType = "";
    return new Content(orig, base, content, contentType, headers);
  }

  private void processHeaderLine(StringBuffer line, TreeMap headers)
    throws IOException, HttpException {
    int colonIndex = line.indexOf(":");       // key is up to colon
    if (colonIndex == -1) {
      int i;
      for (i= 0; i < line.length(); i++)
        if (!Character.isWhitespace(line.charAt(i)))
          break;
      if (i == line.length())
        return;
      throw new HttpException("No colon in header:" + line);
    }
    String key = line.substring(0, colonIndex);

    int valueStart = colonIndex+1;            // skip whitespace
    while (valueStart < line.length()) {
      int c = line.charAt(valueStart);
      if (c != ' ' && c != '\t')
        break;
      valueStart++;
    }
    String value = line.substring(valueStart);

    headers.put(key, value);
  }
}
****************************************************************************
*****

> I think the problem is that no "content-type" was ever on the page -- 
> this leaves both the content type and the extension/suffix to be blank 
> and that causes a problem. Also, if a character-set is also not 
> specified then the fetcher fails as well (as it cannot write to disk).

I tested it and there was a "content-type" header. If its name was
"Content-Type", everything was ok but if its name was "content-type" Nutch
internally looses the information about the content-type by the use of the
code above.



> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of 
> Chirag Chaman
> Sent: Mittwoch, 29. Dezember 2004 13:59
> To: [EMAIL PROTECTED]
> Subject: RE: [Nutch-dev] Fetch / Parse errors and a Bug
> 
> Swen:
> 
> Yes, this is related. Bill Goffe seems to have had the same problem.
> 
> So here's the EASY fix. I tested it over the last few hours with 100k 
> pages and it's working as it should. Simply add "pdf" and "doc" for 
> the pathSuffix of  parser-pdf and parser-doc.  In my opinion no other 
> parser plugin should have its pathsuffix left blank unless it wants to 
> be the default handler -- HTML should only be the one.
> 
> I looked at where you mention that the content type is being looked up 
> and is Case Sensitive -- that is not correct. The HTTP protocol is 
> adding the Content-type to the TreeMap which is initialized with the 
> String.CASE_INSENSITIVE_ORDER comparator. Thus it internally will do a 
> case-insensitive match.
> 
> I think the problem is that no "content-type" was ever on the page -- 
> this leaves both the content type and the extension/suffix to be blank 
> and that causes a problem. Also, if a character-set is also not 
> specified then the fetcher fails as well (as it cannot write to disk).
> 
> I think we need to have global defaults if we encounter such a problem 
> -- the Content type should be set to text/html and the character-set 
> should be
> ISO-8859 or UTF-8.
> 
> Doug, since you initially wrote the http protocol what's the best way 
> to proceed.
> 
> Thankx
> CC
> 
> Just as a side note, it would be AWESOME if we can specify max fetch 
> length based on the document type. 64k is way too small for a PDF (as 
> causes PDFs to not be parsed) and 1MB while okay for PDFs, is way too 
> big for an HTML page. Can be easily implemented by adding a key to the 
> plugin.xml for each parser.
> 
> 
> 
> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of 
> Sven Wende
> Sent: Wednesday, December 29, 2004 5:38 AM
> To: [EMAIL PROTECTED]
> Subject: RE: [Nutch-dev] Fetch / Parse errors and a Bug
> 
> Hi,
> 
> just a short annotation. Some weeks ago I described a problem, which 
> strongly correlates to yours:
> 
> Please take a look at
> http://sourceforge.net/mailarchive/message.php?msg_id=10249708 !
> 
> Maybe my considerations can help to find a working solution.
> 
>  
> 
> > -----Original Message-----
> > From: [EMAIL PROTECTED]
> > [mailto:[EMAIL PROTECTED] On Behalf Of 
> > Chirag Chaman
> > Sent: Dienstag, 28. Dezember 2004 20:40
> > To: [EMAIL PROTECTED]
> > Subject: [Nutch-dev] Fetch / Parse errors and a Bug
> > 
> > So, after some research I think one of the 2 issues I
> reported earlier
> > can get fixed.
> > 
> > To refresh, the error I question is:
> > > fetch okay, but can't parse
> > 
> http://java.sun.com/j2se/1.4.2/docs/api/java/nio/charset/Charset.html,
> > reason: Content-Type not application/pdf:
> > 
> > The problem is that this page did not specify its content
> type in the
> > header and the PDF plugin loads first and has a "" for it's path 
> > suffix. Same goes for the parse-HTML plugin.
> > Therefore when the fetcher cannot get the content type of a
> page (i.e. 
> > the page does not specify the content type) - the PDF plugin gets 
> > called.
> > 
> > Now, its easy to fix this by putting "PDF" for the
> pathsuffix for the
> > parser-pdf...until I read this in Matt Kangas'
> > documentation of the HTML plugin (Wiki)
> > 
> > "This entry looks a bit strange with the empty pathSuffix
> value. But
> > that just means that this plugin doesn't match any pathSuffix value.
> > So, parse-html is only used when we fetch remote URLs, not anything 
> > residing on the local filesystem."
> > 
> > Focusing on the sentence "So,.....filesystem".  Does this mean its 
> > best to leave the pathsuffix blank if we want this invoked
> for remote
> > URLs?  This was a bit confusing.
> > 
> > ***IS IT OKAY TO ADD PDF for the pathsuffix?  
> > 
> > 
> > 
> > And lastly, I think there may be a bug in the getSuffix() in 
> > ParseFactory.java
> > 
> > We use full URLs including query string -- at times they
> may contain
> > "/" or "." Also, anchors "#" take any characters after on the URL.
> > 
> > Thus, to account for this function should be chaged as follows:
> > 
> > - newurl = substring or url till first "#" 
> > - newurl = substring of newurl till "?" 
> >     (this should give us a string that will be the "root" url)
> > - now look for the last "." and retunr till end of string.
> > 
> > 
> > 
> >  
> > 
> > 
> > 
> > 
> > -------------------------------------------------------
> > SF email is sponsored by - The IT Product Guide Read honest
> & candid
> > reviews on hundreds of IT Products from real users.
> > Discover which products truly live up to the hype. Start
> reading now. 
> > http://productguide.itmanagersjournal.com/
> > _______________________________________________
> > Nutch-developers mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/nutch-developers
> > 
> > 
> 
> 
> 
> 
> 
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide Read honest & candid 
> reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now. 
> http://productguide.itmanagersjournal.com/
> _______________________________________________
> Nutch-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
> 
> 
> 
> 
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide Read honest & candid 
> reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now. 
> http://productguide.itmanagersjournal.com/
> _______________________________________________
> Nutch-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
> 
> 





-------------------------------------------------------
SF email is sponsored by - The IT Product Guide Read honest & candid reviews
on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers




-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

RE: [Nutch-dev] Fetch / Parse errors and a Bug

Reply via email to