RE: [Nutch-dev] Fetch / Parse errors and a Bug

Sven Wende Wed, 29 Dec 2004 08:05:33 -0800

I think it�s only case insensitive in that TreeMap parserHeaders() produces
!


But in the constructor this map is copied into a Properties object.

********************************************
        // parse headers
        headers.putAll(parseHeaders(in, line));
********************************************

Look at the following snippet, which does the same thing as
HttpResponse.class does:

********************************************
    public static void main(String[] args) {
        TreeMap headers = new TreeMap(String.CASE_INSENSITIVE_ORDER);
        headers.put("content-type", "text");

        Properties headers2 = new Properties();
        headers2.putAll(headers);

        System.out.println(headers.get("Content-Type"));  // = "text"
        
        System.out.println(headers2.get("Content-Type")); // = null
    }
********************************************

You can use the following url for your tests:
        
        http://www.verdi.de/0x0ac80f2b_0x0069a759

It is a PDF file and the server sends "Content-type: application/pdf" !


> -----Original Message-----
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On 
> Behalf Of Chirag Chaman
> Sent: Mittwoch, 29. Dezember 2004 16:22
> To: [EMAIL PROTECTED]
> Subject: RE: [Nutch-dev] Fetch / Parse errors and a Bug
> 
> That is strange, coz I would expect it to be case 
> insensitive, but then again I have not tested, just looking 
> at the code.
> 
> You see how the TreeMap is initialized with 
> String.CASE_INSENSITIVE_ORDER
> 
> private Map parseHeaders(PushbackInputStream in, StringBuffer line)
>     throws IOException, HttpException {
>     TreeMap headers = new TreeMap(String.CASE_INSENSITIVE_ORDER);
>     return parseHeaders(in, line, headers); 
> 
> So I would imagine that a look up for Content-Type is case 
> insensitive as well.
> 
> 
> Can you send me the link to a page that has this problem -- 
> I'll run some tests to see what's causing this.
> 
> 
> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On 
> Behalf Of Sven Wende
> Sent: Wednesday, December 29, 2004 9:16 AM
> To: [EMAIL PROTECTED]
> Subject: RE: [Nutch-dev] Fetch / Parse errors and a Bug
> 
> Chirag:
> 
> > I looked at where you mention that the content type is 
> being looked up 
> > and is Case Sensitive -- that is not correct. The HTTP protocol is 
> > adding the Content-type to the TreeMap which is initialized 
> with the 
> > String.CASE_INSENSITIVE_ORDER comparator. Thus it 
> internally will do a 
> > case-insensitive match.
> 
> Which code do you refer to?
> 
> I described a problem in the protocoll-http plugin. Just take 
> a look at the following code snippet from the CVS. As you can 
> see, the headers are read in and stored in a simple Hashtable. 
> The problem with case sensitive headers for content-type occurs in the
> toContent() method. (for example)
> 
> **************************************************************
> **************
> *****
> package net.nutch.protocol.http;
> 
> /** An HTTP response. */
> 
> public class HttpResponse {
>   private Properties headers = new Properties();                    
> 
>   /** Returns the value of a named header. */
>   public String getHeader(String name) {
>     return (String)headers.get(name);
>   }
> 
>   public Content toContent() {
>     String contentType = getHeader("Content-Type");
>     if (contentType == null)
>       contentType = "";
>     return new Content(orig, base, content, contentType, headers);
>   }
> 
>   private void processHeaderLine(StringBuffer line, TreeMap headers)
>     throws IOException, HttpException {
>     int colonIndex = line.indexOf(":");       // key is up to colon
>     if (colonIndex == -1) {
>       int i;
>       for (i= 0; i < line.length(); i++)
>         if (!Character.isWhitespace(line.charAt(i)))
>           break;
>       if (i == line.length())
>         return;
>       throw new HttpException("No colon in header:" + line);
>     }
>     String key = line.substring(0, colonIndex);
> 
>     int valueStart = colonIndex+1;            // skip whitespace
>     while (valueStart < line.length()) {
>       int c = line.charAt(valueStart);
>       if (c != ' ' && c != '\t')
>         break;
>       valueStart++;
>     }
>     String value = line.substring(valueStart);
> 
>     headers.put(key, value);
>   }
> }
> **************************************************************
> **************
> *****
> 
> > I think the problem is that no "content-type" was ever on 
> the page -- 
> > this leaves both the content type and the extension/suffix 
> to be blank 
> > and that causes a problem. Also, if a character-set is also not 
> > specified then the fetcher fails as well (as it cannot 
> write to disk).
> 
> I tested it and there was a "content-type" header. If its 
> name was "Content-Type", everything was ok but if its name 
> was "content-type" Nutch internally looses the information 
> about the content-type by the use of the code above.
> 
> 
> 
> > -----Original Message-----
> > From: [EMAIL PROTECTED]
> > [mailto:[EMAIL PROTECTED] On Behalf Of 
> > Chirag Chaman
> > Sent: Mittwoch, 29. Dezember 2004 13:59
> > To: [EMAIL PROTECTED]
> > Subject: RE: [Nutch-dev] Fetch / Parse errors and a Bug
> > 
> > Swen:
> > 
> > Yes, this is related. Bill Goffe seems to have had the same problem.
> > 
> > So here's the EASY fix. I tested it over the last few hours 
> with 100k 
> > pages and it's working as it should. Simply add "pdf" and "doc" for 
> > the pathSuffix of  parser-pdf and parser-doc.  In my 
> opinion no other 
> > parser plugin should have its pathsuffix left blank unless 
> it wants to 
> > be the default handler -- HTML should only be the one.
> > 
> > I looked at where you mention that the content type is 
> being looked up 
> > and is Case Sensitive -- that is not correct. The HTTP protocol is 
> > adding the Content-type to the TreeMap which is initialized 
> with the 
> > String.CASE_INSENSITIVE_ORDER comparator. Thus it 
> internally will do a 
> > case-insensitive match.
> > 
> > I think the problem is that no "content-type" was ever on 
> the page -- 
> > this leaves both the content type and the extension/suffix 
> to be blank 
> > and that causes a problem. Also, if a character-set is also not 
> > specified then the fetcher fails as well (as it cannot 
> write to disk).
> > 
> > I think we need to have global defaults if we encounter 
> such a problem
> > -- the Content type should be set to text/html and the 
> character-set 
> > should be
> > ISO-8859 or UTF-8.
> > 
> > Doug, since you initially wrote the http protocol what's 
> the best way 
> > to proceed.
> > 
> > Thankx
> > CC
> > 
> > Just as a side note, it would be AWESOME if we can specify 
> max fetch 
> > length based on the document type. 64k is way too small for 
> a PDF (as 
> > causes PDFs to not be parsed) and 1MB while okay for PDFs, 
> is way too 
> > big for an HTML page. Can be easily implemented by adding a 
> key to the 
> > plugin.xml for each parser.
> > 
> > 
> > 
> > -----Original Message-----
> > From: [EMAIL PROTECTED]
> > [mailto:[EMAIL PROTECTED] On Behalf Of 
> > Sven Wende
> > Sent: Wednesday, December 29, 2004 5:38 AM
> > To: [EMAIL PROTECTED]
> > Subject: RE: [Nutch-dev] Fetch / Parse errors and a Bug
> > 
> > Hi,
> > 
> > just a short annotation. Some weeks ago I described a 
> problem, which 
> > strongly correlates to yours:
> > 
> > Please take a look at
> > http://sourceforge.net/mailarchive/message.php?msg_id=10249708 !
> > 
> > Maybe my considerations can help to find a working solution.
> > 
> >  
> > 
> > > -----Original Message-----
> > > From: [EMAIL PROTECTED]
> > > [mailto:[EMAIL PROTECTED] On 
> Behalf Of 
> > > Chirag Chaman
> > > Sent: Dienstag, 28. Dezember 2004 20:40
> > > To: [EMAIL PROTECTED]
> > > Subject: [Nutch-dev] Fetch / Parse errors and a Bug
> > > 
> > > So, after some research I think one of the 2 issues I
> > reported earlier
> > > can get fixed.
> > > 
> > > To refresh, the error I question is:
> > > > fetch okay, but can't parse
> > > 
> > 
> http://java.sun.com/j2se/1.4.2/docs/api/java/nio/charset/Charset.html,
> > > reason: Content-Type not application/pdf:
> > > 
> > > The problem is that this page did not specify its content
> > type in the
> > > header and the PDF plugin loads first and has a "" for it's path 
> > > suffix. Same goes for the parse-HTML plugin.
> > > Therefore when the fetcher cannot get the content type of a
> > page (i.e. 
> > > the page does not specify the content type) - the PDF plugin gets 
> > > called.
> > > 
> > > Now, its easy to fix this by putting "PDF" for the
> > pathsuffix for the
> > > parser-pdf...until I read this in Matt Kangas'
> > > documentation of the HTML plugin (Wiki)
> > > 
> > > "This entry looks a bit strange with the empty pathSuffix
> > value. But
> > > that just means that this plugin doesn't match any 
> pathSuffix value.
> > > So, parse-html is only used when we fetch remote URLs, 
> not anything 
> > > residing on the local filesystem."
> > > 
> > > Focusing on the sentence "So,.....filesystem".  Does this 
> mean its 
> > > best to leave the pathsuffix blank if we want this invoked
> > for remote
> > > URLs?  This was a bit confusing.
> > > 
> > > ***IS IT OKAY TO ADD PDF for the pathsuffix?  
> > > 
> > > 
> > > 
> > > And lastly, I think there may be a bug in the getSuffix() in 
> > > ParseFactory.java
> > > 
> > > We use full URLs including query string -- at times they
> > may contain
> > > "/" or "." Also, anchors "#" take any characters after on the URL.
> > > 
> > > Thus, to account for this function should be chaged as follows:
> > > 
> > > - newurl = substring or url till first "#" 
> > > - newurl = substring of newurl till "?" 
> > >   (this should give us a string that will be the "root" url)
> > > - now look for the last "." and retunr till end of string.
> > > 
> > > 
> > > 
> > >  
> > > 
> > > 
> > > 
> > > 
> > > -------------------------------------------------------
> > > SF email is sponsored by - The IT Product Guide Read honest
> > & candid
> > > reviews on hundreds of IT Products from real users.
> > > Discover which products truly live up to the hype. Start
> > reading now. 
> > > http://productguide.itmanagersjournal.com/
> > > _______________________________________________
> > > Nutch-developers mailing list
> > > [email protected]
> > > https://lists.sourceforge.net/lists/listinfo/nutch-developers
> > > 
> > > 
> > 
> > 
> > 
> > 
> > 
> > -------------------------------------------------------
> > SF email is sponsored by - The IT Product Guide Read honest 
> & candid 
> > reviews on hundreds of IT Products from real users.
> > Discover which products truly live up to the hype. Start 
> reading now. 
> > http://productguide.itmanagersjournal.com/
> > _______________________________________________
> > Nutch-developers mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/nutch-developers
> > 
> > 
> > 
> > 
> > -------------------------------------------------------
> > SF email is sponsored by - The IT Product Guide Read honest 
> & candid 
> > reviews on hundreds of IT Products from real users.
> > Discover which products truly live up to the hype. Start 
> reading now. 
> > http://productguide.itmanagersjournal.com/
> > _______________________________________________
> > Nutch-developers mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/nutch-developers
> > 
> > 
> 
> 
> 
> 
> 
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide Read honest & 
> candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now. 
> http://productguide.itmanagersjournal.com/
> _______________________________________________
> Nutch-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
> 
> 
> 
> 
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide Read honest & 
> candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now. 
> http://productguide.itmanagersjournal.com/
> _______________________________________________
> Nutch-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
> 
> 





-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://productguide.itmanagersjournal.com/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

RE: [Nutch-dev] Fetch / Parse errors and a Bug

Reply via email to