RE: [Nutch-dev] Fetch / Parse errors and a Bug

Sven Wende Thu, 30 Dec 2004 16:31:05 -0800

Xin-Yi Liu:

> I believe that the String.CASE_INSENSITIVE_ORDER comparator 
> only affects the way the keys are ordered internally within 
> the TreeMap.  It would not affect lookups, so 
> headers.get(key) would still be a case sensitive.


No - it does also affect lookups. Just try the main() method, I provided.
The problem is the copy to a Properties object.

> Perhaps subclassing Properties to make all gets and puts case 
> insensitive is the best solution.

Thats exactly what i have done in a local patch on my system.


> -----Original Message-----
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On 
> Behalf Of Xin-Yi Liu
> Sent: Donnerstag, 30. Dezember 2004 23:55
> To: [EMAIL PROTECTED]
> Subject: RE: [Nutch-dev] Fetch / Parse errors and a Bug
> 
> I believe that the String.CASE_INSENSITIVE_ORDER comparator 
> only affects the way the keys are ordered internally within 
> the TreeMap.  It would not affect lookups, so 
> headers.get(key) would still be a case sensitive.
> 
> Perhaps subclassing Properties to make all gets and puts case 
> insensitive is the best solution.
> 
> --- Sven Wende <[EMAIL PROTECTED]> wrote:
> 
> > I think it4s only case insensitive in that TreeMap
> > parserHeaders() produces
> > !
> > 
> > But in the constructor this map is copied into a Properties object.
> > 
> > ********************************************
> >     // parse headers
> >     headers.putAll(parseHeaders(in, line));
> > ********************************************
> > 
> > Look at the following snippet, which does the same thing as 
> > HttpResponse.class does:
> > 
> > ********************************************
> >     public static void main(String[] args) {
> >         TreeMap headers = new
> > TreeMap(String.CASE_INSENSITIVE_ORDER);
> >         headers.put("content-type", "text");
> > 
> >         Properties headers2 = new Properties();
> >         headers2.putAll(headers);
> > 
> >        
> > System.out.println(headers.get("Content-Type"));  // = "text"
> >         
> >        
> > System.out.println(headers2.get("Content-Type")); // = null
> >     }
> > ********************************************
> > 
> > You can use the following url for your tests:
> >     
> >     http://www.verdi.de/0x0ac80f2b_0x0069a759
> > 
> > It is a PDF file and the server sends "Content-type:
> > application/pdf" !
> > 
> > 
> > > -----Original Message-----
> > > From: [EMAIL PROTECTED]
> > 
> > >
> >
> [mailto:[EMAIL PROTECTED]
> > On
> > > Behalf Of Chirag Chaman
> > > Sent: Mittwoch, 29. Dezember 2004 16:22
> > > To: [EMAIL PROTECTED]
> > > Subject: RE: [Nutch-dev] Fetch / Parse errors and
> > a Bug
> > > 
> > > That is strange, coz I would expect it to be case 
> insensitive, but 
> > > then again I have not tested,
> > just looking
> > > at the code.
> > > 
> > > You see how the TreeMap is initialized with 
> > > String.CASE_INSENSITIVE_ORDER
> > > 
> > > private Map parseHeaders(PushbackInputStream in,
> > StringBuffer line)
> > >     throws IOException, HttpException {
> > >     TreeMap headers = new
> > TreeMap(String.CASE_INSENSITIVE_ORDER);
> > >     return parseHeaders(in, line, headers);
> > > 
> > > So I would imagine that a look up for Content-Type
> > is case
> > > insensitive as well.
> > > 
> > > 
> > > Can you send me the link to a page that has this
> > problem --
> > > I'll run some tests to see what's causing this.
> > > 
> > > 
> > > -----Original Message-----
> > > From: [EMAIL PROTECTED]
> > >
> >
> [mailto:[EMAIL PROTECTED]
> > On
> > > Behalf Of Sven Wende
> > > Sent: Wednesday, December 29, 2004 9:16 AM
> > > To: [EMAIL PROTECTED]
> > > Subject: RE: [Nutch-dev] Fetch / Parse errors and
> > a Bug
> > > 
> > > Chirag:
> > > 
> > > > I looked at where you mention that the content
> > type is
> > > being looked up
> > > > and is Case Sensitive -- that is not correct.
> > The HTTP protocol is
> > > > adding the Content-type to the TreeMap which is
> > initialized
> > > with the
> > > > String.CASE_INSENSITIVE_ORDER comparator. Thus
> > it
> > > internally will do a
> > > > case-insensitive match.
> > > 
> > > Which code do you refer to?
> > > 
> > > I described a problem in the protocoll-http
> > plugin. Just take
> > > a look at the following code snippet from the CVS.
> > As you can
> > > see, the headers are read in and stored in a
> > simple Hashtable. 
> > > The problem with case sensitive headers for
> > content-type occurs in the
> > > toContent() method. (for example)
> > > 
> > >
> >
> **************************************************************
> > > **************
> > > *****
> > > package net.nutch.protocol.http;
> > > 
> > > /** An HTTP response. */
> > > 
> > > public class HttpResponse {
> > >   private Properties headers = new Properties();
> >                  
> > > 
> > >   /** Returns the value of a named header. */
> > >   public String getHeader(String name) {
> > >     return (String)headers.get(name);
> > >   }
> > > 
> > >   public Content toContent() {
> > >     String contentType =
> > getHeader("Content-Type");
> > >     if (contentType == null)
> > >       contentType = "";
> > >     return new Content(orig, base, content,
> > contentType, headers);
> > >   }
> > > 
> > >   private void processHeaderLine(StringBuffer
> > line, TreeMap headers)
> > >     throws IOException, HttpException {
> > >     int colonIndex = line.indexOf(":");       //
> > key is up to colon
> > >     if (colonIndex == -1) {
> > >       int i;
> > >       for (i= 0; i < line.length(); i++)
> > >         if
> > (!Character.isWhitespace(line.charAt(i)))
> > >           break;
> > >       if (i == line.length())
> > >         return;
> > >       throw new HttpException("No colon in
> > header:" + line);
> > >     }
> > >     String key = line.substring(0, colonIndex);
> > > 
> > >     int valueStart = colonIndex+1;            //
> > skip whitespace
> > >     while (valueStart < line.length()) {
> > >       int c = line.charAt(valueStart);
> > >       if (c != ' ' && c != '\t')
> > >         break;
> > >       valueStart++;
> > >     }
> > >     String value = line.substring(valueStart);
> > > 
> > >     headers.put(key, value);
> > >   }
> > > }
> > >
> >
> **************************************************************
> > > **************
> > > *****
> > > 
> > > > I think the problem is that no "content-type"
> > was ever on
> > > the page --
> > > > this leaves both the content type and the
> > extension/suffix
> > > to be blank
> > > > and that causes a problem. Also, if a
> > character-set is also not
> > > > specified then the fetcher fails as well (as it
> > cannot
> > > write to disk).
> > > 
> > > I tested it and there was a "content-type" header.
> > If its
> > > name was "Content-Type", everything was ok but if
> > its name
> > > was "content-type" Nutch internally looses the
> > information
> > > about the content-type by the use of the code
> > above.
> > > 
> > 
> === message truncated ===
> 
> 
> 
>               
> __________________________________
> Do you Yahoo!? 
> The all-new My Yahoo! - Get yours free! 
> http://my.yahoo.com 
>  
> 
> 
> 
> -------------------------------------------------------
> The SF.Net email is sponsored by: Beat the post-holiday blues
> Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
> It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
> _______________________________________________
> Nutch-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
> 
> 





-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

RE: [Nutch-dev] Fetch / Parse errors and a Bug

Reply via email to