Jérôme, Are you mainly concerned with charset in Content-Type? Currently, what happens when Content-Type exists in both HTTP layer and in META tag (if contents is HTML)? How does Nutch guesses Content-Type, and when does it need to do that? Is there a situation where the guessed content-type differs from the content-type in the metadata? If so, what class uses which? -kuro
> -----Original Message----- > From: Jérôme Charron [mailto:[EMAIL PROTECTED] > Sent: 2006-4-13 12:57 > To: [email protected] > Subject: Re: Content-Type inconsistency? > > I would like to come back on this issue: > The Content object holds two content-types: > 1. The raw content-type from the protocol layer (http header > in case of > http) in the Content's metadata > 2. The guessed content-type in a private field content-type. > > When a ParseData object is created, it takes only the > Content's metadata. > So, the ParseData can only access the raw content type and not the one > guessed. > > What I suggest is : > 1. add a content-type parameter in the ParseData constructors (so that > Parsers can pass the guessed content-type to ParseData). > 2. The Content object stores the guessed content-type in it's > metadata in a > special attribute named for instance GUESSED_CONTENT_TYPE, so that the > ParseData can access it > > I think 1. is really cleanest way to implement this, but > there is a lot of > code impacted => all the parsers. > Solution 2. have no impact on APIs, so the code changes are > very small. > > Suggestions? Comments? > > Jérôme > > -- > http://motrech.free.fr/ > http://www.frutch.org/ > ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
