[ 
https://issues.apache.org/jira/browse/NUTCH-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892562#comment-15892562
 ] 

Sebastian Nagel commented on NUTCH-2363:
----------------------------------------

Hi Markus,
I'm a little bit concerned (and agree with Julien) about the complexity 
introduced to handle cookies. But I know that it may be the only way to crawl 
some sites. Nutch does not provide a way to persist per-domain information such 
as cookies and on the protocol level cookies are persisted only for a single 
fetch task. Some remarks about the patch:

* are you sure that copying arbitrary meta data from a link or redirect source 
to the target CrawlDatum may never harm? Wouldn't it be better to use a 
restricted and configurable set, or do this only for the "Cookie" meta data key?
* handleCookies(datum, content, queueId) is only called from 2 places: success 
and redirect (no need for 404s, exceptions, etc.): Maybe call it directly 
there: it's line of code more, but one indirection less and also a shorter 
argument list of output(...), easier to understand what's happening.
* CookieScoringFilter could be simplified by inheriting from 
AbstractScoringFilter

> Fetcher support for reading and setting cookies
> -----------------------------------------------
>
>                 Key: NUTCH-2363
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2363
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.13
>
>         Attachments: NUTCH-2363.patch
>
>
> Patch adds basic support for cookies in the fetcher, and a scoring plugin 
> that passes cookies to its outlinks, within the domain. Sub-domain or path 
> based is not supported.
> This is useful if you want to maintain sessions or need to get around a 
> cookie wall.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to