[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490711#comment-14490711
 ] 

Lewis John McGibbney commented on NUTCH-1854:
---------------------------------------------

A nice easy fix which really makes things much better. Existing behavior to not 
have perusing fetcher out of the box is imho proper default and this patch 
works perfectly with that. 

> ./bin/crawl fails with a parsing fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1854
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1854
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.9
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.11
>
>         Attachments: NUTCH-1854ver1.patch, NUTCH-1854ver2.patch
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> <property>
> >   <name>fetcher.parse</name>
> >   <value>false</value>
> >   <description>If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.</description>
> > </property>
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to