[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482379#comment-14482379
 ] 

Chris A. Mattmann commented on NUTCH-1854:
------------------------------------------

Hey Lewis, 2 specific issues to point you to:

1. NUTCH-1771 - and the generic class being implemented that could be reused to 
deal with checking in a workflow oriented fashion if the parse_text exists or 
not, and if not, perhaps then regenerating it, or going through a parse cycle 
real quick on any urls that don't have parse_text data.

2. the reason it fails is that it throws an Exception, as Asitang noted, and we 
can simply get around this exception by catching it, logging the error, and 
then correcting for it downstream in either a crawl (lights out) oriented 
fashion using NUTCH-1771 and some logic to then call the ParseJob for any URLs 
that it is missing on before e.g., IndexingJob, etc.

And yes thanks for the context. I am all for dealing with #1 and #2 above and 
people like [~asitang] along with [~chongli] are trying to deal with this too 
and we can help shepherd it in.

> ./bin/crawl fails with a parsing fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1854
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1854
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.9
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.11
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> <property>
> >   <name>fetcher.parse</name>
> >   <value>false</value>
> >   <description>If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.</description>
> > </property>
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to