crawl fails with a parsing fetcher

Lewis John McGibbney (JIRA) Mon, 06 Apr 2015 18:30:36 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482371#comment-14482371
 ]


Lewis John McGibbney commented on NUTCH-1854:
---------------------------------------------

In the past, I've experienced failed fetch task if parsing fails when
invoked during fetching.
There are various ways to overcome this, as you've said generate more,
smaller fetch lists , so if a parsing fetcher fails then we mitigate
against loosing large fetch results.

You've also noted that simply making a check for the parse directory later
on is a work around of sorts but it does not prevent interruptions in a
typical workflow should a parsing fetcher fail.

This is a Nutch Gotcha which I've been aware of since my early use of
Nutch. It's something that's stuck with me and is probably more habit now
than anything else Chris. The crawl script shadows this behavior hence the
reason it fails when attempting to reparse a segment. The parsing fetcher
is disabled by default based on the underlying assumption that Nutch will
be invoked as a breadth first crawl. This is also reflected in the settings
which ignore internal links but follow external links.

I understand that the goal here is to move towards more of an interactive
understanding of Crawldb and Record status, and I am supportive of that. I
hope the above provides some context to Azitang and others.




-- 
*Lewis*


> ./bin/crawl fails with a parsing fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1854
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1854
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.9
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.11
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> <property>
> >   <name>fetcher.parse</name>
> >   <value>false</value>
> >   <description>If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.</description>
> > </property>
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

Reply via email to