[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547425#comment-14547425 ] Hudson commented on NUTCH-1854: --- SUCCESS: Integrated in Nutch-trunk #3125 (See [https://builds.apache.org/job/Nutch-trunk/3125/]) Re-apply NUTCH-1854 after mistakenly rolled back during NUTCH-1973. (mattmann: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1679911) * /nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java * /nutch/trunk/src/java/org/apache/nutch/segment/SegmentChecker.java > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Sebastian Nagel > Labels: memex > Fix For: 1.10 > > Attachments: NUTCH-1854ver1.patch, NUTCH-1854ver2.patch, > NUTCH-1854ver3.patch, NUTCH-1854ver4.patch > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501580#comment-14501580 ] Hudson commented on NUTCH-1854: --- SUCCESS: Integrated in Nutch-trunk #3070 (See [https://builds.apache.org/job/Nutch-trunk/3070/]) NUTCH-1854 bin/crawl fails with a parsing fetcher (snagel: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1674581) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java * /nutch/trunk/src/java/org/apache/nutch/segment/SegmentChecker.java > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Sebastian Nagel > Labels: memex > Fix For: 1.10 > > Attachments: NUTCH-1854ver1.patch, NUTCH-1854ver2.patch, > NUTCH-1854ver3.patch, NUTCH-1854ver4.patch > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501463#comment-14501463 ] Chris A. Mattmann commented on NUTCH-1854: -- awesome work [~asitang] - [~wastl-nagel] all you to commit. Thanks! Thanks for the help [~lewismc] > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Labels: memex > Fix For: 1.10 > > Attachments: NUTCH-1854ver1.patch, NUTCH-1854ver2.patch, > NUTCH-1854ver3.patch, NUTCH-1854ver4.patch > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495687#comment-14495687 ] Asitang Mishra commented on NUTCH-1854: --- okay done Lewis.. > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: NUTCH-1854ver1.patch, NUTCH-1854ver2.patch, > NUTCH-1854ver3.patch, NUTCH-1854ver4.patch > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494778#comment-14494778 ] Lewis John McGibbney commented on NUTCH-1854: - [~asitang] can you please use the following template to format your code. http://svn.apache.org/repos/asf/nutch/branches/2.x/eclipse-codeformat.xml These patches are grand. > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: NUTCH-1854ver1.patch, NUTCH-1854ver2.patch, > NUTCH-1854ver3.patch > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14493655#comment-14493655 ] Sebastian Nagel commented on NUTCH-1854: +1 Great! Needs formatting. Will commit soon. > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: NUTCH-1854ver1.patch, NUTCH-1854ver2.patch, > NUTCH-1854ver3.patch > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492286#comment-14492286 ] Sebastian Nagel commented on NUTCH-1854: Thanks, [~asitang]! * NUTCH-1771 is committed, can you "rebase" the patch * for clarity and brevity I would prefer to implement an explicit check "isParsed" in SegmentChecker and call it immediately at the beginning of the parse function, e.g.: {code} public void parse(Path segment) throws IOException { if (SegmentChecker.isParsed(segment, FileSystem.get(getConf())) { LOG.warn("..."); return; } {code} > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: NUTCH-1854ver1.patch, NUTCH-1854ver2.patch > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490711#comment-14490711 ] Lewis John McGibbney commented on NUTCH-1854: - A nice easy fix which really makes things much better. Existing behavior to not have perusing fetcher out of the box is imho proper default and this patch works perfectly with that. > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: NUTCH-1854ver1.patch, NUTCH-1854ver2.patch > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487622#comment-14487622 ] Asitang Mishra commented on NUTCH-1854: --- Sounds logical to add this check to the SegmentChecker (looked at NUTCH-1771), but the SegmentChecker is not yet added to NUTCH trunk. > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: NUTCH-1854ver1.patch > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487035#comment-14487035 ] Sebastian Nagel commented on NUTCH-1854: Definitely: fetcher.store.content=false and fetcher.parse=false are mutually exclusive in "normal" crawler work-flows. There are others where this combination makes sense, e.g., check availability of URLs, do network load tests, etc. I don't think we can make the work-flow (bin/crawl) safe from any misconfigurations, at least, when it's about manually editing properties (would be different with a config GUI). It's important to provide meaningful messages for common errors (cf. NUTCH-1370 if seeds are excluded by URL filters). +1 for skipping already parsed segments does not harm but * message that parsing segment is skipped should be a warning - it's still a misconfiguration! * the "finished at ..., elapsed ..." message is misleading if a segment is skipped * ev. add the checking code to SegmentChecker (NUTCH-1771) > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: NUTCH-1854ver1.patch > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485299#comment-14485299 ] lufeng commented on NUTCH-1854: --- if we set "fetcher.store.content=false" and "fetcher.parse=false" then the "bin/nutch parse" command will throw exception to check the input content directory exist. So I think why we need this parameter because something we set the "fetcher.parse" to true and don't want to store the content because of slow disk or not much disk space. So I think we can remove this parameter of "fetcher.store.content" and if the parameter of "fetcher.parse=true" we don't store the page content. > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: NUTCH-1854ver1.patch > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482452#comment-14482452 ] Chris A. Mattmann commented on NUTCH-1854: -- ACK! > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: NUTCH-1854ver1.patch > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482454#comment-14482454 ] Chris A. Mattmann commented on NUTCH-1854: -- and a triple bewm +1 > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: NUTCH-1854ver1.patch > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482418#comment-14482418 ] Asitang Mishra commented on NUTCH-1854: --- I agree Lewis. > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: NUTCH-1854ver1.patch > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482415#comment-14482415 ] Lewis John McGibbney commented on NUTCH-1854: - Hi Azitang, this seems more logical to me with minimal interruption to the existing behavior. We would not need to change any existing logic within the crawl script out of the box. -- *Lewis* > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: NUTCH-1854ver1.patch > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482412#comment-14482412 ] Lewis John McGibbney commented on NUTCH-1854: - Hi Chris I saw this and it looks nice. Looks like it will make it in to the codebase reasonably soon! The main point I am making is that a failed parsing fetcher, in my own experience, is a primary factor behind corrupt segments. I therefore make best efforts to avoid this practice. It looks like Asitang has a better solution e.g. Most recent patch attachment. Ack and a +1 on top -- *Lewis* > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: NUTCH-1854ver1.patch > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482379#comment-14482379 ] Chris A. Mattmann commented on NUTCH-1854: -- Hey Lewis, 2 specific issues to point you to: 1. NUTCH-1771 - and the generic class being implemented that could be reused to deal with checking in a workflow oriented fashion if the parse_text exists or not, and if not, perhaps then regenerating it, or going through a parse cycle real quick on any urls that don't have parse_text data. 2. the reason it fails is that it throws an Exception, as Asitang noted, and we can simply get around this exception by catching it, logging the error, and then correcting for it downstream in either a crawl (lights out) oriented fashion using NUTCH-1771 and some logic to then call the ParseJob for any URLs that it is missing on before e.g., IndexingJob, etc. And yes thanks for the context. I am all for dealing with #1 and #2 above and people like [~asitang] along with [~chongli] are trying to deal with this too and we can help shepherd it in. > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482371#comment-14482371 ] Lewis John McGibbney commented on NUTCH-1854: - In the past, I've experienced failed fetch task if parsing fails when invoked during fetching. There are various ways to overcome this, as you've said generate more, smaller fetch lists , so if a parsing fetcher fails then we mitigate against loosing large fetch results. You've also noted that simply making a check for the parse directory later on is a work around of sorts but it does not prevent interruptions in a typical workflow should a parsing fetcher fail. This is a Nutch Gotcha which I've been aware of since my early use of Nutch. It's something that's stuck with me and is probably more habit now than anything else Chris. The crawl script shadows this behavior hence the reason it fails when attempting to reparse a segment. The parsing fetcher is disabled by default based on the underlying assumption that Nutch will be invoked as a breadth first crawl. This is also reflected in the settings which ignore internal links but follow external links. I understand that the goal here is to move towards more of an interactive understanding of Crawldb and Record status, and I am supportive of that. I hope the above provides some context to Azitang and others. -- *Lewis* > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482356#comment-14482356 ] Chris A. Mattmann commented on NUTCH-1854: -- Lewis, what specific problems can it lead to, I'm interested? Asitang has been trying this out and it's been working on small focused crawls, as long as we have downstream support for e.g., checking for crawl parser, and then also in the later jobs, making sure that there is (or isn't) a parse_text folder in the segments. How would that not solve the issues? > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481649#comment-14481649 ] Asitang Mishra commented on NUTCH-1854: --- what should be the default behavior when we run the crawl script i.e ./bin/crawl and fetcher.parse set to true. 1. It should parse once and put the parsed content to the segment db. Then go ahead and re parse during the parse cycle. 2. It should parse once and put the parsed content to the segment db. Does not parse during the parse cycle and exit politely. I have tried a 3rd thing, where I am parsing during the fetch step, but nothing is being written in the DB (It basically solves my problem for developing a runtime UI graph) > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481651#comment-14481651 ] Lewis John McGibbney commented on NUTCH-1854: - Parsing should always be set to false. Parsing fetcher can lead to problems. Lewis -- *Lewis* > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)