[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-05-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547425#comment-14547425
 ] 

Hudson commented on NUTCH-1854:
---

SUCCESS: Integrated in Nutch-trunk #3125 (See 
[https://builds.apache.org/job/Nutch-trunk/3125/])
Re-apply NUTCH-1854 after mistakenly rolled back during NUTCH-1973. (mattmann: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1679911)
* /nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java
* /nutch/trunk/src/java/org/apache/nutch/segment/SegmentChecker.java


> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Sebastian Nagel
>  Labels: memex
> Fix For: 1.10
>
> Attachments: NUTCH-1854ver1.patch, NUTCH-1854ver2.patch, 
> NUTCH-1854ver3.patch, NUTCH-1854ver4.patch
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501580#comment-14501580
 ] 

Hudson commented on NUTCH-1854:
---

SUCCESS: Integrated in Nutch-trunk #3070 (See 
[https://builds.apache.org/job/Nutch-trunk/3070/])
NUTCH-1854 bin/crawl fails with a parsing fetcher (snagel: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1674581)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java
* /nutch/trunk/src/java/org/apache/nutch/segment/SegmentChecker.java


> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Sebastian Nagel
>  Labels: memex
> Fix For: 1.10
>
> Attachments: NUTCH-1854ver1.patch, NUTCH-1854ver2.patch, 
> NUTCH-1854ver3.patch, NUTCH-1854ver4.patch
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-18 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501463#comment-14501463
 ] 

Chris A. Mattmann commented on NUTCH-1854:
--

awesome work [~asitang] - [~wastl-nagel] all you to commit. Thanks! Thanks for 
the help [~lewismc]

> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: memex
> Fix For: 1.10
>
> Attachments: NUTCH-1854ver1.patch, NUTCH-1854ver2.patch, 
> NUTCH-1854ver3.patch, NUTCH-1854ver4.patch
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-14 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495687#comment-14495687
 ] 

Asitang Mishra commented on NUTCH-1854:
---

okay done Lewis..

> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-1854ver1.patch, NUTCH-1854ver2.patch, 
> NUTCH-1854ver3.patch, NUTCH-1854ver4.patch
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-14 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494778#comment-14494778
 ] 

Lewis John McGibbney commented on NUTCH-1854:
-

[~asitang] can you please use the following template to format your code.
http://svn.apache.org/repos/asf/nutch/branches/2.x/eclipse-codeformat.xml
These patches are grand.

> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-1854ver1.patch, NUTCH-1854ver2.patch, 
> NUTCH-1854ver3.patch
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-13 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14493655#comment-14493655
 ] 

Sebastian Nagel commented on NUTCH-1854:


+1 Great! Needs formatting. Will commit soon.

> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-1854ver1.patch, NUTCH-1854ver2.patch, 
> NUTCH-1854ver3.patch
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-13 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492286#comment-14492286
 ] 

Sebastian Nagel commented on NUTCH-1854:


Thanks, [~asitang]!
* NUTCH-1771 is committed, can you "rebase" the patch
* for clarity and brevity I would prefer to implement an explicit check 
"isParsed" in SegmentChecker and call it immediately at the beginning of the 
parse function, e.g.:
{code}
 public void parse(Path segment) throws IOException {
  if (SegmentChecker.isParsed(segment, FileSystem.get(getConf())) {
   LOG.warn("...");
   return;
  }
{code}

> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-1854ver1.patch, NUTCH-1854ver2.patch
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-10 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490711#comment-14490711
 ] 

Lewis John McGibbney commented on NUTCH-1854:
-

A nice easy fix which really makes things much better. Existing behavior to not 
have perusing fetcher out of the box is imho proper default and this patch 
works perfectly with that. 

> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-1854ver1.patch, NUTCH-1854ver2.patch
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-09 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487622#comment-14487622
 ] 

Asitang Mishra commented on NUTCH-1854:
---

Sounds logical to add this check to the SegmentChecker (looked at NUTCH-1771), 
but the SegmentChecker is not yet added to NUTCH trunk. 

> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-1854ver1.patch
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-09 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487035#comment-14487035
 ] 

Sebastian Nagel commented on NUTCH-1854:


Definitely: fetcher.store.content=false and fetcher.parse=false are mutually 
exclusive in "normal" crawler work-flows. There are others where this 
combination makes sense, e.g., check availability of URLs, do network load 
tests, etc. I don't think we can make the work-flow (bin/crawl) safe from any 
misconfigurations, at least, when it's about manually editing properties (would 
be different with a config GUI). It's important to provide meaningful messages 
for common errors (cf. NUTCH-1370 if seeds are excluded by URL filters).
+1 for skipping already parsed segments does not harm but
* message that parsing segment is skipped should be a warning - it's still a 
misconfiguration!
* the "finished at ..., elapsed ..." message is misleading if a segment is 
skipped
* ev. add the checking code to SegmentChecker (NUTCH-1771)

> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-1854ver1.patch
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-08 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485299#comment-14485299
 ] 

lufeng commented on NUTCH-1854:
---

if we set "fetcher.store.content=false" and "fetcher.parse=false" then the 
"bin/nutch parse" command will throw exception to check the input content 
directory exist. So I think why we need this parameter because something we set 
the "fetcher.parse" to true and don't want to store the content because of slow 
disk or not much disk space. So I think we can remove this parameter of 
"fetcher.store.content" and if the parameter of "fetcher.parse=true" we don't 
store the page content.

> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-1854ver1.patch
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-06 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482452#comment-14482452
 ] 

Chris A. Mattmann commented on NUTCH-1854:
--

ACK!

> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-1854ver1.patch
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-06 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482454#comment-14482454
 ] 

Chris A. Mattmann commented on NUTCH-1854:
--

and a triple bewm +1

> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-1854ver1.patch
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-06 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482418#comment-14482418
 ] 

Asitang Mishra commented on NUTCH-1854:
---

I agree Lewis.

> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-1854ver1.patch
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-06 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482415#comment-14482415
 ] 

Lewis John McGibbney commented on NUTCH-1854:
-

Hi Azitang, this seems more logical to me with minimal interruption to the
existing behavior. We would not need to change any existing logic within
the crawl script out of the box.




-- 
*Lewis*


> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-1854ver1.patch
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-06 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482412#comment-14482412
 ] 

Lewis John McGibbney commented on NUTCH-1854:
-

Hi Chris




I saw this and it looks nice. Looks like it will make it in to the codebase
reasonably soon!
The main point I am making is that a failed parsing fetcher, in my own
experience, is a primary factor behind corrupt segments. I therefore make
best efforts to avoid this practice.




It looks like Asitang has a better solution e.g. Most recent patch
attachment.




Ack and a +1 on top




-- 
*Lewis*


> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-1854ver1.patch
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-06 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482379#comment-14482379
 ] 

Chris A. Mattmann commented on NUTCH-1854:
--

Hey Lewis, 2 specific issues to point you to:

1. NUTCH-1771 - and the generic class being implemented that could be reused to 
deal with checking in a workflow oriented fashion if the parse_text exists or 
not, and if not, perhaps then regenerating it, or going through a parse cycle 
real quick on any urls that don't have parse_text data.

2. the reason it fails is that it throws an Exception, as Asitang noted, and we 
can simply get around this exception by catching it, logging the error, and 
then correcting for it downstream in either a crawl (lights out) oriented 
fashion using NUTCH-1771 and some logic to then call the ParseJob for any URLs 
that it is missing on before e.g., IndexingJob, etc.

And yes thanks for the context. I am all for dealing with #1 and #2 above and 
people like [~asitang] along with [~chongli] are trying to deal with this too 
and we can help shepherd it in.

> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-06 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482371#comment-14482371
 ] 

Lewis John McGibbney commented on NUTCH-1854:
-

In the past, I've experienced failed fetch task if parsing fails when
invoked during fetching.
There are various ways to overcome this, as you've said generate more,
smaller fetch lists , so if a parsing fetcher fails then we mitigate
against loosing large fetch results.

You've also noted that simply making a check for the parse directory later
on is a work around of sorts but it does not prevent interruptions in a
typical workflow should a parsing fetcher fail.

This is a Nutch Gotcha which I've been aware of since my early use of
Nutch. It's something that's stuck with me and is probably more habit now
than anything else Chris. The crawl script shadows this behavior hence the
reason it fails when attempting to reparse a segment. The parsing fetcher
is disabled by default based on the underlying assumption that Nutch will
be invoked as a breadth first crawl. This is also reflected in the settings
which ignore internal links but follow external links.

I understand that the goal here is to move towards more of an interactive
understanding of Crawldb and Record status, and I am supportive of that. I
hope the above provides some context to Azitang and others.




-- 
*Lewis*


> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-06 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482356#comment-14482356
 ] 

Chris A. Mattmann commented on NUTCH-1854:
--

Lewis, what specific problems can it lead to, I'm interested? Asitang has been 
trying this out and it's been working on small focused crawls, as long as we 
have downstream support for e.g., checking for crawl parser, and then also in 
the later jobs, making sure that there is (or isn't) a parse_text folder in the 
segments. How would that not solve the issues?

> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-06 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481649#comment-14481649
 ] 

Asitang Mishra commented on NUTCH-1854:
---

what should be the default behavior when we run the crawl script i.e 
./bin/crawl and fetcher.parse set to true.
1. It should parse once and put the parsed content to the segment db. Then go 
ahead and re parse during the parse cycle.
2. It should parse once and put the parsed content to the segment db. Does not 
parse during the parse cycle and exit politely.

I have tried a 3rd thing, where I am parsing during the fetch step, but nothing 
is being written in the DB (It basically solves my problem for developing a 
runtime UI graph)

> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher

2015-04-06 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481651#comment-14481651
 ] 

Lewis John McGibbney commented on NUTCH-1854:
-

Parsing should always be set to false. Parsing fetcher can lead to problems.
Lewis




-- 
*Lewis*


> ./bin/crawl fails with a parsing fetcher
> 
>
> Key: NUTCH-1854
> URL: https://issues.apache.org/jira/browse/NUTCH-1854
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
>
> If you run ./bin/crawl with a parsing fetcher e.g.
> 
> >   fetcher.parse
> >   false
> >   If true, fetcher will parse content. Default is false,
> > which means
> >   that a separate parsing step is required after fetching is
> > finished.
> > 
> we get a horrible message as follows
> Exception in thread "main" java.io.IOException: Segment already parsed!
> We could improve this by making logging more complete and by adding a trigger 
> to the crawl script which would check for crawl_parse for a given segment and 
> then skip parsing if this is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)