[jira] [Commented] (NUTCH-1771) Solrindex fails if a segment is corrupted or incomplete

2015-04-07 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483527#comment-14483527
 ] 

Sebastian Nagel commented on NUTCH-1771:


+1 : will commit soon. Thanks, [~chongli]!

 Solrindex fails if a segment is corrupted or incomplete
 ---

 Key: NUTCH-1771
 URL: https://issues.apache.org/jira/browse/NUTCH-1771
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.8, 1.10
Reporter: Diaa
Priority: Minor
 Fix For: 1.11


 When using solrindex to index multiple segments via -dir segment,
 the indexing fails if one or more segments are corrupted/incomplete 
 (generated but not fetched for example)
 The failure is simply java.io exception.
 Deleting the segment fixes the issue.
 The expected behavior should be one of the following:
 * skipping the segment and proceeding with others (while logging)
 * stopping the indexing and logging the failed segment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1771) Solrindex fails if a segment is corrupted or incomplete

2015-04-03 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394230#comment-14394230
 ] 

Sebastian Nagel commented on NUTCH-1771:


Again: nice patch.
* SegmentChecker holds the state of a segment in private fields: why not force 
the user to pass segment's Path and FileSystem in the constructor? This would 
avoid errors, if the object is re-used and the state is not reset (via 
setFlags()). We could also provide a reset(path, fs) method. Alternatively, 
make the check function static without caching anything.
* to keep SegementMerger extensible: maybe rename isSegmentValid() to, e.g., 
isIndexable()? We could then add other methods later, to check sanity and 
status (generated, fetched, parsed).

 Solrindex fails if a segment is corrupted or incomplete
 ---

 Key: NUTCH-1771
 URL: https://issues.apache.org/jira/browse/NUTCH-1771
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.8, 1.10
Reporter: Diaa
Priority: Minor
 Fix For: 1.11


 When using solrindex to index multiple segments via -dir segment,
 the indexing fails if one or more segments are corrupted/incomplete 
 (generated but not fetched for example)
 The failure is simply java.io exception.
 Deleting the segment fixes the issue.
 The expected behavior should be one of the following:
 * skipping the segment and proceeding with others (while logging)
 * stopping the indexing and logging the failed segment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1771) Solrindex fails if a segment is corrupted or incomplete

2015-04-02 Thread Chong Li (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393949#comment-14393949
 ] 

Chong Li commented on NUTCH-1771:
-

Hi ~ 
I have submitted a SegmentChecker class according to the suggestions from 
[~wastl-nagel] : https://github.com/apache/nutch/pull/15 
I really appreciate some reviews and suggestions ~~

Thanks.

 Solrindex fails if a segment is corrupted or incomplete
 ---

 Key: NUTCH-1771
 URL: https://issues.apache.org/jira/browse/NUTCH-1771
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.8, 1.10
Reporter: Diaa
Priority: Minor
 Fix For: 1.11


 When using solrindex to index multiple segments via -dir segment,
 the indexing fails if one or more segments are corrupted/incomplete 
 (generated but not fetched for example)
 The failure is simply java.io exception.
 Deleting the segment fixes the issue.
 The expected behavior should be one of the following:
 * skipping the segment and proceeding with others (while logging)
 * stopping the indexing and logging the failed segment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1771) Solrindex fails if a segment is corrupted or incomplete

2015-04-01 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391905#comment-14391905
 ] 

Chris A. Mattmann commented on NUTCH-1771:
--

+1 to this patch and to Seb's suggestions. I agree - making code more robust is 
always better.

 Solrindex fails if a segment is corrupted or incomplete
 ---

 Key: NUTCH-1771
 URL: https://issues.apache.org/jira/browse/NUTCH-1771
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.8, 1.10
Reporter: Diaa
Priority: Minor
 Fix For: 1.11


 When using solrindex to index multiple segments via -dir segment,
 the indexing fails if one or more segments are corrupted/incomplete 
 (generated but not fetched for example)
 The failure is simply java.io exception.
 Deleting the segment fixes the issue.
 The expected behavior should be one of the following:
 * skipping the segment and proceeding with others (while logging)
 * stopping the indexing and logging the failed segment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1771) Solrindex fails if a segment is corrupted or incomplete

2015-04-01 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392138#comment-14392138
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1771:
---

+1 for this patch and for [~wastl-nagel], moving to a new class will allow to 
write a little segment checker if the crawl process is stopped due to a hard 
reboot, for instance, this tool could help locate the problematic segment 
before starting the crawling process again.

 Solrindex fails if a segment is corrupted or incomplete
 ---

 Key: NUTCH-1771
 URL: https://issues.apache.org/jira/browse/NUTCH-1771
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.8, 1.10
Reporter: Diaa
Priority: Minor
 Fix For: 1.11


 When using solrindex to index multiple segments via -dir segment,
 the indexing fails if one or more segments are corrupted/incomplete 
 (generated but not fetched for example)
 The failure is simply java.io exception.
 Deleting the segment fixes the issue.
 The expected behavior should be one of the following:
 * skipping the segment and proceeding with others (while logging)
 * stopping the indexing and logging the failed segment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1771) Solrindex fails if a segment is corrupted or incomplete

2015-04-01 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391334#comment-14391334
 ] 

Markus Jelsma commented on NUTCH-1771:
--

Hi - i gave this some more thought. You should not even start indexing a 
corrupted segment at all. If the fetcher fails for some reason, and the segment 
is not complete, it must be deleted. Also, indexing must be performed after 
updating the DB, and since you cannot update the DB with a corrupted segment, 
dealing with it in the indexer makes no sense.

You must delete corrupted segments if they got corrupted after the fetcher 
fails (note: segments are not always corrupt if the fetcher fails due to other 
reasons). And you must always delete segments if they cannot make it in the DB 
when updating.

 Solrindex fails if a segment is corrupted or incomplete
 ---

 Key: NUTCH-1771
 URL: https://issues.apache.org/jira/browse/NUTCH-1771
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.8, 1.10
Reporter: Diaa
Priority: Minor
 Fix For: 1.11


 When using solrindex to index multiple segments via -dir segment,
 the indexing fails if one or more segments are corrupted/incomplete 
 (generated but not fetched for example)
 The failure is simply java.io exception.
 Deleting the segment fixes the issue.
 The expected behavior should be one of the following:
 * skipping the segment and proceeding with others (while logging)
 * stopping the indexing and logging the failed segment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1771) Solrindex fails if a segment is corrupted or incomplete

2015-04-01 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391647#comment-14391647
 ] 

Sebastian Nagel commented on NUTCH-1771:


Hi [~chongli], the patch looks clean and extensible, just great. Thanks! What 
about moving the code to a new class in o.a.n.segments? It will be useful (in a 
more generic form) for other tools as well. The log message in case of a 
skipped segment could be a warning.

Instead of deleting invalid segments, it's possible to ignore them. That's the 
case if bin/crawl is repeatedly scheduled to run an incremental/continuous 
crawl. If some job fails bin/crawl exits. A potentially incomplete/corrupted 
segment is never looked at again, so there's no problem for later runs of 
bin/crawl. That's because only CrawlDb (and LinkDb/WebGraph) are used for 
persistence in this work-flow, content persists only in Solr/ElasticSearch. It 
would be even possible to delete a segment immediately at the end of each 
cycle. If segments are kept and used later (reparsed, reindexed, mined for 
data, etc.), it's necessary to delete or skip invalid ones. And yes, a tool 
which automatically detects invalid segments would be definitely useful!

Making tools more robust by ignoring some segments does not harm. It's the 
easier way: make the work-flow detect and delete invalid segments is a bigger 
effort. Btw., updatedb and web graph already silently skip segments not 
containing required subdirs. LinkDb/invertlinks exits with an exception same as 
IndexingJob. SegmentMerger is special by performing only a partial merge 
excluding a subdir from all segments if this subdir is missing in a single 
segment. 

 Solrindex fails if a segment is corrupted or incomplete
 ---

 Key: NUTCH-1771
 URL: https://issues.apache.org/jira/browse/NUTCH-1771
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.8, 1.10
Reporter: Diaa
Priority: Minor
 Fix For: 1.11


 When using solrindex to index multiple segments via -dir segment,
 the indexing fails if one or more segments are corrupted/incomplete 
 (generated but not fetched for example)
 The failure is simply java.io exception.
 Deleting the segment fixes the issue.
 The expected behavior should be one of the following:
 * skipping the segment and proceeding with others (while logging)
 * stopping the indexing and logging the failed segment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1771) Solrindex fails if a segment is corrupted or incomplete

2015-03-31 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389351#comment-14389351
 ] 

Sebastian Nagel commented on NUTCH-1771:


From [~chongli] in NUTCH-1978:
{quote}
So my initial idea is to check if the segment folder is valid before putting 
the segment into the hadoop job. If the segment is not valid, we can simply 
just skip that segment. We can check if the segment folder contains exactly 6 
sub directories as there should be. The other approach will be to check all the 
six sub directories and see if they are exactly the six dir that should appear.
{quote}

Ok, this would be possible.
* we should check only the 4 directories required for indexing: crawl_fetch, 
crawl_parse, parse_data, parse_text. The content directory may be missing if 
fetcher.store.content == false.
* if segments are really corrupted we need a more sophisticated check, but 
integrity checks should be part of HDFS. Only in local mode a crashed 
generate/fetch/parse may leave corrupted segments. However, any check needs to 
read the segment to checksum/validate it. That may cost a lot of IO and should 
not be done per default.

Unfetched segments (only containing crawl_generate) should be no problem after 
NUTCH-1829 (will be in 1.10) if the exit value of the generate job is properly 
checked (done by bin/crawl). But agreed: filtering out those segments would 
increase usability. Should be done only for the -dir option, calling index with 
one single corrupted/incomplete segment should definitely cause an error. Could 
be alternatively done by an extra SegmentFilter tool which then could also 
check for corrupted segments.





 Solrindex fails if a segment is corrupted or incomplete
 ---

 Key: NUTCH-1771
 URL: https://issues.apache.org/jira/browse/NUTCH-1771
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.8
Reporter: Diaa
Priority: Minor
 Fix For: 1.11


 When using solrindex to index multiple segments via -dir segment,
 the indexing fails if one or more segments are corrupted/incomplete 
 (generated but not fetched for example)
 The failure is simply java.io exception.
 Deleting the segment fixes the issue.
 The expected behavior should be one of the following:
 * skipping the segment and proceeding with others (while logging)
 * stopping the indexing and logging the failed segment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1771) Solrindex fails if a segment is corrupted or incomplete

2014-05-13 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996212#comment-13996212
 ] 

Markus Jelsma commented on NUTCH-1771:
--

I am not sure we can catch that exception as it lives in Hadoop's layer. We can 
only specify a list of directories to load, not much more.

 Solrindex fails if a segment is corrupted or incomplete
 ---

 Key: NUTCH-1771
 URL: https://issues.apache.org/jira/browse/NUTCH-1771
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.8
Reporter: Diaa
Priority: Minor
 Fix For: 1.9


 When using solrindex to index multiple segments via -dir segment,
 the indexing fails if one or more segments are corrupted/incomplete 
 (generated but not fetched for example)
 The failure is simply java.io exception.
 Deleting the segment fixes the issue.
 The expected behavior should be one of the following:
 * skipping the segment and proceeding with others (while logging)
 * stopping the indexing and logging the failed segment



--
This message was sent by Atlassian JIRA
(v6.2#6252)