[jira] Created: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-06-21 Thread Vishal Shah (JIRA)
Generator exits incorrectly for small fetchlists 
-

 Key: NUTCH-503
 URL: https://issues.apache.org/jira/browse/NUTCH-503
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 0.9.0, 0.8.1, 0.8
 Environment: Fedora Core 2, JDK 1.6
Reporter: Vishal Shah
 Fix For: 0.8.2


   I think I found the reason why the generator returns with an empty fetchlist 
for small fetchsizes. 
 
   After the first job finishes running, the generator checks the following 
condition to see if it got an empty list:
 
if (readers == null || readers.length == 0 || !readers[0].next(new
FloatWritable())) {
 
  The third condition is incorrect here. In some cases, esp. for small 
fetchlists, the first partition might be empty, but some other partition(s) 
might contain urls. In this case, the Generator is incorrectly assuming that 
all partitions are empty by just looking at the first. This problem could also 
occur when all URLs in the fetchlist are from the same host (or from a very 
small number of hosts, or from a number of hosts that all map to a small number 
of partitions).


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-06-21 Thread Vishal Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vishal Shah updated NUTCH-503:
--

Attachment: emptyfetchlist.patch

I've created a patch to fix this issue. Please review, and commit it to trunk 
if it's ok.

> Generator exits incorrectly for small fetchlists 
> -
>
> Key: NUTCH-503
> URL: https://issues.apache.org/jira/browse/NUTCH-503
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 0.8, 0.8.1, 0.9.0
> Environment: Fedora Core 2, JDK 1.6
>Reporter: Vishal Shah
> Fix For: 0.8.2
>
> Attachments: emptyfetchlist.patch
>
>
>I think I found the reason why the generator returns with an empty 
> fetchlist for small fetchsizes. 
>  
>After the first job finishes running, the generator checks the following 
> condition to see if it got an empty list:
>  
> if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>  
>   The third condition is incorrect here. In some cases, esp. for small 
> fetchlists, the first partition might be empty, but some other partition(s) 
> might contain urls. In this case, the Generator is incorrectly assuming that 
> all partitions are empty by just looking at the first. This problem could 
> also occur when all URLs in the fetchlist are from the same host (or from a 
> very small number of hosts, or from a number of hosts that all map to a small 
> number of partitions).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-06-21 Thread Vishal Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vishal Shah updated NUTCH-503:
--

Attachment: emptyfetchlist.patch

Hi,

   The previous patch is missing a header line. I've reattached the patch.

> Generator exits incorrectly for small fetchlists 
> -
>
> Key: NUTCH-503
> URL: https://issues.apache.org/jira/browse/NUTCH-503
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 0.8, 0.8.1, 0.9.0
> Environment: Fedora Core 2, JDK 1.6
>Reporter: Vishal Shah
> Fix For: 0.8.2
>
> Attachments: emptyfetchlist.patch, emptyfetchlist.patch
>
>
>I think I found the reason why the generator returns with an empty 
> fetchlist for small fetchsizes. 
>  
>After the first job finishes running, the generator checks the following 
> condition to see if it got an empty list:
>  
> if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>  
>   The third condition is incorrect here. In some cases, esp. for small 
> fetchlists, the first partition might be empty, but some other partition(s) 
> might contain urls. In this case, the Generator is incorrectly assuming that 
> all partitions are empty by just looking at the first. This problem could 
> also occur when all URLs in the fetchlist are from the same host (or from a 
> very small number of hosts, or from a number of hosts that all map to a small 
> number of partitions).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-06-21 Thread Vishal Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507144
 ] 

Vishal Shah commented on NUTCH-503:
---

Hi Emmanuel,

   Can you please dump the contents of your crawldb after injecting your urls 
into the crawldb using the readdb command? Are these urls injected into the db 
in the first place? It could be that your urlfilters are filtering out your 
urls, or maybe there's some other problem. (esp. since the third test you did 
works). It would be good to know the contents of the crawldb before generate 
and after inject in each case.


> Generator exits incorrectly for small fetchlists 
> -
>
> Key: NUTCH-503
> URL: https://issues.apache.org/jira/browse/NUTCH-503
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 0.8, 0.8.1, 0.9.0
> Environment: Fedora Core 2, JDK 1.6
>Reporter: Vishal Shah
> Fix For: 0.8.2
>
> Attachments: emptyfetchlist.patch, emptyfetchlist.patch
>
>
>I think I found the reason why the generator returns with an empty 
> fetchlist for small fetchsizes. 
>  
>After the first job finishes running, the generator checks the following 
> condition to see if it got an empty list:
>  
> if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>  
>   The third condition is incorrect here. In some cases, esp. for small 
> fetchlists, the first partition might be empty, but some other partition(s) 
> might contain urls. In this case, the Generator is incorrectly assuming that 
> all partitions are empty by just looking at the first. This problem could 
> also occur when all URLs in the fetchlist are from the same host (or from a 
> very small number of hosts, or from a number of hosts that all map to a small 
> number of partitions).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-06-29 Thread Vishal Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509056
 ] 

Vishal Shah commented on NUTCH-503:
---

Hi Dogacan,

I don't know how to write a test case to cover this particular bug. Any 
thoughts?

> Generator exits incorrectly for small fetchlists 
> -
>
> Key: NUTCH-503
> URL: https://issues.apache.org/jira/browse/NUTCH-503
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 0.8, 0.8.1, 0.9.0
> Environment: Fedora Core 2, JDK 1.6
>Reporter: Vishal Shah
> Fix For: 0.8.2
>
> Attachments: emptyfetchlist.patch, emptyfetchlist.patch
>
>
>I think I found the reason why the generator returns with an empty 
> fetchlist for small fetchsizes. 
>  
>After the first job finishes running, the generator checks the following 
> condition to see if it got an empty list:
>  
> if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>  
>   The third condition is incorrect here. In some cases, esp. for small 
> fetchlists, the first partition might be empty, but some other partition(s) 
> might contain urls. In this case, the Generator is incorrectly assuming that 
> all partitions are empty by just looking at the first. This problem could 
> also occur when all URLs in the fetchlist are from the same host (or from a 
> very small number of hosts, or from a number of hosts that all map to a small 
> number of partitions).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-525) DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to rerun dedup on a segment

2007-07-24 Thread Vishal Shah (JIRA)
DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to rerun 
dedup on a segment
-

 Key: NUTCH-525
 URL: https://issues.apache.org/jira/browse/NUTCH-525
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.9.0
 Environment: Fedora OS, JDK 1.6, Hadoop FS
Reporter: Vishal Shah
 Attachments: deleteDups.patch

When trying to rerun dedup on a segment, we get the following Exception:

java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 261883
at org.apache.lucene.util.BitVector.get(BitVector.java:72)
at 
org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:346)
at 
org.apache.nutch.indexer.DeleteDuplicates1$InputFormat$DDRecordReader.next(DeleteDuplicates1.java:167)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)

To reproduce the error, try creating two segments with identical urls - fetch, 
parse, index and dedup the 2 segments. Then rerun dedup.

The error comes from the DDRecordReader.next() method:

//skip past deleted documents
while (indexReader.isDeleted(doc) && doc < maxDoc) doc++;

If the last document in the index is deleted, then this loop will skip past the 
last document and call indexReader.isDeleted(doc) again.

The conditions should be inverted in order to fix the problem.

I've attached a patch here.


On a related note, why should we skip past deleted documents? The only time 
when this will happen is when we are rerunning dedup on a segment. If documents 
are not deleted for any reason other than dedup, then they should be given a 
chance to compete again, isn't it? We could fix this by putting an 
indexReader.undeleteAll() in the constructor for DDRecordReader. Any thoughts 
on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-525) DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to rerun dedup on a segment

2007-07-24 Thread Vishal Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vishal Shah updated NUTCH-525:
--

Attachment: deleteDups.patch

Patch for the bug attached here.

> DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to 
> rerun dedup on a segment
> -
>
> Key: NUTCH-525
> URL: https://issues.apache.org/jira/browse/NUTCH-525
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.9.0
> Environment: Fedora OS, JDK 1.6, Hadoop FS
>Reporter: Vishal Shah
> Attachments: deleteDups.patch
>
>
> When trying to rerun dedup on a segment, we get the following Exception:
> java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 261883
>   at org.apache.lucene.util.BitVector.get(BitVector.java:72)
>   at 
> org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:346)
>   at 
> org.apache.nutch.indexer.DeleteDuplicates1$InputFormat$DDRecordReader.next(DeleteDuplicates1.java:167)
>   at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
>   at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)
> To reproduce the error, try creating two segments with identical urls - 
> fetch, parse, index and dedup the 2 segments. Then rerun dedup.
> The error comes from the DDRecordReader.next() method:
> //skip past deleted documents
> while (indexReader.isDeleted(doc) && doc < maxDoc) doc++;
> If the last document in the index is deleted, then this loop will skip past 
> the last document and call indexReader.isDeleted(doc) again.
> The conditions should be inverted in order to fix the problem.
> I've attached a patch here.
> On a related note, why should we skip past deleted documents? The only time 
> when this will happen is when we are rerunning dedup on a segment. If 
> documents are not deleted for any reason other than dedup, then they should 
> be given a chance to compete again, isn't it? We could fix this by putting an 
> indexReader.undeleteAll() in the constructor for DDRecordReader. Any thoughts 
> on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-525) DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to rerun dedup on a segment

2007-07-24 Thread Vishal Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514910
 ] 

Vishal Shah commented on NUTCH-525:
---

Hi,
 
  I'll add a unit test.

  For the undelete thing, the need could arise in a situation where we are
adding segments incrementally.

  For e.g., let's say docs A and B are duplicates and A is selected as the
winner. In the next incremental update, A is refreshed, but it's status is
page_gone (404 or something). Now, rerunning dedup should have undeleted B
since it is no longer a duplicate. Or, if there was another duplicate C with
a score lower than B, then B should have emerged as the winner after page A
became dead.

Regards,

-vishal.



> DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to 
> rerun dedup on a segment
> -
>
> Key: NUTCH-525
> URL: https://issues.apache.org/jira/browse/NUTCH-525
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.9.0
> Environment: Fedora OS, JDK 1.6, Hadoop FS
>Reporter: Vishal Shah
> Attachments: deleteDups.patch
>
>
> When trying to rerun dedup on a segment, we get the following Exception:
> java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 261883
>   at org.apache.lucene.util.BitVector.get(BitVector.java:72)
>   at 
> org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:346)
>   at 
> org.apache.nutch.indexer.DeleteDuplicates1$InputFormat$DDRecordReader.next(DeleteDuplicates1.java:167)
>   at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
>   at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)
> To reproduce the error, try creating two segments with identical urls - 
> fetch, parse, index and dedup the 2 segments. Then rerun dedup.
> The error comes from the DDRecordReader.next() method:
> //skip past deleted documents
> while (indexReader.isDeleted(doc) && doc < maxDoc) doc++;
> If the last document in the index is deleted, then this loop will skip past 
> the last document and call indexReader.isDeleted(doc) again.
> The conditions should be inverted in order to fix the problem.
> I've attached a patch here.
> On a related note, why should we skip past deleted documents? The only time 
> when this will happen is when we are rerunning dedup on a segment. If 
> documents are not deleted for any reason other than dedup, then they should 
> be given a chance to compete again, isn't it? We could fix this by putting an 
> indexReader.undeleteAll() in the constructor for DDRecordReader. Any thoughts 
> on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-525) DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to rerun dedup on a segment

2007-07-24 Thread Vishal Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vishal Shah updated NUTCH-525:
--

Attachment: RededupUnitTest.patch

I have modified the existing junit test for DeleteDuplicates to test for this 
situation( patch attached). 

I ran it first without the bug-fix, and the output i got was:
[junit] Running org.apache.nutch.indexer.TestDeleteDuplicates
[junit] Tests run: 4, Failures: 0, Errors: 1, Time elapsed: 18.059 sec

After the bug-fix, there are no errors:
[junit] Running org.apache.nutch.indexer.TestDeleteDuplicates
[junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 20.496 sec

I agree with Dogacan that we should give users the option to undelete.

> DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to 
> rerun dedup on a segment
> -
>
> Key: NUTCH-525
> URL: https://issues.apache.org/jira/browse/NUTCH-525
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.9.0
> Environment: Fedora OS, JDK 1.6, Hadoop FS
>Reporter: Vishal Shah
> Attachments: deleteDups.patch, RededupUnitTest.patch
>
>
> When trying to rerun dedup on a segment, we get the following Exception:
> java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 261883
>   at org.apache.lucene.util.BitVector.get(BitVector.java:72)
>   at 
> org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:346)
>   at 
> org.apache.nutch.indexer.DeleteDuplicates1$InputFormat$DDRecordReader.next(DeleteDuplicates1.java:167)
>   at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
>   at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)
> To reproduce the error, try creating two segments with identical urls - 
> fetch, parse, index and dedup the 2 segments. Then rerun dedup.
> The error comes from the DDRecordReader.next() method:
> //skip past deleted documents
> while (indexReader.isDeleted(doc) && doc < maxDoc) doc++;
> If the last document in the index is deleted, then this loop will skip past 
> the last document and call indexReader.isDeleted(doc) again.
> The conditions should be inverted in order to fix the problem.
> I've attached a patch here.
> On a related note, why should we skip past deleted documents? The only time 
> when this will happen is when we are rerunning dedup on a segment. If 
> documents are not deleted for any reason other than dedup, then they should 
> be given a chance to compete again, isn't it? We could fix this by putting an 
> indexReader.undeleteAll() in the constructor for DDRecordReader. Any thoughts 
> on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.