[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-07-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511330
 ] 

Hudson commented on NUTCH-503:
--

Integrated in Nutch-Nightly #145 (See 
[http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/145/])

 Generator exits incorrectly for small fetchlists 
 -

 Key: NUTCH-503
 URL: https://issues.apache.org/jira/browse/NUTCH-503
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 0.8, 0.8.1, 0.9.0
 Environment: Fedora Core 2, JDK 1.6
Reporter: Vishal Shah
Assignee: Doğacan Güney
 Fix For: 1.0.0

 Attachments: emptyfetchlist.patch, emptyfetchlist.patch


I think I found the reason why the generator returns with an empty 
 fetchlist for small fetchsizes. 
  
After the first job finishes running, the generator checks the following 
 condition to see if it got an empty list:
  
 if (readers == null || readers.length == 0 || !readers[0].next(new
 FloatWritable())) {
  
   The third condition is incorrect here. In some cases, esp. for small 
 fetchlists, the first partition might be empty, but some other partition(s) 
 might contain urls. In this case, the Generator is incorrectly assuming that 
 all partitions are empty by just looking at the first. This problem could 
 also occur when all URLs in the fetchlist are from the same host (or from a 
 very small number of hosts, or from a number of hosts that all map to a small 
 number of partitions).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-06-29 Thread Emmanuel Joke (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509039
 ] 

Emmanuel Joke commented on NUTCH-503:
-

Results seems to good. So I'm wondering if it is possible to commit this patch ?

 Generator exits incorrectly for small fetchlists 
 -

 Key: NUTCH-503
 URL: https://issues.apache.org/jira/browse/NUTCH-503
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 0.8, 0.8.1, 0.9.0
 Environment: Fedora Core 2, JDK 1.6
Reporter: Vishal Shah
 Fix For: 0.8.2

 Attachments: emptyfetchlist.patch, emptyfetchlist.patch


I think I found the reason why the generator returns with an empty 
 fetchlist for small fetchsizes. 
  
After the first job finishes running, the generator checks the following 
 condition to see if it got an empty list:
  
 if (readers == null || readers.length == 0 || !readers[0].next(new
 FloatWritable())) {
  
   The third condition is incorrect here. In some cases, esp. for small 
 fetchlists, the first partition might be empty, but some other partition(s) 
 might contain urls. In this case, the Generator is incorrectly assuming that 
 all partitions are empty by just looking at the first. This problem could 
 also occur when all URLs in the fetchlist are from the same host (or from a 
 very small number of hosts, or from a number of hosts that all map to a small 
 number of partitions).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-06-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509059
 ] 

Doğacan Güney commented on NUTCH-503:
-

  I don't know how to write a test case to cover this particular bug. Any 
 thoughts?

Normally, you would update TestGenerator by generating a couple of urls then 
showing that even though other parts contain urls first one does not (So, nutch 
would fail this test case without your patch).

However, this bug only occurs in a distributed setup, but our test cases work 
in single machine setup (by default). Hadoop does have something called 
MiniMRCluster which (I think) allows you to run distributed tests. This class 
comes from hadoop's test jar which we don't have.

Since your patch is (hopefully:) obviously true, we can skip writing a unit 
case for this one. But we really need some sort of mechanism to run our tests 
in a distributed setup.

 Generator exits incorrectly for small fetchlists 
 -

 Key: NUTCH-503
 URL: https://issues.apache.org/jira/browse/NUTCH-503
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 0.8, 0.8.1, 0.9.0
 Environment: Fedora Core 2, JDK 1.6
Reporter: Vishal Shah
 Fix For: 0.8.2

 Attachments: emptyfetchlist.patch, emptyfetchlist.patch


I think I found the reason why the generator returns with an empty 
 fetchlist for small fetchsizes. 
  
After the first job finishes running, the generator checks the following 
 condition to see if it got an empty list:
  
 if (readers == null || readers.length == 0 || !readers[0].next(new
 FloatWritable())) {
  
   The third condition is incorrect here. In some cases, esp. for small 
 fetchlists, the first partition might be empty, but some other partition(s) 
 might contain urls. In this case, the Generator is incorrectly assuming that 
 all partitions are empty by just looking at the first. This problem could 
 also occur when all URLs in the fetchlist are from the same host (or from a 
 very small number of hosts, or from a number of hosts that all map to a small 
 number of partitions).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-06-22 Thread Vishal Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507144
 ] 

Vishal Shah commented on NUTCH-503:
---

Hi Emmanuel,

   Can you please dump the contents of your crawldb after injecting your urls 
into the crawldb using the readdb command? Are these urls injected into the db 
in the first place? It could be that your urlfilters are filtering out your 
urls, or maybe there's some other problem. (esp. since the third test you did 
works). It would be good to know the contents of the crawldb before generate 
and after inject in each case.


 Generator exits incorrectly for small fetchlists 
 -

 Key: NUTCH-503
 URL: https://issues.apache.org/jira/browse/NUTCH-503
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 0.8, 0.8.1, 0.9.0
 Environment: Fedora Core 2, JDK 1.6
Reporter: Vishal Shah
 Fix For: 0.8.2

 Attachments: emptyfetchlist.patch, emptyfetchlist.patch


I think I found the reason why the generator returns with an empty 
 fetchlist for small fetchsizes. 
  
After the first job finishes running, the generator checks the following 
 condition to see if it got an empty list:
  
 if (readers == null || readers.length == 0 || !readers[0].next(new
 FloatWritable())) {
  
   The third condition is incorrect here. In some cases, esp. for small 
 fetchlists, the first partition might be empty, but some other partition(s) 
 might contain urls. In this case, the Generator is incorrectly assuming that 
 all partitions are empty by just looking at the first. This problem could 
 also occur when all URLs in the fetchlist are from the same host (or from a 
 very small number of hosts, or from a number of hosts that all map to a small 
 number of partitions).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-06-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507169
 ] 

Doğacan Güney commented on NUTCH-503:
-

Also, how many machines are there on your cluster and which version of nutch 
are you using?

 Generator exits incorrectly for small fetchlists 
 -

 Key: NUTCH-503
 URL: https://issues.apache.org/jira/browse/NUTCH-503
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 0.8, 0.8.1, 0.9.0
 Environment: Fedora Core 2, JDK 1.6
Reporter: Vishal Shah
 Fix For: 0.8.2

 Attachments: emptyfetchlist.patch, emptyfetchlist.patch


I think I found the reason why the generator returns with an empty 
 fetchlist for small fetchsizes. 
  
After the first job finishes running, the generator checks the following 
 condition to see if it got an empty list:
  
 if (readers == null || readers.length == 0 || !readers[0].next(new
 FloatWritable())) {
  
   The third condition is incorrect here. In some cases, esp. for small 
 fetchlists, the first partition might be empty, but some other partition(s) 
 might contain urls. In this case, the Generator is incorrectly assuming that 
 all partitions are empty by just looking at the first. This problem could 
 also occur when all URLs in the fetchlist are from the same host (or from a 
 very small number of hosts, or from a number of hosts that all map to a small 
 number of partitions).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-06-22 Thread Emmanuel Joke (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507469
 ] 

Emmanuel Joke commented on NUTCH-503:
-

Sorry, my mistake.

My compiled jar was not correctly included in my classpath. I confirm that it 
does work with your patch. 

Thanks for ur help.

 Generator exits incorrectly for small fetchlists 
 -

 Key: NUTCH-503
 URL: https://issues.apache.org/jira/browse/NUTCH-503
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 0.8, 0.8.1, 0.9.0
 Environment: Fedora Core 2, JDK 1.6
Reporter: Vishal Shah
 Fix For: 0.8.2

 Attachments: emptyfetchlist.patch, emptyfetchlist.patch


I think I found the reason why the generator returns with an empty 
 fetchlist for small fetchsizes. 
  
After the first job finishes running, the generator checks the following 
 condition to see if it got an empty list:
  
 if (readers == null || readers.length == 0 || !readers[0].next(new
 FloatWritable())) {
  
   The third condition is incorrect here. In some cases, esp. for small 
 fetchlists, the first partition might be empty, but some other partition(s) 
 might contain urls. In this case, the Generator is incorrectly assuming that 
 all partitions are empty by just looking at the first. This problem could 
 also occur when all URLs in the fetchlist are from the same host (or from a 
 very small number of hosts, or from a number of hosts that all map to a small 
 number of partitions).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-06-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507535
 ] 

Doğacan Güney commented on NUTCH-503:
-

Nice to hear, Emmanuel.

I believe this is ready for committing, but, Vishal, can you add a test case 
for this? (Though, I am not sure how we can add a test case since this bug only 
occurs in distributed setups).

 Generator exits incorrectly for small fetchlists 
 -

 Key: NUTCH-503
 URL: https://issues.apache.org/jira/browse/NUTCH-503
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 0.8, 0.8.1, 0.9.0
 Environment: Fedora Core 2, JDK 1.6
Reporter: Vishal Shah
 Fix For: 0.8.2

 Attachments: emptyfetchlist.patch, emptyfetchlist.patch


I think I found the reason why the generator returns with an empty 
 fetchlist for small fetchsizes. 
  
After the first job finishes running, the generator checks the following 
 condition to see if it got an empty list:
  
 if (readers == null || readers.length == 0 || !readers[0].next(new
 FloatWritable())) {
  
   The third condition is incorrect here. In some cases, esp. for small 
 fetchlists, the first partition might be empty, but some other partition(s) 
 might contain urls. In this case, the Generator is incorrectly assuming that 
 all partitions are empty by just looking at the first. This problem could 
 also occur when all URLs in the fetchlist are from the same host (or from a 
 very small number of hosts, or from a number of hosts that all map to a small 
 number of partitions).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-06-21 Thread Emmanuel Joke (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506922
 ] 

Emmanuel Joke commented on NUTCH-503:
-

I just try your patch and i'm afraid I still have the same issue. 

Actually I noticed something wrong:
- I did a first crawl with only 1 url (http://www.boursorama.com/), it didn't 
work. Ive got Generator: 0 records selected for fetching, exiting
- I did a second crawl with also only 1 url (http://lucene.apache.org/), it did 
work perfectly.
- i did a last test to crawl with both url, and I've got results for both site.

It looks weird. 






 Generator exits incorrectly for small fetchlists 
 -

 Key: NUTCH-503
 URL: https://issues.apache.org/jira/browse/NUTCH-503
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 0.8, 0.8.1, 0.9.0
 Environment: Fedora Core 2, JDK 1.6
Reporter: Vishal Shah
 Fix For: 0.8.2

 Attachments: emptyfetchlist.patch, emptyfetchlist.patch


I think I found the reason why the generator returns with an empty 
 fetchlist for small fetchsizes. 
  
After the first job finishes running, the generator checks the following 
 condition to see if it got an empty list:
  
 if (readers == null || readers.length == 0 || !readers[0].next(new
 FloatWritable())) {
  
   The third condition is incorrect here. In some cases, esp. for small 
 fetchlists, the first partition might be empty, but some other partition(s) 
 might contain urls. In this case, the Generator is incorrectly assuming that 
 all partitions are empty by just looking at the first. This problem could 
 also occur when all URLs in the fetchlist are from the same host (or from a 
 very small number of hosts, or from a number of hosts that all map to a small 
 number of partitions).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.