[jira] [Created] (NUTCH-1347) fetcher politeness related to map-reduce

2012-05-01 Thread behnam nikbakht (JIRA)
behnam nikbakht created NUTCH-1347:
--

 Summary: fetcher politeness related to map-reduce
 Key: NUTCH-1347
 URL: https://issues.apache.org/jira/browse/NUTCH-1347
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.4
Reporter: behnam nikbakht


when Nutch is running on Hadoop , based on map-reduce concept, each map task do 
some thing on it's owned data, so, each fetcher map-task work with it's Queues 
and do not know any thing about other Queus. so, enforce delay between 
successive requests and maximum concurrent requests policies on it's Queues. 
but with a simple test we found that it's not good piliteness mechanism when we 
have multiple map tasks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1347) fetcher politeness related to map-reduce

2012-05-01 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13265728#comment-13265728
 ] 

Julien Nioche commented on NUTCH-1347:
--

Not clear what the issue is. You can group URLs into a map input by host, 
domain or IP and then into each queue based on the same criteria.
BTW why not asking on the mailing list before filing a JIRA? You've opened 
quite a few - which is good - but don't reply to comments or questions on them 
which defeats the object
Thanks

 fetcher politeness related to map-reduce
 

 Key: NUTCH-1347
 URL: https://issues.apache.org/jira/browse/NUTCH-1347
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.4
Reporter: behnam nikbakht
  Labels: fetch

 when Nutch is running on Hadoop , based on map-reduce concept, each map task 
 do some thing on it's owned data, so, each fetcher map-task work with it's 
 Queues and do not know any thing about other Queus. so, enforce delay between 
 successive requests and maximum concurrent requests policies on it's Queues. 
 but with a simple test we found that it's not good piliteness mechanism when 
 we have multiple map tasks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Closed] (NUTCH-1343) Crawl sites with hashtags in url

2012-05-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-1343.


Resolution: Invalid

 Crawl sites with hashtags in url
 

 Key: NUTCH-1343
 URL: https://issues.apache.org/jira/browse/NUTCH-1343
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Roberto Gardenier
Priority: Blocker

 Hello,
 Im currently trying to crawl a site which uses hashtags in the urls. I dont 
 seem to get any results and Im hoping im just overlooking something.
 Site structure is as follows:
 http://domain.com (landingpage)
 http://domain.com/#/page1
 http://domain.com/#/page1/subpage1
 http://domain.com/#/page2
 http://domain.com/#/page2/subpage1
 and so on.
 I've pointed nutch to http://domain.com as start url and in my filter i've 
 placed all kind of rules.
 First i thought this would be sufficient:
 +http\://domain\.com\/#
 But then i realised that # is used for comments so i escaped it:
 +http\://domain\.com\/\#
 Still no results. So i thought i could use the asterix for it:
 +http\://domain\.com\/*
 Still no luck.. So i started using various regex stuff but without success.
 I noticed the following messages in hadoop.log:
 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
 Ive researched on this setting but i dont know for sure if this affects my 
 problem in a way. This property is set to false in my configs.
 I dont know if this is even related to the situation above but maybe it helps.
 Any help is very much appreciated! I've tried googling the problem but i 
 couldnt find documentation or anyone else with this problem.
 Many thanks in advance. 
 With kind regard,
 Roberto Gardenier

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1347) fetcher politeness related to map-reduce

2012-05-01 Thread behnam nikbakht (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13265750#comment-13265750
 ] 

behnam nikbakht commented on NUTCH-1347:


i can not recognize your solution.
when i simply put a line in getFetchItem() method in FetchItemQueue class, see 
that there are impoliteness requests to same host:
try {
it = queue.remove(0);
inProgress.add(it);
+System.out.println(it.url.toString()++System.currentTimeMillis());

we can multiply minCrawlDelay or crawlDelay and maxThreads with number of map 
tasks but there is no coordination between tasks and also there are not equal 
number of url from each host for each task.
also i found a bug in selector reduce task in generate phase, that result from 
less of coordination between tasks.
for these problems i use a redis-server that is a fast data server for 
manintaining (key,value) pairs.
so, redis maintain some variables like delay, maxThreads,... for each host and 
can dynamically set them acording to rate of success and block for each host.

 fetcher politeness related to map-reduce
 

 Key: NUTCH-1347
 URL: https://issues.apache.org/jira/browse/NUTCH-1347
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.4
Reporter: behnam nikbakht
  Labels: fetch

 when Nutch is running on Hadoop , based on map-reduce concept, each map task 
 do some thing on it's owned data, so, each fetcher map-task work with it's 
 Queues and do not know any thing about other Queus. so, enforce delay between 
 successive requests and maximum concurrent requests policies on it's Queues. 
 but with a simple test we found that it's not good piliteness mechanism when 
 we have multiple map tasks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1343) Crawl sites with hashtags in url

2012-05-01 Thread Roberto Gardenier (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13265751#comment-13265751
 ] 

Roberto Gardenier commented on NUTCH-1343:
--

Markus Jelsma,

I got notified that you have closed my jira ticket, chaning its resolution 
status to Invalid.
I wonder why you have closed my ticket and marked it invalid as i did not 
commit any changes or solutions?

With kind regards,
Roberto Gardenier 


-Oorspronkelijk bericht-
Van: Markus Jelsma (JIRA) [mailto:j...@apache.org] 
Verzonden: dinsdag 1 mei 2012 13:40
Aan: r.garden...@simgroep.nl
Onderwerp: [jira] [Closed] (NUTCH-1343) Crawl sites with hashtags in url


 [ 
https://issues.apache.org/jira/browse/NUTCH-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-1343.


Resolution: Invalid


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira






 Crawl sites with hashtags in url
 

 Key: NUTCH-1343
 URL: https://issues.apache.org/jira/browse/NUTCH-1343
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Roberto Gardenier
Priority: Blocker

 Hello,
 Im currently trying to crawl a site which uses hashtags in the urls. I dont 
 seem to get any results and Im hoping im just overlooking something.
 Site structure is as follows:
 http://domain.com (landingpage)
 http://domain.com/#/page1
 http://domain.com/#/page1/subpage1
 http://domain.com/#/page2
 http://domain.com/#/page2/subpage1
 and so on.
 I've pointed nutch to http://domain.com as start url and in my filter i've 
 placed all kind of rules.
 First i thought this would be sufficient:
 +http\://domain\.com\/#
 But then i realised that # is used for comments so i escaped it:
 +http\://domain\.com\/\#
 Still no results. So i thought i could use the asterix for it:
 +http\://domain\.com\/*
 Still no luck.. So i started using various regex stuff but without success.
 I noticed the following messages in hadoop.log:
 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
 Ive researched on this setting but i dont know for sure if this affects my 
 problem in a way. This property is set to false in my configs.
 I dont know if this is even related to the situation above but maybe it helps.
 Any help is very much appreciated! I've tried googling the problem but i 
 couldnt find documentation or anyone else with this problem.
 Many thanks in advance. 
 With kind regard,
 Roberto Gardenier

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1343) Crawl sites with hashtags in url

2012-05-01 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13265758#comment-13265758
 ] 

Markus Jelsma commented on NUTCH-1343:
--

Questions should be asked on the mailing list, as you just did. Concrete bugs 
and changes can be filed in Jira. Please check the mailinglist for replies to 
your inquiry.

 Crawl sites with hashtags in url
 

 Key: NUTCH-1343
 URL: https://issues.apache.org/jira/browse/NUTCH-1343
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Roberto Gardenier
Priority: Blocker

 Hello,
 Im currently trying to crawl a site which uses hashtags in the urls. I dont 
 seem to get any results and Im hoping im just overlooking something.
 Site structure is as follows:
 http://domain.com (landingpage)
 http://domain.com/#/page1
 http://domain.com/#/page1/subpage1
 http://domain.com/#/page2
 http://domain.com/#/page2/subpage1
 and so on.
 I've pointed nutch to http://domain.com as start url and in my filter i've 
 placed all kind of rules.
 First i thought this would be sufficient:
 +http\://domain\.com\/#
 But then i realised that # is used for comments so i escaped it:
 +http\://domain\.com\/\#
 Still no results. So i thought i could use the asterix for it:
 +http\://domain\.com\/*
 Still no luck.. So i started using various regex stuff but without success.
 I noticed the following messages in hadoop.log:
 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
 Ive researched on this setting but i dont know for sure if this affects my 
 problem in a way. This property is set to false in my configs.
 I dont know if this is even related to the situation above but maybe it helps.
 Any help is very much appreciated! I've tried googling the problem but i 
 couldnt find documentation or anyone else with this problem.
 Many thanks in advance. 
 With kind regard,
 Roberto Gardenier

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1343) Crawl sites with hashtags in url

2012-05-01 Thread Roberto Gardenier (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13265759#comment-13265759
 ] 

Roberto Gardenier commented on NUTCH-1343:
--

Thank you for your response. I will check the mailinglist for any possible 
reactions. Thank you very much.

 Crawl sites with hashtags in url
 

 Key: NUTCH-1343
 URL: https://issues.apache.org/jira/browse/NUTCH-1343
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Roberto Gardenier
Priority: Blocker

 Hello,
 Im currently trying to crawl a site which uses hashtags in the urls. I dont 
 seem to get any results and Im hoping im just overlooking something.
 Site structure is as follows:
 http://domain.com (landingpage)
 http://domain.com/#/page1
 http://domain.com/#/page1/subpage1
 http://domain.com/#/page2
 http://domain.com/#/page2/subpage1
 and so on.
 I've pointed nutch to http://domain.com as start url and in my filter i've 
 placed all kind of rules.
 First i thought this would be sufficient:
 +http\://domain\.com\/#
 But then i realised that # is used for comments so i escaped it:
 +http\://domain\.com\/\#
 Still no results. So i thought i could use the asterix for it:
 +http\://domain\.com\/*
 Still no luck.. So i started using various regex stuff but without success.
 I noticed the following messages in hadoop.log:
 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
 Ive researched on this setting but i dont know for sure if this affects my 
 problem in a way. This property is set to false in my configs.
 I dont know if this is even related to the situation above but maybe it helps.
 Any help is very much appreciated! I've tried googling the problem but i 
 couldnt find documentation or anyone else with this problem.
 Many thanks in advance. 
 With kind regard,
 Roberto Gardenier

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1346) Follow outlinks to ignore external

2012-05-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1346:
-

Attachment: NUTCH-1346-1.6-1.patch

Patch for 1.6!

 Follow outlinks to ignore external
 --

 Key: NUTCH-1346
 URL: https://issues.apache.org/jira/browse/NUTCH-1346
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.5
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1346-1.6-1.patch


 The follow outlinks feature already respects the db.ignore.external.links 
 setting. However, this means that outlinks of fetched pages that are external 
 are not saved in parse data. There should be a new setting to prevent the 
 outlink follower from going external but still storing external outlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1348) Solrindexer fails with a java.io.IOException error.

2012-05-01 Thread Christian Johnsson (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Johnsson updated NUTCH-1348:
--

Priority: Major  (was: Minor)

Tried a couple of more times and it seems to only happen on after 20-25 000 
documents. Below that seems ok. 

 Solrindexer fails with a java.io.IOException error.
 ---

 Key: NUTCH-1348
 URL: https://issues.apache.org/jira/browse/NUTCH-1348
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.5
 Environment: Debian Stable AMD64
Reporter: Christian Johnsson

 I'm unable to reproduce this error but it happens from time to time when i 
 run solrindexer.
 I use the same commands as i did with 1.4 and about the same configuration 
 and i haven't changed any solr settings. 
 Have the same plugins active just to be able to compare.
 From time to time the solrindexer throws an error. It happends like 1-2 times 
 out of 5 and there is no information in the solr log about it.
 Not sure if it's a bug but i though i might as well report it since i've been 
 running 1.4 since it was released and never came across this error in that 
 version.
 2012-05-01 20:44:14,861 INFO  httpclient.HttpMethodDirector - I/O exception 
 (java.net.SocketException) caught when processing request: Connection reset
 2012-05-01 20:44:14,861 INFO  httpclient.HttpMethodDirector - Retrying request
 2012-05-01 20:44:15,808 INFO  solr.SolrWriter - Indexing 250 documents
 2012-05-01 20:44:36,153 WARN  mapred.LocalJobRunner - job_local_0001
 java.io.IOException
   at 
 org.apache.nutch.indexer.solr.SolrWriter.makeIOException(SolrWriter.java:152)
   at org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:126)
   at 
 org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:55)
   at 
 org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:44)
   at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:440)
   at 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:195)
   at 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:51)
   at 
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
 Caused by: org.apache.solr.client.solrj.SolrServerException: 
 org.apache.commons.httpclient.ProtocolException: Unbuffered entity enclosing 
 request can not be repeated.
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:475)
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
   at 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
   at org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:124)
   ... 8 more
 Caused by: org.apache.commons.httpclient.ProtocolException: Unbuffered entity 
 enclosing request can not be repeated.
   at 
 org.apache.commons.httpclient.methods.EntityEnclosingMethod.writeRequestBody(EntityEnclosingMethod.java:487)
   at 
 org.apache.commons.httpclient.HttpMethodBase.writeRequest(HttpMethodBase.java:2114)
   at 
 org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1096)
   at 
 org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
   at 
 org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
   at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
   at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:422)
   ... 11 more
 2012-05-01 20:44:37,074 ERROR solr.SolrIndexer - java.io.IOException: Job 
 failed!
 It's running on a single machine and no hadoop.
 It's indexing around 50-80 000 smaller documents. Worked flawless in 1.4
 Thats about it :-)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1348) Solrindexer fails with a java.io.IOException error.

2012-05-01 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13266135#comment-13266135
 ] 

Markus Jelsma commented on NUTCH-1348:
--

This is a socket error but the request is retried so the job shouldn't fail 
completely. Or does it?

 Solrindexer fails with a java.io.IOException error.
 ---

 Key: NUTCH-1348
 URL: https://issues.apache.org/jira/browse/NUTCH-1348
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.5
 Environment: Debian Stable AMD64
Reporter: Christian Johnsson

 I'm unable to reproduce this error but it happens from time to time when i 
 run solrindexer.
 I use the same commands as i did with 1.4 and about the same configuration 
 and i haven't changed any solr settings. 
 Have the same plugins active just to be able to compare.
 From time to time the solrindexer throws an error. It happends like 1-2 times 
 out of 5 and there is no information in the solr log about it.
 Not sure if it's a bug but i though i might as well report it since i've been 
 running 1.4 since it was released and never came across this error in that 
 version.
 2012-05-01 20:44:14,861 INFO  httpclient.HttpMethodDirector - I/O exception 
 (java.net.SocketException) caught when processing request: Connection reset
 2012-05-01 20:44:14,861 INFO  httpclient.HttpMethodDirector - Retrying request
 2012-05-01 20:44:15,808 INFO  solr.SolrWriter - Indexing 250 documents
 2012-05-01 20:44:36,153 WARN  mapred.LocalJobRunner - job_local_0001
 java.io.IOException
   at 
 org.apache.nutch.indexer.solr.SolrWriter.makeIOException(SolrWriter.java:152)
   at org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:126)
   at 
 org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:55)
   at 
 org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:44)
   at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:440)
   at 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:195)
   at 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:51)
   at 
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
 Caused by: org.apache.solr.client.solrj.SolrServerException: 
 org.apache.commons.httpclient.ProtocolException: Unbuffered entity enclosing 
 request can not be repeated.
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:475)
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
   at 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
   at org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:124)
   ... 8 more
 Caused by: org.apache.commons.httpclient.ProtocolException: Unbuffered entity 
 enclosing request can not be repeated.
   at 
 org.apache.commons.httpclient.methods.EntityEnclosingMethod.writeRequestBody(EntityEnclosingMethod.java:487)
   at 
 org.apache.commons.httpclient.HttpMethodBase.writeRequest(HttpMethodBase.java:2114)
   at 
 org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1096)
   at 
 org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
   at 
 org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
   at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
   at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:422)
   ... 11 more
 2012-05-01 20:44:37,074 ERROR solr.SolrIndexer - java.io.IOException: Job 
 failed!
 It's running on a single machine and no hadoop.
 It's indexing around 50-80 000 smaller documents. Worked flawless in 1.4
 Thats about it :-)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1339) Default URL normalization rules to remove page anchors completely

2012-05-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1339:
-

Fix Version/s: 1.6

 Default URL normalization rules to remove page anchors completely
 -

 Key: NUTCH-1339
 URL: https://issues.apache.org/jira/browse/NUTCH-1339
 Project: Nutch
  Issue Type: Bug
Affects Versions: nutchgora, 1.6
Reporter: Sebastian Nagel
 Fix For: 1.6

 Attachments: NUTCH-1339-2.patch, NUTCH-1339.patch


 The default rules of URLNormalizerRegex remove the anchor up to the first
 occurrence of ? or . The remaining part of the anchor is kept
 which may cause a large, possibly infinite number of outlinks when the same 
 document
 fetched again and again with different URLs,
 see http://www.mail-archive.com/user%40nutch.apache.org/msg05940.html
 Parameters in inner-page anchors are a common practice in AJAX web sites.
 Currently, crawling AJAX content is not supported (NUTCH-1323).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1348) Solrindexer fails with a java.io.IOException error.

2012-05-01 Thread Christian Johnsson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13266186#comment-13266186
 ] 

Christian Johnsson commented on NUTCH-1348:
---

The indexing job exits with:
(Last 6 lines)
-
Deleting 0 documents
Indexing 250 documents
Deleting 0 documents
Indexing 250 documents
java.io.IOException: Job failed!
-

I have also tried 500 and 1000 for commits and its the same there. This one i 
left at 250 as the default for 1.5 is.

I've tried to run a re-index with 1.4 on the 10 last segments, about 600 000 
documents and it worked flawless.
I could try to run an index on the same segment a couple of times with 1.5 to 
see if there is any logic to it or just totally random.
I can also try to keep it under 25 000 documents to see if it works better then.

This error has never occurred with 1.4, i'm clueless :-)


 Solrindexer fails with a java.io.IOException error.
 ---

 Key: NUTCH-1348
 URL: https://issues.apache.org/jira/browse/NUTCH-1348
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.5
 Environment: Debian Stable AMD64
Reporter: Christian Johnsson

 I'm unable to reproduce this error but it happens from time to time when i 
 run solrindexer.
 I use the same commands as i did with 1.4 and about the same configuration 
 and i haven't changed any solr settings. 
 Have the same plugins active just to be able to compare.
 From time to time the solrindexer throws an error. It happends like 1-2 times 
 out of 5 and there is no information in the solr log about it.
 Not sure if it's a bug but i though i might as well report it since i've been 
 running 1.4 since it was released and never came across this error in that 
 version.
 2012-05-01 20:44:14,861 INFO  httpclient.HttpMethodDirector - I/O exception 
 (java.net.SocketException) caught when processing request: Connection reset
 2012-05-01 20:44:14,861 INFO  httpclient.HttpMethodDirector - Retrying request
 2012-05-01 20:44:15,808 INFO  solr.SolrWriter - Indexing 250 documents
 2012-05-01 20:44:36,153 WARN  mapred.LocalJobRunner - job_local_0001
 java.io.IOException
   at 
 org.apache.nutch.indexer.solr.SolrWriter.makeIOException(SolrWriter.java:152)
   at org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:126)
   at 
 org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:55)
   at 
 org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:44)
   at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:440)
   at 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:195)
   at 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:51)
   at 
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
 Caused by: org.apache.solr.client.solrj.SolrServerException: 
 org.apache.commons.httpclient.ProtocolException: Unbuffered entity enclosing 
 request can not be repeated.
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:475)
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
   at 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
   at org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:124)
   ... 8 more
 Caused by: org.apache.commons.httpclient.ProtocolException: Unbuffered entity 
 enclosing request can not be repeated.
   at 
 org.apache.commons.httpclient.methods.EntityEnclosingMethod.writeRequestBody(EntityEnclosingMethod.java:487)
   at 
 org.apache.commons.httpclient.HttpMethodBase.writeRequest(HttpMethodBase.java:2114)
   at 
 org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1096)
   at 
 org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
   at 
 org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
   at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
   at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:422)
   ... 11 more
 2012-05-01 20:44:37,074 ERROR solr.SolrIndexer - java.io.IOException: Job 
 failed!
 It's running on a single machine and no hadoop.
 It's indexing around 50-80 000 smaller documents. Worked flawless in 1.4
 Thats about it :-)

--
This message is automatically generated 

Build failed in Jenkins: Nutch-nutchgora #242

2012-05-01 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-nutchgora/242/

--
Started by timer
Building remotely on solaris1 in workspace 
https://builds.apache.org/job/Nutch-nutchgora/ws/
hudson.util.IOException2: remote file operation failed: 
https://builds.apache.org/job/Nutch-nutchgora/ws/ at 
hudson.remoting.Channel@3a34ca3d:solaris1
at hudson.FilePath.act(FilePath.java:828)
at hudson.FilePath.act(FilePath.java:814)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:743)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:685)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1218)
at 
hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:581)
at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:470)
at hudson.model.Run.run(Run.java:1434)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:239)
Caused by: java.io.IOException: Remote call on solaris1 failed
at hudson.remoting.Channel.call(Channel.java:655)
at hudson.FilePath.act(FilePath.java:821)
... 10 more
Caused by: java.lang.LinkageError: duplicate class definition: 
hudson/model/Descriptor
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
at java.lang.ClassLoader.defineClass(ClassLoader.java:466)
at 
hudson.remoting.RemoteClassLoader.loadClassFile(RemoteClassLoader.java:152)
at 
hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:131)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2259)
at java.lang.Class.getDeclaredField(Class.java:1852)
at 
java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1582)
at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:52)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:408)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.init(ObjectStreamClass.java:400)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:297)
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:531)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1699)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
at hudson.remoting.UserRequest.deserialize(UserRequest.java:182)
at hudson.remoting.UserRequest.perform(UserRequest.java:98)
at hudson.remoting.UserRequest.perform(UserRequest.java:48)
at hudson.remoting.Request$2.run(Request.java:287)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
at java.util.concurrent.FutureTask.run(FutureTask.java:123)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:651)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:676)
at java.lang.Thread.run(Thread.java:595)
Retrying after 10 seconds

Build failed in Jenkins: Nutch-trunk #1830

2012-05-01 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-trunk/1830/

--
Started by timer
Building remotely on solaris1 in workspace 
https://builds.apache.org/job/Nutch-trunk/ws/
hudson.util.IOException2: remote file operation failed: 
https://builds.apache.org/job/Nutch-trunk/ws/ at 
hudson.remoting.Channel@3a34ca3d:solaris1
at hudson.FilePath.act(FilePath.java:828)
at hudson.FilePath.act(FilePath.java:814)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:743)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:685)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1218)
at 
hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:581)
at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:470)
at hudson.model.Run.run(Run.java:1434)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:239)
Caused by: java.io.IOException: Remote call on solaris1 failed
at hudson.remoting.Channel.call(Channel.java:655)
at hudson.FilePath.act(FilePath.java:821)
... 10 more
Caused by: java.lang.LinkageError: duplicate class definition: 
hudson/model/Descriptor
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
at java.lang.ClassLoader.defineClass(ClassLoader.java:466)
at 
hudson.remoting.RemoteClassLoader.loadClassFile(RemoteClassLoader.java:152)
at 
hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:131)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2259)
at java.lang.Class.getDeclaredField(Class.java:1852)
at 
java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1582)
at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:52)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:408)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.init(ObjectStreamClass.java:400)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:297)
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:531)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1699)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
at hudson.remoting.UserRequest.deserialize(UserRequest.java:182)
at hudson.remoting.UserRequest.perform(UserRequest.java:98)
at hudson.remoting.UserRequest.perform(UserRequest.java:48)
at hudson.remoting.Request$2.run(Request.java:287)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
at java.util.concurrent.FutureTask.run(FutureTask.java:123)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:651)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:676)
at java.lang.Thread.run(Thread.java:595)
Retrying after 10 seconds