[jira] [Created] (NUTCH-1347) fetcher politeness related to map-reduce
behnam nikbakht created NUTCH-1347: -- Summary: fetcher politeness related to map-reduce Key: NUTCH-1347 URL: https://issues.apache.org/jira/browse/NUTCH-1347 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.4 Reporter: behnam nikbakht when Nutch is running on Hadoop , based on map-reduce concept, each map task do some thing on it's owned data, so, each fetcher map-task work with it's Queues and do not know any thing about other Queus. so, enforce delay between successive requests and maximum concurrent requests policies on it's Queues. but with a simple test we found that it's not good piliteness mechanism when we have multiple map tasks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1347) fetcher politeness related to map-reduce
[ https://issues.apache.org/jira/browse/NUTCH-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13265728#comment-13265728 ] Julien Nioche commented on NUTCH-1347: -- Not clear what the issue is. You can group URLs into a map input by host, domain or IP and then into each queue based on the same criteria. BTW why not asking on the mailing list before filing a JIRA? You've opened quite a few - which is good - but don't reply to comments or questions on them which defeats the object Thanks fetcher politeness related to map-reduce Key: NUTCH-1347 URL: https://issues.apache.org/jira/browse/NUTCH-1347 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.4 Reporter: behnam nikbakht Labels: fetch when Nutch is running on Hadoop , based on map-reduce concept, each map task do some thing on it's owned data, so, each fetcher map-task work with it's Queues and do not know any thing about other Queus. so, enforce delay between successive requests and maximum concurrent requests policies on it's Queues. but with a simple test we found that it's not good piliteness mechanism when we have multiple map tasks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1343) Crawl sites with hashtags in url
[ https://issues.apache.org/jira/browse/NUTCH-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-1343. Resolution: Invalid Crawl sites with hashtags in url Key: NUTCH-1343 URL: https://issues.apache.org/jira/browse/NUTCH-1343 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Roberto Gardenier Priority: Blocker Hello, Im currently trying to crawl a site which uses hashtags in the urls. I dont seem to get any results and Im hoping im just overlooking something. Site structure is as follows: http://domain.com (landingpage) http://domain.com/#/page1 http://domain.com/#/page1/subpage1 http://domain.com/#/page2 http://domain.com/#/page2/subpage1 and so on. I've pointed nutch to http://domain.com as start url and in my filter i've placed all kind of rules. First i thought this would be sufficient: +http\://domain\.com\/# But then i realised that # is used for comments so i escaped it: +http\://domain\.com\/\# Still no results. So i thought i could use the asterix for it: +http\://domain\.com\/* Still no luck.. So i started using various regex stuff but without success. I noticed the following messages in hadoop.log: INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off Ive researched on this setting but i dont know for sure if this affects my problem in a way. This property is set to false in my configs. I dont know if this is even related to the situation above but maybe it helps. Any help is very much appreciated! I've tried googling the problem but i couldnt find documentation or anyone else with this problem. Many thanks in advance. With kind regard, Roberto Gardenier -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1347) fetcher politeness related to map-reduce
[ https://issues.apache.org/jira/browse/NUTCH-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13265750#comment-13265750 ] behnam nikbakht commented on NUTCH-1347: i can not recognize your solution. when i simply put a line in getFetchItem() method in FetchItemQueue class, see that there are impoliteness requests to same host: try { it = queue.remove(0); inProgress.add(it); +System.out.println(it.url.toString()++System.currentTimeMillis()); we can multiply minCrawlDelay or crawlDelay and maxThreads with number of map tasks but there is no coordination between tasks and also there are not equal number of url from each host for each task. also i found a bug in selector reduce task in generate phase, that result from less of coordination between tasks. for these problems i use a redis-server that is a fast data server for manintaining (key,value) pairs. so, redis maintain some variables like delay, maxThreads,... for each host and can dynamically set them acording to rate of success and block for each host. fetcher politeness related to map-reduce Key: NUTCH-1347 URL: https://issues.apache.org/jira/browse/NUTCH-1347 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.4 Reporter: behnam nikbakht Labels: fetch when Nutch is running on Hadoop , based on map-reduce concept, each map task do some thing on it's owned data, so, each fetcher map-task work with it's Queues and do not know any thing about other Queus. so, enforce delay between successive requests and maximum concurrent requests policies on it's Queues. but with a simple test we found that it's not good piliteness mechanism when we have multiple map tasks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1343) Crawl sites with hashtags in url
[ https://issues.apache.org/jira/browse/NUTCH-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13265751#comment-13265751 ] Roberto Gardenier commented on NUTCH-1343: -- Markus Jelsma, I got notified that you have closed my jira ticket, chaning its resolution status to Invalid. I wonder why you have closed my ticket and marked it invalid as i did not commit any changes or solutions? With kind regards, Roberto Gardenier -Oorspronkelijk bericht- Van: Markus Jelsma (JIRA) [mailto:j...@apache.org] Verzonden: dinsdag 1 mei 2012 13:40 Aan: r.garden...@simgroep.nl Onderwerp: [jira] [Closed] (NUTCH-1343) Crawl sites with hashtags in url [ https://issues.apache.org/jira/browse/NUTCH-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-1343. Resolution: Invalid -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira Crawl sites with hashtags in url Key: NUTCH-1343 URL: https://issues.apache.org/jira/browse/NUTCH-1343 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Roberto Gardenier Priority: Blocker Hello, Im currently trying to crawl a site which uses hashtags in the urls. I dont seem to get any results and Im hoping im just overlooking something. Site structure is as follows: http://domain.com (landingpage) http://domain.com/#/page1 http://domain.com/#/page1/subpage1 http://domain.com/#/page2 http://domain.com/#/page2/subpage1 and so on. I've pointed nutch to http://domain.com as start url and in my filter i've placed all kind of rules. First i thought this would be sufficient: +http\://domain\.com\/# But then i realised that # is used for comments so i escaped it: +http\://domain\.com\/\# Still no results. So i thought i could use the asterix for it: +http\://domain\.com\/* Still no luck.. So i started using various regex stuff but without success. I noticed the following messages in hadoop.log: INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off Ive researched on this setting but i dont know for sure if this affects my problem in a way. This property is set to false in my configs. I dont know if this is even related to the situation above but maybe it helps. Any help is very much appreciated! I've tried googling the problem but i couldnt find documentation or anyone else with this problem. Many thanks in advance. With kind regard, Roberto Gardenier -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1343) Crawl sites with hashtags in url
[ https://issues.apache.org/jira/browse/NUTCH-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13265758#comment-13265758 ] Markus Jelsma commented on NUTCH-1343: -- Questions should be asked on the mailing list, as you just did. Concrete bugs and changes can be filed in Jira. Please check the mailinglist for replies to your inquiry. Crawl sites with hashtags in url Key: NUTCH-1343 URL: https://issues.apache.org/jira/browse/NUTCH-1343 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Roberto Gardenier Priority: Blocker Hello, Im currently trying to crawl a site which uses hashtags in the urls. I dont seem to get any results and Im hoping im just overlooking something. Site structure is as follows: http://domain.com (landingpage) http://domain.com/#/page1 http://domain.com/#/page1/subpage1 http://domain.com/#/page2 http://domain.com/#/page2/subpage1 and so on. I've pointed nutch to http://domain.com as start url and in my filter i've placed all kind of rules. First i thought this would be sufficient: +http\://domain\.com\/# But then i realised that # is used for comments so i escaped it: +http\://domain\.com\/\# Still no results. So i thought i could use the asterix for it: +http\://domain\.com\/* Still no luck.. So i started using various regex stuff but without success. I noticed the following messages in hadoop.log: INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off Ive researched on this setting but i dont know for sure if this affects my problem in a way. This property is set to false in my configs. I dont know if this is even related to the situation above but maybe it helps. Any help is very much appreciated! I've tried googling the problem but i couldnt find documentation or anyone else with this problem. Many thanks in advance. With kind regard, Roberto Gardenier -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1343) Crawl sites with hashtags in url
[ https://issues.apache.org/jira/browse/NUTCH-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13265759#comment-13265759 ] Roberto Gardenier commented on NUTCH-1343: -- Thank you for your response. I will check the mailinglist for any possible reactions. Thank you very much. Crawl sites with hashtags in url Key: NUTCH-1343 URL: https://issues.apache.org/jira/browse/NUTCH-1343 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Roberto Gardenier Priority: Blocker Hello, Im currently trying to crawl a site which uses hashtags in the urls. I dont seem to get any results and Im hoping im just overlooking something. Site structure is as follows: http://domain.com (landingpage) http://domain.com/#/page1 http://domain.com/#/page1/subpage1 http://domain.com/#/page2 http://domain.com/#/page2/subpage1 and so on. I've pointed nutch to http://domain.com as start url and in my filter i've placed all kind of rules. First i thought this would be sufficient: +http\://domain\.com\/# But then i realised that # is used for comments so i escaped it: +http\://domain\.com\/\# Still no results. So i thought i could use the asterix for it: +http\://domain\.com\/* Still no luck.. So i started using various regex stuff but without success. I noticed the following messages in hadoop.log: INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off Ive researched on this setting but i dont know for sure if this affects my problem in a way. This property is set to false in my configs. I dont know if this is even related to the situation above but maybe it helps. Any help is very much appreciated! I've tried googling the problem but i couldnt find documentation or anyone else with this problem. Many thanks in advance. With kind regard, Roberto Gardenier -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1346) Follow outlinks to ignore external
[ https://issues.apache.org/jira/browse/NUTCH-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1346: - Attachment: NUTCH-1346-1.6-1.patch Patch for 1.6! Follow outlinks to ignore external -- Key: NUTCH-1346 URL: https://issues.apache.org/jira/browse/NUTCH-1346 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.5 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.6 Attachments: NUTCH-1346-1.6-1.patch The follow outlinks feature already respects the db.ignore.external.links setting. However, this means that outlinks of fetched pages that are external are not saved in parse data. There should be a new setting to prevent the outlink follower from going external but still storing external outlinks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1348) Solrindexer fails with a java.io.IOException error.
[ https://issues.apache.org/jira/browse/NUTCH-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Johnsson updated NUTCH-1348: -- Priority: Major (was: Minor) Tried a couple of more times and it seems to only happen on after 20-25 000 documents. Below that seems ok. Solrindexer fails with a java.io.IOException error. --- Key: NUTCH-1348 URL: https://issues.apache.org/jira/browse/NUTCH-1348 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.5 Environment: Debian Stable AMD64 Reporter: Christian Johnsson I'm unable to reproduce this error but it happens from time to time when i run solrindexer. I use the same commands as i did with 1.4 and about the same configuration and i haven't changed any solr settings. Have the same plugins active just to be able to compare. From time to time the solrindexer throws an error. It happends like 1-2 times out of 5 and there is no information in the solr log about it. Not sure if it's a bug but i though i might as well report it since i've been running 1.4 since it was released and never came across this error in that version. 2012-05-01 20:44:14,861 INFO httpclient.HttpMethodDirector - I/O exception (java.net.SocketException) caught when processing request: Connection reset 2012-05-01 20:44:14,861 INFO httpclient.HttpMethodDirector - Retrying request 2012-05-01 20:44:15,808 INFO solr.SolrWriter - Indexing 250 documents 2012-05-01 20:44:36,153 WARN mapred.LocalJobRunner - job_local_0001 java.io.IOException at org.apache.nutch.indexer.solr.SolrWriter.makeIOException(SolrWriter.java:152) at org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:126) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:55) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:44) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:440) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:195) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:51) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) Caused by: org.apache.solr.client.solrj.SolrServerException: org.apache.commons.httpclient.ProtocolException: Unbuffered entity enclosing request can not be repeated. at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:475) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) at org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:124) ... 8 more Caused by: org.apache.commons.httpclient.ProtocolException: Unbuffered entity enclosing request can not be repeated. at org.apache.commons.httpclient.methods.EntityEnclosingMethod.writeRequestBody(EntityEnclosingMethod.java:487) at org.apache.commons.httpclient.HttpMethodBase.writeRequest(HttpMethodBase.java:2114) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1096) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:422) ... 11 more 2012-05-01 20:44:37,074 ERROR solr.SolrIndexer - java.io.IOException: Job failed! It's running on a single machine and no hadoop. It's indexing around 50-80 000 smaller documents. Worked flawless in 1.4 Thats about it :-) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1348) Solrindexer fails with a java.io.IOException error.
[ https://issues.apache.org/jira/browse/NUTCH-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13266135#comment-13266135 ] Markus Jelsma commented on NUTCH-1348: -- This is a socket error but the request is retried so the job shouldn't fail completely. Or does it? Solrindexer fails with a java.io.IOException error. --- Key: NUTCH-1348 URL: https://issues.apache.org/jira/browse/NUTCH-1348 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.5 Environment: Debian Stable AMD64 Reporter: Christian Johnsson I'm unable to reproduce this error but it happens from time to time when i run solrindexer. I use the same commands as i did with 1.4 and about the same configuration and i haven't changed any solr settings. Have the same plugins active just to be able to compare. From time to time the solrindexer throws an error. It happends like 1-2 times out of 5 and there is no information in the solr log about it. Not sure if it's a bug but i though i might as well report it since i've been running 1.4 since it was released and never came across this error in that version. 2012-05-01 20:44:14,861 INFO httpclient.HttpMethodDirector - I/O exception (java.net.SocketException) caught when processing request: Connection reset 2012-05-01 20:44:14,861 INFO httpclient.HttpMethodDirector - Retrying request 2012-05-01 20:44:15,808 INFO solr.SolrWriter - Indexing 250 documents 2012-05-01 20:44:36,153 WARN mapred.LocalJobRunner - job_local_0001 java.io.IOException at org.apache.nutch.indexer.solr.SolrWriter.makeIOException(SolrWriter.java:152) at org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:126) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:55) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:44) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:440) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:195) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:51) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) Caused by: org.apache.solr.client.solrj.SolrServerException: org.apache.commons.httpclient.ProtocolException: Unbuffered entity enclosing request can not be repeated. at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:475) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) at org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:124) ... 8 more Caused by: org.apache.commons.httpclient.ProtocolException: Unbuffered entity enclosing request can not be repeated. at org.apache.commons.httpclient.methods.EntityEnclosingMethod.writeRequestBody(EntityEnclosingMethod.java:487) at org.apache.commons.httpclient.HttpMethodBase.writeRequest(HttpMethodBase.java:2114) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1096) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:422) ... 11 more 2012-05-01 20:44:37,074 ERROR solr.SolrIndexer - java.io.IOException: Job failed! It's running on a single machine and no hadoop. It's indexing around 50-80 000 smaller documents. Worked flawless in 1.4 Thats about it :-) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1339) Default URL normalization rules to remove page anchors completely
[ https://issues.apache.org/jira/browse/NUTCH-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1339: - Fix Version/s: 1.6 Default URL normalization rules to remove page anchors completely - Key: NUTCH-1339 URL: https://issues.apache.org/jira/browse/NUTCH-1339 Project: Nutch Issue Type: Bug Affects Versions: nutchgora, 1.6 Reporter: Sebastian Nagel Fix For: 1.6 Attachments: NUTCH-1339-2.patch, NUTCH-1339.patch The default rules of URLNormalizerRegex remove the anchor up to the first occurrence of ? or . The remaining part of the anchor is kept which may cause a large, possibly infinite number of outlinks when the same document fetched again and again with different URLs, see http://www.mail-archive.com/user%40nutch.apache.org/msg05940.html Parameters in inner-page anchors are a common practice in AJAX web sites. Currently, crawling AJAX content is not supported (NUTCH-1323). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1348) Solrindexer fails with a java.io.IOException error.
[ https://issues.apache.org/jira/browse/NUTCH-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13266186#comment-13266186 ] Christian Johnsson commented on NUTCH-1348: --- The indexing job exits with: (Last 6 lines) - Deleting 0 documents Indexing 250 documents Deleting 0 documents Indexing 250 documents java.io.IOException: Job failed! - I have also tried 500 and 1000 for commits and its the same there. This one i left at 250 as the default for 1.5 is. I've tried to run a re-index with 1.4 on the 10 last segments, about 600 000 documents and it worked flawless. I could try to run an index on the same segment a couple of times with 1.5 to see if there is any logic to it or just totally random. I can also try to keep it under 25 000 documents to see if it works better then. This error has never occurred with 1.4, i'm clueless :-) Solrindexer fails with a java.io.IOException error. --- Key: NUTCH-1348 URL: https://issues.apache.org/jira/browse/NUTCH-1348 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.5 Environment: Debian Stable AMD64 Reporter: Christian Johnsson I'm unable to reproduce this error but it happens from time to time when i run solrindexer. I use the same commands as i did with 1.4 and about the same configuration and i haven't changed any solr settings. Have the same plugins active just to be able to compare. From time to time the solrindexer throws an error. It happends like 1-2 times out of 5 and there is no information in the solr log about it. Not sure if it's a bug but i though i might as well report it since i've been running 1.4 since it was released and never came across this error in that version. 2012-05-01 20:44:14,861 INFO httpclient.HttpMethodDirector - I/O exception (java.net.SocketException) caught when processing request: Connection reset 2012-05-01 20:44:14,861 INFO httpclient.HttpMethodDirector - Retrying request 2012-05-01 20:44:15,808 INFO solr.SolrWriter - Indexing 250 documents 2012-05-01 20:44:36,153 WARN mapred.LocalJobRunner - job_local_0001 java.io.IOException at org.apache.nutch.indexer.solr.SolrWriter.makeIOException(SolrWriter.java:152) at org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:126) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:55) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:44) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:440) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:195) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:51) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) Caused by: org.apache.solr.client.solrj.SolrServerException: org.apache.commons.httpclient.ProtocolException: Unbuffered entity enclosing request can not be repeated. at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:475) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) at org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:124) ... 8 more Caused by: org.apache.commons.httpclient.ProtocolException: Unbuffered entity enclosing request can not be repeated. at org.apache.commons.httpclient.methods.EntityEnclosingMethod.writeRequestBody(EntityEnclosingMethod.java:487) at org.apache.commons.httpclient.HttpMethodBase.writeRequest(HttpMethodBase.java:2114) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1096) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:422) ... 11 more 2012-05-01 20:44:37,074 ERROR solr.SolrIndexer - java.io.IOException: Job failed! It's running on a single machine and no hadoop. It's indexing around 50-80 000 smaller documents. Worked flawless in 1.4 Thats about it :-) -- This message is automatically generated
Build failed in Jenkins: Nutch-nutchgora #242
See https://builds.apache.org/job/Nutch-nutchgora/242/ -- Started by timer Building remotely on solaris1 in workspace https://builds.apache.org/job/Nutch-nutchgora/ws/ hudson.util.IOException2: remote file operation failed: https://builds.apache.org/job/Nutch-nutchgora/ws/ at hudson.remoting.Channel@3a34ca3d:solaris1 at hudson.FilePath.act(FilePath.java:828) at hudson.FilePath.act(FilePath.java:814) at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:743) at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:685) at hudson.model.AbstractProject.checkout(AbstractProject.java:1218) at hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:581) at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:470) at hudson.model.Run.run(Run.java:1434) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:239) Caused by: java.io.IOException: Remote call on solaris1 failed at hudson.remoting.Channel.call(Channel.java:655) at hudson.FilePath.act(FilePath.java:821) ... 10 more Caused by: java.lang.LinkageError: duplicate class definition: hudson/model/Descriptor at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:621) at java.lang.ClassLoader.defineClass(ClassLoader.java:466) at hudson.remoting.RemoteClassLoader.loadClassFile(RemoteClassLoader.java:152) at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:131) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) at java.lang.Class.getDeclaredFields0(Native Method) at java.lang.Class.privateGetDeclaredFields(Class.java:2259) at java.lang.Class.getDeclaredField(Class.java:1852) at java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1582) at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:52) at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:408) at java.security.AccessController.doPrivileged(Native Method) at java.io.ObjectStreamClass.init(ObjectStreamClass.java:400) at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:297) at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:531) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1699) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348) at hudson.remoting.UserRequest.deserialize(UserRequest.java:182) at hudson.remoting.UserRequest.perform(UserRequest.java:98) at hudson.remoting.UserRequest.perform(UserRequest.java:48) at hudson.remoting.Request$2.run(Request.java:287) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269) at java.util.concurrent.FutureTask.run(FutureTask.java:123) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:651) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:676) at java.lang.Thread.run(Thread.java:595) Retrying after 10 seconds
Build failed in Jenkins: Nutch-trunk #1830
See https://builds.apache.org/job/Nutch-trunk/1830/ -- Started by timer Building remotely on solaris1 in workspace https://builds.apache.org/job/Nutch-trunk/ws/ hudson.util.IOException2: remote file operation failed: https://builds.apache.org/job/Nutch-trunk/ws/ at hudson.remoting.Channel@3a34ca3d:solaris1 at hudson.FilePath.act(FilePath.java:828) at hudson.FilePath.act(FilePath.java:814) at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:743) at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:685) at hudson.model.AbstractProject.checkout(AbstractProject.java:1218) at hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:581) at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:470) at hudson.model.Run.run(Run.java:1434) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:239) Caused by: java.io.IOException: Remote call on solaris1 failed at hudson.remoting.Channel.call(Channel.java:655) at hudson.FilePath.act(FilePath.java:821) ... 10 more Caused by: java.lang.LinkageError: duplicate class definition: hudson/model/Descriptor at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:621) at java.lang.ClassLoader.defineClass(ClassLoader.java:466) at hudson.remoting.RemoteClassLoader.loadClassFile(RemoteClassLoader.java:152) at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:131) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) at java.lang.Class.getDeclaredFields0(Native Method) at java.lang.Class.privateGetDeclaredFields(Class.java:2259) at java.lang.Class.getDeclaredField(Class.java:1852) at java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1582) at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:52) at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:408) at java.security.AccessController.doPrivileged(Native Method) at java.io.ObjectStreamClass.init(ObjectStreamClass.java:400) at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:297) at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:531) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1699) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348) at hudson.remoting.UserRequest.deserialize(UserRequest.java:182) at hudson.remoting.UserRequest.perform(UserRequest.java:98) at hudson.remoting.UserRequest.perform(UserRequest.java:48) at hudson.remoting.Request$2.run(Request.java:287) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269) at java.util.concurrent.FutureTask.run(FutureTask.java:123) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:651) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:676) at java.lang.Thread.run(Thread.java:595) Retrying after 10 seconds