[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int
[ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187716#comment-13187716 ] Sebastian Nagel commented on NUTCH-1247: Interestingly, I also found a couple of URLs with unreasonable high retry counter in the data where NUTCH-1245 was first observed (it was Nutch 1.2). * all these URLs failed with some exception (invalid URI or HTTP=403) and not 404, not found, or robots denied? Markus, do the URLs which overflow the retry counter in your Db also belong to this class? * in the segments the status of these URLs is fetch_retry (in crawl_fetch): In Fetcher.java the case ProtocolStatus.EXCEPTION inside the switch statement in FetcherThread.run() falls through the default where the result is collected with STATUS_FETCH_RETRY. CrawlDbReducer calls FetchSchedule.forceRefetch() only for the cases STATUS_FETCH_NOT_MODIFIED or STATUS_FETCH_GONE (here via setPageGoneSchedule). The branch STATUS_FETCH_RETRY does not reset the retry counter. Generator never calls forceRefetch() nor does it reset the retry counter. If this analysis is correct there are two possible patches: * A (CrawlDbReducer): call setPageGoneSchedule for the case STATUS_FETCH_RETRY * B (Generator): reset the retry counter to zero when a db_gone URL is generated again CrawlDatum.retries should be int Key: NUTCH-1247 URL: https://issues.apache.org/jira/browse/NUTCH-1247 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Fix For: 1.5 CrawlDatum.retries is a byte and goes bad with larger values. 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1247) CrawlDatum.retries should be int
[ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1247: --- Attachment: NUTCH-1247.patch_B NUTCH-1247.patch_A CrawlDatum.retries should be int Key: NUTCH-1247 URL: https://issues.apache.org/jira/browse/NUTCH-1247 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1247.patch_A, NUTCH-1247.patch_B CrawlDatum.retries is a byte and goes bad with larger values. 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1250) parse-html does not parse links with empty anchor
parse-html does not parse links with empty anchor - Key: NUTCH-1250 URL: https://issues.apache.org/jira/browse/NUTCH-1250 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Andreas Janning The parse-html plugin does not generate an outlink if the link has no anchor For example the following HTML-Code does not create an Outlink: {code:html} a href=example.com/a {code} The JUnit-Test TestDOMContentUtils tries to test this but fails since there is a comment inside the a-Tag. {code:title=TestDOMContentUtils.java|borderStyle=solid} new String(htmlheadtitle title /title + /headbody + a href=\g\!--no anchor--/a + a href=\g1\ !--whitespace-- /a + a href=\g2\ img src=test.gif alt='bla bla' /a + /body/html), {code} When you remove the comment the test fails. {code:title=TestDOMContentUtils.java Test fails|borderStyle=solid} new String(htmlheadtitle title /title + /headbody + a href=\g\/a // no anchor + a href=\g1\ !--whitespace-- /a + a href=\g2\ img src=test.gif alt='bla bla' /a + /body/html), {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1242) Allow disabling of URL Filters in ParseSegment
[ https://issues.apache.org/jira/browse/NUTCH-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Edward Drapkin updated NUTCH-1242: -- Attachment: (was: ParseSegment.patch) Allow disabling of URL Filters in ParseSegment -- Key: NUTCH-1242 URL: https://issues.apache.org/jira/browse/NUTCH-1242 Project: Nutch Issue Type: Improvement Reporter: Edward Drapkin Fix For: 1.5 Attachments: ParseSegment.patch, parseoutputformat.patch Right now, the ParseSegment job does not allow you to disable URL filtration. For reasons that aren't worth explaining, I need to do this, so I enabled this behavior through the use of a boolean configuration value parse.filter.urls which defaults to true. I've attached a simple, preliminary patch that enables this behavior with that configuration option. I'm not sure if it should be made a command line option or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1242) Allow disabling of URL Filters in ParseSegment
[ https://issues.apache.org/jira/browse/NUTCH-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Edward Drapkin updated NUTCH-1242: -- Attachment: ParseSegment.patch Updated patch to add a message to the usage description. Allow disabling of URL Filters in ParseSegment -- Key: NUTCH-1242 URL: https://issues.apache.org/jira/browse/NUTCH-1242 Project: Nutch Issue Type: Improvement Reporter: Edward Drapkin Fix For: 1.5 Attachments: ParseSegment.patch, parseoutputformat.patch Right now, the ParseSegment job does not allow you to disable URL filtration. For reasons that aren't worth explaining, I need to do this, so I enabled this behavior through the use of a boolean configuration value parse.filter.urls which defaults to true. I've attached a simple, preliminary patch that enables this behavior with that configuration option. I'm not sure if it should be made a command line option or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (NUTCH-1242) Allow disabling of URL Filters in ParseSegment
[ https://issues.apache.org/jira/browse/NUTCH-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-1242: Assignee: Markus Jelsma Allow disabling of URL Filters in ParseSegment -- Key: NUTCH-1242 URL: https://issues.apache.org/jira/browse/NUTCH-1242 Project: Nutch Issue Type: Improvement Reporter: Edward Drapkin Assignee: Markus Jelsma Fix For: 1.5 Attachments: ParseSegment.patch, parseoutputformat.patch Right now, the ParseSegment job does not allow you to disable URL filtration. For reasons that aren't worth explaining, I need to do this, so I enabled this behavior through the use of a boolean configuration value parse.filter.urls which defaults to true. I've attached a simple, preliminary patch that enables this behavior with that configuration option. I'm not sure if it should be made a command line option or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1201) Allow for different FetcherThread impls
[ https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187874#comment-13187874 ] Edward Drapkin commented on NUTCH-1201: --- Does this still need to be done? It seems pretty easy and I'll volunteer to do it if it needs to be done. I was thinking of breaking all of Fetcher apart into more easily compartmentalized and pluggable units for my own benefit (as right now it's an enormous class that's extremely daunting and hard to change). If this issue still needs work, I think I can break Fetcher apart to allow for pluggable fetchers, item queues, queue feeders and fetcher threads, but I don't want to invest time into reinventing a wheel that you've already invented (and may not have updated JIRA). Let me know! Allow for different FetcherThread impls --- Key: NUTCH-1201 URL: https://issues.apache.org/jira/browse/NUTCH-1201 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 For certain cases we need to modify parts in FetcherThread and make it pluggable. This introduces a new config directive fetcher.impl that takes a FQCN and uses that setting Fetcher.fetch to load a class to use for job.setMapRunnerClass(). This new class has to extend Fetcher and and inner class FetcherThread. This allows for overriding methods in FetcherThread but also methods in Fetcher itself if required. A follow up on this issue would be to refactor parts of FetcherThread to make it easier to override small sections instead of copying the entire method body for a small change, which is now the case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1201) Allow for different FetcherThread impls
[ https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187884#comment-13187884 ] Markus Jelsma commented on NUTCH-1201: -- Hi Edward, I've already modified Fetcher to allow for different Fetcher impls via configuration that inherit from Fetcher itself. It works fine and i can override methods i need. However, it may not be that elegant. There's no code to use other queue impls. I'll cook a patch tomorrow. Allow for different FetcherThread impls --- Key: NUTCH-1201 URL: https://issues.apache.org/jira/browse/NUTCH-1201 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 For certain cases we need to modify parts in FetcherThread and make it pluggable. This introduces a new config directive fetcher.impl that takes a FQCN and uses that setting Fetcher.fetch to load a class to use for job.setMapRunnerClass(). This new class has to extend Fetcher and and inner class FetcherThread. This allows for overriding methods in FetcherThread but also methods in Fetcher itself if required. A follow up on this issue would be to refactor parts of FetcherThread to make it easier to override small sections instead of copying the entire method body for a small change, which is now the case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
I want to volunteer some time
Hello all, I've got a bunch of spare time coming up in the next several weeks/months and would like to volunteer to help the project out. I'm already extremely familiar with the internals of Nutch, as I've been hacking at it for our internal use here (at Wolfram Research) for the last ~1.5 years or so. While there's probably a fair amount of code that I haven't read, I've at least visited and read some of all of the areas of Nutch's core and most of the plugins. I think I should put that knowledge to good use and contribute back (I've already sent some patches in, but nothing major or really even that significant), but I'm not sure what needs to be done or where my time would be best spent. I just subscribed to this list, so if there's a thread discussing priorities that's current and whatnot, can someone point me to it in the archives? Barring that, can someone point me in the direction where I should be looking to contribute? My best guess is to just start attacking JIRA tickets... Thanks, Eddie
[jira] [Commented] (NUTCH-1201) Allow for different FetcherThread impls
[ https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187904#comment-13187904 ] Edward Drapkin commented on NUTCH-1201: --- I was thinking more of an approach of breaking Fetcher into these components: interface Fetcher class FetcherImpl interface FetcherThread (extends Thread) class FetcherThreadImpl interface FetchItemQueue class FetchItemQueueImpl interface FetchQueueFeeder class FetchQueueFeederImpl Where all of the *Impl classes would be the current implementations of the classes. I may be over-engineering here (I'm pretty prone to do that), but I think that this would open up the potential to heavily profile fetching and optimizing under various scenarios as I have a sneaky suspicion there's a lot more lock contention and thread spinning that happens during fetching than entirely necessary. It may be beneficial to offer several implementations out of the box for various scenarios: single-threaded fetchers, lightweight queues for short lists and/or small numbers of fetcher threads, heavyweight queues for large workloads, etc. Allow for different FetcherThread impls --- Key: NUTCH-1201 URL: https://issues.apache.org/jira/browse/NUTCH-1201 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 For certain cases we need to modify parts in FetcherThread and make it pluggable. This introduces a new config directive fetcher.impl that takes a FQCN and uses that setting Fetcher.fetch to load a class to use for job.setMapRunnerClass(). This new class has to extend Fetcher and and inner class FetcherThread. This allows for overriding methods in FetcherThread but also methods in Fetcher itself if required. A follow up on this issue would be to refactor parts of FetcherThread to make it easier to override small sections instead of copying the entire method body for a small change, which is now the case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: I want to volunteer some time
Hi! Excellent! You may want to check the list of issues for 1.5. There are several issues being worked on from time to time and a number of open issues and even a few hairy problems. Contribution as patch or comment on any issue is always appreciated. You can also create issues to solve problems yourself as you did with the parser filters issue. Anything is welcome! Cheers, Hello all, I've got a bunch of spare time coming up in the next several weeks/months and would like to volunteer to help the project out. I'm already extremely familiar with the internals of Nutch, as I've been hacking at it for our internal use here (at Wolfram Research) for the last ~1.5 years or so. While there's probably a fair amount of code that I haven't read, I've at least visited and read some of all of the areas of Nutch's core and most of the plugins. I think I should put that knowledge to good use and contribute back (I've already sent some patches in, but nothing major or really even that significant), but I'm not sure what needs to be done or where my time would be best spent. I just subscribed to this list, so if there's a thread discussing priorities that's current and whatnot, can someone point me to it in the archives? Barring that, can someone point me in the direction where I should be looking to contribute? My best guess is to just start attacking JIRA tickets... Thanks, Eddie
Re: I want to volunteer some time
Hi Eddie, Great to hear that! Just to add to what Markus said there are also quite a few tasks to do on the NutchGora branch if that's something you'd be interested in. Or outside the tasks on JIRA, there is always a fair bit to do on the Wiki e.g. how to run in distributed mode etc... Just out of curiosity, could you tell us a bit about what you've been using Nutch for at Wolfram Research? Thanks for volunteering Julien On 17 January 2012 19:15, Markus Jelsma markus.jel...@openindex.io wrote: Hi! Excellent! You may want to check the list of issues for 1.5. There are several issues being worked on from time to time and a number of open issues and even a few hairy problems. Contribution as patch or comment on any issue is always appreciated. You can also create issues to solve problems yourself as you did with the parser filters issue. Anything is welcome! Cheers, Hello all, I've got a bunch of spare time coming up in the next several weeks/months and would like to volunteer to help the project out. I'm already extremely familiar with the internals of Nutch, as I've been hacking at it for our internal use here (at Wolfram Research) for the last ~1.5 years or so. While there's probably a fair amount of code that I haven't read, I've at least visited and read some of all of the areas of Nutch's core and most of the plugins. I think I should put that knowledge to good use and contribute back (I've already sent some patches in, but nothing major or really even that significant), but I'm not sure what needs to be done or where my time would be best spent. I just subscribed to this list, so if there's a thread discussing priorities that's current and whatnot, can someone point me to it in the archives? Barring that, can someone point me in the direction where I should be looking to contribute? My best guess is to just start attacking JIRA tickets... Thanks, Eddie -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
[jira] [Commented] (NUTCH-1201) Allow for different FetcherThread impls
[ https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187927#comment-13187927 ] Andrzej Bialecki commented on NUTCH-1201: -- I agree that there are situations where you might want a custom fetcher (e.g. depth-first crawling), and it would be good to come up with some more specific API than just MapRunner. I'm not convinced yet that providing interfaces (or rather abstract classes) for the existing plumbing in Fetcher is a good idea - let's figure out first whether this code is reusable at all for some other fetching strategies, because if it's not then providing custom queue impls. may offer little value, and perhaps customization should be implemented on a different level. Re. thread spinning - I haven't seen yet an unequivocal case that would prove that crawl contention is caused by the thread mgmt in Fetcher. Usually on closer look the bottleneck turned out to lie elsewhere (network io, remote throttling, dns lookups, politeness rules, etc). Allow for different FetcherThread impls --- Key: NUTCH-1201 URL: https://issues.apache.org/jira/browse/NUTCH-1201 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 For certain cases we need to modify parts in FetcherThread and make it pluggable. This introduces a new config directive fetcher.impl that takes a FQCN and uses that setting Fetcher.fetch to load a class to use for job.setMapRunnerClass(). This new class has to extend Fetcher and and inner class FetcherThread. This allows for overriding methods in FetcherThread but also methods in Fetcher itself if required. A follow up on this issue would be to refactor parts of FetcherThread to make it easier to override small sections instead of copying the entire method body for a small change, which is now the case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1201) Allow for different FetcherThread impls
[ https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187950#comment-13187950 ] Edward Drapkin commented on NUTCH-1201: --- You bring up a good point, and I was making a pretty blatant assumption that the code is in fact reusable for these other cases. I think at the highest level, fetching will always basically be a producer-consumer task, which implies that there will always be these components: some queue, something to feed the queue, something to consume from the queue, and something to pull it all together into the hadoop job. If there's a better way of architecting the code necessary to run a fetching process, it's not something I've seen. The interfaces that I suggest reflect this (and use the same names currently being used) and the default implementations would be the existing code, so as to not break BC. I do think, though, that Fetcher itself ought to be able to be overridden and customized (hence providing an interface to it), although we should focus on making that something that no one wants to do, so it doesn't even need to be discouraged. I envision a situation in which Fetcher just basically serves as glue that holds the other three components together, so a situation where some logic needs to be changed would be changed in one of the other components. We may wind up in a situation where the only benefit to providing custom queue behavior is in conjunction with providing custom queue feeder + queue consumer behavior... as a matter of fact, I'd fully expect this to frequently be the case. Perhaps a better overall approach here might be to break Fetching into a high-level Nutch abstraction, then provide several fetching plugins that can be dropped into place depending on the situation, similar to the way that the protocol plugins behave. The fetcher already runs threads outside of the hadoop framework, so a generic fetcher job that just invoked a fetching plugin wouldn't have to be a regression of any sort. The more I think about it, the more I think that this may be the right solution to a modular fetching system: Nutch (eventually) shipping with fetch-depthfirst and fetch-unthreaded and fetch-default and any other scenario that may arise would allow for support for several cases right out of box. This approach would probably be the most difficult in terms of man hours and testing (but hey, I'm volunteering, right?), but I think it's probably the best way to provide modular fetcher functionality. If we decide to break the fetcher into a plugin, then the fetcher only has to conform to a relatively simply interface. I'd think that we would provide an abstract class that implements that interface and holds together the other sub-components mentioned above, as a starting point for the various fetcher plugins, but I don't think we would have to require that it be used. We could, similarly, offer abstract class default implementations of the various sub-components as well, but we'd nowhere force or require them to be used in any capacity. Allow for different FetcherThread impls --- Key: NUTCH-1201 URL: https://issues.apache.org/jira/browse/NUTCH-1201 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 For certain cases we need to modify parts in FetcherThread and make it pluggable. This introduces a new config directive fetcher.impl that takes a FQCN and uses that setting Fetcher.fetch to load a class to use for job.setMapRunnerClass(). This new class has to extend Fetcher and and inner class FetcherThread. This allows for overriding methods in FetcherThread but also methods in Fetcher itself if required. A follow up on this issue would be to refactor parts of FetcherThread to make it easier to override small sections instead of copying the entire method body for a small change, which is now the case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1251) Deletion of duplicates fails with org.apache.solr.client.solrj.SolrServerException
[ https://issues.apache.org/jira/browse/NUTCH-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arkadi Kosmynin updated NUTCH-1251: --- Description: Deletion of duplicates fails. This happens because the get all query used to get Solr index size is id:[* TO *], which is a range query. Lucene is trying to expand it to a Boolean query and gets as many clauses as there are ids in the index. This is too many in a real situation and it throws an exception. To correct this problem, change the get all query (SOLR_GET_ALL_QUERY) to \*:\*, which is the standard Solr get all query. Indexing log extract: java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:236) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:234) ... 3 more Caused by: org.apache.solr.common.SolrException: Internal Server Error Internal Server Error request: http://localhost:8081/arch/select?q=id:[* TO *]fl=id,boost,tstamp,digeststart=0rows=82938wt=javabinversion=2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89) ... 5 more was: Deletion of duplicates fails. This happens because the get all query used to get Solr index size is id:[* TO *], which is a range query. Lucene is trying to expand it to a Boolean query and gets as many clauses as there are ids in the index. This is too many in a real situation and it throws an exception. To correct this problem, change the get all query (SOLR_GET_ALL_QUERY) to *:*, which is the standard Solr get all query. Indexing log extract: java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:236) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:234) ... 3 more Caused by: org.apache.solr.common.SolrException: Internal Server Error Internal Server Error request: http://localhost:8081/arch/select?q=id:[* TO *]fl=id,boost,tstamp,digeststart=0rows=82938wt=javabinversion=2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89) ... 5 more Deletion of duplicates fails with org.apache.solr.client.solrj.SolrServerException -- Key: NUTCH-1251 URL: https://issues.apache.org/jira/browse/NUTCH-1251 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.4 Environment: Any crawl where the number of URLs in Solr exceeds 1024 (the default max number of clusters in Lucene boolean query). Reporter: Arkadi Kosmynin Priority: Critical Deletion of duplicates fails. This happens because the get all query used to get Solr index size is id:[* TO *], which is a range query. Lucene is trying to expand it to a Boolean query and gets as many clauses as there are ids in the index. This is too many in a real situation and it throws an exception. To correct this problem, change the get all query (SOLR_GET_ALL_QUERY) to \*:\*, which is the standard Solr get all query. Indexing log extract:
[jira] [Updated] (NUTCH-1251) Deletion of duplicates fails with org.apache.solr.client.solrj.SolrServerException
[ https://issues.apache.org/jira/browse/NUTCH-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1251: - Fix Version/s: 1.5 Deletion of duplicates fails with org.apache.solr.client.solrj.SolrServerException -- Key: NUTCH-1251 URL: https://issues.apache.org/jira/browse/NUTCH-1251 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.4 Environment: Any crawl where the number of URLs in Solr exceeds 1024 (the default max number of clusters in Lucene boolean query). Reporter: Arkadi Kosmynin Priority: Critical Fix For: 1.5 Deletion of duplicates fails. This happens because the get all query used to get Solr index size is id:[* TO *], which is a range query. Lucene is trying to expand it to a Boolean query and gets as many clauses as there are ids in the index. This is too many in a real situation and it throws an exception. To correct this problem, change the get all query (SOLR_GET_ALL_QUERY) to \*:\*, which is the standard Solr get all query. Indexing log extract: java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:236) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:234) ... 3 more Caused by: org.apache.solr.common.SolrException: Internal Server Error Internal Server Error request: http://localhost:8081/arch/select?q=id:[* TO *]fl=id,boost,tstamp,digeststart=0rows=82938wt=javabinversion=2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89) ... 5 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1251) Deletion of duplicates fails with org.apache.solr.client.solrj.SolrServerException
[ https://issues.apache.org/jira/browse/NUTCH-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188095#comment-13188095 ] Markus Jelsma commented on NUTCH-1251: -- Can you provide a patch for trunk? Deletion of duplicates fails with org.apache.solr.client.solrj.SolrServerException -- Key: NUTCH-1251 URL: https://issues.apache.org/jira/browse/NUTCH-1251 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.4 Environment: Any crawl where the number of URLs in Solr exceeds 1024 (the default max number of clusters in Lucene boolean query). Reporter: Arkadi Kosmynin Priority: Critical Fix For: 1.5 Deletion of duplicates fails. This happens because the get all query used to get Solr index size is id:[* TO *], which is a range query. Lucene is trying to expand it to a Boolean query and gets as many clauses as there are ids in the index. This is too many in a real situation and it throws an exception. To correct this problem, change the get all query (SOLR_GET_ALL_QUERY) to \*:\*, which is the standard Solr get all query. Indexing log extract: java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:236) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:234) ... 3 more Caused by: org.apache.solr.common.SolrException: Internal Server Error Internal Server Error request: http://localhost:8081/arch/select?q=id:[* TO *]fl=id,boost,tstamp,digeststart=0rows=82938wt=javabinversion=2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89) ... 5 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1251) Deletion of duplicates fails with org.apache.solr.client.solrj.SolrServerException
[ https://issues.apache.org/jira/browse/NUTCH-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188115#comment-13188115 ] Arkadi Kosmynin commented on NUTCH-1251: It is one line change. File org.apache.nutch.indexer.solr.SolrDeleteDuplicates.java, line 90. Deletion of duplicates fails with org.apache.solr.client.solrj.SolrServerException -- Key: NUTCH-1251 URL: https://issues.apache.org/jira/browse/NUTCH-1251 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.4 Environment: Any crawl where the number of URLs in Solr exceeds 1024 (the default max number of clusters in Lucene boolean query). Reporter: Arkadi Kosmynin Priority: Critical Fix For: 1.5 Deletion of duplicates fails. This happens because the get all query used to get Solr index size is id:[* TO *], which is a range query. Lucene is trying to expand it to a Boolean query and gets as many clauses as there are ids in the index. This is too many in a real situation and it throws an exception. To correct this problem, change the get all query (SOLR_GET_ALL_QUERY) to \*:\*, which is the standard Solr get all query. Indexing log extract: java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:236) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:234) ... 3 more Caused by: org.apache.solr.common.SolrException: Internal Server Error Internal Server Error request: http://localhost:8081/arch/select?q=id:[* TO *]fl=id,boost,tstamp,digeststart=0rows=82938wt=javabinversion=2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89) ... 5 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[Nutch Wiki] Trivial Update of AdminGroup by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The AdminGroup page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/AdminGroup?action=diffrev1=6rev2=7 * JulienNioche * MarkusJelsma * ElisabethAdler + * EdwardDrapkin
Re: I want to volunteer some time
Hi Eddie, I've added you to the AdminGroup for our wiki, you will be able to edit whichever areas you are interested in, or which you think can/should be improved. Your introduction sounds real interesting and as Markus Julien have said there is a lot of issues which merit some input, its great that you are able to contribute. Just a quick side-note, as Julien said we also maintain a Nutchgora branch, which has some unique characteristics which you might find interesting. Best for now Lewis On Tue, Jan 17, 2012 at 9:31 PM, Eddie Drapkin edwa...@wolfram.com wrote: Alrighty! I checked out the JIRA and sort of attacked an issue I think I can contribute to... I'll look and try to find more as well. I can certainly write documentation if that's a need (when isn't it?), just someone point me at the areas that need better documentation and I'll do what I can. You mentioned distributed mode, which is something I actually can't really document because it's not something we use - our crawler exists as a single intranet server and probably will for the foreseeable future. Do I need any special account privileges to edit wiki pages (username is EdwardDrapkin)? We use Nutch here to crawl our various intranet sites to build Lucene indexes for a few search applications that we have (search.wolfram.com, mathworld, etc.). I've written a rather hefty plugin for it to accommodate some of the custom functionality we need (I'd guess it's ~20,000 lines of code). We have our search broken down by our sites (e.g. reference.wolfram.com is one index and mathworld is another), which are crawled separately, so a lot of our custom functionality is written in light of that, particularly scoring. Because it's custom code for a single purpose, a lot of the code is also there to curate the data going into the index (custom parsers for a particular site to remove navigation elements, for instance). The most (only, really) interesting thing that I've done with it is tracking wiki changes outside of the primary crawl database (I keep my own database of page modification times) and creating custom fetch lists, so that our wiki can be crawled nightly, as it's rather massive and hosted on a shared machine that can't support an intensive crawl every night. I've also re-created the lucene index plugin as part of our plugin, as we don't use Solr, but our own search application. I'm working now on creating a comprehensive link-graph of all links for a particular crawl configuration, while still only crawling the correct URLs, so that we can experiment with using various page scoring algorithms. This is why I wanted to not filter the links in the parse stage, so now I can have a crawldb with entries from anywhere on the internet while still only crawling a particular subdomain. I'm not sure what the standard use case is for Nutch, but I think we're probably a bit outside of it, but only a bit. Thanks, Eddie On 1/17/2012 1:22 PM, Julien Nioche wrote: Hi Eddie, Great to hear that! Just to add to what Markus said there are also quite a few tasks to do on the NutchGora branch if that's something you'd be interested in. Or outside the tasks on JIRA, there is always a fair bit to do on the Wiki e.g. how to run in distributed mode etc... Just out of curiosity, could you tell us a bit about what you've been using Nutch for at Wolfram Research? Thanks for volunteering Julien On 17 January 2012 19:15, Markus Jelsma markus.jel...@openindex.iowrote: Hi! Excellent! You may want to check the list of issues for 1.5. There are several issues being worked on from time to time and a number of open issues and even a few hairy problems. Contribution as patch or comment on any issue is always appreciated. You can also create issues to solve problems yourself as you did with the parser filters issue. Anything is welcome! Cheers, Hello all, I've got a bunch of spare time coming up in the next several weeks/months and would like to volunteer to help the project out. I'm already extremely familiar with the internals of Nutch, as I've been hacking at it for our internal use here (at Wolfram Research) for the last ~1.5 years or so. While there's probably a fair amount of code that I haven't read, I've at least visited and read some of all of the areas of Nutch's core and most of the plugins. I think I should put that knowledge to good use and contribute back (I've already sent some patches in, but nothing major or really even that significant), but I'm not sure what needs to be done or where my time would be best spent. I just subscribed to this list, so if there's a thread discussing priorities that's current and whatnot, can someone point me to it in the archives? Barring that, can someone point me in the direction where I should be looking to contribute? My best guess is to just start attacking JIRA tickets... Thanks, Eddie