Multiple collections
Has anyone given any thought to allowing nutch to create multiple collections, perhaps in seperate config directories, and allowing the web interface to access any collection individually? To put it into context, say I was planning to use nutch to index a number of sites independantly on a single nutch server, and use seperate OpenSearch clients to access the results from each of the individual sites. Can this be done under the one nutch application currently? Is there a plan to implement this functionality, and if it already exists, how would one do it? I've looked into how to do it within the source, and I believe it can be done, but if there's another way that you believe this should be implemented (rather than defining 'collections') I'd love to know about it before I put effort into making such a change/patch. Thanks, Nathan
Re: How can I get one plugin's root dir
Thanks Dennis! Your methond should work. And I really hope there is one directly method say getPluginRootDir() in the plugin implementation. On 1/16/07, Dennis Kubes <[EMAIL PROTECTED]> wrote: You can get the PluginRepository and then from there get the plugin descriptor and its path. From there resources inside the plugin folder. Change out parse-html with your plugin id. Configuration conf = NutchConfiguration.create(); PluginRepository rep = PluginRepository.get(conf); PluginDescriptor desc = rep.getPluginDescriptor("parse-html"); String path = desc.getPluginPath(); System.out.println(path); Dennis Kubes Scott Green wrote: > Can someone give a answer? I dont think it is good idea we put all > configuration/resources under "conf" dir. > > On 1/15/07, Scott Green <[EMAIL PROTECTED]> wrote: >> Hi, >> >> I need to load some resources from mine plugin's sub-directory. Any >> avaiable method to get the specified plugin's root directory now? >> thanks >> >> - scott >>
Re: How can I get one plugin's root dir
Hi, I want to propose a bit clean plugin directory structure: xxx-plugin `-- lib `-- conf `-- src `-- web (only for web plugin) `-- plugin.xml `-- build.xml Take urlfilter-regex plugin as example, the configuration file "regex-urlfilter.txt" should be put in conf/ dir. Does this make sense? On 1/16/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: Scott Green wrote: > Can someone give a answer? I dont think it is good idea we put all > configuration/resources under "conf" dir. > > On 1/15/07, Scott Green <[EMAIL PROTECTED]> wrote: >> Hi, >> >> I need to load some resources from mine plugin's sub-directory. Any >> avaiable method to get the specified plugin's root directory now? >> thanks You need to make sure that this resource is packaged into the plugin jar (just see how it's done in other plugins). Then you should be able to access it through the ClassLoader that loaded this plugin, e.g. package a.b.c; public class MyPlugin { ... InputStream is = MyPlugin.class.getResourceAsStream("myResource.txt"); ... } -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: How can I get one plugin's root dir
You can get the PluginRepository and then from there get the plugin descriptor and its path. From there resources inside the plugin folder. Change out parse-html with your plugin id. Configuration conf = NutchConfiguration.create(); PluginRepository rep = PluginRepository.get(conf); PluginDescriptor desc = rep.getPluginDescriptor("parse-html"); String path = desc.getPluginPath(); System.out.println(path); Dennis Kubes Scott Green wrote: Can someone give a answer? I dont think it is good idea we put all configuration/resources under "conf" dir. On 1/15/07, Scott Green <[EMAIL PROTECTED]> wrote: Hi, I need to load some resources from mine plugin's sub-directory. Any avaiable method to get the specified plugin's root directory now? thanks - scott
Re: How can I get one plugin's root dir
Scott Green wrote: Can someone give a answer? I dont think it is good idea we put all configuration/resources under "conf" dir. On 1/15/07, Scott Green <[EMAIL PROTECTED]> wrote: Hi, I need to load some resources from mine plugin's sub-directory. Any avaiable method to get the specified plugin's root directory now? thanks You need to make sure that this resource is packaged into the plugin jar (just see how it's done in other plugins). Then you should be able to access it through the ClassLoader that loaded this plugin, e.g. package a.b.c; public class MyPlugin { ... InputStream is = MyPlugin.class.getResourceAsStream("myResource.txt"); ... } -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: How can I get one plugin's root dir
Can someone give a answer? I dont think it is good idea we put all configuration/resources under "conf" dir. On 1/15/07, Scott Green <[EMAIL PROTECTED]> wrote: Hi, I need to load some resources from mine plugin's sub-directory. Any avaiable method to get the specified plugin's root directory now? thanks - scott
[jira] Resolved: (NUTCH-430) integer overflow in HashComparator.compare
[ https://issues.apache.org/jira/browse/NUTCH-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren resolved NUTCH-430. -- Resolution: Fixed Fix Version/s: 0.9.0 committed in revision 495732 with additional whitespace changes. > integer overflow in HashComparator.compare > -- > > Key: NUTCH-430 > URL: https://issues.apache.org/jira/browse/NUTCH-430 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 0.8.1, 0.9.0 >Reporter: Sami Siren > Assigned To: Sami Siren > Fix For: 0.9.0 > > Attachments: NUTCH-430.patch > > > There's a integer overflow problem in HashComparator wich leads to fetchlist > not to be sorted properly by hash of url. This leads to slower fetching > speeds if there are many urls from same host as they are not evenly > distributed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic
[ https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464738 ] Alan Tanaman commented on NUTCH-422: Sami, About your questions - thank you for looking at this plugin. I will be seeing to all of them and will respond over the next week, as currently have a couple of stressed clients... Best regards, Alan > index-extra plugin creates additional fields in the index, based on > configurable logic > -- > > Key: NUTCH-422 > URL: https://issues.apache.org/jira/browse/NUTCH-422 > Project: Nutch > Issue Type: New Feature > Components: indexer >Affects Versions: 0.8.1 > Environment: All environments >Reporter: Alan Tanaman > Assigned To: Sami Siren > Attachments: index-extra-v1.0-bin-java1.5.zip, > index-extra-v1.0-source.zip > > > Extract from the Readme file: > A. Introduction > The index-extra plugin allows you to configure additional fields that you > wish to be added to the index, based on one of the following sources: > - The parsed text > - Meta data fields > - Previously created document-to-be-indexed fields > - Plain constant string > - Java expression combining one or more of the above, and resolving to > a string > A regex can also be applied to any of the above, allowing fields to be > created based on patterns extracted from the source. > B. Installation > 1) Binaries only: Copy the 'index-extra' folder within > index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build > Copy the 'index-extra-conf.xml' file to > NUTCHDIR/conf, and configure > Enable the plugin by updating the nutch-site.xml file > 2) Source code:Always refer to the Nutch wiki for detailed > instructions on building Nutch. In short: > Copy the 'index-extra' folder within > index-extra-v1.0-source.zip to NUTCHDIR/src/plugin > Update the build.xml in NUTCHDIR/src/plugin to > include plugin > Update the NUTCHDIR/default.properties file to > include plugin > run ant to build > Copy the 'index-extra-conf.xml' file to > NUTCHDIR/conf, and configure > Enable the plugin by updating the nutch-site.xml file > C. Known Issues > 1) For this plugin to work correctly on any document field, it is > necessary to run the other index filters > first, so that all basic document fields are generated first. To do > this, configure the indexingfilter.order > property. (Please see patch NUTCH-421 to enable indexingfilter.order > property. If this patch is not applied, > the plugin will still work, but will not be able to use document fields > created by other index filter plugins.) > 2) At this stage, field boost can not be used as Nutch scoring overrides > the field boost with its own > document-level boost calculation. This occurs at the end of > org.apache.nutch.indexer.Indexer's reduce method. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content
[ https://issues.apache.org/jira/browse/NUTCH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464725 ] Armel Nene commented on NUTCH-61: - I was able to apply the patch to Nutch 0.8.1 and have it successfully running. I think this patch should be part of the core code. When crawling a terrabyte of data, it is important that only changed data be fetched and parsed. Prior to apply this patch, we run Nutch in our lab and were confronted with SYSTEM OUT MEMORY messages when trying to crawl files as small as 10Gb of data. Now with this patch, it's true the performance will be slower because of checking for the unmodified data but overall it's worth it. +5 for this patch. > Adaptive re-fetch interval. Detecting umodified content > --- > > Key: NUTCH-61 > URL: https://issues.apache.org/jira/browse/NUTCH-61 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Reporter: Andrzej Bialecki > Assigned To: Andrzej Bialecki > Attachments: 20050606.diff, 20051230.txt, 20060227.txt, > nutch-61-417287.patch > > > Currently Nutch doesn't adjust automatically its re-fetch period, no matter > if individual pages change seldom or frequently. The goal of these changes is > to extend the current codebase to support various possible adjustments to > re-fetch times and intervals, and specifically a re-fetch schedule which > tries to adapt the period between consecutive fetches to the period of > content changes. > Also, these patches implement checking if the content has changed since last > fetching; protocol plugins are also changed to make use of this information, > so that if content is unmodified it doesn't have to be fetched and processed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira