Multiple collections

2007-01-15 Thread Nathan ter Bogt

Has anyone given any thought to allowing nutch to create multiple
collections, perhaps in seperate config directories, and allowing the
web interface to access any collection individually?

To put it into context, say I was planning to use nutch to index a
number of sites independantly on a single nutch server, and use
seperate OpenSearch clients to access the results from each of the
individual sites. Can this be done under the one nutch application
currently? Is there a plan to implement this functionality, and if it
already exists, how would one do it?

I've looked into how to do it within the source, and I believe it can
be done, but if there's another way that you believe this should be
implemented (rather than defining 'collections') I'd love to know
about it before I put effort into making such a change/patch.

Thanks,
Nathan


Re: How can I get one plugin's root dir

2007-01-15 Thread Scott Green

Thanks Dennis! Your methond should work.

And I really hope there is one directly method say getPluginRootDir()
in the plugin implementation.


On 1/16/07, Dennis Kubes <[EMAIL PROTECTED]> wrote:

You can get the PluginRepository and then from there get the plugin
descriptor and its path.  From there resources inside the plugin folder.
Change out parse-html with your plugin id.

 Configuration conf = NutchConfiguration.create();
 PluginRepository rep = PluginRepository.get(conf);
 PluginDescriptor desc = rep.getPluginDescriptor("parse-html");
 String path = desc.getPluginPath();
 System.out.println(path);


Dennis Kubes

Scott Green wrote:
> Can someone give a answer? I dont think it is good idea we put all
> configuration/resources under "conf" dir.
>
> On 1/15/07, Scott Green <[EMAIL PROTECTED]> wrote:
>> Hi,
>>
>> I need to load some resources from mine plugin's sub-directory. Any
>> avaiable method to get the specified plugin's root directory now?
>> thanks
>>
>> - scott
>>



Re: How can I get one plugin's root dir

2007-01-15 Thread Scott Green

Hi,

I want to propose a bit clean plugin directory structure:

xxx-plugin
  `-- lib
  `-- conf
  `-- src
  `-- web (only for web plugin)
  `-- plugin.xml
  `-- build.xml

Take urlfilter-regex plugin as example, the configuration file
"regex-urlfilter.txt" should be put in conf/ dir. Does this make
sense?

On 1/16/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

Scott Green wrote:
> Can someone give a answer? I dont think it is good idea we put all
> configuration/resources under "conf" dir.
>
> On 1/15/07, Scott Green <[EMAIL PROTECTED]> wrote:
>> Hi,
>>
>> I need to load some resources from mine plugin's sub-directory. Any
>> avaiable method to get the specified plugin's root directory now?
>> thanks

You need to make sure that this resource is packaged into the plugin jar
(just see how it's done in other plugins). Then you should be able to
access it through the ClassLoader that loaded this plugin, e.g.

package a.b.c;

public class MyPlugin {
...
InputStream is = MyPlugin.class.getResourceAsStream("myResource.txt");
...
}

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





Re: How can I get one plugin's root dir

2007-01-15 Thread Dennis Kubes
You can get the PluginRepository and then from there get the plugin 
descriptor and its path.  From there resources inside the plugin folder. 
   Change out parse-html with your plugin id.


Configuration conf = NutchConfiguration.create();
PluginRepository rep = PluginRepository.get(conf);
PluginDescriptor desc = rep.getPluginDescriptor("parse-html");
String path = desc.getPluginPath();
System.out.println(path);


Dennis Kubes

Scott Green wrote:

Can someone give a answer? I dont think it is good idea we put all
configuration/resources under "conf" dir.

On 1/15/07, Scott Green <[EMAIL PROTECTED]> wrote:

Hi,

I need to load some resources from mine plugin's sub-directory. Any
avaiable method to get the specified plugin's root directory now?
thanks

- scott



Re: How can I get one plugin's root dir

2007-01-15 Thread Andrzej Bialecki

Scott Green wrote:

Can someone give a answer? I dont think it is good idea we put all
configuration/resources under "conf" dir.

On 1/15/07, Scott Green <[EMAIL PROTECTED]> wrote:

Hi,

I need to load some resources from mine plugin's sub-directory. Any
avaiable method to get the specified plugin's root directory now?
thanks


You need to make sure that this resource is packaged into the plugin jar 
(just see how it's done in other plugins). Then you should be able to 
access it through the ClassLoader that loaded this plugin, e.g.


package a.b.c;

public class MyPlugin {
...
   InputStream is = MyPlugin.class.getResourceAsStream("myResource.txt");
...
}

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: How can I get one plugin's root dir

2007-01-15 Thread Scott Green

Can someone give a answer? I dont think it is good idea we put all
configuration/resources under "conf" dir.

On 1/15/07, Scott Green <[EMAIL PROTECTED]> wrote:

Hi,

I need to load some resources from mine plugin's sub-directory. Any
avaiable method to get the specified plugin's root directory now?
thanks

- scott



[jira] Resolved: (NUTCH-430) integer overflow in HashComparator.compare

2007-01-15 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-430.
--

   Resolution: Fixed
Fix Version/s: 0.9.0

committed in revision 495732 with additional whitespace changes.

> integer overflow in HashComparator.compare
> --
>
> Key: NUTCH-430
> URL: https://issues.apache.org/jira/browse/NUTCH-430
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 0.8.1, 0.9.0
>Reporter: Sami Siren
> Assigned To: Sami Siren
> Fix For: 0.9.0
>
> Attachments: NUTCH-430.patch
>
>
> There's a integer overflow problem in HashComparator wich leads to fetchlist 
> not to be sorted properly by hash of url. This leads to slower fetching 
> speeds if there are many urls from same host as they are not evenly 
> distributed.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

2007-01-15 Thread Alan Tanaman (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464738
 ] 

Alan Tanaman commented on NUTCH-422:


Sami,

About your questions - thank you for looking at this plugin.  I will be
seeing to all of them and will respond over the next week, as currently have
a couple of stressed clients...

Best regards,
Alan


> index-extra plugin creates additional fields in the index, based on 
> configurable logic
> --
>
> Key: NUTCH-422
> URL: https://issues.apache.org/jira/browse/NUTCH-422
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 0.8.1
> Environment: All environments
>Reporter: Alan Tanaman
> Assigned To: Sami Siren
> Attachments: index-extra-v1.0-bin-java1.5.zip, 
> index-extra-v1.0-source.zip
>
>
> Extract from the Readme file:
> A.  Introduction
> The index-extra plugin allows you to configure additional fields that you 
> wish to be added to the index, based on one of the following sources:
>   - The parsed text
>   - Meta data fields
>   - Previously created document-to-be-indexed fields
>   - Plain constant string
>   - Java expression combining one or more of the above, and resolving to 
> a string
> A regex can also be applied to any of the above, allowing fields to be 
> created based on patterns extracted from the source.
> B.  Installation
> 1)  Binaries only:  Copy the 'index-extra' folder within 
> index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
> Copy the 'index-extra-conf.xml' file to 
> NUTCHDIR/conf, and configure
> Enable the plugin by updating the nutch-site.xml file
> 2)  Source code:Always refer to the Nutch wiki for detailed 
> instructions on building Nutch.  In short:
> Copy the 'index-extra' folder within 
> index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
> Update the build.xml in NUTCHDIR/src/plugin to 
> include plugin
> Update the NUTCHDIR/default.properties file to 
> include plugin
> run ant to build
> Copy the 'index-extra-conf.xml' file to 
> NUTCHDIR/conf, and configure
> Enable the plugin by updating the nutch-site.xml file
> C.  Known Issues
> 1)  For this plugin to work correctly on any document field, it is 
> necessary to run the other index filters
> first, so that all basic document fields are generated first.  To do 
> this, configure the indexingfilter.order
> property.  (Please see patch NUTCH-421 to enable indexingfilter.order 
> property. If this patch is not applied,
> the plugin will still work, but will not be able to use document fields 
> created by other index filter plugins.)
> 2)  At this stage, field boost can not be used as Nutch scoring overrides 
> the field boost with its own
> document-level boost calculation.  This occurs at the end of 
> org.apache.nutch.indexer.Indexer's reduce method.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

2007-01-15 Thread Armel Nene (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464725
 ] 

Armel Nene commented on NUTCH-61:
-

I was able to apply the patch to Nutch 0.8.1 and have it successfully running. 
I think this patch should be part of the core code. When crawling a terrabyte 
of data, it is important that only changed data be fetched and parsed. Prior to 
apply this patch, we run Nutch in our lab and were confronted with SYSTEM OUT 
MEMORY messages when trying to crawl files as small as 10Gb of data. Now with 
this patch, it's true the performance will be slower because of checking for 
the unmodified data but overall it's worth it.

+5 for this patch.



> Adaptive re-fetch interval. Detecting umodified content
> ---
>
> Key: NUTCH-61
> URL: https://issues.apache.org/jira/browse/NUTCH-61
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Reporter: Andrzej Bialecki 
> Assigned To: Andrzej Bialecki 
> Attachments: 20050606.diff, 20051230.txt, 20060227.txt, 
> nutch-61-417287.patch
>
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter 
> if individual pages change seldom or frequently. The goal of these changes is 
> to extend the current codebase to support various possible adjustments to 
> re-fetch times and intervals, and specifically a re-fetch schedule which 
> tries to adapt the period between consecutive fetches to the period of 
> content changes.
> Also, these patches implement checking if the content has changed since last 
> fetching; protocol plugins are also changed to make use of this information, 
> so that if content is unmodified it doesn't have to be fetched and processed.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira