[jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

2007-01-15 Thread Alan Tanaman (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464738
 ] 

Alan Tanaman commented on NUTCH-422:


Sami,

About your questions - thank you for looking at this plugin.  I will be
seeing to all of them and will respond over the next week, as currently have
a couple of stressed clients...

Best regards,
Alan


> index-extra plugin creates additional fields in the index, based on 
> configurable logic
> --
>
> Key: NUTCH-422
> URL: https://issues.apache.org/jira/browse/NUTCH-422
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 0.8.1
> Environment: All environments
>Reporter: Alan Tanaman
> Assigned To: Sami Siren
> Attachments: index-extra-v1.0-bin-java1.5.zip, 
> index-extra-v1.0-source.zip
>
>
> Extract from the Readme file:
> A.  Introduction
> The index-extra plugin allows you to configure additional fields that you 
> wish to be added to the index, based on one of the following sources:
>   - The parsed text
>   - Meta data fields
>   - Previously created document-to-be-indexed fields
>   - Plain constant string
>   - Java expression combining one or more of the above, and resolving to 
> a string
> A regex can also be applied to any of the above, allowing fields to be 
> created based on patterns extracted from the source.
> B.  Installation
> 1)  Binaries only:  Copy the 'index-extra' folder within 
> index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
> Copy the 'index-extra-conf.xml' file to 
> NUTCHDIR/conf, and configure
> Enable the plugin by updating the nutch-site.xml file
> 2)  Source code:Always refer to the Nutch wiki for detailed 
> instructions on building Nutch.  In short:
> Copy the 'index-extra' folder within 
> index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
> Update the build.xml in NUTCHDIR/src/plugin to 
> include plugin
> Update the NUTCHDIR/default.properties file to 
> include plugin
> run ant to build
> Copy the 'index-extra-conf.xml' file to 
> NUTCHDIR/conf, and configure
> Enable the plugin by updating the nutch-site.xml file
> C.  Known Issues
> 1)  For this plugin to work correctly on any document field, it is 
> necessary to run the other index filters
> first, so that all basic document fields are generated first.  To do 
> this, configure the indexingfilter.order
> property.  (Please see patch NUTCH-421 to enable indexingfilter.order 
> property. If this patch is not applied,
> the plugin will still work, but will not be able to use document fields 
> created by other index filter plugins.)
> 2)  At this stage, field boost can not be used as Nutch scoring overrides 
> the field boost with its own
> document-level boost calculation.  This occurs at the end of 
> org.apache.nutch.indexer.Indexer's reduce method.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

2007-01-02 Thread Alan Tanaman (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12461863
 ] 

Alan Tanaman commented on NUTCH-422:


Many thanks for your feedback.

Do you have any specifics in mind regarding examples?  I will try and include 
any additional ones that we implement.  I know there are a lot of options, but 
it is a little hard to see what is unclear from my end -- as I am so involved 
in the development, another point-of-view on this is welcome.
;)

Regarding query-extra, we are not currently using the Nutch bean, so the need 
has not arisen for us at this point in time, but I can see how that would be 
useful.  I guess you could adapt one of the existing query- plugins fairly 
easily by having them read the xml configuration file to see what fields are 
potentially available in the index.

As for the boost, I included that as it seems like a useful thing to be able to 
control the boost of a single field, although we don't need that at this very 
moment.  The line of code in the org.apache.nutch.indexer.Indexer's
reduce method could be overridden, but I'm not yet sure how that would affect 
the overall scoring (scoring is one of my really weak points).
Perhaps one of the scoring experts could give some guidance on this?

> index-extra plugin creates additional fields in the index, based on 
> configurable logic
> --
>
> Key: NUTCH-422
> URL: http://issues.apache.org/jira/browse/NUTCH-422
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 0.8.1
> Environment: All environments
>Reporter: Alan Tanaman
> Attachments: index-extra-v1.0-bin-java1.5.zip, 
> index-extra-v1.0-source.zip
>
>
> Extract from the Readme file:
> A.  Introduction
> The index-extra plugin allows you to configure additional fields that you 
> wish to be added to the index, based on one of the following sources:
>   - The parsed text
>   - Meta data fields
>   - Previously created document-to-be-indexed fields
>   - Plain constant string
>   - Java expression combining one or more of the above, and resolving to 
> a string
> A regex can also be applied to any of the above, allowing fields to be 
> created based on patterns extracted from the source.
> B.  Installation
> 1)  Binaries only:  Copy the 'index-extra' folder within 
> index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
> Copy the 'index-extra-conf.xml' file to 
> NUTCHDIR/conf, and configure
> Enable the plugin by updating the nutch-site.xml file
> 2)  Source code:Always refer to the Nutch wiki for detailed 
> instructions on building Nutch.  In short:
> Copy the 'index-extra' folder within 
> index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
> Update the build.xml in NUTCHDIR/src/plugin to 
> include plugin
> Update the NUTCHDIR/default.properties file to 
> include plugin
> run ant to build
> Copy the 'index-extra-conf.xml' file to 
> NUTCHDIR/conf, and configure
> Enable the plugin by updating the nutch-site.xml file
> C.  Known Issues
> 1)  For this plugin to work correctly on any document field, it is 
> necessary to run the other index filters
> first, so that all basic document fields are generated first.  To do 
> this, configure the indexingfilter.order
> property.  (Please see patch NUTCH-421 to enable indexingfilter.order 
> property. If this patch is not applied,
> the plugin will still work, but will not be able to use document fields 
> created by other index filter plugins.)
> 2)  At this stage, field boost can not be used as Nutch scoring overrides 
> the field boost with its own
> document-level boost calculation.  This occurs at the end of 
> org.apache.nutch.indexer.Indexer's reduce method.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

2006-12-28 Thread Alan Tanaman (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-422?page=all ]

Alan Tanaman updated NUTCH-422:
---

Attachment: index-extra-v1.0-bin-java1.5.zip
index-extra-v1.0-source.zip

> index-extra plugin creates additional fields in the index, based on 
> configurable logic
> --
>
> Key: NUTCH-422
> URL: http://issues.apache.org/jira/browse/NUTCH-422
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 0.8.1
> Environment: All environments
>Reporter: Alan Tanaman
> Attachments: index-extra-v1.0-bin-java1.5.zip, 
> index-extra-v1.0-source.zip
>
>
> Extract from the Readme file:
> A.  Introduction
> The index-extra plugin allows you to configure additional fields that you 
> wish to be added to the index, based on one of the following sources:
>   - The parsed text
>   - Meta data fields
>   - Previously created document-to-be-indexed fields
>   - Plain constant string
>   - Java expression combining one or more of the above, and resolving to 
> a string
> A regex can also be applied to any of the above, allowing fields to be 
> created based on patterns extracted from the source.
> B.  Installation
> 1)  Binaries only:  Copy the 'index-extra' folder within 
> index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
> Copy the 'index-extra-conf.xml' file to 
> NUTCHDIR/conf, and configure
> Enable the plugin by updating the nutch-site.xml file
> 2)  Source code:Always refer to the Nutch wiki for detailed 
> instructions on building Nutch.  In short:
> Copy the 'index-extra' folder within 
> index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
> Update the build.xml in NUTCHDIR/src/plugin to 
> include plugin
> Update the NUTCHDIR/default.properties file to 
> include plugin
> run ant to build
> Copy the 'index-extra-conf.xml' file to 
> NUTCHDIR/conf, and configure
> Enable the plugin by updating the nutch-site.xml file
> C.  Known Issues
> 1)  For this plugin to work correctly on any document field, it is 
> necessary to run the other index filters
> first, so that all basic document fields are generated first.  To do 
> this, configure the indexingfilter.order
> property.  (Please see patch NUTCH-421 to enable indexingfilter.order 
> property. If this patch is not applied,
> the plugin will still work, but will not be able to use document fields 
> created by other index filter plugins.)
> 2)  At this stage, field boost can not be used as Nutch scoring overrides 
> the field boost with its own
> document-level boost calculation.  This occurs at the end of 
> org.apache.nutch.indexer.Indexer's reduce method.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

2006-12-28 Thread Alan Tanaman (JIRA)
index-extra plugin creates additional fields in the index, based on 
configurable logic
--

 Key: NUTCH-422
 URL: http://issues.apache.org/jira/browse/NUTCH-422
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.8.1
 Environment: All environments
Reporter: Alan Tanaman


Extract from the Readme file:

A.  Introduction

The index-extra plugin allows you to configure additional fields that you 
wish to be added to the index, based on one of the following sources:
  - The parsed text
  - Meta data fields
  - Previously created document-to-be-indexed fields
  - Plain constant string
  - Java expression combining one or more of the above, and resolving to a 
string
A regex can also be applied to any of the above, allowing fields to be 
created based on patterns extracted from the source.

B.  Installation

1)  Binaries only:  Copy the 'index-extra' folder within 
index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, 
and configure
Enable the plugin by updating the nutch-site.xml file
2)  Source code:Always refer to the Nutch wiki for detailed 
instructions on building Nutch.  In short:
Copy the 'index-extra' folder within 
index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
Update the build.xml in NUTCHDIR/src/plugin to include 
plugin
Update the NUTCHDIR/default.properties file to include 
plugin
run ant to build
Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, 
and configure
Enable the plugin by updating the nutch-site.xml file

C.  Known Issues

1)  For this plugin to work correctly on any document field, it is 
necessary to run the other index filters
first, so that all basic document fields are generated first.  To do this, 
configure the indexingfilter.order
property.  (Please see patch NUTCH-421 to enable indexingfilter.order 
property. If this patch is not applied,
the plugin will still work, but will not be able to use document fields 
created by other index filter plugins.)

2)  At this stage, field boost can not be used as Nutch scoring overrides 
the field boost with its own
document-level boost calculation.  This occurs at the end of 
org.apache.nutch.indexer.Indexer's reduce method.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-421) Allow predeterminate running order of index filters

2006-12-27 Thread Alan Tanaman (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-421?page=all ]

Alan Tanaman updated NUTCH-421:
---

Description: 
I've tested a patch for org.apache.nutch.indexer.IndexingFilters, allowing the 
user to state in which order the indexing filters are to be run based on a new
indexingfilter.order property. This is needed when a filter needs to rely on 
previously generated document fields as a source of input to generate further 
fields.

As suggested elsewhere, I based this on the urlfilter.order functionality:


  indexingfilter.order
  org.apache.nutch.indexer.basic.BasicIndexingFilter 
org.apache.nutch.indexer.more.MoreIndexingFilter
  The order by which index filters are applied.
  If empty, all available index filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.indexer.basic.BasicIndexingFilter 
org.apache.nutch.indexer.more.MoreIndexingFilter
  then BasicIndexingFilter is applied first, and MoreIndexingFilter second.
  Since all filters are AND'ed, filter ordering does not have impact
  on end result, but it may have performance implication, depending
  on relative expensiveness of filters.
  



  was:
I've tested a patch for org.apache.nutch.indexer.IndexingFilters, allowing the 
user to state in which order the indexing filters are to be run based on a new
indexingfilter.order property. This is needed when a filter needs to rely on 
previously generated document fields as a source of input to generate further 
fields.

As suggested elsewhere, I based this on the urlfilter.order functionality:


  indexingfilter.order
  org.apache.nutch.indexer.basic.BasicIndexingFilter 
org.apache.nutch.indexer.more.MoreIndexingFilter
  The order by which index filters are applied.
  If empty, all available index filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.indexer.basic.BasicIndexingFilter 
org.apache.nutch.indexer.more.MoreIndexingFilter
  then BasicIndexingFilter is applied first, and MoreIndexingFilter second.
  Since all filters are AND'ed, filter ordering does not have impact
  on end result, but it may have performance implication, depending
  on relative expensiveness of filters.
  


Patch will be attached to this issue by 29/12/06


> Allow predeterminate running order of index filters
> ---
>
> Key: NUTCH-421
> URL: http://issues.apache.org/jira/browse/NUTCH-421
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 0.8.1
> Environment: All
>Reporter: Alan Tanaman
>Priority: Minor
> Attachments: nutch-421.patch
>
>
> I've tested a patch for org.apache.nutch.indexer.IndexingFilters, allowing 
> the user to state in which order the indexing filters are to be run based on 
> a new
> indexingfilter.order property. This is needed when a filter needs to rely on 
> previously generated document fields as a source of input to generate further 
> fields.
> As suggested elsewhere, I based this on the urlfilter.order functionality:
> 
>   indexingfilter.order
>   org.apache.nutch.indexer.basic.BasicIndexingFilter 
> org.apache.nutch.indexer.more.MoreIndexingFilter
>   The order by which index filters are applied.
>   If empty, all available index filters (as dictated by properties
>   plugin-includes and plugin-excludes above) are loaded and applied in system
>   defined order. If not empty, only named filters are loaded and applied
>   in given order. For example, if this property has value:
>   org.apache.nutch.indexer.basic.BasicIndexingFilter 
> org.apache.nutch.indexer.more.MoreIndexingFilter
>   then BasicIndexingFilter is applied first, and MoreIndexingFilter second.
>   Since all filters are AND'ed, filter ordering does not have impact
>   on end result, but it may have performance implication, depending
>   on relative expensiveness of filters.
>   
> 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-421) Allow predeterminate running order of index filters

2006-12-27 Thread Alan Tanaman (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-421?page=all ]

Alan Tanaman updated NUTCH-421:
---

Attachment: nutch-421.patch

> Allow predeterminate running order of index filters
> ---
>
> Key: NUTCH-421
> URL: http://issues.apache.org/jira/browse/NUTCH-421
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 0.8.1
> Environment: All
>Reporter: Alan Tanaman
>Priority: Minor
> Attachments: nutch-421.patch
>
>
> I've tested a patch for org.apache.nutch.indexer.IndexingFilters, allowing 
> the user to state in which order the indexing filters are to be run based on 
> a new
> indexingfilter.order property. This is needed when a filter needs to rely on 
> previously generated document fields as a source of input to generate further 
> fields.
> As suggested elsewhere, I based this on the urlfilter.order functionality:
> 
>   indexingfilter.order
>   org.apache.nutch.indexer.basic.BasicIndexingFilter 
> org.apache.nutch.indexer.more.MoreIndexingFilter
>   The order by which index filters are applied.
>   If empty, all available index filters (as dictated by properties
>   plugin-includes and plugin-excludes above) are loaded and applied in system
>   defined order. If not empty, only named filters are loaded and applied
>   in given order. For example, if this property has value:
>   org.apache.nutch.indexer.basic.BasicIndexingFilter 
> org.apache.nutch.indexer.more.MoreIndexingFilter
>   then BasicIndexingFilter is applied first, and MoreIndexingFilter second.
>   Since all filters are AND'ed, filter ordering does not have impact
>   on end result, but it may have performance implication, depending
>   on relative expensiveness of filters.
>   
> 
> Patch will be attached to this issue by 29/12/06

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-421) Allow predeterminate running order of index filters

2006-12-27 Thread Alan Tanaman (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-421?page=all ]

Alan Tanaman updated NUTCH-421:
---

Description: 
I've tested a patch for org.apache.nutch.indexer.IndexingFilters, allowing the 
user to state in which order the indexing filters are to be run based on a new
indexingfilter.order property. This is needed when a filter needs to rely on 
previously generated document fields as a source of input to generate further 
fields.

As suggested elsewhere, I based this on the urlfilter.order functionality:


  indexingfilter.order
  org.apache.nutch.indexer.basic.BasicIndexingFilter 
org.apache.nutch.indexer.more.MoreIndexingFilter
  The order by which index filters are applied.
  If empty, all available index filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.indexer.basic.BasicIndexingFilter 
org.apache.nutch.indexer.more.MoreIndexingFilter
  then BasicIndexingFilter is applied first, and MoreIndexingFilter second.
  Since all filters are AND'ed, filter ordering does not have impact
  on end result, but it may have performance implication, depending
  on relative expensiveness of filters.
  


Patch will be attached to this issue by 29/12/06

  was:
I've tested a patch for org.apache.nutch.indexer.IndexingFilters, allowing the 
user to state in which order the indexing filters are to be run based on a new
indexingfilter.order property. This is needed when a filter needs to rely on 
previously generated document fields as a source of input to generate further 
fields.

As suggested elsewhere, I based this on the urlfilter.order functionality:


  indexingfilter.order
  org.apache.nutch.indexer.basic.BasicIndexingFilter 
org.apache.nutch.indexer.more.MoreIndexingFilter
  The order by which index filters are applied.
  If empty, all available index filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.indexer.basic.BasicIndexingFilter 
org.apache.nutch.indexer.more.MoreIndexingFilter
  then BasicIndexingFilter is applied first, and MoreIndexingFilter second.
  Since all filters are AND'ed, filter ordering does not have impact
  on end result, but it may have performance implication, depending
  on relative expensiveness of filters.
  





> Allow predeterminate running order of index filters
> ---
>
> Key: NUTCH-421
> URL: http://issues.apache.org/jira/browse/NUTCH-421
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 0.8.1
> Environment: All
>Reporter: Alan Tanaman
>Priority: Minor
>
> I've tested a patch for org.apache.nutch.indexer.IndexingFilters, allowing 
> the user to state in which order the indexing filters are to be run based on 
> a new
> indexingfilter.order property. This is needed when a filter needs to rely on 
> previously generated document fields as a source of input to generate further 
> fields.
> As suggested elsewhere, I based this on the urlfilter.order functionality:
> 
>   indexingfilter.order
>   org.apache.nutch.indexer.basic.BasicIndexingFilter 
> org.apache.nutch.indexer.more.MoreIndexingFilter
>   The order by which index filters are applied.
>   If empty, all available index filters (as dictated by properties
>   plugin-includes and plugin-excludes above) are loaded and applied in system
>   defined order. If not empty, only named filters are loaded and applied
>   in given order. For example, if this property has value:
>   org.apache.nutch.indexer.basic.BasicIndexingFilter 
> org.apache.nutch.indexer.more.MoreIndexingFilter
>   then BasicIndexingFilter is applied first, and MoreIndexingFilter second.
>   Since all filters are AND'ed, filter ordering does not have impact
>   on end result, but it may have performance implication, depending
>   on relative expensiveness of filters.
>   
> 
> Patch will be attached to this issue by 29/12/06

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-421) Allow predeterminate running order of index filters

2006-12-27 Thread Alan Tanaman (JIRA)
Allow predeterminate running order of index filters
---

 Key: NUTCH-421
 URL: http://issues.apache.org/jira/browse/NUTCH-421
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 0.8.1
 Environment: All
Reporter: Alan Tanaman
Priority: Minor


I've tested a patch for org.apache.nutch.indexer.IndexingFilters, allowing the 
user to state in which order the indexing filters are to be run based on a new
indexingfilter.order property. This is needed when a filter needs to rely on 
previously generated document fields as a source of input to generate further 
fields.

As suggested elsewhere, I based this on the urlfilter.order functionality:


  indexingfilter.order
  org.apache.nutch.indexer.basic.BasicIndexingFilter 
org.apache.nutch.indexer.more.MoreIndexingFilter
  The order by which index filters are applied.
  If empty, all available index filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.indexer.basic.BasicIndexingFilter 
org.apache.nutch.indexer.more.MoreIndexingFilter
  then BasicIndexingFilter is applied first, and MoreIndexingFilter second.
  Since all filters are AND'ed, filter ordering does not have impact
  on end result, but it may have performance implication, depending
  on relative expensiveness of filters.
  




-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable

2006-11-28 Thread Alan Tanaman (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-407?page=comments#action_12453932 ] 

Alan Tanaman commented on NUTCH-407:


In our team we feel that this patch would have been beneficial in practical 
terms.  In the context of the enterprise intelligence solution which we are 
gradually porting over to Nutch, the emphasis is on ease of configuration.  We 
try to avoid exposing features such as regex filter, which although are very 
powerful for a more experienced user, are perhaps confusing to the novice.  
This is because we are primarily focused on the enterprise and less on the WWW.

This is why we preconfigure the db.ignore.external.links property to "true", 
and then only the urls file is used to seed the crawl.

Our ideal is to have a collection of predefined configuration settings for 
specific scenarios -- e.g. Enterprise-XML, Enterprise-Documents, 
Enterprise-Database, Internet-News etc.  We have a script that generates 
multiple crawlers, each one with different sources to be crawled, and although 
possible, it isn't the most practical to change the filters for each one 
manually based on the individual user requirements.

I realise this patch is closed, but how about another approach that says that 
FileResponse.java looks at db.ignore.external.links and decides based on this 
whether to go up the tree.

Obviously, this would also prevent you from crawling outlinks to the WWW 
embedded in documents, but when crawling an enterprise file system, you usually 
don't want to go all over the place anyway.  As I see it, file systems are 
different to the web in that they are inherently hierarchical whereas the web 
is as its name implies, non-hierarchical.  Therefore, when crawling a file 
system, "going up" the tree is just as much an external URI (so to speak) as a 
link to a web site.

*Ducks for cover*

Alan

> Make Nutch crawling parent directories for file protocol configurable
> -
>
> Key: NUTCH-407
> URL: http://issues.apache.org/jira/browse/NUTCH-407
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Thorsten Scherler
> Assigned To: Andrzej Bialecki 
> Attachments: 407.fix.diff
>
>
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06698.html
> I am looking into fixing some very weird behavior of the file protocol.
> I am using 0.8.
> Researching this topic I found 
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06536.html
> and
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
> I am on Ubuntu but I have the same problem that nutch is going down the
> tree (including parents) and not up (including children from the root
> url).
> Further I would vote to make the fetch-parents optional and defined per
> a property whether I would like this not very intuitive "feature".

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira