[jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

2007-01-15 Thread Alan Tanaman (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464738
 ] 

Alan Tanaman commented on NUTCH-422:


Sami,

About your questions - thank you for looking at this plugin.  I will be
seeing to all of them and will respond over the next week, as currently have
a couple of stressed clients...

Best regards,
Alan


 index-extra plugin creates additional fields in the index, based on 
 configurable logic
 --

 Key: NUTCH-422
 URL: https://issues.apache.org/jira/browse/NUTCH-422
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.8.1
 Environment: All environments
Reporter: Alan Tanaman
 Assigned To: Sami Siren
 Attachments: index-extra-v1.0-bin-java1.5.zip, 
 index-extra-v1.0-source.zip


 Extract from the Readme file:
 A.  Introduction
 The index-extra plugin allows you to configure additional fields that you 
 wish to be added to the index, based on one of the following sources:
   - The parsed text
   - Meta data fields
   - Previously created document-to-be-indexed fields
   - Plain constant string
   - Java expression combining one or more of the above, and resolving to 
 a string
 A regex can also be applied to any of the above, allowing fields to be 
 created based on patterns extracted from the source.
 B.  Installation
 1)  Binaries only:  Copy the 'index-extra' folder within 
 index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
 Copy the 'index-extra-conf.xml' file to 
 NUTCHDIR/conf, and configure
 Enable the plugin by updating the nutch-site.xml file
 2)  Source code:Always refer to the Nutch wiki for detailed 
 instructions on building Nutch.  In short:
 Copy the 'index-extra' folder within 
 index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
 Update the build.xml in NUTCHDIR/src/plugin to 
 include plugin
 Update the NUTCHDIR/default.properties file to 
 include plugin
 run ant to build
 Copy the 'index-extra-conf.xml' file to 
 NUTCHDIR/conf, and configure
 Enable the plugin by updating the nutch-site.xml file
 C.  Known Issues
 1)  For this plugin to work correctly on any document field, it is 
 necessary to run the other index filters
 first, so that all basic document fields are generated first.  To do 
 this, configure the indexingfilter.order
 property.  (Please see patch NUTCH-421 to enable indexingfilter.order 
 property. If this patch is not applied,
 the plugin will still work, but will not be able to use document fields 
 created by other index filter plugins.)
 2)  At this stage, field boost can not be used as Nutch scoring overrides 
 the field boost with its own
 document-level boost calculation.  This occurs at the end of 
 org.apache.nutch.indexer.Indexer's reduce method.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

2007-01-02 Thread Alan Tanaman (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12461863
 ] 

Alan Tanaman commented on NUTCH-422:


Many thanks for your feedback.

Do you have any specifics in mind regarding examples?  I will try and include 
any additional ones that we implement.  I know there are a lot of options, but 
it is a little hard to see what is unclear from my end -- as I am so involved 
in the development, another point-of-view on this is welcome.
;)

Regarding query-extra, we are not currently using the Nutch bean, so the need 
has not arisen for us at this point in time, but I can see how that would be 
useful.  I guess you could adapt one of the existing query- plugins fairly 
easily by having them read the xml configuration file to see what fields are 
potentially available in the index.

As for the boost, I included that as it seems like a useful thing to be able to 
control the boost of a single field, although we don't need that at this very 
moment.  The line of code in the org.apache.nutch.indexer.Indexer's
reduce method could be overridden, but I'm not yet sure how that would affect 
the overall scoring (scoring is one of my really weak points).
Perhaps one of the scoring experts could give some guidance on this?

 index-extra plugin creates additional fields in the index, based on 
 configurable logic
 --

 Key: NUTCH-422
 URL: http://issues.apache.org/jira/browse/NUTCH-422
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.8.1
 Environment: All environments
Reporter: Alan Tanaman
 Attachments: index-extra-v1.0-bin-java1.5.zip, 
 index-extra-v1.0-source.zip


 Extract from the Readme file:
 A.  Introduction
 The index-extra plugin allows you to configure additional fields that you 
 wish to be added to the index, based on one of the following sources:
   - The parsed text
   - Meta data fields
   - Previously created document-to-be-indexed fields
   - Plain constant string
   - Java expression combining one or more of the above, and resolving to 
 a string
 A regex can also be applied to any of the above, allowing fields to be 
 created based on patterns extracted from the source.
 B.  Installation
 1)  Binaries only:  Copy the 'index-extra' folder within 
 index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
 Copy the 'index-extra-conf.xml' file to 
 NUTCHDIR/conf, and configure
 Enable the plugin by updating the nutch-site.xml file
 2)  Source code:Always refer to the Nutch wiki for detailed 
 instructions on building Nutch.  In short:
 Copy the 'index-extra' folder within 
 index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
 Update the build.xml in NUTCHDIR/src/plugin to 
 include plugin
 Update the NUTCHDIR/default.properties file to 
 include plugin
 run ant to build
 Copy the 'index-extra-conf.xml' file to 
 NUTCHDIR/conf, and configure
 Enable the plugin by updating the nutch-site.xml file
 C.  Known Issues
 1)  For this plugin to work correctly on any document field, it is 
 necessary to run the other index filters
 first, so that all basic document fields are generated first.  To do 
 this, configure the indexingfilter.order
 property.  (Please see patch NUTCH-421 to enable indexingfilter.order 
 property. If this patch is not applied,
 the plugin will still work, but will not be able to use document fields 
 created by other index filter plugins.)
 2)  At this stage, field boost can not be used as Nutch scoring overrides 
 the field boost with its own
 document-level boost calculation.  This occurs at the end of 
 org.apache.nutch.indexer.Indexer's reduce method.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

2006-12-28 Thread Alan Tanaman (JIRA)
index-extra plugin creates additional fields in the index, based on 
configurable logic
--

 Key: NUTCH-422
 URL: http://issues.apache.org/jira/browse/NUTCH-422
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.8.1
 Environment: All environments
Reporter: Alan Tanaman


Extract from the Readme file:

A.  Introduction

The index-extra plugin allows you to configure additional fields that you 
wish to be added to the index, based on one of the following sources:
  - The parsed text
  - Meta data fields
  - Previously created document-to-be-indexed fields
  - Plain constant string
  - Java expression combining one or more of the above, and resolving to a 
string
A regex can also be applied to any of the above, allowing fields to be 
created based on patterns extracted from the source.

B.  Installation

1)  Binaries only:  Copy the 'index-extra' folder within 
index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, 
and configure
Enable the plugin by updating the nutch-site.xml file
2)  Source code:Always refer to the Nutch wiki for detailed 
instructions on building Nutch.  In short:
Copy the 'index-extra' folder within 
index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
Update the build.xml in NUTCHDIR/src/plugin to include 
plugin
Update the NUTCHDIR/default.properties file to include 
plugin
run ant to build
Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, 
and configure
Enable the plugin by updating the nutch-site.xml file

C.  Known Issues

1)  For this plugin to work correctly on any document field, it is 
necessary to run the other index filters
first, so that all basic document fields are generated first.  To do this, 
configure the indexingfilter.order
property.  (Please see patch NUTCH-421 to enable indexingfilter.order 
property. If this patch is not applied,
the plugin will still work, but will not be able to use document fields 
created by other index filter plugins.)

2)  At this stage, field boost can not be used as Nutch scoring overrides 
the field boost with its own
document-level boost calculation.  This occurs at the end of 
org.apache.nutch.indexer.Indexer's reduce method.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic

2006-12-28 Thread Alan Tanaman (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-422?page=all ]

Alan Tanaman updated NUTCH-422:
---

Attachment: index-extra-v1.0-bin-java1.5.zip
index-extra-v1.0-source.zip

 index-extra plugin creates additional fields in the index, based on 
 configurable logic
 --

 Key: NUTCH-422
 URL: http://issues.apache.org/jira/browse/NUTCH-422
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.8.1
 Environment: All environments
Reporter: Alan Tanaman
 Attachments: index-extra-v1.0-bin-java1.5.zip, 
 index-extra-v1.0-source.zip


 Extract from the Readme file:
 A.  Introduction
 The index-extra plugin allows you to configure additional fields that you 
 wish to be added to the index, based on one of the following sources:
   - The parsed text
   - Meta data fields
   - Previously created document-to-be-indexed fields
   - Plain constant string
   - Java expression combining one or more of the above, and resolving to 
 a string
 A regex can also be applied to any of the above, allowing fields to be 
 created based on patterns extracted from the source.
 B.  Installation
 1)  Binaries only:  Copy the 'index-extra' folder within 
 index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build
 Copy the 'index-extra-conf.xml' file to 
 NUTCHDIR/conf, and configure
 Enable the plugin by updating the nutch-site.xml file
 2)  Source code:Always refer to the Nutch wiki for detailed 
 instructions on building Nutch.  In short:
 Copy the 'index-extra' folder within 
 index-extra-v1.0-source.zip to NUTCHDIR/src/plugin
 Update the build.xml in NUTCHDIR/src/plugin to 
 include plugin
 Update the NUTCHDIR/default.properties file to 
 include plugin
 run ant to build
 Copy the 'index-extra-conf.xml' file to 
 NUTCHDIR/conf, and configure
 Enable the plugin by updating the nutch-site.xml file
 C.  Known Issues
 1)  For this plugin to work correctly on any document field, it is 
 necessary to run the other index filters
 first, so that all basic document fields are generated first.  To do 
 this, configure the indexingfilter.order
 property.  (Please see patch NUTCH-421 to enable indexingfilter.order 
 property. If this patch is not applied,
 the plugin will still work, but will not be able to use document fields 
 created by other index filter plugins.)
 2)  At this stage, field boost can not be used as Nutch scoring overrides 
 the field boost with its own
 document-level boost calculation.  This occurs at the end of 
 org.apache.nutch.indexer.Indexer's reduce method.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-421) Allow predeterminate running order of index filters

2006-12-27 Thread Alan Tanaman (JIRA)
Allow predeterminate running order of index filters
---

 Key: NUTCH-421
 URL: http://issues.apache.org/jira/browse/NUTCH-421
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 0.8.1
 Environment: All
Reporter: Alan Tanaman
Priority: Minor


I've tested a patch for org.apache.nutch.indexer.IndexingFilters, allowing the 
user to state in which order the indexing filters are to be run based on a new
indexingfilter.order property. This is needed when a filter needs to rely on 
previously generated document fields as a source of input to generate further 
fields.

As suggested elsewhere, I based this on the urlfilter.order functionality:

property
  nameindexingfilter.order/name
  valueorg.apache.nutch.indexer.basic.BasicIndexingFilter 
org.apache.nutch.indexer.more.MoreIndexingFilter/value
  descriptionThe order by which index filters are applied.
  If empty, all available index filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.indexer.basic.BasicIndexingFilter 
org.apache.nutch.indexer.more.MoreIndexingFilter
  then BasicIndexingFilter is applied first, and MoreIndexingFilter second.
  Since all filters are AND'ed, filter ordering does not have impact
  on end result, but it may have performance implication, depending
  on relative expensiveness of filters.
  /description
/property



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-421) Allow predeterminate running order of index filters

2006-12-27 Thread Alan Tanaman (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-421?page=all ]

Alan Tanaman updated NUTCH-421:
---

Description: 
I've tested a patch for org.apache.nutch.indexer.IndexingFilters, allowing the 
user to state in which order the indexing filters are to be run based on a new
indexingfilter.order property. This is needed when a filter needs to rely on 
previously generated document fields as a source of input to generate further 
fields.

As suggested elsewhere, I based this on the urlfilter.order functionality:

property
  nameindexingfilter.order/name
  valueorg.apache.nutch.indexer.basic.BasicIndexingFilter 
org.apache.nutch.indexer.more.MoreIndexingFilter/value
  descriptionThe order by which index filters are applied.
  If empty, all available index filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.indexer.basic.BasicIndexingFilter 
org.apache.nutch.indexer.more.MoreIndexingFilter
  then BasicIndexingFilter is applied first, and MoreIndexingFilter second.
  Since all filters are AND'ed, filter ordering does not have impact
  on end result, but it may have performance implication, depending
  on relative expensiveness of filters.
  /description
/property

Patch will be attached to this issue by 29/12/06

  was:
I've tested a patch for org.apache.nutch.indexer.IndexingFilters, allowing the 
user to state in which order the indexing filters are to be run based on a new
indexingfilter.order property. This is needed when a filter needs to rely on 
previously generated document fields as a source of input to generate further 
fields.

As suggested elsewhere, I based this on the urlfilter.order functionality:

property
  nameindexingfilter.order/name
  valueorg.apache.nutch.indexer.basic.BasicIndexingFilter 
org.apache.nutch.indexer.more.MoreIndexingFilter/value
  descriptionThe order by which index filters are applied.
  If empty, all available index filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.indexer.basic.BasicIndexingFilter 
org.apache.nutch.indexer.more.MoreIndexingFilter
  then BasicIndexingFilter is applied first, and MoreIndexingFilter second.
  Since all filters are AND'ed, filter ordering does not have impact
  on end result, but it may have performance implication, depending
  on relative expensiveness of filters.
  /description
/property




 Allow predeterminate running order of index filters
 ---

 Key: NUTCH-421
 URL: http://issues.apache.org/jira/browse/NUTCH-421
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 0.8.1
 Environment: All
Reporter: Alan Tanaman
Priority: Minor

 I've tested a patch for org.apache.nutch.indexer.IndexingFilters, allowing 
 the user to state in which order the indexing filters are to be run based on 
 a new
 indexingfilter.order property. This is needed when a filter needs to rely on 
 previously generated document fields as a source of input to generate further 
 fields.
 As suggested elsewhere, I based this on the urlfilter.order functionality:
 property
   nameindexingfilter.order/name
   valueorg.apache.nutch.indexer.basic.BasicIndexingFilter 
 org.apache.nutch.indexer.more.MoreIndexingFilter/value
   descriptionThe order by which index filters are applied.
   If empty, all available index filters (as dictated by properties
   plugin-includes and plugin-excludes above) are loaded and applied in system
   defined order. If not empty, only named filters are loaded and applied
   in given order. For example, if this property has value:
   org.apache.nutch.indexer.basic.BasicIndexingFilter 
 org.apache.nutch.indexer.more.MoreIndexingFilter
   then BasicIndexingFilter is applied first, and MoreIndexingFilter second.
   Since all filters are AND'ed, filter ordering does not have impact
   on end result, but it may have performance implication, depending
   on relative expensiveness of filters.
   /description
 /property
 Patch will be attached to this issue by 29/12/06

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-421) Allow predeterminate running order of index filters

2006-12-27 Thread Alan Tanaman (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-421?page=all ]

Alan Tanaman updated NUTCH-421:
---

Attachment: nutch-421.patch

 Allow predeterminate running order of index filters
 ---

 Key: NUTCH-421
 URL: http://issues.apache.org/jira/browse/NUTCH-421
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 0.8.1
 Environment: All
Reporter: Alan Tanaman
Priority: Minor
 Attachments: nutch-421.patch


 I've tested a patch for org.apache.nutch.indexer.IndexingFilters, allowing 
 the user to state in which order the indexing filters are to be run based on 
 a new
 indexingfilter.order property. This is needed when a filter needs to rely on 
 previously generated document fields as a source of input to generate further 
 fields.
 As suggested elsewhere, I based this on the urlfilter.order functionality:
 property
   nameindexingfilter.order/name
   valueorg.apache.nutch.indexer.basic.BasicIndexingFilter 
 org.apache.nutch.indexer.more.MoreIndexingFilter/value
   descriptionThe order by which index filters are applied.
   If empty, all available index filters (as dictated by properties
   plugin-includes and plugin-excludes above) are loaded and applied in system
   defined order. If not empty, only named filters are loaded and applied
   in given order. For example, if this property has value:
   org.apache.nutch.indexer.basic.BasicIndexingFilter 
 org.apache.nutch.indexer.more.MoreIndexingFilter
   then BasicIndexingFilter is applied first, and MoreIndexingFilter second.
   Since all filters are AND'ed, filter ordering does not have impact
   on end result, but it may have performance implication, depending
   on relative expensiveness of filters.
   /description
 /property
 Patch will be attached to this issue by 29/12/06

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-421) Allow predeterminate running order of index filters

2006-12-27 Thread Alan Tanaman (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-421?page=all ]

Alan Tanaman updated NUTCH-421:
---

Description: 
I've tested a patch for org.apache.nutch.indexer.IndexingFilters, allowing the 
user to state in which order the indexing filters are to be run based on a new
indexingfilter.order property. This is needed when a filter needs to rely on 
previously generated document fields as a source of input to generate further 
fields.

As suggested elsewhere, I based this on the urlfilter.order functionality:

property
  nameindexingfilter.order/name
  valueorg.apache.nutch.indexer.basic.BasicIndexingFilter 
org.apache.nutch.indexer.more.MoreIndexingFilter/value
  descriptionThe order by which index filters are applied.
  If empty, all available index filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.indexer.basic.BasicIndexingFilter 
org.apache.nutch.indexer.more.MoreIndexingFilter
  then BasicIndexingFilter is applied first, and MoreIndexingFilter second.
  Since all filters are AND'ed, filter ordering does not have impact
  on end result, but it may have performance implication, depending
  on relative expensiveness of filters.
  /description
/property


  was:
I've tested a patch for org.apache.nutch.indexer.IndexingFilters, allowing the 
user to state in which order the indexing filters are to be run based on a new
indexingfilter.order property. This is needed when a filter needs to rely on 
previously generated document fields as a source of input to generate further 
fields.

As suggested elsewhere, I based this on the urlfilter.order functionality:

property
  nameindexingfilter.order/name
  valueorg.apache.nutch.indexer.basic.BasicIndexingFilter 
org.apache.nutch.indexer.more.MoreIndexingFilter/value
  descriptionThe order by which index filters are applied.
  If empty, all available index filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.indexer.basic.BasicIndexingFilter 
org.apache.nutch.indexer.more.MoreIndexingFilter
  then BasicIndexingFilter is applied first, and MoreIndexingFilter second.
  Since all filters are AND'ed, filter ordering does not have impact
  on end result, but it may have performance implication, depending
  on relative expensiveness of filters.
  /description
/property

Patch will be attached to this issue by 29/12/06


 Allow predeterminate running order of index filters
 ---

 Key: NUTCH-421
 URL: http://issues.apache.org/jira/browse/NUTCH-421
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 0.8.1
 Environment: All
Reporter: Alan Tanaman
Priority: Minor
 Attachments: nutch-421.patch


 I've tested a patch for org.apache.nutch.indexer.IndexingFilters, allowing 
 the user to state in which order the indexing filters are to be run based on 
 a new
 indexingfilter.order property. This is needed when a filter needs to rely on 
 previously generated document fields as a source of input to generate further 
 fields.
 As suggested elsewhere, I based this on the urlfilter.order functionality:
 property
   nameindexingfilter.order/name
   valueorg.apache.nutch.indexer.basic.BasicIndexingFilter 
 org.apache.nutch.indexer.more.MoreIndexingFilter/value
   descriptionThe order by which index filters are applied.
   If empty, all available index filters (as dictated by properties
   plugin-includes and plugin-excludes above) are loaded and applied in system
   defined order. If not empty, only named filters are loaded and applied
   in given order. For example, if this property has value:
   org.apache.nutch.indexer.basic.BasicIndexingFilter 
 org.apache.nutch.indexer.more.MoreIndexingFilter
   then BasicIndexingFilter is applied first, and MoreIndexingFilter second.
   Since all filters are AND'ed, filter ordering does not have impact
   on end result, but it may have performance implication, depending
   on relative expensiveness of filters.
   /description
 /property

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable

2006-11-28 Thread Alan Tanaman (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-407?page=comments#action_12453932 ] 

Alan Tanaman commented on NUTCH-407:


In our team we feel that this patch would have been beneficial in practical 
terms.  In the context of the enterprise intelligence solution which we are 
gradually porting over to Nutch, the emphasis is on ease of configuration.  We 
try to avoid exposing features such as regex filter, which although are very 
powerful for a more experienced user, are perhaps confusing to the novice.  
This is because we are primarily focused on the enterprise and less on the WWW.

This is why we preconfigure the db.ignore.external.links property to true, 
and then only the urls file is used to seed the crawl.

Our ideal is to have a collection of predefined configuration settings for 
specific scenarios -- e.g. Enterprise-XML, Enterprise-Documents, 
Enterprise-Database, Internet-News etc.  We have a script that generates 
multiple crawlers, each one with different sources to be crawled, and although 
possible, it isn't the most practical to change the filters for each one 
manually based on the individual user requirements.

I realise this patch is closed, but how about another approach that says that 
FileResponse.java looks at db.ignore.external.links and decides based on this 
whether to go up the tree.

Obviously, this would also prevent you from crawling outlinks to the WWW 
embedded in documents, but when crawling an enterprise file system, you usually 
don't want to go all over the place anyway.  As I see it, file systems are 
different to the web in that they are inherently hierarchical whereas the web 
is as its name implies, non-hierarchical.  Therefore, when crawling a file 
system, going up the tree is just as much an external URI (so to speak) as a 
link to a web site.

*Ducks for cover*

Alan

 Make Nutch crawling parent directories for file protocol configurable
 -

 Key: NUTCH-407
 URL: http://issues.apache.org/jira/browse/NUTCH-407
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Thorsten Scherler
 Assigned To: Andrzej Bialecki 
 Attachments: 407.fix.diff


 http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06698.html
 I am looking into fixing some very weird behavior of the file protocol.
 I am using 0.8.
 Researching this topic I found 
 http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06536.html
 and
 http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
 I am on Ubuntu but I have the same problem that nutch is going down the
 tree (including parents) and not up (including children from the root
 url).
 Further I would vote to make the fetch-parents optional and defined per
 a property whether I would like this not very intuitive feature.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira