[jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic
[ https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464738 ] Alan Tanaman commented on NUTCH-422: Sami, About your questions - thank you for looking at this plugin. I will be seeing to all of them and will respond over the next week, as currently have a couple of stressed clients... Best regards, Alan > index-extra plugin creates additional fields in the index, based on > configurable logic > -- > > Key: NUTCH-422 > URL: https://issues.apache.org/jira/browse/NUTCH-422 > Project: Nutch > Issue Type: New Feature > Components: indexer >Affects Versions: 0.8.1 > Environment: All environments >Reporter: Alan Tanaman > Assigned To: Sami Siren > Attachments: index-extra-v1.0-bin-java1.5.zip, > index-extra-v1.0-source.zip > > > Extract from the Readme file: > A. Introduction > The index-extra plugin allows you to configure additional fields that you > wish to be added to the index, based on one of the following sources: > - The parsed text > - Meta data fields > - Previously created document-to-be-indexed fields > - Plain constant string > - Java expression combining one or more of the above, and resolving to > a string > A regex can also be applied to any of the above, allowing fields to be > created based on patterns extracted from the source. > B. Installation > 1) Binaries only: Copy the 'index-extra' folder within > index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build > Copy the 'index-extra-conf.xml' file to > NUTCHDIR/conf, and configure > Enable the plugin by updating the nutch-site.xml file > 2) Source code:Always refer to the Nutch wiki for detailed > instructions on building Nutch. In short: > Copy the 'index-extra' folder within > index-extra-v1.0-source.zip to NUTCHDIR/src/plugin > Update the build.xml in NUTCHDIR/src/plugin to > include plugin > Update the NUTCHDIR/default.properties file to > include plugin > run ant to build > Copy the 'index-extra-conf.xml' file to > NUTCHDIR/conf, and configure > Enable the plugin by updating the nutch-site.xml file > C. Known Issues > 1) For this plugin to work correctly on any document field, it is > necessary to run the other index filters > first, so that all basic document fields are generated first. To do > this, configure the indexingfilter.order > property. (Please see patch NUTCH-421 to enable indexingfilter.order > property. If this patch is not applied, > the plugin will still work, but will not be able to use document fields > created by other index filter plugins.) > 2) At this stage, field boost can not be used as Nutch scoring overrides > the field boost with its own > document-level boost calculation. This occurs at the end of > org.apache.nutch.indexer.Indexer's reduce method. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic
[ http://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12461863 ] Alan Tanaman commented on NUTCH-422: Many thanks for your feedback. Do you have any specifics in mind regarding examples? I will try and include any additional ones that we implement. I know there are a lot of options, but it is a little hard to see what is unclear from my end -- as I am so involved in the development, another point-of-view on this is welcome. ;) Regarding query-extra, we are not currently using the Nutch bean, so the need has not arisen for us at this point in time, but I can see how that would be useful. I guess you could adapt one of the existing query- plugins fairly easily by having them read the xml configuration file to see what fields are potentially available in the index. As for the boost, I included that as it seems like a useful thing to be able to control the boost of a single field, although we don't need that at this very moment. The line of code in the org.apache.nutch.indexer.Indexer's reduce method could be overridden, but I'm not yet sure how that would affect the overall scoring (scoring is one of my really weak points). Perhaps one of the scoring experts could give some guidance on this? > index-extra plugin creates additional fields in the index, based on > configurable logic > -- > > Key: NUTCH-422 > URL: http://issues.apache.org/jira/browse/NUTCH-422 > Project: Nutch > Issue Type: New Feature > Components: indexer >Affects Versions: 0.8.1 > Environment: All environments >Reporter: Alan Tanaman > Attachments: index-extra-v1.0-bin-java1.5.zip, > index-extra-v1.0-source.zip > > > Extract from the Readme file: > A. Introduction > The index-extra plugin allows you to configure additional fields that you > wish to be added to the index, based on one of the following sources: > - The parsed text > - Meta data fields > - Previously created document-to-be-indexed fields > - Plain constant string > - Java expression combining one or more of the above, and resolving to > a string > A regex can also be applied to any of the above, allowing fields to be > created based on patterns extracted from the source. > B. Installation > 1) Binaries only: Copy the 'index-extra' folder within > index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build > Copy the 'index-extra-conf.xml' file to > NUTCHDIR/conf, and configure > Enable the plugin by updating the nutch-site.xml file > 2) Source code:Always refer to the Nutch wiki for detailed > instructions on building Nutch. In short: > Copy the 'index-extra' folder within > index-extra-v1.0-source.zip to NUTCHDIR/src/plugin > Update the build.xml in NUTCHDIR/src/plugin to > include plugin > Update the NUTCHDIR/default.properties file to > include plugin > run ant to build > Copy the 'index-extra-conf.xml' file to > NUTCHDIR/conf, and configure > Enable the plugin by updating the nutch-site.xml file > C. Known Issues > 1) For this plugin to work correctly on any document field, it is > necessary to run the other index filters > first, so that all basic document fields are generated first. To do > this, configure the indexingfilter.order > property. (Please see patch NUTCH-421 to enable indexingfilter.order > property. If this patch is not applied, > the plugin will still work, but will not be able to use document fields > created by other index filter plugins.) > 2) At this stage, field boost can not be used as Nutch scoring overrides > the field boost with its own > document-level boost calculation. This occurs at the end of > org.apache.nutch.indexer.Indexer's reduce method. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic
[ http://issues.apache.org/jira/browse/NUTCH-422?page=all ] Alan Tanaman updated NUTCH-422: --- Attachment: index-extra-v1.0-bin-java1.5.zip index-extra-v1.0-source.zip > index-extra plugin creates additional fields in the index, based on > configurable logic > -- > > Key: NUTCH-422 > URL: http://issues.apache.org/jira/browse/NUTCH-422 > Project: Nutch > Issue Type: New Feature > Components: indexer >Affects Versions: 0.8.1 > Environment: All environments >Reporter: Alan Tanaman > Attachments: index-extra-v1.0-bin-java1.5.zip, > index-extra-v1.0-source.zip > > > Extract from the Readme file: > A. Introduction > The index-extra plugin allows you to configure additional fields that you > wish to be added to the index, based on one of the following sources: > - The parsed text > - Meta data fields > - Previously created document-to-be-indexed fields > - Plain constant string > - Java expression combining one or more of the above, and resolving to > a string > A regex can also be applied to any of the above, allowing fields to be > created based on patterns extracted from the source. > B. Installation > 1) Binaries only: Copy the 'index-extra' folder within > index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build > Copy the 'index-extra-conf.xml' file to > NUTCHDIR/conf, and configure > Enable the plugin by updating the nutch-site.xml file > 2) Source code:Always refer to the Nutch wiki for detailed > instructions on building Nutch. In short: > Copy the 'index-extra' folder within > index-extra-v1.0-source.zip to NUTCHDIR/src/plugin > Update the build.xml in NUTCHDIR/src/plugin to > include plugin > Update the NUTCHDIR/default.properties file to > include plugin > run ant to build > Copy the 'index-extra-conf.xml' file to > NUTCHDIR/conf, and configure > Enable the plugin by updating the nutch-site.xml file > C. Known Issues > 1) For this plugin to work correctly on any document field, it is > necessary to run the other index filters > first, so that all basic document fields are generated first. To do > this, configure the indexingfilter.order > property. (Please see patch NUTCH-421 to enable indexingfilter.order > property. If this patch is not applied, > the plugin will still work, but will not be able to use document fields > created by other index filter plugins.) > 2) At this stage, field boost can not be used as Nutch scoring overrides > the field boost with its own > document-level boost calculation. This occurs at the end of > org.apache.nutch.indexer.Indexer's reduce method. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-422) index-extra plugin creates additional fields in the index, based on configurable logic
index-extra plugin creates additional fields in the index, based on configurable logic -- Key: NUTCH-422 URL: http://issues.apache.org/jira/browse/NUTCH-422 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 0.8.1 Environment: All environments Reporter: Alan Tanaman Extract from the Readme file: A. Introduction The index-extra plugin allows you to configure additional fields that you wish to be added to the index, based on one of the following sources: - The parsed text - Meta data fields - Previously created document-to-be-indexed fields - Plain constant string - Java expression combining one or more of the above, and resolving to a string A regex can also be applied to any of the above, allowing fields to be created based on patterns extracted from the source. B. Installation 1) Binaries only: Copy the 'index-extra' folder within index-extra-v1.0-bin-java1.5.zip to NUTCHDIR/build Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure Enable the plugin by updating the nutch-site.xml file 2) Source code:Always refer to the Nutch wiki for detailed instructions on building Nutch. In short: Copy the 'index-extra' folder within index-extra-v1.0-source.zip to NUTCHDIR/src/plugin Update the build.xml in NUTCHDIR/src/plugin to include plugin Update the NUTCHDIR/default.properties file to include plugin run ant to build Copy the 'index-extra-conf.xml' file to NUTCHDIR/conf, and configure Enable the plugin by updating the nutch-site.xml file C. Known Issues 1) For this plugin to work correctly on any document field, it is necessary to run the other index filters first, so that all basic document fields are generated first. To do this, configure the indexingfilter.order property. (Please see patch NUTCH-421 to enable indexingfilter.order property. If this patch is not applied, the plugin will still work, but will not be able to use document fields created by other index filter plugins.) 2) At this stage, field boost can not be used as Nutch scoring overrides the field boost with its own document-level boost calculation. This occurs at the end of org.apache.nutch.indexer.Indexer's reduce method. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-421) Allow predeterminate running order of index filters
[ http://issues.apache.org/jira/browse/NUTCH-421?page=all ] Alan Tanaman updated NUTCH-421: --- Description: I've tested a patch for org.apache.nutch.indexer.IndexingFilters, allowing the user to state in which order the indexing filters are to be run based on a new indexingfilter.order property. This is needed when a filter needs to rely on previously generated document fields as a source of input to generate further fields. As suggested elsewhere, I based this on the urlfilter.order functionality: indexingfilter.order org.apache.nutch.indexer.basic.BasicIndexingFilter org.apache.nutch.indexer.more.MoreIndexingFilter The order by which index filters are applied. If empty, all available index filters (as dictated by properties plugin-includes and plugin-excludes above) are loaded and applied in system defined order. If not empty, only named filters are loaded and applied in given order. For example, if this property has value: org.apache.nutch.indexer.basic.BasicIndexingFilter org.apache.nutch.indexer.more.MoreIndexingFilter then BasicIndexingFilter is applied first, and MoreIndexingFilter second. Since all filters are AND'ed, filter ordering does not have impact on end result, but it may have performance implication, depending on relative expensiveness of filters. was: I've tested a patch for org.apache.nutch.indexer.IndexingFilters, allowing the user to state in which order the indexing filters are to be run based on a new indexingfilter.order property. This is needed when a filter needs to rely on previously generated document fields as a source of input to generate further fields. As suggested elsewhere, I based this on the urlfilter.order functionality: indexingfilter.order org.apache.nutch.indexer.basic.BasicIndexingFilter org.apache.nutch.indexer.more.MoreIndexingFilter The order by which index filters are applied. If empty, all available index filters (as dictated by properties plugin-includes and plugin-excludes above) are loaded and applied in system defined order. If not empty, only named filters are loaded and applied in given order. For example, if this property has value: org.apache.nutch.indexer.basic.BasicIndexingFilter org.apache.nutch.indexer.more.MoreIndexingFilter then BasicIndexingFilter is applied first, and MoreIndexingFilter second. Since all filters are AND'ed, filter ordering does not have impact on end result, but it may have performance implication, depending on relative expensiveness of filters. Patch will be attached to this issue by 29/12/06 > Allow predeterminate running order of index filters > --- > > Key: NUTCH-421 > URL: http://issues.apache.org/jira/browse/NUTCH-421 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 0.8.1 > Environment: All >Reporter: Alan Tanaman >Priority: Minor > Attachments: nutch-421.patch > > > I've tested a patch for org.apache.nutch.indexer.IndexingFilters, allowing > the user to state in which order the indexing filters are to be run based on > a new > indexingfilter.order property. This is needed when a filter needs to rely on > previously generated document fields as a source of input to generate further > fields. > As suggested elsewhere, I based this on the urlfilter.order functionality: > > indexingfilter.order > org.apache.nutch.indexer.basic.BasicIndexingFilter > org.apache.nutch.indexer.more.MoreIndexingFilter > The order by which index filters are applied. > If empty, all available index filters (as dictated by properties > plugin-includes and plugin-excludes above) are loaded and applied in system > defined order. If not empty, only named filters are loaded and applied > in given order. For example, if this property has value: > org.apache.nutch.indexer.basic.BasicIndexingFilter > org.apache.nutch.indexer.more.MoreIndexingFilter > then BasicIndexingFilter is applied first, and MoreIndexingFilter second. > Since all filters are AND'ed, filter ordering does not have impact > on end result, but it may have performance implication, depending > on relative expensiveness of filters. > > -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-421) Allow predeterminate running order of index filters
[ http://issues.apache.org/jira/browse/NUTCH-421?page=all ] Alan Tanaman updated NUTCH-421: --- Attachment: nutch-421.patch > Allow predeterminate running order of index filters > --- > > Key: NUTCH-421 > URL: http://issues.apache.org/jira/browse/NUTCH-421 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 0.8.1 > Environment: All >Reporter: Alan Tanaman >Priority: Minor > Attachments: nutch-421.patch > > > I've tested a patch for org.apache.nutch.indexer.IndexingFilters, allowing > the user to state in which order the indexing filters are to be run based on > a new > indexingfilter.order property. This is needed when a filter needs to rely on > previously generated document fields as a source of input to generate further > fields. > As suggested elsewhere, I based this on the urlfilter.order functionality: > > indexingfilter.order > org.apache.nutch.indexer.basic.BasicIndexingFilter > org.apache.nutch.indexer.more.MoreIndexingFilter > The order by which index filters are applied. > If empty, all available index filters (as dictated by properties > plugin-includes and plugin-excludes above) are loaded and applied in system > defined order. If not empty, only named filters are loaded and applied > in given order. For example, if this property has value: > org.apache.nutch.indexer.basic.BasicIndexingFilter > org.apache.nutch.indexer.more.MoreIndexingFilter > then BasicIndexingFilter is applied first, and MoreIndexingFilter second. > Since all filters are AND'ed, filter ordering does not have impact > on end result, but it may have performance implication, depending > on relative expensiveness of filters. > > > Patch will be attached to this issue by 29/12/06 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-421) Allow predeterminate running order of index filters
[ http://issues.apache.org/jira/browse/NUTCH-421?page=all ] Alan Tanaman updated NUTCH-421: --- Description: I've tested a patch for org.apache.nutch.indexer.IndexingFilters, allowing the user to state in which order the indexing filters are to be run based on a new indexingfilter.order property. This is needed when a filter needs to rely on previously generated document fields as a source of input to generate further fields. As suggested elsewhere, I based this on the urlfilter.order functionality: indexingfilter.order org.apache.nutch.indexer.basic.BasicIndexingFilter org.apache.nutch.indexer.more.MoreIndexingFilter The order by which index filters are applied. If empty, all available index filters (as dictated by properties plugin-includes and plugin-excludes above) are loaded and applied in system defined order. If not empty, only named filters are loaded and applied in given order. For example, if this property has value: org.apache.nutch.indexer.basic.BasicIndexingFilter org.apache.nutch.indexer.more.MoreIndexingFilter then BasicIndexingFilter is applied first, and MoreIndexingFilter second. Since all filters are AND'ed, filter ordering does not have impact on end result, but it may have performance implication, depending on relative expensiveness of filters. Patch will be attached to this issue by 29/12/06 was: I've tested a patch for org.apache.nutch.indexer.IndexingFilters, allowing the user to state in which order the indexing filters are to be run based on a new indexingfilter.order property. This is needed when a filter needs to rely on previously generated document fields as a source of input to generate further fields. As suggested elsewhere, I based this on the urlfilter.order functionality: indexingfilter.order org.apache.nutch.indexer.basic.BasicIndexingFilter org.apache.nutch.indexer.more.MoreIndexingFilter The order by which index filters are applied. If empty, all available index filters (as dictated by properties plugin-includes and plugin-excludes above) are loaded and applied in system defined order. If not empty, only named filters are loaded and applied in given order. For example, if this property has value: org.apache.nutch.indexer.basic.BasicIndexingFilter org.apache.nutch.indexer.more.MoreIndexingFilter then BasicIndexingFilter is applied first, and MoreIndexingFilter second. Since all filters are AND'ed, filter ordering does not have impact on end result, but it may have performance implication, depending on relative expensiveness of filters. > Allow predeterminate running order of index filters > --- > > Key: NUTCH-421 > URL: http://issues.apache.org/jira/browse/NUTCH-421 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 0.8.1 > Environment: All >Reporter: Alan Tanaman >Priority: Minor > > I've tested a patch for org.apache.nutch.indexer.IndexingFilters, allowing > the user to state in which order the indexing filters are to be run based on > a new > indexingfilter.order property. This is needed when a filter needs to rely on > previously generated document fields as a source of input to generate further > fields. > As suggested elsewhere, I based this on the urlfilter.order functionality: > > indexingfilter.order > org.apache.nutch.indexer.basic.BasicIndexingFilter > org.apache.nutch.indexer.more.MoreIndexingFilter > The order by which index filters are applied. > If empty, all available index filters (as dictated by properties > plugin-includes and plugin-excludes above) are loaded and applied in system > defined order. If not empty, only named filters are loaded and applied > in given order. For example, if this property has value: > org.apache.nutch.indexer.basic.BasicIndexingFilter > org.apache.nutch.indexer.more.MoreIndexingFilter > then BasicIndexingFilter is applied first, and MoreIndexingFilter second. > Since all filters are AND'ed, filter ordering does not have impact > on end result, but it may have performance implication, depending > on relative expensiveness of filters. > > > Patch will be attached to this issue by 29/12/06 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-421) Allow predeterminate running order of index filters
Allow predeterminate running order of index filters --- Key: NUTCH-421 URL: http://issues.apache.org/jira/browse/NUTCH-421 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 0.8.1 Environment: All Reporter: Alan Tanaman Priority: Minor I've tested a patch for org.apache.nutch.indexer.IndexingFilters, allowing the user to state in which order the indexing filters are to be run based on a new indexingfilter.order property. This is needed when a filter needs to rely on previously generated document fields as a source of input to generate further fields. As suggested elsewhere, I based this on the urlfilter.order functionality: indexingfilter.order org.apache.nutch.indexer.basic.BasicIndexingFilter org.apache.nutch.indexer.more.MoreIndexingFilter The order by which index filters are applied. If empty, all available index filters (as dictated by properties plugin-includes and plugin-excludes above) are loaded and applied in system defined order. If not empty, only named filters are loaded and applied in given order. For example, if this property has value: org.apache.nutch.indexer.basic.BasicIndexingFilter org.apache.nutch.indexer.more.MoreIndexingFilter then BasicIndexingFilter is applied first, and MoreIndexingFilter second. Since all filters are AND'ed, filter ordering does not have impact on end result, but it may have performance implication, depending on relative expensiveness of filters. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable
[ http://issues.apache.org/jira/browse/NUTCH-407?page=comments#action_12453932 ] Alan Tanaman commented on NUTCH-407: In our team we feel that this patch would have been beneficial in practical terms. In the context of the enterprise intelligence solution which we are gradually porting over to Nutch, the emphasis is on ease of configuration. We try to avoid exposing features such as regex filter, which although are very powerful for a more experienced user, are perhaps confusing to the novice. This is because we are primarily focused on the enterprise and less on the WWW. This is why we preconfigure the db.ignore.external.links property to "true", and then only the urls file is used to seed the crawl. Our ideal is to have a collection of predefined configuration settings for specific scenarios -- e.g. Enterprise-XML, Enterprise-Documents, Enterprise-Database, Internet-News etc. We have a script that generates multiple crawlers, each one with different sources to be crawled, and although possible, it isn't the most practical to change the filters for each one manually based on the individual user requirements. I realise this patch is closed, but how about another approach that says that FileResponse.java looks at db.ignore.external.links and decides based on this whether to go up the tree. Obviously, this would also prevent you from crawling outlinks to the WWW embedded in documents, but when crawling an enterprise file system, you usually don't want to go all over the place anyway. As I see it, file systems are different to the web in that they are inherently hierarchical whereas the web is as its name implies, non-hierarchical. Therefore, when crawling a file system, "going up" the tree is just as much an external URI (so to speak) as a link to a web site. *Ducks for cover* Alan > Make Nutch crawling parent directories for file protocol configurable > - > > Key: NUTCH-407 > URL: http://issues.apache.org/jira/browse/NUTCH-407 > Project: Nutch > Issue Type: Improvement >Affects Versions: 0.8 >Reporter: Thorsten Scherler > Assigned To: Andrzej Bialecki > Attachments: 407.fix.diff > > > http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06698.html > I am looking into fixing some very weird behavior of the file protocol. > I am using 0.8. > Researching this topic I found > http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06536.html > and > http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch > I am on Ubuntu but I have the same problem that nutch is going down the > tree (including parents) and not up (including children from the root > url). > Further I would vote to make the fetch-parents optional and defined per > a property whether I would like this not very intuitive "feature". -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira