[jira] [Resolved] (NUTCH-2199) Documentation for Nutch 2.X REST API
[ https://issues.apache.org/jira/browse/NUTCH-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2199. - Resolution: Fixed > Documentation for Nutch 2.X REST API > > > Key: NUTCH-2199 > URL: https://issues.apache.org/jira/browse/NUTCH-2199 > Project: Nutch > Issue Type: New Feature > Components: documentation, REST_api >Affects Versions: 2.3.1 >Reporter: Lewis John McGibbney >Assignee: Furkan KAMACI >Priority: Minor > Fix For: 2.5 > > > The work done on NUTCH-1800 needs to be ported to 2.X branch. This is > trivial, I thought I had already done it but obviously not. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2660) Unit tests of plugins parse-js, headings, index-jexl-filter to be executed during build
[ https://issues.apache.org/jira/browse/NUTCH-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655308#comment-16655308 ] ASF GitHub Bot commented on NUTCH-2660: --- jorgelbg commented on issue #397: NUTCH-2660 Plugin tests not executed URL: https://github.com/apache/nutch/pull/397#issuecomment-431029609 +1 thanks for including the index-jex-filter as well :) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Unit tests of plugins parse-js, headings, index-jexl-filter to be executed > during build > --- > > Key: NUTCH-2660 > URL: https://issues.apache.org/jira/browse/NUTCH-2660 > Project: Nutch > Issue Type: Improvement > Components: build, test >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.16 > > > The unit tests of the plugins "parse-js", "headings" and "index-jexl-filter" > are not executed during build. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2663) Improve index-jexl-filter syntax for scripts
[ https://issues.apache.org/jira/browse/NUTCH-2663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655305#comment-16655305 ] ASF GitHub Bot commented on NUTCH-2663: --- jorgelbg opened a new pull request #400: NUTCH-2663 Improve the JEXL syntax for getting values from the context URL: https://github.com/apache/nutch/pull/400 * Avoids the use of the array notation when getting values from the document/metadata(s) on the JEXL expression. We go from `doc.lang[0] == 'en'` to `doc.lang == 'en'` which is more easy to understand. * Log using errors instead of warnings in the `setConf` methods. We throw a `RuntimeException` if something is wrong, so the log should represent the same severity level. * Some minor changes an additional comments. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Improve index-jexl-filter syntax for scripts > > > Key: NUTCH-2663 > URL: https://issues.apache.org/jira/browse/NUTCH-2663 > Project: Nutch > Issue Type: Improvement > Components: plugin >Affects Versions: 1.16 >Reporter: Jorge Luis Betancourt Gonzalez >Assignee: Jorge Luis Betancourt Gonzalez >Priority: Minor > > JEXL scripts need to be written using the array syntax to get the actual > value (for instance, example extracted from the tests): > {code} > doc.lang[0]=='en' > {code} > Ideally, this would only be required if the actual value is really an array, > and not for single value elements. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work started] (NUTCH-2663) Improve index-jexl-filter syntax for scripts
[ https://issues.apache.org/jira/browse/NUTCH-2663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2663 started by Jorge Luis Betancourt Gonzalez. - > Improve index-jexl-filter syntax for scripts > > > Key: NUTCH-2663 > URL: https://issues.apache.org/jira/browse/NUTCH-2663 > Project: Nutch > Issue Type: Improvement > Components: plugin >Affects Versions: 1.16 >Reporter: Jorge Luis Betancourt Gonzalez >Assignee: Jorge Luis Betancourt Gonzalez >Priority: Minor > > JEXL scripts need to be written using the array syntax to get the actual > value (for instance, example extracted from the tests): > {code} > doc.lang[0]=='en' > {code} > Ideally, this would only be required if the actual value is really an array, > and not for single value elements. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2663) Improve index-jexl-filter syntax for scripts
Jorge Luis Betancourt Gonzalez created NUTCH-2663: - Summary: Improve index-jexl-filter syntax for scripts Key: NUTCH-2663 URL: https://issues.apache.org/jira/browse/NUTCH-2663 Project: Nutch Issue Type: Improvement Components: plugin Affects Versions: 1.16 Reporter: Jorge Luis Betancourt Gonzalez Assignee: Jorge Luis Betancourt Gonzalez JEXL scripts need to be written using the array syntax to get the actual value (for instance, example extracted from the tests): {code} doc.lang[0]=='en' {code} Ideally, this would only be required if the actual value is really an array, and not for single value elements. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2662) index-jexl-filter plugin throws a RuntimeException if its enabled but not configured
[ https://issues.apache.org/jira/browse/NUTCH-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655180#comment-16655180 ] Jorge Luis Betancourt Gonzalez commented on NUTCH-2662: --- [~yossi] Yes, I know :). At the time, I didn't notice that the actual recommendation (the message shown to the user) is to pick an expression (true/false) which can be done automatically and log the default value. The second validation that checks if the expression is syntactically correct it's still valid. > index-jexl-filter plugin throws a RuntimeException if its enabled but not > configured > > > Key: NUTCH-2662 > URL: https://issues.apache.org/jira/browse/NUTCH-2662 > Project: Nutch > Issue Type: Improvement > Components: plugin >Affects Versions: 1.16 >Reporter: Jorge Luis Betancourt Gonzalez >Assignee: Jorge Luis Betancourt Gonzalez >Priority: Minor > > If the index-jexl-filter plugin is enabled but no configuration is provided > in the {{index.jexl.filter}} property the plugin throws a RuntimeException. > In the same exception message, we advise to either set true or false to index > all/none. > This is a case where we can just select a sane default and log a warning, but > not stop the entire process. I think this is more consistent with how we > approach configuration in general: Only fail if there is an actual error in > the configuration (i.e parse error on the expression). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2651) Upgrade to Tika 1.19.1 (from 1.18)
[ https://issues.apache.org/jira/browse/NUTCH-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655133#comment-16655133 ] Markus Jelsma commented on NUTCH-2651: -- +1 also thanks for finding the javax-ws fix, i could immediately apply it to our custom parse plugin. > Upgrade to Tika 1.19.1 (from 1.18) > -- > > Key: NUTCH-2651 > URL: https://issues.apache.org/jira/browse/NUTCH-2651 > Project: Nutch > Issue Type: Improvement > Components: parser, protocol >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Major > Fix For: 1.16 > > > [Tika 1.19.1|https://tika.apache.org/1.19.1/index.html] has been released > recently. Among all the other improvements and fixes (including those of > [1.19|https://tika.apache.org/1.19/index.html]) It contains one important > performance fix (TIKA-2645, cf. NUTCH-2578) affecting the MIME-/Content-Type > detector. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2662) index-jexl-filter plugin throws a RuntimeException if its enabled but not configured
[ https://issues.apache.org/jira/browse/NUTCH-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655082#comment-16655082 ] Yossi Tamari commented on NUTCH-2662: - [~jorgelbg], this goes against your own review comment on this plugin: [https://github.com/apache/nutch/pull/219#discussion_r136044914] > index-jexl-filter plugin throws a RuntimeException if its enabled but not > configured > > > Key: NUTCH-2662 > URL: https://issues.apache.org/jira/browse/NUTCH-2662 > Project: Nutch > Issue Type: Improvement > Components: plugin >Affects Versions: 1.16 >Reporter: Jorge Luis Betancourt Gonzalez >Assignee: Jorge Luis Betancourt Gonzalez >Priority: Minor > > If the index-jexl-filter plugin is enabled but no configuration is provided > in the {{index.jexl.filter}} property the plugin throws a RuntimeException. > In the same exception message, we advise to either set true or false to index > all/none. > This is a case where we can just select a sane default and log a warning, but > not stop the entire process. I think this is more consistent with how we > approach configuration in general: Only fail if there is an actual error in > the configuration (i.e parse error on the expression). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2662) index-jexl-filter plugin throws a RuntimeException if its enabled but not configured
Jorge Luis Betancourt Gonzalez created NUTCH-2662: - Summary: index-jexl-filter plugin throws a RuntimeException if its enabled but not configured Key: NUTCH-2662 URL: https://issues.apache.org/jira/browse/NUTCH-2662 Project: Nutch Issue Type: Improvement Components: plugin Affects Versions: 1.16 Reporter: Jorge Luis Betancourt Gonzalez Assignee: Jorge Luis Betancourt Gonzalez If the index-jexl-filter plugin is enabled but no configuration is provided in the {{index.jexl.filter}} property the plugin throws a RuntimeException. In the same exception message, we advise to either set true or false to index all/none. This is a case where we can just select a sane default and log a warning, but not stop the entire process. I think this is more consistent with how we approach configuration in general: Only fail if there is an actual error in the configuration (i.e parse error on the expression). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2658) Add README file to all plugins in src/plugin
[ https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654864#comment-16654864 ] ASF GitHub Bot commented on NUTCH-2658: --- jorgelbg commented on a change in pull request #398: NUTCH-2658 Add README for the index-links plugin URL: https://github.com/apache/nutch/pull/398#discussion_r226215078 ## File path: src/plugin/index-links/README.md ## @@ -0,0 +1,53 @@ +indexer-links plugin for Nutch +== + +This plugin provides the feature to index the inlinks and outlinks of a URL +into an indexing backend. + +## Configuration + +This plugin provides the following configuration options: + +* `index.links.outlinks.host.ignore`: If true, the plugin will ignore outlinks +that point to the same host as the current URL. By default, all outlinks are +indexed. If `db.ignore.internal.links` is `true` (default value) this setting +is ignored because the internal links are already ignored. + +* `index.links.inlinks.host.ignore`: If true, the plugin will ignore inlinks +coming from the same host as the current URL. By default, all inlinks are +indexed. If `db.ignore.internal.links` is `true` (default value) this setting +is ignored because the internal links are already ignored. + +* `index.links.hosts.only`: If true, the plugin will index only the host portion of the inlinks/outlinks URLs. + +## Fields + +For this plugin to work 2 new fields have to be added/configured in your storage backend: + +* `inlinks` +* `outlinks` + +If the plugin is enabled these fields have to be added to your storage backend +configuration. + +The specifics of how these fields are configured depends on your specific +backend. We provide here sane default values for Solr. + +The following fields should be added to your backend storage. We provide +examples of default values for the Solr schema. + +* Each outlink/inlink will be stored as a string without any tokenization. +* The `inlink`/`outlink` fields have to be multivalued, because normally a +given URL will have multiple inlinks and outlinks. + +``` + +``` + +The field configuration could look like: + +``` + Review comment: Yes, I realized last night that the fields are missing from the `conf/schema.xml` file. I'm going to add them there as well. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add README file to all plugins in src/plugin > > > Key: NUTCH-2658 > URL: https://issues.apache.org/jira/browse/NUTCH-2658 > Project: Nutch > Issue Type: Improvement > Components: documentation, plugin >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Trivial > > Since we've migrated a good portion of our workflow to Github we could > consider adding a {{README.md}} file to the root of each plugin in > {{src/plugins}}. > This is a good place to have plugin-specific documentation. Wich fields the > plugin adds to the indexer, which configuration options, etc. Also, since the > README.md is rendered by Github automatically is a good link to point users. > I think that a good example is the {{indexer-cloudsearch}} plugin, on top of > that it's a good source of information to point users when asking questions > regarding a specific plugin. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2660) Unit tests of plugins parse-js, headings, index-jexl-filter to be executed during build
[ https://issues.apache.org/jira/browse/NUTCH-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2660: --- Summary: Unit tests of plugins parse-js, headings, index-jexl-filter to be executed during build (was: Unit tests of not executed) > Unit tests of plugins parse-js, headings, index-jexl-filter to be executed > during build > --- > > Key: NUTCH-2660 > URL: https://issues.apache.org/jira/browse/NUTCH-2660 > Project: Nutch > Issue Type: Improvement > Components: build, test >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.16 > > > The unit tests of the plugins "parse-js", "headings" and "index-jexl-filter" > are not executed during build. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2660) Unit tests of not executed
[ https://issues.apache.org/jira/browse/NUTCH-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2660: --- Summary: Unit tests of not executed (was: Plugin tests not executed) > Unit tests of not executed > -- > > Key: NUTCH-2660 > URL: https://issues.apache.org/jira/browse/NUTCH-2660 > Project: Nutch > Issue Type: Improvement > Components: build, test >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.16 > > > The unit tests of the plugins "parse-js", "headings" and "index-jexl-filter" > are not executed during build. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2658) Add README file to all plugins in src/plugin
[ https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654773#comment-16654773 ] ASF GitHub Bot commented on NUTCH-2658: --- sebastian-nagel commented on a change in pull request #398: NUTCH-2658 Add README for the index-links plugin URL: https://github.com/apache/nutch/pull/398#discussion_r226201417 ## File path: src/plugin/index-links/README.md ## @@ -0,0 +1,53 @@ +indexer-links plugin for Nutch +== + +This plugin provides the feature to index the inlinks and outlinks of a URL +into an indexing backend. + +## Configuration + +This plugin provides the following configuration options: + +* `index.links.outlinks.host.ignore`: If true, the plugin will ignore outlinks +that point to the same host as the current URL. By default, all outlinks are +indexed. If `db.ignore.internal.links` is `true` (default value) this setting +is ignored because the internal links are already ignored. + +* `index.links.inlinks.host.ignore`: If true, the plugin will ignore inlinks +coming from the same host as the current URL. By default, all inlinks are +indexed. If `db.ignore.internal.links` is `true` (default value) this setting +is ignored because the internal links are already ignored. + +* `index.links.hosts.only`: If true, the plugin will index only the host portion of the inlinks/outlinks URLs. + +## Fields + +For this plugin to work 2 new fields have to be added/configured in your storage backend: + +* `inlinks` +* `outlinks` + +If the plugin is enabled these fields have to be added to your storage backend +configuration. + +The specifics of how these fields are configured depends on your specific +backend. We provide here sane default values for Solr. + +The following fields should be added to your backend storage. We provide +examples of default values for the Solr schema. + +* Each outlink/inlink will be stored as a string without any tokenization. +* The `inlink`/`outlink` fields have to be multivalued, because normally a +given URL will have multiple inlinks and outlinks. + +``` + +``` + +The field configuration could look like: + +``` + Review comment: The Solr schema ([conf/schema.xml](/apache/nutch/blob/master/conf/schema.xml)) already contains the field definitions for multiple IndexingFilter plugins. Why not add inlinks and outlinks also to the schema? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add README file to all plugins in src/plugin > > > Key: NUTCH-2658 > URL: https://issues.apache.org/jira/browse/NUTCH-2658 > Project: Nutch > Issue Type: Improvement > Components: documentation, plugin >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Trivial > > Since we've migrated a good portion of our workflow to Github we could > consider adding a {{README.md}} file to the root of each plugin in > {{src/plugins}}. > This is a good place to have plugin-specific documentation. Wich fields the > plugin adds to the indexer, which configuration options, etc. Also, since the > README.md is rendered by Github automatically is a good link to point users. > I think that a good example is the {{indexer-cloudsearch}} plugin, on top of > that it's a good source of information to point users when asking questions > regarding a specific plugin. -- This message was sent by Atlassian JIRA (v7.6.3#76005)