[jira] [Commented] (NUTCH-2658) Add README file to all plugins in src/plugin
[ https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735160#comment-16735160 ] Hudson commented on NUTCH-2658: --- FAILURE: Integrated in Jenkins build Nutch-trunk #3593 (See [https://builds.apache.org/job/Nutch-trunk/3593/]) NUTCH-2658 Add README for the index-links plugin (jorge-luis.betancourt: [https://github.com/apache/nutch/commit/f79a5af4fe52ae905d6ce77f891911305d1362a9]) * (add) src/plugin/index-links/README.md > Add README file to all plugins in src/plugin > > > Key: NUTCH-2658 > URL: https://issues.apache.org/jira/browse/NUTCH-2658 > Project: Nutch > Issue Type: Improvement > Components: documentation, plugin >Affects Versions: 1.15 >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Trivial > Fix For: 1.16 > > > Since we've migrated a good portion of our workflow to Github we could > consider adding a {{README.md}} file to the root of each plugin in > {{src/plugins}}. > This is a good place to have plugin-specific documentation. Wich fields the > plugin adds to the indexer, which configuration options, etc. Also, since the > README.md is rendered by Github automatically is a good link to point users. > I think that a good example is the {{indexer-cloudsearch}} plugin, on top of > that it's a good source of information to point users when asking questions > regarding a specific plugin. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2658) Add README file to all plugins in src/plugin
[ https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735157#comment-16735157 ] ASF GitHub Bot commented on NUTCH-2658: --- sebastian-nagel commented on pull request #398: NUTCH-2658 Add README for the index-links plugin URL: https://github.com/apache/nutch/pull/398 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add README file to all plugins in src/plugin > > > Key: NUTCH-2658 > URL: https://issues.apache.org/jira/browse/NUTCH-2658 > Project: Nutch > Issue Type: Improvement > Components: documentation, plugin >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Trivial > > Since we've migrated a good portion of our workflow to Github we could > consider adding a {{README.md}} file to the root of each plugin in > {{src/plugins}}. > This is a good place to have plugin-specific documentation. Wich fields the > plugin adds to the indexer, which configuration options, etc. Also, since the > README.md is rendered by Github automatically is a good link to point users. > I think that a good example is the {{indexer-cloudsearch}} plugin, on top of > that it's a good source of information to point users when asking questions > regarding a specific plugin. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2658) Add README file to all plugins in src/plugin
[ https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16687801#comment-16687801 ] Hudson commented on NUTCH-2658: --- FAILURE: Integrated in Jenkins build Nutch-trunk #3589 (See [https://builds.apache.org/job/Nutch-trunk/3589/]) NUTCH-2658 Adding the fields required by the index-links plugin to the (snagel: [https://github.com/apache/nutch/commit/a5df63a3d644e90fb881a0f16c8f29d9320d1de3]) * (edit) conf/schema.xml > Add README file to all plugins in src/plugin > > > Key: NUTCH-2658 > URL: https://issues.apache.org/jira/browse/NUTCH-2658 > Project: Nutch > Issue Type: Improvement > Components: documentation, plugin >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Trivial > > Since we've migrated a good portion of our workflow to Github we could > consider adding a {{README.md}} file to the root of each plugin in > {{src/plugins}}. > This is a good place to have plugin-specific documentation. Wich fields the > plugin adds to the indexer, which configuration options, etc. Also, since the > README.md is rendered by Github automatically is a good link to point users. > I think that a good example is the {{indexer-cloudsearch}} plugin, on top of > that it's a good source of information to point users when asking questions > regarding a specific plugin. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2658) Add README file to all plugins in src/plugin
[ https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16682704#comment-16682704 ] Hudson commented on NUTCH-2658: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3584 (See [https://builds.apache.org/job/Nutch-trunk/3584/]) NUTCH-2658 Adding the fields required by the index-links plugin to the (betancourt.jorge: [https://github.com/apache/nutch/commit/45098e7964f100fe0fb5dfa6cd370a2d966a50dc]) * (edit) conf/schema.xml > Add README file to all plugins in src/plugin > > > Key: NUTCH-2658 > URL: https://issues.apache.org/jira/browse/NUTCH-2658 > Project: Nutch > Issue Type: Improvement > Components: documentation, plugin >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Trivial > > Since we've migrated a good portion of our workflow to Github we could > consider adding a {{README.md}} file to the root of each plugin in > {{src/plugins}}. > This is a good place to have plugin-specific documentation. Wich fields the > plugin adds to the indexer, which configuration options, etc. Also, since the > README.md is rendered by Github automatically is a good link to point users. > I think that a good example is the {{indexer-cloudsearch}} plugin, on top of > that it's a good source of information to point users when asking questions > regarding a specific plugin. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2658) Add README file to all plugins in src/plugin
[ https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661724#comment-16661724 ] ASF GitHub Bot commented on NUTCH-2658: --- lewismc commented on issue #398: NUTCH-2658 Add README for the index-links plugin URL: https://github.com/apache/nutch/pull/398#issuecomment-43257 +1 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add README file to all plugins in src/plugin > > > Key: NUTCH-2658 > URL: https://issues.apache.org/jira/browse/NUTCH-2658 > Project: Nutch > Issue Type: Improvement > Components: documentation, plugin >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Trivial > > Since we've migrated a good portion of our workflow to Github we could > consider adding a {{README.md}} file to the root of each plugin in > {{src/plugins}}. > This is a good place to have plugin-specific documentation. Wich fields the > plugin adds to the indexer, which configuration options, etc. Also, since the > README.md is rendered by Github automatically is a good link to point users. > I think that a good example is the {{indexer-cloudsearch}} plugin, on top of > that it's a good source of information to point users when asking questions > regarding a specific plugin. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2658) Add README file to all plugins in src/plugin
[ https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654864#comment-16654864 ] ASF GitHub Bot commented on NUTCH-2658: --- jorgelbg commented on a change in pull request #398: NUTCH-2658 Add README for the index-links plugin URL: https://github.com/apache/nutch/pull/398#discussion_r226215078 ## File path: src/plugin/index-links/README.md ## @@ -0,0 +1,53 @@ +indexer-links plugin for Nutch +== + +This plugin provides the feature to index the inlinks and outlinks of a URL +into an indexing backend. + +## Configuration + +This plugin provides the following configuration options: + +* `index.links.outlinks.host.ignore`: If true, the plugin will ignore outlinks +that point to the same host as the current URL. By default, all outlinks are +indexed. If `db.ignore.internal.links` is `true` (default value) this setting +is ignored because the internal links are already ignored. + +* `index.links.inlinks.host.ignore`: If true, the plugin will ignore inlinks +coming from the same host as the current URL. By default, all inlinks are +indexed. If `db.ignore.internal.links` is `true` (default value) this setting +is ignored because the internal links are already ignored. + +* `index.links.hosts.only`: If true, the plugin will index only the host portion of the inlinks/outlinks URLs. + +## Fields + +For this plugin to work 2 new fields have to be added/configured in your storage backend: + +* `inlinks` +* `outlinks` + +If the plugin is enabled these fields have to be added to your storage backend +configuration. + +The specifics of how these fields are configured depends on your specific +backend. We provide here sane default values for Solr. + +The following fields should be added to your backend storage. We provide +examples of default values for the Solr schema. + +* Each outlink/inlink will be stored as a string without any tokenization. +* The `inlink`/`outlink` fields have to be multivalued, because normally a +given URL will have multiple inlinks and outlinks. + +``` + +``` + +The field configuration could look like: + +``` + Review comment: Yes, I realized last night that the fields are missing from the `conf/schema.xml` file. I'm going to add them there as well. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add README file to all plugins in src/plugin > > > Key: NUTCH-2658 > URL: https://issues.apache.org/jira/browse/NUTCH-2658 > Project: Nutch > Issue Type: Improvement > Components: documentation, plugin >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Trivial > > Since we've migrated a good portion of our workflow to Github we could > consider adding a {{README.md}} file to the root of each plugin in > {{src/plugins}}. > This is a good place to have plugin-specific documentation. Wich fields the > plugin adds to the indexer, which configuration options, etc. Also, since the > README.md is rendered by Github automatically is a good link to point users. > I think that a good example is the {{indexer-cloudsearch}} plugin, on top of > that it's a good source of information to point users when asking questions > regarding a specific plugin. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2658) Add README file to all plugins in src/plugin
[ https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654773#comment-16654773 ] ASF GitHub Bot commented on NUTCH-2658: --- sebastian-nagel commented on a change in pull request #398: NUTCH-2658 Add README for the index-links plugin URL: https://github.com/apache/nutch/pull/398#discussion_r226201417 ## File path: src/plugin/index-links/README.md ## @@ -0,0 +1,53 @@ +indexer-links plugin for Nutch +== + +This plugin provides the feature to index the inlinks and outlinks of a URL +into an indexing backend. + +## Configuration + +This plugin provides the following configuration options: + +* `index.links.outlinks.host.ignore`: If true, the plugin will ignore outlinks +that point to the same host as the current URL. By default, all outlinks are +indexed. If `db.ignore.internal.links` is `true` (default value) this setting +is ignored because the internal links are already ignored. + +* `index.links.inlinks.host.ignore`: If true, the plugin will ignore inlinks +coming from the same host as the current URL. By default, all inlinks are +indexed. If `db.ignore.internal.links` is `true` (default value) this setting +is ignored because the internal links are already ignored. + +* `index.links.hosts.only`: If true, the plugin will index only the host portion of the inlinks/outlinks URLs. + +## Fields + +For this plugin to work 2 new fields have to be added/configured in your storage backend: + +* `inlinks` +* `outlinks` + +If the plugin is enabled these fields have to be added to your storage backend +configuration. + +The specifics of how these fields are configured depends on your specific +backend. We provide here sane default values for Solr. + +The following fields should be added to your backend storage. We provide +examples of default values for the Solr schema. + +* Each outlink/inlink will be stored as a string without any tokenization. +* The `inlink`/`outlink` fields have to be multivalued, because normally a +given URL will have multiple inlinks and outlinks. + +``` + +``` + +The field configuration could look like: + +``` + Review comment: The Solr schema ([conf/schema.xml](/apache/nutch/blob/master/conf/schema.xml)) already contains the field definitions for multiple IndexingFilter plugins. Why not add inlinks and outlinks also to the schema? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add README file to all plugins in src/plugin > > > Key: NUTCH-2658 > URL: https://issues.apache.org/jira/browse/NUTCH-2658 > Project: Nutch > Issue Type: Improvement > Components: documentation, plugin >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Trivial > > Since we've migrated a good portion of our workflow to Github we could > consider adding a {{README.md}} file to the root of each plugin in > {{src/plugins}}. > This is a good place to have plugin-specific documentation. Wich fields the > plugin adds to the indexer, which configuration options, etc. Also, since the > README.md is rendered by Github automatically is a good link to point users. > I think that a good example is the {{indexer-cloudsearch}} plugin, on top of > that it's a good source of information to point users when asking questions > regarding a specific plugin. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2658) Add README file to all plugins in src/plugin
[ https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653497#comment-16653497 ] ASF GitHub Bot commented on NUTCH-2658: --- jorgelbg opened a new pull request #398: NUTCH-2658 Add README for the index-links plugin URL: https://github.com/apache/nutch/pull/398 Add a README file for the index-links plugin. At the very least, least this solves part of the issue with users knowing what they need to add to their backend (usually Solr). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add README file to all plugins in src/plugin > > > Key: NUTCH-2658 > URL: https://issues.apache.org/jira/browse/NUTCH-2658 > Project: Nutch > Issue Type: Improvement > Components: documentation, plugin >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Trivial > > Since we've migrated a good portion of our workflow to Github we could > consider adding a {{README.md}} file to the root of each plugin in > {{src/plugins}}. > This is a good place to have plugin-specific documentation. Wich fields the > plugin adds to the indexer, which configuration options, etc. Also, since the > README.md is rendered by Github automatically is a good link to point users. > I think that a good example is the {{indexer-cloudsearch}} plugin, on top of > that it's a good source of information to point users when asking questions > regarding a specific plugin. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2658) Add README file to all plugins in src/plugin
[ https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653495#comment-16653495 ] Jorge Luis Betancourt Gonzalez commented on NUTCH-2658: --- [~wastl-nagel] exactly what I was thinking. Right now in order to configure a given plugin you need to look at the nutch-default.xml to see what options are available, and read the documentation from there. If it's an indexing plugin you need to check the schema, or in the worst case the actual code to figure out what fields are going to be added. I consider that at least these 2 components should be made more visible to the users, the advantage of the README is that lives right next to the code so it's easier to "remember" to update it. [~yossi] I agree that having the documentation also on the Wiki is very helpful and the README it's not intended to replace that. +1 on generating the wiki from the README (or something else) this will at least guarantees that is updated with each release. We can also add a check/step to the release procedure to check if any new plugins have been added and if the README is there. Of course, there is always the risk that the README contains dummy/not useful data. But through PRs we can keep an eye on that. As a side note, I kind of like how elasticsearch has it's documentation versioned and updated per release. Not sure how to integrate this with our wiki. > Add README file to all plugins in src/plugin > > > Key: NUTCH-2658 > URL: https://issues.apache.org/jira/browse/NUTCH-2658 > Project: Nutch > Issue Type: Improvement > Components: documentation, plugin >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Trivial > > Since we've migrated a good portion of our workflow to Github we could > consider adding a {{README.md}} file to the root of each plugin in > {{src/plugins}}. > This is a good place to have plugin-specific documentation. Wich fields the > plugin adds to the indexer, which configuration options, etc. Also, since the > README.md is rendered by Github automatically is a good link to point users. > I think that a good example is the {{indexer-cloudsearch}} plugin, on top of > that it's a good source of information to point users when asking questions > regarding a specific plugin. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2658) Add README file to all plugins in src/plugin
[ https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653368#comment-16653368 ] Yossi Tamari commented on NUTCH-2658: - I disagree regarding putting the documentation in the code. This is not helpful for new users and users who are not Java coders. They can't be expected to navigate to src/plugin/indexer-cloudsearch to find the documentation for that plugin. The README.md files are also less likely to appear high in Google results, compared to the Wiki. The real problem is that the Wiki, and specifically PluginCentral, is not properly maintained. Do you think the README files will be maintained better? Maybe we can add a build step that will copy the information from the README to the Wiki on release? > Add README file to all plugins in src/plugin > > > Key: NUTCH-2658 > URL: https://issues.apache.org/jira/browse/NUTCH-2658 > Project: Nutch > Issue Type: Improvement > Components: documentation, plugin >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Trivial > > Since we've migrated a good portion of our workflow to Github we could > consider adding a {{README.md}} file to the root of each plugin in > {{src/plugins}}. > This is a good place to have plugin-specific documentation. Wich fields the > plugin adds to the indexer, which configuration options, etc. Also, since the > README.md is rendered by Github automatically is a good link to point users. > I think that a good example is the {{indexer-cloudsearch}} plugin, on top of > that it's a good source of information to point users when asking questions > regarding a specific plugin. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2658) Add README file to all plugins in src/plugin
[ https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653353#comment-16653353 ] Sebastian Nagel commented on NUTCH-2658: In general, a good idea to bundle the plugin documentation and make them available under a uniform path. At present, we the documentation is spread over 4 different places: - the Wiki, e.g., https://wiki.apache.org/nutch/IndexReplace - the [API doc|http://nutch.apache.org/apidocs/apidocs-1.15/overview-summary.html] linking to the package.html / package-info.java of the plugin packages. Some plugins provide a usage description their or in the implementing class. - few plugins already have a README.md, e.g., [indexer-cloudsearch|https://github.com/apache/nutch/tree/master/src/plugin/indexer-cloudsearch] - nutch-default.xml for properties In doubt, I would opt for moving documentation to the code because the code is versioned while our Wiki is not, resp. it's difficult to link a Nutch version (eg. 1.14) and the appropriate description. This would be also a good idea for to the tutorial. The drawback - we really need to maintain the READMEs - once released we cannot change the documentation. > Add README file to all plugins in src/plugin > > > Key: NUTCH-2658 > URL: https://issues.apache.org/jira/browse/NUTCH-2658 > Project: Nutch > Issue Type: Improvement > Components: documentation, plugin >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Trivial > > Since we've migrated a good portion of our workflow to Github we could > consider adding a {{README.md}} file to the root of each plugin in > {{src/plugins}}. > This is a good place to have plugin-specific documentation. Wich fields the > plugin adds to the indexer, which configuration options, etc. Also, since the > README.md is rendered by Github automatically is a good link to point users. > I think that a good example is the {{indexer-cloudsearch}} plugin, on top of > that it's a good source of information to point users when asking questions > regarding a specific plugin. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2658) Add README file to all plugins in src/plugin
[ https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653252#comment-16653252 ] Jorge Luis Betancourt Gonzalez commented on NUTCH-2658: --- I'm thinking of having at least 2 general sections: * Configuration: Covers all parameters that are included in the nutch-default.xml (although could be a bit of a repetition) * Fields: Includes information about which fields should be added to your storage backend configuration (if applicable). Including documentation on how to configure Solr fields would be a nice default configuration, although we support different backends. > Add README file to all plugins in src/plugin > > > Key: NUTCH-2658 > URL: https://issues.apache.org/jira/browse/NUTCH-2658 > Project: Nutch > Issue Type: Improvement > Components: documentation, plugin >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Trivial > > Since we've migrated a good portion of our workflow to Github we could > consider adding a {{README.md}} file to the root of each plugin in > {{src/plugins}}. > This is a good place to have plugin-specific documentation. Wich fields the > plugin adds to the indexer, which configuration options, etc. Also, since the > README.md is rendered by Github automatically is a good link to point users. > I think that a good example is the {{indexer-cloudsearch}} plugin, on top of > that it's a good source of information to point users when asking questions > regarding a specific plugin. -- This message was sent by Atlassian JIRA (v7.6.3#76005)