[jira] [Resolved] (NUTCH-2199) Documentation for Nutch 2.X REST API

2018-10-18 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2199.
-
Resolution: Fixed

> Documentation for Nutch 2.X REST API
> 
>
> Key: NUTCH-2199
> URL: https://issues.apache.org/jira/browse/NUTCH-2199
> Project: Nutch
>  Issue Type: New Feature
>  Components: documentation, REST_api
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Furkan KAMACI
>Priority: Minor
> Fix For: 2.5
>
>
> The work done on NUTCH-1800 needs to be ported to 2.X branch. This is 
> trivial, I thought I had already done it but obviously not. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2660) Unit tests of plugins parse-js, headings, index-jexl-filter to be executed during build

2018-10-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655308#comment-16655308
 ] 

ASF GitHub Bot commented on NUTCH-2660:
---

jorgelbg commented on issue #397: NUTCH-2660 Plugin tests not executed
URL: https://github.com/apache/nutch/pull/397#issuecomment-431029609
 
 
   +1 thanks for including the index-jex-filter as well :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Unit tests of plugins parse-js, headings, index-jexl-filter to be executed 
> during build
> ---
>
> Key: NUTCH-2660
> URL: https://issues.apache.org/jira/browse/NUTCH-2660
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, test
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> The unit tests of the plugins "parse-js", "headings" and "index-jexl-filter" 
> are not executed during build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2663) Improve index-jexl-filter syntax for scripts

2018-10-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655305#comment-16655305
 ] 

ASF GitHub Bot commented on NUTCH-2663:
---

jorgelbg opened a new pull request #400: NUTCH-2663 Improve the JEXL syntax for 
getting values from the context
URL: https://github.com/apache/nutch/pull/400
 
 
   * Avoids the use of the array notation when getting values from the 
document/metadata(s) on the JEXL expression. We go from `doc.lang[0] == 'en'` 
to `doc.lang == 'en'` which is more easy to understand.
   
   * Log using errors instead of warnings in the `setConf` methods. We throw a 
`RuntimeException` if something is wrong, so the log should represent the same 
severity level.
   
   * Some minor changes an additional comments.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Improve index-jexl-filter syntax for scripts
> 
>
> Key: NUTCH-2663
> URL: https://issues.apache.org/jira/browse/NUTCH-2663
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.16
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>
> JEXL scripts need to be written using the array syntax to get the actual 
> value (for instance, example extracted from the tests):
> {code}
> doc.lang[0]=='en'
> {code}
> Ideally, this would only be required if the actual value is really an array, 
> and not for single value elements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work started] (NUTCH-2663) Improve index-jexl-filter syntax for scripts

2018-10-18 Thread Jorge Luis Betancourt Gonzalez (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2663 started by Jorge Luis Betancourt Gonzalez.
-
> Improve index-jexl-filter syntax for scripts
> 
>
> Key: NUTCH-2663
> URL: https://issues.apache.org/jira/browse/NUTCH-2663
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.16
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>
> JEXL scripts need to be written using the array syntax to get the actual 
> value (for instance, example extracted from the tests):
> {code}
> doc.lang[0]=='en'
> {code}
> Ideally, this would only be required if the actual value is really an array, 
> and not for single value elements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2663) Improve index-jexl-filter syntax for scripts

2018-10-18 Thread Jorge Luis Betancourt Gonzalez (JIRA)
Jorge Luis Betancourt Gonzalez created NUTCH-2663:
-

 Summary: Improve index-jexl-filter syntax for scripts
 Key: NUTCH-2663
 URL: https://issues.apache.org/jira/browse/NUTCH-2663
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Affects Versions: 1.16
Reporter: Jorge Luis Betancourt Gonzalez
Assignee: Jorge Luis Betancourt Gonzalez


JEXL scripts need to be written using the array syntax to get the actual value 
(for instance, example extracted from the tests):

{code}
doc.lang[0]=='en'
{code}

Ideally, this would only be required if the actual value is really an array, 
and not for single value elements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2662) index-jexl-filter plugin throws a RuntimeException if its enabled but not configured

2018-10-18 Thread Jorge Luis Betancourt Gonzalez (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655180#comment-16655180
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2662:
---

[~yossi] Yes, I know :). At the time, I didn't notice that the actual 
recommendation (the message shown to the user) is to pick an expression 
(true/false) which can be done automatically and log the default value. The 
second validation that checks if the expression is syntactically correct it's 
still valid.

> index-jexl-filter plugin throws a RuntimeException if its enabled but not 
> configured
> 
>
> Key: NUTCH-2662
> URL: https://issues.apache.org/jira/browse/NUTCH-2662
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.16
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>
> If the index-jexl-filter plugin is enabled but no configuration is provided 
> in the {{index.jexl.filter}} property the plugin throws a RuntimeException. 
> In the same exception message, we advise to either set true or false to index 
> all/none. 
> This is a case where we can just select a sane default and log a warning, but 
> not stop the entire process. I think this is more consistent with how we 
> approach configuration in general: Only fail if there is an actual error in 
> the configuration (i.e parse error on the expression).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2651) Upgrade to Tika 1.19.1 (from 1.18)

2018-10-18 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655133#comment-16655133
 ] 

Markus Jelsma commented on NUTCH-2651:
--

+1

also thanks for finding the javax-ws fix, i could immediately apply it to our 
custom parse plugin.

> Upgrade to Tika 1.19.1 (from 1.18)
> --
>
> Key: NUTCH-2651
> URL: https://issues.apache.org/jira/browse/NUTCH-2651
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, protocol
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.16
>
>
> [Tika 1.19.1|https://tika.apache.org/1.19.1/index.html] has been released 
> recently. Among all the other improvements and fixes (including those of 
> [1.19|https://tika.apache.org/1.19/index.html]) It contains one important 
> performance fix (TIKA-2645, cf. NUTCH-2578) affecting the MIME-/Content-Type 
> detector.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2662) index-jexl-filter plugin throws a RuntimeException if its enabled but not configured

2018-10-18 Thread Yossi Tamari (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655082#comment-16655082
 ] 

Yossi Tamari commented on NUTCH-2662:
-

[~jorgelbg], this goes against your own review comment on this plugin: 
[https://github.com/apache/nutch/pull/219#discussion_r136044914]

> index-jexl-filter plugin throws a RuntimeException if its enabled but not 
> configured
> 
>
> Key: NUTCH-2662
> URL: https://issues.apache.org/jira/browse/NUTCH-2662
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.16
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>
> If the index-jexl-filter plugin is enabled but no configuration is provided 
> in the {{index.jexl.filter}} property the plugin throws a RuntimeException. 
> In the same exception message, we advise to either set true or false to index 
> all/none. 
> This is a case where we can just select a sane default and log a warning, but 
> not stop the entire process. I think this is more consistent with how we 
> approach configuration in general: Only fail if there is an actual error in 
> the configuration (i.e parse error on the expression).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2662) index-jexl-filter plugin throws a RuntimeException if its enabled but not configured

2018-10-18 Thread Jorge Luis Betancourt Gonzalez (JIRA)
Jorge Luis Betancourt Gonzalez created NUTCH-2662:
-

 Summary: index-jexl-filter plugin throws a RuntimeException if its 
enabled but not configured
 Key: NUTCH-2662
 URL: https://issues.apache.org/jira/browse/NUTCH-2662
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Affects Versions: 1.16
Reporter: Jorge Luis Betancourt Gonzalez
Assignee: Jorge Luis Betancourt Gonzalez


If the index-jexl-filter plugin is enabled but no configuration is provided in 
the {{index.jexl.filter}} property the plugin throws a RuntimeException. In the 
same exception message, we advise to either set true or false to index 
all/none. 

This is a case where we can just select a sane default and log a warning, but 
not stop the entire process. I think this is more consistent with how we 
approach configuration in general: Only fail if there is an actual error in the 
configuration (i.e parse error on the expression).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2658) Add README file to all plugins in src/plugin

2018-10-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654864#comment-16654864
 ] 

ASF GitHub Bot commented on NUTCH-2658:
---

jorgelbg commented on a change in pull request #398: NUTCH-2658 Add README for 
the index-links plugin
URL: https://github.com/apache/nutch/pull/398#discussion_r226215078
 
 

 ##
 File path: src/plugin/index-links/README.md
 ##
 @@ -0,0 +1,53 @@
+indexer-links plugin for Nutch
+==
+
+This plugin provides the feature to index the inlinks and outlinks of a URL
+into an indexing backend.
+
+## Configuration
+
+This plugin provides the following configuration options:
+
+* `index.links.outlinks.host.ignore`: If true, the plugin will ignore outlinks
+that point to the same host as the current URL. By default, all outlinks are
+indexed. If `db.ignore.internal.links` is `true` (default value) this setting
+is ignored because the internal links are already ignored.
+
+* `index.links.inlinks.host.ignore`: If true, the plugin will ignore inlinks
+coming from the same host as the current URL. By default, all inlinks are
+indexed. If `db.ignore.internal.links` is `true` (default value) this setting
+is ignored because the internal links are already ignored.
+
+* `index.links.hosts.only`: If true, the plugin will index only the host 
portion of the inlinks/outlinks URLs.
+
+## Fields
+
+For this plugin to work 2 new fields have to be added/configured in your 
storage backend:
+
+* `inlinks`
+* `outlinks`
+
+If the plugin is enabled these fields have to be added to your storage backend
+configuration.
+
+The specifics of how these fields are configured depends on your specific
+backend. We provide here sane default values for Solr.
+
+The following fields should be added to your backend storage. We provide
+examples of default values for the Solr schema.
+
+* Each outlink/inlink will be stored as a string without any tokenization.
+* The `inlink`/`outlink` fields have to be multivalued, because normally a
+given URL will have multiple inlinks and outlinks.
+
+```
+
+```
+
+The field configuration could look like:
+
+```
+
 
 Review comment:
   Yes, I realized last night that the fields are missing from the 
`conf/schema.xml` file. I'm going to add them there as well.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add README file to all plugins in src/plugin
> 
>
> Key: NUTCH-2658
> URL: https://issues.apache.org/jira/browse/NUTCH-2658
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Trivial
>
> Since we've migrated a good portion of our workflow to Github we could 
> consider adding a {{README.md}} file to the root of each plugin in 
> {{src/plugins}}. 
> This is a good place to have plugin-specific documentation. Wich fields the 
> plugin adds to the indexer, which configuration options, etc. Also, since the 
> README.md is rendered by Github automatically is a good link to point users.
> I think that a good example is the {{indexer-cloudsearch}} plugin, on top of 
> that it's a good source of information to point users when asking questions 
> regarding a specific plugin.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2660) Unit tests of plugins parse-js, headings, index-jexl-filter to be executed during build

2018-10-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2660:
---
Summary: Unit tests of plugins parse-js, headings, index-jexl-filter to be 
executed during build  (was: Unit tests of not executed)

> Unit tests of plugins parse-js, headings, index-jexl-filter to be executed 
> during build
> ---
>
> Key: NUTCH-2660
> URL: https://issues.apache.org/jira/browse/NUTCH-2660
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, test
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> The unit tests of the plugins "parse-js", "headings" and "index-jexl-filter" 
> are not executed during build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2660) Unit tests of not executed

2018-10-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2660:
---
Summary: Unit tests of not executed  (was: Plugin tests not executed)

> Unit tests of not executed
> --
>
> Key: NUTCH-2660
> URL: https://issues.apache.org/jira/browse/NUTCH-2660
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, test
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> The unit tests of the plugins "parse-js", "headings" and "index-jexl-filter" 
> are not executed during build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2658) Add README file to all plugins in src/plugin

2018-10-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654773#comment-16654773
 ] 

ASF GitHub Bot commented on NUTCH-2658:
---

sebastian-nagel commented on a change in pull request #398: NUTCH-2658 Add 
README for the index-links plugin
URL: https://github.com/apache/nutch/pull/398#discussion_r226201417
 
 

 ##
 File path: src/plugin/index-links/README.md
 ##
 @@ -0,0 +1,53 @@
+indexer-links plugin for Nutch
+==
+
+This plugin provides the feature to index the inlinks and outlinks of a URL
+into an indexing backend.
+
+## Configuration
+
+This plugin provides the following configuration options:
+
+* `index.links.outlinks.host.ignore`: If true, the plugin will ignore outlinks
+that point to the same host as the current URL. By default, all outlinks are
+indexed. If `db.ignore.internal.links` is `true` (default value) this setting
+is ignored because the internal links are already ignored.
+
+* `index.links.inlinks.host.ignore`: If true, the plugin will ignore inlinks
+coming from the same host as the current URL. By default, all inlinks are
+indexed. If `db.ignore.internal.links` is `true` (default value) this setting
+is ignored because the internal links are already ignored.
+
+* `index.links.hosts.only`: If true, the plugin will index only the host 
portion of the inlinks/outlinks URLs.
+
+## Fields
+
+For this plugin to work 2 new fields have to be added/configured in your 
storage backend:
+
+* `inlinks`
+* `outlinks`
+
+If the plugin is enabled these fields have to be added to your storage backend
+configuration.
+
+The specifics of how these fields are configured depends on your specific
+backend. We provide here sane default values for Solr.
+
+The following fields should be added to your backend storage. We provide
+examples of default values for the Solr schema.
+
+* Each outlink/inlink will be stored as a string without any tokenization.
+* The `inlink`/`outlink` fields have to be multivalued, because normally a
+given URL will have multiple inlinks and outlinks.
+
+```
+
+```
+
+The field configuration could look like:
+
+```
+
 
 Review comment:
   The Solr schema 
([conf/schema.xml](/apache/nutch/blob/master/conf/schema.xml)) already contains 
the field definitions for multiple IndexingFilter plugins. Why not add inlinks 
and outlinks also to the schema?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add README file to all plugins in src/plugin
> 
>
> Key: NUTCH-2658
> URL: https://issues.apache.org/jira/browse/NUTCH-2658
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Trivial
>
> Since we've migrated a good portion of our workflow to Github we could 
> consider adding a {{README.md}} file to the root of each plugin in 
> {{src/plugins}}. 
> This is a good place to have plugin-specific documentation. Wich fields the 
> plugin adds to the indexer, which configuration options, etc. Also, since the 
> README.md is rendered by Github automatically is a good link to point users.
> I think that a good example is the {{indexer-cloudsearch}} plugin, on top of 
> that it's a good source of information to point users when asking questions 
> regarding a specific plugin.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)