[jira] [Commented] (NUTCH-2661) Move TestOutlinks to the proper path

2018-10-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653876#comment-16653876
 ] 

ASF GitHub Bot commented on NUTCH-2661:
---

sebastian-nagel commented on issue #399: NUTCH-2661 Move the TestOutlinks class 
into the o.a.n.parse path
URL: https://github.com/apache/nutch/pull/399#issuecomment-430710838
 
 
   +1 that's the right location


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Move TestOutlinks to the proper path
> 
>
> Key: NUTCH-2661
> URL: https://issues.apache.org/jira/browse/NUTCH-2661
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Trivial
> Fix For: 1.16
>
>
> Initially, I placed the {{TestOutlinks}} class in the index-links plugin, 
> although this was when I found the bug with the {{hashCode}}. Now I realised 
> that this test is best to have in the {{test/org/apache/nutch/nutch/parse}} 
> directory. 
> Even more because since this test is not covering any plugin-specific code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2661) Move TestOutlinks to the proper path

2018-10-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653781#comment-16653781
 ] 

ASF GitHub Bot commented on NUTCH-2661:
---

jorgelbg opened a new pull request #399: NUTCH-2661 Move the TestOutlinks class 
into the o.a.n.parse path
URL: https://github.com/apache/nutch/pull/399
 
 
   This test covers the specific case of the comparison betwen 2 identical 
`Outlink` instances. Because this is not `index-links` specific I'm moving the 
test class into the core parse tests.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Move TestOutlinks to the proper path
> 
>
> Key: NUTCH-2661
> URL: https://issues.apache.org/jira/browse/NUTCH-2661
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Trivial
> Fix For: 1.16
>
>
> Initially, I placed the {{TestOutlinks}} class in the index-links plugin, 
> although this was when I found the bug with the {{hashCode}}. Now I realised 
> that this test is best to have in the {{test/org/apache/nutch/nutch/parse}} 
> directory. 
> Even more because since this test is not covering any plugin-specific code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2661) Move TestOutlinks to the proper path

2018-10-17 Thread Jorge Luis Betancourt Gonzalez (JIRA)
Jorge Luis Betancourt Gonzalez created NUTCH-2661:
-

 Summary: Move TestOutlinks to the proper path
 Key: NUTCH-2661
 URL: https://issues.apache.org/jira/browse/NUTCH-2661
 Project: Nutch
  Issue Type: Improvement
Reporter: Jorge Luis Betancourt Gonzalez
Assignee: Jorge Luis Betancourt Gonzalez
 Fix For: 1.16


Initially, I placed the {{TestOutlinks}} class in the index-links plugin, 
although this was when I found the bug with the {{hashCode}}. Now I realised 
that this test is best to have in the {{test/org/apache/nutch/nutch/parse}} 
directory. 

Even more because since this test is not covering any plugin-specific code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2658) Add README file to all plugins in src/plugin

2018-10-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653497#comment-16653497
 ] 

ASF GitHub Bot commented on NUTCH-2658:
---

jorgelbg opened a new pull request #398: NUTCH-2658 Add README for the 
index-links plugin
URL: https://github.com/apache/nutch/pull/398
 
 
   Add a README file for the index-links plugin. At the very least, least this 
solves part of the issue with users knowing what they need to add to their 
backend (usually Solr).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add README file to all plugins in src/plugin
> 
>
> Key: NUTCH-2658
> URL: https://issues.apache.org/jira/browse/NUTCH-2658
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Trivial
>
> Since we've migrated a good portion of our workflow to Github we could 
> consider adding a {{README.md}} file to the root of each plugin in 
> {{src/plugins}}. 
> This is a good place to have plugin-specific documentation. Wich fields the 
> plugin adds to the indexer, which configuration options, etc. Also, since the 
> README.md is rendered by Github automatically is a good link to point users.
> I think that a good example is the {{indexer-cloudsearch}} plugin, on top of 
> that it's a good source of information to point users when asking questions 
> regarding a specific plugin.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2658) Add README file to all plugins in src/plugin

2018-10-17 Thread Jorge Luis Betancourt Gonzalez (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653495#comment-16653495
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2658:
---

[~wastl-nagel] exactly what I was thinking. Right now in order to configure a 
given plugin you need to look at the nutch-default.xml to see what options are 
available, and read the documentation from there. If it's an indexing plugin 
you need to check the schema, or in the worst case the actual code to figure 
out what fields are going to be added. 

I consider that at least these 2 components should be made more visible to the 
users, the advantage of the README is that lives right next to the code so it's 
easier to "remember" to update it.

[~yossi] I agree that having the documentation also on the Wiki is very helpful 
and the README it's not intended to replace that.

+1 on generating the wiki from the README (or something else) this will at 
least guarantees that is updated with each release. 

We can also add a check/step to the release procedure to check if any new 
plugins have been added and if the README is there. Of course, there is always 
the risk that the README contains dummy/not useful data. But through PRs we can 
keep an eye on that.

As a side note, I kind of like how elasticsearch has it's documentation 
versioned and updated per release. Not sure how to integrate this with our wiki.

> Add README file to all plugins in src/plugin
> 
>
> Key: NUTCH-2658
> URL: https://issues.apache.org/jira/browse/NUTCH-2658
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Trivial
>
> Since we've migrated a good portion of our workflow to Github we could 
> consider adding a {{README.md}} file to the root of each plugin in 
> {{src/plugins}}. 
> This is a good place to have plugin-specific documentation. Wich fields the 
> plugin adds to the indexer, which configuration options, etc. Also, since the 
> README.md is rendered by Github automatically is a good link to point users.
> I think that a good example is the {{indexer-cloudsearch}} plugin, on top of 
> that it's a good source of information to point users when asking questions 
> regarding a specific plugin.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2660) Plugin tests not executed

2018-10-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653467#comment-16653467
 ] 

ASF GitHub Bot commented on NUTCH-2660:
---

sebastian-nagel opened a new pull request #397: NUTCH-2660 Plugin tests not 
executed
URL: https://github.com/apache/nutch/pull/397
 
 
   - add missing unit test packages to plugin build.xml
   - tests of "headings" plugin depend on "lib-nekohtml"
   - add "protocol-okhttp" to Javadoc API overview
   - add missing test packages to ant "eclipse" target
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Plugin tests not executed
> -
>
> Key: NUTCH-2660
> URL: https://issues.apache.org/jira/browse/NUTCH-2660
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, test
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> The unit tests of the plugins "parse-js", "headings" and "index-jexl-filter" 
> are not executed during build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2660) Plugin tests not executed

2018-10-17 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2660:
--

 Summary: Plugin tests not executed
 Key: NUTCH-2660
 URL: https://issues.apache.org/jira/browse/NUTCH-2660
 Project: Nutch
  Issue Type: Improvement
  Components: build, test
Affects Versions: 1.15
Reporter: Sebastian Nagel
 Fix For: 1.16


The unit tests of the plugins "parse-js", "headings" and "index-jexl-filter" 
are not executed during build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2659) Add missing Apache license headers

2018-10-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653452#comment-16653452
 ] 

ASF GitHub Bot commented on NUTCH-2659:
---

sebastian-nagel opened a new pull request #396: NUTCH-2659 Add missing Apache 
license headers
URL: https://github.com/apache/nutch/pull/396
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add missing Apache license headers
> --
>
> Key: NUTCH-2659
> URL: https://issues.apache.org/jira/browse/NUTCH-2659
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.16
>
>
> Should add Apache license headers to source files (at least, *.java) - some 
> files lack the license header.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2659) Add missing Apache license headers

2018-10-17 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2659:
--

 Summary: Add missing Apache license headers
 Key: NUTCH-2659
 URL: https://issues.apache.org/jira/browse/NUTCH-2659
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.15
Reporter: Sebastian Nagel
 Fix For: 1.16


Should add Apache license headers to source files (at least, *.java) - some 
files lack the license header.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2658) Add README file to all plugins in src/plugin

2018-10-17 Thread Yossi Tamari (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653368#comment-16653368
 ] 

Yossi Tamari commented on NUTCH-2658:
-

I disagree regarding putting the documentation in the code. This is not helpful 
for new users and users who are not Java coders. They can't be expected to 
navigate to src/plugin/indexer-cloudsearch to find the documentation for that 
plugin.

The README.md files are also less likely to appear high in Google results, 
compared to the Wiki.

The real problem is that the Wiki, and specifically PluginCentral, is not 
properly maintained. Do you think the README files will be maintained better?

Maybe we can add a build step that will copy the information from the README to 
the Wiki on release?

> Add README file to all plugins in src/plugin
> 
>
> Key: NUTCH-2658
> URL: https://issues.apache.org/jira/browse/NUTCH-2658
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Trivial
>
> Since we've migrated a good portion of our workflow to Github we could 
> consider adding a {{README.md}} file to the root of each plugin in 
> {{src/plugins}}. 
> This is a good place to have plugin-specific documentation. Wich fields the 
> plugin adds to the indexer, which configuration options, etc. Also, since the 
> README.md is rendered by Github automatically is a good link to point users.
> I think that a good example is the {{indexer-cloudsearch}} plugin, on top of 
> that it's a good source of information to point users when asking questions 
> regarding a specific plugin.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2658) Add README file to all plugins in src/plugin

2018-10-17 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653353#comment-16653353
 ] 

Sebastian Nagel commented on NUTCH-2658:


In general, a good idea to bundle the plugin documentation and make them 
available under a uniform path. At present, we the documentation is spread over 
4 different places:
- the Wiki, e.g., https://wiki.apache.org/nutch/IndexReplace
- the [API 
doc|http://nutch.apache.org/apidocs/apidocs-1.15/overview-summary.html] linking 
to the package.html / package-info.java of the plugin packages. Some plugins 
provide a usage description their or in the implementing class.
- few plugins already have a README.md, e.g., 
[indexer-cloudsearch|https://github.com/apache/nutch/tree/master/src/plugin/indexer-cloudsearch]
- nutch-default.xml for properties

In doubt, I would opt for moving documentation to the code because the code is 
versioned while our Wiki is not, resp. it's difficult to link a Nutch version 
(eg. 1.14) and the appropriate description. This would be also a good idea for 
to the tutorial. The drawback - we really need to maintain the READMEs - once 
released we cannot change the documentation.

> Add README file to all plugins in src/plugin
> 
>
> Key: NUTCH-2658
> URL: https://issues.apache.org/jira/browse/NUTCH-2658
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Trivial
>
> Since we've migrated a good portion of our workflow to Github we could 
> consider adding a {{README.md}} file to the root of each plugin in 
> {{src/plugins}}. 
> This is a good place to have plugin-specific documentation. Wich fields the 
> plugin adds to the indexer, which configuration options, etc. Also, since the 
> README.md is rendered by Github automatically is a good link to point users.
> I think that a good example is the {{indexer-cloudsearch}} plugin, on top of 
> that it's a good source of information to point users when asking questions 
> regarding a specific plugin.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2658) Add README file to all plugins in src/plugin

2018-10-17 Thread Jorge Luis Betancourt Gonzalez (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653252#comment-16653252
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2658:
---

I'm thinking of having at least 2 general sections:

* Configuration: Covers all parameters that are included in the 
nutch-default.xml (although could be a bit of a repetition)
* Fields: Includes information about which fields should be added to your 
storage backend configuration (if applicable). 

Including documentation on how to configure Solr fields would be a nice default 
configuration, although we support different backends.



> Add README file to all plugins in src/plugin
> 
>
> Key: NUTCH-2658
> URL: https://issues.apache.org/jira/browse/NUTCH-2658
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Trivial
>
> Since we've migrated a good portion of our workflow to Github we could 
> consider adding a {{README.md}} file to the root of each plugin in 
> {{src/plugins}}. 
> This is a good place to have plugin-specific documentation. Wich fields the 
> plugin adds to the indexer, which configuration options, etc. Also, since the 
> README.md is rendered by Github automatically is a good link to point users.
> I think that a good example is the {{indexer-cloudsearch}} plugin, on top of 
> that it's a good source of information to point users when asking questions 
> regarding a specific plugin.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Plugin specific documentation

2018-10-17 Thread Jorge Betancourt
Hi all,

I've created an issue [1] with a proposition about improving a bit the
documentation for each plugin that is included with Nutch. I would love to
get some feedback about the general idea.

Best Regards,
Jorge

[1] https://issues.apache.org/jira/browse/NUTCH-2658


[jira] [Created] (NUTCH-2658) Add README file to all plugins in src/plugin

2018-10-17 Thread Jorge Luis Betancourt Gonzalez (JIRA)
Jorge Luis Betancourt Gonzalez created NUTCH-2658:
-

 Summary: Add README file to all plugins in src/plugin
 Key: NUTCH-2658
 URL: https://issues.apache.org/jira/browse/NUTCH-2658
 Project: Nutch
  Issue Type: Improvement
  Components: documentation, plugin
Reporter: Jorge Luis Betancourt Gonzalez


Since we've migrated a good portion of our workflow to Github we could consider 
adding a {{README.md}} file to the root of each plugin in {{src/plugins}}. 

This is a good place to have plugin-specific documentation. Wich fields the 
plugin adds to the indexer, which configuration options, etc. Also, since the 
README.md is rendered by Github automatically is a good link to point users.

I think that a good example is the {{indexer-cloudsearch}} plugin, on top of 
that it's a good source of information to point users when asking questions 
regarding a specific plugin.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)