[jira] [Commented] (NUTCH-1763) Improving comments on the Injector Class

2017-10-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211836#comment-16211836
 ] 

Hudson commented on NUTCH-1763:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3458 (See 
[https://builds.apache.org/job/Nutch-trunk/3458/])
NUTCH-1763 Code comment Injector contributed by Diaa (snagel: 
[https://github.com/apache/nutch/commit/21d56a0c5626553a3bf5058588d9277e6844e00f])
* (edit) src/java/org/apache/nutch/crawl/Injector.java


> Improving comments on the Injector Class
> 
>
> Key: NUTCH-1763
> URL: https://issues.apache.org/jira/browse/NUTCH-1763
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.9
>Reporter: Diaa
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.14
>
> Attachments: Injector.java.patch, Injector.java.patch
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> I think the Injector class could use some improvements in the comments.
> I am attaching a few improvements to that and will keep adding as I 
> understand it more.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-1763) Improving comments on the Injector Class

2017-10-19 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1763.

   Resolution: Fixed
Fix Version/s: 1.14

Adapted patch to recent Injector version and committed to 1.x 
([21d56a0c|https://github.com/apache/nutch/commit/21d56a0c5626553a3bf5058588d9277e6844e00f]).
 Thanks, [~diaa_abdallah]!

> Improving comments on the Injector Class
> 
>
> Key: NUTCH-1763
> URL: https://issues.apache.org/jira/browse/NUTCH-1763
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.9
>Reporter: Diaa
>Priority: Minor
> Fix For: 1.14
>
> Attachments: Injector.java.patch, Injector.java.patch
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> I think the Injector class could use some improvements in the comments.
> I am attaching a few improvements to that and will keep adding as I 
> understand it more.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (NUTCH-1763) Improving comments on the Injector Class

2017-10-19 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-1763:
--

Assignee: Sebastian Nagel

> Improving comments on the Injector Class
> 
>
> Key: NUTCH-1763
> URL: https://issues.apache.org/jira/browse/NUTCH-1763
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.9
>Reporter: Diaa
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.14
>
> Attachments: Injector.java.patch, Injector.java.patch
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> I think the Injector class could use some improvements in the comments.
> I am attaching a few improvements to that and will keep adding as I 
> understand it more.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-2435) New configuration allowing to choose whether to store 'parse_text' directory or not.

2017-10-19 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2435.

Resolution: Fixed

> New configuration allowing to choose whether to store 'parse_text' directory 
> or not.
> 
>
> Key: NUTCH-2435
> URL: https://issues.apache.org/jira/browse/NUTCH-2435
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.13
> Environment: Apach Nutch 1.13
>Reporter: Marcos Bori
>Assignee: Sebastian Nagel
> Fix For: 1.14
>
>
> Whenever a page is parsed, one of the outputs is the directory 'parse_text'.
> It is intended to be used at the indexing phase so the page can be searched 
> from a search engine such as Solr.
> In my special crawling case, I don't need to index the page contents. 
> Therefore, creating and filing the 'parse_text' is not required for me. To 
> optimize performance, I don't want the crawler to store this information to 
> the filesystem. 
> I propose a new parameter "parser.store.text" allowing to choose whether to 
> store 'parse_text' directory or not. Its default value, of course, is "true".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2435) New configuration allowing to choose whether to store 'parse_text' directory or not.

2017-10-19 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2435:
---
Fix Version/s: 1.14

> New configuration allowing to choose whether to store 'parse_text' directory 
> or not.
> 
>
> Key: NUTCH-2435
> URL: https://issues.apache.org/jira/browse/NUTCH-2435
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.13
> Environment: Apach Nutch 1.13
>Reporter: Marcos Bori
> Fix For: 1.14
>
>
> Whenever a page is parsed, one of the outputs is the directory 'parse_text'.
> It is intended to be used at the indexing phase so the page can be searched 
> from a search engine such as Solr.
> In my special crawling case, I don't need to index the page contents. 
> Therefore, creating and filing the 'parse_text' is not required for me. To 
> optimize performance, I don't want the crawler to store this information to 
> the filesystem. 
> I propose a new parameter "parser.store.text" allowing to choose whether to 
> store 'parse_text' directory or not. Its default value, of course, is "true".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (NUTCH-2435) New configuration allowing to choose whether to store 'parse_text' directory or not.

2017-10-19 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2435:
--

Assignee: Sebastian Nagel

> New configuration allowing to choose whether to store 'parse_text' directory 
> or not.
> 
>
> Key: NUTCH-2435
> URL: https://issues.apache.org/jira/browse/NUTCH-2435
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.13
> Environment: Apach Nutch 1.13
>Reporter: Marcos Bori
>Assignee: Sebastian Nagel
> Fix For: 1.14
>
>
> Whenever a page is parsed, one of the outputs is the directory 'parse_text'.
> It is intended to be used at the indexing phase so the page can be searched 
> from a search engine such as Solr.
> In my special crawling case, I don't need to index the page contents. 
> Therefore, creating and filing the 'parse_text' is not required for me. To 
> optimize performance, I don't want the crawler to store this information to 
> the filesystem. 
> I propose a new parameter "parser.store.text" allowing to choose whether to 
> store 'parse_text' directory or not. Its default value, of course, is "true".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2435) New configuration allowing to choose whether to store 'parse_text' directory or not.

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211768#comment-16211768
 ] 

ASF GitHub Bot commented on NUTCH-2435:
---

sebastian-nagel commented on issue #225: NUTCH-2435 - New parameter 
"parser.store.text"
URL: https://github.com/apache/nutch/pull/225#issuecomment-338043379
 
 
   Thanks, @maborec! Everything ok, just bussy the last time.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> New configuration allowing to choose whether to store 'parse_text' directory 
> or not.
> 
>
> Key: NUTCH-2435
> URL: https://issues.apache.org/jira/browse/NUTCH-2435
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.13
> Environment: Apach Nutch 1.13
>Reporter: Marcos Bori
>
> Whenever a page is parsed, one of the outputs is the directory 'parse_text'.
> It is intended to be used at the indexing phase so the page can be searched 
> from a search engine such as Solr.
> In my special crawling case, I don't need to index the page contents. 
> Therefore, creating and filing the 'parse_text' is not required for me. To 
> optimize performance, I don't want the crawler to store this information to 
> the filesystem. 
> I propose a new parameter "parser.store.text" allowing to choose whether to 
> store 'parse_text' directory or not. Its default value, of course, is "true".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2435) New configuration allowing to choose whether to store 'parse_text' directory or not.

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211766#comment-16211766
 ] 

ASF GitHub Bot commented on NUTCH-2435:
---

sebastian-nagel closed pull request #225: NUTCH-2435 - New parameter 
"parser.store.text"
URL: https://github.com/apache/nutch/pull/225
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index c406907c5..587386140 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -1372,6 +1372,13 @@ CAUTION: Set the parser.timeout to -1 or a bigger value 
than 30, when using this
   
 
 
+
+  parser.store.text
+  true
+  If true (default value), parser will store parse text 
(parse_text directory within the segment).
+
+
+
 
> Key: NUTCH-2435
> URL: https://issues.apache.org/jira/browse/NUTCH-2435
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.13
> Environment: Apach Nutch 1.13
>Reporter: Marcos Bori
>
> Whenever a page is parsed, one of the outputs is the directory 'parse_text'.
> It is intended to be used at the indexing phase so the page can be searched 
> from a search engine such as Solr.
> In my special crawling case, I don't need to index the page contents. 
> Therefore, creating and filing the 'parse_text' is not required for me. To 
> optimize performance, I don't want the crawler to store this information to 
> the filesystem. 
> I propose a new parameter "parser.store.text" allowing to choose whether to 
> store 'parse_text' directory or not. Its default value, of course, is "true".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2407) Memory leak causing Nutch Server to run out of memory

2017-10-19 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211759#comment-16211759
 ] 

Sebastian Nagel commented on NUTCH-2407:


See 
[NUTCH-1746|https://issues.apache.org/jira/browse/NUTCH-1746?focusedCommentId=14004258&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14004258]
 ObjectCache leaking

> Memory leak causing Nutch Server to run out of memory
> -
>
> Key: NUTCH-2407
> URL: https://issues.apache.org/jira/browse/NUTCH-2407
> Project: Nutch
>  Issue Type: Bug
>  Components: nutch server
>Affects Versions: 2.3.1
> Environment: Ubuntu 16.04 64-bit
> Oracle Java 8 64-bit
> Nutch 2.3.1 (standalone deployment)
> MongoDB 3.4
>Reporter: Vyacheslav Pascarel
> Attachments: first.txt, second.txt, started.txt
>
>
> My application is trying to perform continuous crawling using Nutch REST 
> services. The application injects a seed URL and then repeats 
> GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times (each step 
> in the sequence is executed upon successful competition of the previous step 
> then the whole sequence is repeated again). Here is a brief description of 
> the job:
> * Number of GENERATE/FETCH/PARSE/UPDATEDB cycles per run: 50
> * 'topN' parameter value of GENERATE step in each cycle: 10
> * Seed URL: http://www.cnn.com
> * Regex URL filters for all jobs: 
> ** *"-^.\{1000,\}$"* - exclude very long URLs
> ** *"+."* - include the rest
> To monitor Nutch server I use Java VisualVM that comes with Java SDK. After 
> each run (50 cycles of GENERATE/FETCH/PARSE/UPDATEDB) I perform garbage 
> collection using the mentioned tool and check memory usage. My observation is 
> that Nutch Server leaks ~25MB per run.
> NOTES: I added custom HTTP DELETE services to clean job history in 
> NutchServerPoolExecutor and remove all custom configurations from 
> RAMConfManager after each run. So observed ~25MB memory leak is after job 
> history/configuration cleanup.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2429) Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211476#comment-16211476
 ] 

ASF GitHub Bot commented on NUTCH-2429:
---

HiranChaudhuri commented on a change in pull request #222: NUTCH-2429 Fix 
Plugin System to allow protocol plugins to bundle their URLStreamHandlers
URL: https://github.com/apache/nutch/pull/222#discussion_r145780057
 
 

 ##
 File path: src/plugin/protocol-foo/ivy.xml
 ##
 @@ -0,0 +1,23 @@
+
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers
> -
>
> Key: NUTCH-2429
> URL: https://issues.apache.org/jira/browse/NUTCH-2429
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl
>Affects Versions: 1.14
> Environment: Tested on both Nutch 1.13 and 1.14 in Ubuntu Linux with 
> OpenJDK 1.8.
>Reporter: Hiran Chaudhuri
> Fix For: 1.14
>
>
> While trying to use the protocol-smb plugin (which is not part of the Nutch 
> distribution) I realized there are four steps to successfully make use of a 
> protocol plugin:
> 1 - put the artifact into the plugins directory
> 2 - modify Nutch configuration files to allow smb:// urls plus include the 
> plugin to the loaded list
> 3 - extract jcifs.jar and place it on the system classpath
> 4 - run nutch with the correct system property
> While steps 1 and 2 seem obvious, 3 and 4 require knowledge of plugin 
> internals which does not feel right for nutch and plugin users. Even more, 
> the jcifs.jar would exist twice on the classpath and could even cause further 
> problems during runtime.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2429) Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211477#comment-16211477
 ] 

ASF GitHub Bot commented on NUTCH-2429:
---

HiranChaudhuri commented on a change in pull request #222: NUTCH-2429 Fix 
Plugin System to allow protocol plugins to bundle their URLStreamHandlers
URL: https://github.com/apache/nutch/pull/222#discussion_r145780113
 
 

 ##
 File path: src/plugin/protocol-foo/plugin.xml
 ##
 @@ -0,0 +1,30 @@
+
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers
> -
>
> Key: NUTCH-2429
> URL: https://issues.apache.org/jira/browse/NUTCH-2429
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl
>Affects Versions: 1.14
> Environment: Tested on both Nutch 1.13 and 1.14 in Ubuntu Linux with 
> OpenJDK 1.8.
>Reporter: Hiran Chaudhuri
> Fix For: 1.14
>
>
> While trying to use the protocol-smb plugin (which is not part of the Nutch 
> distribution) I realized there are four steps to successfully make use of a 
> protocol plugin:
> 1 - put the artifact into the plugins directory
> 2 - modify Nutch configuration files to allow smb:// urls plus include the 
> plugin to the loaded list
> 3 - extract jcifs.jar and place it on the system classpath
> 4 - run nutch with the correct system property
> While steps 1 and 2 seem obvious, 3 and 4 require knowledge of plugin 
> internals which does not feel right for nutch and plugin users. Even more, 
> the jcifs.jar would exist twice on the classpath and could even cause further 
> problems during runtime.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2429) Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211475#comment-16211475
 ] 

ASF GitHub Bot commented on NUTCH-2429:
---

HiranChaudhuri commented on a change in pull request #222: NUTCH-2429 Fix 
Plugin System to allow protocol plugins to bundle their URLStreamHandlers
URL: https://github.com/apache/nutch/pull/222#discussion_r145780018
 
 

 ##
 File path: src/java/org/apache/nutch/plugin/URLStreamHandlerFactory.java
 ##
 @@ -0,0 +1,85 @@
+package org.apache.nutch.plugin;
+
+import java.lang.ref.WeakReference;
+import java.net.URL;
+import java.net.URLStreamHandler;
+import java.util.ArrayList;
+
+import org.mortbay.log.Log;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class URLStreamHandlerFactory
+implements java.net.URLStreamHandlerFactory {
+  
+  protected static final Logger LOG = LoggerFactory
+  .getLogger(URLStreamHandlerFactory.class);
+  
+  /** The singleton instance. */
+  private static URLStreamHandlerFactory instance;
+  
+  /** Here we register all PluginRepositories. */
+  private ArrayList> prs;
+  
+  static {
+instance = new URLStreamHandlerFactory();
+URL.setURLStreamHandlerFactory(instance);
+LOG.info("Registered URLStreamHandlerFactory with the JVM.");
+  }
+  
+  private URLStreamHandlerFactory() {
+LOG.debug("URLStreamHandlerFactory()");
+prs = new ArrayList<>();
+  }
+
+  /** Return the singleton instance of this class. */
+  public static URLStreamHandlerFactory getInstance() {
+LOG.debug("getInstance()");
+return instance;
+  }
+  
+  /** Use this method once a new PluginRepository was created to register it.
+   * 
+   * @param pr The PluginRepository to be registered.
+   */
+  public void registerPluginRepository(PluginRepository pr) {
+LOG.debug("registerPluginRepository(...)");
 
 Review comment:
   That is why it uses the debug level logging. If such logging should not go 
to production, you could as well question all debug messages.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers
> -
>
> Key: NUTCH-2429
> URL: https://issues.apache.org/jira/browse/NUTCH-2429
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl
>Affects Versions: 1.14
> Environment: Tested on both Nutch 1.13 and 1.14 in Ubuntu Linux with 
> OpenJDK 1.8.
>Reporter: Hiran Chaudhuri
> Fix For: 1.14
>
>
> While trying to use the protocol-smb plugin (which is not part of the Nutch 
> distribution) I realized there are four steps to successfully make use of a 
> protocol plugin:
> 1 - put the artifact into the plugins directory
> 2 - modify Nutch configuration files to allow smb:// urls plus include the 
> plugin to the loaded list
> 3 - extract jcifs.jar and place it on the system classpath
> 4 - run nutch with the correct system property
> While steps 1 and 2 seem obvious, 3 and 4 require knowledge of plugin 
> internals which does not feel right for nutch and plugin users. Even more, 
> the jcifs.jar would exist twice on the classpath and could even cause further 
> problems during runtime.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2429) Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211474#comment-16211474
 ] 

ASF GitHub Bot commented on NUTCH-2429:
---

HiranChaudhuri commented on a change in pull request #222: NUTCH-2429 Fix 
Plugin System to allow protocol plugins to bundle their URLStreamHandlers
URL: https://github.com/apache/nutch/pull/222#discussion_r145779980
 
 

 ##
 File path: src/java/org/apache/nutch/plugin/URLStreamHandlerFactory.java
 ##
 @@ -0,0 +1,85 @@
+package org.apache.nutch.plugin;
+
+import java.lang.ref.WeakReference;
+import java.net.URL;
+import java.net.URLStreamHandler;
+import java.util.ArrayList;
+
+import org.mortbay.log.Log;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class URLStreamHandlerFactory
+implements java.net.URLStreamHandlerFactory {
+  
+  protected static final Logger LOG = LoggerFactory
+  .getLogger(URLStreamHandlerFactory.class);
+  
+  /** The singleton instance. */
+  private static URLStreamHandlerFactory instance;
+  
+  /** Here we register all PluginRepositories. */
+  private ArrayList> prs;
+  
+  static {
+instance = new URLStreamHandlerFactory();
+URL.setURLStreamHandlerFactory(instance);
+LOG.info("Registered URLStreamHandlerFactory with the JVM.");
+  }
+  
+  private URLStreamHandlerFactory() {
+LOG.debug("URLStreamHandlerFactory()");
+prs = new ArrayList<>();
+  }
+
+  /** Return the singleton instance of this class. */
+  public static URLStreamHandlerFactory getInstance() {
+LOG.debug("getInstance()");
 
 Review comment:
   That is why it uses the debug level logging. If such logging should not go 
to production, you could as well question all debug messages.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers
> -
>
> Key: NUTCH-2429
> URL: https://issues.apache.org/jira/browse/NUTCH-2429
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl
>Affects Versions: 1.14
> Environment: Tested on both Nutch 1.13 and 1.14 in Ubuntu Linux with 
> OpenJDK 1.8.
>Reporter: Hiran Chaudhuri
> Fix For: 1.14
>
>
> While trying to use the protocol-smb plugin (which is not part of the Nutch 
> distribution) I realized there are four steps to successfully make use of a 
> protocol plugin:
> 1 - put the artifact into the plugins directory
> 2 - modify Nutch configuration files to allow smb:// urls plus include the 
> plugin to the loaded list
> 3 - extract jcifs.jar and place it on the system classpath
> 4 - run nutch with the correct system property
> While steps 1 and 2 seem obvious, 3 and 4 require knowledge of plugin 
> internals which does not feel right for nutch and plugin users. Even more, 
> the jcifs.jar would exist twice on the classpath and could even cause further 
> problems during runtime.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2429) Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211472#comment-16211472
 ] 

ASF GitHub Bot commented on NUTCH-2429:
---

HiranChaudhuri commented on a change in pull request #222: NUTCH-2429 Fix 
Plugin System to allow protocol plugins to bundle their URLStreamHandlers
URL: https://github.com/apache/nutch/pull/222#discussion_r145779941
 
 

 ##
 File path: src/java/org/apache/nutch/plugin/URLStreamHandlerFactory.java
 ##
 @@ -0,0 +1,85 @@
+package org.apache.nutch.plugin;
+
+import java.lang.ref.WeakReference;
+import java.net.URL;
+import java.net.URLStreamHandler;
+import java.util.ArrayList;
+
+import org.mortbay.log.Log;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class URLStreamHandlerFactory
+implements java.net.URLStreamHandlerFactory {
+  
+  protected static final Logger LOG = LoggerFactory
+  .getLogger(URLStreamHandlerFactory.class);
+  
+  /** The singleton instance. */
+  private static URLStreamHandlerFactory instance;
+  
+  /** Here we register all PluginRepositories. */
+  private ArrayList> prs;
+  
+  static {
+instance = new URLStreamHandlerFactory();
+URL.setURLStreamHandlerFactory(instance);
+LOG.info("Registered URLStreamHandlerFactory with the JVM.");
+  }
+  
+  private URLStreamHandlerFactory() {
+LOG.debug("URLStreamHandlerFactory()");
 
 Review comment:
   That is why it uses the debug level logging. If such logging should not go 
to production, you could as well question all debug messages.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers
> -
>
> Key: NUTCH-2429
> URL: https://issues.apache.org/jira/browse/NUTCH-2429
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl
>Affects Versions: 1.14
> Environment: Tested on both Nutch 1.13 and 1.14 in Ubuntu Linux with 
> OpenJDK 1.8.
>Reporter: Hiran Chaudhuri
> Fix For: 1.14
>
>
> While trying to use the protocol-smb plugin (which is not part of the Nutch 
> distribution) I realized there are four steps to successfully make use of a 
> protocol plugin:
> 1 - put the artifact into the plugins directory
> 2 - modify Nutch configuration files to allow smb:// urls plus include the 
> plugin to the loaded list
> 3 - extract jcifs.jar and place it on the system classpath
> 4 - run nutch with the correct system property
> While steps 1 and 2 seem obvious, 3 and 4 require knowledge of plugin 
> internals which does not feel right for nutch and plugin users. Even more, 
> the jcifs.jar would exist twice on the classpath and could even cause further 
> problems during runtime.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2429) Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211470#comment-16211470
 ] 

ASF GitHub Bot commented on NUTCH-2429:
---

HiranChaudhuri commented on a change in pull request #222: NUTCH-2429 Fix 
Plugin System to allow protocol plugins to bundle their URLStreamHandlers
URL: https://github.com/apache/nutch/pull/222#discussion_r145779721
 
 

 ##
 File path: src/java/org/apache/nutch/plugin/URLStreamHandlerFactory.java
 ##
 @@ -0,0 +1,85 @@
+package org.apache.nutch.plugin;
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers
> -
>
> Key: NUTCH-2429
> URL: https://issues.apache.org/jira/browse/NUTCH-2429
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl
>Affects Versions: 1.14
> Environment: Tested on both Nutch 1.13 and 1.14 in Ubuntu Linux with 
> OpenJDK 1.8.
>Reporter: Hiran Chaudhuri
> Fix For: 1.14
>
>
> While trying to use the protocol-smb plugin (which is not part of the Nutch 
> distribution) I realized there are four steps to successfully make use of a 
> protocol plugin:
> 1 - put the artifact into the plugins directory
> 2 - modify Nutch configuration files to allow smb:// urls plus include the 
> plugin to the loaded list
> 3 - extract jcifs.jar and place it on the system classpath
> 4 - run nutch with the correct system property
> While steps 1 and 2 seem obvious, 3 and 4 require knowledge of plugin 
> internals which does not feel right for nutch and plugin users. Even more, 
> the jcifs.jar would exist twice on the classpath and could even cause further 
> problems during runtime.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2429) Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers

2017-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211465#comment-16211465
 ] 

ASF GitHub Bot commented on NUTCH-2429:
---

HiranChaudhuri commented on issue #222: NUTCH-2429 Fix Plugin System to allow 
protocol plugins to bundle their URLStreamHandlers
URL: https://github.com/apache/nutch/pull/222#issuecomment-337990176
 
 
   I think I applied the requested changed. This branch is still showing 
'changes requested'. How does this move on?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers
> -
>
> Key: NUTCH-2429
> URL: https://issues.apache.org/jira/browse/NUTCH-2429
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl
>Affects Versions: 1.14
> Environment: Tested on both Nutch 1.13 and 1.14 in Ubuntu Linux with 
> OpenJDK 1.8.
>Reporter: Hiran Chaudhuri
> Fix For: 1.14
>
>
> While trying to use the protocol-smb plugin (which is not part of the Nutch 
> distribution) I realized there are four steps to successfully make use of a 
> protocol plugin:
> 1 - put the artifact into the plugins directory
> 2 - modify Nutch configuration files to allow smb:// urls plus include the 
> plugin to the loaded list
> 3 - extract jcifs.jar and place it on the system classpath
> 4 - run nutch with the correct system property
> While steps 1 and 2 seem obvious, 3 and 4 require knowledge of plugin 
> internals which does not feel right for nutch and plugin users. Even more, 
> the jcifs.jar would exist twice on the classpath and could even cause further 
> problems during runtime.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2443) Extract links from the video tag with the parse-html plugin

2017-10-19 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211126#comment-16211126
 ] 

Sebastian Nagel commented on NUTCH-2443:


Keep it simple for now, and open a new issue to work on it systematically? Not 
to miss any link means some work, there are many attributes where URLs appear. 
Of course, only  and  are really frequent, see 
https://gist.github.com/sebastian-nagel/ff4379f9e2115d3c922416d520274b86

> Extract links from the video tag with the parse-html plugin
> ---
>
> Key: NUTCH-2443
> URL: https://issues.apache.org/jira/browse/NUTCH-2443
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, plugin
>Affects Versions: 1.13
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
> Fix For: 1.14
>
>
> At the moment the {{parse-html}} extracts links from the tags {{a, area, 
> form}} (configurable){{, frame, iframe, script, link, img}}. Since we allow 
> extracting links to binary files (images) extracting links also from the 
> {{video}} tag should be supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2443) Extract links from the video tag with the parse-html plugin

2017-10-19 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211065#comment-16211065
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2443:
---

It's not hard to add more tags, but honestly I'm seeing a lot of those tags 
with URL-value attributes for the first time, the question is should have them 
_all_ in the actual implementation? 

> Extract links from the video tag with the parse-html plugin
> ---
>
> Key: NUTCH-2443
> URL: https://issues.apache.org/jira/browse/NUTCH-2443
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, plugin
>Affects Versions: 1.13
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
> Fix For: 1.14
>
>
> At the moment the {{parse-html}} extracts links from the tags {{a, area, 
> form}} (configurable){{, frame, iframe, script, link, img}}. Since we allow 
> extracting links to binary files (images) extracting links also from the 
> {{video}} tag should be supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)