[jira] [Updated] (NUTCH-2163) Utilize current JVM threads to augment URLClassLoader with newly discovered classes

2015-11-06 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2163:

Summary: Utilize current JVM threads to augment URLClassLoader with newly 
discovered classes  (was: Utilize current JVM threads to augment URLClassLoader 
with newlt discovered classes)

> Utilize current JVM threads to augment URLClassLoader with newly discovered 
> classes
> ---
>
> Key: NUTCH-2163
> URL: https://issues.apache.org/jira/browse/NUTCH-2163
> Project: Nutch
>  Issue Type: Bug
>  Components: util
>Reporter: Lewis John McGibbney
>
> I found [this 
> code|https://github.com/apache/nutch/compare/trunk...infolinks:nutch-osgi] a 
> while back and have been thinking about OSGi again for Nutch. 
> Our justification here is that we want to dynamically create 
> [InteractiveSeleniumHandler's|https://github.com/apache/nutch/blob/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/InteractiveSeleniumHandler.java]
>  and inject the code into the .job artifacts which can then be used in the 
> next round of fetching. 
> The code looks like the following
> {code}
> +List nutchConfigurationClasspathURLs = new ArrayList();
> +
> +// Collect classpath URLs from Hadoop's Configuration class CL
> +URLClassLoader hadoopBundleConfigurationClassLoader = (URLClassLoader) 
> conf.getClassLoader();
> +for (URL hadoopBundleClasspathURL : 
> hadoopBundleConfigurationClassLoader.getURLs()) {
> +  nutchConfigurationClasspathURLs.add(hadoopBundleClasspathURL);
> +}
> +
> +// Append classpath URLs from current thread, which ostensibly include a 
> Nutch job file
> +URLClassLoader tccl = (URLClassLoader) 
> Thread.currentThread().getContextClassLoader();
> +for (URL tcclClasspathURL : tccl.getURLs()) {
> +  nutchConfigurationClasspathURLs.add(tcclClasspathURL);
> +}
> +
> +URLClassLoader nutchConfigurationClassLoader = new 
> URLClassLoader(nutchConfigurationClasspathURLs.toArray(new URL[0]));
> +// Reset the Configuration object's CL to the new one
> +conf.setClassLoader(nutchConfigurationClassLoader);
> {code}
> The Thread.currentThread().getContextClassLoader(); is the secret sauce... 
> however I just wonder what thoughts are about this approach?
> We have, from time to time over the years discussed 
> [Nutch|http://wiki.apache.org/nutch/NutchOSGi] and I spoke with 
> [~bdelacretaz] a good few years ago @ApacheCon but I don't have the time to 
> implement total OSGi coverage of the Nutch codebase. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2163) Utilize current JVM threads to augment URLClassLoader with newlt discovered classes

2015-11-06 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2163:
---

 Summary: Utilize current JVM threads to augment URLClassLoader 
with newlt discovered classes
 Key: NUTCH-2163
 URL: https://issues.apache.org/jira/browse/NUTCH-2163
 Project: Nutch
  Issue Type: Bug
  Components: util
Reporter: Lewis John McGibbney


I found [this 
code|https://github.com/apache/nutch/compare/trunk...infolinks:nutch-osgi] a 
while back and have been thinking about OSGi again for Nutch. 
Our justification here is that we want to dynamically create 
[InteractiveSeleniumHandler's|https://github.com/apache/nutch/blob/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/InteractiveSeleniumHandler.java]
 and inject the code into the .job artifacts which can then be used in the next 
round of fetching. 
The code looks like the following
{code}
+List nutchConfigurationClasspathURLs = new ArrayList();
+
+// Collect classpath URLs from Hadoop's Configuration class CL
+URLClassLoader hadoopBundleConfigurationClassLoader = (URLClassLoader) 
conf.getClassLoader();
+for (URL hadoopBundleClasspathURL : 
hadoopBundleConfigurationClassLoader.getURLs()) {
+  nutchConfigurationClasspathURLs.add(hadoopBundleClasspathURL);
+}
+
+// Append classpath URLs from current thread, which ostensibly include a 
Nutch job file
+URLClassLoader tccl = (URLClassLoader) 
Thread.currentThread().getContextClassLoader();
+for (URL tcclClasspathURL : tccl.getURLs()) {
+  nutchConfigurationClasspathURLs.add(tcclClasspathURL);
+}
+
+URLClassLoader nutchConfigurationClassLoader = new 
URLClassLoader(nutchConfigurationClasspathURLs.toArray(new URL[0]));
+// Reset the Configuration object's CL to the new one
+conf.setClassLoader(nutchConfigurationClassLoader);
{code}
The Thread.currentThread().getContextClassLoader(); is the secret sauce... 
however I just wonder what thoughts are about this approach?
We have, from time to time over the years discussed 
[Nutch|http://wiki.apache.org/nutch/NutchOSGi] and I spoke with [~bdelacretaz] 
a good few years ago @ApacheCon but I don't have the time to implement total 
OSGi coverage of the Nutch codebase. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2162) Nutch Webapp Crawl fails as it tries to index

2015-11-06 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14994168#comment-14994168
 ] 

Lewis John McGibbney commented on NUTCH-2162:
-

Ack.
I also got it working well with Solr and ES bt an indexing engine is not a
prerequisite as indicated in the crawl script. Making it mandatory from
within the GUI is backwards IMHO.
I think driving for crawl metrics via statistics panel would be a good goal
for this web app. An indexing engine may not be required for that either if
we can data accessible through RESt.

On Friday, November 6, 2015, Chris A. Mattmann (JIRA) 



-- 
*Lewis*


> Nutch Webapp Crawl fails as it tries to index
> -
>
> Key: NUTCH-2162
> URL: https://issues.apache.org/jira/browse/NUTCH-2162
> Project: Nutch
>  Issue Type: Bug
>  Components: web gui
>Affects Versions: 1.11
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: nutch_webapp.log
>
>
> Right now a crawl task fails on the trunk version of the WebApp due to it 
> attempting to index. No indexer is defined by default so this is a major bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


The Nutch Webapp

2015-11-06 Thread Mattmann, Chris A (3980)
Hey Everyone,

So I just tried the Nutch Webapp for 1.11. It’s brittle, but works.
I am REALLY happy with it. Great work Fjodor Vershinin and Lewis on
making the application!

Since it’s in Wicket and I know my way around Wicket I’m going to
work in 1.12 and beyond on really improving this and making it
a cool crawl visualization framework. We have learned a lot in Memex
about how to do interactive crawl dashboards. I anticipate flowing
a lot of these lessons learned into the web app. D3, Bokeh and other
cool updates coming! :)

I am also going to start on a wiki page showing people how to use
the web app and get it up and running.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





[jira] [Commented] (NUTCH-2162) Nutch Webapp Crawl fails as it tries to index

2015-11-06 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14994029#comment-14994029
 ] 

Chris A. Mattmann commented on NUTCH-2162:
--

so I tried this out. It actually works fine as long as you have everything the 
default, e.g., if you install solr on 8983, and you install the Nutch schema in 
that solr and by default you install it into collection 1. I have it fully 
working with that config. It's brittle but doesn't require a code update and it 
works. 

One other thing to note - you can't change properties (yet) from the Nutch 
config, so you *must* update http.agent.name to something in your 
runtime/*/conf/nutch-{site|default}.xml file before starting the web services 
REST layer and using the Wicket App.

One other thing we should think about - Maven - and then Maven WAR overlays 
here once we get a version of Nutch working with Maven.

> Nutch Webapp Crawl fails as it tries to index
> -
>
> Key: NUTCH-2162
> URL: https://issues.apache.org/jira/browse/NUTCH-2162
> Project: Nutch
>  Issue Type: Bug
>  Components: web gui
>Affects Versions: 1.11
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: nutch_webapp.log
>
>
> Right now a crawl task fails on the trunk version of the WebApp due to it 
> attempting to index. No indexer is defined by default so this is a major bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2162) Nutch Webapp Crawl fails as it tries to index

2015-11-06 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14993376#comment-14993376
 ] 

Lewis John McGibbney commented on NUTCH-2162:
-

In all honesty a work around for this is merely to comment out the following 
line
https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/webui/client/impl/RemoteCommandsBatchFactory.java#L59
However, the correct solution is to bake in optional logic which allows the 
user to determine whether indexing is required or not. I'll have a crack when i 
can.

> Nutch Webapp Crawl fails as it tries to index
> -
>
> Key: NUTCH-2162
> URL: https://issues.apache.org/jira/browse/NUTCH-2162
> Project: Nutch
>  Issue Type: Bug
>  Components: web gui
>Affects Versions: 1.11
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: nutch_webapp.log
>
>
> Right now a crawl task fails on the trunk version of the WebApp due to it 
> attempting to index. No indexer is defined by default so this is a major bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)