[jira] [Updated] (NUTCH-2163) Utilize current JVM threads to augment URLClassLoader with newly discovered classes
[ https://issues.apache.org/jira/browse/NUTCH-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2163: Summary: Utilize current JVM threads to augment URLClassLoader with newly discovered classes (was: Utilize current JVM threads to augment URLClassLoader with newlt discovered classes) > Utilize current JVM threads to augment URLClassLoader with newly discovered > classes > --- > > Key: NUTCH-2163 > URL: https://issues.apache.org/jira/browse/NUTCH-2163 > Project: Nutch > Issue Type: Bug > Components: util >Reporter: Lewis John McGibbney > > I found [this > code|https://github.com/apache/nutch/compare/trunk...infolinks:nutch-osgi] a > while back and have been thinking about OSGi again for Nutch. > Our justification here is that we want to dynamically create > [InteractiveSeleniumHandler's|https://github.com/apache/nutch/blob/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/InteractiveSeleniumHandler.java] > and inject the code into the .job artifacts which can then be used in the > next round of fetching. > The code looks like the following > {code} > +List nutchConfigurationClasspathURLs = new ArrayList(); > + > +// Collect classpath URLs from Hadoop's Configuration class CL > +URLClassLoader hadoopBundleConfigurationClassLoader = (URLClassLoader) > conf.getClassLoader(); > +for (URL hadoopBundleClasspathURL : > hadoopBundleConfigurationClassLoader.getURLs()) { > + nutchConfigurationClasspathURLs.add(hadoopBundleClasspathURL); > +} > + > +// Append classpath URLs from current thread, which ostensibly include a > Nutch job file > +URLClassLoader tccl = (URLClassLoader) > Thread.currentThread().getContextClassLoader(); > +for (URL tcclClasspathURL : tccl.getURLs()) { > + nutchConfigurationClasspathURLs.add(tcclClasspathURL); > +} > + > +URLClassLoader nutchConfigurationClassLoader = new > URLClassLoader(nutchConfigurationClasspathURLs.toArray(new URL[0])); > +// Reset the Configuration object's CL to the new one > +conf.setClassLoader(nutchConfigurationClassLoader); > {code} > The Thread.currentThread().getContextClassLoader(); is the secret sauce... > however I just wonder what thoughts are about this approach? > We have, from time to time over the years discussed > [Nutch|http://wiki.apache.org/nutch/NutchOSGi] and I spoke with > [~bdelacretaz] a good few years ago @ApacheCon but I don't have the time to > implement total OSGi coverage of the Nutch codebase. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2163) Utilize current JVM threads to augment URLClassLoader with newlt discovered classes
Lewis John McGibbney created NUTCH-2163: --- Summary: Utilize current JVM threads to augment URLClassLoader with newlt discovered classes Key: NUTCH-2163 URL: https://issues.apache.org/jira/browse/NUTCH-2163 Project: Nutch Issue Type: Bug Components: util Reporter: Lewis John McGibbney I found [this code|https://github.com/apache/nutch/compare/trunk...infolinks:nutch-osgi] a while back and have been thinking about OSGi again for Nutch. Our justification here is that we want to dynamically create [InteractiveSeleniumHandler's|https://github.com/apache/nutch/blob/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/InteractiveSeleniumHandler.java] and inject the code into the .job artifacts which can then be used in the next round of fetching. The code looks like the following {code} +List nutchConfigurationClasspathURLs = new ArrayList(); + +// Collect classpath URLs from Hadoop's Configuration class CL +URLClassLoader hadoopBundleConfigurationClassLoader = (URLClassLoader) conf.getClassLoader(); +for (URL hadoopBundleClasspathURL : hadoopBundleConfigurationClassLoader.getURLs()) { + nutchConfigurationClasspathURLs.add(hadoopBundleClasspathURL); +} + +// Append classpath URLs from current thread, which ostensibly include a Nutch job file +URLClassLoader tccl = (URLClassLoader) Thread.currentThread().getContextClassLoader(); +for (URL tcclClasspathURL : tccl.getURLs()) { + nutchConfigurationClasspathURLs.add(tcclClasspathURL); +} + +URLClassLoader nutchConfigurationClassLoader = new URLClassLoader(nutchConfigurationClasspathURLs.toArray(new URL[0])); +// Reset the Configuration object's CL to the new one +conf.setClassLoader(nutchConfigurationClassLoader); {code} The Thread.currentThread().getContextClassLoader(); is the secret sauce... however I just wonder what thoughts are about this approach? We have, from time to time over the years discussed [Nutch|http://wiki.apache.org/nutch/NutchOSGi] and I spoke with [~bdelacretaz] a good few years ago @ApacheCon but I don't have the time to implement total OSGi coverage of the Nutch codebase. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2162) Nutch Webapp Crawl fails as it tries to index
[ https://issues.apache.org/jira/browse/NUTCH-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14994168#comment-14994168 ] Lewis John McGibbney commented on NUTCH-2162: - Ack. I also got it working well with Solr and ES bt an indexing engine is not a prerequisite as indicated in the crawl script. Making it mandatory from within the GUI is backwards IMHO. I think driving for crawl metrics via statistics panel would be a good goal for this web app. An indexing engine may not be required for that either if we can data accessible through RESt. On Friday, November 6, 2015, Chris A. Mattmann (JIRA) -- *Lewis* > Nutch Webapp Crawl fails as it tries to index > - > > Key: NUTCH-2162 > URL: https://issues.apache.org/jira/browse/NUTCH-2162 > Project: Nutch > Issue Type: Bug > Components: web gui >Affects Versions: 1.11 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: nutch_webapp.log > > > Right now a crawl task fails on the trunk version of the WebApp due to it > attempting to index. No indexer is defined by default so this is a major bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
The Nutch Webapp
Hey Everyone, So I just tried the Nutch Webapp for 1.11. It’s brittle, but works. I am REALLY happy with it. Great work Fjodor Vershinin and Lewis on making the application! Since it’s in Wicket and I know my way around Wicket I’m going to work in 1.12 and beyond on really improving this and making it a cool crawl visualization framework. We have learned a lot in Memex about how to do interactive crawl dashboards. I anticipate flowing a lot of these lessons learned into the web app. D3, Bokeh and other cool updates coming! :) I am also going to start on a wiki page showing people how to use the web app and get it up and running. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] [Commented] (NUTCH-2162) Nutch Webapp Crawl fails as it tries to index
[ https://issues.apache.org/jira/browse/NUTCH-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14994029#comment-14994029 ] Chris A. Mattmann commented on NUTCH-2162: -- so I tried this out. It actually works fine as long as you have everything the default, e.g., if you install solr on 8983, and you install the Nutch schema in that solr and by default you install it into collection 1. I have it fully working with that config. It's brittle but doesn't require a code update and it works. One other thing to note - you can't change properties (yet) from the Nutch config, so you *must* update http.agent.name to something in your runtime/*/conf/nutch-{site|default}.xml file before starting the web services REST layer and using the Wicket App. One other thing we should think about - Maven - and then Maven WAR overlays here once we get a version of Nutch working with Maven. > Nutch Webapp Crawl fails as it tries to index > - > > Key: NUTCH-2162 > URL: https://issues.apache.org/jira/browse/NUTCH-2162 > Project: Nutch > Issue Type: Bug > Components: web gui >Affects Versions: 1.11 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: nutch_webapp.log > > > Right now a crawl task fails on the trunk version of the WebApp due to it > attempting to index. No indexer is defined by default so this is a major bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2162) Nutch Webapp Crawl fails as it tries to index
[ https://issues.apache.org/jira/browse/NUTCH-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14993376#comment-14993376 ] Lewis John McGibbney commented on NUTCH-2162: - In all honesty a work around for this is merely to comment out the following line https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/webui/client/impl/RemoteCommandsBatchFactory.java#L59 However, the correct solution is to bake in optional logic which allows the user to determine whether indexing is required or not. I'll have a crack when i can. > Nutch Webapp Crawl fails as it tries to index > - > > Key: NUTCH-2162 > URL: https://issues.apache.org/jira/browse/NUTCH-2162 > Project: Nutch > Issue Type: Bug > Components: web gui >Affects Versions: 1.11 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: nutch_webapp.log > > > Right now a crawl task fails on the trunk version of the WebApp due to it > attempting to index. No indexer is defined by default so this is a major bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)