Running individual test classes from nutch script cont'd
Hi, OK this stems from discussion on the user@ list a while ago [1] and my discovery of NUTCH-672 yesterday. I attached a patch, which fails completely, as I hadn't uncovered things I now know. The original patch submitted for the issue would have been fine for =Nutch 1.2 but now as the file structure has changed in = Nutch 1.3 both pre and post build with ant it is no longer as trivial as it looks. Basically the additions to the bin/nutch script would something similar to this echo pluginload a plugin and run one of its classes main() echo junit runs the given JUnit test echo or echo CLASSNAME run the class named CLASSNAME -- elif [ $COMMAND = plugin ] ; then CLASS=org.apache.nutch.plugin.PluginRepository elif [ $COMMAND = junit ] ; then CLASSPATH=$CLASSPATH:src/test/ CLASS='junit.textui.TestRunner' else CLASS=$COMMAND This would enable us to execute for example bin/nutch junit org.apache.nutch.crawl.CrawlDBTestUtil, However the problem we face is that we now no longer have /lib existing under /branch-1.4, it is instead located under /branch-1.4/runtime/local/lib or alternatively in the /lib directory in snapshop.job in deploy mode. I'm therefore getting the class not found error if I try to run. Exception in thread main java.lang.NoClassDefFoundError: junit/textui/TestRunner Caused by: java.lang.ClassNotFoundException: junit.textui.TestRunner at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) Could not find the main class: junit.textui.TestRunner. Program will exit. One observation I have, regardless of whether we would wish to run junit tests on test classes in a either a development or production environment e.g. from source or from post build runtime code the correct command line options would have to be specified within the source nutch script. I've been looking at this for a while and haven't really made much progress apart form the above observations. Can anyone shine some light or even suggest how we could correctly configure a patch for the Nutch script? Thank you [1] http://www.mail-archive.com/user@nutch.apache.org/msg03207.html -- *Lewis*
[jira] [Updated] (NUTCH-1057) Make fetcher thread time out configurable
[ https://issues.apache.org/jira/browse/NUTCH-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1057: - Attachment: NUTCH-1057-1.4-1.patch Patch for 1.4. There's also a diff for NUTCH-1037 in the config file which hasn't been committed yet. Make fetcher thread time out configurable - Key: NUTCH-1057 URL: https://issues.apache.org/jira/browse/NUTCH-1057 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.4, 2.0 Attachments: NUTCH-1057-1.4-1.patch The fetcher sets a time out value based of half the mapred.task.timeout value. This is not a proper value for all cases. Add an option (fetcher.thread.timeout.divisor) to configure the divisor used and default it to two. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1043) Add pattern for filtering .js in default url filters
[ https://issues.apache.org/jira/browse/NUTCH-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1043: - Patch Info: [Patch Available] Add pattern for filtering .js in default url filters Key: NUTCH-1043 URL: https://issues.apache.org/jira/browse/NUTCH-1043 Project: Nutch Issue Type: Task Affects Versions: 1.4, 2.0 Reporter: Julien Nioche Priority: Minor Fix For: 1.4, 2.0 Attachments: NUTCH-1043.patch The Javascript parser is not used by default as it is extremely noisy, however the default URL filters do not filter out URLs ending in .js and the default parser (Tika) can't parse them. In a nutshell we are fetching URLS that we know can't be parsed. I suggest that we add a regex to the default URL filters. If people are interested in fetching and parsing .js files they can activate the plugin in their conf and remove the regex in the URL filters. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1023) Trivial error in error message for org.apache.nutch.crawl.LinkDbReader
[ https://issues.apache.org/jira/browse/NUTCH-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1023: - Patch Info: [Patch Available] Trivial error in error message for org.apache.nutch.crawl.LinkDbReader -- Key: NUTCH-1023 URL: https://issues.apache.org/jira/browse/NUTCH-1023 Project: Nutch Issue Type: Improvement Components: linkdb Affects Versions: 1.3 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Trivial Fix For: 1.4 Attachments: LinkDbReader-trivial.patch The following line in the above class has a trivial error in syntax before the -dump parameter. Instead of a curly bracket, it should be consistent with the round bracket. 126 System.err.println(Usage: LinkDbReader linkdb {-dump out_dir | -url url)); -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-961: Attachment: NUTCH-961-1.4-dombuilder-1.patch With BP enabled you can get an java.util.EmptyStackException from DOMBuilder. This is fixed in this patch by adding another check around the peek 'n pop methods. http://mail-archives.apache.org/mod_mbox/nutch-user/201107.mbox/%3c201107151523.18511.markus.jel...@openindex.io%3E There is no answer yet to why this can occur yet i think checking before pop or peek is good anyway. Expose Tika's boilerpipe support Key: NUTCH-961 URL: https://issues.apache.org/jira/browse/NUTCH-961 Project: Nutch Issue Type: New Feature Components: parser Reporter: Markus Jelsma Fix For: 1.4, 2.0 Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961v2.patch Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-965) Skip parsing for truncated documents
[ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-965: Patch Info: [Patch Available] Skip parsing for truncated documents Key: NUTCH-965 URL: https://issues.apache.org/jira/browse/NUTCH-965 Project: Nutch Issue Type: Improvement Components: parser Reporter: Alexis Fix For: 1.4, 2.0 Attachments: parserJob.patch The issue you're likely to run into when parsing truncated FLV files is described here: http://www.mail-archive.com/user@nutch.apache.org/msg01880.html The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1044) Redirected URLs and possibly all of their outlinked URLs have invalid scores.
[ https://issues.apache.org/jira/browse/NUTCH-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13066649#comment-13066649 ] Markus Jelsma commented on NUTCH-1044: -- Can you provide a patch? Redirected URLs and possibly all of their outlinked URLs have invalid scores. - Key: NUTCH-1044 URL: https://issues.apache.org/jira/browse/NUTCH-1044 Project: Nutch Issue Type: Bug Components: fetcher, parser Affects Versions: 1.3 Reporter: Nutch User - 1 1.: http://lucene.472066.n3.nabble.com/URL-redirection-and-zero-scores-td3085311.html 2.: http://lucene.472066.n3.nabble.com/A-possible-solution-to-my-URL-redirection-and-zero-scores-problem-td3162164.html Please note that also URLs redirected by meta refresh redirection do have invalid scores. For such URLs a CrawlDatum is created on the lines 157-177 of ParseOutputFormat.java (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/parse/ParseOutputFormat.java?view=markup). The new CrawlDatum's score isn't set anywhere after the creation so it's 1.0f as can be seen on the line 122 of CrawlDatum.java (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup). It's another question whether the redirected URL's score should be just passed to the new URL or should the redirection be considered as a link in which case the new URL's score would be 'originalScore' / ('numberOfOutlinks' + 1). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
adding details to mvn.template?
Hi, Quick question, I've been looking at various issues dealt with prior to Nutch 1.3 release in particular NUTCH-995. Please excuse (and correct) my ignorance, but I need to clear this one up so I understand correctly. The purpose the mvn.template file serves is so we can specify exactly who can commit a Nutch maven pom. The pom in turn specifies the build dirs e.g. source dir as well as test dir. Then finally all dependencies we rely on within the project? Although I am not planning, and I'm aware we don't need to commit a Nutch Maven pom, is there any purpose in me committing my developer id, name and email to the mvn.template file? If so is the template file the only one I would need to provide a patch for? Thank you -- *Lewis*
Re: adding details to mvn.template?
Please excuse (and correct) my ignorance, but I need to clear this one up so I understand correctly. The purpose the mvn.template file serves is so we can specify exactly who can commit a Nutch maven pom. The pom in turn specifies the build dirs e.g. source dir as well as test dir. Then finally all dependencies we rely on within the project? The purpose of mvn.template is to add more details to the pom.xml generated from ivy. This pom file is used mostly for publishing the Nutch jar as an artefact but some people use it to manage the dependencies, although this can be done with Ivy without problems. Although I am not planning, and I'm aware we don't need to commit a Nutch Maven pom, is there any purpose in me committing my developer id, name and email to the mvn.template file? If so is the template file the only one I would need to provide a patch for? Yes, most definitely. Should be the only thing to patch indeed Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
[jira] [Commented] (NUTCH-1019) Edit comment in org.apache.nutch.crawl.Crawl to reflect removal of legacy
[ https://issues.apache.org/jira/browse/NUTCH-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13066731#comment-13066731 ] Lewis John McGibbney commented on NUTCH-1019: - Committed at revision 1147712. Edit comment in org.apache.nutch.crawl.Crawl to reflect removal of legacy - Key: NUTCH-1019 URL: https://issues.apache.org/jira/browse/NUTCH-1019 Project: Nutch Issue Type: Improvement Components: documentation Affects Versions: 1.4, 2.0 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Trivial Fix For: 1.4, 2.0 Attachments: crawl-comment.patch When updating the wiki documentation for command line options, I noticed that the comment on line 51 of the above class is inaccurate and needs to be updated to reflect changes. Although this is a trivial task I won't be able to committ until 2nd week July. Can I ask someone else please? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1059) Remove convdb command from /bin/nutch
[ https://issues.apache.org/jira/browse/NUTCH-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1059: Attachment: NUTCH-1059-remove-convdb.patch The patch simply removes both the command line option and the class which is supposedly called when the command is initiated. This is being removed as the crawldbconv class has been dropped = 1.3. Remove convdb command from /bin/nutch - Key: NUTCH-1059 URL: https://issues.apache.org/jira/browse/NUTCH-1059 Project: Nutch Issue Type: Task Components: build Affects Versions: 1.3 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Trivial Fix For: 1.4, 2.0 Attachments: NUTCH-1059-remove-convdb.patch There is no class shipped with =Nutch 1.3 for the Convdb command therefore I'm assuming this command somehow slipped through the net undetected. I will attach a trivial patch simply removing it from the bin/nutch script -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Build failed in Jenkins: Nutch-trunk #1549
See https://builds.apache.org/job/Nutch-trunk/1549/ -- [...truncated 985 lines...] A src/plugin/subcollection/src/java/org/apache/nutch/collection A src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java A src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java A src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html A src/plugin/subcollection/src/java/org/apache/nutch/indexer A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java A src/plugin/subcollection/README.txt A src/plugin/subcollection/plugin.xml A src/plugin/subcollection/build.xml A src/plugin/index-more A src/plugin/index-more/ivy.xml A src/plugin/index-more/src A src/plugin/index-more/src/test A src/plugin/index-more/src/test/org A src/plugin/index-more/src/test/org/apache A src/plugin/index-more/src/test/org/apache/nutch A src/plugin/index-more/src/test/org/apache/nutch/indexer A src/plugin/index-more/src/test/org/apache/nutch/indexer/more A src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java A src/plugin/index-more/src/java A src/plugin/index-more/src/java/org A src/plugin/index-more/src/java/org/apache A src/plugin/index-more/src/java/org/apache/nutch A src/plugin/index-more/src/java/org/apache/nutch/indexer A src/plugin/index-more/src/java/org/apache/nutch/indexer/more A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html A src/plugin/index-more/plugin.xml A src/plugin/index-more/build.xml AUsrc/plugin/plugin.dtd A src/plugin/parse-ext A src/plugin/parse-ext/ivy.xml A src/plugin/parse-ext/src A src/plugin/parse-ext/src/test A src/plugin/parse-ext/src/test/org A src/plugin/parse-ext/src/test/org/apache A src/plugin/parse-ext/src/test/org/apache/nutch A src/plugin/parse-ext/src/test/org/apache/nutch/parse A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java A src/plugin/parse-ext/src/java A src/plugin/parse-ext/src/java/org A src/plugin/parse-ext/src/java/org/apache A src/plugin/parse-ext/src/java/org/apache/nutch A src/plugin/parse-ext/src/java/org/apache/nutch/parse A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java A src/plugin/parse-ext/plugin.xml A src/plugin/parse-ext/build.xml A src/plugin/parse-ext/command A src/plugin/urlnormalizer-pass A src/plugin/urlnormalizer-pass/ivy.xml A src/plugin/urlnormalizer-pass/src A src/plugin/urlnormalizer-pass/src/test A src/plugin/urlnormalizer-pass/src/test/org A src/plugin/urlnormalizer-pass/src/test/org/apache A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java A src/plugin/urlnormalizer-pass/src/java A src/plugin/urlnormalizer-pass/src/java/org A src/plugin/urlnormalizer-pass/src/java/org/apache A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java AUsrc/plugin/urlnormalizer-pass/plugin.xml AUsrc/plugin/urlnormalizer-pass/build.xml A src/plugin/parse-html A src/plugin/parse-html/ivy.xml A src/plugin/parse-html/lib A src/plugin/parse-html/lib/tagsoup.LICENSE.txt A src/plugin/parse-html/src A src/plugin/parse-html/src/test A src/plugin/parse-html/src/test/org A src/plugin/parse-html/src/test/org/apache A src/plugin/parse-html/src/test/org/apache/nutch A src/plugin/parse-html/src/test/org/apache/nutch/parse A