Running individual test classes from nutch script cont'd

2011-07-17 Thread lewis john mcgibbney
Hi,

OK this stems from discussion on the user@ list a while ago [1] and my
discovery of NUTCH-672 yesterday. I attached a patch, which fails
completely, as I hadn't uncovered things I now know.
The original patch submitted for the issue would have been fine for =Nutch
1.2 but now as the file structure has changed in = Nutch 1.3 both pre and
post build with ant it is no longer as trivial as it looks. Basically the
additions to the bin/nutch script would something similar to this

  echo   pluginload a plugin and run one of its classes main()
  echo   junit runs the given JUnit test
  echo  or
  echo   CLASSNAME run the class named CLASSNAME
--
elif [ $COMMAND = plugin ] ; then
  CLASS=org.apache.nutch.plugin.PluginRepository
elif [ $COMMAND = junit ] ; then
  CLASSPATH=$CLASSPATH:src/test/
  CLASS='junit.textui.TestRunner'
else
  CLASS=$COMMAND

This would enable us to execute for example bin/nutch junit
org.apache.nutch.crawl.CrawlDBTestUtil, However the problem we face is that
we now no longer have /lib existing under /branch-1.4, it is instead located
under /branch-1.4/runtime/local/lib or alternatively in the /lib directory
in snapshop.job in deploy mode.

I'm therefore getting the class not found error if I try to run.

Exception in thread main java.lang.NoClassDefFoundError:
junit/textui/TestRunner
Caused by: java.lang.ClassNotFoundException: junit.textui.TestRunner
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
Could not find the main class: junit.textui.TestRunner.  Program will exit.

One observation I have, regardless of whether we would wish to run junit
tests on test classes in a either a development or production environment
e.g. from source or from post build runtime code the correct command line
options would have to be specified within the source nutch script.

I've been looking at this for a while and haven't really made much progress
apart form the above observations. Can anyone shine some light or even
suggest how we could correctly configure a patch for the Nutch script?

Thank you

[1] http://www.mail-archive.com/user@nutch.apache.org/msg03207.html

-- 
*Lewis*


[jira] [Updated] (NUTCH-1057) Make fetcher thread time out configurable

2011-07-17 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1057:
-

Attachment: NUTCH-1057-1.4-1.patch

Patch for 1.4. There's also a diff for NUTCH-1037 in the config file which 
hasn't been committed yet.

 Make fetcher thread time out configurable
 -

 Key: NUTCH-1057
 URL: https://issues.apache.org/jira/browse/NUTCH-1057
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.4, 2.0

 Attachments: NUTCH-1057-1.4-1.patch


 The fetcher sets a time out value based of half the mapred.task.timeout 
 value. This is not a proper value for all cases. Add an option 
 (fetcher.thread.timeout.divisor) to configure the divisor used and default it 
 to two.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1043) Add pattern for filtering .js in default url filters

2011-07-17 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1043:
-

Patch Info: [Patch Available]

 Add pattern for filtering .js in default url filters
 

 Key: NUTCH-1043
 URL: https://issues.apache.org/jira/browse/NUTCH-1043
 Project: Nutch
  Issue Type: Task
Affects Versions: 1.4, 2.0
Reporter: Julien Nioche
Priority: Minor
 Fix For: 1.4, 2.0

 Attachments: NUTCH-1043.patch


 The Javascript parser is not used by default as it is extremely noisy, 
 however the default URL filters do not filter out URLs ending in .js and the 
 default parser (Tika) can't parse them. In a nutshell we are fetching URLS 
 that we know can't be parsed.
 I suggest that we add a regex to the default URL filters. If people are 
 interested in fetching and parsing .js files they can activate the plugin in 
 their conf and remove the regex in the URL filters.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1023) Trivial error in error message for org.apache.nutch.crawl.LinkDbReader

2011-07-17 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1023:
-

Patch Info: [Patch Available]

 Trivial error in error message for org.apache.nutch.crawl.LinkDbReader
 --

 Key: NUTCH-1023
 URL: https://issues.apache.org/jira/browse/NUTCH-1023
 Project: Nutch
  Issue Type: Improvement
  Components: linkdb
Affects Versions: 1.3
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Trivial
 Fix For: 1.4

 Attachments: LinkDbReader-trivial.patch


 The following line in the above class has a trivial error in syntax before 
 the -dump parameter. Instead of a curly bracket, it should be consistent with 
 the round bracket.
 126   System.err.println(Usage: LinkDbReader linkdb {-dump out_dir | 
 -url url));
  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

2011-07-17 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-961:


Attachment: NUTCH-961-1.4-dombuilder-1.patch

With BP enabled you can get an java.util.EmptyStackException from DOMBuilder. 
This is fixed in this patch by adding another check around the peek 'n pop 
methods.

http://mail-archives.apache.org/mod_mbox/nutch-user/201107.mbox/%3c201107151523.18511.markus.jel...@openindex.io%3E

There is no answer yet to why this can occur yet i think checking before pop or 
peek is good anyway.

 Expose Tika's boilerpipe support
 

 Key: NUTCH-961
 URL: https://issues.apache.org/jira/browse/NUTCH-961
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Markus Jelsma
 Fix For: 1.4, 2.0

 Attachments: BoilerpipeExtractorRepository.java, 
 NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, 
 NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch, 
 NUTCH-961v2.patch


 Tika 0.8 comes with the Boilerpipe content handler which can be used to 
 extract boilerplate content from HTML pages. We should see how we can expose 
 Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-965) Skip parsing for truncated documents

2011-07-17 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-965:


Patch Info: [Patch Available]

 Skip parsing for truncated documents
 

 Key: NUTCH-965
 URL: https://issues.apache.org/jira/browse/NUTCH-965
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Reporter: Alexis
 Fix For: 1.4, 2.0

 Attachments: parserJob.patch


 The issue you're likely to run into when parsing truncated FLV files is 
 described here:
 http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
 The parser library gets stuck in infinite loop as it encounters corrupted 
 data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1044) Redirected URLs and possibly all of their outlinked URLs have invalid scores.

2011-07-17 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13066649#comment-13066649
 ] 

Markus Jelsma commented on NUTCH-1044:
--

Can you provide a patch?

 Redirected URLs and possibly all of their outlinked URLs have invalid scores.
 -

 Key: NUTCH-1044
 URL: https://issues.apache.org/jira/browse/NUTCH-1044
 Project: Nutch
  Issue Type: Bug
  Components: fetcher, parser
Affects Versions: 1.3
Reporter: Nutch User - 1

 1.: 
 http://lucene.472066.n3.nabble.com/URL-redirection-and-zero-scores-td3085311.html
 2.: 
 http://lucene.472066.n3.nabble.com/A-possible-solution-to-my-URL-redirection-and-zero-scores-problem-td3162164.html
 Please note that also URLs redirected by meta refresh redirection do have 
 invalid scores. For such URLs a CrawlDatum is created on the lines 157-177 of 
 ParseOutputFormat.java 
 (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/parse/ParseOutputFormat.java?view=markup).
  The new CrawlDatum's score isn't set anywhere after the creation so it's 
 1.0f as can be seen on the line 122 of CrawlDatum.java 
 (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup).
 It's another question whether the redirected URL's score should be just 
 passed to the new URL or should the redirection be considered as a link in 
 which case the new URL's score would be 'originalScore' / ('numberOfOutlinks' 
 + 1).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




adding details to mvn.template?

2011-07-17 Thread lewis john mcgibbney
Hi,

Quick question, I've been looking at various issues dealt with prior to
Nutch 1.3 release in particular NUTCH-995.

Please excuse (and correct) my ignorance, but I need to clear this one up so
I understand correctly. The purpose the mvn.template file serves is so we
can specify exactly who can commit a Nutch maven pom. The pom in turn
specifies the build dirs e.g. source dir as well as test dir. Then finally
all dependencies we rely on within the project?

Although I am not planning, and I'm aware we don't need to commit a Nutch
Maven pom, is there any purpose in me committing my developer id, name and
email to the mvn.template file? If so is the template file the only one I
would need to provide a patch for?

Thank you

-- 
*Lewis*


Re: adding details to mvn.template?

2011-07-17 Thread Julien Nioche
Please excuse (and correct) my ignorance, but I need to clear this one up so
 I understand correctly. The purpose the mvn.template file serves is so we
 can specify exactly who can commit a Nutch maven pom. The pom in turn
 specifies the build dirs e.g. source dir as well as test dir. Then finally
 all dependencies we rely on within the project?


The purpose of mvn.template is to add more details to the pom.xml generated
from ivy. This pom file is used mostly for publishing the Nutch jar as an
artefact but some people use it to manage the dependencies, although this
can be done with Ivy without problems.



 Although I am not planning, and I'm aware we don't need to commit a Nutch
 Maven pom, is there any purpose in me committing my developer id, name and
 email to the mvn.template file? If so is the template file the only one I
 would need to provide a patch for?


Yes, most definitely. Should be the only thing to patch indeed

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


[jira] [Commented] (NUTCH-1019) Edit comment in org.apache.nutch.crawl.Crawl to reflect removal of legacy

2011-07-17 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13066731#comment-13066731
 ] 

Lewis John McGibbney commented on NUTCH-1019:
-

Committed at revision 1147712.

 Edit comment in org.apache.nutch.crawl.Crawl to reflect removal of legacy
 -

 Key: NUTCH-1019
 URL: https://issues.apache.org/jira/browse/NUTCH-1019
 Project: Nutch
  Issue Type: Improvement
  Components: documentation
Affects Versions: 1.4, 2.0
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Trivial
 Fix For: 1.4, 2.0

 Attachments: crawl-comment.patch


 When updating the wiki documentation for command line options, I noticed that 
 the comment on line 51 of the above class is inaccurate and needs to be 
 updated to reflect changes. Although this is a trivial task I won't be able 
 to committ until 2nd week July. Can I ask someone else please?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1059) Remove convdb command from /bin/nutch

2011-07-17 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1059:


Attachment: NUTCH-1059-remove-convdb.patch

The patch simply removes both the command line option and the class which is 
supposedly called when the command is initiated. This is being removed as the 
crawldbconv class has been dropped = 1.3.

 Remove convdb command from /bin/nutch
 -

 Key: NUTCH-1059
 URL: https://issues.apache.org/jira/browse/NUTCH-1059
 Project: Nutch
  Issue Type: Task
  Components: build
Affects Versions: 1.3
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Trivial
 Fix For: 1.4, 2.0

 Attachments: NUTCH-1059-remove-convdb.patch


 There is no class shipped with =Nutch 1.3 for the Convdb command therefore 
 I'm assuming this command somehow slipped through the net undetected. I will 
 attach a trivial patch simply removing it from the bin/nutch script 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




Build failed in Jenkins: Nutch-trunk #1549

2011-07-17 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-trunk/1549/

--
[...truncated 985 lines...]
A src/plugin/subcollection/src/java/org/apache/nutch/collection
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html
A src/plugin/subcollection/src/java/org/apache/nutch/indexer
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java
A src/plugin/subcollection/README.txt
A src/plugin/subcollection/plugin.xml
A src/plugin/subcollection/build.xml
A src/plugin/index-more
A src/plugin/index-more/ivy.xml
A src/plugin/index-more/src
A src/plugin/index-more/src/test
A src/plugin/index-more/src/test/org
A src/plugin/index-more/src/test/org/apache
A src/plugin/index-more/src/test/org/apache/nutch
A src/plugin/index-more/src/test/org/apache/nutch/indexer
A src/plugin/index-more/src/test/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
A src/plugin/index-more/src/java
A src/plugin/index-more/src/java/org
A src/plugin/index-more/src/java/org/apache
A src/plugin/index-more/src/java/org/apache/nutch
A src/plugin/index-more/src/java/org/apache/nutch/indexer
A src/plugin/index-more/src/java/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html
A src/plugin/index-more/plugin.xml
A src/plugin/index-more/build.xml
AUsrc/plugin/plugin.dtd
A src/plugin/parse-ext
A src/plugin/parse-ext/ivy.xml
A src/plugin/parse-ext/src
A src/plugin/parse-ext/src/test
A src/plugin/parse-ext/src/test/org
A src/plugin/parse-ext/src/test/org/apache
A src/plugin/parse-ext/src/test/org/apache/nutch
A src/plugin/parse-ext/src/test/org/apache/nutch/parse
A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java
A src/plugin/parse-ext/src/java
A src/plugin/parse-ext/src/java/org
A src/plugin/parse-ext/src/java/org/apache
A src/plugin/parse-ext/src/java/org/apache/nutch
A src/plugin/parse-ext/src/java/org/apache/nutch/parse
A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java
A src/plugin/parse-ext/plugin.xml
A src/plugin/parse-ext/build.xml
A src/plugin/parse-ext/command
A src/plugin/urlnormalizer-pass
A src/plugin/urlnormalizer-pass/ivy.xml
A src/plugin/urlnormalizer-pass/src
A src/plugin/urlnormalizer-pass/src/test
A src/plugin/urlnormalizer-pass/src/test/org
A src/plugin/urlnormalizer-pass/src/test/org/apache
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java
A src/plugin/urlnormalizer-pass/src/java
A src/plugin/urlnormalizer-pass/src/java/org
A src/plugin/urlnormalizer-pass/src/java/org/apache
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java
AUsrc/plugin/urlnormalizer-pass/plugin.xml
AUsrc/plugin/urlnormalizer-pass/build.xml
A src/plugin/parse-html
A src/plugin/parse-html/ivy.xml
A src/plugin/parse-html/lib
A src/plugin/parse-html/lib/tagsoup.LICENSE.txt
A src/plugin/parse-html/src
A src/plugin/parse-html/src/test
A src/plugin/parse-html/src/test/org
A src/plugin/parse-html/src/test/org/apache
A src/plugin/parse-html/src/test/org/apache/nutch
A src/plugin/parse-html/src/test/org/apache/nutch/parse
A