Hi Markus, On Jenkins all unit tests have passed including plugins: https://builds.apache.org/job/Nutch-trunk/3536/testReport/
(same on my laptop running Ubuntu 18.04 and on a Ubuntu 16.04 server) Could be related to the Java version. % java -version openjdk version "1.8.0_171" OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-0ubuntu0.18.04.1-b11) OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode) But let's discuss the test failures in separate threads. Sebastian On 06/13/2018 12:45 PM, Markus Jelsma wrote: > Ah, wrong thread. But it seems some things are not entirely right for 1.15 > release just yet. > Markus > > > > -----Original message----- >> From:Markus Jelsma <markus.jel...@openindex.io> >> Sent: Wednesday 13th June 2018 12:44 >> To: dev@nutch.apache.org >> Subject: RE: Nutch 1.14 issues >> >> Hi, >> >> I've got some tests failing here on a vanilla master check out. >> >> [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: >> 0.314 sec >> [junit] Test org.apache.nutch.net.TestURLNormalizers FAILED >> >> Jurian had protocol-http's test failing just now, but running ant test on my >> system with a clean check out didn't run the plugin tests at all. Whatever i >> do, plugin tests won't run. >> >> Markus >> >> >> >> >> >> -----Original message----- >>> From:Sebastian Nagel <wastl.na...@googlemail.com> >>> Sent: Tuesday 12th June 2018 16:24 >>> To: dev@nutch.apache.org >>> Subject: Re: Nutch 1.14 issues >>> >>> Hi Arkadi, >>> >>> thanks for your feedback and suggestions. >>> I can understand your frustration but I also want to clarify: >>> >>> - Arch is a nice project, for sure. But Arch is GPL licensed >>> which makes contributions a one-way route (Nutch -> Arch) >>> and causes me even not to look into the Arch sources. Sorry. >>> >>> - Please take the time to split your list of issues into separate >>> requests on the mailing list or open separate Jira issues. >>> Also take care that the problems are reproducible by sharing >>> documents failed to parse, log snippets, config files, etc. >>> >>> - Sorry about NUTCH-2071, I took this mainly as a class path issue >>> in the parse-tika plugin (which is solved). Now I understand better >>> what your objective is and I'll will review and try to fix it >>> (in combination with NUTCH-1993). But again: please take the time >>> to explain your objectives, ping committers if fixes make no progress, >>> etc. >>> >>> - Nutch is a community project. There are no "paid" committers. This >>> means although some of us are paid to configure/operate/adapt crawlers >>> nobody is delegated to fix issues, support Nutch users, etc. >>> That's voluntary work. >>> >>> - Everybody is welcome to contribute (patches, documentation, support >>> on the mailing list, etc.) Because Nutch is a small project this >>> will help us definitely. >>> >>> >>> Thanks, >>> Sebastian >>> >>> >>> >>> On 06/12/2018 08:46 AM, arkadi.kosmy...@csiro.au wrote: >>>> Hi guys, >>>> >>>> >>>> >>>> I am porting Arch (https://www.atnf.csiro.au/computing/software/arch/) to >>>> Nutch 1.14 and Solr 7.2, >>>> and I have come across a few serious issues, of which you should be aware: >>>> >>>> >>>> >>>> 1. The Nutch-2071 is still an issue in 1.14, because the returned >>>> parseResult is never null. >>>> If a parser fails to parse a document, it returns an empty result, but not >>>> null. This means that, >>>> from a chain of parser candidates, only the first one has a chance to try >>>> to parse the document. >>>> >>>> 2. Nutch adopted Tika as a general parsing tool, and stopped >>>> supporting “legacy” parsing (OO, >>>> MS) plugins. I continued using them and hoped to stop supporting them in >>>> the next version of Arch I >>>> am preparing to be released, but I still can’t do it, because Tika fails >>>> to parse too many documents >>>> on our site. But, when I reinforce Tika with the legacy parsers, I achieve >>>> almost 100% parsing >>>> success rate. This is why NUTCH-2071 is important for Arch. I think you >>>> should bring back legacy >>>> parsers to Nutch, because the quality of parsing of “real life” data, such >>>> as ours, is not great >>>> without them. >>>> >>>> 3. The lines defining fall-back (*) plugin in parse-plugins.xml are >>>> not effective, because >>>> they are ignored, as long as there is at least one plugin claiming * in >>>> its plugin.xml file. In some >>>> cases, Nutch assigns * capability to plugins that don’t even claim it. For >>>> example, I can’t >>>> understand, why Arch content blocking plugin gets it. >>>> >>>> 4. In earlier versions of Nutch, use of the native libraries really >>>> helped. It reduced >>>> crawling of our site from a couple of days to 6-7 hours. In Nutch 1.14, I >>>> don’t notice this. I’ve >>>> obtained Hadoop libraries, placed them where they are expected, even >>>> inserted an explicit load >>>> library call in my code, but I still don’t notice any significant time >>>> savings. >>>> >>>> 5. The Feed plugin seems to have a major problem. The line 102 in >>>> FeedIndexingFilter.java >>>> generated a NumberFormatException (which caused the failure of the entire >>>> crawling process!) because >>>> it was trying to parse a date in string format, not a number. Given that >>>> this metadata piece was >>>> generated by the feed parser (same plugin), it seems that the plugin is in >>>> disagreement with itself. >>>> >>>> 6. This is less important, but when Tika fails to parse a document, >>>> it generates a scary error >>>> message and ugly stack trace. I think this should be a one line warning, >>>> because other parsers may >>>> still parse this document successfully. >>>> >>>> >>>> >>>> Hope this helps. >>>> >>>> >>>> >>>> Regards, >>>> >>>> >>>> >>>> Arkadi >>>> >>> >>> >>