Ah, wrong thread. But it seems some things are not entirely right for 1.15 release just yet. Markus
-----Original message----- > From:Markus Jelsma <markus.jel...@openindex.io> > Sent: Wednesday 13th June 2018 12:44 > To: dev@nutch.apache.org > Subject: RE: Nutch 1.14 issues > > Hi, > > I've got some tests failing here on a vanilla master check out. > > [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: > 0.314 sec > [junit] Test org.apache.nutch.net.TestURLNormalizers FAILED > > Jurian had protocol-http's test failing just now, but running ant test on my > system with a clean check out didn't run the plugin tests at all. Whatever i > do, plugin tests won't run. > > Markus > > > > > > -----Original message----- > > From:Sebastian Nagel <wastl.na...@googlemail.com> > > Sent: Tuesday 12th June 2018 16:24 > > To: dev@nutch.apache.org > > Subject: Re: Nutch 1.14 issues > > > > Hi Arkadi, > > > > thanks for your feedback and suggestions. > > I can understand your frustration but I also want to clarify: > > > > - Arch is a nice project, for sure. But Arch is GPL licensed > > which makes contributions a one-way route (Nutch -> Arch) > > and causes me even not to look into the Arch sources. Sorry. > > > > - Please take the time to split your list of issues into separate > > requests on the mailing list or open separate Jira issues. > > Also take care that the problems are reproducible by sharing > > documents failed to parse, log snippets, config files, etc. > > > > - Sorry about NUTCH-2071, I took this mainly as a class path issue > > in the parse-tika plugin (which is solved). Now I understand better > > what your objective is and I'll will review and try to fix it > > (in combination with NUTCH-1993). But again: please take the time > > to explain your objectives, ping committers if fixes make no progress, > > etc. > > > > - Nutch is a community project. There are no "paid" committers. This > > means although some of us are paid to configure/operate/adapt crawlers > > nobody is delegated to fix issues, support Nutch users, etc. > > That's voluntary work. > > > > - Everybody is welcome to contribute (patches, documentation, support > > on the mailing list, etc.) Because Nutch is a small project this > > will help us definitely. > > > > > > Thanks, > > Sebastian > > > > > > > > On 06/12/2018 08:46 AM, arkadi.kosmy...@csiro.au wrote: > > > Hi guys, > > > > > > > > > > > > I am porting Arch (https://www.atnf.csiro.au/computing/software/arch/) to > > > Nutch 1.14 and Solr 7.2, > > > and I have come across a few serious issues, of which you should be aware: > > > > > > > > > > > > 1. The Nutch-2071 is still an issue in 1.14, because the returned > > > parseResult is never null. > > > If a parser fails to parse a document, it returns an empty result, but > > > not null. This means that, > > > from a chain of parser candidates, only the first one has a chance to try > > > to parse the document. > > > > > > 2. Nutch adopted Tika as a general parsing tool, and stopped > > > supporting “legacy” parsing (OO, > > > MS) plugins. I continued using them and hoped to stop supporting them in > > > the next version of Arch I > > > am preparing to be released, but I still can’t do it, because Tika fails > > > to parse too many documents > > > on our site. But, when I reinforce Tika with the legacy parsers, I > > > achieve almost 100% parsing > > > success rate. This is why NUTCH-2071 is important for Arch. I think you > > > should bring back legacy > > > parsers to Nutch, because the quality of parsing of “real life” data, > > > such as ours, is not great > > > without them. > > > > > > 3. The lines defining fall-back (*) plugin in parse-plugins.xml are > > > not effective, because > > > they are ignored, as long as there is at least one plugin claiming * in > > > its plugin.xml file. In some > > > cases, Nutch assigns * capability to plugins that don’t even claim it. > > > For example, I can’t > > > understand, why Arch content blocking plugin gets it. > > > > > > 4. In earlier versions of Nutch, use of the native libraries really > > > helped. It reduced > > > crawling of our site from a couple of days to 6-7 hours. In Nutch 1.14, I > > > don’t notice this. I’ve > > > obtained Hadoop libraries, placed them where they are expected, even > > > inserted an explicit load > > > library call in my code, but I still don’t notice any significant time > > > savings. > > > > > > 5. The Feed plugin seems to have a major problem. The line 102 in > > > FeedIndexingFilter.java > > > generated a NumberFormatException (which caused the failure of the entire > > > crawling process!) because > > > it was trying to parse a date in string format, not a number. Given that > > > this metadata piece was > > > generated by the feed parser (same plugin), it seems that the plugin is > > > in disagreement with itself. > > > > > > 6. This is less important, but when Tika fails to parse a document, > > > it generates a scary error > > > message and ugly stack trace. I think this should be a one line warning, > > > because other parsers may > > > still parse this document successfully. > > > > > > > > > > > > Hope this helps. > > > > > > > > > > > > Regards, > > > > > > > > > > > > Arkadi > > > > > > > >