RE: Nutch 1.14 issues

Markus Jelsma Wed, 13 Jun 2018 03:46:11 -0700

Ah, wrong thread. But it seems some things are not entirely right for 1.15 
release just yet.
Markus


 
 
-----Original message-----
> From:Markus Jelsma <[email protected]>
> Sent: Wednesday 13th June 2018 12:44
> To: [email protected]
> Subject: RE: Nutch 1.14 issues
> 
> Hi,
> 
> I've got some tests failing here on a vanilla master check out.
> 
>     [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 0.314 sec
>     [junit] Test org.apache.nutch.net.TestURLNormalizers FAILED
> 
> Jurian had protocol-http's test failing just now, but running ant test on my 
> system with a clean check out didn't run the plugin tests at all. Whatever i 
> do, plugin tests won't run.
> 
> Markus
> 
> 
> 
>  
>  
> -----Original message-----
> > From:Sebastian Nagel <[email protected]>
> > Sent: Tuesday 12th June 2018 16:24
> > To: [email protected]
> > Subject: Re: Nutch 1.14 issues
> > 
> > Hi Arkadi,
> > 
> > thanks for your feedback and suggestions.
> > I can understand your frustration but I also want to clarify:
> > 
> > - Arch is a nice project, for sure. But Arch is GPL licensed
> >   which makes contributions a one-way route (Nutch -> Arch)
> >   and causes me even not to look into the Arch sources. Sorry.
> > 
> > - Please take the time to split your list of issues into separate
> >   requests on the mailing list or open separate Jira issues.
> >   Also take care that the problems are reproducible by sharing
> >   documents failed to parse, log snippets, config files, etc.
> > 
> > - Sorry about NUTCH-2071, I took this mainly as a class path issue
> >   in the parse-tika plugin (which is solved). Now I understand better
> >   what your objective is and I'll will review and try to fix it
> >   (in combination with NUTCH-1993). But again: please take the time
> >   to explain your objectives, ping committers if fixes make no progress,
> >   etc.
> > 
> > - Nutch is a community project. There are no "paid" committers. This
> >   means although some of us are paid to configure/operate/adapt crawlers
> >   nobody is delegated to fix issues, support Nutch users, etc.
> >   That's voluntary work.
> > 
> > - Everybody is welcome to contribute (patches, documentation, support
> >   on the mailing list, etc.)  Because Nutch is a small project this
> >   will help us definitely.
> > 
> > 
> > Thanks,
> > Sebastian
> > 
> > 
> > 
> > On 06/12/2018 08:46 AM, [email protected] wrote:
> > > Hi guys,
> > > 
> > >  
> > > 
> > > I am porting Arch (https://www.atnf.csiro.au/computing/software/arch/) to 
> > > Nutch 1.14 and Solr 7.2,
> > > and I have come across a few serious issues, of which you should be aware:
> > > 
> > >  
> > > 
> > > 1.       The Nutch-2071 is still an issue in 1.14, because the returned 
> > > parseResult is never null.
> > > If a parser fails to parse a document, it returns an empty result, but 
> > > not null. This means that,
> > > from a chain of parser candidates, only the first one has a chance to try 
> > > to parse the document.
> > > 
> > > 2.       Nutch adopted Tika as a general parsing tool, and stopped 
> > > supporting “legacy” parsing (OO,
> > > MS) plugins. I continued using them and hoped to stop supporting them in 
> > > the next version of Arch I
> > > am preparing to be released, but I still can’t do it, because Tika fails 
> > > to parse too many documents
> > > on our site. But, when I reinforce Tika with the legacy parsers, I 
> > > achieve almost 100% parsing
> > > success rate. This is why NUTCH-2071 is important for Arch. I think you 
> > > should bring back legacy
> > > parsers to Nutch, because the quality of parsing of “real life” data, 
> > > such as ours, is not great
> > > without them.
> > > 
> > > 3.       The lines defining fall-back (*) plugin in parse-plugins.xml are 
> > > not effective, because
> > > they are ignored, as long as there is at least one plugin claiming * in 
> > > its plugin.xml file. In some
> > > cases, Nutch assigns * capability to plugins that don’t even claim it. 
> > > For example, I can’t
> > > understand, why Arch content blocking plugin gets it.
> > > 
> > > 4.       In earlier versions of Nutch, use of the native libraries really 
> > > helped. It reduced
> > > crawling of our site from a couple of days to 6-7 hours. In Nutch 1.14, I 
> > > don’t notice this. I’ve
> > > obtained Hadoop libraries, placed them where they are expected, even 
> > > inserted an explicit load
> > > library call in my code, but I still don’t notice any significant time 
> > > savings.
> > > 
> > > 5.       The Feed plugin seems to have a major problem. The line 102 in  
> > > FeedIndexingFilter.java
> > > generated a NumberFormatException (which caused the failure of the entire 
> > > crawling process!) because
> > > it was trying to parse a date in string format, not a number. Given that 
> > > this metadata piece was
> > > generated by the feed parser (same plugin), it seems that the plugin is 
> > > in disagreement with itself.
> > > 
> > > 6.       This is less important, but when Tika fails to parse a document, 
> > > it generates a scary error
> > > message and ugly stack trace. I think this should be a one line warning, 
> > > because other parsers may
> > > still parse this document successfully.
> > > 
> > >  
> > > 
> > > Hope this helps.
> > > 
> > >  
> > > 
> > > Regards,
> > > 
> > >  
> > > 
> > > Arkadi
> > > 
> > 
> > 
>

RE: Nutch 1.14 issues

Reply via email to