Re: Nutch 1.14 issues

Sebastian Nagel Wed, 13 Jun 2018 04:22:22 -0700

Hi Markus,

On Jenkins all unit tests have passed including plugins:
  https://builds.apache.org/job/Nutch-trunk/3536/testReport/


(same on my laptop running Ubuntu 18.04 and on a Ubuntu 16.04 server)

Could be related to the Java version.
% java -version
openjdk version "1.8.0_171"
OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-0ubuntu0.18.04.1-b11)
OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)

But let's discuss the test failures in separate threads.

Sebastian


On 06/13/2018 12:45 PM, Markus Jelsma wrote:
> Ah, wrong thread. But it seems some things are not entirely right for 1.15 
> release just yet.
> Markus
> 
>  
>  
> -----Original message-----
>> From:Markus Jelsma <[email protected]>
>> Sent: Wednesday 13th June 2018 12:44
>> To: [email protected]
>> Subject: RE: Nutch 1.14 issues
>>
>> Hi,
>>
>> I've got some tests failing here on a vanilla master check out.
>>
>>     [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
>> 0.314 sec
>>     [junit] Test org.apache.nutch.net.TestURLNormalizers FAILED
>>
>> Jurian had protocol-http's test failing just now, but running ant test on my 
>> system with a clean check out didn't run the plugin tests at all. Whatever i 
>> do, plugin tests won't run.
>>
>> Markus
>>
>>
>>
>>  
>>  
>> -----Original message-----
>>> From:Sebastian Nagel <[email protected]>
>>> Sent: Tuesday 12th June 2018 16:24
>>> To: [email protected]
>>> Subject: Re: Nutch 1.14 issues
>>>
>>> Hi Arkadi,
>>>
>>> thanks for your feedback and suggestions.
>>> I can understand your frustration but I also want to clarify:
>>>
>>> - Arch is a nice project, for sure. But Arch is GPL licensed
>>>   which makes contributions a one-way route (Nutch -> Arch)
>>>   and causes me even not to look into the Arch sources. Sorry.
>>>
>>> - Please take the time to split your list of issues into separate
>>>   requests on the mailing list or open separate Jira issues.
>>>   Also take care that the problems are reproducible by sharing
>>>   documents failed to parse, log snippets, config files, etc.
>>>
>>> - Sorry about NUTCH-2071, I took this mainly as a class path issue
>>>   in the parse-tika plugin (which is solved). Now I understand better
>>>   what your objective is and I'll will review and try to fix it
>>>   (in combination with NUTCH-1993). But again: please take the time
>>>   to explain your objectives, ping committers if fixes make no progress,
>>>   etc.
>>>
>>> - Nutch is a community project. There are no "paid" committers. This
>>>   means although some of us are paid to configure/operate/adapt crawlers
>>>   nobody is delegated to fix issues, support Nutch users, etc.
>>>   That's voluntary work.
>>>
>>> - Everybody is welcome to contribute (patches, documentation, support
>>>   on the mailing list, etc.)  Because Nutch is a small project this
>>>   will help us definitely.
>>>
>>>
>>> Thanks,
>>> Sebastian
>>>
>>>
>>>
>>> On 06/12/2018 08:46 AM, [email protected] wrote:
>>>> Hi guys,
>>>>
>>>>  
>>>>
>>>> I am porting Arch (https://www.atnf.csiro.au/computing/software/arch/) to 
>>>> Nutch 1.14 and Solr 7.2,
>>>> and I have come across a few serious issues, of which you should be aware:
>>>>
>>>>  
>>>>
>>>> 1.       The Nutch-2071 is still an issue in 1.14, because the returned 
>>>> parseResult is never null.
>>>> If a parser fails to parse a document, it returns an empty result, but not 
>>>> null. This means that,
>>>> from a chain of parser candidates, only the first one has a chance to try 
>>>> to parse the document.
>>>>
>>>> 2.       Nutch adopted Tika as a general parsing tool, and stopped 
>>>> supporting “legacy” parsing (OO,
>>>> MS) plugins. I continued using them and hoped to stop supporting them in 
>>>> the next version of Arch I
>>>> am preparing to be released, but I still can’t do it, because Tika fails 
>>>> to parse too many documents
>>>> on our site. But, when I reinforce Tika with the legacy parsers, I achieve 
>>>> almost 100% parsing
>>>> success rate. This is why NUTCH-2071 is important for Arch. I think you 
>>>> should bring back legacy
>>>> parsers to Nutch, because the quality of parsing of “real life” data, such 
>>>> as ours, is not great
>>>> without them.
>>>>
>>>> 3.       The lines defining fall-back (*) plugin in parse-plugins.xml are 
>>>> not effective, because
>>>> they are ignored, as long as there is at least one plugin claiming * in 
>>>> its plugin.xml file. In some
>>>> cases, Nutch assigns * capability to plugins that don’t even claim it. For 
>>>> example, I can’t
>>>> understand, why Arch content blocking plugin gets it.
>>>>
>>>> 4.       In earlier versions of Nutch, use of the native libraries really 
>>>> helped. It reduced
>>>> crawling of our site from a couple of days to 6-7 hours. In Nutch 1.14, I 
>>>> don’t notice this. I’ve
>>>> obtained Hadoop libraries, placed them where they are expected, even 
>>>> inserted an explicit load
>>>> library call in my code, but I still don’t notice any significant time 
>>>> savings.
>>>>
>>>> 5.       The Feed plugin seems to have a major problem. The line 102 in  
>>>> FeedIndexingFilter.java
>>>> generated a NumberFormatException (which caused the failure of the entire 
>>>> crawling process!) because
>>>> it was trying to parse a date in string format, not a number. Given that 
>>>> this metadata piece was
>>>> generated by the feed parser (same plugin), it seems that the plugin is in 
>>>> disagreement with itself.
>>>>
>>>> 6.       This is less important, but when Tika fails to parse a document, 
>>>> it generates a scary error
>>>> message and ugly stack trace. I think this should be a one line warning, 
>>>> because other parsers may
>>>> still parse this document successfully.
>>>>
>>>>  
>>>>
>>>> Hope this helps.
>>>>
>>>>  
>>>>
>>>> Regards,
>>>>
>>>>  
>>>>
>>>> Arkadi
>>>>
>>>
>>>
>>

Re: Nutch 1.14 issues

Reply via email to