Re: A bug in org.apache.nutch.parse.ParseUtil?

Mattmann, Chris A (3980) Mon, 20 Apr 2015 21:33:59 -0700

Sounds great, Arkadi (isAnySuccess()). Please submit a pull
request and/or patch when you get a chance. This sounds like
a needed change for sure.


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: "arkadi.kosmy...@csiro.au" <arkadi.kosmy...@csiro.au>
Reply-To: "user@nutch.apache.org" <user@nutch.apache.org>
Date: Tuesday, April 21, 2015 at 12:20 AM
To: "user@nutch.apache.org" <user@nutch.apache.org>
Subject: RE: A bug in org.apache.nutch.parse.ParseUtil?

>Hi Sebastian,
>
>Yes, I considered parseResult.isSuccess(), but the problem is, it returns
>success only if all parses were successful. So, if the first parser
>succeeds, it will break the loop, else all parsers will be used - I don't
>think this was the idea.
>
>If retaining ParseStatus of failed parses is important, perhaps a similar
>isAnySuccess() function could help.
>
>Regards,
>
>Arkadi
>
>-----Original Message-----
>From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
>Sent: Saturday, 18 April 2015 7:37 AM
>To: user@nutch.apache.org
>Subject: Re: A bug in org.apache.nutch.parse.ParseUtil?
>
>Hi Arkadi,
>
>agreed that's a bug.
>
>> if ( parseResult != null ) parseResult.filter() ;
>
>parseResult.isSuccess()
>  would do the check without modifying the ParseResult
>
>In case, that also fall-back parsers fail it could useful to return one
>(the first? the last?) failed ParseResult. Luckily the parser places a
>meaningful error message or minor ParseStatus which could be used by the
>caller for diagnostics.
>
>Thanks,
>Sebastian
>
>On 04/17/2015 06:31 AM, arkadi.kosmy...@csiro.au wrote:
>> Hi,
>> 
>> From reading the code it is clear that it is designed to allow using
>> several parsers to parse a document in a sequence, until it is
>> successfully parsed. In practice, this does not work because these
>> lines
>> 
>> f (parseResult != null && !parseResult.isEmpty())
>>         return parseResult;
>> 
>> break the loop even if the parsing has failed because parseResult is
>>not empty anyway, it contains a ParseData with ParseStatus.FAILED.
>> This is easy to fix, for example, by adding a line before the two lines
>>mentioned above:
>> 
>> if ( parseResult != null ) parseResult.filter() ;
>> 
>> This will remove failed ParseData objects from the parseResult and
>>leave it empty if the parsing had been unsuccessful. I believe that this
>>fix is important because it allows use of backup parsers as originally
>>designed and thus increase index completeness.
>> 
>> Regards,
>> Arkadi
>> 
>> 
>> 
>

Re: A bug in org.apache.nutch.parse.ParseUtil?

Reply via email to