RE: A bug in org.apache.nutch.parse.ParseUtil?

Arkadi.Kosmynin Mon, 20 Apr 2015 21:22:12 -0700

Hi Sebastian,

Yes, I considered parseResult.isSuccess(), but the problem is, it returns 
success only if all parses were successful. So, if the first parser succeeds, 
it will break the loop, else all parsers will be used - I don't think this was 
the idea.


If retaining ParseStatus of failed parses is important, perhaps a similar 
isAnySuccess() function could help.

Regards,

Arkadi

-----Original Message-----
From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] 
Sent: Saturday, 18 April 2015 7:37 AM
To: user@nutch.apache.org
Subject: Re: A bug in org.apache.nutch.parse.ParseUtil?

Hi Arkadi,

agreed that's a bug.

> if ( parseResult != null ) parseResult.filter() ;

parseResult.isSuccess()
  would do the check without modifying the ParseResult

In case, that also fall-back parsers fail it could useful to return one (the 
first? the last?) failed ParseResult. Luckily the parser places a meaningful 
error message or minor ParseStatus which could be used by the caller for 
diagnostics.

Thanks,
Sebastian

On 04/17/2015 06:31 AM, arkadi.kosmy...@csiro.au wrote:
> Hi,
> 
> From reading the code it is clear that it is designed to allow using 
> several parsers to parse a document in a sequence, until it is 
> successfully parsed. In practice, this does not work because these 
> lines
> 
> f (parseResult != null && !parseResult.isEmpty())
>         return parseResult;
> 
> break the loop even if the parsing has failed because parseResult is not 
> empty anyway, it contains a ParseData with ParseStatus.FAILED.
> This is easy to fix, for example, by adding a line before the two lines 
> mentioned above:
> 
> if ( parseResult != null ) parseResult.filter() ;
> 
> This will remove failed ParseData objects from the parseResult and leave it 
> empty if the parsing had been unsuccessful. I believe that this fix is 
> important because it allows use of backup parsers as originally designed and 
> thus increase index completeness.
> 
> Regards,
> Arkadi
> 
> 
>

RE: A bug in org.apache.nutch.parse.ParseUtil?

Reply via email to