[Nutch-dev] [jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

Stefan Groschupf (JIRA) Fri, 02 Jun 2006 08:46:50 -0700

    [ 
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414448 ]


Stefan Groschupf commented on NUTCH-290:
----------------------------------------

If a parser throws an exeption:
Fetcher, 261:
 try {
          parse = this.parseUtil.parse(content);
          parseStatus = parse.getData().getStatus();
        } catch (Exception e) {
          parseStatus = new ParseStatus(e);
        }
        if (!parseStatus.isSuccess()) {
          LOG.warning("Error parsing: " + key + ": " + parseStatus);
          parse = parseStatus.getEmptyParse(getConf());
        }

than we use the empty parse object:
and a empthy parse contans just no text, see getText
private static class EmptyParseImpl implements Parse {
    
    private ParseData data = null;
    
    public EmptyParseImpl(ParseStatus status, Configuration conf) {
      data = new ParseData(status, "", new Outlink[0],
                           new Metadata(), new Metadata());
      data.setConf(conf);
    }
    
    public ParseData getData() {
      return data;
    }

    public String getText() {
      return "";
    }
  }
 So the Problem should be somewhere else.

> parse-pdf: Garbage indexed when text-extraction not allowed
> -----------------------------------------------------------
>
>          Key: NUTCH-290
>          URL: http://issues.apache.org/jira/browse/NUTCH-290
>      Project: Nutch
>         Type: Bug

>   Components: indexer
>     Versions: 0.8-dev
>     Reporter: Stefan Neufeind
>  Attachments: NUTCH-290-canExtractContent.patch
>
> It seems that garbage (or undecoded text?) is indexed when text-extraction 
> for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

Reply via email to