[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414448 ] 

Stefan Groschupf commented on NUTCH-290:


If a parser throws an exeption:
Fetcher, 261:
 try {
  parse = this.parseUtil.parse(content);
  parseStatus = parse.getData().getStatus();
} catch (Exception e) {
  parseStatus = new ParseStatus(e);
}
if (!parseStatus.isSuccess()) {
  LOG.warning(Error parsing:  + key + :  + parseStatus);
  parse = parseStatus.getEmptyParse(getConf());
}

than we use the empty parse object:
and a empthy parse contans just no text, see getText
private static class EmptyParseImpl implements Parse {

private ParseData data = null;

public EmptyParseImpl(ParseStatus status, Configuration conf) {
  data = new ParseData(status, , new Outlink[0],
   new Metadata(), new Metadata());
  data.setConf(conf);
}

public ParseData getData() {
  return data;
}

public String getText() {
  return ;
}
  }
 So the Problem should be somewhere else.

 parse-pdf: Garbage indexed when text-extraction not allowed
 ---

  Key: NUTCH-290
  URL: http://issues.apache.org/jira/browse/NUTCH-290
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
  Attachments: NUTCH-290-canExtractContent.patch

 It seems that garbage (or undecoded text?) is indexed when text-extraction 
 for a PDF is not allowed.
 Example-PDF:
 http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414458 ] 

Stefan Neufeind commented on NUTCH-290:
---

But if one plugin fails in 0.8-dev, isn't the next used? I understand that in 
the default-config the text-parser would be used as the last resort fallback.

Also I'm not sure where the summary-text comes from if I use the patch above to 
prevent generating an exception but return empty parse-data.

 parse-pdf: Garbage indexed when text-extraction not allowed
 ---

  Key: NUTCH-290
  URL: http://issues.apache.org/jira/browse/NUTCH-290
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
  Attachments: NUTCH-290-canExtractContent.patch

 It seems that garbage (or undecoded text?) is indexed when text-extraction 
 for a PDF is not allowed.
 Example-PDF:
 http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414469 ] 

Stefan Groschupf commented on NUTCH-290:


As far I understand the code, the next parser is only used if the previous 
parser return with a unsuccessfully paring status. If the parser throws an 
expception these exception is not catched in the parseutil at all.
So the pdf parser should throw an expception and not report a unsucessfully 
status to solve this problem, isn't it?


 parse-pdf: Garbage indexed when text-extraction not allowed
 ---

  Key: NUTCH-290
  URL: http://issues.apache.org/jira/browse/NUTCH-290
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
  Attachments: NUTCH-290-canExtractContent.patch

 It seems that garbage (or undecoded text?) is indexed when text-extraction 
 for a PDF is not allowed.
 Example-PDF:
 http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414477 ] 

Stefan Neufeind commented on NUTCH-290:
---

But to my understanding of the plugin it still extracts as much as possible 
(meta-data) from the PDF. So if indexing is not allowed but this is a PDF, then 
returning empty text as the document-body should be fine - shouldn't it? 
Nothing else except a PDF-plugin will be able to handle PDF correclty in this 
case.

Stefan G., can you point out why in the summary I see binary data for a PDF as 
summary and if there is a possible fix for it in the context of this current 
bug here?

 parse-pdf: Garbage indexed when text-extraction not allowed
 ---

  Key: NUTCH-290
  URL: http://issues.apache.org/jira/browse/NUTCH-290
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
  Attachments: NUTCH-290-canExtractContent.patch

 It seems that garbage (or undecoded text?) is indexed when text-extraction 
 for a PDF is not allowed.
 Example-PDF:
 http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-05-30 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12413780 ] 

Stefan Neufeind commented on NUTCH-290:
---

The plugin itself imho works fine now. Does not throw an exception anymore and 
if allowed outputs text correctly.
However I still get the garbage-output from a PDF. Could that be due to the 
fact that in case no extraction is allowed (empty parsing-text returned) the 
parser will still fallback to using the raw text to index?

What I did was deleting crawl_parse and parse_* from the segments-directory, 
running nutch parse and reindexing everything. However the raw chars in the 
search-output (summary) remain. :-((

 parse-pdf: Garbage indexed when text-extraction not allowed
 ---

  Key: NUTCH-290
  URL: http://issues.apache.org/jira/browse/NUTCH-290
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind
  Attachments: NUTCH-290-canExtractContent.patch

 It seems that garbage (or undecoded text?) is indexed when text-extraction 
 for a PDF is not allowed.
 Example-PDF:
 http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-290) parse-pdf: Garbage (?) indexed when text-extraction now allowed

2006-05-28 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12413637 ] 

Stefan Neufeind commented on NUTCH-290:
---

this one here fires in the PDF-parser:

} catch (Exception e) { // run time exception
LOG.warning(General exception in PDF parser: +e.getMessage());
e.printStackTrace();
  return new ParseStatus(ParseStatus.FAILED,
  Can't be handled as pdf document.  + 
e).getEmptyParse(getConf());
}

The exception is:

060522 001010 General exception in PDF parser: You do not have permission to 
extract text
java.io.IOException: You do not have permission to extract text
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:189)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:140)
at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:120)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:77)
at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:257)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:143)


Could it be that, maybe as a fallback, in case the document can't be parsed and 
no description is returned that in search-output the document itself is used 
as description? If yes: In case of binary files this seems to lead to 
problems.

 parse-pdf: Garbage (?) indexed when text-extraction now allowed
 ---

  Key: NUTCH-290
  URL: http://issues.apache.org/jira/browse/NUTCH-290
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Stefan Neufeind


 It seems that garbage (or undecoded text?) is indexed when text-extraction 
 for a PDF is not allowed.
 Example-PDF:
 http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira