[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414448 ] Stefan Groschupf commented on NUTCH-290: If a parser throws an exeption: Fetcher, 261: try { parse = this.parseUtil.parse(content); parseStatus = parse.getData().getStatus(); } catch (Exception e) { parseStatus = new ParseStatus(e); } if (!parseStatus.isSuccess()) { LOG.warning(Error parsing: + key + : + parseStatus); parse = parseStatus.getEmptyParse(getConf()); } than we use the empty parse object: and a empthy parse contans just no text, see getText private static class EmptyParseImpl implements Parse { private ParseData data = null; public EmptyParseImpl(ParseStatus status, Configuration conf) { data = new ParseData(status, , new Outlink[0], new Metadata(), new Metadata()); data.setConf(conf); } public ParseData getData() { return data; } public String getText() { return ; } } So the Problem should be somewhere else. parse-pdf: Garbage indexed when text-extraction not allowed --- Key: NUTCH-290 URL: http://issues.apache.org/jira/browse/NUTCH-290 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Stefan Neufeind Attachments: NUTCH-290-canExtractContent.patch It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed. Example-PDF: http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414458 ] Stefan Neufeind commented on NUTCH-290: --- But if one plugin fails in 0.8-dev, isn't the next used? I understand that in the default-config the text-parser would be used as the last resort fallback. Also I'm not sure where the summary-text comes from if I use the patch above to prevent generating an exception but return empty parse-data. parse-pdf: Garbage indexed when text-extraction not allowed --- Key: NUTCH-290 URL: http://issues.apache.org/jira/browse/NUTCH-290 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Stefan Neufeind Attachments: NUTCH-290-canExtractContent.patch It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed. Example-PDF: http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414469 ] Stefan Groschupf commented on NUTCH-290: As far I understand the code, the next parser is only used if the previous parser return with a unsuccessfully paring status. If the parser throws an expception these exception is not catched in the parseutil at all. So the pdf parser should throw an expception and not report a unsucessfully status to solve this problem, isn't it? parse-pdf: Garbage indexed when text-extraction not allowed --- Key: NUTCH-290 URL: http://issues.apache.org/jira/browse/NUTCH-290 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Stefan Neufeind Attachments: NUTCH-290-canExtractContent.patch It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed. Example-PDF: http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414477 ] Stefan Neufeind commented on NUTCH-290: --- But to my understanding of the plugin it still extracts as much as possible (meta-data) from the PDF. So if indexing is not allowed but this is a PDF, then returning empty text as the document-body should be fine - shouldn't it? Nothing else except a PDF-plugin will be able to handle PDF correclty in this case. Stefan G., can you point out why in the summary I see binary data for a PDF as summary and if there is a possible fix for it in the context of this current bug here? parse-pdf: Garbage indexed when text-extraction not allowed --- Key: NUTCH-290 URL: http://issues.apache.org/jira/browse/NUTCH-290 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Stefan Neufeind Attachments: NUTCH-290-canExtractContent.patch It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed. Example-PDF: http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12413780 ] Stefan Neufeind commented on NUTCH-290: --- The plugin itself imho works fine now. Does not throw an exception anymore and if allowed outputs text correctly. However I still get the garbage-output from a PDF. Could that be due to the fact that in case no extraction is allowed (empty parsing-text returned) the parser will still fallback to using the raw text to index? What I did was deleting crawl_parse and parse_* from the segments-directory, running nutch parse and reindexing everything. However the raw chars in the search-output (summary) remain. :-(( parse-pdf: Garbage indexed when text-extraction not allowed --- Key: NUTCH-290 URL: http://issues.apache.org/jira/browse/NUTCH-290 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Stefan Neufeind Attachments: NUTCH-290-canExtractContent.patch It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed. Example-PDF: http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-290) parse-pdf: Garbage (?) indexed when text-extraction now allowed
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12413637 ] Stefan Neufeind commented on NUTCH-290: --- this one here fires in the PDF-parser: } catch (Exception e) { // run time exception LOG.warning(General exception in PDF parser: +e.getMessage()); e.printStackTrace(); return new ParseStatus(ParseStatus.FAILED, Can't be handled as pdf document. + e).getEmptyParse(getConf()); } The exception is: 060522 001010 General exception in PDF parser: You do not have permission to extract text java.io.IOException: You do not have permission to extract text at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:189) at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:140) at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:120) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:77) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:257) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:143) Could it be that, maybe as a fallback, in case the document can't be parsed and no description is returned that in search-output the document itself is used as description? If yes: In case of binary files this seems to lead to problems. parse-pdf: Garbage (?) indexed when text-extraction now allowed --- Key: NUTCH-290 URL: http://issues.apache.org/jira/browse/NUTCH-290 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Stefan Neufeind It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed. Example-PDF: http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira