[ https://issues.apache.org/jira/browse/JCR-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587611#action_12587611 ]
Alexander Klimetschek commented on JCR-1530: -------------------------------------------- Hmm, IMHO it shouldn't be Jackrabbit's concern to handle such "details", especially as text extraction from binary files is not a mandatory aspect of the JCR API. What about using Apache Tika? It aims to collect all the various extraction libraries and self-built classes of the Apache project and to build a proper re-usable framework. It recently pushed out its first release. Jukka, you probably know more about it - is it already useful for Jackrabbit? You mentioned in JCR-1290 that this could be a task for Jackrabbit 1.5. http://incubator.apache.org/tika/ > MsPowerPointTextExtractor does not extract from PPTs with € sign > ---------------------------------------------------------------- > > Key: JCR-1530 > URL: https://issues.apache.org/jira/browse/JCR-1530 > Project: Jackrabbit > Issue Type: Bug > Components: jackrabbit-text-extractors > Affects Versions: 1.4 > Reporter: Dirk Feufel > > The MsPowerPointTextExtractor class has a problem when reading PPTs when an € > sign is contained. All text following that sign is ignored. Perhaps the POI > PowerPointExtractor should be used instead of parsing the data by hand. As a > side effect, this would simply the code. Extracting could be done as follows: > public Reader extractText(InputStream stream, String type, String > encoding) throws IOException { > try { > PowerPointExtractor extractor = new > PowerPointExtractor(stream); > return new StringReader(extractor.getText(true,true)); > } catch (RuntimeException e) { > logger.warn("Failed to extract PowerPoint text > content", e); > return new StringReader(""); > } finally { > try { stream.close(); } catch (IOException ignored) {} > } > } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.