[ https://issues.apache.org/jira/browse/JCR-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting reopened JCR-1878: -------------------------------- We need the ooxml-schemas dependency in any case if we want to support Microsoft Office 2007 files (see JCR-1887). I think that's a pretty important improvement, that's definitely worth keeping even if it notably increases the standalone jar size. I'll ping the POI people on whether the ooxml-schemas jar could be trimmed down somehow. Also, in Tika we could perhaps find some ways to reduce the size of the dependencies, as not all of the included functionality is really needed (text extraction is typically just a part of the functionality included in the parser libraries). Anyway, I'm reopening this issue until we have a solution that satisfies everyone. > Use Apache Tika for text extraction > ----------------------------------- > > Key: JCR-1878 > URL: https://issues.apache.org/jira/browse/JCR-1878 > Project: Jackrabbit Content Repository > Issue Type: Improvement > Components: jackrabbit-text-extractors > Reporter: Jukka Zitting > Assignee: Jukka Zitting > Fix For: 1.6.0 > > > Once Apache Tika is released with a resolution to TIKA-175 (making Tika > available to Java 1.4 projects), we should replace our direct parser library > dependencies with Tika parsers. Ideally we'd just use the Tika > AutoDetectParser that'll automatically detect the type of a binary and parse > it accordingly, solving JCR-728. > I guess we should keep some level of backwards compatibility with existing > textFilterClasses="..." configurations, perhaps by keeping the existing > TextExtractor classes as wrappers around respective Tika parsers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.