add icu dependency ------------------ Key: TIKA-765 URL: https://issues.apache.org/jira/browse/TIKA-765 Project: Tika Issue Type: Improvement Components: general Affects Versions: 0.10 Reporter: Robert Muir
Spinoff of TIKA-713. In PDFBox, reflection is used to detect if ICU is available in the classpath: if it is, then it can use ICU BiDi support to properly extract right-to-left text. otherwise, the text is returned "backwards". This is because the JDK does not provide the functionality needed to do this inverse BiDI reordering / arabic-unshaping. it would be nice to properly depend on this, so that these languages work out of box... we do this in Apache Solr's tika integration (contrib/extraction) for example. Unlike the charset detection code from ICU that tika "includes", including BiDi support would be trickier, because it uses datafiles built from unicode (These change over time and would be a hassle to maintain). Additionally as a note: Tika has some forked charset code from ICU... long term it would be great to get those changes into ICU as well. Finally as an optimization its possible to reduce the icu4j jar size if needed with http://apps.icu-project.org/datacustom/, but maybe as a start we could just depend upon the 'whole' icu? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira