add icu dependency
------------------

                 Key: TIKA-765
                 URL: https://issues.apache.org/jira/browse/TIKA-765
             Project: Tika
          Issue Type: Improvement
          Components: general
    Affects Versions: 0.10
            Reporter: Robert Muir


Spinoff of TIKA-713.

In PDFBox, reflection is used to detect if ICU is available in the classpath: 
if it is, then it can use ICU BiDi support
to properly extract right-to-left text. otherwise, the text is returned 
"backwards". This is because the JDK does not
provide the functionality needed to do this inverse BiDI reordering / 
arabic-unshaping.

it would be nice to properly depend on this, so that these languages work out 
of box... we do this in Apache Solr's
tika integration (contrib/extraction) for example.

Unlike the charset detection code from ICU that tika "includes", including BiDi 
support would be trickier, because it uses
datafiles built from unicode (These change over time and would be a hassle to 
maintain).

Additionally as a note: Tika has some forked charset code from ICU... long term 
it would be great to get those changes 
into ICU as well.

Finally as an optimization its possible to reduce the icu4j jar size if needed 
with http://apps.icu-project.org/datacustom/,
but maybe as a start we could just depend upon the 'whole' icu?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to