it is an optional dependency of PDFBox. If ICU is available, then it
is capable of processing Arabic PDF files.

The problem is that Arabic "text" in PDF files is really glyphs
(encoded in visual order) and needs to be 'unshaped' with some stuff
that isn't in the JDK.

If the size of the default ICU jar file is the issue here, we can
consider an alternative: The default ICU jar is very large as it
includes everything, yet it can be customized to only include what is
needed: http://apps.icu-project.org/datacustom/

We did this in lucene for the collation contrib, to shrink the jar
about 2MB: http://issues.apache.org/jira/browse/LUCENE-1867

For this use-case, it could be even smaller, as most of the huge size
of ICU comes from large CJK collation tables (needed for collation,
but not for this Arabic PDF extraction).

In reality I don't really like doing this as it might confuse users
(e.g. people that want collation, too), and ICU is useful for other
things, but if thats what we have to do, we should do it so that
Arabic PDF files will work.

On Mon, Mar 8, 2010 at 11:53 PM, Lance Norskog <goks...@gmail.com> wrote:
> Is this a mistake in the Tika library collection in the Solr trunk?
>
> On Mon, Mar 8, 2010 at 5:15 PM, Robert Muir <rcm...@gmail.com> wrote:
>> I think the problem is that Solr does not include the ICU4J jar, so it
>> won't work with Arabic PDF files.
>>
>> Try putting ICU4J 3.8 (http://site.icu-project.org/download) in your 
>> classpath.
>>
>> On Mon, Mar 8, 2010 at 6:30 PM, Abdelhamid  ABID <aeh.a...@gmail.com> wrote:
>>> Hi,
>>> Posting arabic pdf files to Solr using a web form (to solr/update/extract)
>>> get extracted texts and each words displayed in reverse direction(instead of
>>> right to left).
>>> When perform search against these texts with -always- reversed key-words I
>>> get results but reversed.
>>> This problem doesn't occur when posting MsWord document.
>>> I think the problem come from Tika !
>>>
>>> Any clue ?
>>>
>>> --
>>> elsadek
>>> Software Engineer- J2EE / WEB / ESB MULE
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcm...@gmail.com
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>



-- 
Robert Muir
rcm...@gmail.com

Reply via email to