Re: PDF extraction leads to reversed words

2010-03-16 Thread Abdelhamid ABID
Hi again , I just came from trying the version 1.5-dev from Solr trunk. After applying the patch you provided, and adding icu4j-3_8_1 in classpath, results are pretty good different then before. Now words and texts are not reversed and are displayed correctly except some pdf files's text parts that

Re: PDF extraction leads to reversed words

2010-03-09 Thread Robert Muir
On Tue, Mar 9, 2010 at 9:44 AM, Abdelhamid ABID wrote: > I put ICU4J 4.2 in the lib of Solr, nothing changed, I'm trying now with > ICU4J 3.8 > Hello, what version of Solr are you using? I think you will need to use the trunk version. I created a patch for this issue that you can apply to trunk

Re: PDF extraction leads to reversed words

2010-03-09 Thread Abdelhamid ABID
I tried couples of times to get this patch, but downloads fail, filesize missmach or someting like error poped up is there another link On 3/9/10, Dominique Bejean wrote: > > Hi, > > The problem comes form PDFBox ( > http://brutus.apache.org/jira/browse/PDFBOX-377) and is fixed now. However > Tik

Re: PDF extraction leads to reversed words

2010-03-09 Thread Abdelhamid ABID
I put ICU4J 4.2 in the lib of Solr, nothing changed, I'm trying now with ICU4J 3.8 On 3/9/10, Robert Muir wrote: > > I think the problem is that Solr does not include the ICU4J jar, so it > won't work with Arabic PDF files. > > Try putting ICU4J 3.8 (http://site.icu-project.org/download) in your

Re: PDF extraction leads to reversed words

2010-03-09 Thread Robert Muir
On Tue, Mar 9, 2010 at 10:10 AM, Abdelhamid ABID wrote: > nor 3.8 version does change anythings ! > the patch (https://issues.apache.org/jira/browse/SOLR-1813) can only work on Solr trunk. It will not work with Solr 1.4. Solr 1.4 uses pdfbox-0.7.3.jar, which does not support Arabic. Solr trunk

Re: PDF extraction leads to reversed words

2010-03-09 Thread Abdelhamid ABID
nor 3.8 version does change anythings ! On 3/9/10, Robert Muir wrote: > > I think the problem is that Solr does not include the ICU4J jar, so it > won't work with Arabic PDF files. > > Try putting ICU4J 3.8 (http://site.icu-project.org/download) in your > classpath. > > > On Mon, Mar 8, 2010 at 6

Re: PDF extraction leads to reversed words

2010-03-09 Thread Abdelhamid ABID
I doen't know about pdftotext, is it pluggable with Solr, or do we need hard-code the step of extraction before Solr turn. On 3/9/10, Dominique Bejean wrote: > > Hi, > > The problem comes form PDFBox ( > http://brutus.apache.org/jira/browse/PDFBOX-377) and is fixed now. However > Tika doesn't yet

Re: PDF extraction leads to reversed words

2010-03-09 Thread Robert Muir
this depends on what version of solr you are using, the trunk version has a version of tika that supports this. See SOLR-1813 On Tue, Mar 9, 2010 at 3:59 AM, Dominique Bejean wrote: > Hi, > > The problem comes form PDFBox > (http://brutus.apache.org/jira/browse/PDFBOX-377) and is fixed now. Howev

Re: PDF extraction leads to reversed words

2010-03-09 Thread Abdelhamid ABID
I'm using 1.4 version of Solr On 3/9/10, Robert Muir wrote: > > On Tue, Mar 9, 2010 at 9:44 AM, Abdelhamid ABID > wrote: > > I put ICU4J 4.2 in the lib of Solr, nothing changed, I'm trying now with > > ICU4J 3.8 > > > > > Hello, what version of Solr are you using? I think you will need to > use

Re: PDF extraction leads to reversed words

2010-03-09 Thread Abdelhamid ABID
nor 3.8 version does change anythings ! On 3/9/10, Robert Muir wrote: > > I think the problem is that Solr does not include the ICU4J jar, so it > won't work with Arabic PDF files. > > Try putting ICU4J 3.8 (http://site.icu-project.org/download) in your > classpath. > > > On Mon, Mar 8, 2010 at 6

Re: PDF extraction leads to reversed words

2010-03-09 Thread Robert Muir
sorry for the link to the wrong JIRA issue, was looking at another issue. its here: https://issues.apache.org/jira/browse/SOLR-1813 again you will need to apply it to trunk I think, as thats the only place I have tested it. -- Robert Muir rcm...@gmail.com

Re: PDF extraction leads to reversed words

2010-03-09 Thread Dominique Bejean
Hi, The problem comes form PDFBox (http://brutus.apache.org/jira/browse/PDFBOX-377) and is fixed now. However Tika doesn't yet use this version of PDFBox. So for PDF text extraction, I doesn't use Tika but pdftotext. Dominique Le 09/03/10 06:00, Robert Muir a écrit : it is an optional depe

Re: PDF extraction leads to reversed words

2010-03-08 Thread Robert Muir
it is an optional dependency of PDFBox. If ICU is available, then it is capable of processing Arabic PDF files. The problem is that Arabic "text" in PDF files is really glyphs (encoded in visual order) and needs to be 'unshaped' with some stuff that isn't in the JDK. If the size of the default IC

Re: PDF extraction leads to reversed words

2010-03-08 Thread Lance Norskog
Is this a mistake in the Tika library collection in the Solr trunk? On Mon, Mar 8, 2010 at 5:15 PM, Robert Muir wrote: > I think the problem is that Solr does not include the ICU4J jar, so it > won't work with Arabic PDF files. > > Try putting ICU4J 3.8 (http://site.icu-project.org/download) in y

Re: PDF extraction leads to reversed words

2010-03-08 Thread Robert Muir
I think the problem is that Solr does not include the ICU4J jar, so it won't work with Arabic PDF files. Try putting ICU4J 3.8 (http://site.icu-project.org/download) in your classpath. On Mon, Mar 8, 2010 at 6:30 PM, Abdelhamid ABID wrote: > Hi, > Posting arabic pdf files to Solr using a web fo

PDF extraction leads to reversed words

2010-03-08 Thread Abdelhamid ABID
Hi, Posting arabic pdf files to Solr using a web form (to solr/update/extract) get extracted texts and each words displayed in reverse direction(instead of right to left). When perform search against these texts with -always- reversed key-words I get results but reversed. This problem doesn't occur