Re: Tess4j API for TIKA OCR parser

2017-03-08 Thread Thejan Wijesinghe
gt; Sent: Tuesday, March 7, 2017 10:38 AM > To: Thejan Wijesinghe > Cc: dev@tika.apache.org > Subject: Re: Tess4j API for TIKA OCR parser > > Thanks Nick for the reply. > > Thejan, > > I am glad to know your progress. Rewriting the TesseractOCRParser would be > the ulti

RE: Tess4j API for TIKA OCR parser

2017-03-07 Thread Thamme Gowda
apache.org] Sent: Tuesday, March 7, 2017 10:38 AM To: Thejan Wijesinghe Cc: dev@tika.apache.org Subject: Re: Tess4j API for TIKA OCR parser Thanks Nick for the reply. Thejan, I am glad to know your progress. Rewriting the TesseractOCRParser would be the ultimate goal if using Tess4j proves to be b

RE: Tess4j API for TIKA OCR parser

2017-03-07 Thread Allison, Timothy B.
Y and why not give the new tika-eval module a trial to evaluate the differences in output? :) -Original Message- From: Thamme Gowda [mailto:thammego...@apache.org] Sent: Tuesday, March 7, 2017 10:38 AM To: Thejan Wijesinghe Cc: dev@tika.apache.org Subject: Re: Tess4j API for TIKA OCR

RE: Tess4j API for TIKA OCR parser

2017-03-07 Thread Allison, Timothy B.
+1 Same experience, of same vintage. :) -Original Message- From: Luís Filipe Nassif [mailto:lfcnas...@gmail.com] Sent: Tuesday, March 7, 2017 10:34 AM To: dev@tika.apache.org Subject: Re: Tess4j API for TIKA OCR parser Hi Thejan, Before the first version of TesseractOcrParser was

Re: Tess4j API for TIKA OCR parser

2017-03-07 Thread Luís Filipe Nassif
Hi Thejan, Before the first version of TesseractOcrParser was commited I tried to use Tess4j, that was 4 years ago. Unfortunatelly that time I run into some problems like permanent hangs with tesseract/Tess4j and, even worse, Jvm crashes because of bugs into native code (pointers to crazy adresses

Re: Tess4j API for TIKA OCR parser

2017-03-07 Thread Thamme Gowda
Thanks Nick for the reply. Thejan, I am glad to know your progress. Rewriting the TesseractOCRParser would be the ultimate goal if using Tess4j proves to be better than the way it is done currently. But, for now, please consider these: + Rename your class to *Tess4jOCRParser*. It is a new parser

Re: Tess4j API for TIKA OCR parser

2017-03-07 Thread Thejan Wijesinghe
Hi Nick, I thought the same thing. I will try to keep the public method signatures unchanged and will send updates on my progress. On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch wrote: > On Tue, 7 Mar 2017, Thejan Wijesinghe wrote: > >> I have already use the Tess4j API to rewrite the TesseractOCRP

Re: Tess4j API for TIKA OCR parser

2017-03-07 Thread Nick Burch
On Tue, 7 Mar 2017, Thejan Wijesinghe wrote: I have already use the Tess4j API to rewrite the TesseractOCRParser class, Although It successfully extracts content from most of the file types, it fails some particular unit tests in the TesseractOCRParserTest class. I can solve that. However, I want

Re: Tess4j API for TIKA OCR parser

2017-03-07 Thread Thejan Wijesinghe
Hi Thamme, I did minimal changes to the TesseractOCRParser class. I basically changed the doOCR() private method. But the existing unit tests get failed even though the content and metadata get extracted. Could you provide me with any guidance on resolving these errors by running the test cases. I

Re: Tess4j API for TIKA OCR parser

2017-03-06 Thread Thejan Wijesinghe
Thamme, I have already use the Tess4j API to rewrite the TesseractOCRParser class, Although It successfully extracts content from most of the file types, it fails some particular unit tests in the TesseractOCRParserTest class. I can solve that. However, I want to know whether I can rewrite the enti

Re: Tess4j API for TIKA OCR parser

2017-03-05 Thread Thamme Gowda
Thejan, Welcome to the world of mysteries. I am unable to explain why you are facing it since I am unable to reproduce it. Try out few other images, may be the image you have chosen is corrupt and maybe there is an exception thrown and silently swallowed in code. I suggest you do this: Please

Tess4j API for TIKA OCR parser

2017-03-04 Thread Thejan Wijesinghe
Hi Thamme, Yes. I am using Ubuntu :) and I had ImageMagick and Tesseract both installed in my system using apt-get. Since, I wasn't sure whether this is a problem with the APT software packages, I built both ImageMagick and Tesseract from sources. I also double checked the availability of Tessera