+1 Same experience, of same vintage. :)
-----Original Message----- From: Luís Filipe Nassif [mailto:lfcnas...@gmail.com] Sent: Tuesday, March 7, 2017 10:34 AM To: dev@tika.apache.org Subject: Re: Tess4j API for TIKA OCR parser Hi Thejan, Before the first version of TesseractOcrParser was commited I tried to use Tess4j, that was 4 years ago. Unfortunatelly that time I run into some problems like permanent hangs with tesseract/Tess4j and, even worse, Jvm crashes because of bugs into native code (pointers to crazy adresses) when processing corrupted images. So I changed the strategy and take the Runtime.exec way to execute tesseract out of process to get rid of those Jvm crashes. That was a long time ago, maybe those problems are gone away with current tesseract and Tess4j. But I recommend for now commiting your changes in a new parser instead of changing the default TesseractOcrParser, until the new code is tested against millions of images from the wild with tika-batch so it can be proved it is stable enough to be the default Ocr parser of Tika. Best, Luis Em 7 de mar de 2017 9:58 AM, "Thejan Wijesinghe" < thejan.k.wijesin...@gmail.com> escreveu: > Hi Nick, > > I thought the same thing. I will try to keep the public method > signatures unchanged and will send updates on my progress. > > On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch <apa...@gagravarr.org> wrote: > > > On Tue, 7 Mar 2017, Thejan Wijesinghe wrote: > > > >> I have already use the Tess4j API to rewrite the TesseractOCRParser > class, > >> Although It successfully extracts content from most of the file > >> types, > it > >> fails some particular unit tests in the TesseractOCRParserTest > >> class. I can solve that. However, I want to know whether I can > >> rewrite the entire TesseractOCRParser class from the ground up, but > >> if I do that there will be many broken links in the internals of > >> TIKA because as I witnessed, most > of > >> the classes use TesseractOCRParser class indirectly. > >> > > > > If you can, try to keep the public methods unchanged. That way, > > other callers to the class will be unaffected by your re-write of > > the internal logic > > > > Nick > > >