Re: Tess4j API for TIKA OCR parser

2017-03-08 Thread Thejan Wijesinghe
Hi everyone!

Luis, It is my pleasure to meet an original creator of a major component of
TIKA. I should say that it is very creative + reliable workaround. :) I
still have many unclear areas in TIKA parsers. Perhaps you can help me to
clarify some of them.

Even now, there must be some unreliability in Tess4j because I also got
some jvm crashing issues when trying to test it first but however I got
through them. I can't exactly say whether this is going to be a perfect
implementation without testing this properly. However, I'll try my best to
make this work. I have crated an jira issue for this [1]
<https://issues.apache.org/jira/browse/TIKA-2293>. I invite you all to help
me, make this a success.

[1] https://issues.apache.org/jira/browse/TIKA-2293

On Tue, Mar 7, 2017 at 9:42 PM, Thamme Gowda <thammego...@apache.org> wrote:

> yes, we can try tika-eval to see the difference. Perfect!
>
> Best,
> TG
>
> On Mar 7, 2017 7:44 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:
>
> Y and why not give the new tika-eval module a trial to evaluate the
> differences in output?  :)
>
> -Original Message-
> From: Thamme Gowda [mailto:thammego...@apache.org]
> Sent: Tuesday, March 7, 2017 10:38 AM
> To: Thejan Wijesinghe <thejan.k.wijesin...@gmail.com>
> Cc: dev@tika.apache.org
> Subject: Re: Tess4j API for TIKA OCR parser
>
> Thanks Nick for the reply.
>
> Thejan,
>
> I am glad to know your progress. Rewriting the TesseractOCRParser would be
> the ultimate goal if using Tess4j proves to be better than the way it is
> done currently.
>
> But, for now, please consider these:
> + Rename your class to *Tess4jOCRParser*. It is a new parser providing
> + the
> same functionality as *TesseractOCRParser*
> + Keep the *TesseractOCRParser* intact. You can use it as your reference
> + to
> understand features of OCR parser to support.
> + Benchmark *TesseractOCRParser* and *Tess4jOCRParser* with respect to
> performance and stability. You can take a set of 100 images and compare
> how much time each of them took. Please share those results here.
>
>
> Based on the benchmark, we can decide whether to replace old one with new
> one. Because TesseractOCRParser is used along with many other parsers like
> JPEG/PDF etc any improvements you make with Tess4jOCRParser will have a
> huge effect!
>
> P.S.
> + Please don't edit any test cases. You may add new ones, though!
> + Could you please create a Jira Issue to track this. Sorry, I must have
> said this early.
>
> Best,
> TG
>
>
> On Tue, Mar 7, 2017 at 4:58 AM, Thejan Wijesinghe <
> thejan.k.wijesin...@gmail.com> wrote:
>
> > Hi Nick,
> >
> > I thought the same thing. I will try to keep the public method
> > signatures unchanged and will send updates on my progress.
> >
> > On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch <apa...@gagravarr.org> wrote:
> >
> > > On Tue, 7 Mar 2017, Thejan Wijesinghe wrote:
> > >
> > >> I have already use the Tess4j API to rewrite the TesseractOCRParser
> > class,
> > >> Although It successfully extracts content from most of the file
> > >> types,
> > it
> > >> fails some particular unit tests in the TesseractOCRParserTest
> > >> class. I can solve that. However, I want to know whether I can
> > >> rewrite the entire TesseractOCRParser class from the ground up, but
> > >> if I do that there will be many broken links in the internals of
> > >> TIKA because as I witnessed, most
> > of
> > >> the classes use TesseractOCRParser class indirectly.
> > >>
> > >
> > > If you can, try to keep the public methods unchanged. That way,
> > > other callers to the class will be unaffected by your re-write of
> > > the internal logic
> > >
> > > Nick
> > >
> >
>
>
>


RE: Tess4j API for TIKA OCR parser

2017-03-07 Thread Thamme Gowda
yes, we can try tika-eval to see the difference. Perfect!

Best,
TG

On Mar 7, 2017 7:44 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:

Y and why not give the new tika-eval module a trial to evaluate the
differences in output?  :)

-Original Message-
From: Thamme Gowda [mailto:thammego...@apache.org]
Sent: Tuesday, March 7, 2017 10:38 AM
To: Thejan Wijesinghe <thejan.k.wijesin...@gmail.com>
Cc: dev@tika.apache.org
Subject: Re: Tess4j API for TIKA OCR parser

Thanks Nick for the reply.

Thejan,

I am glad to know your progress. Rewriting the TesseractOCRParser would be
the ultimate goal if using Tess4j proves to be better than the way it is
done currently.

But, for now, please consider these:
+ Rename your class to *Tess4jOCRParser*. It is a new parser providing
+ the
same functionality as *TesseractOCRParser*
+ Keep the *TesseractOCRParser* intact. You can use it as your reference
+ to
understand features of OCR parser to support.
+ Benchmark *TesseractOCRParser* and *Tess4jOCRParser* with respect to
performance and stability. You can take a set of 100 images and compare how
much time each of them took. Please share those results here.


Based on the benchmark, we can decide whether to replace old one with new
one. Because TesseractOCRParser is used along with many other parsers like
JPEG/PDF etc any improvements you make with Tess4jOCRParser will have a
huge effect!

P.S.
+ Please don't edit any test cases. You may add new ones, though!
+ Could you please create a Jira Issue to track this. Sorry, I must have
said this early.

Best,
TG


On Tue, Mar 7, 2017 at 4:58 AM, Thejan Wijesinghe <
thejan.k.wijesin...@gmail.com> wrote:

> Hi Nick,
>
> I thought the same thing. I will try to keep the public method
> signatures unchanged and will send updates on my progress.
>
> On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch <apa...@gagravarr.org> wrote:
>
> > On Tue, 7 Mar 2017, Thejan Wijesinghe wrote:
> >
> >> I have already use the Tess4j API to rewrite the TesseractOCRParser
> class,
> >> Although It successfully extracts content from most of the file
> >> types,
> it
> >> fails some particular unit tests in the TesseractOCRParserTest
> >> class. I can solve that. However, I want to know whether I can
> >> rewrite the entire TesseractOCRParser class from the ground up, but
> >> if I do that there will be many broken links in the internals of
> >> TIKA because as I witnessed, most
> of
> >> the classes use TesseractOCRParser class indirectly.
> >>
> >
> > If you can, try to keep the public methods unchanged. That way,
> > other callers to the class will be unaffected by your re-write of
> > the internal logic
> >
> > Nick
> >
>


RE: Tess4j API for TIKA OCR parser

2017-03-07 Thread Allison, Timothy B.
Y and why not give the new tika-eval module a trial to evaluate the differences 
in output?  :)

-Original Message-
From: Thamme Gowda [mailto:thammego...@apache.org] 
Sent: Tuesday, March 7, 2017 10:38 AM
To: Thejan Wijesinghe <thejan.k.wijesin...@gmail.com>
Cc: dev@tika.apache.org
Subject: Re: Tess4j API for TIKA OCR parser

Thanks Nick for the reply.

Thejan,

I am glad to know your progress. Rewriting the TesseractOCRParser would be the 
ultimate goal if using Tess4j proves to be better than the way it is done 
currently.

But, for now, please consider these:
+ Rename your class to *Tess4jOCRParser*. It is a new parser providing 
+ the
same functionality as *TesseractOCRParser*
+ Keep the *TesseractOCRParser* intact. You can use it as your reference 
+ to
understand features of OCR parser to support.
+ Benchmark *TesseractOCRParser* and *Tess4jOCRParser* with respect to
performance and stability. You can take a set of 100 images and compare how 
much time each of them took. Please share those results here.


Based on the benchmark, we can decide whether to replace old one with new one. 
Because TesseractOCRParser is used along with many other parsers like JPEG/PDF 
etc any improvements you make with Tess4jOCRParser will have a huge effect!

P.S.
+ Please don't edit any test cases. You may add new ones, though!
+ Could you please create a Jira Issue to track this. Sorry, I must have
said this early.

Best,
TG


On Tue, Mar 7, 2017 at 4:58 AM, Thejan Wijesinghe < 
thejan.k.wijesin...@gmail.com> wrote:

> Hi Nick,
>
> I thought the same thing. I will try to keep the public method 
> signatures unchanged and will send updates on my progress.
>
> On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch <apa...@gagravarr.org> wrote:
>
> > On Tue, 7 Mar 2017, Thejan Wijesinghe wrote:
> >
> >> I have already use the Tess4j API to rewrite the TesseractOCRParser
> class,
> >> Although It successfully extracts content from most of the file 
> >> types,
> it
> >> fails some particular unit tests in the TesseractOCRParserTest 
> >> class. I can solve that. However, I want to know whether I can 
> >> rewrite the entire TesseractOCRParser class from the ground up, but 
> >> if I do that there will be many broken links in the internals of 
> >> TIKA because as I witnessed, most
> of
> >> the classes use TesseractOCRParser class indirectly.
> >>
> >
> > If you can, try to keep the public methods unchanged. That way, 
> > other callers to the class will be unaffected by your re-write of 
> > the internal logic
> >
> > Nick
> >
>


RE: Tess4j API for TIKA OCR parser

2017-03-07 Thread Allison, Timothy B.
+1

Same experience, of same vintage. :)

-Original Message-
From: Luís Filipe Nassif [mailto:lfcnas...@gmail.com] 
Sent: Tuesday, March 7, 2017 10:34 AM
To: dev@tika.apache.org
Subject: Re: Tess4j API for TIKA OCR parser

Hi Thejan,

Before the first version of TesseractOcrParser was commited I tried to use 
Tess4j, that was 4 years ago. Unfortunatelly that time I run into some problems 
like permanent hangs with tesseract/Tess4j and, even worse, Jvm crashes because 
of bugs into native code (pointers to crazy adresses) when processing corrupted 
images. So I changed the strategy and take the Runtime.exec way to execute 
tesseract out of process to get rid of those Jvm crashes.

That was a long time ago, maybe those problems are gone away with current 
tesseract and Tess4j. But I recommend for now commiting your changes in a new 
parser instead of changing the default TesseractOcrParser, until the new code 
is tested against millions of images from the wild with tika-batch so it can be 
proved it is stable enough to be the default Ocr parser of Tika.

Best,
Luis

Em 7 de mar de 2017 9:58 AM, "Thejan Wijesinghe" < 
thejan.k.wijesin...@gmail.com> escreveu:

> Hi Nick,
>
> I thought the same thing. I will try to keep the public method 
> signatures unchanged and will send updates on my progress.
>
> On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch <apa...@gagravarr.org> wrote:
>
> > On Tue, 7 Mar 2017, Thejan Wijesinghe wrote:
> >
> >> I have already use the Tess4j API to rewrite the TesseractOCRParser
> class,
> >> Although It successfully extracts content from most of the file 
> >> types,
> it
> >> fails some particular unit tests in the TesseractOCRParserTest 
> >> class. I can solve that. However, I want to know whether I can 
> >> rewrite the entire TesseractOCRParser class from the ground up, but 
> >> if I do that there will be many broken links in the internals of 
> >> TIKA because as I witnessed, most
> of
> >> the classes use TesseractOCRParser class indirectly.
> >>
> >
> > If you can, try to keep the public methods unchanged. That way, 
> > other callers to the class will be unaffected by your re-write of 
> > the internal logic
> >
> > Nick
> >
>


Re: Tess4j API for TIKA OCR parser

2017-03-07 Thread Luís Filipe Nassif
Hi Thejan,

Before the first version of TesseractOcrParser was commited I tried to use
Tess4j, that was 4 years ago. Unfortunatelly that time I run into some
problems like permanent hangs with tesseract/Tess4j and, even worse, Jvm
crashes because of bugs into native code (pointers to crazy adresses) when
processing corrupted images. So I changed the strategy and take the
Runtime.exec way to execute tesseract out of process to get rid of those
Jvm crashes.

That was a long time ago, maybe those problems are gone away with current
tesseract and Tess4j. But I recommend for now commiting your changes in a
new parser instead of changing the default TesseractOcrParser, until the
new code is tested against millions of images from the wild with tika-batch
so it can be proved it is stable enough to be the default Ocr parser of
Tika.

Best,
Luis

Em 7 de mar de 2017 9:58 AM, "Thejan Wijesinghe" <
thejan.k.wijesin...@gmail.com> escreveu:

> Hi Nick,
>
> I thought the same thing. I will try to keep the public method signatures
> unchanged and will send updates on my progress.
>
> On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch  wrote:
>
> > On Tue, 7 Mar 2017, Thejan Wijesinghe wrote:
> >
> >> I have already use the Tess4j API to rewrite the TesseractOCRParser
> class,
> >> Although It successfully extracts content from most of the file types,
> it
> >> fails some particular unit tests in the TesseractOCRParserTest class. I
> >> can
> >> solve that. However, I want to know whether I can rewrite the entire
> >> TesseractOCRParser class from the ground up, but if I do that there will
> >> be
> >> many broken links in the internals of TIKA because as I witnessed, most
> of
> >> the classes use TesseractOCRParser class indirectly.
> >>
> >
> > If you can, try to keep the public methods unchanged. That way, other
> > callers to the class will be unaffected by your re-write of the internal
> > logic
> >
> > Nick
> >
>


Re: Tess4j API for TIKA OCR parser

2017-03-07 Thread Thamme Gowda
Thanks Nick for the reply.

Thejan,

I am glad to know your progress. Rewriting the TesseractOCRParser would be
the ultimate goal if using Tess4j proves to be better than the way it is
done currently.

But, for now, please consider these:
+ Rename your class to *Tess4jOCRParser*. It is a new parser providing the
same functionality as *TesseractOCRParser*
+ Keep the *TesseractOCRParser* intact. You can use it as your reference to
understand features of OCR parser to support.
+ Benchmark *TesseractOCRParser* and *Tess4jOCRParser* with respect to
performance and stability. You can take a set of 100 images and compare how
much time each of them took. Please share those results here.


Based on the benchmark, we can decide whether to replace old one with new
one. Because TesseractOCRParser is used along with many other parsers like
JPEG/PDF etc any improvements you make with Tess4jOCRParser will have a
huge effect!

P.S.
+ Please don't edit any test cases. You may add new ones, though!
+ Could you please create a Jira Issue to track this. Sorry, I must have
said this early.

Best,
TG


On Tue, Mar 7, 2017 at 4:58 AM, Thejan Wijesinghe <
thejan.k.wijesin...@gmail.com> wrote:

> Hi Nick,
>
> I thought the same thing. I will try to keep the public method signatures
> unchanged and will send updates on my progress.
>
> On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch  wrote:
>
> > On Tue, 7 Mar 2017, Thejan Wijesinghe wrote:
> >
> >> I have already use the Tess4j API to rewrite the TesseractOCRParser
> class,
> >> Although It successfully extracts content from most of the file types,
> it
> >> fails some particular unit tests in the TesseractOCRParserTest class. I
> >> can
> >> solve that. However, I want to know whether I can rewrite the entire
> >> TesseractOCRParser class from the ground up, but if I do that there will
> >> be
> >> many broken links in the internals of TIKA because as I witnessed, most
> of
> >> the classes use TesseractOCRParser class indirectly.
> >>
> >
> > If you can, try to keep the public methods unchanged. That way, other
> > callers to the class will be unaffected by your re-write of the internal
> > logic
> >
> > Nick
> >
>


Re: Tess4j API for TIKA OCR parser

2017-03-07 Thread Thejan Wijesinghe
Hi Nick,

I thought the same thing. I will try to keep the public method signatures
unchanged and will send updates on my progress.

On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch  wrote:

> On Tue, 7 Mar 2017, Thejan Wijesinghe wrote:
>
>> I have already use the Tess4j API to rewrite the TesseractOCRParser class,
>> Although It successfully extracts content from most of the file types, it
>> fails some particular unit tests in the TesseractOCRParserTest class. I
>> can
>> solve that. However, I want to know whether I can rewrite the entire
>> TesseractOCRParser class from the ground up, but if I do that there will
>> be
>> many broken links in the internals of TIKA because as I witnessed, most of
>> the classes use TesseractOCRParser class indirectly.
>>
>
> If you can, try to keep the public methods unchanged. That way, other
> callers to the class will be unaffected by your re-write of the internal
> logic
>
> Nick
>


Re: Tess4j API for TIKA OCR parser

2017-03-07 Thread Nick Burch

On Tue, 7 Mar 2017, Thejan Wijesinghe wrote:

I have already use the Tess4j API to rewrite the TesseractOCRParser class,
Although It successfully extracts content from most of the file types, it
fails some particular unit tests in the TesseractOCRParserTest class. I can
solve that. However, I want to know whether I can rewrite the entire
TesseractOCRParser class from the ground up, but if I do that there will be
many broken links in the internals of TIKA because as I witnessed, most of
the classes use TesseractOCRParser class indirectly.


If you can, try to keep the public methods unchanged. That way, other 
callers to the class will be unaffected by your re-write of the internal 
logic


Nick


Re: Tess4j API for TIKA OCR parser

2017-03-06 Thread Thejan Wijesinghe
Thamme,
I have already use the Tess4j API to rewrite the TesseractOCRParser class,
Although It successfully extracts content from most of the file types, it
fails some particular unit tests in the TesseractOCRParserTest class. I can
solve that. However, I want to know whether I can rewrite the entire
TesseractOCRParser class from the ground up, but if I do that there will be
many broken links in the internals of TIKA because as I witnessed, most of
the classes use TesseractOCRParser class indirectly.

On Mon, Mar 6, 2017 at 12:37 AM, Thamme Gowda 
wrote:

> Thejan,
>
> Welcome to the world of mysteries. I am unable to explain why you are
> facing it since I am unable to reproduce it.
>
> Try out few other images, may be the image you have chosen is corrupt and
> maybe there is an exception thrown and silently swallowed in code.
>
> I suggest you do this:
>Please use an IDE like IntelliJ/Eclipse and use a debugger to understand
> the call stack inside TesseractOCRParser. It is indeed a nice way to get to
> the internals of Tika :-)
>
>
> Best,
> TG
>
>
> *--*
> *Thamme Gowda*
> TG | @thammegowda 
> ~Sent via somebody's Webmail server!
>
> On Sat, Mar 4, 2017 at 9:04 AM, Thejan Wijesinghe <
> thejan.k.wijesin...@gmail.com> wrote:
>
> >
> > Hi Thamme,
> >
> > Yes. I am using Ubuntu :) and I had ImageMagick and Tesseract both
> > installed in my system using apt-get. Since, I wasn't sure whether this
> is
> > a problem with the APT software packages, I built both ImageMagick and
> > Tesseract from sources.
> >
> > I also double checked the availability of Tesseract and ImageMagick by
> > typing CLI commands that you suggested and the below commands as well,
> >
> > convert test.jpg -resize 64x64 resized_test.jpg
> >
> > tesseract test.jpg out
> >
> > and they worked.
> >
> > I can't find a exact reason why I am not getting metadata but when I used
> > the AutoDetectParser class instead of the TesseractOCRParser class, I can
> > extract both content and metadata.
> >
> > p.s. I will put updating the wiki OCR page in my TODO list :)
> >
>



-- 

[image: cutmypic.png]

Thejan Wijesinghe

Department of Computer Science and Engineering

University of Moratuwa

[image: phone-16.png]

+94778097907

[image: link.png]  [image: linkedin.png]
 [image: github_alt.png]
 [image: facebook.png]
 [image: twitter.png]
 [image: google_plus.png]
 [image:
skype_online_social_media-20.png] [image: mail-32.png]