OCR Sensitivity

2011-03-03 Thread Richard
Hi,

I am really new to Tesseract OCR 3.0 as a static DLL within a windows
envionment and have the majority of what I want working but...

Is there a way to increase the sensitivity of the OCR engine?

For instance, I am passing JPG images that purely have images of
registration plates (ANPR essentially) but the OCR engine reads

1 as I
0 as U
8 as S

I have tried altering params on the TessBaseAPI::INit but this simply
crashes it set to nything other than  OME_DEFAULT

I have also set up a char whilelist to limit to certain chars/digits

Any help would be appreciated.

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Especial Characteres

2011-03-03 Thread Sriranga(78yrsold)
Dimitry,
I had generated traineddata(Kannada) files sucessfully from the old
datafiles of 2.xx last year. There is discussion by spohorsky in the forum
how to do.
sriranga(78)
♫

On Thu, Mar 3, 2011 at 5:42 PM, Dmitry Silaev daemons2...@gmail.com wrote:

 Manuel,

 It's quite an interesting question although it may seem to be an
 ordinary newbie-like one.

 I was always wondering if 2.xx files can be used with version 3.xx.
 The wiki states that the files in the traineddata file are different
 from the list used prior to 3.00, and will most likely change,
 possibly dramatically in future revisions.

 I have no time to investigate it in the code so I decided to act
 rather than to think. After some tinkering with all those files I
 slipped the resulted por.traineddata into my Tesseract algo I'm
 currently working at, and - guess what? - it worked! ))

 I must say it was tested only with a couple of *very simple* images
 and also it absolutely lacks any dictionary-related data. And my test
 images don't contain these specific Portuguese letters with
 diacritics. So in fact this file may perform poorly. Please test and
 report your results. The file is in the attachment.

 It was not difficult at all but also not so straight-forward to make
 this training data file, so probably this process deserves a separate
 article and later I'd like to post it in my blog.

 Warm regards,
 Dmitry Silaev





 On Wed, Mar 2, 2011 at 8:40 PM, manuelfhp manuel...@gmail.com wrote:
  Helo list,
  I can't find a solution for special chars
 
  I installed tesseract 3 in my MacOSX 10.6
  It is running very well
 
  But I'm having problems with charset.
  I need tesseract working with brazillian portuguese. (ISO8859-1)
 
  I installed the portuguese dictionary but is not working with special
  chars like  Ç Ã É é   (ISO8859-1)
  Is there any solution ?
 
  There is an old dictionary special for brazilian portuguese in version
  2.0.4. Is it possible to use in version 3? How?
 
 
  --
  You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
  To post to this group, send email to tesseract-ocr@googlegroups.com.
  To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
  For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.
 
 

 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.



-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Especial Characteres

2011-03-03 Thread Dmitry Silaev
Sriranga,

Thanks for letting me know. You are the first one then, and I invented
the bicycle ))
However an article might be still of use instead of verbose forum discussion...
May be you'd like to write it then?

Warm regards,
Dmitry Silaev





On Thu, Mar 3, 2011 at 3:55 PM, Sriranga(78yrsold)
withblessi...@gmail.com wrote:
 Dimitry,
 I had generated traineddata(Kannada) files sucessfully from the old
 datafiles of 2.xx last year. There is discussion by spohorsky in the forum
 how to do.
 sriranga(78)
 ♫
 On Thu, Mar 3, 2011 at 5:42 PM, Dmitry Silaev daemons2...@gmail.com wrote:

 Manuel,

 It's quite an interesting question although it may seem to be an
 ordinary newbie-like one.

 I was always wondering if 2.xx files can be used with version 3.xx.
 The wiki states that the files in the traineddata file are different
 from the list used prior to 3.00, and will most likely change,
 possibly dramatically in future revisions.

 I have no time to investigate it in the code so I decided to act
 rather than to think. After some tinkering with all those files I
 slipped the resulted por.traineddata into my Tesseract algo I'm
 currently working at, and - guess what? - it worked! ))

 I must say it was tested only with a couple of *very simple* images
 and also it absolutely lacks any dictionary-related data. And my test
 images don't contain these specific Portuguese letters with
 diacritics. So in fact this file may perform poorly. Please test and
 report your results. The file is in the attachment.

 It was not difficult at all but also not so straight-forward to make
 this training data file, so probably this process deserves a separate
 article and later I'd like to post it in my blog.

 Warm regards,
 Dmitry Silaev





 On Wed, Mar 2, 2011 at 8:40 PM, manuelfhp manuel...@gmail.com wrote:
  Helo list,
  I can't find a solution for special chars
 
  I installed tesseract 3 in my MacOSX 10.6
  It is running very well
 
  But I'm having problems with charset.
  I need tesseract working with brazillian portuguese. (ISO8859-1)
 
  I installed the portuguese dictionary but is not working with special
  chars like  Ç Ã É é   (ISO8859-1)
  Is there any solution ?
 
  There is an old dictionary special for brazilian portuguese in version
  2.0.4. Is it possible to use in version 3? How?
 
 
  --
  You received this message because you are subscribed to the Google
  Groups tesseract-ocr group.
  To post to this group, send email to tesseract-ocr@googlegroups.com.
  To unsubscribe from this group, send email to
  tesseract-ocr+unsubscr...@googlegroups.com.
  For more options, visit this group at
  http://groups.google.com/group/tesseract-ocr?hl=en.
 
 

 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.


 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.


-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: OCR Sensitivity

2011-03-03 Thread patrickq
The answer lies within your own question! Since you expect only
digits, simply accept these letters as the equivalent digit by
replacing them.

Patrick

On Mar 3, 5:09 am, Richard rhe...@dial.pipex.com wrote:
 Hi,

 I am really new to Tesseract OCR 3.0 as a static DLL within a windows
 envionment and have the majority of what I want working but...

 Is there a way to increase the sensitivity of the OCR engine?

 For instance, I am passing JPG images that purely have images of
 registration plates (ANPR essentially) but the OCR engine reads

 1 as I
 0 as U
 8 as S

 I have tried altering params on the TessBaseAPI::INit but this simply
 crashes it set to nything other than  OME_DEFAULT

 I have also set up a char whilelist to limit to certain chars/digits

 Any help would be appreciated.

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: OCR Sensitivity

2011-03-03 Thread Richard
Yes, that is possible and I am scanning now to do that

 but its not possible to always the know the format of the plate
and just changing random chars may/does give strange results.

On Mar 3, 1:13 pm, patrickq patrick.questemb...@gmail.com wrote:
 The answer lies within your own question! Since you expect only
 digits, simply accept these letters as the equivalent digit by
 replacing them.

 Patrick

 On Mar 3, 5:09 am, Richard rhe...@dial.pipex.com wrote:

  Hi,

  I am really new to Tesseract OCR 3.0 as a static DLL within a windows
  envionment and have the majority of what I want working but...

  Is there a way to increase the sensitivity of the OCR engine?

  For instance, I am passing JPG images that purely have images of
  registration plates (ANPR essentially) but the OCR engine reads

  1 as I
  0 as U
  8 as S

  I have tried altering params on the TessBaseAPI::INit but this simply
  crashes it set to nything other than  OME_DEFAULT

  I have also set up a char whilelist to limit to certain chars/digits

  Any help would be appreciated.

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Especial Characteres

2011-03-03 Thread Dmitry Silaev
Sriranga,

Actually I don't understand why one needs to refer to the forum
discussion you've just mentioned above, as I managed to build this
traineddata file without writing a single line of code and even
without a compiler, say Visual C++...

The value I can add is in that any user inexperienced in programming
can make this traineddata file himself ))

Warm regards,
Dmitry Silaev





On Thu, Mar 3, 2011 at 5:08 PM, Sriranga(78yrsold)
withblessi...@gmail.com wrote:
 Dmitry,
 No I am NOT the first invented but actually credited to spohor...@sjm.com
 -who helped me very lot including creating vcproj for combined traineddata
 for windows. I am very thankful to him for his help/guidance rendered from
 time to time. Without his help I would not succeeded to generate traineddata
 file out of old datafiles  All credits should go to Steve. Steve has already
 explained in detail how to do in the forum discussion are available.
 -sriranga(78yrs)

 On Thu, Mar 3, 2011 at 6:36 PM, Dmitry Silaev daemons2...@gmail.com wrote:

 Sriranga,

 Thanks for letting me know. You are the first one then, and I invented
 the bicycle ))
 However an article might be still of use instead of verbose forum
 discussion...
 May be you'd like to write it then?

 Warm regards,
 Dmitry Silaev





 On Thu, Mar 3, 2011 at 3:55 PM, Sriranga(78yrsold)
 withblessi...@gmail.com wrote:
  Dimitry,
  I had generated traineddata(Kannada) files sucessfully from the old
  datafiles of 2.xx last year. There is discussion by spohorsky in the
  forum
  how to do.
  sriranga(78)
  ♫
  On Thu, Mar 3, 2011 at 5:42 PM, Dmitry Silaev daemons2...@gmail.com
  wrote:
 
  Manuel,
 
  It's quite an interesting question although it may seem to be an
  ordinary newbie-like one.
 
  I was always wondering if 2.xx files can be used with version 3.xx.
  The wiki states that the files in the traineddata file are different
  from the list used prior to 3.00, and will most likely change,
  possibly dramatically in future revisions.
 
  I have no time to investigate it in the code so I decided to act
  rather than to think. After some tinkering with all those files I
  slipped the resulted por.traineddata into my Tesseract algo I'm
  currently working at, and - guess what? - it worked! ))
 
  I must say it was tested only with a couple of *very simple* images
  and also it absolutely lacks any dictionary-related data. And my test
  images don't contain these specific Portuguese letters with
  diacritics. So in fact this file may perform poorly. Please test and
  report your results. The file is in the attachment.
 
  It was not difficult at all but also not so straight-forward to make
  this training data file, so probably this process deserves a separate
  article and later I'd like to post it in my blog.
 
  Warm regards,
  Dmitry Silaev
 
 
 
 
 
  On Wed, Mar 2, 2011 at 8:40 PM, manuelfhp manuel...@gmail.com wrote:
   Helo list,
   I can't find a solution for special chars
  
   I installed tesseract 3 in my MacOSX 10.6
   It is running very well
  
   But I'm having problems with charset.
   I need tesseract working with brazillian portuguese. (ISO8859-1)
  
   I installed the portuguese dictionary but is not working with special
   chars like  Ç Ã É é   (ISO8859-1)
   Is there any solution ?
  
   There is an old dictionary special for brazilian portuguese in
   version
   2.0.4. Is it possible to use in version 3? How?
  
  
   --
   You received this message because you are subscribed to the Google
   Groups tesseract-ocr group.
   To post to this group, send email to tesseract-ocr@googlegroups.com.
   To unsubscribe from this group, send email to
   tesseract-ocr+unsubscr...@googlegroups.com.
   For more options, visit this group at
   http://groups.google.com/group/tesseract-ocr?hl=en.
  
  
 
  --
  You received this message because you are subscribed to the Google
  Groups
  tesseract-ocr group.
  To post to this group, send email to tesseract-ocr@googlegroups.com.
  To unsubscribe from this group, send email to
  tesseract-ocr+unsubscr...@googlegroups.com.
  For more options, visit this group at
  http://groups.google.com/group/tesseract-ocr?hl=en.
 
 
  --
  You received this message because you are subscribed to the Google
  Groups
  tesseract-ocr group.
  To post to this group, send email to tesseract-ocr@googlegroups.com.
  To unsubscribe from this group, send email to
  tesseract-ocr+unsubscr...@googlegroups.com.
  For more options, visit this group at
  http://groups.google.com/group/tesseract-ocr?hl=en.
 

 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.


 --
 You received this message because you are subscribed to the Google Groups
 

Re: Especial Characteres

2011-03-03 Thread Sriranga(78yrsold)
Dmitry,
I fully agree with your points. Newbies (who are non-programmer) like me
cannot make traineddata file without any valuable guidance of people like
you. Being expert programmer/developer, you have succeeded to build
traineddata very easily. As such only newbies need/must to refer to the
forum discussion on any points -for solution,
With Warmest regards,
-sriranga(78yrs)

On Thu, Mar 3, 2011 at 7:46 PM, Dmitry Silaev daemons2...@gmail.com wrote:

 Sriranga,

 Actually I don't understand why one needs to refer to the forum
 discussion you've just mentioned above, as I managed to build this
 traineddata file without writing a single line of code and even
 without a compiler, say Visual C++...

 The value I can add is in that any user inexperienced in programming
 can make this traineddata file himself ))

 Warm regards,
 Dmitry Silaev





 On Thu, Mar 3, 2011 at 5:08 PM, Sriranga(78yrsold)
 withblessi...@gmail.com wrote:
  Dmitry,
  No I am NOT the first invented but actually credited to
 spohor...@sjm.com
  -who helped me very lot including creating vcproj for combined
 traineddata
  for windows. I am very thankful to him for his help/guidance rendered
 from
  time to time. Without his help I would not succeeded to generate
 traineddata
  file out of old datafiles  All credits should go to Steve. Steve has
 already
  explained in detail how to do in the forum discussion are available.
  -sriranga(78yrs)
 
  On Thu, Mar 3, 2011 at 6:36 PM, Dmitry Silaev daemons2...@gmail.com
 wrote:
 
  Sriranga,
 
  Thanks for letting me know. You are the first one then, and I invented
  the bicycle ))
  However an article might be still of use instead of verbose forum
  discussion...
  May be you'd like to write it then?
 
  Warm regards,
  Dmitry Silaev
 
 
 
 
 
  On Thu, Mar 3, 2011 at 3:55 PM, Sriranga(78yrsold)
  withblessi...@gmail.com wrote:
   Dimitry,
   I had generated traineddata(Kannada) files sucessfully from the old
   datafiles of 2.xx last year. There is discussion by spohorsky in the
   forum
   how to do.
   sriranga(78)
   ♫
   On Thu, Mar 3, 2011 at 5:42 PM, Dmitry Silaev daemons2...@gmail.com
   wrote:
  
   Manuel,
  
   It's quite an interesting question although it may seem to be an
   ordinary newbie-like one.
  
   I was always wondering if 2.xx files can be used with version 3.xx.
   The wiki states that the files in the traineddata file are different
   from the list used prior to 3.00, and will most likely change,
   possibly dramatically in future revisions.
  
   I have no time to investigate it in the code so I decided to act
   rather than to think. After some tinkering with all those files I
   slipped the resulted por.traineddata into my Tesseract algo I'm
   currently working at, and - guess what? - it worked! ))
  
   I must say it was tested only with a couple of *very simple* images
   and also it absolutely lacks any dictionary-related data. And my test
   images don't contain these specific Portuguese letters with
   diacritics. So in fact this file may perform poorly. Please test and
   report your results. The file is in the attachment.
  
   It was not difficult at all but also not so straight-forward to make
   this training data file, so probably this process deserves a separate
   article and later I'd like to post it in my blog.
  
   Warm regards,
   Dmitry Silaev
  
  
  
  
  
   On Wed, Mar 2, 2011 at 8:40 PM, manuelfhp manuel...@gmail.com
 wrote:
Helo list,
I can't find a solution for special chars
   
I installed tesseract 3 in my MacOSX 10.6
It is running very well
   
But I'm having problems with charset.
I need tesseract working with brazillian portuguese. (ISO8859-1)
   
I installed the portuguese dictionary but is not working with
 special
chars like  Ç Ã É é   (ISO8859-1)
Is there any solution ?
   
There is an old dictionary special for brazilian portuguese in
version
2.0.4. Is it possible to use in version 3? How?
   
   
--
You received this message because you are subscribed to the Google
Groups tesseract-ocr group.
To post to this group, send email to
 tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.
   
   
  
   --
   You received this message because you are subscribed to the Google
   Groups
   tesseract-ocr group.
   To post to this group, send email to tesseract-ocr@googlegroups.com.
   To unsubscribe from this group, send email to
   tesseract-ocr+unsubscr...@googlegroups.com.
   For more options, visit this group at
   http://groups.google.com/group/tesseract-ocr?hl=en.
  
  
   --
   You received this message because you are subscribed to the Google
   Groups
   tesseract-ocr group.
   To post to this group, send email to tesseract-ocr@googlegroups.com.
   To unsubscribe from this group, send email 

Re: Especial Characteres

2011-03-03 Thread manuel...@gmail.com
Hi Dmitry,

I just replaced with your file por.traineddata
But I'm getting an error:

manuel$ tesseract input.tiff output -l por
actual_tessdata_num_entries_ = TESSDATA_NUM_ENTRIES:Error:Assert failed:in 
file tessdatamanager.cpp, line 55
Segmentation fault

It's seem to be interesting to convert old files from 2.0X to 3, because there 
isn't a brazillian portuguese for version 3,  just portuguese. 
At least the dictionary por.traineeddata is working correctly in version 3.
The special chars is being recognized by tesseract 3.

regards,
Manuel Pardo




Em 03/03/2011, às 09:12, Dmitry Silaev escreveu:

 Manuel,
 
 It's quite an interesting question although it may seem to be an
 ordinary newbie-like one.
 
 I was always wondering if 2.xx files can be used with version 3.xx.
 The wiki states that the files in the traineddata file are different
 from the list used prior to 3.00, and will most likely change,
 possibly dramatically in future revisions.
 
 I have no time to investigate it in the code so I decided to act
 rather than to think. After some tinkering with all those files I
 slipped the resulted por.traineddata into my Tesseract algo I'm
 currently working at, and - guess what? - it worked! ))
 
 I must say it was tested only with a couple of *very simple* images
 and also it absolutely lacks any dictionary-related data. And my test
 images don't contain these specific Portuguese letters with
 diacritics. So in fact this file may perform poorly. Please test and
 report your results. The file is in the attachment.
 
 It was not difficult at all but also not so straight-forward to make
 this training data file, so probably this process deserves a separate
 article and later I'd like to post it in my blog.
 
 Warm regards,
 Dmitry Silaev
 
 
 
 
 
 On Wed, Mar 2, 2011 at 8:40 PM, manuelfhp manuel...@gmail.com wrote:
 Helo list,
 I can't find a solution for special chars
 
 I installed tesseract 3 in my MacOSX 10.6
 It is running very well
 
 But I'm having problems with charset.
 I need tesseract working with brazillian portuguese. (ISO8859-1)
 
 I installed the portuguese dictionary but is not working with special
 chars like  Ç Ã É é   (ISO8859-1)
 Is there any solution ?
 
 There is an old dictionary special for brazilian portuguese in version
 2.0.4. Is it possible to use in version 3? How?
 
 
 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.
 
 
 
 -- 
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.
 
 por.traineddata

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Dictionnary issues

2011-03-03 Thread Jonathan
Hi all,

I'm working on a project that involves detecting text in street level
images. I have already written a code that allows me to extract text
areas from my images.

I work with Tesseract 3.0, and, first of all, I tried running
Tesseract on full images (1080x1920), just to see the results I could
get. Obviously, because of trees, fences, walls, etc, there are a lot
of false recognition from Tesseract, but some texts are also well
recognized. So, to improve the recognition, I give to Tesseract only
the text areas segmented by my code and hope that recognition would be
good although the scenes are very difficult.

I know that when the image is too complicated (not enough contrast
between text and background, many shadows...) detection is really
difficult and may not give good results, however even in some
supposedly very simple cases like this one (black text on white
background with a slight blur):
http://tesseract-ocr.googlegroups.com/web/Paris_12-080422_0687-34-1_0001312_box_0006.png?gda=4bbSHWQq5Pp34OGAuWVwGRkvOnHabRkL_yLtSqEDTbGzFn1v-X8wOvnLz5Tja6xhmVIF9MLYgjPKDhWo7fDwKv7hdSobhowICgXY9oBdZxkhoGyvOFXq71KIRN2DRDZ98DIdT53NzgFmQudIVZfn2evkHEao
Tesseract recognizes:
http://tesseract-ocr.googlegroups.com/web/Paris_12-080422_0687-34-1_0001312_box_0006_boxes.png?gda=KSimu2oq5Pp34OGAuWVwGRkvOnHabRkL_yLtSqEDTbGzFn1v-X8wOvnLz5Tja6xhmVIF9MLYgjPKDhWo7fDwKv7hdSobOWEMBOZDXT0mTiVSy6rk8qwfOToRrNOWJtPPKSAn4D797daDQaep90o7AOpSKHW0

I do not understand this result. Indeed, I use Tesseract with the
option -l fra for french language. Normally, in the french
dictionnary, the word Cloison exists, so I do not understand why
Tesseract recognizes a 0 instead of a o.

Does the dictionary actually plays a role in the recognition? Because
it is clear that the 0 and o have same shape-based confidence
value, but the dictionary should also aim at choosing o rather than
0, am I wrong?

In addition, Tesseract does not seem to take into account the scale
between two adjacent boxes? It recognizes ll for the segmented
quotation mark (see images above) while it recognizes correctly 'i'
just before ll.

I also tried to add lines to the file fra.unicharambigs to correct
false recognition of the 'n' as l'I (line in unicharambigs: 3 l'I 1
n 0) and the 'm' as ITI (line in unicharambigs: 3 ITI 1 m 0), I ran
combine_tessdata to make a new fra.traineddata, but there is no
change.

So, i tried to help Tesseract by giving it our own segmented text
image, in this case, the blur is removed and the recognition gives
better results as you can see on this image:
http://tesseract-ocr.googlegroups.com/web/Paris_12-080422_0687-34-1_0001312_box_0006_boxes+(2).png?gda=_sgwA3Iq5Pp34OGAuWVwGRkvOnHabRkL_yLtSqEDTbGzFn1v-X8wOvnLz5Tja6xhmVIF9MLYgjPKDhWo7fDwKv7hdSobOWEMBOZDXT0mTiVSy6rk8juef4gIssVZMUVd4ovTnHRV4u3aa4iAIyYQIqbG9naPgh6o8ccLBvP6Chud5KMzIQ
or this one too:
http://tesseract-ocr.googlegroups.com/web/Paris_12-080422_0687-34-1_0001312_box_0005_boxes.png?gda=pgvY8moq5Pp34OGAuWVwGRkvOnHabRkL_yLtSqEDTbGzFn1v-X8wOvnLz5Tja6xhmVIF9MLYgjPKDhWo7fDwKv7hdSobIvfRTlYBT-BD2NUWBDUNMqwfOToRrNOWJtPPKSAn4D797daDQaep90o7AOpSKHW0

I guess that Faux and plafonds (and mayber even Faux-plafonds)
are present in the basic Tesseract dictionnary since the recognition
is good with original Tesseract.
However, if I use a new dictionary I have created, with a list of
about 350k french words, using wordlist2dawg to create fra.word-dawg
and remake the fra.traineddata and that I ran Tesseract on the same
image, the recognition is Foux-plufonds. This word is not in my list
neither Foux nor plufonds whereas Faux-plafonds, Faux and
plafonds are in my list.
If you have any idea to help me with this too, I will be very
greatful.

Next, I will try to provide character-image by character-image to
Tesseract to simplify again the recognition, but if you have any other
idea to improve it, I am definitely interested.

Thank you in advance for any help you will be able to provide me,
Jonathan.

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Especial Characteres

2011-03-03 Thread Dmitry Silaev
Manuel,

Is the error message generated by version 2.xx? Did you try to run
version 3.xx with my por.traineddata file?
I don't get it - have you succeeded or not?
Please provide us with the image you are trying to recognize.

Warm regards,
Dmitry Silaev





On Thu, Mar 3, 2011 at 5:34 PM, manuel...@gmail.com manuel...@gmail.com wrote:
 Hi Dmitry,

 I just replaced with your file por.traineddata
 But I'm getting an error:

 manuel$ tesseract input.tiff output -l por
 actual_tessdata_num_entries_ = TESSDATA_NUM_ENTRIES:Error:Assert failed:in 
 file tessdatamanager.cpp, line 55
 Segmentation fault

 It's seem to be interesting to convert old files from 2.0X to 3, because 
 there isn't a brazillian portuguese for version 3,  just portuguese.
 At least the dictionary por.traineeddata is working correctly in version 3.
 The special chars is being recognized by tesseract 3.

 regards,
 Manuel Pardo




 Em 03/03/2011, às 09:12, Dmitry Silaev escreveu:

 Manuel,

 It's quite an interesting question although it may seem to be an
 ordinary newbie-like one.

 I was always wondering if 2.xx files can be used with version 3.xx.
 The wiki states that the files in the traineddata file are different
 from the list used prior to 3.00, and will most likely change,
 possibly dramatically in future revisions.

 I have no time to investigate it in the code so I decided to act
 rather than to think. After some tinkering with all those files I
 slipped the resulted por.traineddata into my Tesseract algo I'm
 currently working at, and - guess what? - it worked! ))

 I must say it was tested only with a couple of *very simple* images
 and also it absolutely lacks any dictionary-related data. And my test
 images don't contain these specific Portuguese letters with
 diacritics. So in fact this file may perform poorly. Please test and
 report your results. The file is in the attachment.

 It was not difficult at all but also not so straight-forward to make
 this training data file, so probably this process deserves a separate
 article and later I'd like to post it in my blog.

 Warm regards,
 Dmitry Silaev





 On Wed, Mar 2, 2011 at 8:40 PM, manuelfhp manuel...@gmail.com wrote:
 Helo list,
 I can't find a solution for special chars

 I installed tesseract 3 in my MacOSX 10.6
 It is running very well

 But I'm having problems with charset.
 I need tesseract working with brazillian portuguese. (ISO8859-1)

 I installed the portuguese dictionary but is not working with special
 chars like  Ç Ã É é   (ISO8859-1)
 Is there any solution ?

 There is an old dictionary special for brazilian portuguese in version
 2.0.4. Is it possible to use in version 3? How?


 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.



 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.

 por.traineddata

 --
 You received this message because you are subscribed to the Google Groups 
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to 
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/tesseract-ocr?hl=en.



-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Tesserac 2.0 not working

2011-03-03 Thread Pieter
Hi, I added the code to my project from this site:
http://www.pixel-technology.com/freeware/tessnet2/
It work well untill i installed the tesserac 3.0 windows executable.
Now when I run my application it shuts down when it hits this line of
code:
 ocr.Init(@D:\Projects\AMCDF\Source\Frameworks\Device\AMCDF.Device.GUI
\Resources\tessdata\, eng, false);

This use to work. Any help?
I uninstaledd all references to 3.0 from regedit and still no luck

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.