Hi,


I am using Tesseract 4 (git 10f4998a) to process a file with two columns.  
A snippet of the image is shown below.  The problem is that there is a 
fuzzy line between the two columns, and the column detector has got 
confused.  I've ended up with one block covering the first column up to 
"The" on the second line, but then a block covering both columns with the 
"patient has ..." all the way across to "history of low".


I've looked in the debug views, and it looks to me like the line removal 
hasn't managed to remove that fuzzy line down the middle.  The "good" is 
then close enough that the column finder is deciding to merge the two 
blocks on that line.


Looking at the code in linefind.cpp and colfind.cpp, I see lots of 
constants for various thresholds, but I don't see any configurable ones, 
and I'm not sure which way to go now.  Would it be better to work on the 
line detector in linefind.cpp and try and get rid of that vertical line?  
Or would I be better to run a columnar histogram and try and do column 
splitting myself?  Or should I ignore the fact that the line hasn't been 
removed, and concentrate on tightening up the column finder so that it's 
able to separate these two columns correctly?  It seems to me that there's 
enough of a gap there that it ought to be able to separate the columns (it 
does a pretty good job on the rest of the document, so it can't be far off).


Any recommendations would be appreciated.


Thanks,


Ewan.




<https://lh3.googleusercontent.com/-mrxB3T8S4fM/Ws1h25mfleI/AAAAAAAACoc/fJi8OkO6wswexnYDZU2uoofSRBCYmPiVwCLcBGAs/s1600/Screen%2BShot%2B2018-04-10%2Bat%2B6.12.48%2BPM.png>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bdee5651-c305-4bbb-a14c-ccd5ba5cd7e2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to