I'm testing out OCRing PDF tables using Tesseract OCR. I'm borrowing the concept from here: http://craiget.com/extracting-table-data-from-pdfs-with-ocr/
I've made good progress using the PPM examples on rosetta code and am having fun with it. (NOTE: I am starting off with an image resized down to 157x158. The original image is 5100x6601 -- 600 dpi) Here's an image of my progress. http://imgur.com/a/3fcKK I'm determining a "line" by seeing if the rolling sum of the previous 10 points is zero. I don't want all black pixels, just the ones that constitute a line. I'm stuck because this simple approach of is compressing the matrix with the infix I think. I'm not yet saavy enough with matrices to figure out what to do from here. $ xb 58 157 $ hlines 58 148 $ vlines 49 157 My next logical step (assuming the matrices were equal) was to essentially AND them together so that I had a combined image/matrix of black/white for the vertical and horizontal lines. I was then going to attempt to chop up the image like in the python blog post and feed it to Tesseract. Any tips or taking it further would be great. Thanks for the help You can get the PPM here: https://www.dropbox.com/s/qoi1glkqs0tfezs/small.ppm require 'files' readppm=: monad define dat=. fread y NB. read from file msk=. 1 ,~ (*. 3 >: +/\) (LF&=@}: *. '#'&~:@}.) dat NB. mark field ends 't wbyh maxval dat'=. msk <;._2 dat NB. parse 'wbyh maxval'=. 2 1([ {. [: _99&". (LF,' ')&charsub)&.> wbyh;maxval NB. convert to numeric if. (_99 0 +./@e. wbyh,maxval) +. 'P6' -.@-: 2{.t do. _1 return. end. (a. i. dat) makeRGB |.wbyh NB. convert to basic bitmap format ) makeRGB=: 0&$: : (($,)~ ,&3) fillRGB=: makeRGB }:@$ setPixels=: (1&{::@[)`(<"1@(0&{::@[))`]} getPixels=: <"1@[ { ] NB. viewmat _50 (+ / % #) \ _50 (+ / % #)\"1 x2 z=:readppm 'c:/temp/small.ppm' NB. compress the RGB into a single number x2=:+/"1 z NB. convert the RGB into a binary if it's black/white xb =: 500 <: x2 hlines=:(10 (+/)\"1 xb) = 0 vlines=:(10 (+/)\ xb) = 0 ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
