Well, apparently you are correct and it is tied into permissions somehow. I imagine it must need specific permissions for some read/write operations that occur within Leptonica and Tesseract. I was able to reproduce those errors using Tesseract from the command line just now, after I had messed around with the read/write/execute permissions on an input file. I'll keep drilling down...
On Friday, March 29, 2019 at 3:04:19 PM UTC-5, Lucas L. wrote: > > OK, I appreciate the suggestion and clarification, but the aptitude > package manager doesn't seem to have a later version than the one that I > have now. I suppose I should build it from source, but your own page for > installing from source suggests using aptitude first. > tesseract-ocr is already the newest version (4.00~git2844-607e8fd8-2). > 0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded. > > Also, how could there be a permissions issue when the PDF is created, just > not sized correctly? I would expect the PDF to not be created at all if > that were the case. > > On Friday, March 29, 2019 at 2:52:28 PM UTC-5, zdenop wrote: >> >> First of all: upgrade to the latest tesseract code. A lot fo fixes were >> implemented in meantime >> >> Next: "Error in fopenWriteStream" indicate problem with the writing. >> Check privileges, space etc. Than try to use other format (jpeg, png) if it >> helps. >> >> Zdenko >> >> >> pi 29. 3. 2019 o 17:42 Lucas L. <[email protected]> napĂsal(a): >> >>> OK, I am running up against another issue, and it's getting weirder. >>> Since Tesseract does not take PDFs as input, this service does the deed of >>> breaking a PDF into pages, and then converting each of those pages to an >>> image format (either lzw-compressed TIFF or uncompressed PPM if that >>> fails). Somehow, if I run ImageMagick and then Tesseract on these pages >>> individually from a command line using the same parameters in the service >>> code, it runs fine processing the TIFF. But when the service runs, I get: >>> Error in fopenWriteStream: stream not opened >>> Error in pixWrite: stream not opened >>> >>> And the output pdf has all of the pages and they are not mangled... >>> however they are shrunk into a tiny corner of the page. I have attached the >>> resulting file. I feel that it is obvious from the fact that it works when >>> I run it outside the service that it is a code issue... however I really am >>> not sure what it could be doing differently from my command line. The pages >>> come out looking great when I run tesseract on the individual pages >>> manually. The errors do not appear when I run the command lines manually. >>> >>> The command lines and params I am using: >>> >>> Convert the input PDF (which is scanned and has no OCR layer) to input >>> image: >>> convert -depth 16 -density 300 -colorspace RGB -despeckle -flatten >>> -compress >>> lzw -background white -alpha off "/path/pg_0010.pdf" "/path/pg_0010.tif" >>> Process the input image for OCR and output to PDF: >>> tesseract -l eng "/path/pg_0010.tif" "/path/pg_0010" pdf >>> >>> Configuration parameters from /usr/share/tesseract-ocr/4.00/tessdata/ >>> configs >>> tessedit_create_pdf 1 >>> tessedit_pageseg_mode 3 >>> tessedit_write_images true >>> >>> >>> On Thursday, March 28, 2019 at 1:31:36 PM UTC-5, Lucas L. wrote: >>>> >>>> Environment >>>> >>>> - Tesseract 4.0.0-beta.3-249-g607e >>>> - leptonica-1.76.0 >>>> - Linux (hostname removed) 4.18.0-16-generic #17 >>>> <https://github.com/tesseract-ocr/tesseract/pull/17>-Ubuntu SMP Fri >>>> Feb 8 00:06:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux >>>> >>>> Current Behavior: >>>> >>>> I work at a SaaS firm which provides cloud storage services >>>> specializing in documents. As a part of our service, we try to create PDFs >>>> with searchable text layers from scanned documents. When processing PPMs >>>> which are created by ImageMagick from the original document, Leptonica >>>> mangles the image before it can be OCR'd properly by Tesseract. This >>>> results in a PDF unreadable by both human eyes and Tesseract. This only >>>> seems to happen for some specific documents. >>>> How do I know it's Leptonica, specifically? >>>> >>>> I have executed Tesseract with the config values tessedit_write_images >>>> 1 and tessedit_pageseg_mode 0. From my understanding, the second >>>> option does not enable OCR at all while processing with Tesseract (which >>>> speeds up my test cases) and the first option outputs a .tif debug image >>>> which is apparently what Leptonica feeds to Tesseract after processing. >>>> That image is also mangled. >>>> Sample data >>>> >>>> I have extracted a single page from a PDF -- the process works on a >>>> page-by-page basis and most of the documents we work with contain highly >>>> sensitive information, so I had no other option but to do this. >>>> Regardless, >>>> it is good sample data. The "pg_0009.ppm" file is the original input fed >>>> into Tesseract on the command line which was converted from the original >>>> scanned document by ImageMagick. The "tessinput.tif" file is the image >>>> produced by the tessedit_write_images 1 option which is supposed to be >>>> OCR'd by Tesseract. This particular page caused a seg fault in Tesseract, >>>> something that doesn't usually happen, and I suspect it is because the >>>> text >>>> is overlapped so many times that the OCR engine has too much to handle. >>>> >>>> Google Drive since it's too large for an attachment: >>>> https://drive.google.com/file/d/1UCzXYu7iusep-bOD6EcKyBs2qXCqVdu5/view?usp=sharing >>>> Expected Behavior: >>>> >>>> Leptonica leaves the image mostly intact so that Tesseract can provide >>>> a proper text layer for the output PDF. Alternatively, a configuration >>>> option is available to bypass Leptonica. >>>> >>>> Any and all help is appreciated with this issue. Thanks for reading. >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/4ac58e80-fd54-49f4-b479-3a33f5ca5388%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/4ac58e80-fd54-49f4-b479-3a33f5ca5388%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ffa24c46-b96e-4cb2-b9fa-af9ff0afd7a3%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

