First of all: upgrade to the latest tesseract code. A lot fo fixes were implemented in meantime
Next: "Error in fopenWriteStream" indicate problem with the writing. Check privileges, space etc. Than try to use other format (jpeg, png) if it helps. Zdenko pi 29. 3. 2019 o 17:42 Lucas L. <[email protected]> napĂsal(a): > OK, I am running up against another issue, and it's getting weirder. Since > Tesseract does not take PDFs as input, this service does the deed of > breaking a PDF into pages, and then converting each of those pages to an > image format (either lzw-compressed TIFF or uncompressed PPM if that > fails). Somehow, if I run ImageMagick and then Tesseract on these pages > individually from a command line using the same parameters in the service > code, it runs fine processing the TIFF. But when the service runs, I get: > Error in fopenWriteStream: stream not opened > Error in pixWrite: stream not opened > > And the output pdf has all of the pages and they are not mangled... > however they are shrunk into a tiny corner of the page. I have attached the > resulting file. I feel that it is obvious from the fact that it works when > I run it outside the service that it is a code issue... however I really am > not sure what it could be doing differently from my command line. The pages > come out looking great when I run tesseract on the individual pages > manually. The errors do not appear when I run the command lines manually. > > The command lines and params I am using: > > Convert the input PDF (which is scanned and has no OCR layer) to input > image: > convert -depth 16 -density 300 -colorspace RGB -despeckle -flatten -compress > lzw -background white -alpha off "/path/pg_0010.pdf" "/path/pg_0010.tif" > Process the input image for OCR and output to PDF: > tesseract -l eng "/path/pg_0010.tif" "/path/pg_0010" pdf > > Configuration parameters from /usr/share/tesseract-ocr/4.00/tessdata/ > configs > tessedit_create_pdf 1 > tessedit_pageseg_mode 3 > tessedit_write_images true > > > On Thursday, March 28, 2019 at 1:31:36 PM UTC-5, Lucas L. wrote: >> >> Environment >> >> - Tesseract 4.0.0-beta.3-249-g607e >> - leptonica-1.76.0 >> - Linux (hostname removed) 4.18.0-16-generic #17 >> <https://github.com/tesseract-ocr/tesseract/pull/17>-Ubuntu SMP Fri >> Feb 8 00:06:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux >> >> Current Behavior: >> >> I work at a SaaS firm which provides cloud storage services specializing >> in documents. As a part of our service, we try to create PDFs with >> searchable text layers from scanned documents. When processing PPMs which >> are created by ImageMagick from the original document, Leptonica mangles >> the image before it can be OCR'd properly by Tesseract. This results in a >> PDF unreadable by both human eyes and Tesseract. This only seems to happen >> for some specific documents. >> How do I know it's Leptonica, specifically? >> >> I have executed Tesseract with the config values tessedit_write_images 1 >> and tessedit_pageseg_mode 0. From my understanding, the second option >> does not enable OCR at all while processing with Tesseract (which speeds up >> my test cases) and the first option outputs a .tif debug image which is >> apparently what Leptonica feeds to Tesseract after processing. That image >> is also mangled. >> Sample data >> >> I have extracted a single page from a PDF -- the process works on a >> page-by-page basis and most of the documents we work with contain highly >> sensitive information, so I had no other option but to do this. Regardless, >> it is good sample data. The "pg_0009.ppm" file is the original input fed >> into Tesseract on the command line which was converted from the original >> scanned document by ImageMagick. The "tessinput.tif" file is the image >> produced by the tessedit_write_images 1 option which is supposed to be >> OCR'd by Tesseract. This particular page caused a seg fault in Tesseract, >> something that doesn't usually happen, and I suspect it is because the text >> is overlapped so many times that the OCR engine has too much to handle. >> >> Google Drive since it's too large for an attachment: >> https://drive.google.com/file/d/1UCzXYu7iusep-bOD6EcKyBs2qXCqVdu5/view?usp=sharing >> Expected Behavior: >> >> Leptonica leaves the image mostly intact so that Tesseract can provide a >> proper text layer for the output PDF. Alternatively, a configuration option >> is available to bypass Leptonica. >> >> Any and all help is appreciated with this issue. Thanks for reading. >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/4ac58e80-fd54-49f4-b479-3a33f5ca5388%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/4ac58e80-fd54-49f4-b479-3a33f5ca5388%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8ys%3Dd-F2xMAJvosHitM8wbV9cPTC3zTomMnW9ZXgUuzCQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

