I'm relatively new to JavaScript programming for node.js, and I've been
reading about this for 5 hours and cannot wrap my head around it, so here I
am... I am trying to get the text from a PDF for searching. I need page
numbers, line numbers, and character positions of the results. It appears
that pdf.js cannot keep line breaks at the very least. I wonder if it will
keep multiple sequential spaces, but the line breaks are a dealbreaker, so
I've moved on. Now I'm using pdf-image to convert the pdf document to a png
for each page. Then I want to use tesseract.js to run OCR on the png files
to get the text as it appears in the pdf including line breaks and extra
spaces. The problem is if the pdf document is more than 5-10 pages, then
execution kills my laptop. The process of converting the pdf to png's
consumes over 12GB of RAM and never finishes to even move on the the OCR
which has to be worse. The average number of pages in the pdf documents I
am processing is 300-500, so I have to batch process. The problem I have is
that pdf-image and tesseract.js both use promises for async processing.
It's really the async that's killing my laptop. I just want to get the
number of pages, loop over each page one at a time, convert it to png, then
perform the OCR, then finish some other synchronous processing before
moving on to the next page. The code I have right now that doesn't work is:
import Tesseract from 'tesseract.js';
import pdfimage from 'pdf-image';
var PDFImage = pdfimage.PDFImage;
var pdfImage = new PDFImage(pdfFilePath, { convertOptions: { "-density":
"196" }});
for(var pageIndex = 0; pageIndex < numberOfPages; pageIndex++)
{
pdfImage.convertPage(pageIndex).then(function (pageImage) {
Tesseract.recognize(pageImage, 'eng').then(({ data: { text } }) => {
console.log(text);
//perform other synchronous processing with the text before moving to the
next page...
});
});
}
I can process one page without the for loop in about 200ms, but when I try
to loop everything gets messed up. I'm not sure how to proceed with
processing these promises synchronously. I know promises are supposed to be
more efficient, but sometimes order is important and resources are limited
for unchecked parallel processing... Like file type conversion and OCR for
300-500 page documents.
As a nice-to-have, I would also like to figure out how to load and
initialize tesseract.js once and then just call the recognize method. I
have tried the following code to achieve that, but I think it loads and
initializes then reloads and initializes when it calls the recognize
method. Controlling that behavior may not be possible, but I figured I'd
throw it out there.
(async () =>
{
await Tesseract.load();
await Tesseract.loadLanguage('eng');
await Tesseract.initialize('eng');
});
//then perform the convertPage then recognize as shown in the first code
block above...
Thank you!
--
Job board: http://jobs.nodejs.org/
New group rules:
https://gist.github.com/othiym23/9886289#file-moderation-policy-md
Old group rules:
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
---
You received this message because you are subscribed to the Google Groups
"nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/nodejs/6495d8ad-44c7-48fa-9936-e8ec687d2807o%40googlegroups.com.