I would suggest if you can pass that process to cluster mode or use a worker thread. With that, it can be much faster
On Mon, Jun 8, 2020 at 10:25 AM Matthew Hamilton <buddhatr...@gmail.com> wrote: > I'm relatively new to JavaScript programming for node.js, and I've been > reading about this for 5 hours and cannot wrap my head around it, so here I > am... I am trying to get the text from a PDF for searching. I need page > numbers, line numbers, and character positions of the results. It appears > that pdf.js cannot keep line breaks at the very least. I wonder if it will > keep multiple sequential spaces, but the line breaks are a dealbreaker, so > I've moved on. Now I'm using pdf-image to convert the pdf document to a png > for each page. Then I want to use tesseract.js to run OCR on the png files > to get the text as it appears in the pdf including line breaks and extra > spaces. The problem is if the pdf document is more than 5-10 pages, then > execution kills my laptop. The process of converting the pdf to png's > consumes over 12GB of RAM and never finishes to even move on the the OCR > which has to be worse. The average number of pages in the pdf documents I > am processing is 300-500, so I have to batch process. The problem I have is > that pdf-image and tesseract.js both use promises for async processing. > It's really the async that's killing my laptop. I just want to get the > number of pages, loop over each page one at a time, convert it to png, then > perform the OCR, then finish some other synchronous processing before > moving on to the next page. The code I have right now that doesn't work is: > > import Tesseract from 'tesseract.js'; > import pdfimage from 'pdf-image'; > > var PDFImage = pdfimage.PDFImage; > var pdfImage = new PDFImage(pdfFilePath, { convertOptions: { "-density": > "196" }}); > > for(var pageIndex = 0; pageIndex < numberOfPages; pageIndex++) > { > pdfImage.convertPage(pageIndex).then(function (pageImage) { > Tesseract.recognize(pageImage, 'eng').then(({ data: { text } }) => { > console.log(text); > //perform other synchronous processing with the text before moving to the > next page... > }); > }); > } > > I can process one page without the for loop in about 200ms, but when I try > to loop everything gets messed up. I'm not sure how to proceed with > processing these promises synchronously. I know promises are supposed to be > more efficient, but sometimes order is important and resources are limited > for unchecked parallel processing... Like file type conversion and OCR for > 300-500 page documents. > > As a nice-to-have, I would also like to figure out how to load and > initialize tesseract.js once and then just call the recognize method. I > have tried the following code to achieve that, but I think it loads and > initializes then reloads and initializes when it calls the recognize > method. Controlling that behavior may not be possible, but I figured I'd > throw it out there. > > (async () => > { > await Tesseract.load(); > await Tesseract.loadLanguage('eng'); > await Tesseract.initialize('eng'); > }); > > //then perform the convertPage then recognize as shown in the first code > block above... > > Thank you! > > -- > Job board: http://jobs.nodejs.org/ > New group rules: > https://gist.github.com/othiym23/9886289#file-moderation-policy-md > Old group rules: > https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines > --- > You received this message because you are subscribed to the Google Groups > "nodejs" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to nodejs+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/nodejs/6495d8ad-44c7-48fa-9936-e8ec687d2807o%40googlegroups.com > <https://groups.google.com/d/msgid/nodejs/6495d8ad-44c7-48fa-9936-e8ec687d2807o%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- Job board: http://jobs.nodejs.org/ New group rules: https://gist.github.com/othiym23/9886289#file-moderation-policy-md Old group rules: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines --- You received this message because you are subscribed to the Google Groups "nodejs" group. To unsubscribe from this group and stop receiving emails from it, send an email to nodejs+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/nodejs/CAJV_9FrEkCmOiqKkpgG9ZZ1GdNHJnNRT727%3DBv7oBKPnmcCqSg%40mail.gmail.com.