I would suggest if you can pass that process to cluster mode or use a
worker thread. With that, it can be much faster

On Mon, Jun 8, 2020 at 10:25 AM Matthew Hamilton <buddhatr...@gmail.com>
wrote:

> I'm relatively new to JavaScript programming for node.js, and I've been
> reading about this for 5 hours and cannot wrap my head around it, so here I
> am... I am trying to get the text from a PDF for searching. I need page
> numbers, line numbers, and character positions of the results. It appears
> that pdf.js cannot keep line breaks at the very least. I wonder if it will
> keep multiple sequential spaces, but the line breaks are a dealbreaker, so
> I've moved on. Now I'm using pdf-image to convert the pdf document to a png
> for each page. Then I want to use tesseract.js to run OCR on the png files
> to get the text as it appears in the pdf including line breaks and extra
> spaces. The problem is if the pdf document is more than 5-10 pages, then
> execution kills my laptop. The process of converting the pdf to png's
> consumes over 12GB of RAM and never finishes to even move on the the OCR
> which has to be worse. The average number of pages in the pdf documents I
> am processing is 300-500, so I have to batch process. The problem I have is
> that pdf-image and tesseract.js both use promises for async processing.
> It's really the async that's killing my laptop. I just want to get the
> number of pages, loop over each page one at a time, convert it to png, then
> perform the OCR, then finish some other synchronous processing before
> moving on to the next page. The code I have right now that doesn't work is:
>
> import Tesseract from 'tesseract.js';
> import pdfimage from 'pdf-image';
>
> var PDFImage = pdfimage.PDFImage;
> var pdfImage = new PDFImage(pdfFilePath, { convertOptions: { "-density":
> "196" }});
>
> for(var pageIndex = 0; pageIndex < numberOfPages; pageIndex++)
> {
> pdfImage.convertPage(pageIndex).then(function (pageImage) {
> Tesseract.recognize(pageImage, 'eng').then(({ data: { text } }) => {
> console.log(text);
> //perform other synchronous processing with the text before moving to the
> next page...
> });
> });
> }
>
> I can process one page without the for loop in about 200ms, but when I try
> to loop everything gets messed up. I'm not sure how to proceed with
> processing these promises synchronously. I know promises are supposed to be
> more efficient, but sometimes order is important and resources are limited
> for unchecked parallel processing... Like file type conversion and OCR for
> 300-500 page documents.
>
> As a nice-to-have, I would also like to figure out how to load and
> initialize tesseract.js once and then just call the recognize method. I
> have tried the following code to achieve that, but I think it loads and
> initializes then reloads and initializes when it calls the recognize
> method. Controlling that behavior may not be possible, but I figured I'd
> throw it out there.
>
> (async () =>
> {
> await Tesseract.load();
> await Tesseract.loadLanguage('eng');
> await Tesseract.initialize('eng');
> });
>
> //then perform the convertPage then recognize as shown in the first code
> block above...
>
> Thank you!
>
> --
> Job board: http://jobs.nodejs.org/
> New group rules:
> https://gist.github.com/othiym23/9886289#file-moderation-policy-md
> Old group rules:
> https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
> ---
> You received this message because you are subscribed to the Google Groups
> "nodejs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to nodejs+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/nodejs/6495d8ad-44c7-48fa-9936-e8ec687d2807o%40googlegroups.com
> <https://groups.google.com/d/msgid/nodejs/6495d8ad-44c7-48fa-9936-e8ec687d2807o%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
Job board: http://jobs.nodejs.org/
New group rules: 
https://gist.github.com/othiym23/9886289#file-moderation-policy-md
Old group rules: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
--- 
You received this message because you are subscribed to the Google Groups 
"nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to nodejs+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/nodejs/CAJV_9FrEkCmOiqKkpgG9ZZ1GdNHJnNRT727%3DBv7oBKPnmcCqSg%40mail.gmail.com.

Reply via email to