[tesseract-ocr] Re: The text is not recognized from png

amrapalli karan Tue, 07 Apr 2020 21:42:54 -0700

Thanks for the post but while I am trying to use deskew in R , its throwing 
error while installation. But I have a work around which gave somewhat 
similar results. The magick package has image_deskew but that didn't seem 
to work. The output is generating a '|' and 'CATHODEFULL'. and I am not 
sure why. Is there any way out?


Code:
library magick
image=image_read_pdf('https://www.hindustancopper.com/Upload/Reports/0-637189269505122500-AnnualReport.pdf')
text=image %>%
  image_rotate(3)%>%
  image_ocr()



On Tuesday, April 7, 2020 at 6:06:55 PM UTC+5:30, Lakshay Saini wrote:
>
> Hi,
>
> 1. Deskew the image to get straight text lines.
> 2. Use tesseract's PSM 6 mode, this mode helps you scan the pdf 
> horizontally which can be very useful in table extraction.
>
> Tesseract engine can provide great results depending on the quality of the 
> image provided to it. It cannot give you 100% results all the time. 
> Although if the image quality is great, it's possible to get 100% results. 
> :)
>
> I have attached the results after deskewing the image. Kindly look into 
> the same. I have done the same in python.
>
> On Tuesday, April 7, 2020 at 11:08:25 AM UTC+5:30, amrapalli karan wrote:
>>
>> I have this .pdf file which I am able to read only partially. I am using 
>> R language to fetch the data from the pdf file which is uploaded in the 
>> form of an image.
>>
>> The expected output is:
>>
>> CONTINUOUS CAST COPPER WIRE ROD 11 MM 44*1*567*CATHODE FULL **434122*
>> CONTINUOUS CAST COPPER WIRE ROD NS 439678
>> CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056...etc
>>
>> The actual output which I am getting:
>>
>> CONTINUOUS CAST COPPER WIRE ROD 11 MM 44567 
>> CONTINUOUS CAST COPPER WIRE ROD NS 439678
>> CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056...etc.
>>
>> The highlighted part of the text is missing when I am extracting the data. A 
>> part of the code that I am using in R is :
>>
>> pdf_convert(event_url, 
>>             pages = 1, 
>>             dpi = 850, 
>>             filenames = "page1.png")# what does the data look like
>> text <- ocr("page1.png")
>> cat(text)
>>
>> What changes should I make that would help me fetch the complete data? 
>> Thanks in advance
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f17eaeba-9c92-4952-83fe-c28582166f1a%40googlegroups.com.

[tesseract-ocr] Re: The text is not recognized from png

Reply via email to