Thanks for the post but while I am trying to use deskew in R , its throwing
error while installation. But I have a work around which gave somewhat
similar results. The magick package has image_deskew but that didn't seem
to work. The output is generating a '|' and 'CATHODEFULL'. and I am not
sure why. Is there any way out?
Code:
library magick
image=image_read_pdf('https://www.hindustancopper.com/Upload/Reports/0-637189269505122500-AnnualReport.pdf')
text=image %>%
image_rotate(3)%>%
image_ocr()
On Tuesday, April 7, 2020 at 6:06:55 PM UTC+5:30, Lakshay Saini wrote:
>
> Hi,
>
> 1. Deskew the image to get straight text lines.
> 2. Use tesseract's PSM 6 mode, this mode helps you scan the pdf
> horizontally which can be very useful in table extraction.
>
> Tesseract engine can provide great results depending on the quality of the
> image provided to it. It cannot give you 100% results all the time.
> Although if the image quality is great, it's possible to get 100% results.
> :)
>
> I have attached the results after deskewing the image. Kindly look into
> the same. I have done the same in python.
>
> On Tuesday, April 7, 2020 at 11:08:25 AM UTC+5:30, amrapalli karan wrote:
>>
>> I have this .pdf file which I am able to read only partially. I am using
>> R language to fetch the data from the pdf file which is uploaded in the
>> form of an image.
>>
>> The expected output is:
>>
>> CONTINUOUS CAST COPPER WIRE ROD 11 MM 44*1*567*CATHODE FULL **434122*
>> CONTINUOUS CAST COPPER WIRE ROD NS 439678
>> CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056...etc
>>
>> The actual output which I am getting:
>>
>> CONTINUOUS CAST COPPER WIRE ROD 11 MM 44567
>> CONTINUOUS CAST COPPER WIRE ROD NS 439678
>> CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056...etc.
>>
>> The highlighted part of the text is missing when I am extracting the data. A
>> part of the code that I am using in R is :
>>
>> pdf_convert(event_url,
>> pages = 1,
>> dpi = 850,
>> filenames = "page1.png")# what does the data look like
>> text <- ocr("page1.png")
>> cat(text)
>>
>> What changes should I make that would help me fetch the complete data?
>> Thanks in advance
>>
>>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/f17eaeba-9c92-4952-83fe-c28582166f1a%40googlegroups.com.