[jira] [Commented] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf

Christian (Jira) Tue, 01 Dec 2020 04:02:04 -0800


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241476#comment-17241476
 ]


Christian  commented on PDFBOX-5029:
------------------------------------

Hi Tilman, in your "sorted" files there are spaces between words but the word 
order in a sentence is backward - also the text is not following the column 
order in the pdf file but is jumping from "first line-first column to first 
line-second column to first line- third column" and so on. 
In addition there is a problem with the positioning of some vowel sign on the 
top of consonants - sometimes is correct sometimes is wrong even for the same 
combination of vowel+consonant. Same with the order of some "consonant 
clusters" - I'm not sure if it's correct to describe it this way, but it would 
be like the word "the" rendered as "hte" if that makes sense.

The "not sorted" files are even worse with missing spaces and reverse word 
order + letters in each word are backward. There is no "column issue" in this 
case. I summarize it with two examples:

Ex - sorted files: "the cat is red" --> "red is cat the" + the column issue.
Ex - not sorted files: "the cat is red" --> "dersitaceht" (no column issue)

In terms of "accuracy" my original utf-8 file attached above has no column 
issue and words have the right order in the sentences. We noticed also that the 
first word for each line in the first pdf column is missing. This does not make 
things easier I guess. 

Ex - test_scraped.utf8 file: "the cat is red" -> "catisred" (no column issue + 
missing first word)

Thanks again for your help.

> Tika - Issues extracting Arabic script from pdf
> -----------------------------------------------
>
>                 Key: PDFBOX-5029
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5029
>             Project: PDFBox
>          Issue Type: Bug
>         Environment: Windows - Anaconda / Spyder
>            Reporter: Christian 
>            Priority: Major
>         Attachments: PDFBOX-5029-not-sorted-2.0.21.txt, 
> PDFBOX-5029-not-sorted-trunk.txt, PDFBOX-5029-sorted-2.0.21.txt, 
> PDFBOX-5029-sorted-trunk.txt, extracting_text_asian_pdf.py, test.pdf, 
> test_scraped.utf8
>
>
> I'm working on building a corpus of Uygur texts and some of the content is 
> coming from pdf files. I wrote a short python script to scrape text from pdf 
> using tika-python. The script is Arabic, and the output looks good but there 
> is one major problem: there are many missing spaces between words and I 
> really do not know how to address this issue. I am attaching a pdf file, the 
> script to scrape its text and the output (test_scraped.utf8). Thanks in 
> advance for your help.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf

Reply via email to