[jira] [Commented] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf
[ https://issues.apache.org/jira/browse/PDFBOX-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260786#comment-17260786 ] Christian commented on PDFBOX-5029: Hi Tilman, first of all Happy New Year - I have been very busy in the past weeks and only now I'm back on the issue of scraping PDF files using TIkka - I tried all the possible combinations - the only way to get the correct text is to copy and paste the PDF content in a txt file and run afterwards the script. If I do it with WORD there are still mistakes. In any case it won't solve the issue because I want to extract the text from the original PDF: > Tika - Issues extracting Arabic script from pdf > --- > > Key: PDFBOX-5029 > URL: https://issues.apache.org/jira/browse/PDFBOX-5029 > Project: PDFBox > Issue Type: Bug > Environment: Windows - Anaconda / Spyder >Reporter: Christian >Priority: Major > Attachments: PDFBOX-5029-not-sorted-2.0.21.txt, > PDFBOX-5029-not-sorted-trunk.txt, PDFBOX-5029-sorted-2.0.21.txt, > PDFBOX-5029-sorted-trunk.txt, extracting_text_asian_pdf.py, test.pdf, > test_scraped.utf8 > > > I'm working on building a corpus of Uygur texts and some of the content is > coming from pdf files. I wrote a short python script to scrape text from pdf > using tika-python. The script is Arabic, and the output looks good but there > is one major problem: there are many missing spaces between words and I > really do not know how to address this issue. I am attaching a pdf file, the > script to scrape its text and the output (test_scraped.utf8). Thanks in > advance for your help. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf
[ https://issues.apache.org/jira/browse/PDFBOX-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242621#comment-17242621 ] Tilman Hausherr commented on PDFBOX-5029: - Could you please tell what segment in the PDF is affected, i.e. one line? Then please copy & paste that into WORD and convert to PDF, and see if it happens again, and if yes, please attach that PDF. (This will be difficult to solve. I really wonder why the Tika is so different, because Tika uses our stripper class) > Tika - Issues extracting Arabic script from pdf > --- > > Key: PDFBOX-5029 > URL: https://issues.apache.org/jira/browse/PDFBOX-5029 > Project: PDFBox > Issue Type: Bug > Environment: Windows - Anaconda / Spyder >Reporter: Christian >Priority: Major > Attachments: PDFBOX-5029-not-sorted-2.0.21.txt, > PDFBOX-5029-not-sorted-trunk.txt, PDFBOX-5029-sorted-2.0.21.txt, > PDFBOX-5029-sorted-trunk.txt, extracting_text_asian_pdf.py, test.pdf, > test_scraped.utf8 > > > I'm working on building a corpus of Uygur texts and some of the content is > coming from pdf files. I wrote a short python script to scrape text from pdf > using tika-python. The script is Arabic, and the output looks good but there > is one major problem: there are many missing spaces between words and I > really do not know how to address this issue. I am attaching a pdf file, the > script to scrape its text and the output (test_scraped.utf8). Thanks in > advance for your help. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf
[ https://issues.apache.org/jira/browse/PDFBOX-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241476#comment-17241476 ] Christian commented on PDFBOX-5029: Hi Tilman, in your "sorted" files there are spaces between words but the word order in a sentence is backward - also the text is not following the column order in the pdf file but is jumping from "first line-first column to first line-second column to first line- third column" and so on. In addition there is a problem with the positioning of some vowel sign on the top of consonants - sometimes is correct sometimes is wrong even for the same combination of vowel+consonant. Same with the order of some "consonant clusters" - I'm not sure if it's correct to describe it this way, but it would be like the word "the" rendered as "hte" if that makes sense. The "not sorted" files are even worse with missing spaces and reverse word order + letters in each word are backward. There is no "column issue" in this case. I summarize it with two examples: Ex - sorted files: "the cat is red" --> "red is cat the" + the column issue. Ex - not sorted files: "the cat is red" --> "dersitaceht" (no column issue) In terms of "accuracy" my original utf-8 file attached above has no column issue and words have the right order in the sentences. We noticed also that the first word for each line in the first pdf column is missing. This does not make things easier I guess. Ex - test_scraped.utf8 file: "the cat is red" -> "catisred" (no column issue + missing first word) Thanks again for your help. > Tika - Issues extracting Arabic script from pdf > --- > > Key: PDFBOX-5029 > URL: https://issues.apache.org/jira/browse/PDFBOX-5029 > Project: PDFBox > Issue Type: Bug > Environment: Windows - Anaconda / Spyder >Reporter: Christian >Priority: Major > Attachments: PDFBOX-5029-not-sorted-2.0.21.txt, > PDFBOX-5029-not-sorted-trunk.txt, PDFBOX-5029-sorted-2.0.21.txt, > PDFBOX-5029-sorted-trunk.txt, extracting_text_asian_pdf.py, test.pdf, > test_scraped.utf8 > > > I'm working on building a corpus of Uygur texts and some of the content is > coming from pdf files. I wrote a short python script to scrape text from pdf > using tika-python. The script is Arabic, and the output looks good but there > is one major problem: there are many missing spaces between words and I > really do not know how to address this issue. I am attaching a pdf file, the > script to scrape its text and the output (test_scraped.utf8). Thanks in > advance for your help. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf
[ https://issues.apache.org/jira/browse/PDFBOX-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240465#comment-17240465 ] Tilman Hausherr commented on PDFBOX-5029: - No I did not use the script. I don't have python installed. I have tika but I wanted to test with the latest PDFBox version because IMHO it can only be a PDFBox problem, if it is. > Tika - Issues extracting Arabic script from pdf > --- > > Key: PDFBOX-5029 > URL: https://issues.apache.org/jira/browse/PDFBOX-5029 > Project: PDFBox > Issue Type: Bug > Environment: Windows - Anaconda / Spyder >Reporter: Christian >Priority: Major > Attachments: PDFBOX-5029-not-sorted-2.0.21.txt, > PDFBOX-5029-not-sorted-trunk.txt, PDFBOX-5029-sorted-2.0.21.txt, > PDFBOX-5029-sorted-trunk.txt, extracting_text_asian_pdf.py, test.pdf, > test_scraped.utf8 > > > I'm working on building a corpus of Uygur texts and some of the content is > coming from pdf files. I wrote a short python script to scrape text from pdf > using tika-python. The script is Arabic, and the output looks good but there > is one major problem: there are many missing spaces between words and I > really do not know how to address this issue. I am attaching a pdf file, the > script to scrape its text and the output (test_scraped.utf8). Thanks in > advance for your help. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf
[ https://issues.apache.org/jira/browse/PDFBOX-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240363#comment-17240363 ] Christian commented on PDFBOX-5029: Also, what is the difference between the sorted and not-sorted files you attached? Did you use my script to extract the text? Thanks again. > Tika - Issues extracting Arabic script from pdf > --- > > Key: PDFBOX-5029 > URL: https://issues.apache.org/jira/browse/PDFBOX-5029 > Project: PDFBox > Issue Type: Bug > Environment: Windows - Anaconda / Spyder >Reporter: Christian >Priority: Major > Attachments: PDFBOX-5029-not-sorted-2.0.21.txt, > PDFBOX-5029-not-sorted-trunk.txt, PDFBOX-5029-sorted-2.0.21.txt, > PDFBOX-5029-sorted-trunk.txt, extracting_text_asian_pdf.py, test.pdf, > test_scraped.utf8 > > > I'm working on building a corpus of Uygur texts and some of the content is > coming from pdf files. I wrote a short python script to scrape text from pdf > using tika-python. The script is Arabic, and the output looks good but there > is one major problem: there are many missing spaces between words and I > really do not know how to address this issue. I am attaching a pdf file, the > script to scrape its text and the output (test_scraped.utf8). Thanks in > advance for your help. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf
[ https://issues.apache.org/jira/browse/PDFBOX-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240362#comment-17240362 ] Christian commented on PDFBOX-5029: Thanks Tilman, will do - tomorrow I will be in touch with a native speaker and I will provide you the exact lines and missing spaces. > Tika - Issues extracting Arabic script from pdf > --- > > Key: PDFBOX-5029 > URL: https://issues.apache.org/jira/browse/PDFBOX-5029 > Project: PDFBox > Issue Type: Bug > Environment: Windows - Anaconda / Spyder >Reporter: Christian >Priority: Major > Attachments: PDFBOX-5029-not-sorted-2.0.21.txt, > PDFBOX-5029-not-sorted-trunk.txt, PDFBOX-5029-sorted-2.0.21.txt, > PDFBOX-5029-sorted-trunk.txt, extracting_text_asian_pdf.py, test.pdf, > test_scraped.utf8 > > > I'm working on building a corpus of Uygur texts and some of the content is > coming from pdf files. I wrote a short python script to scrape text from pdf > using tika-python. The script is Arabic, and the output looks good but there > is one major problem: there are many missing spaces between words and I > really do not know how to address this issue. I am attaching a pdf file, the > script to scrape its text and the output (test_scraped.utf8). Thanks in > advance for your help. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5029) Tika - Issues extracting Arabic script from pdf
[ https://issues.apache.org/jira/browse/PDFBOX-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239942#comment-17239942 ] Tilman Hausherr commented on PDFBOX-5029: - I attached 4 files. IMHO spaces are there. Can you point to a line that is missing one or several spaces, and then tell where this line is in the PDF? > Tika - Issues extracting Arabic script from pdf > --- > > Key: PDFBOX-5029 > URL: https://issues.apache.org/jira/browse/PDFBOX-5029 > Project: PDFBox > Issue Type: Bug > Environment: Windows - Anaconda / Spyder >Reporter: Christian >Priority: Major > Attachments: PDFBOX-5029-not-sorted-2.0.21.txt, > PDFBOX-5029-not-sorted-trunk.txt, PDFBOX-5029-sorted-2.0.21.txt, > PDFBOX-5029-sorted-trunk.txt, extracting_text_asian_pdf.py, test.pdf, > test_scraped.utf8 > > > I'm working on building a corpus of Uygur texts and some of the content is > coming from pdf files. I wrote a short python script to scrape text from pdf > using tika-python. The script is Arabic, and the output looks good but there > is one major problem: there are many missing spaces between words and I > really do not know how to address this issue. I am attaching a pdf file, the > script to scrape its text and the output (test_scraped.utf8). Thanks in > advance for your help. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org