[ https://issues.apache.org/jira/browse/PDFBOX-534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ernesto De Santis updated PDFBOX-534: ------------------------------------- Attachment: kvfs.txt kvfs.pdf kvfs.pdf is part of my thesis work. It was created by the pdflatex command. kvfs.txt is the result of the PDFBox execution to get the body text. > PDF file created with LaTeX is bad parsed > ----------------------------------------- > > Key: PDFBOX-534 > URL: https://issues.apache.org/jira/browse/PDFBOX-534 > Project: PDFBox > Issue Type: Bug > Components: Parsing > Affects Versions: 0.8.0-incubator > Environment: Linux/Ubuntu 9 > Reporter: Ernesto De Santis > Attachments: kvfs.pdf, kvfs.txt > > > I'm getting an unexpected behavior parsing a pdf file. > I'm trying to get the clean body text of some file, and I get a lot of aXX > strings. Where each X is a number. It appear be the char code of the real > character, I don't know really. > My code is too simple: > String[] args = {"/home/ernesto/tesis/documento/kvfs.pdf"}; > ExtractText.main(args); > The output I get is: > a73a109a112a108a101a109a101a110a116a97a110a100a111 a97a99a99a101a115a111 a97 > a115a105a115a116a101a109a97a115 a100a101 > a97a114a99a104a105a118a111a115 a118a105a114a116a117a97a108a101a115 > a112a97a114a97 a108a97 a104a101a114a114a97a109a105a101a110a116a97 > a100a101 a98a250a115a113a117a101a100a97 a75a110a101a111a98a97a115a101 > and more ...... > The pdf file was generated by pdflatex command, in Ubuntu 9. > The pdf properties are: > producer: pdfTeX-1.40.3 > format: PDF-1.4 > security: NO > optimized: NO > paper: A4, vertical (210 x 297 mm) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.