[ https://issues.apache.org/jira/browse/PDFBOX-534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13023591#comment-13023591 ]
Ernesto De Santis commented on PDFBOX-534: ------------------------------------------ I'm sorry for the late response. It's working for me. Thanks! > PDF file created with LaTeX is bad parsed > ----------------------------------------- > > Key: PDFBOX-534 > URL: https://issues.apache.org/jira/browse/PDFBOX-534 > Project: PDFBox > Issue Type: Bug > Components: Parsing > Affects Versions: 0.8.0-incubator > Environment: Linux/Ubuntu 9 > Reporter: Ernesto De Santis > Fix For: 1.2.0 > > Attachments: amapn19_03.pdf, amapn19_03.txt, kvfs-PDFKit.txt, > kvfs.pdf, kvfs.txt, kvfs_r944875.txt > > > I'm getting an unexpected behavior parsing a pdf file. > I'm trying to get the clean body text of some file, and I get a lot of aXX > strings. Where each X is a number. It appear be the char code of the real > character, I don't know really. > My code is too simple: > String[] args = {"/home/ernesto/tesis/documento/kvfs.pdf"}; > ExtractText.main(args); > I used the PDFBox 0.8.0-incubator version. Builded on 20/9/2009. > The output I get is: > a73a109a112a108a101a109a101a110a116a97a110a100a111 a97a99a99a101a115a111 a97 > a115a105a115a116a101a109a97a115 a100a101 > a97a114a99a104a105a118a111a115 a118a105a114a116a117a97a108a101a115 > a112a97a114a97 a108a97 a104a101a114a114a97a109a105a101a110a116a97 > a100a101 a98a250a115a113a117a101a100a97 a75a110a101a111a98a97a115a101 > and more ...... > The pdf file was generated by pdflatex command, in Ubuntu 9. > The pdf properties are: > producer: pdfTeX-1.40.3 > format: PDF-1.4 > security: NO > optimized: NO > paper: A4, vertical (210 x 297 mm) > When I run the PDFBox test, I get this by the console: > 0 [main] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled > operation: d > INFO [main]: unsupported/disabled operation: d > 7 [main] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled > operation: J > INFO [main]: unsupported/disabled operation: J > 7 [main] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled > operation: m > INFO [main]: unsupported/disabled operation: m > 7 [main] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled > operation: l > INFO [main]: unsupported/disabled operation: l > 7 [main] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled > operation: S > INFO [main]: unsupported/disabled operation: S > 272 [main] INFO org.apache.pdfbox.util.PDFStreamEngine - > unsupported/disabled operation: re > INFO [main]: unsupported/disabled operation: re > 272 [main] INFO org.apache.pdfbox.util.PDFStreamEngine - > unsupported/disabled operation: f > INFO [main]: unsupported/disabled operation: f > 1274 [main] INFO org.apache.pdfbox.util.PDFStreamEngine - > unsupported/disabled operation: rg > INFO [main]: unsupported/disabled operation: rg > 1275 [main] INFO org.apache.pdfbox.util.PDFStreamEngine - > unsupported/disabled operation: RG > INFO [main]: unsupported/disabled operation: RG > 1536 [main] INFO org.apache.pdfbox.util.PDFStreamEngine - > unsupported/disabled operation: f* > INFO [main]: unsupported/disabled operation: f* -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira