[ https://issues.apache.org/jira/browse/PDFBOX-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler resolved PDFBOX-779. --------------------------------------- Resolution: Fixed Fix Version/s: 1.4.0 Assignee: Andreas Lehmkühler Works fine with the current trunk version. I attached the resulting text. > All English characters and some Chinese words are separated by a space > ---------------------------------------------------------------------- > > Key: PDFBOX-779 > URL: https://issues.apache.org/jira/browse/PDFBOX-779 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.2.1, 1.3.1 > Environment: x86_64 GNU/Linux > java 1.6.0_20 > pdfbox 1.2.1 > fontbax 1.2.1 > Reporter: Jingxuan Yu > Assignee: Andreas Lehmkühler > Fix For: 1.4.0 > > Attachments: IKAnalyzer.pdf, IKAnalyzer.txt, PDFBOX779-IKAnalyzer.txt > > > See the pdf document and text document extracted by ExtractText. > The file's info: > $ pdfinfo IKAnalyzer.pdf > Title: IKAnalyzer中文分词器V3.0使用手册 > Keywords: IK Analyzer 中文分词器 Lucene > Author: 林良益、卓诗垚 > Creator: WPS Office 个人版 > Producer: PDFlib 7.0.3 (C++/Win32) > CreationDate: Sun Dec 6 22:07:26 2009 > Tagged: no > Pages: 15 > Encrypted: no > Page size: 595.3 x 841.9 pts (A4) > File size: 441273 bytes > Optimized: no > PDF version: 1.5 > $ pdffonts IKAnalyzer.pdf > name type emb sub uni object ID > ------------------------------------ ----------------- --- --- --- --------- > INUZMH+NSimSun-Identity-H CID TrueType yes yes yes 7 0 > MGIXAY+MicrosoftYaHei-Identity-H CID TrueType yes yes yes 8 0 > CFLOPA+SimSun-Identity-H CID TrueType yes yes yes 6 0 > GHNZKZ+TimesNewRomanPS-BoldMT-Identity-H CID TrueType yes yes yes 19 > 0 > UNEBHT+Cambria-Bold-Identity-H CID TrueType yes yes yes 20 0 > UQKWWP+Wingdings-Regular-Identity-H CID TrueType yes yes yes 33 0 > NKFTTO+MicrosoftYaHei-Identity-H CID TrueType yes yes yes 40 0 > OOJXDG+CourierNewPSMT-Identity-H CID TrueType yes yes yes 51 0 > WHLDYI+CourierNewPS-ItalicMT-Identity-H CID TrueType yes yes yes 58 > 0 > TXIHGB+Cambria-Identity-H CID TrueType yes yes yes 100 0 > CRJWMD+TimesNewRomanPSMT-Identity-H CID TrueType yes yes yes 108 0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.