[jira] [Created] (PDFBOX-1304) Text extraction meets "Could not parse predefined CMAP" and returns just a small part of the content containing garbage chars.

2012-05-06 Thread Huan LI (JIRA)
Huan LI created PDFBOX-1304:
---

 Summary: Text extraction meets "Could not parse predefined CMAP" 
and returns just a small part of the content containing garbage chars.
 Key: PDFBOX-1304
 URL: https://issues.apache.org/jira/browse/PDFBOX-1304
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.6.0
 Environment: Win7 32bits
Reporter: Huan LI


i'm using pdfbox-1.6.0 for text extraction from a Chinese pdf file(see the 
attachment "fj.pdf").
 
the extraction code looks like below:
[code]
stripper = new PDFTextStripper(encoding);
txt = stripper.getText(_pdfDoc);
[/code] 
when running getText(), the console says :
[console]
五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
determineEncoding
严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
determineEncoding
严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
determineEncoding
严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
determineEncoding
严重: Error: Could not parse predefined CMAP file for 'Founder-PKUE1-UCS2'
五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
determineEncoding
严重: Error: Could not parse predefined CMAP file for 'Founder-PKUO1-UCS2'
五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
determineEncoding
严重: Error: Could not parse predefined CMAP file for 'Founder-PKUF1-UCS2'
五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
determineEncoding
严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
determineEncoding
严重: Error: Could not parse predefined CMAP file for 'Founder-PKUF1-UCS2'
五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
determineEncoding
严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
determineEncoding
严重: Error: Could not parse predefined CMAP file for 'Founder-PKUE1-UCS2'
五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
determineEncoding
严重: Error: Could not parse predefined CMAP file for 'Founder-PKUE1-UCS2'
五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
determineEncoding
严重: Error: Could not parse predefined CMAP file for 'Founder-PKUF1-UCS2'
五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
determineEncoding
严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
[/console]
after getText() returns, the txt contains just a small part of the pdf content 
(lots are missing) and some garbage chars like "犖犑狌犣犎犗犝犔犻犺犅"(see attachment 
"fj.txt").
 
I've heard some said that the "org.apache.pdfbox.cos.COSString.java" has some 
errors when pdfbox-0.7.3. Has COSString.java been corrected in 1.6.0?


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PDFBOX-1304) Text extraction meets "Could not parse predefined CMAP" and returns just a small part of the content containing garbage chars.

2012-05-06 Thread Huan LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huan LI updated PDFBOX-1304:


Attachment: fj.txt
fj.pdf

> Text extraction meets "Could not parse predefined CMAP" and returns just a 
> small part of the content containing garbage chars.
> --
>
> Key: PDFBOX-1304
> URL: https://issues.apache.org/jira/browse/PDFBOX-1304
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.6.0
> Environment: Win7 32bits
>Reporter: Huan LI
> Attachments: fj.pdf, fj.txt
>
>
> i'm using pdfbox-1.6.0 for text extraction from a Chinese pdf file(see the 
> attachment "fj.pdf").
>  
> the extraction code looks like below:
> [code]
> stripper = new PDFTextStripper(encoding);
> txt = stripper.getText(_pdfDoc);
> [/code] 
> when running getText(), the console says :
> [console]
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUE1-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUO1-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUF1-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUF1-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUE1-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUE1-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUF1-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont 
> determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
> [/console]
> after getText() returns, the txt contains just a small part of the pdf 
> content (lots are missing) and some garbage chars like "犖犑狌犣犎犗犝犔犻犺犅"(see 
> attachment "fj.txt").
>  
> I've heard some said that the "org.apache.pdfbox.cos.COSString.java" has some 
> errors when pdfbox-0.7.3. Has COSString.java been corrected in 1.6.0?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: 1.7 release?

2012-05-06 Thread Andreas Lehmkuehler

Hi,

Am 04.05.2012 15:46, schrieb Timo Boehme:

Am 03.05.2012 21:04, schrieb Michael McCandless:

Any guestimates for a 1.7.0 release?

It's been a long time (9 months) since 1.6.0... and I count ~203
commits since 1.6.0.


There was already some discussion about it (see "Re: Next release(s)?" dating
from 2012-04-10) and it is clear that a new version (probably 1.7.0) should be
released soon.
IMHO there are some things which should be done before, integrate Maruans latest 
patch (PDFBOX-1000), improve the TTF-Parser (PDFBOX-490) 



However I think we will wait until the project lead is back online.

I guess you are adressing me as PMC Chair. I'm afraid there is a
misunderstanding I'd like to clarify.

There is no concept of leadership within the ASF. An apache project is led by 
the PMC [1]. The PMC Chair [2] is just the speaker of the project and acts as 
interface to the board of the foundation. All PMC members [3] including the 
chair are equal and each of them has one vote.



Kind regards,
Timo


BR
Andreas Lehmkühler

[1] http://www.apache.org/foundation/how-it-works.html#pmc
[2] http://www.apache.org/foundation/how-it-works.html#pmc-chair
[3] http://www.apache.org/foundation/how-it-works.html#pmc-members


Re: 1.7 release?

2012-05-06 Thread Maruan Sahyoun

Before integrating the current work at PDFBOX-1000 I would prefer to 

- make sure the lexer is using the new IO classes
- move some parts to the (new) SimpleParser as e.g. some keywords are already 
handled in the lexer which is more than the lexer should do imo

regards

Maruan

Am 06.05.2012 um 16:46 schrieb Andreas Lehmkuehler :

> Hi,
> 
> Am 04.05.2012 15:46, schrieb Timo Boehme:
>> Am 03.05.2012 21:04, schrieb Michael McCandless:
>>> Any guestimates for a 1.7.0 release?
>>> 
>>> It's been a long time (9 months) since 1.6.0... and I count ~203
>>> commits since 1.6.0.
>> 
>> There was already some discussion about it (see "Re: Next release(s)?" dating
>> from 2012-04-10) and it is clear that a new version (probably 1.7.0) should 
>> be
>> released soon.
> IMHO there are some things which should be done before, integrate Maruans 
> latest patch (PDFBOX-1000), improve the TTF-Parser (PDFBOX-490) 
> 
>> However I think we will wait until the project lead is back online.
> I guess you are adressing me as PMC Chair. I'm afraid there is a
> misunderstanding I'd like to clarify.
> 
> There is no concept of leadership within the ASF. An apache project is led by 
> the PMC [1]. The PMC Chair [2] is just the speaker of the project and acts as 
> interface to the board of the foundation. All PMC members [3] including the 
> chair are equal and each of them has one vote.
> 
>> Kind regards,
>> Timo
> 
> BR
> Andreas Lehmkühler
> 
> [1] http://www.apache.org/foundation/how-it-works.html#pmc
> [2] http://www.apache.org/foundation/how-it-works.html#pmc-chair
> [3] http://www.apache.org/foundation/how-it-works.html#pmc-members