bug with lucene version 3.0.1
Hello developers, at first, i want to thank you for this project ! I like it to use this project in combination with lucene. But i think i've found a little bug in the pdf box version 1.0.0 The class org.apache.pdfbox.searchengine.lucene.IndexFiles have to be changed, because the Constructor of the StandardAnalyzer Class has been changed in the new Version of Lucene. (see http://lucene.apache.org/java/3_0_1/api/all/index.html) I hope i could help a little bit. Greetz Thomas Boy
Relation between COS and PD model
Hi, I wonder what the intended relation between the COS and the PD model is. Is the COS model the actual data model and the PD classes are only views on this data? Is the PD layer supposed to cache data (for performance issues)? If data is changed via methods in the PD layer, how are these changes propagated to other PD objects? Do we need an observer mechanism for data changes? -- Johannes Koch Fraunhofer Institute for Applied Information Technology FIT Web Compliance Center Schloss Birlinghoven, D-53757 Sankt Augustin, Germany Phone: +49-2241-142628Fax: +49-2241-142065
bug with lucene version 3.0.1
Hello developers, please read my first mail xD there are some more changes necessary, i think but u will see Greetz Thomas Boy
Re: Relation between COS and PD model
Hi, Gesendet: Mi, 03. Mrz 2010 Von: Johannes Kochjohannes.k...@fit.fraunhofer.de Hi, I wonder what the intended relation between the COS and the PD model is. Is the COS model the actual data model and the PD classes are only views on this data? AFAIU yes Is the PD layer supposed to cache data (for performance issues)? There may be some cases where data will be cached, but it wasn't intended. If data is changed via methods in the PD layer, how are these changes propagated to other PD objects? Do we need an observer mechanism for data changes? I don't think so. Data changes should be writen to the COS model by the PD layer. At [1] you will find a brief description of the relation between the COS and the PD model. BR Andreas Lehmkühler [1] http://pdfbox.apache.org/userguide/index.html
Re: Re: Relation between COS and PD model
Hi, Gesendet: Mi, 03. Mrz 2010 Von: Johannes Kochjohannes.k...@fit.fraunhofer.de Hi Andreas, Andreas Lehmkühler schrieb: Gesendet: Mi, 03. Mrz 2010 Von: Johannes Kochjohannes.k...@fit.fraunhofer.de Is the PD layer supposed to cache data (for performance issues)? There may be some cases where data will be cached, but it wasn't intended. If data is changed via methods in the PD layer, how are these changes propagated to other PD objects? Do we need an observer mechanism for data changes? I don't think so. Data changes should be writen to the COS model by the PD layer. How will caching PD objects synchronize their cached PD objects with underlying COS data changed by other PD objects? I don't remember a concrete example, but I'm sure that there are a few. But I think the solution is obvious. You just have to reinitialize your cached value when calling the corresponding setter. BR Andreas Lehmkühler
Re: Re: Relation between COS and PD model
可以使用COSDocument 2010/3/3 Andreas Lehmkühler andr...@lehmi.de: Hi, Gesendet: Mi, 03. Mrz 2010 Von: Johannes Kochjohannes.k...@fit.fraunhofer.de Hi Andreas, Andreas Lehmkühler schrieb: Gesendet: Mi, 03. Mrz 2010 Von: Johannes Kochjohannes.k...@fit.fraunhofer.de Is the PD layer supposed to cache data (for performance issues)? There may be some cases where data will be cached, but it wasn't intended. If data is changed via methods in the PD layer, how are these changes propagated to other PD objects? Do we need an observer mechanism for data changes? I don't think so. Data changes should be writen to the COS model by the PD layer. How will caching PD objects synchronize their cached PD objects with underlying COS data changed by other PD objects? I don't remember a concrete example, but I'm sure that there are a few. But I think the solution is obvious. You just have to reinitialize your cached value when calling the corresponding setter. BR Andreas Lehmkühler -- nisen(English Name)/倪森(Chinese Name) Blog: http://nisen.javaeye.com
Re: Re: Relation between COS and PD model
Hi, 2010/3/3 Andreas Lehmkühler andr...@lehmi.de: Von: Johannes Kochjohannes.k...@fit.fraunhofer.de How will caching PD objects synchronize their cached PD objects with underlying COS data changed by other PD objects? I don't remember a concrete example, but I'm sure that there are a few. But I think the solution is obvious. You just have to reinitialize your cached value when calling the corresponding setter. See PDFont.get/setEncoding for a good example of this. The problem that I believe Johannes is referring to is that there's currently no way for the PD object to know when the underlying COS object (typically a dictionary) is changed, which makes all the current caching solutions a bit brittle. This is also why I was opposed to the earlier idea of extending the current COSObjectable mechanism and would in fact prefer to avoid it as much as possible. PS. I've been trying (see PDFBOX-626) to reduce the memory impact of the full COS object hierarchy that we keep in memory for all PDF documents, but it looks like there are no more big improvements to be made without some radical design changes. One thing I've been considering is making the PD model the canonical data layer and using COS objects only during parsing and serialization. This should give us dramatic memory improvements for text extraction and rendering use cases, but may be troublesome for all use cases where existing PDF documents are being modified. Perhaps we should consider creating an optimized read only version of PDFBox in addition to the fully featured version we now have. BR, Jukka Zitting
Re: Re: Relation between COS and PD model
Oh,my God.I use in Chinese.Sorry. I think you can use cos.COSDocument for cache。 cache in memory: new COSDocument(new pdfbox.io.RandomAccessBuffer()) if you use PD,you can use PDDocument.load( new RandomAccessBuffer()) default is filesystem, at you temp dir or you set dir. called scratchFile in PDFBox,I think is cache mechanism . -- nisen(English Name)/倪森(Chinese Name) Blog: http://nisen.javaeye.com
[jira] Commented: (PDFBOX-595) extracted text contains character names instead of the characters themselves
[ https://issues.apache.org/jira/browse/PDFBOX-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840824#action_12840824 ] Andreas Lehmkühler commented on PDFBOX-595: --- After applying PDFBOX-592 it works like a charm. What environment do you use? I'm on ubuntu linux 32bit with java 1.6.0_15. extracted text contains character names instead of the characters themselves Key: PDFBOX-595 URL: https://issues.apache.org/jira/browse/PDFBOX-595 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 0.8.0-incubator Reporter: Godmar Back Attachments: scansoftbugtestcase.pdf PDF files created by ScanSoft PDFCreate! aren't properly extracted. For instance, this code: BT 0 0 0 rg 0 Tc 0 Tw /F1 12 Tf 1 0 0 1 42 739 Tm [(T)hi)s)1(i)s)1(a)6(t)(e)(s))2(P)(D)F)4(c)(r))(a)(t)(e)(d)0(w)i)t)(h)0(S)c)(a)(ns)of))2(P)(D)F)C)r))(a)(t)(e)(!))s)1(pr))nt)(e)r)7(dr))ve)(r)7(f)r)m)2(a)6(.ht)(m)l)2(f)) 1 0 0 1 438 739 Tm [(l)e)6(vi)e)(w)e)(d)0(i)n)0(F)i)r))(f)x)0(on) 1 0 0 1 42 725 Tm [(W)(i)ndow)s)1(X)P)(.)0(T)oda)y)0(i)s)1(01/)(07/)(2010.) ET is extracted as: This is a test PDF created with Scansoft PDFCreate!'s printer driver from a .html file viewed in Firefox on Windows XP. Today is zero1slashzero7slashtwozero1zero. pdftotext (Poppler) extracts: This is a test PDF created with Scansoft PDFCreate!'s printer driver from a .html file viewed in Firefox on Windows XP. Today is 01/07/2010. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [gsoc2010] PDF GUI discuss frame
vote result: * PackageNameChoices: org.apache.pdflens (3) nisen,Mel,Adam * ProjectNameChoices :PDFLens(3) ,nisen,Mel,Adam * PluginSystemChoices: no comment * GUI_API_Choices : Pivot (3) , nisen,Mel,Adam we will use like this. 2010/2/27 Martinez, Mel - 1004 - MITLL m.marti...@ll.mit.edu: I'll ditto Adam's votes: * PackageName: org.apache.pdflens (+1) * ProjectName: PDFLens(+1), OpenPDF(-1) * PluginSystem: no comment * GUI_API: Pivot (+1) -Mel -Original Message- From: a...@swmc.com [mailto:a...@swmc.com] Sent: Friday, February 26, 2010 11:51 AM To: dev@pdfbox.apache.org Subject: Re: [gsoc2010] PDF GUI discuss frame * PackageName: org.apache.pdflens (+1) * ProjectName: PDFLens(+1), OpenPDF(-1) * PluginSystem: no comment * GUI_API: Pivot (+1) --Adam From: nisen nisen...@gmail.com To: dev@pdfbox.apache.org Date: 02/25/2010 19:44 Subject: [gsoc2010] PDF GUI discuss frame The mail and relay is so long。So we need a new discuss。I write wikis for this project choices.We only choose,but also give someome reference to the new. the choice frame : * PackageNameChoices: * ProjectNameChoices : * PluginSystemChoices * GUI_API_Choices : you can get more info and lastest: http://code.google.com/p/pdfboxgui/wiki/Choices you can add here directly,if you like mail,you relay this,I will clear up to wiki you can vote like me: * PackageNameChoices: org.apache.pdflens (+1) * ProjectNameChoices :PDFLens(+1) ,OpenPDF(-1), * PluginSystemChoices: OSGi (+1) ,JPF( 0) * GUI_API_Choices : Pivot (+1) you can vote like Mel : get info from this mail * PackageNameChoices: org.apache.pdflens (+1) maybe * ProjectNameChoices :PDFLens(+1) ,PDFDbg(-1), * PluginSystemChoices: * GUI_API_Choices : Pivot (+1) -Original Message- From: Todd Volkert [mailto:tvolk...@gmail.com] Sent: Thursday, February 25, 2010 9:29 AM To: dev@pdfbox.apache.org Subject: Re: [idea] PdfReader In Google Summer of Code 2010 -- nisen(English Name)/倪森(Chinese Name) Blog: http://nisen.javaeye.com ? Click here to submit conditions This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or the taking of any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884. -- nisen(English Name)/倪森(Chinese Name) Blog: http://nisen.javaeye.com
Re: Reopen PDFBOX-483?
Hello again, I got distracted by other work from this issue and I've returned to it today. Here are the experiments I've performed: 1) view PDF with PdfReader (it renders correctly) 2) print PDF to HP LaserJet 4P (it renders with many lines and text omitted) 3) comment-out W/W* in PageDrawer.properties 4) restart PdfReader and repeat #1 and #2 (with same result) I archived all my code changes and retrieved the latest sources from svn. So, I should be running the latest and greatest code. If you goto PDFBOX-490 https://issues.apache.org/jira/browse/PDFBOX-490, you'll find attached file filled.pdf that manifests this error, but I've been seeing this with a lot of different PDFs: display looks good, print looks bad. I can attach another file to PDFBOX-483 https://issues.apache.org/jira/browse/PDFBOX-483 if you'd like. Can you point me to where I should look to see why text appears on screen, but not on paper? Thanks in advance, steve
[jira] Commented: (PDFBOX-595) extracted text contains character names instead of the characters themselves
[ https://issues.apache.org/jira/browse/PDFBOX-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841028#action_12841028 ] Godmar Back commented on PDFBOX-595: My report was against 0.8.0-incubator. If the problem is fixed in 1.0.1-SNAPSHOT, close the report and push PDFBOX-592 along. extracted text contains character names instead of the characters themselves Key: PDFBOX-595 URL: https://issues.apache.org/jira/browse/PDFBOX-595 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 0.8.0-incubator Reporter: Godmar Back Attachments: scansoftbugtestcase.pdf PDF files created by ScanSoft PDFCreate! aren't properly extracted. For instance, this code: BT 0 0 0 rg 0 Tc 0 Tw /F1 12 Tf 1 0 0 1 42 739 Tm [(T)hi)s)1(i)s)1(a)6(t)(e)(s))2(P)(D)F)4(c)(r))(a)(t)(e)(d)0(w)i)t)(h)0(S)c)(a)(ns)of))2(P)(D)F)C)r))(a)(t)(e)(!))s)1(pr))nt)(e)r)7(dr))ve)(r)7(f)r)m)2(a)6(.ht)(m)l)2(f)) 1 0 0 1 438 739 Tm [(l)e)6(vi)e)(w)e)(d)0(i)n)0(F)i)r))(f)x)0(on) 1 0 0 1 42 725 Tm [(W)(i)ndow)s)1(X)P)(.)0(T)oda)y)0(i)s)1(01/)(07/)(2010.) ET is extracted as: This is a test PDF created with Scansoft PDFCreate!'s printer driver from a .html file viewed in Firefox on Windows XP. Today is zero1slashzero7slashtwozero1zero. pdftotext (Poppler) extracts: This is a test PDF created with Scansoft PDFCreate!'s printer driver from a .html file viewed in Firefox on Windows XP. Today is 01/07/2010. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-595) extracted text contains character names instead of the characters themselves
[ https://issues.apache.org/jira/browse/PDFBOX-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841087#action_12841087 ] Andreas Lehmkühler commented on PDFBOX-595: --- This issue probably depends on the used environment. There is a user who reported a similar problem [1] which depends on the used JDK. There 64bit versions of the JDK seem to be problematic. What do you use? can you confirm that behaviour? [1] http://markmail.org/message/xdpdd7wu5mdzxcnl extracted text contains character names instead of the characters themselves Key: PDFBOX-595 URL: https://issues.apache.org/jira/browse/PDFBOX-595 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 0.8.0-incubator Reporter: Godmar Back Attachments: scansoftbugtestcase.pdf PDF files created by ScanSoft PDFCreate! aren't properly extracted. For instance, this code: BT 0 0 0 rg 0 Tc 0 Tw /F1 12 Tf 1 0 0 1 42 739 Tm [(T)hi)s)1(i)s)1(a)6(t)(e)(s))2(P)(D)F)4(c)(r))(a)(t)(e)(d)0(w)i)t)(h)0(S)c)(a)(ns)of))2(P)(D)F)C)r))(a)(t)(e)(!))s)1(pr))nt)(e)r)7(dr))ve)(r)7(f)r)m)2(a)6(.ht)(m)l)2(f)) 1 0 0 1 438 739 Tm [(l)e)6(vi)e)(w)e)(d)0(i)n)0(F)i)r))(f)x)0(on) 1 0 0 1 42 725 Tm [(W)(i)ndow)s)1(X)P)(.)0(T)oda)y)0(i)s)1(01/)(07/)(2010.) ET is extracted as: This is a test PDF created with Scansoft PDFCreate!'s printer driver from a .html file viewed in Firefox on Windows XP. Today is zero1slashzero7slashtwozero1zero. pdftotext (Poppler) extracts: This is a test PDF created with Scansoft PDFCreate!'s printer driver from a .html file viewed in Firefox on Windows XP. Today is 01/07/2010. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.