bug with lucene version 3.0.1

2010-03-03 Thread thomas . boy
Hello developers,
at first, i want to thank you for this project ! I like it to use this
project in combination with lucene.
But i think i've found a little bug in the pdf box version 1.0.0
The class  org.apache.pdfbox.searchengine.lucene.IndexFiles
have to be changed, because the Constructor of the StandardAnalyzer Class
has been changed in the new Version of Lucene.
(see http://lucene.apache.org/java/3_0_1/api/all/index.html)
I hope i could help a little bit.

Greetz Thomas Boy






Relation between COS and PD model

2010-03-03 Thread Johannes Koch

Hi,

I wonder what the intended relation between the COS and the PD model is.

Is the COS model the actual data model and the PD classes are only views 
on this data?


Is the PD layer supposed to cache data (for performance issues)?

If data is changed via methods in the PD layer, how are these changes 
propagated to other PD objects? Do we need an observer mechanism for 
data changes?


--
Johannes Koch
Fraunhofer Institute for Applied Information Technology FIT
Web Compliance Center
Schloss Birlinghoven, D-53757 Sankt Augustin, Germany
Phone: +49-2241-142628Fax: +49-2241-142065


bug with lucene version 3.0.1

2010-03-03 Thread thomas . boy
Hello developers,
please read my first mail xD
there are some more changes necessary, i think
but u will see

Greetz Thomas Boy








Re: Relation between COS and PD model

2010-03-03 Thread Andreas Lehmkühler
Hi,

Gesendet: Mi, 03. Mrz 2010 
Von: Johannes Kochjohannes.k...@fit.fraunhofer.de

 Hi,
 
 I wonder what the intended relation between the COS and the PD model is.
 
 Is the COS model the actual data model and the PD classes are only views 
 on this data?
AFAIU yes

 Is the PD layer supposed to cache data (for performance issues)?
There may be some cases where data will be cached, but it wasn't intended.

 If data is changed via methods in the PD layer, how are these changes 
 propagated to other PD objects? Do we need an observer mechanism for 
 data changes?
I don't think so. Data changes should be writen to the COS model by the PD 
layer.

At [1] you will find a brief description of the relation between the COS and 
the PD model.

BR
Andreas Lehmkühler

[1] http://pdfbox.apache.org/userguide/index.html


Re: Re: Relation between COS and PD model

2010-03-03 Thread Andreas Lehmkühler
Hi,

Gesendet: Mi, 03. Mrz 2010
Von: Johannes Kochjohannes.k...@fit.fraunhofer.de

 Hi Andreas,
 
 Andreas Lehmkühler schrieb:
  Gesendet: Mi, 03. Mrz 2010 
  Von: Johannes Kochjohannes.k...@fit.fraunhofer.de
  Is the PD layer supposed to cache data (for performance issues)?
  There may be some cases where data will be cached, but it wasn't
 intended.
  
  If data is changed via methods in the PD layer, how are these changes 
  propagated to other PD objects? Do we need an observer mechanism for 
  data changes?
  I don't think so. Data changes should be writen to the COS model by the PD
 layer.
 
 How will caching PD objects synchronize their cached PD objects with 
 underlying COS data changed by other PD objects?
I don't remember a concrete example, but I'm sure that there are a few. But I 
think the
solution is obvious. You just have to reinitialize your cached value when 
calling the 
corresponding setter.

BR
Andreas Lehmkühler


Re: Re: Relation between COS and PD model

2010-03-03 Thread nisen
可以使用COSDocument

2010/3/3 Andreas Lehmkühler andr...@lehmi.de:
 Hi,

 Gesendet: Mi, 03. Mrz 2010
 Von: Johannes Kochjohannes.k...@fit.fraunhofer.de

 Hi Andreas,

 Andreas Lehmkühler schrieb:
  Gesendet: Mi, 03. Mrz 2010
  Von: Johannes Kochjohannes.k...@fit.fraunhofer.de
  Is the PD layer supposed to cache data (for performance issues)?
  There may be some cases where data will be cached, but it wasn't
 intended.
 
  If data is changed via methods in the PD layer, how are these changes
  propagated to other PD objects? Do we need an observer mechanism for
  data changes?
  I don't think so. Data changes should be writen to the COS model by the PD
 layer.

 How will caching PD objects synchronize their cached PD objects with
 underlying COS data changed by other PD objects?
 I don't remember a concrete example, but I'm sure that there are a few. But I 
 think the
 solution is obvious. You just have to reinitialize your cached value when 
 calling the
 corresponding setter.

 BR
 Andreas Lehmkühler




-- 
nisen(English Name)/倪森(Chinese Name)
Blog: http://nisen.javaeye.com


Re: Re: Relation between COS and PD model

2010-03-03 Thread Jukka Zitting
Hi,

2010/3/3 Andreas Lehmkühler andr...@lehmi.de:
 Von: Johannes Kochjohannes.k...@fit.fraunhofer.de
 How will caching PD objects synchronize their cached PD objects with
 underlying COS data changed by other PD objects?
 I don't remember a concrete example, but I'm sure that there are a few. But I 
 think the
 solution is obvious. You just have to reinitialize your cached value when 
 calling the
 corresponding setter.

See PDFont.get/setEncoding for a good example of this.

The problem that I believe Johannes is referring to is that there's
currently no way for the PD object to know when the underlying COS
object (typically a dictionary) is changed, which makes all the
current caching solutions a bit brittle. This is also why I was
opposed to the earlier idea of extending the current COSObjectable
mechanism and would in fact prefer to avoid it as much as possible.

PS. I've been trying (see PDFBOX-626) to reduce the memory impact of
the full COS object hierarchy that we keep in memory for all PDF
documents, but it looks like there are no more big improvements to be
made without some radical design changes. One thing I've been
considering is making the PD model the canonical data layer and using
COS objects only during parsing and serialization. This should give us
dramatic memory improvements for text extraction and rendering use
cases, but may be troublesome for all use cases where existing PDF
documents are being modified. Perhaps we should consider creating an
optimized read only version of PDFBox in addition to the fully
featured version we now have.

BR,

Jukka Zitting


Re: Re: Relation between COS and PD model

2010-03-03 Thread nisen
Oh,my God.I use in Chinese.Sorry.

I think you can use cos.COSDocument for cache。

cache in memory:
new COSDocument(new pdfbox.io.RandomAccessBuffer())
if you use PD,you can use  PDDocument.load( new RandomAccessBuffer())

default is filesystem, at you temp dir or you set dir. called
scratchFile in PDFBox,I think is cache mechanism .


-- 
nisen(English Name)/倪森(Chinese Name)
Blog: http://nisen.javaeye.com


[jira] Commented: (PDFBOX-595) extracted text contains character names instead of the characters themselves

2010-03-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840824#action_12840824
 ] 

Andreas Lehmkühler commented on PDFBOX-595:
---

After applying PDFBOX-592 it works like a charm. What environment do you use? 
I'm on ubuntu linux 32bit with java 1.6.0_15.

 extracted text contains character names instead of the characters themselves
 

 Key: PDFBOX-595
 URL: https://issues.apache.org/jira/browse/PDFBOX-595
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 0.8.0-incubator
Reporter: Godmar Back
 Attachments: scansoftbugtestcase.pdf


 PDF files created by ScanSoft PDFCreate! aren't properly extracted.
 For instance, this code:
 BT
 0 0 0 rg
 0 Tc 0 Tw /F1 12 Tf
 1 0 0 1 42 739  Tm
 [(T)hi)s)1(i)s)1(a)6(t)(e)(s))2(P)(D)F)4(c)(r))(a)(t)(e)(d)0(w)i)t)(h)0(S)c)(a)(ns)of))2(P)(D)F)C)r))(a)(t)(e)(!))s)1(pr))nt)(e)r)7(dr))ve)(r)7(f)r)m)2(a)6(.ht)(m)l)2(f))
 1 0 0 1 438 739  Tm
 [(l)e)6(vi)e)(w)e)(d)0(i)n)0(F)i)r))(f)x)0(on)
 1 0 0 1 42 725  Tm
 [(W)(i)ndow)s)1(X)P)(.)0(T)oda)y)0(i)s)1(01/)(07/)(2010.)
 ET
 is extracted as:
 This is a test PDF created with Scansoft PDFCreate!'s printer driver from a 
 .html file viewed in Firefox on
 Windows XP. Today is zero1slashzero7slashtwozero1zero.
 pdftotext (Poppler) extracts:
 This is a test PDF created with Scansoft PDFCreate!'s printer driver from a 
 .html file viewed in Firefox on Windows XP. Today is 01/07/2010.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [gsoc2010] PDF GUI discuss frame

2010-03-03 Thread nisen
vote result:
* PackageNameChoices: org.apache.pdflens (3)  nisen,Mel,Adam
   * ProjectNameChoices :PDFLens(3) ,nisen,Mel,Adam
   * PluginSystemChoices: no comment
   * GUI_API_Choices : Pivot (3) , nisen,Mel,Adam

we will use like this.


2010/2/27 Martinez, Mel - 1004 - MITLL m.marti...@ll.mit.edu:
 I'll ditto Adam's votes:

 * PackageName:  org.apache.pdflens (+1)
 * ProjectName:  PDFLens(+1), OpenPDF(-1)
 * PluginSystem: no comment
 * GUI_API:      Pivot (+1)

 -Mel

 -Original Message-
 From: a...@swmc.com [mailto:a...@swmc.com]
 Sent: Friday, February 26, 2010 11:51 AM
 To: dev@pdfbox.apache.org
 Subject: Re: [gsoc2010] PDF GUI discuss frame

 * PackageName:  org.apache.pdflens (+1)
 * ProjectName:  PDFLens(+1), OpenPDF(-1)
 * PluginSystem: no comment
 * GUI_API:      Pivot (+1)

 --Adam




 From:
 nisen nisen...@gmail.com
 To:
 dev@pdfbox.apache.org
 Date:
 02/25/2010 19:44
 Subject:
 [gsoc2010] PDF GUI discuss frame



 The mail and relay is so long。So we need a new discuss。I write wikis
 for this project choices.We only choose,but also give someome
 reference to the new.

 the choice frame :
    * PackageNameChoices:
    * ProjectNameChoices :
    * PluginSystemChoices
    * GUI_API_Choices :
 you can get more info and lastest:
 http://code.google.com/p/pdfboxgui/wiki/Choices
 you can add here directly,if you like mail,you relay this,I will clear
 up to wiki

 you can vote like me:
     * PackageNameChoices: org.apache.pdflens (+1)
    * ProjectNameChoices :PDFLens(+1) ,OpenPDF(-1),
    * PluginSystemChoices: OSGi (+1) ,JPF( 0)
    * GUI_API_Choices : Pivot (+1)

 you can vote like Mel : get info from this mail
     * PackageNameChoices: org.apache.pdflens (+1)  maybe
    * ProjectNameChoices :PDFLens(+1) ,PDFDbg(-1),
    * PluginSystemChoices:
    * GUI_API_Choices : Pivot (+1)


 -Original Message-
 From: Todd Volkert [mailto:tvolk...@gmail.com]
 Sent: Thursday, February 25, 2010 9:29 AM
 To: dev@pdfbox.apache.org
 Subject: Re: [idea] PdfReader In Google Summer of Code 2010


 --
 nisen(English Name)/倪森(Chinese Name)
 Blog: http://nisen.javaeye.com



 ?  Click here to submit conditions

 This email and any content within or attached hereto from  Sun West Mortgage
 Company, Inc.  is confidential and/or legally privileged. The information is
 intended only for the use of the individual or entity named on this email.
 If you are not the intended recipient, you are hereby notified that any
 disclosure, copying, distribution or the taking of any action in reliance on
 the contents of this email information is strictly prohibited, and that the
 documents should be returned to this office immediately by email. Receipt by
 anyone other than the intended recipient is not a waiver of any privilege.
 Please do not include your social security number, account number, or any
 other personal or financial information in the content of the email. Should
 you have any questions, please call  (800) 453 7884.




-- 
nisen(English Name)/倪森(Chinese Name)
Blog: http://nisen.javaeye.com


Re: Reopen PDFBOX-483?

2010-03-03 Thread steve poling

Hello again,

I got distracted by other work from this issue and I've returned to it 
today. Here are the experiments I've performed:


1) view PDF with PdfReader (it renders correctly)
2) print PDF to HP LaserJet 4P (it renders with many lines and text omitted)
3) comment-out W/W* in PageDrawer.properties
4) restart PdfReader and repeat #1 and #2 (with same result)

I archived all my code changes and retrieved the latest sources from 
svn. So, I should be running the latest and greatest code.


If you goto PDFBOX-490 
https://issues.apache.org/jira/browse/PDFBOX-490, you'll find attached 
file filled.pdf that manifests this error, but I've been seeing this 
with a lot of different PDFs: display looks good, print looks bad. I can 
attach another file to PDFBOX-483 
https://issues.apache.org/jira/browse/PDFBOX-483 if you'd like.


Can you point me to where I should look to see why text appears on 
screen, but not on paper?


Thanks in advance,

steve



[jira] Commented: (PDFBOX-595) extracted text contains character names instead of the characters themselves

2010-03-03 Thread Godmar Back (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841028#action_12841028
 ] 

Godmar Back commented on PDFBOX-595:


My report was against 0.8.0-incubator. If the problem is fixed in 
1.0.1-SNAPSHOT, close the report and push PDFBOX-592 along.

 extracted text contains character names instead of the characters themselves
 

 Key: PDFBOX-595
 URL: https://issues.apache.org/jira/browse/PDFBOX-595
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 0.8.0-incubator
Reporter: Godmar Back
 Attachments: scansoftbugtestcase.pdf


 PDF files created by ScanSoft PDFCreate! aren't properly extracted.
 For instance, this code:
 BT
 0 0 0 rg
 0 Tc 0 Tw /F1 12 Tf
 1 0 0 1 42 739  Tm
 [(T)hi)s)1(i)s)1(a)6(t)(e)(s))2(P)(D)F)4(c)(r))(a)(t)(e)(d)0(w)i)t)(h)0(S)c)(a)(ns)of))2(P)(D)F)C)r))(a)(t)(e)(!))s)1(pr))nt)(e)r)7(dr))ve)(r)7(f)r)m)2(a)6(.ht)(m)l)2(f))
 1 0 0 1 438 739  Tm
 [(l)e)6(vi)e)(w)e)(d)0(i)n)0(F)i)r))(f)x)0(on)
 1 0 0 1 42 725  Tm
 [(W)(i)ndow)s)1(X)P)(.)0(T)oda)y)0(i)s)1(01/)(07/)(2010.)
 ET
 is extracted as:
 This is a test PDF created with Scansoft PDFCreate!'s printer driver from a 
 .html file viewed in Firefox on
 Windows XP. Today is zero1slashzero7slashtwozero1zero.
 pdftotext (Poppler) extracts:
 This is a test PDF created with Scansoft PDFCreate!'s printer driver from a 
 .html file viewed in Firefox on Windows XP. Today is 01/07/2010.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PDFBOX-595) extracted text contains character names instead of the characters themselves

2010-03-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841087#action_12841087
 ] 

Andreas Lehmkühler commented on PDFBOX-595:
---

This issue probably depends on the used environment. There is a user who 
reported a similar problem [1] which depends on the used JDK. There 64bit 
versions of the JDK seem to be problematic. What do you use? can you confirm 
that behaviour?

[1] http://markmail.org/message/xdpdd7wu5mdzxcnl

 extracted text contains character names instead of the characters themselves
 

 Key: PDFBOX-595
 URL: https://issues.apache.org/jira/browse/PDFBOX-595
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 0.8.0-incubator
Reporter: Godmar Back
 Attachments: scansoftbugtestcase.pdf


 PDF files created by ScanSoft PDFCreate! aren't properly extracted.
 For instance, this code:
 BT
 0 0 0 rg
 0 Tc 0 Tw /F1 12 Tf
 1 0 0 1 42 739  Tm
 [(T)hi)s)1(i)s)1(a)6(t)(e)(s))2(P)(D)F)4(c)(r))(a)(t)(e)(d)0(w)i)t)(h)0(S)c)(a)(ns)of))2(P)(D)F)C)r))(a)(t)(e)(!))s)1(pr))nt)(e)r)7(dr))ve)(r)7(f)r)m)2(a)6(.ht)(m)l)2(f))
 1 0 0 1 438 739  Tm
 [(l)e)6(vi)e)(w)e)(d)0(i)n)0(F)i)r))(f)x)0(on)
 1 0 0 1 42 725  Tm
 [(W)(i)ndow)s)1(X)P)(.)0(T)oda)y)0(i)s)1(01/)(07/)(2010.)
 ET
 is extracted as:
 This is a test PDF created with Scansoft PDFCreate!'s printer driver from a 
 .html file viewed in Firefox on
 Windows XP. Today is zero1slashzero7slashtwozero1zero.
 pdftotext (Poppler) extracts:
 This is a test PDF created with Scansoft PDFCreate!'s printer driver from a 
 .html file viewed in Firefox on Windows XP. Today is 01/07/2010.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.