Re: PDFBox january 2010 report due

2010-01-15 Thread Jukka Zitting
Hi,

On Thu, Jan 14, 2010 at 11:05 AM, Jeremias Maerki
 wrote:
> Sounds good to me. I have nothing to add. Thanks for writing this up!

+1

BR,

Jukka Zitting


Re: PDFBox january 2010 report due

2010-01-15 Thread Philipp Koch
hi,
thanks for writing up the report!
one side note:
shouldn't we maybe add (additionally) the various performance
improvements provided by mel martinez?

regards,
philipp

On Fri, Jan 15, 2010 at 1:38 PM, Jukka Zitting  wrote:
> Hi,
>
> On Thu, Jan 14, 2010 at 11:05 AM, Jeremias Maerki
>  wrote:
>> Sounds good to me. I have nothing to add. Thanks for writing this up!
>
> +1
>
> BR,
>
> Jukka Zitting
>


[jira] Commented: (PDFBOX-90) Support explicit retrieval of page labels

2010-01-15 Thread Johannes Koch (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800695#action_12800695
 ] 

Johannes Koch commented on PDFBOX-90:
-

I'd love to see PDNameTreeNode in the trunk, so that I can use it for 
PDStructureTreeRoot.

> Support explicit retrieval of page labels
> -
>
> Key: PDFBOX-90
> URL: https://issues.apache.org/jira/browse/PDFBOX-90
> Project: PDFBox
>  Issue Type: New Feature
>  Components: PDModel
> Attachments: pdfbox-90.patch
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=1283254
> Originally submitted by cvonsee on 2005-09-06 11:26.
> Please include methods in PDPage (or elsewhere) to
> allow explicit retrieval of page label information for
> the current page and for all pages.  Retrieved
> information should include everything that is available
> from the PDF, including page numbering style, label
> prefix and page number for current page.
> Thanks! Keep up the good work!
> Chris von See

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Re: PDFBox january 2010 report due

2010-01-15 Thread Andreas Lehmkühler
Hi,

> hi,
> thanks for writing up the report!
> one side note:
> shouldn't we maybe add (additionally) the various performance
> improvements provided by mel martinez?
Yeah, of course. When I wrote my report they weren't filed yet. 

What about the following addition to the PDFBox section:

"We rececived a lot of promising patches from different contributors in the 
last months. E.g., there were
several patches from Mel Martinez including notable performance improvements 
for the parser, which 
Jukka already added to the codebase. As some of the other contributions include 
substantial changes to the
codebase, we have to ask the contributors to sign a CLA first."
 
> regards,
> philipp

BR
Andreas Lehmkühler




Re: Reopen PDFBOX-483?

2010-01-15 Thread Andreas Lehmkühler
Hi,


Gesendet: Mi, 13. Jan 2010 Von: steve poling

> Andreas, I am using the trunk version (like before). And I have 
> commented out the W/W* lines from the PageDrawer.properties file (like 
> before). But I am seeing the same phenomena observed for PDFBOX-483.
I won't reopen PDFBOX-483 without knowing more details on your issue. I could 
be related, 
but probably it's just a sideeffect which only appears if clipping is 
actviated/deactivated.
Is it possible to provide us with a sample document?

> I am doing things a little diferently. I am running PrintPDF.java, and I 
> have made a kludged hack to get past the PDFBOX-490 issue, but when I 
> print, I get just the form's lines and the AcroField text like before. 
> I've stepped through the code at PDFTrueTypeFont.java where my kludge 
> occurs and nobody's throwing exceptions or anything. (I temporarily 
> hard-coded a stream from the TTF file in place of the embedded stream.)
Hmmm, I've made similar experiences in an other case [1]. Everything worked 
fine displaying the pdf
using PDFReader, but if I tried to print the pdf some of the content got lost.
Perhaps you're running in the same issue, or perhaps your hack is the problem? 

> I recall that when we opened PDFBOX-490, I had commented out W/W* and I 
> was getting some text, just the wrong typeface. Where should I look to 
> see whether the clipping or whatnot responsible for the text going away 
> when printed. I first saw this a while back, and text was rendering OK 
> to the screen, just not the HP LaserJet.
During my investigations concerning [1] I've learned that displaying and 
printing could be quite different things.
Both are using the (PDFBox) code for rendering but the used Graphics2d (there 
are different classes used for
displaying and printing within the JDK). Furthermore the result may depend on 
the used printer driver/OS/JDK.
What environment are you using?

> smiles and cheers,
> 
> steve

BR
Andreas Lehmkühler


[1] https://issues.apache.org/jira/browse/PDFBOX-552


Re: Reopen PDFBOX-483?

2010-01-15 Thread steve poling

Andreas,

You said:
I won't reopen PDFBOX-483 without knowing more details on your issue. I could be related, 
but probably it's just a sideeffect which only appears if clipping is actviated/deactivated.

Is it possible to provide us with a sample document?
  


That's only fair. You've got the document. I'm using the PDF that I 
attached to PDFBOX-490. (Sorry I didn't say so earlier.)


If memory serves, when I opened PDFBOX-490, I got the current behavior 
until I commented out W/W* whereupon I started getting text output, but 
in the wrong typeface.


Is there any place I should look in the code to follow up on your 
clipping activated/deactivated conjecture?




Hmmm, I've made similar experiences in an other case [1]. Everything worked 
fine displaying the pdf
using PDFReader, but if I tried to print the pdf some of the content got lost.
Perhaps you're running in the same issue, or perhaps your hack is the problem? 
  


The possibility that my hack is at fault weighs heavily on my mind. I'll 
drop back to the current trunk version and repeat my experiment. (I'm 
not seeing any NullPointerException as described in PDFBOX-552.)





During my investigations concerning [1] I've learned that displaying and 
printing could be quite different things.
Both are using the (PDFBox) code for rendering but the used Graphics2d (there 
are different classes used for
displaying and printing within the JDK). Furthermore the result may depend on 
the used printer driver/OS/JDK.
What environment are you using?
  


I feel you're onto something about the distinction between printer vs 
display draw code.


I'm running Windows XP, JDK 1.6.0_17, and printing to the HP LaserJet 4.




[jira] Created: (PDFBOX-604) Various text extraction performance improvements

2010-01-15 Thread Jukka Zitting (JIRA)
Various text extraction performance improvements


 Key: PDFBOX-604
 URL: https://issues.apache.org/jira/browse/PDFBOX-604
 Project: PDFBox
  Issue Type: Improvement
  Components: Text extraction
Reporter: Jukka Zitting


Even after Mel's recent patches I've found a number of small performance 
bottlenecks that we could get rid of.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] performance issue flood

2010-01-15 Thread Jukka Zitting
Hi,

On Fri, Jan 15, 2010 at 8:30 AM, Philipp Koch  wrote:
> thanks a lot for your performance optimization contributions.

+1 Good stuff!

> what factor of (overall) speedup is to be expected?

I ran some simple tests and it looks like PDF loading (PDDocument.load
on a File) is now about 20% faster and text extraction
(PDFTextStripper.writeText on an already loaded PDDocument and a dummy
writer) about 30% faster than before Mel's patches and my additional
improvements.

PDFBox is still quite a bit slower than I'd hope, but this is already
a pretty good improvement.

PS. Mel, if you come up with other improvements, it'll be easier for
us to review and apply the changes if you submit them as patches
instead of full copies of the modified files. To create a patch, use
"svn diff" in your checkout.

BR,

Jukka Zitting