from:"\"Maruan Sahyoun\""

Re: [VOTE] Release Apache PDFBox 2.0.29

2023-06-30 Thread Maruan Sahyoun

+1
Maruan

> Am 30.06.2023 um 18:16 schrieb Andreas Lehmkühler :
> 
> Hi,
> 
> is there anybody else who is able to spend some cycles on looking into this 
> release? There is at least one vote missing and about 24 hours to go ...
> 
> Thanks in advance
> 
> Andreas
> 
>> Am 28.06.23 um 18:54 schrieb Andreas Lehmkühler:
>> Hi,
>> 
>> a candidate for the PDFBox 2.0.29 release is available at:
>> 
>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.29/
>> 
>> The release candidate is a zip archive of the sources in:
>> 
>> https://svn.apache.org/repos/asf/pdfbox/tags/2.0.29/
>> 
>> The SHA-512 checksum of the archive is 
>> d33146e9c9a74de57e9a24a1bbf1967a145f6b4883814533b003115ff0c65930a4a4bac427be3af18b07ce08a7afa08bf19d1dbc7b0a79c788bb02429de38d77.
>> 
>> Please vote on releasing this package as Apache PDFBox 2.0.29.
>> The vote is open for the next 72 hours and passes if a majority of at
>> least three +1 PDFBox PMC votes are cast.
>> 
>> [ ] +1 Release this package as Apache PDFBox 2.0.29
>> [ ] -1 Do not release this package because...
>> 
>> 
>> Here is my +1
>> 
>> Andreas
>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: Apache PDFBox Board Report July 2023 due

2023-07-11 Thread Maruan Sahyoun

+1
Maruan

> Am 11.07.2023 um 08:16 schrieb Andreas Lehmkühler :
> 
> Hi,
> 
> find attached a quick draft of the board report we're expected to submit this 
> month. It's based upon the report wizard template which can be found at [1]
> 
> Any comments or additions are appreciated ...
> 
> 
> ## Description:
> The mission of PDFBox is the creation and maintenance of software related to 
> Java library for working with PDF documents
> 
> ## Project Status:
> Current project status: Ongoing with moderate activity
> Issues for the board: There are no issues requiring board attention at this 
> time
> 
> ## Membership Data:
> Apache PDFBox was founded 2009-10-21 (14 years ago)
> There are currently 21 committers and 21 PMC members in this project.
> The Committer-to-PMC ratio is 1:1.
> 
> Community changes, past quarter:
> - No new PMC members. Last addition was Matthäus Mayer on 2017-10-16.
> - No new committers. Last addition was Joerg O. Henne on 2017-10-09.
> 
> ## Project Activity:
> Recent releases:
> 
>2.0.29 was released on 2023-07-01.
>2.0.28 was released on 2023-04-13.
>2.0.27 was released on 2022-09-29.
> 
> ## Community Health:
> - there is a steady stream of contributions, bug reports and questions on the 
> mailing lists
> - there are a lot of refactorings, improvements and bugfixes
> - 2.0.29 was released a few days ago
> - the new release consists of small improvements and bug fixes. Two of the 
> latter fix two regressions introduced/revealed in the former 2.0.28 release
> - a vote for the first beta version of PDFBox 3.0.0 is ongoing
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: [VOTE] Release Apache PDFBox 3.0.0-beta1

2023-07-11 Thread Maruan Sahyoun

+1
Maruan 

> Am 11.07.2023 um 07:56 schrieb Andreas Lehmkühler :
> 
> Hi,
> 
> a candidate for the PDFBox 3.0.0-beta1 release is available at:
> 
>https://dist.apache.org/repos/dist/dev/pdfbox/3.0.0-beta1/
> 
> The release candidate is a zip archive of the sources in:
> 
>https://svn.apache.org/repos/asf/pdfbox/tags/3.0.0-beta1/
> 
> The SHA-512 checksum of the archive is 
> 07a697c6d31854a74eb0452b792644da33fe5e0f3954040465498869059d8a47b11285e6c1472ab8f7c0be76373b86cfd0d1d5963fc1ed9c08ffbad1aadc5651.
> 
> Please vote on releasing this package as Apache PDFBox 3.0.0-beta1.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 PDFBox PMC votes are cast.
> 
>[ ] +1 Release this package as Apache PDFBox 3.0.0-beta1
>[ ] -1 Do not release this package because...
> 
> 
> Here is my +1
> 
> Andreas
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: [VOTE] Release Apache PDFBox 3.0.2

2024-03-11 Thread Maruan Sahyoun

+1
Maruan 

> Am 11.03.2024 um 20:24 schrieb Andreas Lehmkühler :
> 
> Hi,

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: [VOTE] Release Apache PDFBox 2.0.31

2024-03-21 Thread Maruan Sahyoun

+1
Maruan 

> Am 21.03.2024 um 18:53 schrieb Andreas Lehmkühler :
> 
> Hi,
> 
> a candidate for the PDFBox 2.0.31 release is avaiable at:
> 
>https://dist.apache.org/repos/dist/dev/pdfbox/2.0.31/
> 
> The release candidate is a zip archive of the sources in:
> 
>https://svn.apache.org/repos/asf/pdfbox/tags/2.0.31/
> 
> The SHA-512 checksum of the archive is 
> c231ccebf918b8aa0dc80d3162fc88ff4ab78d586bcead0ef0cc44a6cab4f6d455112497ad866901e3948a6c76320d19487c3be7e7c1e66c5e2733de82fe3f09.
> 
> Please vote on releasing this package as Apache PDFBox 2.0.31.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 PDFBox PMC votes are cast.
> 
>[ ] +1 Release this package as Apache PDFBox 2.0.31
>[ ] -1 Do not release this package because...
> 
> 
> Here is my +1
> 
> Andreas
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: [VOTE] Release Apache PDFBox 2.0.31

2024-03-21 Thread Maruan Sahyoun

IMHO this is not a show stopper


> Am 22.03.2024 um 06:54 schrieb Andreas Lehmkühler :
> 
> 
> 
>> Am 21.03.24 um 20:07 schrieb Tim Allison:
>> In the parent pom.xml in the zip file, there's a "release" submodule
>> specified. However, there's no release directory in the src zip that would
>> match: https://svn.apache.org/repos/asf/pdfbox/tags/2.0.31/release/
>> Is that expected?
> Hmmm, of course not. Thanks for the pointer.
> 
> I've rearranged the structure in [1] and never realized that the empty 
> "release" subproject won't show up in the sources-zip. Obviously nobody tried 
> to build one of the last releases from the sources-zip.
> 
> However, I'm going to look into this.
> 
> Is this a showstopper, shall I cancel the release? Or do we just live with 
> another/the last release with that issue?
> 
> 
> [1] https://issues.apache.org/jira/browse/PDFBOX-5699
> 
> 
>>> On Thu, Mar 21, 2024 at 1:53 PM Andreas Lehmkühler 
>>> 
>>> wrote:
>>> Hi,
>>> 
>>> a candidate for the PDFBox 2.0.31 release is avaiable at:
>>> 
>>>  https://dist.apache.org/repos/dist/dev/pdfbox/2.0.31/
>>> 
>>> The release candidate is a zip archive of the sources in:
>>> 
>>>  https://svn.apache.org/repos/asf/pdfbox/tags/2.0.31/
>>> 
>>> The SHA-512 checksum of the archive is
>>> 
>>> c231ccebf918b8aa0dc80d3162fc88ff4ab78d586bcead0ef0cc44a6cab4f6d455112497ad866901e3948a6c76320d19487c3be7e7c1e66c5e2733de82fe3f09.
>>> 
>>> Please vote on releasing this package as Apache PDFBox 2.0.31.
>>> The vote is open for the next 72 hours and passes if a majority of at
>>> least three +1 PDFBox PMC votes are cast.
>>> 
>>>  [ ] +1 Release this package as Apache PDFBox 2.0.31
>>>  [ ] -1 Do not release this package because...
>>> 
>>> 
>>> Here is my +1
>>> 
>>> Andreas
>>> 
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>>> 
>>> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[DISCUSS] PDFBox and Exception handling

2014-02-13 Thread Maruan Sahyoun

Hi,

what do you think of having an exception handling in pdfbox where people could 
define their own handlers. Something similar to

https://camel.apache.org/exception-clause.html

The benefit would be that we could pass the context e.g. during PDF parsing and 
the handler could return something which is than taken as the input. In 
addition to that maybe we can think about having some additional types of 
exceptions instead of mostly IOException to support that.  

BR
Maruan Sahyoun

Re: [DISCUSS] GSoC Participation

2014-02-13 Thread Maruan Sahyoun

There were several ideas floating around wo a real consensus. From my 
perspective PDFbox would benefit most if missing pieces could be implemented:

- shading types as Tilman suggested
- signature algorithms
- support for different character sets during PDF generation
- PDF optimization e.g. remove duplicate resources when merging PDFs or no 
longer needed ones during splitting a PDF
- PoC work could also be feasible of how to implement different PDF levels and 
standards in a similar, extendable manner.

I’d rather see us completing PDF core features than adding new functionality 
like table recognition, high level PDF creation API or OCR interface although 
these would be very beneficial functionalities.

In addition working on the documentation might be something although not for a 
‚core‘ developer. 

One question which needs to be answered is who would act as a mentor?

BR
Maruan Sahyoun


Am 13.02.2014 um 09:19 schrieb Andreas Lehmkühler :

> Hi,
> 
> for those who are still interested in GSoC, [1] has some information on how
> to participate. According to the mail it's maybe to late, but I would give
> it a try. I've shared a private link, only available to PMC-members, as some
> of the imformation seems to be private.
> 
> BR
> Andreas Lehmkühler
> 
> [1]
> https://mail-search.apache.org/pmc/private-arch/pdfbox-private/201401.mbox/%3c22705bfe-be29-492a-be12-0749317fb...@apache.org%3E
> 
> 
>> John Hewson  hat am 11. Februar 2014 um 23:52 geschrieben:
>> 
>> 
>> The ideas are supposed to be starting points for students to make their own
>> proposal, so give them some ideas for expanding/reducing the scope and they
>> can choose themselves.
>> 
>> -- John
>> 
>> On 11 Feb 2014, at 13:47, Tilman Hausherr  wrote:
>> 
>>> Its unclear what the "size" of a participation must be. What I'd like to
>>> have is someone to implement shading types 6 and 7, and I think it would be
>>> 1-2 weeks of work. This would be perfect for a math student, or a computer
>>> science student who is specializing in graphics. My own math is from school
>>> 30 years ago and we never did Bézier curves, tensor-products and Bernstein
>>> polynomials so I can't do it without learning the math first.
>>> 
>>> Tilman
>>

Re: [DISCUSS] PDFBox and Exception handling

2014-02-13 Thread Maruan Sahyoun

Hi John,

currently pdfbox mostly throws IOExceptions where the user of the lib is not 
able to do something about it. 

Some of these exceptions could occur because a file was not found etc. So 
that’s ok. Others might occur because objects are not at a certain position. 
There are workarounds for some of these in pdfbox e.g. if %%EOF ist not the 
last entry in a PDF. Thus users are dependent on us putting in the workarounds 
to handle such situations. 

Now let’s assume there is a situation where an object is not at a certain 
location, or a specific string is missing …. what if we throw an exception 
where one could register a handler. We pass some kind of context e.g. lexer, 
file position, token …. and the user can handle the exception and „enrich“ the 
content or pass the correct information. The exception is than resolved and the 
process can continue.

In addition to that we are able to extend from a strictly conformant parsing to 
a relaxed parsing by using the same mechanism thus having the workarounds not 
in the ‚core‘ parser.

BR
Maruan Sahyoun

Am 13.02.2014 um 09:44 schrieb John Hewson :

> I'm not sure in understand what you mean, the Camel examples are very complex 
> indeed. A quick concrete example of what you're after would help greatly.
> 
> -- John
> 
>> On 13 Feb 2014, at 00:20, Maruan Sahyoun  wrote:
>> 
>> Hi,
>> 
>> what do you think of having an exception handling in pdfbox where people 
>> could define their own handlers. Something similar to
>> 
>> https://camel.apache.org/exception-clause.html
>> 
>> The benefit would be that we could pass the context e.g. during PDF parsing 
>> and the handler could return something which is than taken as the input. In 
>> addition to that maybe we can think about having some additional types of 
>> exceptions instead of mostly IOException to support that.  
>> 
>> BR
>> Maruan Sahyoun
>>

Re: [DISCUSS] PDFBox and Exception handling

2014-02-13 Thread Maruan Sahyoun

John

Am 13.02.2014 um 18:50 schrieb John Hewson :

> Maruan,
> 
>> Now let’s assume there is a situation where an object is not at a certain 
>> location, or a specific string is missing …. what if we throw an exception 
>> where one could register a handler. We pass some kind of context e.g. lexer, 
>> file position, token …. and the user can handle the exception and „enrich“ 
>> the content or pass the correct information.
> 
> The idea sounds reasonable in theory, but the more I reflect on in the more I 
> think that we should assume that the user is making use of PDFBox because 
> they don’t want to have to parse the PDF file themselves. I can’t think of an 
> example where the knowledge of how to correct some invalid PDF would’t be 
> better off existing within PDFBox itself, rather than in user code.

Of course they don’t want to parse it themselves. They can expect that PDFBox 
can handle a valid PDF file. But in case a file is invalid for whatever reason 
the only options are to either wait until we include a workaround or put it in 
themselves. The idea is to have an entry point. What’s the benefit of an 
exception when one can’t do anything about it.  And if you don’t want to write 
your handler you are not enforced to do so. 
 
> 
> From a technical standpoint, exposing the internal parser context to the user 
> seems particularly problematic: the internal implementation details which are 
> part of the context now become part of PDFBox’s public API which needs to be 
> kept stable between major releases. How is the user to resolve a non-trivial 
> exception and allow parsing to continue in a manner which leaves the 
> internals of the parser in a consistent state? If we don’t know how users are 
> resolving exceptions out in the real world, how can we be sure that changes 
> we make to the parser later won’t break their code?

One can only assume that a documented API is stable. As long as this is the 
case why should it break their code. Of course if a different file is causing a 
similar exception which will be dealt with by the exception handler and the 
code is not able to deal with it ...

> 
>> In addition to that we are able to extend from a strictly conformant parsing 
>> to a relaxed parsing by using the same mechanism thus having the workarounds 
>> not in the ‚core‘ parser.
> 
> 
> My suggestion would be to either subclass the core parser or pass it a 
> “conformance level” argument, e.g. PDF_1_5 or PDF_X. I don’t think any 
> external error handling/recovery mechanism is going to work in practice, 
> especially if that means generating thousands of exceptions when given a bad 
> content stream.
> 

It’s not about supporting different standards - that’s different thing 
(currently PDFBox doesn’t have concept of applying standards or versions - 
functions are either available or not, regardless of when they became part of 
the PDF spec). It’s about having a core which handles conformant files and an 
extension which handles workarounds for nonconformant files. Currently that’s 
all within the code - sometimes marked, sometimes not - which makes it 
difficult to rewrite the parser. As you already found out sometimes a fix was 
made to handle a single occurrence of a file and the file itself might no 
longer exist.


> -- John
> 
> On 13 Feb 2014, at 03:24, Maruan Sahyoun  wrote:
> 
>> Hi John,
>> 
>> currently pdfbox mostly throws IOExceptions where the user of the lib is not 
>> able to do something about it. 
>> 
>> Some of these exceptions could occur because a file was not found etc. So 
>> that’s ok. Others might occur because objects are not at a certain position. 
>> There are workarounds for some of these in pdfbox e.g. if %%EOF ist not the 
>> last entry in a PDF. Thus users are dependent on us putting in the 
>> workarounds to handle such situations. 
>> 
>> Now let’s assume there is a situation where an object is not at a certain 
>> location, or a specific string is missing …. what if we throw an exception 
>> where one could register a handler. We pass some kind of context e.g. lexer, 
>> file position, token …. and the user can handle the exception and „enrich“ 
>> the content or pass the correct information. The exception is than resolved 
>> and the process can continue.
>> 
>> In addition to that we are able to extend from a strictly conformant parsing 
>> to a relaxed parsing by using the same mechanism thus having the workarounds 
>> not in the ‚core‘ parser.
>> 
>> BR
>> Maruan Sahyoun
>> 
>> Am 13.02.2014 um 09:44 schrieb John Hewson :
>> 
>>> I'm not sure in understand what you mean, the Camel examples are very 
>>> complex indeed.

Re: [DISCUSS] PDFBox and Exception handling

2014-02-14 Thread Maruan Sahyoun

hat would mean that every improvement or 
bugfix which changes the result breaks the contract. E.g. lets say that we 
extract additional text, or we no longer extract text that should not have been 
extracted or we render a PDF differently …. 

So I do get and understand you point - I don’t share your view though. 


> 
>> It’s not about supporting different standards […] It’s about having a core 
>> which handles conformant files and an extension which handles workarounds 
>> for nonconformant files. 
> 
> A commonly used approach to parsing programming languages is to have a core 
> language which is small, easily parsed and with an AST which is easy to 
> manipulate. On top of that is another parser which handles all of the 
> syntactic sugar of the language, transforming a complex concrete AST into a 
> simple core AST. Perhaps PDFBox could take a similar approach with 
> ConformingParser having a NonConformingParser subclass which is capable of 
> pre-processing bad PDF files before they reach the core parser. The actual 
> implementation may be more subtle than this, perhaps with some back-and-forth 
> between the conforming and non-conforming parsers, so that when the 
> conforming parser encounters an error it can call a protected method which in 
> ConformingParser would throw an error but in NonConformingParser would 
> perform a recovery, as you proposed. But by using protected methods we avoid 
> the maintainability problem caused by making the error recovery mechanism 
> public.
> 
> What do you think?

This is a good and valid approach, but doesn’t address the intention I had.


> 
> -- John
> 
> On 13 Feb 2014, at 10:57, Maruan Sahyoun  wrote:
> 
>> John
>> 
>> Am 13.02.2014 um 18:50 schrieb John Hewson :
>> 
>>> Maruan,
>>> 
>>>> Now let’s assume there is a situation where an object is not at a certain 
>>>> location, or a specific string is missing …. what if we throw an exception 
>>>> where one could register a handler. We pass some kind of context e.g. 
>>>> lexer, file position, token …. and the user can handle the exception and 
>>>> „enrich“ the content or pass the correct information.
>>> 
>>> The idea sounds reasonable in theory, but the more I reflect on in the more 
>>> I think that we should assume that the user is making use of PDFBox because 
>>> they don’t want to have to parse the PDF file themselves. I can’t think of 
>>> an example where the knowledge of how to correct some invalid PDF would’t 
>>> be better off existing within PDFBox itself, rather than in user code.
>> 
>> Of course they don’t want to parse it themselves. They can expect that 
>> PDFBox can handle a valid PDF file. But in case a file is invalid for 
>> whatever reason the only options are to either wait until we include a 
>> workaround or put it in themselves. The idea is to have an entry point. 
>> What’s the benefit of an exception when one can’t do anything about it.  And 
>> if you don’t want to write your handler you are not enforced to do so. 
>> 
>>> 
>>> From a technical standpoint, exposing the internal parser context to the 
>>> user seems particularly problematic: the internal implementation details 
>>> which are part of the context now become part of PDFBox’s public API which 
>>> needs to be kept stable between major releases. How is the user to resolve 
>>> a non-trivial exception and allow parsing to continue in a manner which 
>>> leaves the internals of the parser in a consistent state? If we don’t know 
>>> how users are resolving exceptions out in the real world, how can we be 
>>> sure that changes we make to the parser later won’t break their code?
>> 
>> One can only assume that a documented API is stable. As long as this is the 
>> case why should it break their code. Of course if a different file is 
>> causing a similar exception which will be dealt with by the exception 
>> handler and the code is not able to deal with it ...
>> 
>>> 
>>>> In addition to that we are able to extend from a strictly conformant 
>>>> parsing to a relaxed parsing by using the same mechanism thus having the 
>>>> workarounds not in the ‚core‘ parser.
>>> 
>>> 
>>> My suggestion would be to either subclass the core parser or pass it a 
>>> “conformance level” argument, e.g. PDF_1_5 or PDF_X. I don’t think any 
>>> external error handling/recovery mechanism is going to work in practice, 
>>> especially if that means generating thousands of exceptions when given a 
>>> bad content stream.
>>&

Re: [DISCUSS] PDFBox and Exception handling

2014-02-16 Thread Maruan Sahyoun

Hi Fred,

unfortunately the attachment didn't make it through due to restrictions of the 
mailing list - could you make it available somewhere on a public site?

BR

Maruan Sahyoun

> Am 16.02.2014 um 01:04 schrieb Fred Hansen :
> 
> 
> Just in case you're not tired of exceptions, I've written the attached. It 
> concludes that the right-thing-to-do is to examine individually each throw 
> statement.
>

Re: [DISCUSS] PDFBox and Exception handling

2014-02-16 Thread Maruan Sahyoun

Hi Fred,

thank you for putting down your thoughts, very helpful.

BR
Maruan Sahyoun

> Am 16.02.2014 um 23:36 schrieb Fred Hansen :
> 
> I've converted the attachment to a web page:
>http://physpics.com/Java/Notes/ExceptionHandling.php
> 
> 
> 
> 
>> ____________
>> From: Maruan Sahyoun 
>> To: "dev@pdfbox.apache.org"  
>> Sent: Sunday, February 16, 2014 3:36 AM
>> Subject: Re: [DISCUSS] PDFBox and Exception handling
>> 
>> 
>> Hi Fred,
>> 
>> unfortunately the attachment didn't make it through due to restrictions of 
>> the mailing list - could you make it available somewhere on a public site?
>> 
>> BR
>> 
>> Maruan Sahyoun
>> 
>> 
>>> Am 16.02.2014 um 01:04 schrieb Fred Hansen :
>>> 
>>> 
>>> Just in case you're not tired of exceptions, I've written the attached. It 
>>> concludes that the right-thing-to-do is to examine individually each throw 
>>> statement.
>>

PDFBox and GitHub

2014-02-17 Thread Maruan Sahyoun

Hi,

according to Infra there is a better GitHub integration available on as an opt 
in feature

https://blogs.apache.org/infra/entry/improved_integration_between_apache_and

Shall we use it?

Maruan Sahyoun

pdfbox.io - which should I use

2014-02-18 Thread Maruan Sahyoun

Hi,

there are currently a number of different options to use as a base for a 
potential new parser/lexer. The ones currently in use are

BaseParser: 
import org.apache.pdfbox.io.PushBackInputStream;
import org.apache.pdfbox.io.RandomAccess;

PDFParser (additional):
import org.apache.pdfbox.io.RandomAccess;

NonSequentialParser:
import org.apache.pdfbox.io.PushBackInputStream;
import org.apache.pdfbox.io.RandomAccess;
import org.apache.pdfbox.io.RandomAccessBuffer;
import org.apache.pdfbox.io.RandomAccessBufferedFileInputStream;

There are some additional Classes/Interfaces in the io package e.g. 
RandomAccessBufferedFileInputStream implementing RandomAccessRead

Any preferences, ideas of consolidating this? 

Currently I’m using RandomAccessBufferedFileInputStream with some additional 
implementations of RandomAccessRead to support reading from a ByteArray for 
testing purposes)

BR

Maruan Sahyoun

Re: PDFBox and GitHub

2014-02-18 Thread Maruan Sahyoun

Hi,

Am 18.02.2014 um 13:00 schrieb Andreas Lehmkühler :

> Hi,
> 
>> Maruan Sahyoun  hat am 17. Februar 2014 um 09:16
>> geschrieben:
>> 
>> 
>> Hi,
>> 
>> according to Infra there is a better GitHub integration available on as an 
>> opt
>> in feature
>> 
>> https://blogs.apache.org/infra/entry/improved_integration_between_apache_and
>> 
>> Shall we use it?
> I'm not sure if I got the point. Is your idea to do the switch from svn to git
> or to use those
> opt in features with our readonly git mirror (is that possible)?

If I understood correctly that’s possible with the current setup. So I’m not 
proposing to switch to git from svn as part of that question.

> 
>> Maruan Sahyoun
> 
> BR
> Andreas Lehmkühler

BR
Maruan

Re: pdfbox.io - which should I use

2014-02-18 Thread Maruan Sahyoun

Yes, we could use RandomAccessRead as a base and subclasses to wrap NIO and 
others. 

Then the parsers would use RandomAccessRead

WDYT

Maruan Sahyoun

> Am 18.02.2014 um 21:42 schrieb John Hewson :
> 
> The streams used by BaseParser and PDFParser are sequential, so you can 
> ignore them.
> Use of PushBackInputStream in the non-sequential parser seems a little odd. 
> 
> We might want to think about getting rid of the classes in 
> org.apache.pdfbox.io and replacing
> them with classes from java.nio.channels. It looks like the PDFBox classes 
> pre-date NIO.
> With NIO we could use memory mapped files, which for large PDFFiles will 
> perform better
> than an InputStream.
> 
> -- John
> 
>> On 18 Feb 2014, at 03:53, Maruan Sahyoun  wrote:
>> 
>> Hi,
>> 
>> there are currently a number of different options to use as a base for a 
>> potential new parser/lexer. The ones currently in use are
>> 
>> BaseParser: 
>> import org.apache.pdfbox.io.PushBackInputStream;
>> import org.apache.pdfbox.io.RandomAccess;
>> 
>> PDFParser (additional):
>> import org.apache.pdfbox.io.RandomAccess;
>> 
>> NonSequentialParser:
>> import org.apache.pdfbox.io.PushBackInputStream;
>> import org.apache.pdfbox.io.RandomAccess;
>> import org.apache.pdfbox.io.RandomAccessBuffer;
>> import org.apache.pdfbox.io.RandomAccessBufferedFileInputStream;
>> 
>> There are some additional Classes/Interfaces in the io package e.g. 
>> RandomAccessBufferedFileInputStream implementing RandomAccessRead
>> 
>> Any preferences, ideas of consolidating this? 
>> 
>> Currently I’m using RandomAccessBufferedFileInputStream with some additional 
>> implementations of RandomAccessRead to support reading from a ByteArray for 
>> testing purposes)
>> 
>> BR
>> 
>> Maruan Sahyoun
>

Re: pdfbox.io - which should I use

2014-02-18 Thread Maruan Sahyoun

Hi John,

I'd think that we would still need pdfbox.io as e.g SeekableByteChannel doesn't 
give us an easy way of reading a single char (needed for parsing) but that 
would be a small wrapper so we don't need to handle that inside parsers. Reason 
is that data is read as a ByteBuffer which is a chunk of data.

Maruan Sahyoun

> Am 19.02.2014 um 04:45 schrieb John Hewson :
> 
> RandomAccessRead looks like it could be replaced with 
> java.nio.channels.SeekableByteChannel as implemented by 
> java.nio.channels.FileChannel.
> 
> -- John
> 
>> On 18 Feb 2014, at 12:50, Maruan Sahyoun  wrote:
>> 
>> Yes, we could use RandomAccessRead as a base and subclasses to wrap NIO and 
>> others. 
>> 
>> Then the parsers would use RandomAccessRead
>> 
>> WDYT
>> 
>> Maruan Sahyoun
>> 
>>> Am 18.02.2014 um 21:42 schrieb John Hewson :
>>> 
>>> The streams used by BaseParser and PDFParser are sequential, so you can 
>>> ignore them.
>>> Use of PushBackInputStream in the non-sequential parser seems a little odd. 
>>> 
>>> We might want to think about getting rid of the classes in 
>>> org.apache.pdfbox.io and replacing
>>> them with classes from java.nio.channels. It looks like the PDFBox classes 
>>> pre-date NIO.
>>> With NIO we could use memory mapped files, which for large PDFFiles will 
>>> perform better
>>> than an InputStream.
>>> 
>>> -- John
>>> 
>>>> On 18 Feb 2014, at 03:53, Maruan Sahyoun  wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> there are currently a number of different options to use as a base for a 
>>>> potential new parser/lexer. The ones currently in use are
>>>> 
>>>> BaseParser: 
>>>> import org.apache.pdfbox.io.PushBackInputStream;
>>>> import org.apache.pdfbox.io.RandomAccess;
>>>> 
>>>> PDFParser (additional):
>>>> import org.apache.pdfbox.io.RandomAccess;
>>>> 
>>>> NonSequentialParser:
>>>> import org.apache.pdfbox.io.PushBackInputStream;
>>>> import org.apache.pdfbox.io.RandomAccess;
>>>> import org.apache.pdfbox.io.RandomAccessBuffer;
>>>> import org.apache.pdfbox.io.RandomAccessBufferedFileInputStream;
>>>> 
>>>> There are some additional Classes/Interfaces in the io package e.g. 
>>>> RandomAccessBufferedFileInputStream implementing RandomAccessRead
>>>> 
>>>> Any preferences, ideas of consolidating this? 
>>>> 
>>>> Currently I’m using RandomAccessBufferedFileInputStream with some 
>>>> additional implementations of RandomAccessRead to support reading from a 
>>>> ByteArray for testing purposes)
>>>> 
>>>> BR
>>>> 
>>>> Maruan Sahyoun
>

Re: pdfbox.io - which should I use

2014-02-18 Thread Maruan Sahyoun

Hi John,

forgot that - SeekableByteChannel is Java 1.7

BR
Maruan Sahyoun

Am 19.02.2014 um 04:45 schrieb John Hewson :

> RandomAccessRead looks like it could be replaced with 
> java.nio.channels.SeekableByteChannel as implemented by 
> java.nio.channels.FileChannel.
> 
> -- John
> 
> On 18 Feb 2014, at 12:50, Maruan Sahyoun  wrote:
> 
>> Yes, we could use RandomAccessRead as a base and subclasses to wrap NIO and 
>> others. 
>> 
>> Then the parsers would use RandomAccessRead
>> 
>> WDYT
>> 
>> Maruan Sahyoun
>> 
>>> Am 18.02.2014 um 21:42 schrieb John Hewson :
>>> 
>>> The streams used by BaseParser and PDFParser are sequential, so you can 
>>> ignore them.
>>> Use of PushBackInputStream in the non-sequential parser seems a little odd. 
>>> 
>>> We might want to think about getting rid of the classes in 
>>> org.apache.pdfbox.io and replacing
>>> them with classes from java.nio.channels. It looks like the PDFBox classes 
>>> pre-date NIO.
>>> With NIO we could use memory mapped files, which for large PDFFiles will 
>>> perform better
>>> than an InputStream.
>>> 
>>> -- John
>>> 
>>>> On 18 Feb 2014, at 03:53, Maruan Sahyoun  wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> there are currently a number of different options to use as a base for a 
>>>> potential new parser/lexer. The ones currently in use are
>>>> 
>>>> BaseParser: 
>>>> import org.apache.pdfbox.io.PushBackInputStream;
>>>> import org.apache.pdfbox.io.RandomAccess;
>>>> 
>>>> PDFParser (additional):
>>>> import org.apache.pdfbox.io.RandomAccess;
>>>> 
>>>> NonSequentialParser:
>>>> import org.apache.pdfbox.io.PushBackInputStream;
>>>> import org.apache.pdfbox.io.RandomAccess;
>>>> import org.apache.pdfbox.io.RandomAccessBuffer;
>>>> import org.apache.pdfbox.io.RandomAccessBufferedFileInputStream;
>>>> 
>>>> There are some additional Classes/Interfaces in the io package e.g. 
>>>> RandomAccessBufferedFileInputStream implementing RandomAccessRead
>>>> 
>>>> Any preferences, ideas of consolidating this? 
>>>> 
>>>> Currently I’m using RandomAccessBufferedFileInputStream with some 
>>>> additional implementations of RandomAccessRead to support reading from a 
>>>> ByteArray for testing purposes)
>>>> 
>>>> BR
>>>> 
>>>> Maruan Sahyoun
>>> 
>

Re: Color Space Refactoring

2014-02-20 Thread Maruan Sahyoun

Hi John,

that's no doubt a great enhancement and a hughe step forward.

BR

Maruan Sahyoun

> Am 20.02.2014 um 11:17 schrieb John Hewson :
> 
> Hi All
> 
> I have just committed a significant refactoring of color spaces to trunk. The 
> main purpose of the change is to encapsulate all color space handling code 
> within PDColorSpace and its subclasses. Until now there was color handling 
> code in many different places, including separate code for each image format. 
> Due to the close link between images, color, and performance it has been 
> necessary to rewrite much of the image reading code.
> 
> Here's a summary of the changes:
> 
> - PDCcitt has been removed, its reading capability has moved to 
> CCITTFaxFilter and writing capability has moved to CCITTFactory.
> 
> - PDJpeg has been removed. JPEG reading is now done by new code in DCTFilter 
> which correctly handles CMYK/YCCK color. This fixes various files where 
> images appeared like negatives. JPEG writing is done by new code in 
> JPEGFactory.
> 
> - cleaned up JBIG2Filter
> 
> - cleaned up JPXFilter, in particular calling decode() caused the stream 
> dictionary to be updated, which was unsafe. I've also added a special 
> JPXColorSpace which wraps the embedded AWT color space of a JPX 
> BufferedImage, this replaces the need for the awkward mapping of ColorSpace 
> to PDColorSpace.
> 
> - Added better error messages for missing JAI plugins (JPX, JBIG2). A special 
> exception, MissingImageReaderException is now thrown.
> 
> - PDXObjectForm has been renamed to PDFormXObject to match the PDF spec.
> - PDXObjectImage has been renamed in the same manner.
> - PDInlinedImage has been renamed to PDInlineImage for the same reason.
> - CCITTFaxDecodeFilter has been renamed to CCITTFaxFilter for consistency 
> with the other filters.
> 
> - ImageParameters has been removed, it was used to represent inline image 
> parameters which are now simply members of PDInlineImage.
> 
> - added PDColor which represents a color value, including patterns, it is 
> immutable for ease of use.
> 
> - removed PDColorState which was a container for both a color and a color 
> space, in almost every case it was used to represent a color and so has been 
> replaced by PDColor and occasionally PDColorSpace.
> 
> - moved most of the functionality of PDXObject into its subclasses
> 
> - rewrote almost all color handling code in all PDColorSpace subclasses, 
> including fixing the calculations for l*a*b, DeviceN, and indexed color 
> spaces. 
> 
> - all color spaces now implement a toRGB(float[]) function for color 
> conversion, so external consumers of color spaces no longer have to know 
> about internals such as tint transforms.
> 
> - image color conversion is now performed in one operation, using 
> ColorConvertOp, rather than pixel-by-pixel, this speeds up ICC transforms by 
> many orders of magnitude. Color spaces now expose a special method 
> toImageRGB(Raster) for this purpose. This fixes some known performance issues 
> with certain files.
> 
> - updated Type1, Axial, Radial, and Gouraud shading contexts to call the new 
> toRGB functions. This is an interim measure, for better performance the color 
> conversion should instead be done using toImageRGB after the entire gradient 
> is drawn to the raster.
> 
> - creation of AWT Paint has been moved inside color spaces, hiding the 
> details from the caller. It is no longer possible to get an AWT Color from a 
> color space, only a Paint may be obtained.
> 
> - removed PDColorSpaceFactory and moved its functionality into PDColorSpace.
> 
> - moved some of the new shading and tiling pattern code to PDPattern so that 
> toPaint() is encapsulated in the color space.
> 
> - new PDImage interface which is implemented by both PDInlineImage and 
> PDImageXObject
> 
> - Image XObject image reading, masking  and stencilling code has been 
> rewritten, resulting in the removal of CompositeImage.
> 
> - new SampledImageReader performs image reading for all formats, including 
> JPEG and CCITT. The format itself is simply a filter, as is the case in the 
> PDF spec. New image reading handles decode arrays, interpolation, and 
> conversion of all image types to efficient 8bpp rasters. This replaces 
> PDPixelMap as well as reading code from PDJpeg and PDCcitt. Handling of decod 
> arrays fixes various issues where images were inverted, especially inline 
> images in Type 3 fonts.
> 
> - removed SetNonStrokingICCBasedColor, SetNonStrokingIndexed, 
> SetNonStrokingPattern, SetNonStrokingSeparation, SetStrokingICCBasedColor, 
> SetStrokingIndexed, SetStrokingPattern, SetStrokingSeparation, and replaced 
> them with SetColor.
> 
> There will no doubt be some regressions, please post a comment on PDFBOX-1893 
> to let me know.
> 
> Thanks
> 
> -- John
> 
>

Re: covert to Image is very slow

2014-02-25 Thread Maruan Sahyoun

Antonio,

in addition to John’s comment:

is the 4 to 5 secs for the pure conversion (page.convertToImage) or the 
complete run? Could you time the portions?

BR 
Maruan Sahyoun

Am 25.02.2014 um 19:20 schrieb John Hewson :

> Antonio
> 
> For complex pages or pages with many images 4-5 seconds is to be expected.
> If the page in question is very simple there may be something PDFBox can fix
> to seed things up. If so, open an issue on the PDFBox JIRA and attach the PDF
> file via More > Attach Files.
> 
> Before doing so, please try the latest 2.0.0 trunk snapshot, we have recently 
> made
> a number of performance improvements.
> 
> Some general speed tips: use TYPE_INT_RGB or TYPE_INT_ARGB buffers,
> not *_BGR and try rendering at a lower resolution, if possible.
> 
> -- John
> 
> On 25 Feb 2014, at 06:15, Antonio González  wrote:
> 
>> Hi
>> 
>> When i convert a PDF file a Image is very slow 4 o 5 secs.
>> 
>> my code is
>> 
>> 
>> String fichero = "C:\\guiaalfresco.pdf";
>> PDDocument pdfDocument= null;
>> try {
>> File file = new File(fichero);
>> pdfDocument = PDDocument.load(file);
>> List pages = pdfDocument.getDocumentCatalog().getAllPages();
>> if (pages.size()>0){
>> // Captura la primera página del PDF
>> PDPage page = (PDPage) pages.get(0);
>> // Convierta la página PDF a Image
>> BufferedImage image = page.convertToImage(BufferedImage.TYPE_INT_BGR,200 );
>> pdfDocument.close();
>> File outputfile = new File("c:\\saved.png");
>> BufferedImage imagen=resizeImage(image, 200);
>> ImageIO.write(imagen, "png", outputfile);
>> }
>> } catch (IOException e) {
>> e.printStackTrace();
>> }
>

Re: Remove AWT Fonts

2014-03-04 Thread Maruan Sahyoun

Hi John,

what about just using the platform fonts? If not then Latex uses the URW++ 
fonts which were made available under the http://www.latex-project.org/lppl 
license. (same fonts are used by Ghostscript). Could check if the license is 
fine with ours.

BR
Maruan Sahyoun

Am 03.03.2014 um 21:20 schrieb John Hewson :

> Hi All
> 
> I wanted to bring PDFBOX-1959 to the attention of the mailing list. PDFBox is 
> ready to leave AWT font rendering behind as the JDKs rendering has proven to 
> be buggy and we now have our own renderers for all font types in 2.0.0.
> 
> Before we can do this we need to ship a set of standard 14 fonts with PDFBox 
> as currently the system fonts are being used via AWT. We also need to provide 
> a mechanism for the user to supply their own external fonts for cases where 
> embedded fonts are missing. 
> 
> The main question is, what fonts should we ship? Some of the "free" fonts 
> I've seen render very poorly, any suggestions? Furthermore, are there fonts 
> under more restrictive licenses which we could ship? Apache does allow for 
> such files to be part of a project under certain conditions.
> 
> Also: Adobe has some font packs, e.g. Japanese, which we could point users 
> towards.
> 
> Cheers
> 
> -- John

Re: Remove AWT Fonts

2014-03-04 Thread Maruan Sahyoun

Hi John,

what I was having in mind is something similar to Apache FOP’s auto detect 
feature for fonts.

doc: https://xmlgraphics.apache.org/fop/1.1/fonts.html
code: 
http://svn.apache.org/viewvc/xmlgraphics/fop/trunk/src/java/org/apache/fop/fonts/autodetect/

Fo inclusion these are some additional candidates

https://fedorahosted.org/liberation-fonts/ (SIL licensed 
http://scripts.sil.org/cms/scripts/page.php?item_id=OFL-FAQ_web&_sc=1#68092c0f)
http://dejavu-fonts.org/ (http://dejavu-fonts.org/wiki/License)
Croscore fonts https://fedoraproject.org/wiki/I18N/Liberation_vs_Croscore_fonts


I’d think if we can avoid bundling a set of fonts but use OS fonts and/or allow 
people to use their own will help us in the long run as if the quality is not 
inline with the ones used by Adobe Reader there will be additional 
questions/issues/bug reports we are not able to resolve.

BR

Maruan Sahyoun

Am 04.03.2014 um 19:34 schrieb John Hewson :

> Hi Maruan
> 
> Java provides access to platform fonts via AWT and does not reveal the paths 
> to the fonts
> which it finds, so it is not practical to use platform fonts without using 
> AWT. There have also
> been a number of problems with some unix platforms which lack some of the 
> standard 14
> fonts or which ship with poor quality substitutes. Ideally, PDFBox should 
> produce the same
> result irrespective of which platform it is running on, much like Adobe 
> Reader (excluding any
> missing embedded fonts, of course).
> 
> I’ve had poor experiences in the past with the Nimbus family of fonts from 
> URW++ but there
> are numerous factors (kerning, hinting, metrics, TTF vs Type 1) which may 
> have changed since
> then. We should check out how well these fonts compare with the standard 14 
> used by Adobe,
> in particular whether or not the metrics actually match (I know that it is 
> claimed that they do).
> 
> -- John
> 
> On 4 Mar 2014, at 05:48, Maruan Sahyoun  wrote:
> 
>> Hi John,
>> 
>> what about just using the platform fonts? If not then Latex uses the URW++ 
>> fonts which were made available under the http://www.latex-project.org/lppl 
>> license. (same fonts are used by Ghostscript). Could check if the license is 
>> fine with ours.
>> 
>> BR
>> Maruan Sahyoun
>> 
>> Am 03.03.2014 um 21:20 schrieb John Hewson :
>> 
>>> Hi All
>>> 
>>> I wanted to bring PDFBOX-1959 to the attention of the mailing list. PDFBox 
>>> is ready to leave AWT font rendering behind as the JDKs rendering has 
>>> proven to be buggy and we now have our own renderers for all font types in 
>>> 2.0.0.
>>> 
>>> Before we can do this we need to ship a set of standard 14 fonts with 
>>> PDFBox as currently the system fonts are being used via AWT. We also need 
>>> to provide a mechanism for the user to supply their own external fonts for 
>>> cases where embedded fonts are missing. 
>>> 
>>> The main question is, what fonts should we ship? Some of the "free" fonts 
>>> I've seen render very poorly, any suggestions? Furthermore, are there fonts 
>>> under more restrictive licenses which we could ship? Apache does allow for 
>>> such files to be part of a project under certain conditions.
>>> 
>>> Also: Adobe has some font packs, e.g. Japanese, which we could point users 
>>> towards.
>>> 
>>> Cheers
>>> 
>>> -- John
>> 
>

Re: Remove AWT Fonts

2014-03-04 Thread Maruan Sahyoun

John,

I don’t understand why we do have to ship fonts. We didn’t ship fonts until now 
but were dependent on platform fonts through AWT. So the situation won’t 
change. 

For legal reasons we won’t be able to use the fonts Adobe uses and I doubt that 
there are open source fonts which provide the same results. (rendering quality, 
number of glyphs ….) so I think a mechanism to use platform fonts and letting 
users register new ones similar to our current font aliases is a better and 
more reliable option. 

BR
Maruan Sahyoun

Am 04.03.2014 um 21:28 schrieb John Hewson :

> Maruan
> 
>> what I was having in mind is something similar to Apache FOP’s auto detect 
>> feature for fonts.
> 
> Yeah, this looks good, we could use this for finding missing embedded fonts.
> 
>> For inclusion these are some additional candidates
>> 
>> https://fedorahosted.org/liberation-fonts/ (SIL licensed 
>> http://scripts.sil.org/cms/scripts/page.php?item_id=OFL-FAQ_web&_sc=1#68092c0f)
>> http://dejavu-fonts.org/ (http://dejavu-fonts.org/wiki/License)
>> Croscore fonts 
>> https://fedoraproject.org/wiki/I18N/Liberation_vs_Croscore_fonts
> 
> Great, I’ll take a look.
> 
>> I’d think if we can avoid bundling a set of fonts but use OS fonts and/or 
>> allow people to use their own will help us in the long run as if the quality 
>> is not inline with the ones used by Adobe Reader there will be additional 
>> questions/issues/bug reports we are not able to resolve.
> 
> We still need to ship a set of standard 14 fonts to solve the problems with 
> platforms which don’t
> have these fonts or have poor quality substitutes. The ideal solution is to 
> bundle our own high
> quality fonts and not depend on proprietary, platform-specific fonts. If we 
> can’t do this for some
> reason (e.g. quality), then we can reluctantly make use of platform fonts.
> 
> -- John
> 
> On 4 Mar 2014, at 11:45, Maruan Sahyoun  wrote:
> 
>> Hi John,
>> 
>> what I was having in mind is something similar to Apache FOP’s auto detect 
>> feature for fonts.
>> 
>> doc: https://xmlgraphics.apache.org/fop/1.1/fonts.html
>> code: 
>> http://svn.apache.org/viewvc/xmlgraphics/fop/trunk/src/java/org/apache/fop/fonts/autodetect/
>> 
>> Fo inclusion these are some additional candidates
>> 
>> https://fedorahosted.org/liberation-fonts/ (SIL licensed 
>> http://scripts.sil.org/cms/scripts/page.php?item_id=OFL-FAQ_web&_sc=1#68092c0f)
>> http://dejavu-fonts.org/ (http://dejavu-fonts.org/wiki/License)
>> Croscore fonts 
>> https://fedoraproject.org/wiki/I18N/Liberation_vs_Croscore_fonts
>> 
>> 
>> I’d think if we can avoid bundling a set of fonts but use OS fonts and/or 
>> allow people to use their own will help us in the long run as if the quality 
>> is not inline with the ones used by Adobe Reader there will be additional 
>> questions/issues/bug reports we are not able to resolve.
>> 
>> BR
>> 
>> Maruan Sahyoun
>> 
>> Am 04.03.2014 um 19:34 schrieb John Hewson :
>> 
>>> Hi Maruan
>>> 
>>> Java provides access to platform fonts via AWT and does not reveal the 
>>> paths to the fonts
>>> which it finds, so it is not practical to use platform fonts without using 
>>> AWT. There have also
>>> been a number of problems with some unix platforms which lack some of the 
>>> standard 14
>>> fonts or which ship with poor quality substitutes. Ideally, PDFBox should 
>>> produce the same
>>> result irrespective of which platform it is running on, much like Adobe 
>>> Reader (excluding any
>>> missing embedded fonts, of course).
>>> 
>>> I’ve had poor experiences in the past with the Nimbus family of fonts from 
>>> URW++ but there
>>> are numerous factors (kerning, hinting, metrics, TTF vs Type 1) which may 
>>> have changed since
>>> then. We should check out how well these fonts compare with the standard 14 
>>> used by Adobe,
>>> in particular whether or not the metrics actually match (I know that it is 
>>> claimed that they do).
>>> 
>>> -- John
>>> 
>>> On 4 Mar 2014, at 05:48, Maruan Sahyoun  wrote:
>>> 
>>>> Hi John,
>>>> 
>>>> what about just using the platform fonts? If not then Latex uses the URW++ 
>>>> fonts which were made available under the 
>>>> http://www.latex-project.org/lppl license. (same fonts are used by 
>>>> Ghostscript). Could check if the license is fine with ours.
>>>> 
>>>> BR
>>>> Maru

Re: IOException when merging PDF after increasing pushBackSize

2014-03-05 Thread Maruan Sahyoun

Hi James,

a) the file didn’t make it to the mailing list because of restrictions. Could 
you upload it to a public location?
b) try opening the document with PDDocument.loadNonSeq() in a simple test case 
- will it give errors?

BR
Maruan Sahyoun

Am 05.03.2014 um 15:21 schrieb James Carter :

> When attempting to merge the attached PDF with several other documents, PDF 
> throws the following exception: Could not push back 328764 bytes in order to 
> reparse stream. Try increasing push back buffer using system property 
> org.apache.pdfbox.baseParser.pushBackSize
> 
> The discussion on the JIRA ticket (PDFBOX-1920) mentioned that the PDF is not 
> well formed. Upon increasing the pushBackSize, the following error is seen:
> 
> Exception in thread "main" java.io.IOException: expected='endstream' 
> actual='' org.apache.pdfbox.io.PushBackInputStream@45cb0cdc
> at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:609)
> at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:605)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:194)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1219)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1186)
> at 
> org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:196)
> at com.acme.MergePDF.runSmartService(MergePDF.java:52)
> at com.acme.MergePDF.main(MergePDF.java:68)
> 
> Is this reasonably something that PDFBox could handle, or does the ill formed 
> nature of the PDF leave this outside of what PDFBox would support?
> 
> Thanks,
> James

PDFBox Documentation - Rendering

2014-03-10 Thread Maruan Sahyoun

Hi,

I’m currently enhancing the documentation for PDFBox with some more samples, 
code snippets etc. 

For the developer section would it be possible that someone - maybe John or 
Tilman as they are most familiar with the rendering code - writes up a small 
introductory article about how rendering works in PDFBox. Only a quick overview?

BR
Maruan Sahyoun

[DISCUSS] PDFBox and support for PDF versions, PDF standards

2014-03-10 Thread Maruan Sahyoun

Hi,

as I’m currently looking at the parsing part of PDFBox one question came to my 
mind which is a more formal support for PDF versions and PDF standards such as 
PDF/A, PDF/UA …

As of today PDFBox has no formal support for specific PDF versions in a way 
that a specific version can be enforced, validated ... The PDFBox PDF/A 
validation does a good job for PDF/A 1b but it can not be easily extended to 
other standards.

Do you think that there is a need for a more formal support of such standards 
and versions? The would influence some of the design decisions for the parser 
and affect the base objects.

BR
Maruan Sahyoun

Re: [DISCUSS] PDFBox and support for PDF versions, PDF standards

2014-03-10 Thread Maruan Sahyoun

Hi John,

it’s not about PDF versions but PDF versions and standards.

The base syntax has not changed. But the elements described by the base have.

BR
Maruan Sahyoun

Am 10.03.2014 um 09:20 schrieb John Hewson :

> Hi Maruan
> 
>> As of today PDFBox has no formal support for specific PDF versions in a way 
>> that a specific version can be enforced, validated ...
> 
> Perhaps that is because there is not much demand for this? Nowadays everyone 
> has instant access to the latest version of Adobe Reader so checking that a 
> PDF can be opened with a specific version of Adobe Reader is not that useful 
> anymore. There might be some niche cases, but I can’t think what they would 
> be. For cases where it’s important that a PDF file is valid then a format 
> such as PDF/A or PDF/X must be used instead as “vanilla" PDF is ambiguous.
> 
>> The PDFBox PDF/A validation does a good job for PDF/A 1b but it can not be 
>> easily extended to other standards.
> 
> Yes, PDF/A is carefully validated because it is for archival purposes, unlike 
> regular PDF files.
> 
>> Do you think that there is a need for a more formal support of such 
>> standards and versions? The would influence some of the design decisions for 
>> the parser and affect the base objects.
> 
> 
> I can’t think of a reason why someone would want to parse a specific PDF 
> version, so my answer is no, I don’t think there is such a need. Has the 
> syntax of PDF even changed that much over the different versions?
> 
> — John
>

Re: [DISCUSS] PDFBox and support for PDF versions, PDF standards

2014-03-10 Thread Maruan Sahyoun

I think we are talking about two different things here. The parsing process to 
get the tokens, and the parsing process to follow the PDF file layout and to 
form and follow the higher level structures such as Xref. Tokens didn’t change. 
File layout and higher level structures did like - Linerization or Xref 
Streams. Dependent on the PDF standard some are permitted some are not. 

BR
Maruan

Am 10.03.2014 um 10:06 schrieb John Hewson :

>> The base syntax has not changed. But the elements described by the base have.
> 
> 
> If the syntax hasn’t changed then there can’t be anything in the parser which 
> is version-specific.
> 
> -- John
> 
> On 10 Mar 2014, at 01:43, Maruan Sahyoun  wrote:
> 
>> Hi John,
>> 
>> it’s not about PDF versions but PDF versions and standards.
>> 
>> The base syntax has not changed. But the elements described by the base have.
>> 
>> BR
>> Maruan Sahyoun
>> 
>> Am 10.03.2014 um 09:20 schrieb John Hewson :
>> 
>>> Hi Maruan
>>> 
>>>> As of today PDFBox has no formal support for specific PDF versions in a 
>>>> way that a specific version can be enforced, validated ...
>>> 
>>> Perhaps that is because there is not much demand for this? Nowadays 
>>> everyone has instant access to the latest version of Adobe Reader so 
>>> checking that a PDF can be opened with a specific version of Adobe Reader 
>>> is not that useful anymore. There might be some niche cases, but I can’t 
>>> think what they would be. For cases where it’s important that a PDF file is 
>>> valid then a format such as PDF/A or PDF/X must be used instead as 
>>> “vanilla" PDF is ambiguous.
>>> 
>>>> The PDFBox PDF/A validation does a good job for PDF/A 1b but it can not be 
>>>> easily extended to other standards.
>>> 
>>> Yes, PDF/A is carefully validated because it is for archival purposes, 
>>> unlike regular PDF files.
>>> 
>>>> Do you think that there is a need for a more formal support of such 
>>>> standards and versions? The would influence some of the design decisions 
>>>> for the parser and affect the base objects.
>>> 
>>> 
>>> I can’t think of a reason why someone would want to parse a specific PDF 
>>> version, so my answer is no, I don’t think there is such a need. Has the 
>>> syntax of PDF even changed that much over the different versions?
>>> 
>>> — John
>>> 
>> 
>

Re: [DISCUSS] PDFBox and support for PDF versions, PDF standards

2014-03-10 Thread Maruan Sahyoun

OK - wasn’t precise enough - token types didn’t change but there are newer 
tokens introduced. 

As the syntax has changed do we need version and standards support in the 
parsing phase then? Other way would be to parse what’s in there and do 
validation etc. purely on the parsing result (COS model, PD model). Need to do 
that anyway.

What about writing?

BR
Maruan Sahyoun

Am 10.03.2014 um 11:43 schrieb John Hewson :

>>> If the syntax hasn’t changed then there can’t be anything in the parser 
>>> which is version-specific.
>> 
>> I think we are talking about two different things here. The parsing process 
>> to get the tokens and the parsing process to follow the PDF file layout and 
>> to form and follow the higher level structures such as Xref.
> 
> Yes, there are two phases, tokenizing and parsing; sometimes both are called 
> parsing.
> 
>> Tokens didn’t change. File layout and higher level structures did like - 
>> Linerization or Xref Streams. Dependent on the PDF standard some are 
>> permitted some are not. 
> 
> That’s not right. The tokens have changed: “xref” is a keyword and therefore 
> a token. Also, as I said originally, the syntax has changed, because what you 
> call "higher level structures” is actually the syntax.
> 
> -- John
> 
> On 10 Mar 2014, at 02:32, Maruan Sahyoun  wrote:
> 
>> I think we are talking about two different things here. The parsing process 
>> to get the tokens, and the parsing process to follow the PDF file layout and 
>> to form and follow the higher level structures such as Xref. Tokens didn’t 
>> change. File layout and higher level structures did like - Linerization or 
>> Xref Streams. Dependent on the PDF standard some are permitted some are not. 
>> 
>> BR
>> Maruan
>> 
>> Am 10.03.2014 um 10:06 schrieb John Hewson :
>> 
>>>> The base syntax has not changed. But the elements described by the base 
>>>> have.
>>> 
>>> 
>>> If the syntax hasn’t changed then there can’t be anything in the parser 
>>> which is version-specific.
>>> 
>>> -- John
>>> 
>>> On 10 Mar 2014, at 01:43, Maruan Sahyoun  wrote:
>>> 
>>>> Hi John,
>>>> 
>>>> it’s not about PDF versions but PDF versions and standards.
>>>> 
>>>> The base syntax has not changed. But the elements described by the base 
>>>> have.
>>>> 
>>>> BR
>>>> Maruan Sahyoun
>>>> 
>>>> Am 10.03.2014 um 09:20 schrieb John Hewson :
>>>> 
>>>>> Hi Maruan
>>>>> 
>>>>>> As of today PDFBox has no formal support for specific PDF versions in a 
>>>>>> way that a specific version can be enforced, validated ...
>>>>> 
>>>>> Perhaps that is because there is not much demand for this? Nowadays 
>>>>> everyone has instant access to the latest version of Adobe Reader so 
>>>>> checking that a PDF can be opened with a specific version of Adobe Reader 
>>>>> is not that useful anymore. There might be some niche cases, but I can’t 
>>>>> think what they would be. For cases where it’s important that a PDF file 
>>>>> is valid then a format such as PDF/A or PDF/X must be used instead as 
>>>>> “vanilla" PDF is ambiguous.
>>>>> 
>>>>>> The PDFBox PDF/A validation does a good job for PDF/A 1b but it can not 
>>>>>> be easily extended to other standards.
>>>>> 
>>>>> Yes, PDF/A is carefully validated because it is for archival purposes, 
>>>>> unlike regular PDF files.
>>>>> 
>>>>>> Do you think that there is a need for a more formal support of such 
>>>>>> standards and versions? The would influence some of the design decisions 
>>>>>> for the parser and affect the base objects.
>>>>> 
>>>>> 
>>>>> I can’t think of a reason why someone would want to parse a specific PDF 
>>>>> version, so my answer is no, I don’t think there is such a need. Has the 
>>>>> syntax of PDF even changed that much over the different versions?
>>>>> 
>>>>> — John
>>>>> 
>>>> 
>>> 
>> 
>

Re: [DISCUSS] PDFBox and support for PDF versions, PDF standards

2014-03-11 Thread Maruan Sahyoun


> 
>> OK - wasn’t precise enough - token types didn’t change but there are newer 
>> tokens introduced. 
> 
> Yes.
> 
>> As the syntax has changed do we need version and standards support in the 
>> parsing phase then?
> 
> I don’t think so, no. I don’t know what the use-case would be. You’d have to 
> go back and read all seven versions of the PDF Reference and make sure that 
> the parser implements the correct handling for each version, that’s an awful 
> lot of work.

OK - so the parser should concentrate on getting the parsing done according to 
the spec (which is mostly the case with NonSequentialParser today) and we also 
have a way that there is some standards/relaxed way of parsing for files where 
the base syntax is not correct as we need to catch such circumstances for 
standards compliant parsing (which we don’t have in core but in the PDF/A 
project) but would ignore such errors if they can be corrected for relaxed 
parsing. 

> 
>> Other way would be to parse what’s in there and do validation etc. purely on 
>> the parsing result (COS model, PD model). Need to do that anyway.
> 
> Yes, I prefer this approach, you can always write a tool which inspects a 
> PDDocument and determines whether or not it uses features available in a 
> given PDF version. It seems better to do this as a separate feature than to 
> try and build it into the parser or the PD model directly.

Fine for me - would be something like a ‚profile' per standard which could be 
used for validation as well as writing.

To get that completed we need to revisit the PD model as not all features of 
PDF are reflected in the matching PD model. That could be done when 
implementing the profiles.

> 
>> What about writing?
> 
> Yes, we want versions for writing, because a user may want to generate e.g a 
> PDF 1.6 file. This is going to be even more important in the near future 
> because the PDF 2.0 standard is supposed to be introduced in 2014.

There are some base features missing in writing a PDF today but I think Andreas 
has something in the works. The ‚profile‘ mentioned above could be used for 
writing too e.g. to check if PD model keys are permitted for a certain 
standard/version or not.

> 
> -- John

Re: [DISCUSS] PDFBox and support for PDF versions, PDF standards

2014-03-11 Thread Maruan Sahyoun


> Great. One more thing...
> 
>> To get that completed we need to revisit the PD model as not all features of 
>> PDF are reflected in the matching PD model. That could be done when 
>> implementing the profiles.
> 
> All the PD classes provide access to the underlying COS model, so there’s no 
> need to expose low-level details in the PD model.

Yes I know. Working on the PD model would make the ‚profile‘ easier to build 
and understand but thinking about it, as one can work on the COS level, that’s 
the one which needs to be checked. WDYT?

Maruan


> 
> -- John
> 
> On 11 Mar 2014, at 00:24, Maruan Sahyoun  wrote:
> 
>> 
>>> 
>>>> OK - wasn’t precise enough - token types didn’t change but there are newer 
>>>> tokens introduced. 
>>> 
>>> Yes.
>>> 
>>>> As the syntax has changed do we need version and standards support in the 
>>>> parsing phase then?
>>> 
>>> I don’t think so, no. I don’t know what the use-case would be. You’d have 
>>> to go back and read all seven versions of the PDF Reference and make sure 
>>> that the parser implements the correct handling for each version, that’s an 
>>> awful lot of work.
>> 
>> OK - so the parser should concentrate on getting the parsing done according 
>> to the spec (which is mostly the case with NonSequentialParser today) and we 
>> also have a way that there is some standards/relaxed way of parsing for 
>> files where the base syntax is not correct as we need to catch such 
>> circumstances for standards compliant parsing (which we don’t have in core 
>> but in the PDF/A project) but would ignore such errors if they can be 
>> corrected for relaxed parsing. 
>> 
>>> 
>>>> Other way would be to parse what’s in there and do validation etc. purely 
>>>> on the parsing result (COS model, PD model). Need to do that anyway.
>>> 
>>> Yes, I prefer this approach, you can always write a tool which inspects a 
>>> PDDocument and determines whether or not it uses features available in a 
>>> given PDF version. It seems better to do this as a separate feature than to 
>>> try and build it into the parser or the PD model directly.
>> 
>> Fine for me - would be something like a ‚profile' per standard which could 
>> be used for validation as well as writing.
>> 
>> To get that completed we need to revisit the PD model as not all features of 
>> PDF are reflected in the matching PD model. That could be done when 
>> implementing the profiles.
>> 
>>> 
>>>> What about writing?
>>> 
>>> Yes, we want versions for writing, because a user may want to generate e.g 
>>> a PDF 1.6 file. This is going to be even more important in the near future 
>>> because the PDF 2.0 standard is supposed to be introduced in 2014.
>> 
>> There are some base features missing in writing a PDF today but I think 
>> Andreas has something in the works. The ‚profile‘ mentioned above could be 
>> used for writing too e.g. to check if PD model keys are permitted for a 
>> certain standard/version or not.
>> 
>>> 
>>> -- John
>> 
>

Re: Need JBIG2 test image

2014-03-12 Thread Maruan Sahyoun

Hi Tilman,

I can make one up tomorrow if no one else is faster. Will be done from scratch 
with no real world data in it.

BR

Maruan Sahyoun

Am 12.03.2014 um 18:43 schrieb Tilman Hausherr :

> No, the file would of course be public.
> 
> I can still have a look about whether PDFBOX can now handle these files, 
> however I suspect that this would bring you in trouble with the law even if I 
> promise you all you want.
> 
> PDFBOX does support JBIG2, you need the levigo plugin.
> 
> Tilman
> 
> Am 12.03.2014 18:33, schrieb Alin Mazilu:
>> I have a scanned accident police reports that have people names, addresses
>> and phone numbers in them. I had a problem printing these files with pdfbox
>> and I had to improvise by using a command prompt print utility as a
>> Process. I could maybe give you one if you agree not to release it to the
>> public.
>> 
>> Alin
>> 
>> 
>> On Wed, Mar 12, 2014 at 1:19 PM, Tilman Hausherr 
>> wrote:
>> 
>>> Hello all,
>>> 
>>> I'd need a PDF with JBIG2 encoding that can be distributed. So it should
>>> not have anything on it that is copyrighted, i.e. artwork or a real text.
>>> Just some random lines or a lorem ipsum text. The image should be black &
>>> white, i.e. not have other elements in it that have a color like a
>>> watermark. Some unserviced Xerox copiers might produce such images, or some
>>> software from Adobe, IRIS etc. If you have such a file, sent it to me,
>>> tilman at snafu dot de, not to the list.
>>> 
>>> I want to use this PDF for a unit test that checks whether the PDF is
>>> decoded with the JBIG2 plugin. A fail would be an empty image. This way we
>>> check that the JBIG2 plugin is properly attached.
>>> 
>>> Tilman
>>> 
>>> 
>

Re: Problem With MergeUtility

2014-03-13 Thread Maruan Sahyoun

Hi,

not a direct answer to your question but could you try PDDocument.loadNonSeq 
instead?

BR
Maruan Sahyoun

> Am 13.03.2014 um 16:16 schrieb Alin Mazilu :
> 
> Hello guys,
> 
> 
> Has anyone had any problem with this? Any idea why it happens? What would
> be a good value for pushBackSize so this does not happen? Thanks!
> 
> 
> Partial stack trace:
> 
> 
> org.apache.pdfbox.exceptions.WrappedIOException: Could not push back 72940
> bytes in order to reparse stream. Try increasing push back buffer using
> system property org.apache.pdfbox.baseParser.pushBackSize
> 
> 
> 
>at
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:546)
> 
> 
> 
>at
> org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:566)
> 
> 
> 
>at
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
> 
> 
> 
>at
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1071)
> 
> 
> 
>at
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1038)
> 
> 
> 
>at
> org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:186)

Re: Problem With MergeUtility

2014-03-13 Thread Maruan Sahyoun

this issue is logged at PDFBOX-1964 with a potential patch attached.


BR 
Maruan Sahyoun

Am 13.03.2014 um 17:52 schrieb Timo Boehme :

> Hi,
> 
> as far as I remember PDFMergeUtility is one of the last utilities not 
> supporting loadNonSeq currently.
> 
> As a workaround get the source of PDFMergeUtility, change PDDocument.load to 
> PDDocument.loadNonSeq  (you may provide null as buffer parameter).
> 
> 
> Best,
> Timo
> 
> 
> Am 13.03.2014 16:46, schrieb Alin Mazilu:
>> Where? Here's the code that causes that:
>> 
>> PDFMergeUtility util = new PDFMergeUtility();
>> 
>> for (File file : set) {
>> try{
>> if( file.exists() ){
>> util.addSource(file);
>> }
>> } catch ( Exception e ){
>>//log e
>> }
>>  }
>> util.setDestinationFileName(...);
>> 
>> util.mergeDocuments();
>> 
>> 
>> On Thu, Mar 13, 2014 at 11:27 AM, Maruan Sahyoun 
>> wrote:
>> 
>>> Hi,
>>> 
>>> not a direct answer to your question but could you try
>>> PDDocument.loadNonSeq instead?
>>> 
>>> BR
>>> Maruan Sahyoun
>>> 
>>>> Am 13.03.2014 um 16:16 schrieb Alin Mazilu :
>>>> 
>>>> Hello guys,
>>>> 
>>>> 
>>>> Has anyone had any problem with this? Any idea why it happens? What would
>>>> be a good value for pushBackSize so this does not happen? Thanks!
>>>> 
>>>> 
>>>> Partial stack trace:
>>>> 
>>>> 
>>>> org.apache.pdfbox.exceptions.WrappedIOException: Could not push back
>>> 72940
>>>> bytes in order to reparse stream. Try increasing push back buffer using
>>>> system property org.apache.pdfbox.baseParser.pushBackSize
>>>> 
>>>> 
>>>> 
>>>>at
>>>> 
>>> org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:546)
>>>> 
>>>> 
>>>> 
>>>>at
>>>> org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:566)
>>>> 
>>>> 
>>>> 
>>>>at
>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
>>>> 
>>>> 
>>>> 
>>>>at
>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1071)
>>>> 
>>>> 
>>>> 
>>>>at
>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1038)
>>>> 
>>>> 
>>>> 
>>>>at
>>>> 
>>> org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:186)
>>> 
>> 
> 
> 
> -- 
> 
> Timo Boehme
> OntoChem GmbH
> H.-Damerow-Str. 4
> 06120 Halle/Saale
> T: +49 345 4780474
> F: +49 345 4780471
> timo.boe...@ontochem.com
> 
> _
> 
> OntoChem GmbH
> Geschäftsführer: Dr. Lutz Weber
> Sitz: Halle / Saale
> Registergericht: Stendal
> Registernummer: HRB 215461
> _
>

Re: Removing processStream and processSubStream

2014-03-19 Thread Maruan Sahyoun

Hi,

in general I think that this is a valid change. From how I understand the 
rendering in PDF Form, Text, Image and Pattern maintain their own matrix to map 
to user space which is then transformed by the CTM to device space so handling 
them specifically is fine and inline with the spec. I’d suggest that we make 
sure that the different ‚spaces‘ are defined properly within the code and refer 
to the PDF spec so that the code is easier to read if this is not already the 
case. With so many changes it’s a good opportunity to enhance the documentation 
within the source code. Some of the old code enjoys very little documentation.  

I wouldn’t remove processStream and processSubStream but deprecate them and 
remove them in the next major release though as to keep the changes to a 
minimum. There are a number of very important changes in 2.0. The easier we can 
get people to use that version wo to many changes to their own code the better.

For 2.0 removing the deprecated stuff of 1.x is fine. Removing not deprecated 
stuff should be avoided if possible. 

For the rendering what might have been missed is taking the UserUnit entry in 
the page dictionary into account which might change the default user space. 
This was introduced in PDF 1.6. A good opportunity to read that entry and make 
sure that we handle it appropriately.

BR
Maruan Sahyoun

Am 18.03.2014 um 20:46 schrieb John Hewson :

> Hi All
> 
> I’m still working on getting Tiling Patterns to render correctly, and need to 
> make some
> changes to core PDFBox functionality in order to proceed. My problem is that 
> tiling
> patterns are defined in their parent stream’s initial coordinate space, 
> rather than the
> coordinate space defined by the CTM. However, in PDFBox there is no way to 
> access
> the parent stream, so I can’t find out what it’s initial matrix is. The 
> manner in which the
> initial coordinate space is determined is different for pages, forms, and 
> patterns
> 
> What this means is that the parent stream’s initial coordinate space needs to 
> be passed
> to processStream and processSubStream in PDFStreamEngine. This will 
> necessarily be
> a breaking change, and it will affect all downstream subclasses of 
> PDFStreamEngine.
> 
> Because this has to be a breaking change, I propose that we go all the way 
> and make
> the new API bulletproof, 1) so that we won’t have to introduce breaking 
> changes in the
> future if we encounter similar issues, 2) so that the caller of the method 
> can’t pass the
> wrong data in the parameters. We would remove the two generic methods:
> 
> public void processStream(PDResources resources, COSStream cosStream, 
> PDRectangle drawingSize, int rotation)
> public void processSubStream(PDResources resources, COSStream cosStream)
> 
> and replace them with four specific methods:
> 
> public void processPage(PDPage page)
> public void processForm(PDFormXObject form)
> public void processTilingPattern(PDTilingPattern pattern)
> public void processType3Font(PDType3Font font)
> 
> This would mean that the various “proces” methods have access to their 
> parent
> stream, and can read any of its public fields in the future without 
> introducing breaking
> changes by altering the method’s parameters.
> 
> What do you think?
> 
> -- John
>

Re: Removing processStream and processSubStream

2014-03-19 Thread Maruan Sahyoun

John,

Am 19.03.2014 um 18:15 schrieb John Hewson :

> Maruan
> 
>> From how I understand the rendering in PDF Form, Text, Image and Pattern 
>> maintain their own matrix to map to user space which is then transformed by 
>> the CTM to device space so handling them specifically is fine and inline 
>> with the spec.
> 
> No, that’s not right, what I said was:
> 
>>> My problem is that tiling patterns are defined in their parent stream’s 
>>> initial coordinate space, rather than the
>>> coordinate space defined by the CTM.
> 
> So patterns should *not* be using the CTM, which is what I’m trying to 
> achieve.
> 

I think you misunderstood what I wrote - patterns have their own matrix - so I 
think we are on the same page here. IMHO according to the spec CTM transforms 
from user space to device space. So it’s pattern space -> user space -> device 
space.


>> I’d suggest that we make sure that the different ‚spaces‘ are defined 
>> properly within the code and refer to the PDF spec so that the code is 
>> easier to read if this is not already the case. With so many changes it’s a 
>> good opportunity to enhance the documentation within the source code. Some 
>> of the old code enjoys very little documentation.
> 
> 
> I disagree, in general I don’t think that references to the PDF spec are a 
> good form of documentation (there are some exceptions). References to the 
> spec are meaningless to the reader unless they take the time to look them up 
> in a 700 page PDF document. I would argue that by just linking back to the 
> spec, we have *failed* to document PDFBox, not succeeded.
> 
> References to the PDF spec have another major flaw: they go out-of-date. For 
> example a Pattern Colour Space will always be called “Pattern Colour Space” 
> in future versions of the PDF spec but it may not be described in paragraph 
> 8.6.6.2 or on page 156. The existing code contains many references to the PDF 
> 1.6 and 1.7 specs as well as the ISO PDF32000 spec, which means that I need 
> three 700 page PDF files open at all times in order to look up PDFBox 
> references. With the new version of the PDF spec due this year, this 
> situation is going to get worse.
> 

Didn’t mean to only reference to the spec but to use the same terms as 
described by the spec. Adding references to the spec is an add-on not a 
replacement.

> I agree that some of the existing code needs more documentation, and I often 
> add documentation to old files which I’m working on. However, my approach is 
> to just paste in a sentence or two from the PDF spec (fair use). That way the 
> reader does not ever need to look at the PDF spec. Because we use the same 
> terminology in PDFBox as in the spec, if someone really wants to look 
> something up, it’s as simple as Ctrl+F, no reference needed, and it’s 
> guaranteed not to go out-of-date.
> 
>> I wouldn’t remove processStream and processSubStream but deprecate them and 
>> remove them in the next major release though as to keep the changes to a 
>> minimum.
> 
> This isn’t possible, as I said it "will necessarily be a breaking change”. 
> This is because in 2.0 PDFStreamEngine needs to know the parent of each 
> stream, but processStream and processSubStream do not provide this 
> information. That’s why I’m discussing this on the mailing list.

I don’t understand why this is shouldn’t be possible. It’s more effort, agreed, 
but beneficial.

> 
>> For the rendering what might have been missed is taking the UserUnit entry 
>> in the page dictionary into account which might change the default user 
>> space. This was introduced in PDF 1.6. A good opportunity to read that entry 
>> and make sure that we handle it appropriately.
> 
> Yes, I have this as a “todo” in my working copy, however, if we put the 
> UserUnit in the matrix then we should also put the page Rotation into the 
> matrix, but that’a a significant change.
> 
> -- John

Re: Removing processStream and processSubStream

2014-03-19 Thread Maruan Sahyoun

as an added note - initially you suggested

public void processTilingPattern(PDTilingPattern pattern) 

but as Patterns in general have their own matrix I think it applies to all 
patterns, that’s why I wrote „… Form, Text, Image and Pattern maintain …“

BR
Maruan

Am 19.03.2014 um 18:31 schrieb Maruan Sahyoun :

> John,
> 
> Am 19.03.2014 um 18:15 schrieb John Hewson :
> 
>> Maruan
>> 
>>> From how I understand the rendering in PDF Form, Text, Image and Pattern 
>>> maintain their own matrix to map to user space which is then transformed by 
>>> the CTM to device space so handling them specifically is fine and inline 
>>> with the spec.
>> 
>> No, that’s not right, what I said was:
>> 
>>>> My problem is that tiling patterns are defined in their parent stream’s 
>>>> initial coordinate space, rather than the
>>>> coordinate space defined by the CTM.
>> 
>> So patterns should *not* be using the CTM, which is what I’m trying to 
>> achieve.
>> 
> 
> I think you misunderstood what I wrote - patterns have their own matrix - so 
> I think we are on the same page here. IMHO according to the spec CTM 
> transforms from user space to device space. So it’s pattern space -> user 
> space -> device space.
> 
> 
>>> I’d suggest that we make sure that the different ‚spaces‘ are defined 
>>> properly within the code and refer to the PDF spec so that the code is 
>>> easier to read if this is not already the case. With so many changes it’s a 
>>> good opportunity to enhance the documentation within the source code. Some 
>>> of the old code enjoys very little documentation.
>> 
>> 
>> I disagree, in general I don’t think that references to the PDF spec are a 
>> good form of documentation (there are some exceptions). References to the 
>> spec are meaningless to the reader unless they take the time to look them up 
>> in a 700 page PDF document. I would argue that by just linking back to the 
>> spec, we have *failed* to document PDFBox, not succeeded.
>> 
>> References to the PDF spec have another major flaw: they go out-of-date. For 
>> example a Pattern Colour Space will always be called “Pattern Colour Space” 
>> in future versions of the PDF spec but it may not be described in paragraph 
>> 8.6.6.2 or on page 156. The existing code contains many references to the 
>> PDF 1.6 and 1.7 specs as well as the ISO PDF32000 spec, which means that I 
>> need three 700 page PDF files open at all times in order to look up PDFBox 
>> references. With the new version of the PDF spec due this year, this 
>> situation is going to get worse.
>> 
> 
> Didn’t mean to only reference to the spec but to use the same terms as 
> described by the spec. Adding references to the spec is an add-on not a 
> replacement.
> 
>> I agree that some of the existing code needs more documentation, and I often 
>> add documentation to old files which I’m working on. However, my approach is 
>> to just paste in a sentence or two from the PDF spec (fair use). That way 
>> the reader does not ever need to look at the PDF spec. Because we use the 
>> same terminology in PDFBox as in the spec, if someone really wants to look 
>> something up, it’s as simple as Ctrl+F, no reference needed, and it’s 
>> guaranteed not to go out-of-date.
>> 
>>> I wouldn’t remove processStream and processSubStream but deprecate them and 
>>> remove them in the next major release though as to keep the changes to a 
>>> minimum.
>> 
>> This isn’t possible, as I said it "will necessarily be a breaking change”. 
>> This is because in 2.0 PDFStreamEngine needs to know the parent of each 
>> stream, but processStream and processSubStream do not provide this 
>> information. That’s why I’m discussing this on the mailing list.
> 
> I don’t understand why this is shouldn’t be possible. It’s more effort, 
> agreed, but beneficial.
> 
>> 
>>> For the rendering what might have been missed is taking the UserUnit entry 
>>> in the page dictionary into account which might change the default user 
>>> space. This was introduced in PDF 1.6. A good opportunity to read that 
>>> entry and make sure that we handle it appropriately.
>> 
>> Yes, I have this as a “todo” in my working copy, however, if we put the 
>> UserUnit in the matrix then we should also put the page Rotation into the 
>> matrix, but that’a a significant change.
>> 
>> -- John
>

Re: Removing processStream and processSubStream

2014-03-19 Thread Maruan Sahyoun

John

Am 19.03.2014 um 19:10 schrieb John Hewson :

> Maruan,
> 
>>>> From how I understand the rendering in PDF Form, Text, Image and Pattern 
>>>> maintain their own matrix to map to user space which is then transformed 
>>>> by the CTM to device space so handling them specifically is fine and 
>>>> inline with the spec.
>>> 
>>> No, that’s not right, what I said was:
>>> 
>>>>> My problem is that tiling patterns are defined in their parent stream’s 
>>>>> initial coordinate space, rather than the
>>>>> coordinate space defined by the CTM.
>>> 
>>> So patterns should *not* be using the CTM, which is what I’m trying to 
>>> achieve.
>>> 
>> 
>> I think you misunderstood what I wrote - patterns have their own matrix - so 
>> I think we are on the same page here. IMHO according to the spec CTM 
>> transforms from user space to device space. So it’s pattern space -> user 
>> space -> device space.
> 
> Nope, as I said, that’s what PDFBox currently does and it’s wrong. As you say 
> the CTM transforms from user space to device space, but it’s not the only way 
> to do so, and it is not used by patterns.

As the processing is defined in the spec this is a good reference so no need to 
discuss that further. Of course different people might come to different 
conclusions by reading and interpreting the spec. 

> 
>> Didn’t mean to only reference to the spec but to use the same terms as 
>> described by the spec. Adding references to the spec is an add-on not a 
>> replacement.
> 
> I don’t see what value this adds, given that the references will just go 
> out-of-date when the next spec is released. We already use the same 
> terminology as the PDF spec, so Ctrl+F can be used for quick look-ups that 
> won’t go out-of-date.

You are not enforced to add the information.

> 
>>> This isn’t possible, as I said it "will necessarily be a breaking change”. 
>>> This is because in 2.0 PDFStreamEngine needs to know the parent of each 
>>> stream, but processStream and processSubStream do not provide this 
>>> information. That’s why I’m discussing this on the mailing list.
>> 
>> I don’t understand why this is shouldn’t be possible. It’s more effort, 
>> agreed, but beneficial.
> 
> 
> What’s not to understand? PDFStreamEngine *needs* to know the parent of each 
> stream, and the old methods don’t provide this, passing a null parent will 
> not work because we need that information later in order to correctly process 
> the stream. If we allowed a null parent to be passed, the result would be 
> silently broken rendering - there’s no value in providing a 
> backwards-compatible API if it can only produce broken results.

Won’t get to the same conclusion here (as I think we won’t get on the other 
topics above).

> 
> -- John
> 
> On 19 Mar 2014, at 10:31, Maruan Sahyoun  wrote:
> 
>> John,
>> 
>> Am 19.03.2014 um 18:15 schrieb John Hewson :
>> 
>>> Maruan
>>> 
>>>> From how I understand the rendering in PDF Form, Text, Image and Pattern 
>>>> maintain their own matrix to map to user space which is then transformed 
>>>> by the CTM to device space so handling them specifically is fine and 
>>>> inline with the spec.
>>> 
>>> No, that’s not right, what I said was:
>>> 
>>>>> My problem is that tiling patterns are defined in their parent stream’s 
>>>>> initial coordinate space, rather than the
>>>>> coordinate space defined by the CTM.
>>> 
>>> So patterns should *not* be using the CTM, which is what I’m trying to 
>>> achieve.
>>> 
>> 
>> I think you misunderstood what I wrote - patterns have their own matrix - so 
>> I think we are on the same page here. IMHO according to the spec CTM 
>> transforms from user space to device space. So it’s pattern space -> user 
>> space -> device space.
>> 
>> 
>>>> I’d suggest that we make sure that the different ‚spaces‘ are defined 
>>>> properly within the code and refer to the PDF spec so that the code is 
>>>> easier to read if this is not already the case. With so many changes it’s 
>>>> a good opportunity to enhance the documentation within the source code. 
>>>> Some of the old code enjoys very little documentation.
>>> 
>>> 
>>> I disagree, in general I don’t think that references to the PDF spec are a 
>>> good form of documentation (there are some exceptions). References to the 
>>> spe

Re: Apache PDFBox April 2014 board report due

2014-04-01 Thread Maruan Sahyoun

Hi Andreas,

+1 with the additions from John and Tilman

BR
Maruan

Am 30.03.2014 um 16:29 schrieb Andreas Lehmkuehler :

> Hi,
> 
> find attached a quick draft of the board report we're expected to submit this
> month.
> 
> @Johm, @Tilman
> Please add something about the GSoC status.
> 
> 
> Any further comments, objections or additions?
> 
> 
> 
> 
> The Apache PDFBox library is an open source Java tool for working with PDF
> documents.
> 
> 
> General Comments
> 
> 
> There are no issues that require Board attention.
> 
> 
> Community
> -
> 
> There is a steady stream of contributions and bug reports from the community.
> 
> John Hewson and Tilman Hausherr were added as committers and PMC members to 
> our ranks in February 2014.
> 
> Eric Leleu stepped back and went emeritus per his own request in March 2014.
> 
> 452 (429 last report) subscribers on the user@ list
> 157 (164 last report) subscribers on the dev@ list
> 
> Releases
> 
> 
> Version 1.8.4 was released on 31th of January 2014
> 
> 1.8.4 is an incremental bugfix release based on PDFBox 1.8.x.
> 
> GSoC
> 
> 
> TODO
> 
> Development:
> 
> 
> Most likely the next bugfix version 1.8.5 will be released in the second 
> quarter.
> 
> The work on our next major release is an ongoing effort. The main topics are:
> 
> - switch to java 1.6
> - modularization
> - replace/enhance the parser
> - refactor the underlying COS model
> - code cleanup
> - enhance rendering
> 
> 
> 
> BR
> Andreas Lehmkühler

Re: Apache PDFBox April 2014 board report due

2014-04-02 Thread Maruan Sahyoun

Hi,

to unsubscribe please follow the information at 
http://pdfbox.apache.org/mailinglists.html

BR
Maruan Sahyoun

Am 02.04.2014 um 10:02 schrieb Somnath Jadhav :

> Hello ,
> 
> Can I know how to unsubscribe from this alerts ?
> 
> I no longer needs those alerts and I cant see any option for
> unsubscribe..Please help.
> 
> Regards,
> Somnath Jadhav,
> +91-9270153230
> www.somnathjadhav.com
> 
> 
> On 2 April 2014 12:58, Timo Boehme  wrote:
> 
>> +1 with the GSoC additions.
>> 
>> 
>> Best,
>> Timo
>> 
>> 
>> 
>> Am 30.03.2014 16:29, schrieb Andreas Lehmkuehler:
>> 
>>> Hi,
>>> 
>>> 
>>> find attached a quick draft of the board report we're expected to submit
>>> this
>>> month.
>>> 
>>> @Johm, @Tilman
>>> Please add something about the GSoC status.
>>> 
>>> 
>>> Any further comments, objections or additions?
>>> 
>>> 
>>> 
>>> 
>>> The Apache PDFBox library is an open source Java tool for working with PDF
>>> documents.
>>> 
>>> 
>>> General Comments
>>> 
>>> 
>>> There are no issues that require Board attention.
>>> 
>>> 
>>> Community
>>> -
>>> 
>>> There is a steady stream of contributions and bug reports from the
>>> community.
>>> 
>>> John Hewson and Tilman Hausherr were added as committers and PMC members
>>> to our ranks in February 2014.
>>> 
>>> Eric Leleu stepped back and went emeritus per his own request in March
>>> 2014.
>>> 
>>> 452 (429 last report) subscribers on the user@ list
>>> 157 (164 last report) subscribers on the dev@ list
>>> 
>>> Releases
>>> 
>>> 
>>> Version 1.8.4 was released on 31th of January 2014
>>> 
>>> 1.8.4 is an incremental bugfix release based on PDFBox 1.8.x.
>>> 
>>> GSoC
>>> 
>>> 
>>> TODO
>>> 
>>> Development:
>>> 
>>> 
>>> Most likely the next bugfix version 1.8.5 will be released in the second
>>> quarter.
>>> 
>>> The work on our next major release is an ongoing effort. The main topics
>>> are:
>>> 
>>> - switch to java 1.6
>>> - modularization
>>> - replace/enhance the parser
>>> - refactor the underlying COS model
>>> - code cleanup
>>> - enhance rendering
>>> 
>>> 
>>> 
>>> BR
>>> Andreas Lehmkühler
>>> 
>> 
>> 
>> --
>> 
>> Timo Boehme
>> OntoChem GmbH
>> H.-Damerow-Str. 4
>> 06120 Halle/Saale
>> T: +49 345 4780474
>> F: +49 345 4780471
>> timo.boe...@ontochem.com
>> 
>> _
>> 
>> OntoChem GmbH
>> Geschäftsführer: Dr. Lutz Weber
>> Sitz: Halle / Saale
>> Registergericht: Stendal
>> Registernummer: HRB 215461
>> _
>> 
>>

xmpbox vs. jempbox - which is the one moving forward

2014-04-09 Thread Maruan Sahyoun

Hi,

did we make a decision about xmpbox or jempbox are the one to use for XMP 
metadata moving forward? There is a discussion in PDFBOX-1187 about cutting the 
dependency to jempbox and preflight uses xmpbox.

BR
Maruan

Re: possible memory leak PDFBox 2.0.0

2014-04-10 Thread Maruan Sahyoun

Hi Joseph,

the attachments didn’t make it to the mailing list. Could you upload it to a 
public location? Id the behavior reproducible with any PDF or only with some. 
Could you oplad a sample PDF too?

BR
Maruan Sahyoun

Am 10.04.2014 um 13:50 schrieb Joseph Siddal :

> Hi,
> 
> I've found a memory leak that is caused when doing high volumes of printing.
> 
> The code that reproduces the bug is attached. The code just continuously 
> sends the same printjob to the default printer. The pdf I'm using is 
> available here. The memory leak is evident after 6mins of running the code. 
> The sun.print.CustomMediaTray has 2 static ArrayList fields which are 
> continuously growing in size going from size 29000 to 1+ after 6 minutes 
> and continuing to climb.
> 
> This is using OSX Mavericks, JDK 1.8.0.
> 
> Any help would be appreciated.
> 
> Regards
> Joseph

Re: New PDFBox bugfix release 1.8.5

2014-04-18 Thread Maruan Sahyoun

Hi,

I'm currently on a trip so won't be able to fix it today.

BR

Maruan

> Am 18.04.2014 um 16:58 schrieb Tilman Hausherr :
> 
> Now only Maruans issue is open. I'm currently fixing more javadoc stuff for 
> 1.8 and 2.0 and will comment when done. This will be finished in 15 min.
> 
> After that, two possibilities IMO:
> - if you can also work on it tomorrow, just wait for Maruan
> - if you can only work on it today, then set the issue to resolved after I'm 
> done and Maruan can open a new issue.
> 
> Tilman
> 
> Am 18.04.2014 16:13, schrieb Andreas Lehmkuehler:
>> Hi,
>> 
>> Am 18.04.2014 15:52, schrieb Tilman Hausherr:
>>> Am 18.04.2014 15:36, schrieb Andreas Lehmkuehler:
 Hi,
 
 it's time to cut a new bugfix release as there are a lot of fixes
>>> 
>>> Yes!
>>> 
 WDYT?
 Is there anything we should wait for? Any fix only available in the trunk
 which should be merged into then branch as well? What about the 4 open 
 issues
 [1] marked with fix for 1.8.5?
>>> 
>>> PDFBOX-1946 : person 
>>> didn't
>>> answer => set to resolve
>> +1
>> 
>>> PDFBOX-1977 : LZW bug has
>>> been resolved. However the test is still not perfect. Don't really know 
>>> what to
>>> do, I don't have the time to create a "perfect" test, i.e. that would 1. 
>>> include
>>> the case that failed, 2. have both types of tests, deterministic and
>>> non-deterministic. A possible solution would be to change the title to the 
>>> bug
>>> only, then create a new issue re: the test for 2.0 only. WDYT?
>> Sounds reasonable. Will you do that?
>> 
>>> PDFBOX-2026: IMO the bug 
>>> has
>>> been fixed. However the user didn't answer. I will set to resolve.
>> +1
>> 
>>> PDFBOX-1897 : I'll let 
>>> Maruan
>>> resolve that one
>>> 
>>> Tilman
>>> 
 
 BR
 Andreas Lehmkühler
 
 [1] http://s.apache.org/VwQ
>> 
>> Thanks for the fast reply
>> 
>> BR
>> Andreas Lehmkühler
>

Re: New PDFBox bugfix release 1.8.5

2014-04-22 Thread Maruan Sahyoun

Hi there,

there is an issue with my local copy of pdfbox atm - need a little more time to 
resolve the issue.

BR
Maruan Sahyoun

Am 18.04.2014 um 17:03 schrieb Maruan Sahyoun :

> Hi,
> 
> I'm currently on a trip so won't be able to fix it today.
> 
> BR
> 
> Maruan
> 
>> Am 18.04.2014 um 16:58 schrieb Tilman Hausherr :
>> 
>> Now only Maruans issue is open. I'm currently fixing more javadoc stuff for 
>> 1.8 and 2.0 and will comment when done. This will be finished in 15 min.
>> 
>> After that, two possibilities IMO:
>> - if you can also work on it tomorrow, just wait for Maruan
>> - if you can only work on it today, then set the issue to resolved after I'm 
>> done and Maruan can open a new issue.
>> 
>> Tilman
>> 
>> Am 18.04.2014 16:13, schrieb Andreas Lehmkuehler:
>>> Hi,
>>> 
>>> Am 18.04.2014 15:52, schrieb Tilman Hausherr:
>>>> Am 18.04.2014 15:36, schrieb Andreas Lehmkuehler:
>>>>> Hi,
>>>>> 
>>>>> it's time to cut a new bugfix release as there are a lot of fixes
>>>> 
>>>> Yes!
>>>> 
>>>>> WDYT?
>>>>> Is there anything we should wait for? Any fix only available in the trunk
>>>>> which should be merged into then branch as well? What about the 4 open 
>>>>> issues
>>>>> [1] marked with fix for 1.8.5?
>>>> 
>>>> PDFBOX-1946 <https://issues.apache.org/jira/browse/PDFBOX-1946>: person 
>>>> didn't
>>>> answer => set to resolve
>>> +1
>>> 
>>>> PDFBOX-1977 <https://issues.apache.org/jira/browse/PDFBOX-1977>: LZW bug 
>>>> has
>>>> been resolved. However the test is still not perfect. Don't really know 
>>>> what to
>>>> do, I don't have the time to create a "perfect" test, i.e. that would 1. 
>>>> include
>>>> the case that failed, 2. have both types of tests, deterministic and
>>>> non-deterministic. A possible solution would be to change the title to the 
>>>> bug
>>>> only, then create a new issue re: the test for 2.0 only. WDYT?
>>> Sounds reasonable. Will you do that?
>>> 
>>>> PDFBOX-2026: <https://issues.apache.org/jira/browse/PDFBOX-2026>IMO the 
>>>> bug has
>>>> been fixed. However the user didn't answer. I will set to resolve.
>>> +1
>>> 
>>>> PDFBOX-1897 <https://issues.apache.org/jira/browse/PDFBOX-1897>: I'll let 
>>>> Maruan
>>>> resolve that one
>>>> 
>>>> Tilman
>>>> 
>>>>> 
>>>>> BR
>>>>> Andreas Lehmkühler
>>>>> 
>>>>> [1] http://s.apache.org/VwQ
>>> 
>>> Thanks for the fast reply
>>> 
>>> BR
>>> Andreas Lehmkühler
>>

Re: New PDFBox bugfix release 1.8.5

2014-04-25 Thread Maruan Sahyoun

Hi Andreas,

will commit them later today.

BR
Maruan Sahyoun

Am 24.04.2014 um 11:52 schrieb Andreas Lehmkühler :

> Hi,
> 
> I'm planning to cut the release at the beginning of the next week.
> 
> Any objections?
> 
> @Maruan
> What about your pending javadoc changes? Do you need more time or help? As we
> are not in a hurry, it wouldn't be a problem to postpone the release process 
> for
> another week or two.
> 
> BR
> Andreas Lehmkühler
> 
>> Andreas Lehmkuehler  hat am 18. April 2014 um 15:36
>> geschrieben:
>> 
>> 
>> Hi,
>> 
>> it's time to cut a new bugfix release as there are a lot of fixes
>> in our queue. Additionally I already announced a possible new release in the
>> second quarter and people are already asking for it. ;-)
>> 
>> WDYT?
>> Is there anything we should wait for? Any fix only available in the trunk
>> which
>> should be merged into then branch as well? What about the 4 open issues [1]
>> marked with fix for 1.8.5?
>> 
>> BR
>> Andreas Lehmkühler
>> 
>> [1] http://s.apache.org/VwQ

Re: xmpbox vs. jempbox - which is the one moving forward

2014-04-25 Thread Maruan Sahyoun

Hi

Am 25.04.2014 um 12:38 schrieb Andreas Lehmkühler :

> Hi,
> 
> 
>> Maruan Sahyoun  hat am 9. April 2014 um 15:10
>> geschrieben:
>> 
>> 
>> Hi,
>> 
>> did we make a decision about xmpbox or jempbox are the one to use for XMP
>> metadata moving forward? There is a discussion in PDFBOX-1187 about cutting
>> the dependency to jempbox and preflight uses xmpbox.
>> 
> Thanks for bringing this up again.
> 
> How about the following scenario:
> 
> We could alter PDMetadata as follows:
> 
> - remove the import/exportXMPMetadata methods
> - provide new methods get/setMetadatastream to provide an Input/Outputstream 
> to
> be used with your favourite XMPMetadata implementation

+1 for being independent.

E.g. Adobe has a Java XMP lib under BSD license 
http://www.adobe.com/devnet/xmp/library/eula-xmp-library-java.html 

> 
> Pros:
> 
> - this would remove a in many cases not needed dependency in pdfbox
> - users can choose what library to use for handling XMP-Metadata, even any
> thirdparty lib could be used
> 
> Cons:
> 
> - we still have to maintain 2 XMP-libs

I’d think we should remove one of the XMP metadata libs which we can do 
independent of the above decision.


> 
> WDYT?
> 
> 
> BR
> Andreas Lehmkühler

Re: New PDFBox bugfix release 1.8.5

2014-04-25 Thread Maruan Sahyoun

Hi Andreas,

I’ve committed the changes. Fingers crossed that I did that correctly this time.

BR
Maruan

Am 24.04.2014 um 11:52 schrieb Andreas Lehmkühler :

> Hi,
> 
> I'm planning to cut the release at the beginning of the next week.
> 
> Any objections?
> 
> @Maruan
> What about your pending javadoc changes? Do you need more time or help? As we
> are not in a hurry, it wouldn't be a problem to postpone the release process 
> for
> another week or two.
> 
> BR
> Andreas Lehmkühler
> 
>> Andreas Lehmkuehler  hat am 18. April 2014 um 15:36
>> geschrieben:
>> 
>> 
>> Hi,
>> 
>> it's time to cut a new bugfix release as there are a lot of fixes
>> in our queue. Additionally I already announced a possible new release in the
>> second quarter and people are already asking for it. ;-)
>> 
>> WDYT?
>> Is there anything we should wait for? Any fix only available in the trunk
>> which
>> should be merged into then branch as well? What about the 4 open issues [1]
>> marked with fix for 1.8.5?
>> 
>> BR
>> Andreas Lehmkühler
>> 
>> [1] http://s.apache.org/VwQ

Re: New PDFBox bugfix release 1.8.5

2014-04-26 Thread Maruan Sahyoun

Yes, already monitored it :-) 

thanks for the patience.

BR
Maruan

> Am 26.04.2014 um 10:29 schrieb Andreas Lehmkuehler :
> 
> Hi Maruan,
> 
> [1] everything works. Thanks!
> 
> Looks like we are done here and I'm going to cut the release on Monday or 
> Tuesday evening (UTC+2)
> 
> BR
> Andreas Lehmkühler
> 
> [1] https://builds.apache.org/job/PDFBox%201.8.x/122/
> 
> 
> Am 26.04.2014 00:07, schrieb Maruan Sahyoun:
>> Hi Andreas,
>> 
>> I’ve committed the changes. Fingers crossed that I did that correctly this 
>> time.
>> 
>> BR
>> Maruan
>> 
>>> Am 24.04.2014 um 11:52 schrieb Andreas Lehmkühler :
>>> 
>>> Hi,
>>> 
>>> I'm planning to cut the release at the beginning of the next week.
>>> 
>>> Any objections?
>>> 
>>> @Maruan
>>> What about your pending javadoc changes? Do you need more time or help? As 
>>> we
>>> are not in a hurry, it wouldn't be a problem to postpone the release 
>>> process for
>>> another week or two.
>>> 
>>> BR
>>> Andreas Lehmkühler
>>> 
>>>> Andreas Lehmkuehler  hat am 18. April 2014 um 15:36
>>>> geschrieben:
>>>> 
>>>> 
>>>> Hi,
>>>> 
>>>> it's time to cut a new bugfix release as there are a lot of fixes
>>>> in our queue. Additionally I already announced a possible new release in 
>>>> the
>>>> second quarter and people are already asking for it. ;-)
>>>> 
>>>> WDYT?
>>>> Is there anything we should wait for? Any fix only available in the trunk
>>>> which
>>>> should be merged into then branch as well? What about the 4 open issues [1]
>>>> marked with fix for 1.8.5?
>>>> 
>>>> BR
>>>> Andreas Lehmkühler
>>>> 
>>>> [1] http://s.apache.org/VwQ
>

Re: [VOTE] Release Apache PDFBox 1.8.5

2014-04-29 Thread Maruan Sahyoun

+1 - thanks for preparing the release.

I’ll update the docs on the website as soon as the release is out.

BR
Maruan Sahyoun

Am 28.04.2014 um 19:57 schrieb Andreas Lehmkuehler :

> Hi,
> 
> a candidate for the PDFBox 1.8.5 release is available at:
> 
>http://people.apache.org/~lehmi/pdfbox/1.8.5/
> 
> The release candidate is a zip archive of the sources in:
> 
>http://svn.apache.org/repos/asf/pdfbox/tags/1.8.5/
> 
> The SHA1 checksum of the archive is fc01acc1e2575ff1f40e44e949a862fcae076029.
> 
> Please vote on releasing this package as Apache PDFBox 1.8.5.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 PDFBox PMC votes are cast.
> 
>[ ] +1 Release this package as Apache PDFBox 1.8.5
>[ ] -1 Do not release this package because...
> 
> 
> Here is my +1
> 
> BR
> Andreas Lehmkühler

1.8.5 and Website

2014-05-02 Thread Maruan Sahyoun

Hi,

I’ve updated the PDFBox API docs to reflect 1.8.5 on the website.

BR
Maruan

Am 02.05.2014 um 09:27 schrieb Andreas Lehmkühler :

> Hi,
> 
> due to the newest PDFBox 1.8.5 release I've closed all 1.8.5 related issues
> in a bulk operation. I've disabled the email notification to avoid an email
> flood.
> I've also added the all new version 1.8.6 for our next bugfix release ...
> 
> I'll update the download page once the mirrors copied the version from our
> repository.
> 
> BR
> Andreas Lehmkühler

Re: [VOTE] Release Apache PDFBox 1.8.5

2014-05-04 Thread Maruan Sahyoun

same for me

BR - Maruan

Am 04.05.2014 um 12:37 schrieb Andreas Lehmkuehler :

> Hi,
> 
> Am 28.04.2014 21:20, schrieb John Hewson:
>> +1
> 
> Is it just me, or did others on the list get this mail with a delay of 6 days 
> too? According to the mail header the issue was on the senders side.
> 
> As we got enough votes for the release and John didn't veto to release 1.8.5. 
> everything is fine.
> 
>> 
>> -- John
> 
> BR
> Andreas Lehmkühler
> 
>> 
>> On 28 Apr 2014, at 10:57, Andreas Lehmkuehler  wrote:
>> 
>>> Hi,
>>> 
>>> a candidate for the PDFBox 1.8.5 release is available at:
>>> 
>>>http://people.apache.org/~lehmi/pdfbox/1.8.5/
>>> 
>>> The release candidate is a zip archive of the sources in:
>>> 
>>>http://svn.apache.org/repos/asf/pdfbox/tags/1.8.5/
>>> 
>>> The SHA1 checksum of the archive is 
>>> fc01acc1e2575ff1f40e44e949a862fcae076029.
>>> 
>>> Please vote on releasing this package as Apache PDFBox 1.8.5.
>>> The vote is open for the next 72 hours and passes if a majority of at
>>> least three +1 PDFBox PMC votes are cast.
>>> 
>>>[ ] +1 Release this package as Apache PDFBox 1.8.5
>>>[ ] -1 Do not release this package because...
>>> 
>>> 
>>> Here is my +1
>>> 
>>> BR
>>> Andreas Lehmkühler
>> 
>> 
>

Enhancements to PDFBox

2014-05-29 Thread Maruan Sahyoun

Hi,

for a current project I need to work on enhancing PDFBox for

# splitting files (e.g. remove no longer needed resources)
# merging files (e.g. avoid duplicating resources)
# page handling (adding/removing individual pages with resource handling)
# enhancements to forms handling (pre fill XFA forms - partially done, 
enhancing AP generation)

Is someone else working on something similar?

BR

Maruan

Re: Enhancements to PDFBox

2014-05-29 Thread Maruan Sahyoun

Hi,

Am 29.05.2014 um 13:57 schrieb Andreas Lehmkuehler :

> Am 29.05.2014 09:39, schrieb Maruan Sahyoun:
>> Hi,
>> 
>> for a current project I need to work on enhancing PDFBox for
>> 
>> # splitting files (e.g. remove no longer needed resources)
> I had a quick look some time ago hoping that it would be easy to just remove 
> unneeded stuff but it isn't (maybe I didn't get it yet). In most cases 
> resources are deleted in combination with the page they belong to. The bigger 
> issue is annotations referring to pages. Those pages including there 
> resources aren't removed when the pages are removed because of the reference 
> in the annotation directory.
>> # merging files (e.g. avoid duplicating resources)
> That just makes sense if the pdfs to be merged uses similar resources.
> 
>> # page handling (adding/removing individual pages with resource handling)
> This should be a side produkt of #1 and #2
> 
>> # enhancements to forms handling (pre fill XFA forms - partially done, 
>> enhancing AP generation)
> This seems to be an important feature not only for you. So it would be nice 
> if someone could improve that.
> 

I already have filling an XFA form ready with some limitations (PDXFA’s COS has 
to be an array, dataset entry must be present … ). Could put it in if someone 
is interested in the current stage but planned to remove some limitations 
first. I’m not totally sure if that should be part of PDXFA or a Filler tool as 
this will introduce some dependency on XML handling. 
Preferences?

>> Is someone else working on something similar?
> My recent todo list is already quite long and maybe #1 and #2 or on it, but 
> I'm afraid on a lower position. But I'm happy to help if someone wants to 
> implement some of those features.

I will be working on #1 and #2 (at least to a degree which is needed for the 
project). If we could get some ideas together and you could help me - based on 
your past experience and knowledge of the code base - to get this started this 
would be great. 

> 
> 
>> BR
>> 
>> Maruan
> 
> BR
> Andreas Lehmkühler
>

Re: Enhancements to PDFBox

2014-05-29 Thread Maruan Sahyoun


Am 29.05.2014 um 14:31 schrieb Andreas Lehmkuehler :

> Am 29.05.2014 14:20, schrieb Maruan Sahyoun:
>> Hi,
>> 
>> Am 29.05.2014 um 13:57 schrieb Andreas Lehmkuehler :
>> 
>>> Am 29.05.2014 09:39, schrieb Maruan Sahyoun:
>>>> Hi,
>>>> 
>>>> for a current project I need to work on enhancing PDFBox for
>>>> 
>>>> # splitting files (e.g. remove no longer needed resources)
>>> I had a quick look some time ago hoping that it would be easy to just 
>>> remove unneeded stuff but it isn't (maybe I didn't get it yet). In most 
>>> cases resources are deleted in combination with the page they belong to. 
>>> The bigger issue is annotations referring to pages. Those pages including 
>>> there resources aren't removed when the pages are removed because of the 
>>> reference in the annotation directory.
>>>> # merging files (e.g. avoid duplicating resources)
>>> That just makes sense if the pdfs to be merged uses similar resources.
>>> 
>>>> # page handling (adding/removing individual pages with resource handling)
>>> This should be a side produkt of #1 and #2
>>> 
>>>> # enhancements to forms handling (pre fill XFA forms - partially done, 
>>>> enhancing AP generation)
>>> This seems to be an important feature not only for you. So it would be nice 
>>> if someone could improve that.
>>> 
>> 
>> I already have filling an XFA form ready with some limitations (PDXFA’s COS 
>> has to be an array, dataset entry must be present … ). Could put it in if 
>> someone is interested in the current stage but planned to remove some 
>> limitations first. I’m not totally sure if that should be part of PDXFA or a 
>> Filler tool as this will introduce some dependency on XML handling.
>> Preferences?
> Hmm, maybe it would be I good idea to put that stuff in a separate module, so 
> that it could be added/discarded on demand.

OK - will do.

> 
>>>> Is someone else working on something similar?
>>> My recent todo list is already quite long and maybe #1 and #2 or on it, but 
>>> I'm afraid on a lower position. But I'm happy to help if someone wants to 
>>> implement some of those features.
>> 
>> I will be working on #1 and #2 (at least to a degree which is needed for the 
>> project). If we could get some ideas together and you could help me - based 
>> on your past experience and knowledge of the code base - to get this started 
>> this would be great.
> Yes, of course.
> 
>>>> BR
>>>> 
>>>> Maruan
>>> 
> 
> BR
> Andreas Lehmkühler

Re: Enhancements to PDFBox

2014-05-29 Thread Maruan Sahyoun

Hi Simon,

thanks for the pointer - very useful.

BR
Maruan

Am 29.05.2014 um 12:06 schrieb Simon Steiner :

> Hi,
> 
> I worked on merging fonts in pdfs in fop using pdfbox
> https://issues.apache.org/jira/browse/FOP-2302
> 
> Thanks
> 
> -Original Message-
> From: Maruan Sahyoun [mailto:sahy...@fileaffairs.de] 
> Sent: 29 May 2014 08:40
> To: dev@pdfbox.apache.org
> Subject: Enhancements to PDFBox
> 
> Hi,
> 
> for a current project I need to work on enhancing PDFBox for
> 
> # splitting files (e.g. remove no longer needed resources) # merging files
> (e.g. avoid duplicating resources) # page handling (adding/removing
> individual pages with resource handling) # enhancements to forms handling
> (pre fill XFA forms - partially done, enhancing AP generation)
> 
> Is someone else working on something similar?
> 
> BR
> 
> Maruan
>

Re: Enhancements to PDFBox

2014-05-29 Thread Maruan Sahyoun

Am 29.05.2014 um 18:51 schrieb John Hewson :

>> # splitting files (e.g. remove no longer needed resources)
> 
> Each page has its own Resources dictionary, so it shouldn't be too difficult. 
> One thing to watch out for is is the "page tree" which allows pages to 
> inherit resources from each other, this is handled as PDPageNode but it's 
> kind of messy.

thanks for the hint. Splitting and merging is somewhat similar as splitting is 
typically done by creating a new document and importing the needed pages into 
the newly created document. Using the current code this might lead to duplicate 
resources. 

> 
>> # merging files (e.g. avoid duplicating resources)
> 
> Sounds like the files are pretty similar, is this actually an overlay? Or are 
> you wanting to insert entire pages?

it’s merging individual files together inserting entire pages. Although the 
files are created individually they share some common elements like company 
logos or fonts. 

> 
> I imagine you probably want to implement both these features at the COS level 
> rather than the PD level, as it's pretty low-level processing.
> 

It will involve a lot of COS processing. I haven’t decided yet if it will sit 
on top of COS or PD. Typically we do encourage people to use PD so I tend to 
start from there and dig down internally as needed. WDYT?

> -- John
> 
>> On 29 May 2014, at 00:39, Maruan Sahyoun  wrote:
>> 
>> Hi,
>> 
>> for a current project I need to work on enhancing PDFBox for
>> 
>> # splitting files (e.g. remove no longer needed resources)
>> # merging files (e.g. avoid duplicating resources)
>> # page handling (adding/removing individual pages with resource handling)
>> # enhancements to forms handling (pre fill XFA forms - partially done, 
>> enhancing AP generation)
>> 
>> Is someone else working on something similar?
>> 
>> BR
>> 
>> Maruan

Re: Idea: stable 2.0 versions

2014-06-01 Thread Maruan Sahyoun

Hi

Am 01.06.2014 um 15:03 schrieb Andreas Lehmkuehler :

> Hi,
> 
> Am 30.05.2014 23:13, schrieb John Hewson:
>> I think the risk of creating the impression that 2.0 is stable is too high. 
>> The real problem
>> is that 2.0 has been too long in development, there were frustrated users 
>> asking a year
>> ago about when it would be released.
> The biggest issue is, that we can't name a version stable without an official 
> release.
> 
>> Perhaps it’s time to push for a release of 2.0 and aim for a more frequent 
>> release cycle
>> after that, to avoid repeating the situation where the stable and trunk 
>> versions are
>> years apart?
> +1, it's time to go for release, not tomorrow or next week, but we should 
> start to do some planning.
> 
>> What is holding back 2.0? What features are we *really* holding out on? Can 
>> we put
>> together a roadmap - our users often ask for one...
> I already had a starting discussion with Maruan two weeks ago at a f2f 
> meeting.
> 
> I'd like to add those changes which include api changes so what we haven't to 
> wait until the next major release, at least those changes which are not that 
> big, such as
> 
> - solving the jempbox/xmpbox issue

could handle that

> - update bouncy castle
> - split the pdfbox module in at least 2 modules (core and rendering)

would break into pdfbox-core (parsing and COS), pdfbox-pd (PD model) and 
pdfbox-rendering.

> 
> There are some changes/improvements/bugfixes I'd like to solve as well:
> 
> - PDFBOX-922: unicode support

one of the most important missing basic features affecting forms handling, 
updating a pdf with non ISO chars …..

> - PDFBOX-62: almost done
> - improve the parser concerning broken XRef-tables
> - complete the recent font-improvements
> 
> There some other more or less easy to solve candidates
> 
> - enhance type safety
> - remove dependencies
> - 
> 
> There are some other things on our ideas list which should be postponed
> 
> - enhanced parser (could maybe done without big refactorings, so that we 
> don't have to wait until the next major release)

+1 to postpone it (haven’t go any feedback on the lexer yet). At least it could 
be done wo affecting the PD model.

> - refactoring of COS-level object

+1 to postpone it as this should be done together with the parsing

> - 
> 
> There is one important thing we have to do before releasing 2.0, an upgrade 
> guide including updated docs.

could handle that. Would need some input about major changes as a starting 
point as I din’t follow all breaking changes.

> 
> We should contact press@ in preparation of the release to phrase a press 
> release.
> 
> 
> IMHO, it could be realisitc to do a release in the summer, maybe in august.
> 
>> — John

WRT a roadmap I’d think it would be very good to come up with one but that 
would mean to agree on a set of features/changes upfront for a specific 
release. Don’t know if that is doable. E.g. a lot of the new/improved 
functionality is around rendering which is a very important functionality as 
this is a very common use case. On the other hand that hasn’t been on the ideas 
page.

> 
> BR
> Andreas Lehmkühler
>> 
>> On 30 May 2014, at 14:01, Tilman Hausherr  wrote:
>> 
>>> I suggest that we come up with a concept of designating "stable versions" 
>>> (or "tested versions") for the trunk and put them on the homepage. A stable 
>>> version is one with no or only minor regressions, and/or a version that 
>>> committers have found to be "good". This would be for users of the 2.0 
>>> version who don't want to read every discussion, and also as a hint for 
>>> unhappy 1.8 users.
>>> 
>>> I suspect that other open source projects do also have rules to designate 
>>> stable versions, but I didn't look at them.
>>> 
>>> Proposed rules:
>>> - any committer can designate any version that is older than 24 hours as 
>>> stable
>>> - any committer can veto any version as unstable
>>> - any version that has only positive votes is mentioned on
>>>  https://pdfbox.apache.org/downloads.html#scm
>>> - there should be up to three versions there
>>> 
>>> Tilman
>>> 
>> 
>> 
>

Re: Idea: stable 2.0 versions

2014-06-01 Thread Maruan Sahyoun

Hi

Am 01.06.2014 um 18:51 schrieb Tilman Hausherr :

> Am 01.06.2014 15:46, schrieb Maruan Sahyoun:
>>> >
>>> >There is one important thing we have to do before releasing 2.0, an 
>>> >upgrade guide including updated docs.
>> could handle that. Would need some input about major changes as a starting 
>> point as I din’t follow all breaking changes.
>> 
> 
> Here are the ones I know about:
> 
> old => new
> 
> PDXObjectForm => PDFormXObject
> PDXObjectImage => PDImageXObject
> PDPage.convertToImage() => PDFRenderer(PDDocument).renderImage()
> PDXObjectImage.getRGBImage() => PDImageXObject.getImage()
> 
>  => PDFPrinter(PDDocument, ).print(PDDocument,PrinterJob, …)

AFAIK this was PDDocument.print()

Build issues

2014-06-01 Thread Maruan Sahyoun

Hi,

sorry for all the noise - my mistake(s)

Maruan Sahyoun

Re: Idea: stable 2.0 versions

2014-06-02 Thread Maruan Sahyoun

Hi,

Maruan Sahyoun

Am 02.06.2014 um 08:59 schrieb John Hewson :

>> On 1 Jun 2014, at 06:03, Andreas Lehmkuehler  wrote:
>> 
>> Hi,
>> 
>> Am 30.05.2014 23:13, schrieb John Hewson:
>>> I think the risk of creating the impression that 2.0 is stable is too high. 
>>> The real problem
>>> is that 2.0 has been too long in development, there were frustrated users 
>>> asking a year
>>> ago about when it would be released.
>> The biggest issue is, that we can't name a version stable without an 
>> official release.
> 
> Seems like there could be some "release candidates" at some point soon... not 
> quite yet.
> 
>> 
>>> Perhaps it’s time to push for a release of 2.0 and aim for a more frequent 
>>> release cycle
>>> after that, to avoid repeating the situation where the stable and trunk 
>>> versions are
>>> years apart?
>> +1, it's time to go for release, not tomorrow or next week, but we should 
>> start to do some planning.
>> 
>>> What is holding back 2.0? What features are we *really* holding out on? Can 
>>> we put
>>> together a roadmap - our users often ask for one...
>> I already had a starting discussion with Maruan two weeks ago at a f2f 
>> meeting.
>> 
>> I'd like to add those changes which include api changes so what we haven't 
>> to wait until the next major release, at least those changes which are not 
>> that big, such as
>> 
>> - solving the jempbox/xmpbox issue
>> - update bouncy castle
>> - split the pdfbox module in at least 2 modules (core and rendering)
> 
> Splitting the rendering code into a module isn't really a feature... is there 
> a higher-level goal? If so, is it achievable for a 2.0 release in the near 
> future?

There are requests for PDFBox on Android where most of awt is not available.

> 
>> 
>> There are some changes/improvements/bugfixes I'd like to solve as well:
>> 
>> - PDFBOX-922: unicode support
>> - PDFBOX-62: almost done
>> - improve the parser concerning broken XRef-tables
>> - complete the recent font-improvements
> 
> Yes, finally removing AWT fonts will be a huge improvement.
> 
>> There some other more or less easy to solve candidates
>> 
>> - enhance type safety
>> - remove dependencies
>> - 
>> 
>> There are some other things on our ideas list which should be postponed
>> 
>> - enhanced parser (could maybe done without big refactorings, so that we 
>> don't have to wait until the next major release)
>> - refactoring of COS-level object
>> - 
>> 
>> There is one important thing we have to do before releasing 2.0, an upgrade 
>> guide including updated docs.
>> 
>> We should contact press@ in preparation of the release to phrase a press 
>> release.
>> 
>> 
>> IMHO, it could be realisitc to do a release in the summer, maybe in august.
>> 
>>> -- John
>> 
>> BR
>> Andreas Lehmkühler
>>> 
>>>> On 30 May 2014, at 14:01, Tilman Hausherr  wrote:
>>>> 
>>>> I suggest that we come up with a concept of designating "stable versions" 
>>>> (or "tested versions") for the trunk and put them on the homepage. A 
>>>> stable version is one with no or only minor regressions, and/or a version 
>>>> that committers have found to be "good". This would be for users of the 
>>>> 2.0 version who don't want to read every discussion, and also as a hint 
>>>> for unhappy 1.8 users.
>>>> 
>>>> I suspect that other open source projects do also have rules to designate 
>>>> stable versions, but I didn't look at them.
>>>> 
>>>> Proposed rules:
>>>> - any committer can designate any version that is older than 24 hours as 
>>>> stable
>>>> - any committer can veto any version as unstable
>>>> - any version that has only positive votes is mentioned on
>>>> https://pdfbox.apache.org/downloads.html#scm
>>>> - there should be up to three versions there
>>>> 
>>>> Tilman
>>

Re: Idea: stable 2.0 versions

2014-06-02 Thread Maruan Sahyoun

Hi

Am 02.06.2014 um 17:59 schrieb John Hewson :

>> On 2 Jun 2014, at 00:24, Maruan Sahyoun  wrote:
> 
>> 
>> Hi,
>> 
>> Maruan Sahyoun
>> 
>> Am 02.06.2014 um 08:59 schrieb John Hewson :
>> 
>>>> On 1 Jun 2014, at 06:03, Andreas Lehmkuehler  wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> Am 30.05.2014 23:13, schrieb John Hewson:
>>>>> I think the risk of creating the impression that 2.0 is stable is too 
>>>>> high. The real problem
>>>>> is that 2.0 has been too long in development, there were frustrated users 
>>>>> asking a year
>>>>> ago about when it would be released.
>>>> The biggest issue is, that we can't name a version stable without an 
>>>> official release.
>>> 
>>> Seems like there could be some "release candidates" at some point soon... 
>>> not quite yet.
>>> 
>>>> 
>>>>> Perhaps it’s time to push for a release of 2.0 and aim for a more 
>>>>> frequent release cycle
>>>>> after that, to avoid repeating the situation where the stable and trunk 
>>>>> versions are
>>>>> years apart?
>>>> +1, it's time to go for release, not tomorrow or next week, but we should 
>>>> start to do some planning.
>>>> 
>>>>> What is holding back 2.0? What features are we *really* holding out on? 
>>>>> Can we put
>>>>> together a roadmap - our users often ask for one...
>>>> I already had a starting discussion with Maruan two weeks ago at a f2f 
>>>> meeting.
>>>> 
>>>> I'd like to add those changes which include api changes so what we haven't 
>>>> to wait until the next major release, at least those changes which are not 
>>>> that big, such as
>>>> 
>>>> - solving the jempbox/xmpbox issue
>>>> - update bouncy castle
>>>> - split the pdfbox module in at least 2 modules (core and rendering)
>>> 
>>> Splitting the rendering code into a module isn't really a feature... is 
>>> there a higher-level goal? If so, is it achievable for a 2.0 release in the 
>>> near future?
>> 
>> There are requests for PDFBox on Android where most of awt is not available.
> 
> So the ultimate goal is to have an Android release for 2.0, who's going to do 
> this? AWT is very deeply integrated into PD (e.g. colour spaces, images) and 
> also FontBox (paths). I think a workable plan for removing it is much harder 
> than it looks.

I don’t think and didn’t want to say that an Android release shall be done for 
2.0. Only wanted to provide feedback why rendering might be on it’s own module 
as per Andreas input.

> 
>> 
>>> 
>>>> 
>>>> There are some changes/improvements/bugfixes I'd like to solve as well:
>>>> 
>>>> - PDFBOX-922: unicode support
>>>> - PDFBOX-62: almost done
>>>> - improve the parser concerning broken XRef-tables
> 
> I'm thinking of taking a look at XRefs.
> 
>>>> - complete the recent font-improvements
>>> 
>>> Yes, finally removing AWT fonts will be a huge improvement.
>>> 
>>>> There some other more or less easy to solve candidates
>>>> 
>>>> - enhance type safety
>>>> - remove dependencies
>>>> - 
>>>> 
>>>> There are some other things on our ideas list which should be postponed
>>>> 
>>>> - enhanced parser (could maybe done without big refactorings, so that we 
>>>> don't have to wait until the next major release)
> 
> Yeah, let's just makes sure the public API is nice and tight, then we can 
> refactor the internals at will later.
> 
>>>> - refactoring of COS-level object
>>>> - 
>>>> 
>>>> There is one important thing we have to do before releasing 2.0, an 
>>>> upgrade guide including updated docs.
>>>> 
>>>> We should contact press@ in preparation of the release to phrase a press 
>>>> release.
>>>> 
>>>> 
>>>> IMHO, it could be realisitc to do a release in the summer, maybe in august.
>>>> 
>>>>> -- John
>>>> 
>>>> BR
>>>> Andreas Lehmkühler
>>>>> 
>>>>>> On 30 May 2014, at 14:01, Tilman Hausherr  wrote:
>>>>>> 
>>>>>> I suggest that we come up with a concept of designating "stable 
>>>>>> versions" (or "tested versions") for the trunk and put them on the 
>>>>>> homepage. A stable version is one with no or only minor regressions, 
>>>>>> and/or a version that committers have found to be "good". This would be 
>>>>>> for users of the 2.0 version who don't want to read every discussion, 
>>>>>> and also as a hint for unhappy 1.8 users.
>>>>>> 
>>>>>> I suspect that other open source projects do also have rules to 
>>>>>> designate stable versions, but I didn't look at them.
>>>>>> 
>>>>>> Proposed rules:
>>>>>> - any committer can designate any version that is older than 24 hours as 
>>>>>> stable
>>>>>> - any committer can veto any version as unstable
>>>>>> - any version that has only positive votes is mentioned on
>>>>>> https://pdfbox.apache.org/downloads.html#scm
>>>>>> - there should be up to three versions there
>>>>>> 
>>>>>> Tilman
>>

Re: Changing font tag for BaseFont

2014-06-05 Thread Maruan Sahyoun

Hi,

why do you need to change that tag? IKOTCH+ as a prefix to the font is used 
because you font is subsetted i.e. not all glyphs of the font have been written 
into the PDF file. This is inline with the specification.

As usage questions are discussed on the users mailing list may I ask you to use 
that in the future?

BR

Maruan Sahyoun

Am 05.06.2014 um 09:12 schrieb Robert Strauch :

> Hello,
> 
> I have a PDF which embeds a TrueType font called UnicodeDoc. Within the PDF I 
> can see the following:
> 
> /BaseFont /IKOTCH+UnicodeDoc
> 
> Is it possible using PDFBox to change the tag value IKOTCH and if so how? I 
> know that this value may be different for other documents. However I just 
> need acces to this tagbut I cannot find the appropriate way.
> 
> Sincerely,
> Robert

Re: PDFBox 1.8.6 release

2014-06-11 Thread Maruan Sahyoun

Hi,

would you think that https://issues.apache.org/jira/browse/PDFBOX-1512 
(TextPositionComparator is not compatible with Java 7) should potentially be 
handled. Although I haven’t received feedback on it I could move forward 
implementing it to reflect other PDF readers handle positions. But I wouldn’t 
be able to start working on it before the week after next.

BR

Maruan

Am 11.06.2014 um 18:02 schrieb Tilman Hausherr :

> Sure... Could you make a decision on
> PDFBOX-239 ? And are there 
> any other issues that are to be fixed for 1.8.6?
> 
> Tilman
> 
> 
> 
> 
> 
> Am 11.06.2014 08:04, schrieb Andreas Lehmkuehler:
>> Am 28.05.2014 15:10, schrieb Andreas Lehmkühler:
>>> Hi,
>>> 
>>> there are already a number of solved issues mostly due
>>> to the hard work of Tilman and I'm thinking about a new
>>> bugfix release. How about a new one in 2 or 3 weeks
>>> from now?
>>> 
>>> WDYT?
>> How about next week, let say wednesday the 18th?
>> 
>>> BR
>>> Andreas Lehmkühler
>> 
>> BR
>> Andreas Lehmkühler
>> 
>

Re: [VOTE] Release Apache PDFBox 1.8.6

2014-06-19 Thread Maruan Sahyoun

+1 - thanks for the preparation

Maruan Sahyoun

Am 19.06.2014 um 14:28 schrieb Andreas Lehmkuehler :

> Hi,
> 
> a candidate for the PDFBox 1.8.6 release is available at:
> 
>http://people.apache.org/~lehmi/pdfbox/1.8.6/
> 
> The release candidate is a zip archive of the sources in:
> 
>http://svn.apache.org/repos/asf/pdfbox/tags/1.8.6/
> 
> The SHA1 checksum of the archive is 543c49ebe34a443654a0c3c264f36acc07983cc6.
> 
> Please vote on releasing this package as Apache PDFBox 1.8.6.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 PDFBox PMC votes are cast.
> 
>[ ] +1 Release this package as Apache PDFBox 1.8.6
>[ ] -1 Do not release this package because...
> 
> 
> Here is my +1
> 
> BR
> Andreas Lehmkühler

PDFBox and XMP - retire jempbox

2014-06-19 Thread Maruan Sahyoun

Hi,

we currently have two libraries handling XMP metadata jempbox and xmpbox.

Part of PDFBOX-1187/PDFBOX-2197 was to remove a direct dependency from jempbox 
as now XMP metadata could be generated by any library and added as a stream. 
This will be available for PDFBox 2.0.0.

I would like to propose to now retire jempbox as xmpbox

# is closer to the spec (naming conventions)
# used for PDF/A validation where we can not remove a dependency on XMP 
handling as checking metadata is necessary for PDF/A compliance. 

In case there is functionality in jempbox that is missing in xmpbox that could 
be added at a later stage upon request.

WDYT? 

BR
Maruan

Release Apache PDFBox 1.8.6 - API docs

2014-06-20 Thread Maruan Sahyoun

the apidocs for 1.8.6 are available at 
http://pdfbox.staging.apache.org/docs/1.8.6/javadocs/

upon release they will be put into production.

BR

Maruan Sahyoun

Am 19.06.2014 um 14:28 schrieb Andreas Lehmkuehler :

> Hi,
> 
> a candidate for the PDFBox 1.8.6 release is available at:
> 
>http://people.apache.org/~lehmi/pdfbox/1.8.6/
> 
> The release candidate is a zip archive of the sources in:
> 
>http://svn.apache.org/repos/asf/pdfbox/tags/1.8.6/
> 
> The SHA1 checksum of the archive is 543c49ebe34a443654a0c3c264f36acc07983cc6.
> 
> Please vote on releasing this package as Apache PDFBox 1.8.6.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 PDFBox PMC votes are cast.
> 
>[ ] +1 Release this package as Apache PDFBox 1.8.6
>[ ] -1 Do not release this package because...
> 
> 
> Here is my +1
> 
> BR
> Andreas Lehmkühler

Re: TIKA-1300

2014-06-27 Thread Maruan Sahyoun

thanks for the pointer - very useful information.

BR
Maruan

Am 27.06.2014 um 08:18 schrieb Tilman Hausherr :

> Please look at TIKA-1300 , 
> it about PDFBox sequential parser vs. non sequential parser
> 
>

Re: Apache PDFBox July 2014 board report due

2014-06-30 Thread Maruan Sahyoun

+1 - thx for taking care of this.

Maruan


Am 28.06.2014 um 12:15 schrieb Andreas Lehmkuehler :

> Hi,
> 
> find attached a quick draft of the board report we're expected to submit this
> month.
> 
> @John, @Tilman
> Please add something about the GSoC status.
> 
> 
> Any further comments, objections or additions?
> 
> 
> 
> 
> The Apache PDFBox library is an open source Java tool for working with PDF
> documents.
> 
> 
> General Comments
> 
> 
> There are no issues that require Board attention.
> 
> Community
> -
> 
> There is a steady stream of contributions and bug reports from the community.
> 
> 451 (452 last report) subscribers on the user@ list
> 153 (157 last report) subscribers on the dev@ list
> 
> Maruan gave a presentation about PDFBox at the PDF Days Europe 2014 in 
> cologne.
> We got some positive feedback and a couple of people show some interest in our
> project/community.
> 
> Releases
> 
> 
> Version 1.8.5 was released on 2nd of May 2014
> Version 1.8.6 was released on 22nd of June 2014
> 
> Both are incremental bugfix releases based on PDFBox 1.8.x.
> 
> GSoC
> 
> 
> TODO John & Tilman
> 
> Development:
> 
> 
> The work on our next major release is an ongoing effort. The main topics are:
> 
> - switch to java 1.6
> - modularization
> - replace/enhance the parser
> - code cleanup
> - enhance rendering
> 
> We are targeting the late summer as a rough release date for the next major 
> release.
> 
> 
> 
> BR
> Andreas Lehmkühler

Re: Regression Testing

2014-07-03 Thread Maruan Sahyoun

Hi John,

thanks for binging this up. This is a very important topic which was also 
discussed at the PDFDays in Germany.

 # Tests #
In addition to rendering we shall be covering metadata and text extraction as 
well as PDF/A validation. 

# Testfiles # 
Recently there were a number of test sets made available which we can use. 
http://digitalcorpora.org/corpora/files , 
https://github.com/openplanets/format-corpus/tree/master/pdfCabinetOfHorrors …
For PDF/A validation there is the Isartor test suite 
http://www.pdfa.org/2011/08/download-isartor-test-suite/. Some restrictions 
apply there.
In addition we can put additional files into our own repository as you 
suggested.
So there is no shortage on test files. 

TIKA-1300/TIKA-1302 has a discussion around the same topic together with some 
development for an infrastructure (VM, Jenkins …). IMHO we should join forces 
with them.

BR

Maruan


Am 04.07.2014 um 02:16 schrieb John Hewson :

> Hi All
> 
> I’ve been thinking about regression testing recently and how we can improve
> our tests for rendering. There are currently two problems:
> 
> 1) Different JDKs produce slightly different renderings (see PDFBOX-1843).
>(I suspect that AWT fonts are a big part of this, so the problem might get 
> a lot better
>soon once we render all fonts ourselves).
> 
> 2) Most PDF test files we have are not under an Apache-friendly license, so
>we can’t put the test files into the trunk SVN.
> 
> It seems that some of you have your own collections of test PDF files which 
> you are
> running regression tests on: that’s great but it would be much better if we 
> had a
> central repository of test files and sample renderings.
> 
> I’d like to suggest the following solutions to the above issues:
> 
> 1) We should choose a “blessed” JDK which will be used to perform the 
> renderings
>this should be whatever is a convenient and sensible default for 
> committers. (My
>preference would be for Oracle’s JDK 7 because JDK 6 is deprecated has 
> known
>rendering bugs). We should make sure that Jenkins runs tests using the 
> ”blessed”
>JDK.
> 
>   The regression test can then check to see if it is running on the “blessed” 
> JDK and
>   if not then the tests can be skipped and we can warn the user.
> 
> 2) We should create a new “regression” branch in SVN which contains only PDF 
> files
>for testing and PNG images which contain known-good renderings created 
> using the
>“blessed” JDK. This branch would not be part of the source of PDFBox but 
> will still
>allow us to version control the test PDFs (it also simplifies the workflow 
> for adding
>new test PDFs and new known-good renderings: simply do an "svn add”).
> 
>As far as copyright and licensing is concerned we can put any PDF files 
> which are
>available publicly on the web into this branch without too much worry.
> 
> What does everybody think?
> 
> -- John
>

Re: Regression Testing

2014-07-05 Thread Maruan Sahyoun


> Hi Tilman
> 
> Thanks for your thoughts, I think that your concerns are already covered by 
> my original proposal, I’ll try to explain why and how:
> 
>> Of course I agree with the need for regression tests, however it isn't easy: 
>> besides the problems of the different JDKs (I use JDK7 Windows 64 bit), 
>> there is the problem that some enhancements create slight changes in 
>> rendering that are not errors, i.e. both the "before" and the "after" files 
>> look OK by itself. This has happened when we changed the text rendering 
>> recently, and has happened again when the clipping was improved. The cause 
>> are probably slight changes in color or in boundaries.
> 
> If a rendering has changed then the regression test should fail. When a 
> failure occurs the developer needs to manually inspect the differences (we 
> could generate a visual diff which highlights what changed to make this 
> easier) and if ok then they can replace the known-good PNG with the ones just 
> rendered. Indeed this will be the basic workflow for working with regression 
> tests.
> 

I think this is the only way to handle that situation. The same applies for 
text extraction etc. - If an improvement changes the results the ‚base‘ needs 
to be reset by adding the new image, text etc as the validation source.

A basic testbed could also run against other JDKs - e.g. wo validating against 
the know-good files - so we pick up potential issues early. Should be easy with 
Jenkins and treated as a hint.  


>> Copyrights is a problem: I'm testing mostly with JIRA attachments that I've 
>> downloaded over the years. While uploading such files to JIRA might count as 
>> fair use, I doubt that this would still be true if they are included in a 
>> distribution. Instead, they should be stored somewhere on Apache servers 
>> where only committers and build software ("Travis", "Jenkins", ...) can 
>> access then. The public PDFs that Maruan mentions don't possibly have all 
>> the Problem cases that we solved before. However I have started working with 
>> these files and there are at least 5 recent issues that deals with them.
> 
> The PDFs won’t be in a distribution. They will just happen to be stored in an 
> SVN repo but not our source code repo, in the same way that the website is 
> stored in the “cmssite” branch of SVN or indeed, are on JIRA. The law doesn’t 
> distinguish between JIRA and SVN, both are publicly available via HTTP, so 
> using SVN will simply be a continuation of what we’re already doing with JIRA.
> 
> The crucial factor is that we’re only storing publicly available PDFs,  
> because we have the right to do so, just like Google’s cache, and like we 
> currently do with JIRA.
> 
> Additionally, the PDFs need to be version controlled otherwise we won’t be 
> able to reliably recreate previous builds, so storing the files on a web 
> server won’t be practical. Also committers will frequently be updating the 
> renderings as bugs are fixed and we’ll need to version-control the rendered 
> PNG files for the same reason. Finally, having committers-only files doesn’t 
> fit well with the Apache goal of open development and would be unnecessary 
> anyway given that all the PDFs are to be taken from public sources only.
> 
> In summary, I’m proposing that we just keep doing what we’re currently doing 
> with JIRA but we move it into its own SVN repo along with some pre-rendered 
> PNGs.

In addition if we put in workarounds to handle nonconforming PDFs there should 
be a unit test added to make sure that we don’t break that e.g. when rewriting 
the parser. 

> 
>> Re preflight: the default mode should be to have the Isartor tests on. 
>> Individuals could still disable them locally, but the central build software 
>> should always use them.
> 
> Yes - does anybody know why this isn’t the default?
> 

No.

+1 for enabling it per default


> -- John

PDFBox and documentation

2014-07-05 Thread Maruan Sahyoun

Hi,

I have the infrastructure for enhancing our documentation nearly sorted (needed 
to learn a little more about the possibilities of the Apache CMS). Now WDYT 
would be the expectation for documenting how to use PDFBox for different use 
cases - code snippets or runnable examples?

BR
Maruan

Re: PDFBox and documentation

2014-07-05 Thread Maruan Sahyoun

that should be doable with some newer additions to the Apache CMS which allows 
to pull from svn and/or git. Will try something on that basis. If it works we 
can enhance the example package.

BR
Maruan

Am 05.07.2014 um 18:45 schrieb John Hewson :

> I'm for runnable examples in trunk on SVN, otherwise we'll end up with code 
> that doesn't actually run. Some snippets from these examples could be put on 
> the website but they should always link back to the example file in SVN 
> viewvc - there's nothing more frustrating for a new user than incomplete 
> examples, or having to copy and paste snippets together to recreate an 
> example file.
> 
> Looking at the examples we have currently on SVN the coding conventions used 
> are starting to look a bit dated, certainly far behind more recently written 
> code.
> 
> -- John
> 
>> On 5 Jul 2014, at 04:46, Maruan Sahyoun  wrote:
>> 
>> Hi,
>> 
>> I have the infrastructure for enhancing our documentation nearly sorted 
>> (needed to learn a little more about the possibilities of the Apache CMS). 
>> Now WDYT would be the expectation for documenting how to use PDFBox for 
>> different use cases - code snippets or runnable examples?
>> 
>> BR
>> Maruan

Re: Paid PDFBox support

2014-07-07 Thread Maruan Sahyoun

the issue is because part1.pdf in PDFBOX-1533 references the same 2 pages 3 
times within the document catalog (/Kids [3 0 R, 3 0 R, 3 0 R]). Could you 
attach a sample pdf to PDFBOX-1533 to verify that your issue has the same cause 
or verify it for yourself?

We are using PDFBox for merging documents ourselves successfully. Obviously 
this file would need some special treatment. 

BR
Maruan

Am 07.07.2014 um 11:31 schrieb Aleksander Blomskøld :

> Hi,
> 
> We're using PDFBox for PDF validation and PDF merging in a backend
> invoicing system. It's working pretty well for most of the time, but right
> now we're having some unhappy customers because of
> https://issues.apache.org/jira/browse/PDFBOX-1533.
> 
> As it's important for us to have this fixed pretty soon, we're wondering if
> anyone of you would be willing to fix this issue for pay. If so, please
> contact me so we can work out the details.
> 
> 
> Regards,
> 
> Aleksander Blomskøld

Re: Paid PDFBox support

2014-07-08 Thread Maruan Sahyoun

of course it’s possible to put in a workaround - might it be in PDFBox itself 
or in the merging application. Even better might be to check why this - at 
least misleading information - might have been created. Would you think you 
could influence that?

BR
Maruan

Am 08.07.2014 um 11:01 schrieb Aleksander Blomskøld :

> Yes, it's the same issue. The files attached actually comes from the
> company I'm working for.
> 
> 
> On Mon, Jul 7, 2014 at 11:05 PM, Maruan Sahyoun 
> wrote:
> 
>> the issue is because part1.pdf in PDFBOX-1533 references the same 2 pages
>> 3 times within the document catalog (/Kids [3 0 R, 3 0 R, 3 0 R]). Could
>> you attach a sample pdf to PDFBOX-1533 to verify that your issue has the
>> same cause or verify it for yourself?
>> 
>> We are using PDFBox for merging documents ourselves successfully.
>> Obviously this file would need some special treatment.
>> 
>> BR
>> Maruan
>> 
>> Am 07.07.2014 um 11:31 schrieb Aleksander Blomskøld :
>> 
>>> Hi,
>>> 
>>> We're using PDFBox for PDF validation and PDF merging in a backend
>>> invoicing system. It's working pretty well for most of the time, but
>> right
>>> now we're having some unhappy customers because of
>>> https://issues.apache.org/jira/browse/PDFBOX-1533.
>>> 
>>> As it's important for us to have this fixed pretty soon, we're wondering
>> if
>>> anyone of you would be willing to fix this issue for pay. If so, please
>>> contact me so we can work out the details.
>>> 
>>> 
>>> Regards,
>>> 
>>> Aleksander Blomskøld
>> 
>>

Re: Paid PDFBox support

2014-07-08 Thread Maruan Sahyoun

what we could do is put the workaround into PDFBox and print a log output. OTOH 
you might have more control over handling such situation if you deal with it 
yourself by putting in a check and a workaround. See my comment at PDFBOX-1533. 
WDYT?

BR
Maruan

Am 08.07.2014 um 15:02 schrieb Aleksander Blomskøld :

> Our biggest problem now is that we haven't been able to detect when the
> issue occours before our customer does. I guess a possible (but not
> optimal) work around for us would be to check the PDF files if they got
> this issue (getAllPages.size() is not the same as getNumPages()), and then
> raise an exception so we can contact the senders manually.
> 
> 
> Aleksander
> 
> On Tue, Jul 8, 2014 at 11:05 AM, Maruan Sahyoun 
> wrote:
> 
>> of course it’s possible to put in a workaround - might it be in PDFBox
>> itself or in the merging application. Even better might be to check why
>> this - at least misleading information - might have been created. Would you
>> think you could influence that?
>> 
>> BR
>> Maruan
>> 
>> Am 08.07.2014 um 11:01 schrieb Aleksander Blomskøld :
>> 
>>> Yes, it's the same issue. The files attached actually comes from the
>>> company I'm working for.
>>> 
>>> 
>>> On Mon, Jul 7, 2014 at 11:05 PM, Maruan Sahyoun 
>>> wrote:
>>> 
>>>> the issue is because part1.pdf in PDFBOX-1533 references the same 2
>> pages
>>>> 3 times within the document catalog (/Kids [3 0 R, 3 0 R, 3 0 R]). Could
>>>> you attach a sample pdf to PDFBOX-1533 to verify that your issue has the
>>>> same cause or verify it for yourself?
>>>> 
>>>> We are using PDFBox for merging documents ourselves successfully.
>>>> Obviously this file would need some special treatment.
>>>> 
>>>> BR
>>>> Maruan
>>>> 
>>>> Am 07.07.2014 um 11:31 schrieb Aleksander Blomskøld >> :
>>>> 
>>>>> Hi,
>>>>> 
>>>>> We're using PDFBox for PDF validation and PDF merging in a backend
>>>>> invoicing system. It's working pretty well for most of the time, but
>>>> right
>>>>> now we're having some unhappy customers because of
>>>>> https://issues.apache.org/jira/browse/PDFBOX-1533.
>>>>> 
>>>>> As it's important for us to have this fixed pretty soon, we're
>> wondering
>>>> if
>>>>> anyone of you would be willing to fix this issue for pay. If so, please
>>>>> contact me so we can work out the details.
>>>>> 
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Aleksander Blomskøld
>>>> 
>>>> 
>> 
>>

Re: Paid PDFBox support

2014-07-08 Thread Maruan Sahyoun

yes - in PDFBOX-1533 I added a description for a workaround I plan to put in. 
WDYT?

BR
Maruan

Am 08.07.2014 um 19:49 schrieb John Hewson :

> In Adobe Acrobat this file has only two pages, so as noted the root of the 
> page tree is invalid:
> 
> /Kids [3 0 R, 3 0 R, 3 0 R]
> 
> Acrobat is ignoring these extra pages, so the fix for PDFBox should be to 
> ignore repeated objects in the page tree.
> 
> -- John
> 
> On 7 Jul 2014, at 14:05, Maruan Sahyoun  wrote:
> 
>> the issue is because part1.pdf in PDFBOX-1533 references the same 2 pages 3 
>> times within the document catalog (/Kids [3 0 R, 3 0 R, 3 0 R]). Could you 
>> attach a sample pdf to PDFBOX-1533 to verify that your issue has the same 
>> cause or verify it for yourself?
>> 
>> We are using PDFBox for merging documents ourselves successfully. Obviously 
>> this file would need some special treatment. 
>> 
>> BR
>> Maruan
>> 
>> Am 07.07.2014 um 11:31 schrieb Aleksander Blomskøld :
>> 
>>> Hi,
>>> 
>>> We're using PDFBox for PDF validation and PDF merging in a backend
>>> invoicing system. It's working pretty well for most of the time, but right
>>> now we're having some unhappy customers because of
>>> https://issues.apache.org/jira/browse/PDFBOX-1533.
>>> 
>>> As it's important for us to have this fixed pretty soon, we're wondering if
>>> anyone of you would be willing to fix this issue for pay. If so, please
>>> contact me so we can work out the details.
>>> 
>>> 
>>> Regards,
>>> 
>>> Aleksander Blomskøld
>> 
>

Re: Paid PDFBox support

2014-07-08 Thread Maruan Sahyoun

wouldn’t say its PERFECTLY valid but going to handle it anyway (as Adobe 
Reader/Acrobat does).

BR
Maruan

Am 08.07.2014 um 19:53 schrieb Martin Schröder :

> 2014-07-08 19:49 GMT+02:00 John Hewson :
>> In Adobe Acrobat this file has only two pages, so as noted the root of the 
>> page tree is invalid:
>> 
>> /Kids [3 0 R, 3 0 R, 3 0 R]
> 
> This is IMHO perfectly valid.
> 
> Has anybody tried preflighting the pdf with Acrobat?
> 
> Best
>   Martin

Re: Paid PDFBox support

2014-07-08 Thread Maruan Sahyoun

thx

Maruan

Am 08.07.2014 um 20:33 schrieb John Hewson :

> Looks good. I modified getAllKids() so that it returns the same output as 
> your workaround, rather than applying the workaround to the output. It’s now 
> in the 1.8.7 and 2.0 trunks.
> 
> -- John
> 
> On 8 Jul 2014, at 10:53, Maruan Sahyoun  wrote:
> 
>> yes - in PDFBOX-1533 I added a description for a workaround I plan to put 
>> in. WDYT?
>> 
>> BR
>> Maruan
>> 
>> Am 08.07.2014 um 19:49 schrieb John Hewson :
>> 
>>> In Adobe Acrobat this file has only two pages, so as noted the root of the 
>>> page tree is invalid:
>>> 
>>> /Kids [3 0 R, 3 0 R, 3 0 R]
>>> 
>>> Acrobat is ignoring these extra pages, so the fix for PDFBox should be to 
>>> ignore repeated objects in the page tree.
>>> 
>>> -- John
>>> 
>>> On 7 Jul 2014, at 14:05, Maruan Sahyoun  wrote:
>>> 
>>>> the issue is because part1.pdf in PDFBOX-1533 references the same 2 pages 
>>>> 3 times within the document catalog (/Kids [3 0 R, 3 0 R, 3 0 R]). Could 
>>>> you attach a sample pdf to PDFBOX-1533 to verify that your issue has the 
>>>> same cause or verify it for yourself?
>>>> 
>>>> We are using PDFBox for merging documents ourselves successfully. 
>>>> Obviously this file would need some special treatment. 
>>>> 
>>>> BR
>>>> Maruan
>>>> 
>>>> Am 07.07.2014 um 11:31 schrieb Aleksander Blomskøld :
>>>> 
>>>>> Hi,
>>>>> 
>>>>> We're using PDFBox for PDF validation and PDF merging in a backend
>>>>> invoicing system. It's working pretty well for most of the time, but right
>>>>> now we're having some unhappy customers because of
>>>>> https://issues.apache.org/jira/browse/PDFBOX-1533.
>>>>> 
>>>>> As it's important for us to have this fixed pretty soon, we're wondering 
>>>>> if
>>>>> anyone of you would be willing to fix this issue for pay. If so, please
>>>>> contact me so we can work out the details.
>>>>> 
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Aleksander Blomskøld
>>>> 
>>> 
>> 
>

Re: Subversion integration with JIRA

2014-07-22 Thread Maruan Sahyoun

+1 

Maruan

Am 22.07.2014 um 19:53 schrieb Andreas Lehmkuehler :

> Hi,
> 
> our infra guys provide an integration of subversion with JIRA tickets. All 
> subversion commits will be automatically added as comment  to the 
> corresponding JIRA ticket as long as the ticket number is used within the svn 
> commit comment.
> 
> See http://www.apache.org/dev/svngit2jira.html for any further details.
> 
> Should we ask infra to enable that feature for PDFBox?
> 
> WDYT?
> 
> 
> BR
> Andreas Lehmkühler
>

Re: Subversion integration with JIRA

2014-07-23 Thread Maruan Sahyoun

according to the sample provided in http://www.apache.org/dev/svngit2jira.html 
the commit will be shown in the comments.

Maruan

Am 23.07.2014 um 08:33 schrieb Thomas Chojecki :

> Am 2014-07-23 07:57, schrieb Tilman Hausherr:
>> Lets try it. TIKA has something similar, see e.g. here:
>> https://issues.apache.org/jira/browse/TIKA-1325
>> Tilman
> 
> Looks like they mishandle the hudson to do something that jira already 
> support in a similar way. I think the solution from infra is the better one. 
> So the code changes will be shown only in the sourcecode section of a ticket. 
> :-)
> 
> The feature to link a sourcecode with a issue is imo a must have.
> 
> +1
> 
>> Am 22.07.2014 19:53, schrieb Andreas Lehmkuehler:
>>> Hi,
>>> our infra guys provide an integration of subversion with JIRA tickets. All 
>>> subversion commits will be automatically added as comment  to the 
>>> corresponding JIRA ticket as long as the ticket number is used within the 
>>> svn commit comment.
>>> See http://www.apache.org/dev/svngit2jira.html for any further details.
>>> Should we ask infra to enable that feature for PDFBox?
>>> WDYT?
>>> BR
>>> Andreas Lehmkühler

Re: Custom TextStripper / PDGraphicsState Not Reading Color

2014-07-29 Thread Maruan Sahyoun

+1 for removing the .properties file if the new mechanism is easier to 
understand and handle. The discussion doesn’t provide that proof or some 
information about that.

How would a replacement look like?

OTOH if it’s a documentation issue we could also add some more information to 
the javadocs to explain the dependencies. 

We could add a register/unregister method to allow to add/remove custom 
operator handling or provide a service discovery mechanism. This way we still 
have the old flexibility.

BR
Maruan

Am 29.07.2014 um 21:48 schrieb John Hewson :

> Right but we need to address the confusion and complexity that has been 
> caused by .properties files which made PDFBOX-2246 so tricky to figure out.
> 
> Lets remove this wart!
> 
> -- John
> 
> On 29 Jul 2014, at 10:44, Tilman Hausherr  wrote:
> 
>> Hi,
>> 
>> At this time, the problem I see and wanted to solve (PDFBOX-2246) exists 
>> regardless whether we use a properties file or initialize directly in the 
>> code.
>> 
>> Tilman
>> 
>> 
>> Am 29.07.2014 19:41, schrieb John Hewson:
>>> On 29 Jul 2014, at 03:44, Andreas Lehmkühler  wrote:
>>> 
 Hi,
 
 it's not a black and white issue (comments inline)
 
> John Hewson  hat am 29. Juli 2014 um 07:44 geschrieben:
> 
> 
> Yes, really I should have said subclasses of PDFStreamEngine -  that's 
> where
> the .properties file originates. I'd propose replacing the properties
> mechanism with a simple method containing the mapping which can be 
> overridden
> in subclasses. Ultimately, users expect to be able to subclass the 
> behaviour
> of a class by just subclassing the class.
 PDFStreamEngine doesn't configure any operator set itself. The subclasses 
 are
 supposed to configure their own set of operators depending on the 
 particular
 usecase. E.g. to extend the text extraction one has to subclass 
 PDFTextStripper
 and so on.
>>> It’s PDFStreamEngine which implements the .property mechanism though, via 
>>> the
>>> PDFStreamEngine(Properties properties) constructor.
>>> 
 E.g. to extend the text extraction one has to subclass PDFTextStripper and 
 so on.
>>> That’s true, but it’s only half the story, don’t forget that the 
>>> .properties files need
>>> to be copied and pasted elsewhere and modified along with overriding which 
>>> .property
>>> file is passed in the constructor if you want to truly override the class’ 
>>> behaviour.
>>> 
> We've seen a number of incidents of confusion on the mailing list due to 
> the
> current design.
 IMHO, most of the confusion is based on the lack of knowledge of the pdf 
 spec.
 One can't understand how pdfbox works under the hood by simply looking at 
 the
 code. One has to understand the pdf spec as well, at least the base 
 concepts.
>>> I’m specifically talking about confusion surrounding how to override 
>>> operators, and
>>> .properties files, this has come up before. This entire thread has been 
>>> caused by
>>> PDFBox’s design and *not* the PDF spec.
>>> 
> I'd say that to the modern Java developer having non-code runtime binding 
> has
> become an anti-pattern, resulting in brittle code which can't easily be
> navigated in an IDE and which resists automated analysis and exhibits 
> runtime
> failures despite compiling ok. This is one of those cases where the 
> collective
> wisdom has just evolved over the years.
 It depends on the given usecase. All solutions have advantages and
 disadvantages. E.g. if someone wants to configure the PDFTextStripper 
 without
 recompiling the code, it is quite handy to keep the configuration in a text
 file.
>>> Has anybody *ever* wanted to change the operators which PDFTextStripper is
>>> processing without recompiling the code? These are internal implementation
>>> details that shouldn’t be exposed in the first place - it’s not a 
>>> “configuration” at
>>> all, especially as 99% of possible changes would just break PDFTextStripper.
>>> 
 In this case I'm neither pro or con a text based config, but I tend to 
 agree
 with John to have the different configurations in some method within the
 subclasses of PDFStreamEngine.
>>> As above, this isn’t “configuration” at all, it lacks even a basic use 
>>> case. I don’t
>>> see any pros which aren’t fabricated for the sake of argument, but the cons 
>>> are
>>> causing us significant problems right here, right now.
>>> 
 BR
 Andreas Lehmkühler
 
> -- John
> 
>> On 28 Jul 2014, at 13:42, Tilman Hausherr  wrote:
>> 
>> I disagree - one doesn't *have* to pass a property file to 
>> PDFTextStripper
>> and PageDrawer. The properties file for PDFTextStripper is optional. The
>> property parameter was already there before it became an apache project.
>> 
>> 
>> Tilman
>> 
>> 
>> 
>> Am 28.07.2014 22:08, schrie

Re: Custom TextStripper / PDGraphicsState Not Reading Color

2014-07-30 Thread Maruan Sahyoun

thx for the hint.

Maruan Sahyoun

> Am 30.07.2014 um 12:33 schrieb Andreas Lehmkühler :
> 
> 
> 
>> Maruan Sahyoun  hat am 30. Juli 2014 um 08:12
>> geschrieben:
>> 
>> 
>> +1 for removing the .properties file if the new mechanism is easier to
>> understand and handle. The discussion doesn’t provide that proof or some
>> information about that.
>> 
>> How would a replacement look like?
>> 
>> OTOH if it’s a documentation issue we could also add some more information to
>> the javadocs to explain the dependencies.
>> 
>> We could add a register/unregister method to allow to add/remove custom
>> operator handling or provide a service discovery mechanism. This way we still
>> have the old flexibility.
> There is already the method registerOperatorProcessor in PDFStreamEngine to
> register operators. In most cases it's called when processing the property 
> file.
> In the case of preflight (see PreflightStreamEngine) those register calls are
> done directly within the constructor. There isn't any unregister method.
> 
> BR
> Andreas Lehmkühler
> 
>> 
>> BR
>> Maruan
>> 
>>> Am 29.07.2014 um 21:48 schrieb John Hewson :
>>> 
>>> Right but we need to address the confusion and complexity that has been
>>> caused by .properties files which made PDFBOX-2246 so tricky to figure out.
>>> 
>>> Lets remove this wart!
>>> 
>>> -- John
>>> 
>>>> On 29 Jul 2014, at 10:44, Tilman Hausherr  wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> At this time, the problem I see and wanted to solve (PDFBOX-2246) exists
>>>> regardless whether we use a properties file or initialize directly in the
>>>> code.
>>>> 
>>>> Tilman
>>>> 
>>>> 
>>>> Am 29.07.2014 19:41, schrieb John Hewson:
>>>>> On 29 Jul 2014, at 03:44, Andreas Lehmkühler  wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> it's not a black and white issue (comments inline)
>>>>>> 
>>>>>>> John Hewson  hat am 29. Juli 2014 um 07:44
>>>>>>> geschrieben:
>>>>>>> 
>>>>>>> 
>>>>>>> Yes, really I should have said subclasses of PDFStreamEngine -  that's
>>>>>>> where
>>>>>>> the .properties file originates. I'd propose replacing the properties
>>>>>>> mechanism with a simple method containing the mapping which can be
>>>>>>> overridden
>>>>>>> in subclasses. Ultimately, users expect to be able to subclass the
>>>>>>> behaviour
>>>>>>> of a class by just subclassing the class.
>>>>>> PDFStreamEngine doesn't configure any operator set itself. The subclasses
>>>>>> are
>>>>>> supposed to configure their own set of operators depending on the
>>>>>> particular
>>>>>> usecase. E.g. to extend the text extraction one has to subclass
>>>>>> PDFTextStripper
>>>>>> and so on.
>>>>> It’s PDFStreamEngine which implements the .property mechanism though, via
>>>>> the
>>>>> PDFStreamEngine(Properties properties) constructor.
>>>>> 
>>>>>> E.g. to extend the text extraction one has to subclass PDFTextStripper
>>>>>> and so on.
>>>>> That’s true, but it’s only half the story, don’t forget that the
>>>>> .properties files need
>>>>> to be copied and pasted elsewhere and modified along with overriding which
>>>>> .property
>>>>> file is passed in the constructor if you want to truly override the class’
>>>>> behaviour.
>>>>> 
>>>>>>> We've seen a number of incidents of confusion on the mailing list due to
>>>>>>> the
>>>>>>> current design.
>>>>>> IMHO, most of the confusion is based on the lack of knowledge of the pdf
>>>>>> spec.
>>>>>> One can't understand how pdfbox works under the hood by simply looking at
>>>>>> the
>>>>>> code. One has to understand the pdf spec as well, at least the base
>>>>>> concepts.
>>>>> I’m specifically talking about confusion surrounding how to override
>>>>> operators, and
>>>>> .properties fi

Re: Apache PDFBox Board Report January 2022 due

2022-01-10 Thread Maruan Sahyoun

+1
Maruan 

> Am 09.01.2022 um 14:20 schrieb Andreas Lehmkuehler :
> 
> Hi,
> 
> find attached a quick draft of the board report we're expected to submit this
> month. It's based upon the report wizard template which can be found at [1]
> 
> Any comments or additions are appreciated ...
> 
> 
> 
> ## Description:
> The mission of PDFBox is the creation and maintenance of software related to
> Java library for working with PDF documents
> 
> ## Issues:
> There are no issues requiring board attention at this time.
> 
> ## Membership Data:
> Apache PDFBox was founded 2009-10-21 (12 years ago)
> There are currently 21 committers and 21 PMC members in this project.
> The Committer-to-PMC ratio is 1:1.
> 
> Community changes, past quarter:
> - No new PMC members. Last addition was Matthäus Mayer on 2017-10-16.
> - No new committers. Last addition was Joerg O. Henne on 2017-10-09.
> 
> ## Project Activity:
> Recent releases:
> 
>2.0.25 was released on 2021-12-16.
>3.0.0-alpha2 was released on 2021-09-10.
>2.0.24 was released on 2021-06-10.
> 
> ## Community Health:
> - there is a steady stream of contributions, bug reports and questions on the
>  mailing lists
> - there are a lot of refactorings, improvements and bugfixes
> - we are working on finalizing 3.0.0 and released another alpha version
> - PDFBox isn't affected by the log42j vulnerablility as we are using commons
>  logging and don't ship any logging library
> - Maruan activated GitHub CodeQL scans for our codebase
> 
> 
> 
> Andreas
> 
> [1] https://reporter.apache.org/wizard/?pdfbox
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: [VOTE] Release Apache PDFBox JBIG2 ImageIO 3.0.4

2022-02-28 Thread Maruan Sahyoun

+1
Maruan 

> Am 26.02.2022 um 16:39 schrieb Andreas Lehmkuehler :
> 
> Hi,
> 
> a candidate for the Apache PDFBox JBIG2 ImageIO 3.0.4 release is available at:
> 
>https://dist.apache.org/repos/dist/dev/pdfbox/jbig2-imageio/3.0.4/
> 
> The release candidate is a zip archive of the sources in:
> 
>https://github.com/apache/pdfbox-jbig2/tree/3.0.4/
> 
> The SHA-512 checksum of the archive is 
> 382acb53e0bb56595f7eb8c382369a48a000ced22ff4d101ec89316c749b5afd344c6303a3e6c75b12e949f1efe688e18bd1b8b0b5deb449a581b1c97c35e672.
> 
> Please vote on releasing this package as Apache PDFBox JBIG2 ImageIO 3.0.4.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 PDFBox PMC votes are cast.
> 
>[ ] +1 Release this package as Apache PDFBox JBIG2 ImageIO 3.0.4
>[ ] -1 Do not release this package because...
> 
> Here is my +1
> 
> Andreas
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: timeout on Jempbox xmp media management schema's getHistory

2022-09-07 Thread Maruan Sahyoun

What about opening a ticket and attaching the XMP in question 

BR
Maruan 

> Am 07.09.2022 um 23:19 schrieb Tim Allison :
> 
> All,
>  This issue is ringing a bell.  I'm sorry if there's an open issue or
> you/we've decided long ago that this is not an issue.
>   One of the timeouts in the most recent run was caused by Jempbox's
> handling of the history in the media management schema.  There are 32000
> elements in the history. :(
>   On the Tika side, we limit history elements to 1024.  However, Jempbox
> still has to load the full list, and on this xmp, it takes a long, long
> time (I was never patient enough to let it finish).
>   If I do enough subclassing and limit getEventSequenceList to 1024, all
> is good. [0]
>   Is it worth spending time to fix the underlying performance issue in
> Jempboxx, or is this type of hack on the Tika side the best option?
> 
>   Best,
> 
> Tim
> 
> [0] int length = items.getLength();
> length = 1024;
> for(int i = 0; i < length; ++i) {
> 
>Element li = (Element)items.item(i);
>retval.add(new ResourceEvent(li));
> }

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: Minimum Java version for PDFBox 3.x

2023-03-18 Thread Maruan Sahyoun

I‘d second a move to 11 for 3.x as for the lifetime of 3.x this will enable us 
to use newer funtions without another major release.

BR
Maruan

> Am 18.03.2023 um 10:13 schrieb Tilman Hausherr :
> 
> You may have a point with some of your arguments, but not this one:
>> Public updates for Java 8 have stopped in march 2022, now one year ago
> 
> My latest jdk8 is from January 17th of this year. (Amazon Corretto)
> 
> About the difficulty to find contributors - this has always been difficult. 
> That's because PDF isn't "sexy" at all.
> 
> Coincidentally, our CI builds are now failing because of Jenkins / Hudson has 
> moved to jdk11.
> 
> Tilman
> 
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: Minimum Java version for PDFBox 3.x

2023-03-18 Thread Maruan Sahyoun

Fine - so let‘s target that for 4x

> Am 18.03.2023 um 16:51 schrieb Andreas Lehmkuehler :
> 
> Am 18.03.23 um 10:49 schrieb Maruan Sahyoun:
>> I‘d second a move to 11 for 3.x as for the lifetime of 3.x this will enable 
>> us to use newer funtions without another major release.
> I'd like to do so for the next major version 4.0.x. Hopefully it won't take 
> us that much time to release that version as it took us to release 3.0.x.
> 
> BTW 3.0.x will be the last version supporting preflight and maybe it is a 
> good idea to stuck with java 8 compatibility.
> 
> Andreas
>> BR
>> Maruan
>>>> Am 18.03.2023 um 10:13 schrieb Tilman Hausherr :
>>> 
>>> You may have a point with some of your arguments, but not this one:
>>>> Public updates for Java 8 have stopped in march 2022, now one year ago
>>> 
>>> My latest jdk8 is from January 17th of this year. (Amazon Corretto)
>>> 
>>> About the difficulty to find contributors - this has always been difficult. 
>>> That's because PDF isn't "sexy" at all.
>>> 
>>> Coincidentally, our CI builds are now failing because of Jenkins / Hudson 
>>> has moved to jdk11.
>>> 
>>> Tilman
>>> 
>>> 
>>> 
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

How can we help

2010-03-07 Thread Maruan Sahyoun

Dear PDFBox developers,

we are a small German based consulting/implementation company working in the 
area of electronic documents. PDF is a key technology in our projects. We are 
an Adobe partner for their server products (Adobe LiveCycle) and have been 
working in the past with libs like iText, pdflib and pdfnet.sdk in addition to 
pdfbox. 

We would like to commit some ressources to help develop pdfbox further. What 
are the areas where we should look into?

With kind regards


Maruan Sahyoun

FileAffairs GmbH
Kaiserswerther Str. 115
40880 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahy...@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827

Re: pdfbox develpment

2010-03-08 Thread Maruan Sahyoun

Hi,

we were looking to start fixing some of the open issues but can instead develop 
some small tutorials for common tasks like text extraction, forms handling and 
highlighting.

WDYT

Kind regards

Maruan Sahyoun

Am 09.03.2010 um 07:58 schrieb Andreas Lehmkuehler:

> Hi,
> 
> Michael Müller schrieb:
>> Daniel,
>> Yes, I found some activities on the lists. But on the project site
>> neither developer nor commiter. Just missing documentation? ;-)
>> Great to hear, this project is alive.
>> I have big problems to use it, due to missing or vague docs.
>> EG: setTextMatrix
>> public void setTextMatrix(double a, double b, double c, double d, double
>> e, double f)
>> What's a, b, c, d, e, f? I figured out, e and f to be coordinates. Would
>> be much better to name this x and y or to enhance this documentation.
> These values correspond to the naming used in the pdf reference for a matrix.
> 
>> Maybe enhancing documentaion is an entry point for me to support the
>> project? Or does any doc exists beside the published java docs?
> Be our guest, a good and complete documentation is always useful, especially
> for beginners.
> 
> BR
> Andreas Lehmkühler

Re: Reopen PDFBOX-483?

2010-03-08 Thread Maruan Sahyoun

Hi Andreas,

I can do a test on our Windows test server (Windows 2003, 32bit) and let you 
know the results around lunch time (german time) if that helps

Maruan Sahyoun

Am 09.03.2010 um 08:11 schrieb Andreas Lehmkuehler:

> Hi,
> 
> steve poling schrieb:
>> Andreas Lehmkuehler schrieb:
>>>>> If you goto PDFBOX-490 
>>>>> <https://issues.apache.org/jira/browse/PDFBOX-490>, you'll find attached 
>>>>> file filled.pdf that manifests this error, but I've been seeing this with 
>>>>> a lot of different PDFs: display looks good, print looks bad. I can 
>>>>> attach another file to PDFBOX-483 
>>>>> <https://issues.apache.org/jira/browse/PDFBOX-483> if you'd like.
>>>> I've tried that pdf and it works like a charm except for some misplaced 
>>>> characters. I'm using ubuntu linux, java 1.6.0_15 32bit and a HP Laserjet 
>>>> 2550N.
>>> I've made another test on my MacBook (MacOSX 10.6., jdk 1.6.0_17 64bit, 
>>> same printer) and it works well too.
>> I'd like to know if anyone has repeated the experiment on any Windows-based 
>> platform, since Ubuntu and OSX are both Linux-based. If someone else can 
>> reproduce the failure on Windows, I'll start trusting my sanity again.
> I'm a software development for a lot of years and sometimes it leads to
> insanity, but we all have to do our best not to end in the programmers
> nuthouse ;-))
> 
> I'll see if I can find some time to run that test on my rarely used windows 
> box.
> 
> BR
> Andreas Lehmkühler
>

Re: Reopen PDFBOX-483?

2010-03-09 Thread Maruan Sahyoun

Hi,

please find enclosed the result of the printing test conducted on 

Windows 2003 Server SP2 32 bit, Java 1.5 using a fresh built from trunk. The 
test was done using the Adobe PDF printer driver as well as Apple and HP 
Postscript printers with similar results.

kind regards

Maruan Sahyoun





Am 09.03.2010 um 09:49 schrieb Andreas Lehmkühler:

> Hi,
> 
> Betreff: Re: Reopen PDFBOX-483?
> Gesendet: Di, 09. Mrz 2010
> Von: Maruan Sahyoun
> 
>> Hi Andreas,
>> 
>> I can do a test on our Windows test server (Windows 2003, 32bit) and let you
>> know the results around lunch time (german time) if that helps
> Yeah, that would be great.
> 
> BR
> Andreas Lehmkühler
> 
>> Maruan Sahyoun
>> 
>> Am 09.03.2010 um 08:11 schrieb Andreas Lehmkuehler:
>> 
>>> Hi,
>>> 
>>> steve poling schrieb:
>>>> Andreas Lehmkuehler schrieb:
>>>>>>> If you goto PDFBOX-490
>> <https://issues.apache.org/jira/browse/PDFBOX-490>, you'll find attached
>> file filled.pdf that manifests this error, but I've been seeing this with a
>> lot of different PDFs: display looks good, print looks bad. I can attach
>> another file to PDFBOX-483
>> <https://issues.apache.org/jira/browse/PDFBOX-483> if you'd like.
>>>>>> I've tried that pdf and it works like a charm except for some misplaced
>> characters. I'm using ubuntu linux, java 1.6.0_15 32bit and a HP Laserjet
>> 2550N.
>>>>> I've made another test on my MacBook (MacOSX 10.6., jdk 1.6.0_17 64bit,
>> same printer) and it works well too.
>>>> I'd like to know if anyone has repeated the experiment on any
>> Windows-based platform, since Ubuntu and OSX are both Linux-based. If
>> someone else can reproduce the failure on Windows, I'll start trusting my
>> sanity again.
>>> I'm a software development for a lot of years and sometimes it leads to
>>> insanity, but we all have to do our best not to end in the programmers
>>> nuthouse ;-))
>>> 
>>> I'll see if I can find some time to run that test on my rarely used
>> windows box.
>>> 
>>> BR
>>> Andreas Lehmkühler
>>> 
>> 
>> 
> 
> --- original Nachricht Ende 
>

Re: Reopen PDFBOX-483?

2010-03-09 Thread Maruan Sahyoun

Hi Andreas,

yes, the results are similar BUT most of the text and some of the lines are 
missing. Converting to Image output using PDFToImage provides a different and 
much better result where all text and lines are included and only some 
misplacement occurs. Is there a way to submit the attachment so you can see for 
yourself?

Maruan Sahyoun

Am 09.03.2010 um 13:38 schrieb Andreas Lehmkühler:

> Hi,
> 
> Betreff: Re: Reopen PDFBOX-483?
> Gesendet: Di, 09. Mrz 2010
> Von: Maruan Sahyoun
> 
>> Hi,
>> 
>> please find enclosed the result of the printing test conducted on 
>> 
>> Windows 2003 Server SP2 32 bit, Java 1.5 using a fresh built from trunk. The
>> test was done using the Adobe PDF printer driver as well as Apple and HP
>> Postscript printers with similar results.
> Thanks for testing. Your attachments didn't make it due to some restrictions 
> of the mailing list.
> Probably it would be sufficient to describe the results. Let me guess, they 
> are all similar. All
> contain text, some characters are misplaced and a wrong font is used.
> 
> BR
> Andreas Lehmkühler
>

Re: pdfbox develpment

2010-03-09 Thread Maruan Sahyoun

Hi ,

I started with the documentation of some tools and opened an issue in JIRA for 
that (PDFBOX-653). Please let me know if that workflow is OK for you or if I 
should use a different approach. 

Kind regards
 
Maruan Sahyoun

Am 09.03.2010 um 09:37 schrieb Andreas Lehmkühler:

> Hi,
> 
> Betreff: Re: pdfbox develpment
> Gesendet: Di, 09. Mrz 2010
> Von: Maruan Sahyoun
> 
>> Hi,
>> 
>> we were looking to start fixing some of the open issues but can instead
>> develop some small tutorials for common tasks like text extraction, forms
>> handling and highlighting.
>> 
>> WDYT
> Sounds good to me. Some of the command line utilities are already described 
> at [1] and
> some other documentation can be found at [2], so that will be a good point to 
> start.
> IMHO, the following command line tools should be described anyway:
> 
> - PDFSplit, PDFMerger, Overlay
> - PDFReader
> - PDFDebugger
> 
> These can be found here [3]. Probably we should describe some/all of the 
> examples
> which can be found here [4]. The sources for the documentation itself can be 
> found here [5]
> 
> BR
> Andreas Lehmkühler
> 
> [1] http://pdfbox.apache.org/commandlineutilities/index.html
> [2] http://pdfbox.apache.org/userguide/index.html
> [3] http://svn.apache.org/viewvc/pdfbox/trunk/src/main/java/org/apache/pdfbox/
> [4] 
> http://svn.apache.org/viewvc/pdfbox/trunk/src/main/java/org/apache/pdfbox/examples/
> [5] http://svn.apache.org/viewvc/pdfbox/trunk/src/site/
>> 
>> Kind regards
>> 
>> Maruan Sahyoun
>> 
>> Am 09.03.2010 um 07:58 schrieb Andreas Lehmkuehler:
>> 
>>> Hi,
>>> 
>>> Michael Müller schrieb:
>>>> Daniel,
>>>> Yes, I found some activities on the lists. But on the project site
>>>> neither developer nor commiter. Just missing documentation? ;-)
>>>> Great to hear, this project is alive.
>>>> I have big problems to use it, due to missing or vague docs.
>>>> EG: setTextMatrix
>>>> public void setTextMatrix(double a, double b, double c, double d, double
>>>> e, double f)
>>>> What's a, b, c, d, e, f? I figured out, e and f to be coordinates. Would
>>>> be much better to name this x and y or to enhance this documentation.
>>> These values correspond to the naming used in the pdf reference for a
>> matrix.
>>> 
>>>> Maybe enhancing documentaion is an entry point for me to support the
>>>> project? Or does any doc exists beside the published java docs?
>>> Be our guest, a good and complete documentation is always useful,
>> especially
>>> for beginners.
>>> 
>>> BR
>>> Andreas Lehmkühler
>> 
>> 
> 
> --- original Nachricht Ende 
>

Re: Reopen PDFBOX-483?

2010-03-09 Thread Maruan Sahyoun

Hi,

FYI - using PDFReader the PDF is displayed OK but when printed the same results 
are produced as with PrintPDF. The printed output contains the variable data 
only (and some lines), Boilerplate text is not printed.  

Maruan Sahyoun

Am 09.03.2010 um 13:58 schrieb Andreas Lehmkühler:

> Hi,
> 
> Betreff: Re: Reopen PDFBOX-483?
> Gesendet: Di, 09. Mrz 2010
> Von: Maruan Sahyoun
> 
>> Hi ,
>> 
>> please find enclosed the text extracted from the printed PDF. Extraction was
>> done using Adobe Acrobat 8.
>> 
>> X0X0X0 X0X0X05
>> X0X0X0 X0X0X05
>> X0X0X0 X0X0X05 
>> X0X0X05 MM/DD/ X0X2 
>> X0X2 
>> X0X0X0X X0X0X0X
>> X0X0X05 X0X0X05
>> X0X0X05 X0X0X05
>> X0X0X0X X0X0X05 
>> X0X0X05 
>> 
>> X0X0 
>> X0X0 
>> X05 X0X0X05 
>> MM/DD/ X0X0X05 X0X0 
>> 
> Hmm, that's odd. I'll run my own tests later when I'm at home. Finally that 
> seems to be a windows only issue. I'll also file an issue on JIRA
> 
> Thanks for the tests!
> 
> BR
> Andreas Lehmkühler
> 
>> 
>> Maruan Sahyoun
>> 
>> 
>> 
>> Geschäftsführer: Maruan Sahyoun
>> Handelsregister: AG Düsseldorf, HRB 53837
>> UST.-ID: DE248275827
>> 
>> Am 09.03.2010 um 13:45 schrieb Maruan Sahyoun:
>> 
>>> Hi Andreas,
>>> 
>>> yes, the results are similar BUT most of the text and some of the lines
>> are missing. Converting to Image output using PDFToImage provides a
>> different and much better result where all text and lines are included and
>> only some misplacement occurs. Is there a way to submit the attachment so
>> you can see for yourself?
>>> 
>>> Maruan Sahyoun
>>> 
>>> Am 09.03.2010 um 13:38 schrieb Andreas Lehmkühler:
>>> 
>>>> Hi,
>>>> 
>>>> Betreff: Re: Reopen PDFBOX-483?
>>>> Gesendet: Di, 09. Mrz 2010
>>>> Von: Maruan Sahyoun
>>>> 
>>>>> Hi,
>>>>> 
>>>>> please find enclosed the result of the printing test conducted on 
>>>>> 
>>>>> Windows 2003 Server SP2 32 bit, Java 1.5 using a fresh built from trunk.
>> The
>>>>> test was done using the Adobe PDF printer driver as well as Apple and
>> HP
>>>>> Postscript printers with similar results.
>>>> Thanks for testing. Your attachments didn't make it due to some
>> restrictions of the mailing list.
>>>> Probably it would be sufficient to describe the results. Let me guess,
>> they are all similar. All
>>>> contain text, some characters are misplaced and a wrong font is used.
>>>> 
>>>> BR
>>>> Andreas Lehmkühler
>>>> 
>>> 
>> 
>> 
> 
> --- original Nachricht Ende 
>

Re: Reopen PDFBOX-483?

2010-03-09 Thread Maruan Sahyoun

Hi ,

please find enclosed the text extracted from the printed PDF. Extraction was 
done using Adobe Acrobat 8.

X0X0X0 X0X0X05 
X0X0X0 X0X0X05 
X0X0X0 X0X0X05 
X0X0X05 MM/DD/ X0X2 
X0X2 
X0X0X0X X0X0X0X 
X0X0X05 X0X0X05 
X0X0X05 X0X0X05 
X0X0X0X X0X0X05 
X0X0X05 

X0X0 
X0X0 
X05 X0X0X05 
MM/DD/ X0X0X05 X0X0 


Maruan Sahyoun



Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827

Am 09.03.2010 um 13:45 schrieb Maruan Sahyoun:

> Hi Andreas,
> 
> yes, the results are similar BUT most of the text and some of the lines are 
> missing. Converting to Image output using PDFToImage provides a different and 
> much better result where all text and lines are included and only some 
> misplacement occurs. Is there a way to submit the attachment so you can see 
> for yourself?
> 
> Maruan Sahyoun
> 
> Am 09.03.2010 um 13:38 schrieb Andreas Lehmkühler:
> 
>> Hi,
>> 
>> Betreff: Re: Reopen PDFBOX-483?
>> Gesendet: Di, 09. Mrz 2010
>> Von: Maruan Sahyoun
>> 
>>> Hi,
>>> 
>>> please find enclosed the result of the printing test conducted on 
>>> 
>>> Windows 2003 Server SP2 32 bit, Java 1.5 using a fresh built from trunk. The
>>> test was done using the Adobe PDF printer driver as well as Apple and HP
>>> Postscript printers with similar results.
>> Thanks for testing. Your attachments didn't make it due to some restrictions 
>> of the mailing list.
>> Probably it would be sufficient to describe the results. Let me guess, they 
>> are all similar. All
>> contain text, some characters are misplaced and a wrong font is used.
>> 
>> BR
>> Andreas Lehmkühler
>> 
>

Re: Reopen PDFBOX-483?

2010-03-10 Thread Maruan Sahyoun

Hi,

I did some initial "debugging" and it seems that the content of the form fields 
(date part) is being printed but the form template itself being held in 
Pages:Kids:Resources:XObject are not printed. Unfortunately as I'm currently in 
the stage of learning about the PDFBox code at that point in time I can't 
provide more help.

Kind regards

Maruan



Am 09.03.2010 um 21:01 schrieb Andreas Lehmkuehler:

> Hi,
> 
> steve poling schrieb:
>> Andreas Lehmkuehler schrieb:
> If you goto PDFBOX-490 
> , you'll find attached 
> file filled.pdf that manifests this error, but I've been seeing this with 
> a lot of different PDFs: display looks good, print looks bad. I can 
> attach another file to PDFBOX-483 
>  if you'd like.
 I've tried that pdf and it works like a charm except for some misplaced 
 characters. I'm using ubuntu linux, java 1.6.0_15 32bit and a HP Laserjet 
 2550N.
>>> I've made another test on my MacBook (MacOSX 10.6., jdk 1.6.0_17 64bit, 
>>> same printer) and it works well too.
>> I'd like to know if anyone has repeated the experiment on any Windows-based 
>> platform, since Ubuntu and OSX are both Linux-based. If someone else can 
>> reproduce the failure on Windows, I'll start trusting my sanity again.
> Good news Steve you're obviously not insane. ;-) Maruan confirmed your issue 
> on
> W2K and I've tested it on my WinXP with jdk 1.6.0_13 with the same result. The
> print looks bad. I have no explanation yet, except that it seems to be windows
> only. For now I don't have a clue where to look. Perhaps I will have an idea 
> in
> a few days ...
> 
> BR
> Andreas Lehmkühler
>

PDFBox documentation

2010-03-11 Thread Maruan Sahyoun

I've finished documenting the command line tools available within PDFBox using 
the same depth as was available for the already documented ones. I do think 
that this needs to be enhanced at a later stage.

I'm now looking into documenting some common tasks. For that I would like to 
restructure the content which is available under 'Developers Guide' a bit.

Proposed structure

Index
Building PDFBox
Tutorials
Cookbook
FAQ 
Redistribute PDFBox
Fonts
.NETVersion

Tutorials will contain the current content of Bookmarks, File References, 
Highlighting, Metadata and Text Extraction. To possibly be enhanced at a later 
stage.
Cookbook will document the examples available with PDFBox and possibly add some 
new at a later stage.

WDYT

Kind regards 

Maruan Sahyoun

Re: PDFBox documentation

2010-03-11 Thread Maruan Sahyoun

I think it would be very good to enhance the font section possibly covering the 
different font types available and how they are supported (or not)  in PDFBox. 
So your input is very welcome ;-)

Maruan Sahyoun

Am 12.03.2010 um 05:09 schrieb nisen:

> +1
> maybe I can contribute the Chinese version and some idea about Fonts。
> 
> 2010/3/12 Philipp Koch :
>> +1
>> 
>> regards,
>> philipp
>> 
>> On Thu, Mar 11, 2010 at 9:25 PM, Andreas Lehmkuehler  
>> wrote:
>>> +1
>>> 
>>> Maruan Sahyoun schrieb:
>>>> 
>>>> I've finished documenting the command line tools available within PDFBox
>>>> using the same depth as was available for the already documented ones. I do
>>>> think that this needs to be enhanced at a later stage.
>>>> 
>>>> I'm now looking into documenting some common tasks. For that I would like
>>>> to restructure the content which is available under 'Developers Guide' a
>>>> bit.
>>>> 
>>>> Proposed structure
>>>> 
>>>> Index
>>>> Building PDFBox
>>>> Tutorials
>>>> Cookbook
>>>> FAQ Redistribute PDFBox
>>>> Fonts
>>>> .NETVersion
>>>> 
>>>> Tutorials will contain the current content of Bookmarks, File References,
>>>> Highlighting, Metadata and Text Extraction. To possibly be enhanced at a
>>>> later stage.
>>>> Cookbook will document the examples available with PDFBox and possibly add
>>>> some new at a later stage.
>>>> 
>>>> WDYT
>>>> 
>>>> Kind regards
>>>> Maruan Sahyoun
>>>> 
>>> 
>>> 
>> 
> 
> 
> 
> -- 
> nisen(English Name)/倪森(Chinese Name)
> Blog: http://nisen.javaeye.com

GSoC

2010-03-12 Thread Maruan Sahyoun

is PDFBox participating in GSoC 2010?

Maruan Sahyoun

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3928 matches

Mail list logo