RE: [fw-general] Extracting data out of PDF with Zend_Pdf?

2007-08-31 Thread Alexander Veremyev
Hi Markus,

Yes, that looks it's done :)

Historically PDF Info structure is used for storing Title, Author,
Subject, etc.
Since PDF 1.4 Metadata streams may also be used for this.

Metadata streams may contain much more info.

I saw somewhere in the documentation (I don't remember exactly where),
that this structures may not be synchronized. There was an algorithm to
choose more actual information based on some timestamps.
But it's really good idea, to keep them synchronized.


With best regards,
   Alexander Veremyev.

 -Original Message-
 From: Markus Fischer [mailto:[EMAIL PROTECTED] 
 Sent: Friday, August 31, 2007 2:22 AM
 To: Alexander Veremyev
 Cc: Zend Framework General
 Subject: Re: [fw-general] Extracting data out of PDF with Zend_Pdf?
 
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 Hey!
 
 This is great, I just saw your commit and tested it. I saw 
 the API being changed :
 
 * $oPdf-properties is now a property, not a method anymore
 * $oPdf-getMetaData() returns some xml rdf sequence
 
 I tested it with quite some PDFs and it worked very well. I 
 also realized that the amount of information in the 
 properties can vary, some have a Title, others don't.
 
 Is there a difference in practice between the distilled 
 information through the properties property and the RDF data?
 
 thank you!
 - - Markus
 
 Alexander Veremyev wrote:
  Hi Markus,
  
  Thanks for the offered help!
  
  I mentioned JIRA issue only to indicate that feature already was 
  requested. So it increases its chances to be done in a 
 short time :) 
  Actually I am going to take a look into it and determine 
 plans for it 
  tomorrow.
  
  With best regards,
 Alexander Veremyev.
  
  -Original Message-
  From: Markus Fischer [mailto:[EMAIL PROTECTED]
  Sent: Monday, August 27, 2007 11:54 PM
  To: Alexander Veremyev
  Cc: Zend Framework General
  Subject: Re: [fw-general] Extracting data out of PDF with Zend_Pdf?
 
  Hi Alexander,
  
  thank you for answering so quickly. I'll search JIRA next time.
  
  I'm not new to PHP but the PDF spec is quite complex so is the PDF 
  implementation ... unfortunately I've not enough time to dig into, 
  I'ld love to help and come up with a patch.
  
  So I hope it will get implemented soon, this would really be great.
  
  thanks,
  - Markus
  
  Alexander Veremyev wrote:
  Hi  Markus,
 
  PDF properties processing is planned 
  (http://framework.zend.com/issues/browse/ZF-294), but 
 not done yet.
 
  It's not the first request for the feature and implementation is 
  relatively simple. I think it should be done in the near future.
 
 
  With best regards,
 Alexander Veremyev.
 
  -Original Message-
  From: Markus Fischer [mailto:[EMAIL PROTECTED]
  Sent: Sunday, August 26, 2007 10:37 PM
  To: Zend Framework General
  Subject: [fw-general] Extracting data out of PDF with Zend_Pdf?
 
  Hi,
 
  is it supported to extra metadata information from a PDF? The 
  information I'm seeking is
  * title
  * number of pages
  * author
 
  (of course as long as the information is contained in the PDF).
 
  I've gone through quite some PDFs where Adobes Reader shows
  me title
  and author information but from Zend_Pdf I get nothing back.
 
  Following the documentation I thought I can get this
  information from
  the properties() method, e.g.
 
  $oPdf = Zend_Pdf::load($sFile);
  var_dump( $oPdf-properties() );
 
  But the returned array was empty in all cases.
 
  I know I can get the number of pages by counting the pages 
  property, but what about the other information?
 
  If it's not possible with Zend_Pdf, although off-topic, 
 what other 
  possibilities are out there? fpdf? Or some unix commands (I'm on 
  Linux)?
 
  thanks,
  - Markus
 
  ps: I was using 1.0.1
 
 
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.6 (MingW32)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
 
 iD8DBQFG10Ly1nS0RcInK9ARAihjAJ9ehFtVOu2o+vIgcdz6UVgm2dkNCwCgpQdw
 x0d34VQLBAKfa0oJGmr4XKg=
 =CAt8
 -END PGP SIGNATURE-
 


RE: [fw-general] Extracting data out of PDF with Zend_Pdf?

2007-08-31 Thread Alexander Veremyev
Zend_Pdf preloads PDF objects reference tables and pages. Both
operations take enough time and memory.

I think pages loading may be omitted for some cases and it may save a
lot of resources, but it should be tested. Could I ask you to do this?
:)  (It looks you have a good set of real world PDF examples)
Please comment line 294 of library/Zend/Pdf.php file (current SVN
version):
-
//$this-_loadPages($this-_trailer-Root-Pages);
---
Note: $pdf-pages array will be empty.


With best regards,
   Alexander Veremyev.

 -Original Message-
 From: Markus Fischer [mailto:[EMAIL PROTECTED] 
 Sent: Friday, August 31, 2007 3:14 AM
 To: Alexander Veremyev
 Cc: Zend Framework General
 Subject: Re: [fw-general] Extracting data out of PDF with Zend_Pdf?
 
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 I just discovered another need ... however I think this won't 
 easily implemented.
 
 Currently the complete PDF needs to be parsed into memory, 
 even all I want from a PDF is the metadata information.
 
 Would it be possible to implement a smart way to extract 
 metadata information without parsing everything into memory ... ?
 
 Some PDF files I tested needed more then 128M of memory to be 
 parsed even all I need is Title and Author ... and besides 
 memory it also takes quite some time, too.
 
 thanks,
 - - Markus
 
 Markus Fischer wrote:
  Hey!
  
  This is great, I just saw your commit and tested it. I saw the API 
  being changed :
  
  * $oPdf-properties is now a property, not a method anymore
  * $oPdf-getMetaData() returns some xml rdf sequence
  
  I tested it with quite some PDFs and it worked very well. I also 
  realized that the amount of information in the properties can vary, 
  some have a Title, others don't.
  
  Is there a difference in practice between the distilled information 
  through the properties property and the RDF data?
  
  thank you!
  - Markus
  
  Alexander Veremyev wrote:
  Hi Markus,
  
  Thanks for the offered help!
  
  I mentioned JIRA issue only to indicate that feature already was 
  requested. So it increases its chances to be done in a 
 short time :) 
  Actually I am going to take a look into it and determine 
 plans for it 
  tomorrow.
  
  With best regards,
 Alexander Veremyev.
  
  -Original Message-
  From: Markus Fischer [mailto:[EMAIL PROTECTED]
  Sent: Monday, August 27, 2007 11:54 PM
  To: Alexander Veremyev
  Cc: Zend Framework General
  Subject: Re: [fw-general] Extracting data out of PDF with 
 Zend_Pdf?
 
  Hi Alexander,
  
  thank you for answering so quickly. I'll search JIRA next time.
  
  I'm not new to PHP but the PDF spec is quite complex so is the PDF 
  implementation ... unfortunately I've not enough time to dig into, 
  I'ld love to help and come up with a patch.
  
  So I hope it will get implemented soon, this would really be great.
  
  thanks,
  - Markus
  
  Alexander Veremyev wrote:
  Hi  Markus,
 
  PDF properties processing is planned 
  (http://framework.zend.com/issues/browse/ZF-294), but 
 not done yet.
 
  It's not the first request for the feature and 
 implementation is 
  relatively simple. I think it should be done in the near future.
 
 
  With best regards,
 Alexander Veremyev.
 
  -Original Message-
  From: Markus Fischer [mailto:[EMAIL PROTECTED]
  Sent: Sunday, August 26, 2007 10:37 PM
  To: Zend Framework General
  Subject: [fw-general] Extracting data out of PDF with Zend_Pdf?
 
  Hi,
 
  is it supported to extra metadata information from a PDF? The 
  information I'm seeking is
  * title
  * number of pages
  * author
 
  (of course as long as the information is contained in the PDF).
 
  I've gone through quite some PDFs where Adobes Reader shows
  me title
  and author information but from Zend_Pdf I get nothing back.
 
  Following the documentation I thought I can get this
  information from
  the properties() method, e.g.
 
  $oPdf = Zend_Pdf::load($sFile);
  var_dump( $oPdf-properties() );
 
  But the returned array was empty in all cases.
 
  I know I can get the number of pages by counting the pages 
  property, but what about the other information?
 
  If it's not possible with Zend_Pdf, although off-topic, 
 what other 
  possibilities are out there? fpdf? Or some unix 
 commands (I'm on 
  Linux)?
 
  thanks,
  - Markus
 
  ps: I was using 1.0.1
  
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.6 (MingW32)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
 
 iD8DBQFG109Q1nS0RcInK9ARAmoPAJsGXp8DuD72lFpirddPV6WLX3ke8ACgqF5I
 7glEVrmvYgZxIJEf3HGeEg8=
 =Emla
 -END PGP SIGNATURE-
 


Re: [fw-general] Extracting data out of PDF with Zend_Pdf?

2007-08-31 Thread Markus Fischer
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

Alexander Veremyev wrote:
 Zend_Pdf preloads PDF objects reference tables and pages. Both
 operations take enough time and memory.
 
 I think pages loading may be omitted for some cases and it may save a
 lot of resources, but it should be tested. Could I ask you to do this?
 :)  (It looks you have a good set of real world PDF examples)
 Please comment line 294 of library/Zend/Pdf.php file (current SVN
 version):
 -
 //$this-_loadPages($this-_trailer-Root-Pages);
 ---
 Note: $pdf-pages array will be empty.


I tested it and it worked quite well. It's much faster, of course, and
memory consumption is more conservative.

But the number of pages (current I get this information only with
count($pdf-pages) ) is one of the important meta data information I
would need to know about a PDF. Is there a chance to get the number of
pages without parsing the complete PDF into memory?

thanks,
- - Markus
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG2FT01nS0RcInK9ARAgdgAJsHqQE5TUthP8A6W2JTlv7QoMkiMgCgkzuu
ZKTHVeRe5EJHHovV1sn1z70=
=PHGU
-END PGP SIGNATURE-


RE: [fw-general] Extracting data out of PDF with Zend_Pdf?

2007-08-31 Thread Alexander Veremyev
Hi Markus,

Great thanks for the testing!

That looks it would be a good feature to have info only PDF loading
mode.

Number of document pages is calculated dinamically now. Pages structure
is usually a tree with pages at leafs. So it's necessary to load each
tree element to check if it's a page node or pages agregation node. That
provokes complete page data loading.

It's also possible to get page numbers without actual tree processing.
Root pages tree node contains number of leafs under it (== number of
pages).
It could be retrieved in context of Zend_Pdf object by the following
expression:
---
$this-_trailer-Root-Pages-Count-value
--

I am thinking about what is the best API for retrieving page numbers
using this way...


With best regards,
   Alexander Veremyev.


 -Original Message-
 From: Markus Fischer [mailto:[EMAIL PROTECTED] 
 Sent: Friday, August 31, 2007 9:51 PM
 To: Alexander Veremyev
 Cc: Zend Framework General
 Subject: Re: [fw-general] Extracting data out of PDF with Zend_Pdf?
 
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 Hi,
 
 Alexander Veremyev wrote:
  Zend_Pdf preloads PDF objects reference tables and pages. Both 
  operations take enough time and memory.
  
  I think pages loading may be omitted for some cases and it 
 may save a 
  lot of resources, but it should be tested. Could I ask you 
 to do this?
  :)  (It looks you have a good set of real world PDF 
 examples) Please 
  comment line 294 of library/Zend/Pdf.php file (current SVN
  version):
  -
  //$this-_loadPages($this-_trailer-Root-Pages);
  ---
  Note: $pdf-pages array will be empty.
 
 
 I tested it and it worked quite well. It's much faster, of 
 course, and memory consumption is more conservative.
 
 But the number of pages (current I get this information only with
 count($pdf-pages) ) is one of the important meta data 
 information I would need to know about a PDF. Is there a 
 chance to get the number of pages without parsing the 
 complete PDF into memory?
 
 thanks,
 - - Markus
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.6 (MingW32)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
 
 iD8DBQFG2FT01nS0RcInK9ARAgdgAJsHqQE5TUthP8A6W2JTlv7QoMkiMgCgkzuu
 ZKTHVeRe5EJHHovV1sn1z70=
 =PHGU
 -END PGP SIGNATURE-
 


Re: [fw-general] Extracting data out of PDF with Zend_Pdf?

2007-08-30 Thread Markus Fischer
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I just discovered another need ... however I think this won't easily
implemented.

Currently the complete PDF needs to be parsed into memory, even all I
want from a PDF is the metadata information.

Would it be possible to implement a smart way to extract metadata
information without parsing everything into memory ... ?

Some PDF files I tested needed more then 128M of memory to be parsed
even all I need is Title and Author ... and besides memory it also takes
quite some time, too.

thanks,
- - Markus

Markus Fischer wrote:
 Hey!
 
 This is great, I just saw your commit and tested it. I saw the API being
 changed :
 
 * $oPdf-properties is now a property, not a method anymore
 * $oPdf-getMetaData() returns some xml rdf sequence
 
 I tested it with quite some PDFs and it worked very well. I also
 realized that the amount of information in the properties can vary, some
 have a Title, others don't.
 
 Is there a difference in practice between the distilled information
 through the properties property and the RDF data?
 
 thank you!
 - Markus
 
 Alexander Veremyev wrote:
 Hi Markus,
 
 Thanks for the offered help!
 
 I mentioned JIRA issue only to indicate that feature already was
 requested. So it increases its chances to be done in a short time :)
 Actually I am going to take a look into it and determine plans for it
 tomorrow.
 
 With best regards,
Alexander Veremyev.
 
 -Original Message-
 From: Markus Fischer [mailto:[EMAIL PROTECTED] 
 Sent: Monday, August 27, 2007 11:54 PM
 To: Alexander Veremyev
 Cc: Zend Framework General
 Subject: Re: [fw-general] Extracting data out of PDF with Zend_Pdf?

 Hi Alexander,
 
 thank you for answering so quickly. I'll search JIRA next time.
 
 I'm not new to PHP but the PDF spec is quite complex so is 
 the PDF implementation ... unfortunately I've not enough time 
 to dig into, I'ld love to help and come up with a patch.
 
 So I hope it will get implemented soon, this would really be great.
 
 thanks,
 - Markus
 
 Alexander Veremyev wrote:
 Hi  Markus,

 PDF properties processing is planned
 (http://framework.zend.com/issues/browse/ZF-294), but not done yet.

 It's not the first request for the feature and implementation is 
 relatively simple. I think it should be done in the near future.


 With best regards,
Alexander Veremyev.

 -Original Message-
 From: Markus Fischer [mailto:[EMAIL PROTECTED]
 Sent: Sunday, August 26, 2007 10:37 PM
 To: Zend Framework General
 Subject: [fw-general] Extracting data out of PDF with Zend_Pdf?

 Hi,

 is it supported to extra metadata information from a PDF? The 
 information I'm seeking is
 * title
 * number of pages
 * author

 (of course as long as the information is contained in the PDF).

 I've gone through quite some PDFs where Adobes Reader shows 
 me title 
 and author information but from Zend_Pdf I get nothing back.

 Following the documentation I thought I can get this 
 information from 
 the properties() method, e.g.

 $oPdf = Zend_Pdf::load($sFile);
 var_dump( $oPdf-properties() );

 But the returned array was empty in all cases.

 I know I can get the number of pages by counting the pages 
 property, but what about the other information?

 If it's not possible with Zend_Pdf, although off-topic, what other 
 possibilities are out there? fpdf? Or some unix commands (I'm on 
 Linux)?

 thanks,
 - Markus

 ps: I was using 1.0.1
 
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG109Q1nS0RcInK9ARAmoPAJsGXp8DuD72lFpirddPV6WLX3ke8ACgqF5I
7glEVrmvYgZxIJEf3HGeEg8=
=Emla
-END PGP SIGNATURE-


RE: [fw-general] Extracting data out of PDF with Zend_Pdf?

2007-08-27 Thread Alexander Veremyev
Hi  Markus,

PDF properties processing is planned
(http://framework.zend.com/issues/browse/ZF-294), but not done yet.

It's not the first request for the feature and implementation is
relatively simple. I think it should be done in the near future.


With best regards,
   Alexander Veremyev.

 -Original Message-
 From: Markus Fischer [mailto:[EMAIL PROTECTED] 
 Sent: Sunday, August 26, 2007 10:37 PM
 To: Zend Framework General
 Subject: [fw-general] Extracting data out of PDF with Zend_Pdf?
 
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 Hi,
 
 is it supported to extra metadata information from a PDF? The 
 information I'm seeking is
 * title
 * number of pages
 * author
 
 (of course as long as the information is contained in the PDF).
 
 I've gone through quite some PDFs where Adobes Reader shows 
 me title and author information but from Zend_Pdf I get nothing back.
 
 Following the documentation I thought I can get this 
 information from the properties() method, e.g.
 
 $oPdf = Zend_Pdf::load($sFile);
 var_dump( $oPdf-properties() );
 
 But the returned array was empty in all cases.
 
 I know I can get the number of pages by counting the pages 
 property, but what about the other information?
 
 If it's not possible with Zend_Pdf, although off-topic, what 
 other possibilities are out there? fpdf? Or some unix 
 commands (I'm on Linux)?
 
 thanks,
 - - Markus
 
 ps: I was using 1.0.1
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.6 (MingW32)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
 
 iD8DBQFG0chf1nS0RcInK9ARAiDlAJ4+aAH7QO1b7zKFS1H6UucYZ8aKPwCeO90x
 VYXJNZ9ZR+3Jv1IYoArZlNY=
 =qqMV
 -END PGP SIGNATURE-
 


Re: [fw-general] Extracting data out of PDF with Zend_Pdf?

2007-08-27 Thread Markus Fischer
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Alexander,

thank you for answering so quickly. I'll search JIRA next time.

I'm not new to PHP but the PDF spec is quite complex so is the PDF
implementation ... unfortunately I've not enough time to dig into, I'ld
love to help and come up with a patch.

So I hope it will get implemented soon, this would really be great.

thanks,
- - Markus

Alexander Veremyev wrote:
 Hi  Markus,
 
 PDF properties processing is planned
 (http://framework.zend.com/issues/browse/ZF-294), but not done yet.
 
 It's not the first request for the feature and implementation is
 relatively simple. I think it should be done in the near future.
 
 
 With best regards,
Alexander Veremyev.
 
 -Original Message-
 From: Markus Fischer [mailto:[EMAIL PROTECTED] 
 Sent: Sunday, August 26, 2007 10:37 PM
 To: Zend Framework General
 Subject: [fw-general] Extracting data out of PDF with Zend_Pdf?

 Hi,
 
 is it supported to extra metadata information from a PDF? The 
 information I'm seeking is
 * title
 * number of pages
 * author
 
 (of course as long as the information is contained in the PDF).
 
 I've gone through quite some PDFs where Adobes Reader shows 
 me title and author information but from Zend_Pdf I get nothing back.
 
 Following the documentation I thought I can get this 
 information from the properties() method, e.g.
 
 $oPdf = Zend_Pdf::load($sFile);
 var_dump( $oPdf-properties() );
 
 But the returned array was empty in all cases.
 
 I know I can get the number of pages by counting the pages 
 property, but what about the other information?
 
 If it's not possible with Zend_Pdf, although off-topic, what 
 other possibilities are out there? fpdf? Or some unix 
 commands (I'm on Linux)?
 
 thanks,
 - Markus
 
 ps: I was using 1.0.1


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG0yvh1nS0RcInK9ARAlyBAJoCy4/XUr4+33KO/K0f2hUVBfP0hACdF0QD
kryvg+Wo3H/17rLTpwk43eE=
=duRE
-END PGP SIGNATURE-


RE: [fw-general] Extracting data out of PDF with Zend_Pdf?

2007-08-27 Thread Alexander Veremyev
Hi Markus,

Thanks for the offered help!

I mentioned JIRA issue only to indicate that feature already was
requested. So it increases its chances to be done in a short time :)
Actually I am going to take a look into it and determine plans for it
tomorrow.

With best regards,
   Alexander Veremyev.

 -Original Message-
 From: Markus Fischer [mailto:[EMAIL PROTECTED] 
 Sent: Monday, August 27, 2007 11:54 PM
 To: Alexander Veremyev
 Cc: Zend Framework General
 Subject: Re: [fw-general] Extracting data out of PDF with Zend_Pdf?
 
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 Hi Alexander,
 
 thank you for answering so quickly. I'll search JIRA next time.
 
 I'm not new to PHP but the PDF spec is quite complex so is 
 the PDF implementation ... unfortunately I've not enough time 
 to dig into, I'ld love to help and come up with a patch.
 
 So I hope it will get implemented soon, this would really be great.
 
 thanks,
 - - Markus
 
 Alexander Veremyev wrote:
  Hi  Markus,
  
  PDF properties processing is planned
  (http://framework.zend.com/issues/browse/ZF-294), but not done yet.
  
  It's not the first request for the feature and implementation is 
  relatively simple. I think it should be done in the near future.
  
  
  With best regards,
 Alexander Veremyev.
  
  -Original Message-
  From: Markus Fischer [mailto:[EMAIL PROTECTED]
  Sent: Sunday, August 26, 2007 10:37 PM
  To: Zend Framework General
  Subject: [fw-general] Extracting data out of PDF with Zend_Pdf?
 
  Hi,
  
  is it supported to extra metadata information from a PDF? The 
  information I'm seeking is
  * title
  * number of pages
  * author
  
  (of course as long as the information is contained in the PDF).
  
  I've gone through quite some PDFs where Adobes Reader shows 
 me title 
  and author information but from Zend_Pdf I get nothing back.
  
  Following the documentation I thought I can get this 
 information from 
  the properties() method, e.g.
  
  $oPdf = Zend_Pdf::load($sFile);
  var_dump( $oPdf-properties() );
  
  But the returned array was empty in all cases.
  
  I know I can get the number of pages by counting the pages 
  property, but what about the other information?
  
  If it's not possible with Zend_Pdf, although off-topic, what other 
  possibilities are out there? fpdf? Or some unix commands (I'm on 
  Linux)?
  
  thanks,
  - Markus
  
  ps: I was using 1.0.1
 
 
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.6 (MingW32)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
 
 iD8DBQFG0yvh1nS0RcInK9ARAlyBAJoCy4/XUr4+33KO/K0f2hUVBfP0hACdF0QD
 kryvg+Wo3H/17rLTpwk43eE=
 =duRE
 -END PGP SIGNATURE-