RE: [fw-general] Extracting data out of PDF with Zend_Pdf?
Hi Markus, Yes, that looks it's done :) Historically PDF Info structure is used for storing Title, Author, Subject, etc. Since PDF 1.4 Metadata streams may also be used for this. Metadata streams may contain much more info. I saw somewhere in the documentation (I don't remember exactly where), that this structures may not be synchronized. There was an algorithm to choose more actual information based on some timestamps. But it's really good idea, to keep them synchronized. With best regards, Alexander Veremyev. -Original Message- From: Markus Fischer [mailto:[EMAIL PROTECTED] Sent: Friday, August 31, 2007 2:22 AM To: Alexander Veremyev Cc: Zend Framework General Subject: Re: [fw-general] Extracting data out of PDF with Zend_Pdf? -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hey! This is great, I just saw your commit and tested it. I saw the API being changed : * $oPdf-properties is now a property, not a method anymore * $oPdf-getMetaData() returns some xml rdf sequence I tested it with quite some PDFs and it worked very well. I also realized that the amount of information in the properties can vary, some have a Title, others don't. Is there a difference in practice between the distilled information through the properties property and the RDF data? thank you! - - Markus Alexander Veremyev wrote: Hi Markus, Thanks for the offered help! I mentioned JIRA issue only to indicate that feature already was requested. So it increases its chances to be done in a short time :) Actually I am going to take a look into it and determine plans for it tomorrow. With best regards, Alexander Veremyev. -Original Message- From: Markus Fischer [mailto:[EMAIL PROTECTED] Sent: Monday, August 27, 2007 11:54 PM To: Alexander Veremyev Cc: Zend Framework General Subject: Re: [fw-general] Extracting data out of PDF with Zend_Pdf? Hi Alexander, thank you for answering so quickly. I'll search JIRA next time. I'm not new to PHP but the PDF spec is quite complex so is the PDF implementation ... unfortunately I've not enough time to dig into, I'ld love to help and come up with a patch. So I hope it will get implemented soon, this would really be great. thanks, - Markus Alexander Veremyev wrote: Hi Markus, PDF properties processing is planned (http://framework.zend.com/issues/browse/ZF-294), but not done yet. It's not the first request for the feature and implementation is relatively simple. I think it should be done in the near future. With best regards, Alexander Veremyev. -Original Message- From: Markus Fischer [mailto:[EMAIL PROTECTED] Sent: Sunday, August 26, 2007 10:37 PM To: Zend Framework General Subject: [fw-general] Extracting data out of PDF with Zend_Pdf? Hi, is it supported to extra metadata information from a PDF? The information I'm seeking is * title * number of pages * author (of course as long as the information is contained in the PDF). I've gone through quite some PDFs where Adobes Reader shows me title and author information but from Zend_Pdf I get nothing back. Following the documentation I thought I can get this information from the properties() method, e.g. $oPdf = Zend_Pdf::load($sFile); var_dump( $oPdf-properties() ); But the returned array was empty in all cases. I know I can get the number of pages by counting the pages property, but what about the other information? If it's not possible with Zend_Pdf, although off-topic, what other possibilities are out there? fpdf? Or some unix commands (I'm on Linux)? thanks, - Markus ps: I was using 1.0.1 -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG10Ly1nS0RcInK9ARAihjAJ9ehFtVOu2o+vIgcdz6UVgm2dkNCwCgpQdw x0d34VQLBAKfa0oJGmr4XKg= =CAt8 -END PGP SIGNATURE-
RE: [fw-general] Extracting data out of PDF with Zend_Pdf?
Zend_Pdf preloads PDF objects reference tables and pages. Both operations take enough time and memory. I think pages loading may be omitted for some cases and it may save a lot of resources, but it should be tested. Could I ask you to do this? :) (It looks you have a good set of real world PDF examples) Please comment line 294 of library/Zend/Pdf.php file (current SVN version): - //$this-_loadPages($this-_trailer-Root-Pages); --- Note: $pdf-pages array will be empty. With best regards, Alexander Veremyev. -Original Message- From: Markus Fischer [mailto:[EMAIL PROTECTED] Sent: Friday, August 31, 2007 3:14 AM To: Alexander Veremyev Cc: Zend Framework General Subject: Re: [fw-general] Extracting data out of PDF with Zend_Pdf? -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I just discovered another need ... however I think this won't easily implemented. Currently the complete PDF needs to be parsed into memory, even all I want from a PDF is the metadata information. Would it be possible to implement a smart way to extract metadata information without parsing everything into memory ... ? Some PDF files I tested needed more then 128M of memory to be parsed even all I need is Title and Author ... and besides memory it also takes quite some time, too. thanks, - - Markus Markus Fischer wrote: Hey! This is great, I just saw your commit and tested it. I saw the API being changed : * $oPdf-properties is now a property, not a method anymore * $oPdf-getMetaData() returns some xml rdf sequence I tested it with quite some PDFs and it worked very well. I also realized that the amount of information in the properties can vary, some have a Title, others don't. Is there a difference in practice between the distilled information through the properties property and the RDF data? thank you! - Markus Alexander Veremyev wrote: Hi Markus, Thanks for the offered help! I mentioned JIRA issue only to indicate that feature already was requested. So it increases its chances to be done in a short time :) Actually I am going to take a look into it and determine plans for it tomorrow. With best regards, Alexander Veremyev. -Original Message- From: Markus Fischer [mailto:[EMAIL PROTECTED] Sent: Monday, August 27, 2007 11:54 PM To: Alexander Veremyev Cc: Zend Framework General Subject: Re: [fw-general] Extracting data out of PDF with Zend_Pdf? Hi Alexander, thank you for answering so quickly. I'll search JIRA next time. I'm not new to PHP but the PDF spec is quite complex so is the PDF implementation ... unfortunately I've not enough time to dig into, I'ld love to help and come up with a patch. So I hope it will get implemented soon, this would really be great. thanks, - Markus Alexander Veremyev wrote: Hi Markus, PDF properties processing is planned (http://framework.zend.com/issues/browse/ZF-294), but not done yet. It's not the first request for the feature and implementation is relatively simple. I think it should be done in the near future. With best regards, Alexander Veremyev. -Original Message- From: Markus Fischer [mailto:[EMAIL PROTECTED] Sent: Sunday, August 26, 2007 10:37 PM To: Zend Framework General Subject: [fw-general] Extracting data out of PDF with Zend_Pdf? Hi, is it supported to extra metadata information from a PDF? The information I'm seeking is * title * number of pages * author (of course as long as the information is contained in the PDF). I've gone through quite some PDFs where Adobes Reader shows me title and author information but from Zend_Pdf I get nothing back. Following the documentation I thought I can get this information from the properties() method, e.g. $oPdf = Zend_Pdf::load($sFile); var_dump( $oPdf-properties() ); But the returned array was empty in all cases. I know I can get the number of pages by counting the pages property, but what about the other information? If it's not possible with Zend_Pdf, although off-topic, what other possibilities are out there? fpdf? Or some unix commands (I'm on Linux)? thanks, - Markus ps: I was using 1.0.1 -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG109Q1nS0RcInK9ARAmoPAJsGXp8DuD72lFpirddPV6WLX3ke8ACgqF5I 7glEVrmvYgZxIJEf3HGeEg8= =Emla -END PGP SIGNATURE-
Re: [fw-general] Extracting data out of PDF with Zend_Pdf?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, Alexander Veremyev wrote: Zend_Pdf preloads PDF objects reference tables and pages. Both operations take enough time and memory. I think pages loading may be omitted for some cases and it may save a lot of resources, but it should be tested. Could I ask you to do this? :) (It looks you have a good set of real world PDF examples) Please comment line 294 of library/Zend/Pdf.php file (current SVN version): - //$this-_loadPages($this-_trailer-Root-Pages); --- Note: $pdf-pages array will be empty. I tested it and it worked quite well. It's much faster, of course, and memory consumption is more conservative. But the number of pages (current I get this information only with count($pdf-pages) ) is one of the important meta data information I would need to know about a PDF. Is there a chance to get the number of pages without parsing the complete PDF into memory? thanks, - - Markus -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG2FT01nS0RcInK9ARAgdgAJsHqQE5TUthP8A6W2JTlv7QoMkiMgCgkzuu ZKTHVeRe5EJHHovV1sn1z70= =PHGU -END PGP SIGNATURE-
RE: [fw-general] Extracting data out of PDF with Zend_Pdf?
Hi Markus, Great thanks for the testing! That looks it would be a good feature to have info only PDF loading mode. Number of document pages is calculated dinamically now. Pages structure is usually a tree with pages at leafs. So it's necessary to load each tree element to check if it's a page node or pages agregation node. That provokes complete page data loading. It's also possible to get page numbers without actual tree processing. Root pages tree node contains number of leafs under it (== number of pages). It could be retrieved in context of Zend_Pdf object by the following expression: --- $this-_trailer-Root-Pages-Count-value -- I am thinking about what is the best API for retrieving page numbers using this way... With best regards, Alexander Veremyev. -Original Message- From: Markus Fischer [mailto:[EMAIL PROTECTED] Sent: Friday, August 31, 2007 9:51 PM To: Alexander Veremyev Cc: Zend Framework General Subject: Re: [fw-general] Extracting data out of PDF with Zend_Pdf? -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, Alexander Veremyev wrote: Zend_Pdf preloads PDF objects reference tables and pages. Both operations take enough time and memory. I think pages loading may be omitted for some cases and it may save a lot of resources, but it should be tested. Could I ask you to do this? :) (It looks you have a good set of real world PDF examples) Please comment line 294 of library/Zend/Pdf.php file (current SVN version): - //$this-_loadPages($this-_trailer-Root-Pages); --- Note: $pdf-pages array will be empty. I tested it and it worked quite well. It's much faster, of course, and memory consumption is more conservative. But the number of pages (current I get this information only with count($pdf-pages) ) is one of the important meta data information I would need to know about a PDF. Is there a chance to get the number of pages without parsing the complete PDF into memory? thanks, - - Markus -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG2FT01nS0RcInK9ARAgdgAJsHqQE5TUthP8A6W2JTlv7QoMkiMgCgkzuu ZKTHVeRe5EJHHovV1sn1z70= =PHGU -END PGP SIGNATURE-
Re: [fw-general] Extracting data out of PDF with Zend_Pdf?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I just discovered another need ... however I think this won't easily implemented. Currently the complete PDF needs to be parsed into memory, even all I want from a PDF is the metadata information. Would it be possible to implement a smart way to extract metadata information without parsing everything into memory ... ? Some PDF files I tested needed more then 128M of memory to be parsed even all I need is Title and Author ... and besides memory it also takes quite some time, too. thanks, - - Markus Markus Fischer wrote: Hey! This is great, I just saw your commit and tested it. I saw the API being changed : * $oPdf-properties is now a property, not a method anymore * $oPdf-getMetaData() returns some xml rdf sequence I tested it with quite some PDFs and it worked very well. I also realized that the amount of information in the properties can vary, some have a Title, others don't. Is there a difference in practice between the distilled information through the properties property and the RDF data? thank you! - Markus Alexander Veremyev wrote: Hi Markus, Thanks for the offered help! I mentioned JIRA issue only to indicate that feature already was requested. So it increases its chances to be done in a short time :) Actually I am going to take a look into it and determine plans for it tomorrow. With best regards, Alexander Veremyev. -Original Message- From: Markus Fischer [mailto:[EMAIL PROTECTED] Sent: Monday, August 27, 2007 11:54 PM To: Alexander Veremyev Cc: Zend Framework General Subject: Re: [fw-general] Extracting data out of PDF with Zend_Pdf? Hi Alexander, thank you for answering so quickly. I'll search JIRA next time. I'm not new to PHP but the PDF spec is quite complex so is the PDF implementation ... unfortunately I've not enough time to dig into, I'ld love to help and come up with a patch. So I hope it will get implemented soon, this would really be great. thanks, - Markus Alexander Veremyev wrote: Hi Markus, PDF properties processing is planned (http://framework.zend.com/issues/browse/ZF-294), but not done yet. It's not the first request for the feature and implementation is relatively simple. I think it should be done in the near future. With best regards, Alexander Veremyev. -Original Message- From: Markus Fischer [mailto:[EMAIL PROTECTED] Sent: Sunday, August 26, 2007 10:37 PM To: Zend Framework General Subject: [fw-general] Extracting data out of PDF with Zend_Pdf? Hi, is it supported to extra metadata information from a PDF? The information I'm seeking is * title * number of pages * author (of course as long as the information is contained in the PDF). I've gone through quite some PDFs where Adobes Reader shows me title and author information but from Zend_Pdf I get nothing back. Following the documentation I thought I can get this information from the properties() method, e.g. $oPdf = Zend_Pdf::load($sFile); var_dump( $oPdf-properties() ); But the returned array was empty in all cases. I know I can get the number of pages by counting the pages property, but what about the other information? If it's not possible with Zend_Pdf, although off-topic, what other possibilities are out there? fpdf? Or some unix commands (I'm on Linux)? thanks, - Markus ps: I was using 1.0.1 -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG109Q1nS0RcInK9ARAmoPAJsGXp8DuD72lFpirddPV6WLX3ke8ACgqF5I 7glEVrmvYgZxIJEf3HGeEg8= =Emla -END PGP SIGNATURE-
RE: [fw-general] Extracting data out of PDF with Zend_Pdf?
Hi Markus, PDF properties processing is planned (http://framework.zend.com/issues/browse/ZF-294), but not done yet. It's not the first request for the feature and implementation is relatively simple. I think it should be done in the near future. With best regards, Alexander Veremyev. -Original Message- From: Markus Fischer [mailto:[EMAIL PROTECTED] Sent: Sunday, August 26, 2007 10:37 PM To: Zend Framework General Subject: [fw-general] Extracting data out of PDF with Zend_Pdf? -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, is it supported to extra metadata information from a PDF? The information I'm seeking is * title * number of pages * author (of course as long as the information is contained in the PDF). I've gone through quite some PDFs where Adobes Reader shows me title and author information but from Zend_Pdf I get nothing back. Following the documentation I thought I can get this information from the properties() method, e.g. $oPdf = Zend_Pdf::load($sFile); var_dump( $oPdf-properties() ); But the returned array was empty in all cases. I know I can get the number of pages by counting the pages property, but what about the other information? If it's not possible with Zend_Pdf, although off-topic, what other possibilities are out there? fpdf? Or some unix commands (I'm on Linux)? thanks, - - Markus ps: I was using 1.0.1 -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG0chf1nS0RcInK9ARAiDlAJ4+aAH7QO1b7zKFS1H6UucYZ8aKPwCeO90x VYXJNZ9ZR+3Jv1IYoArZlNY= =qqMV -END PGP SIGNATURE-
Re: [fw-general] Extracting data out of PDF with Zend_Pdf?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Alexander, thank you for answering so quickly. I'll search JIRA next time. I'm not new to PHP but the PDF spec is quite complex so is the PDF implementation ... unfortunately I've not enough time to dig into, I'ld love to help and come up with a patch. So I hope it will get implemented soon, this would really be great. thanks, - - Markus Alexander Veremyev wrote: Hi Markus, PDF properties processing is planned (http://framework.zend.com/issues/browse/ZF-294), but not done yet. It's not the first request for the feature and implementation is relatively simple. I think it should be done in the near future. With best regards, Alexander Veremyev. -Original Message- From: Markus Fischer [mailto:[EMAIL PROTECTED] Sent: Sunday, August 26, 2007 10:37 PM To: Zend Framework General Subject: [fw-general] Extracting data out of PDF with Zend_Pdf? Hi, is it supported to extra metadata information from a PDF? The information I'm seeking is * title * number of pages * author (of course as long as the information is contained in the PDF). I've gone through quite some PDFs where Adobes Reader shows me title and author information but from Zend_Pdf I get nothing back. Following the documentation I thought I can get this information from the properties() method, e.g. $oPdf = Zend_Pdf::load($sFile); var_dump( $oPdf-properties() ); But the returned array was empty in all cases. I know I can get the number of pages by counting the pages property, but what about the other information? If it's not possible with Zend_Pdf, although off-topic, what other possibilities are out there? fpdf? Or some unix commands (I'm on Linux)? thanks, - Markus ps: I was using 1.0.1 -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG0yvh1nS0RcInK9ARAlyBAJoCy4/XUr4+33KO/K0f2hUVBfP0hACdF0QD kryvg+Wo3H/17rLTpwk43eE= =duRE -END PGP SIGNATURE-
RE: [fw-general] Extracting data out of PDF with Zend_Pdf?
Hi Markus, Thanks for the offered help! I mentioned JIRA issue only to indicate that feature already was requested. So it increases its chances to be done in a short time :) Actually I am going to take a look into it and determine plans for it tomorrow. With best regards, Alexander Veremyev. -Original Message- From: Markus Fischer [mailto:[EMAIL PROTECTED] Sent: Monday, August 27, 2007 11:54 PM To: Alexander Veremyev Cc: Zend Framework General Subject: Re: [fw-general] Extracting data out of PDF with Zend_Pdf? -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Alexander, thank you for answering so quickly. I'll search JIRA next time. I'm not new to PHP but the PDF spec is quite complex so is the PDF implementation ... unfortunately I've not enough time to dig into, I'ld love to help and come up with a patch. So I hope it will get implemented soon, this would really be great. thanks, - - Markus Alexander Veremyev wrote: Hi Markus, PDF properties processing is planned (http://framework.zend.com/issues/browse/ZF-294), but not done yet. It's not the first request for the feature and implementation is relatively simple. I think it should be done in the near future. With best regards, Alexander Veremyev. -Original Message- From: Markus Fischer [mailto:[EMAIL PROTECTED] Sent: Sunday, August 26, 2007 10:37 PM To: Zend Framework General Subject: [fw-general] Extracting data out of PDF with Zend_Pdf? Hi, is it supported to extra metadata information from a PDF? The information I'm seeking is * title * number of pages * author (of course as long as the information is contained in the PDF). I've gone through quite some PDFs where Adobes Reader shows me title and author information but from Zend_Pdf I get nothing back. Following the documentation I thought I can get this information from the properties() method, e.g. $oPdf = Zend_Pdf::load($sFile); var_dump( $oPdf-properties() ); But the returned array was empty in all cases. I know I can get the number of pages by counting the pages property, but what about the other information? If it's not possible with Zend_Pdf, although off-topic, what other possibilities are out there? fpdf? Or some unix commands (I'm on Linux)? thanks, - Markus ps: I was using 1.0.1 -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFG0yvh1nS0RcInK9ARAlyBAJoCy4/XUr4+33KO/K0f2hUVBfP0hACdF0QD kryvg+Wo3H/17rLTpwk43eE= =duRE -END PGP SIGNATURE-