Re: [Podofo-users] Example of text removal or image extraction
Hi, I have done something of that sort where I did filter out a few text instructions from the stream, this can be done in this manner. Suggestion to improve the following are more than welcome, particularly the fact that I have to append spaces to strings. char * buffer = new char[ pageStreamLength * N ];// This is bad PdfOutputDevice pdfOutputDevice( buffer, pageStreamLength * N ); while( tokenizer.ReadNext( t, keyword, variant ) ) { bool skip = false; // Stack parameters up if ( t != ePdfContentsType_Keyword ) { stack.push( variant ); parameters++; continue; } // Process the operators if we're interested on it string k(keyword); // Handle operators if( k == ... ){ skip = true; } [..] // // Filter it out if( skip ) { while( ... ) stack.pop(); continue; } list< PdfVariant > l; //TODO: copy party while( parameters > 0 ) { l.push_front( stack.top() ); stack.pop(); parameters--; } // Write operands for_each( l.begin(), l.end(), [ & ]( const PdfVariant & variant ) { variant.Write( &pdfOutputDevice, ePdfWriteMode_Clean); pdfOutputDevice.Write( " ", 1 ); } ); // Write operator k += " "; pdfOutputDevice.Write( k.c_str(), k.size() ); } On 27/11/2019 08:06, Jacob Pedersen wrote: Hi I think this is the way to go, rather than copying images over. Removing text could be done by either; 1. Replace chars in text draw commands with spaces. I guess it would be the simplest and fastest approach if possible. 2. Remove text draw commands (basically anything between and including BT and ET). Fonts can be left, since PDF will just be converted into TIFF, so it must just be readable. Size does not matter since it will exist for a fraction of a second in memory. I have not found any examples of how to replace or edit the contents stream. I can iterate through it using the tokenizer, but where to go from there? Thanks for more pointers 😊 -Oprindelig meddelelse- Fra: zyx Sendt: 26. november 2019 07:51 Til: podofo-users@lists.sourceforge.net Emne: Re: [Podofo-users] Example of text removal or image extraction On Mon, 2019-11-25 at 18:04 +, Jacob Pedersen wrote: Basically I just need to make an identical PDF without the text. Hi, it would be much harder, but maybe you can change the podofotxtextract tool and instead of extracting only the text (see the TextExtractor::ExtractText there), remove the text from the corresponding content streams, preserving all but the text-related things. You might be able to remove also any font objects, because they can mean significant part of the PDF file. Removing such objects, with all its references in the document, can be tricky. The same editing of the content streams. Bye, zyx ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
[Podofo-users] String bounding box
Hi all, I'd like to compute the bounding box of a given string part of a text drawing instruction, such as TJ or similar. I've started writing a draft for it but I've stopped halfway through as I am unsure if PoDoFo implements such a thing off the shelf. Does it? While writing a very rudimentary implementation of the "Text Space Details" computation as outlined in the PDF Reference two questions came to my mind: 1. I would need to know the horizontal displacement of a glyph and I am using GetGlyphWidth(), if that is correct what about the vertical displacement? 2. What is the correct way to determine, given the PdfString, the GlyphId(s) contained within it ? Thanks, Pietro. ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
Re: [Podofo-users] Memory leaks in PdfImage::LoadFromTiffHandle
I am not aware of such construct being available in C++ - and a quick grep on the source code does not match any custom stuff built in to provide an equivalent functionality. Is it hidden somewhere ? On 13/11/2019 08:53, zyx wrote: On Tue, 2019-11-12 at 20:52 +0100, Michal Sudolsky wrote: Patch attached. Hi, the "try {} __finally {}" would make the job as well, right? +PdfMemoryInputStream stream( data.data(), data.size() ); Until this patch the code was buildable with ancient compilers, which do not support vector::data(). Search for // MSC before VC11 has no data member, same as BorlandC Bye, zyx ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
Re: [Podofo-users] PoDoFo fonts cmaps "override"
Hi zyx, I think I am one commit behind the server's trunk. I can't tell of any bug as the mechanics are not clear to me well enough to understand how I am supposed to do what I want to do. I will try the suggestion Clayton gave me (and thanks A LOT for that) and report back if I found any issue. It will take some time though as I am juggling up with many things, please let me know if there any specific testing I'd like to carry out, many I can be helpful in that way. P. On 05/11/2019 07:18, zyx wrote: On Sun, 2019-11-03 at 22:40 +, Pietro Paolini wrote: In PoDoFo terminology it seems that I'd need to create my own PdfEncoding subclass and assign it to a font somehow, but I haven't found any example. Is this possible at all ? Hi, what is the PoDoFo version you use, please? I recall a bug about CMap generation in the past. Maybe try svn trunk to see whether it's fixed. Bye, zyx ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
[Podofo-users] PoDoFo fonts cmaps "override"
Hi everybody, From time to time I come across PDFs whose fonts are broken and that I cannot paste from. I think this has to do with their cmaps and I am wondering if there is an example somewhere (if this is possible at all) where a document's font is amended, that is, its cmap is altered and its changes saved so that subsequent programs opening the document won't be affected by the problem. Sometimes it can be quite bizarre as only a handful of characters are not correctly mapped into unicode. In PoDoFo terminology it seems that I'd need to create my own PdfEncoding subclass and assign it to a font somehow, but I haven't found any example. Is this possible at all ? Thanks, Pietro ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
Re: [Podofo-users] PoDoFo features
On 29/10/2019 13:31, Alistiar wrote: Hello, I was looking at your /.pdf/ search tool (library) that allows data extraction from .pdf documents and I’d like to ask about its features: My intention is to implement your library (APIs) in /C++/, while my requirement is following: to */search/count keyword/s from a multiple .pdf files at the same time as well as counting all words/* (also, I’d like to ask whether it’s possible to make an exception for prepositions, conjunctions in a way that they are not part of the final word count. I suppose this function won’t be available directly in any APIs, so my question would be - whether is it possible to extend the APIs or write my own function/method that would be used to eliminate particular words/sentences or extend PoDoFo APIs functionality in any way)? In case it's possible to use your library for such operations that are mentioned above. Could you, please, provide me with a slight code example of /C++/ implementation (keyword search and its count) just to help me get a better understanding of your library (APIs) principles of usage. I am not a maintainer and I haven't used this library too extensively, however I've found the folder tools/ a very good place to start looking. It seems like that tools/podofotxtextract folder could be a good starting place for you. /Just to check/: I suppose that the library is fully compatible with Windows 10 and that it should be fully supported (using APIs and such) in/ C++/ as it was written in /C++/ (I’ve read that on your website, but I just want to make sure that I haven’t missed anything)? Also, is your SW freeware even for a commercial use? Have a nice day. Thank You! Mark ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
[Podofo-users] GetFont()
Hi all, before digging into the source code I'd prefer to have a steer from somebody who has more experience than me with the library. I have a small document which uses in its stream a font named '/F2' but such font can't be looked up by the attached code, which simply compiles a main which takes two parameters: doc file and font name. ./main test.pdf F1 It won't be able to look the font up even if it exists as the line PdfFont * font = document.GetFont( pdfObject ); Return a pointer to NULL. I inspected the PDF with i7j and I can find the font under the page '/Resources' -> '/Font' dictionary therefore it is there indeed, am I using the library incorrectly or doing anything silly ? If the patch fixes the problem I am more than happy to test it.I had been unable to send the pdf as an attachment, it is fairly small (300 KB) but the moderation system won't let it through nonetheless. Thanks, Pietro. #include #include #include using namespace std; using namespace PoDoFo; int main(int argn, char **argv) { // Doc, fontName try { PdfMemDocument document; document.Load( argv[1] ); PdfPage * page = document.GetPage( 0 ); PdfName fontName( argv[2] ); // Initialize function for string 'rendering' PdfObject * pdfObject = page->GetFromResources( PdfName( "Font" ) , fontName); if( !pdfObject ) { cerr << "No font name '" << fontName.GetName() << "' found"; exit( 255 ); } PdfFont * font = document.GetFont( pdfObject ); if ( !font ) { cerr << "Could not look up font " << fontName.GetName() << endl; exit( 255 ); } cerr << "Font found " << font->GetIdentifier().GetName() << endl; } catch( PdfError & e ) { cerr << e.GetError() << std::endl; e.PrintErrorMsg(); } return 0; } ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
Re: [Podofo-users] PoDoFo browser
Hi all, Problem solved I needed to call GetReference() to actually load the reference . On 10/09/2019 08:32, zyx wrote: On Mon, 2019-09-09 at 11:02 +0100, Pietro Paolini wrote: What the correct approach to do what I want with PoDoFo ? Hi, I would try something like this while cycle: https://sourceforge.net/p/podofo/code/HEAD/tree/podofo/trunk/tools/podofoimgextract/ImageExtractor.cpp#l60 Bye, zyx ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
Re: [Podofo-users] PoDoFo browser
Hi, thanks for that, I get stuck at the same point nonetheless, auto it = page->GetResources()->GetDictionary().begin(); while ( it != page->GetResources()->GetDictionary().end() ) { std::cout << it->first.GetName() << "," << it->second->Reference().ObjectNumber() << " " << it->second->Reference().GenerationNumber() << std::endl; ++it; } Font,0 0 ProcSet,0 0 XObject,0 0 ExtGState,0 0 Font,0 0 ProcSet,0 0 XObjec Names are OK but references are all zeros ... Thanks, P. On 10/09/2019 08:32, zyx wrote: On Mon, 2019-09-09 at 11:02 +0100, Pietro Paolini wrote: What the correct approach to do what I want with PoDoFo ? Hi, I would try something like this while cycle: https://sourceforge.net/p/podofo/code/HEAD/tree/podofo/trunk/tools/podofoimgextract/ImageExtractor.cpp#l60 Bye, zyx ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
Re: [Podofo-users] PoDoFo browser
Hi Matthew, Thanks a lot for that. If it is long un-maintained I do better not trying to spend too much time on it, it isn't essential for me to get it to work and I'd overwhelm you with questions. However if that can help you in maintaining the project flag it up and I will spend some time on it. I am very keen in getting the "main" lib to do what and I am getting stuck, or "lost in a glass of water" as we use to say in Italy. What I want to achieve is reasonably simple: I want to list all XObjects and their subtype for a given page. I am banging my head here (sorry about the formatting, I've been wrestling with Thunderbird to get it right): .. for all pages .. PoDoFo::PdfPage *page = document.GetPage(i); PoDoFo::PdfObject *xobject = page->GetResources() ->GetDictionary().GetKey(PoDoFo::PdfName("XObject")); if (xobject == NULL) continue; auto it = xobject->GetDictionary().begin(); while (it != xobject->GetDictionary().end()) { std::cout << it->first.GetName() << "," << it->second->Reference().ObjectNumber() << " " << it->second->Reference().GenerationNumber() << std::endl; std::cout << document.GetObjects() ->GetObject( it->second->Reference())->GetDataTypeString() CRASH! << std::endl; ++it; } It crashes all the time as the object cannot be loaded for some reason, which became obvious when I checked the ObjectNumber()/GenerationNumber pair which seems to be 0,0 all the time. What the correct approach to do what I want with PoDoFo ? Thanks again, Pietro On 06/09/2019 20:59, Matthew Brincke wrote: Hello Pietro, hello all, On 06 September 2019 at 19:33 Pietro Paolini wrote: Hi all, I am following the instruction to compile the PoDoFo browser from the page: http://podofo.sourceforge.net/download.html the PoDoFoBrowser hasn't been changed/maintained for the whole decade (from 2011, of course) so it's "normal" that it can break on newer systems. However at the moment of the checkout I get an error with externals. Â UÂ Â trunk svn: warning: W205011: Error handling externals definition for 'trunk/externals/required_libpodofo': svn: warning: W170013: Unable to connect to a repository at URL 'http://podofo.svn.sourceforge.net/svnroot/podofo/podofo/tags/RELEASE_0_8_4' Checked out revision 1998. svn: E205011: Failure occurred processing one or more externals definitions That URL is outdated, please use the svn checkout option --ignore-externals for your "main" checkout and then manually checkout the tag using the current URL https://svn.code.sf.net/p/podofo/code/podofo/tags/RELEASE_0_8_4 but note well that it's very outdated and likely to be very broken compared to current PoDoFo. PoDoFoBrowser also has probably never been tested since Qt 4.3 was released. Is there anything I am doing wrong ? Except trying to get something long unmaintained, not really (please see above for a workaround, but the whole browser thing hasn't been tested for long, so it's very likely to break). Thanks, Pietro Best regards, mabri ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
[Podofo-users] PoDoFo browser
Hi all, I am following the instruction to compile the PoDoFo browser from the page: http://podofo.sourceforge.net/download.html However at the moment of the checkout I get an error with externals. Â UÂ Â trunk svn: warning: W205011: Error handling externals definition for 'trunk/externals/required_libpodofo': svn: warning: W170013: Unable to connect to a repository at URL 'http://podofo.svn.sourceforge.net/svnroot/podofo/podofo/tags/RELEASE_0_8_4' Checked out revision 1998. svn: E205011: Failure occurred processing one or more externals definitions Is there anything I am doing wrong ? Thanks, Pietro ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
Re: [Podofo-users] Podofo rendering
Hi Matthew, I resorted to the shared-library solution as it worked well for me, at least it builds. I haven't tried to run anything as yet. For the sake of my personal understanding On 13/08/2019 13:40, Matthew Brincke wrote: Then podofo was automatically configured with JPEG support and you'll need to link to the libjpeg it found in your project too (when using a static libpodofo build). When using a shared library build and not doing make install, using LD_LIBRARY_PATH (on GNU/Linux) or changing the dynamic linker config would be required to have your program find the libpodofo shared library. For podofo-built programs this is required only when moving the library to a non-standard location, or accessing it through a different non-standard path (like in a sandbox) because the build process embeds a run-path in them. This is the CMakeLists.txt for my microscopic project: ADD_EXECUTABLE(main main.cpp) TARGET_LINK_LIBRARIES(main ${PODOFO_INSTALL_TOP}/lib/libpodofo.a) I thought this would do the work - do I need to manually specify all the dependent libraries, something along the line of TARGET_LINK_LIBRARIES(main /path/to/libjgeg{libtiff..} it's been a while since I programmed in C/C++ so forgive me some slip ups. I am also totally new to CMAKE. Thanks, Pietro ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
Re: [Podofo-users] Podofo rendering
Hi Matthew, Thanks a lot for your answer, I am not bothered at all and I am more than happy to stay on trunk - I just intended to flag this up. I've also found that the example "pdfcontentsgraph" is not built as part of the main build. Thanks, P. On 12/08/2019 23:36, Matthew Brincke wrote: Hello Pietro, hello all, On 12 August 2019 at 23:24 Pietro Paolini wrote: Hi all, I've stumbled upon this library only a few days ago and I noticed that the downloadable version from http://sourceforge.net/projects/podofo/files/podofo/0.9.6/podofo-0.9.6.tar.gz/download Does not compile, at least on my system, while what I get right off trunk does compile. in the meantime some issues were fixed which could have had made it difficult to compile podofo, a number of security/crash issues were also fixed, so it is definitely recommended to only use current svn trunk anymore. On Tuesday, August 13, in the afternoon, I'll very probably commit (I'm a full committer) a fix for issue #58 (plus some debug code to avoid such issues being so difficult to debug as this was for me). I'd just like to run some further tests with it, which are likely to pass, could you please hold out until then? I am exploring the possibility of using the library to inspect PDFs, making some analysis on them and saving the PDF result of some processing. A good example could be hidden text. It seems to be able to parse the input PDF and to "translate" in into an a "PdfVecObjects" but I have not found - within the time I had available for it, I should mention - a way to render the PDFs on screen. PoDoFo is suitable for analyzing PDF documents, modify them (text editing is still very limited, mostly adding is supported yet), and create them, but what it doesn't do at all (because it's outside its scope) is rendering. There are other libraries for that, a popular free (open source) one is libpoppler (homepage URL https://poppler.freedesktop.org/ ), except for some commenting functionality it does exclusively rendering (AFAIK), so IMHO it's a good complement to PoDoFo which you'd use for the analysis and transformation (sorry, colour space support is still rather limited, I hope that'll change before 1.0 ;-) ). Is there an example somewhere for it ? For rendering, please see the libpoppler homepage, for creation, there are examples in that directory under podofo trunk, for analysis, please see the tools directory's sub-directories there (podofopdfinfo could be a good start). Thanks, Pietro. I hope this helps. Best regards, Matthew ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
Re: [Podofo-users] Podofo rendering
Hi Matthew, I spoke too early and I have another question if you don't mind me hammering. I've written a very simple program to check the PdfTokenizer however it fails to compile as some external dependencies seem to be missing. int main(int argn, char **argv) { [...] PoDoFo::PdfMemDocument doc (filename.c_str()); for (int i = 1; doc.GetPageCount(); i++) { PoDoFo::EPdfContentsType t; const char * text; PoDoFo::PdfVariant variant; bool readToken; PoDoFo::PdfContentsTokenizer tokenizer( doc.GetPage(i)); while ( ( readToken = tokenizer.ReadNext(t, text, variant) ) ) { [..] } } I get many erros, among them: [..]/lib/libpodofo.a(PdfFiltersPrivate.cpp.o): In function `PoDoFo::PdfDCTFilter::EndDecodeImpl()': PdfFiltersPrivate.cpp:(.text+0x2146): undefined reference to `jpeg_read_header' PdfFiltersPrivate.cpp:(.text+0x215f): undefined reference to `jpeg_destroy_decompress' PdfFiltersPrivate.cpp:(.text+0x21bd): undefined reference to `jpeg_start_decompress' PdfFiltersPrivate.cpp:(.text+0x22b5): undefined reference to `jpeg_read_scanlines' PdfFiltersPrivate.cpp:(.text+0x24fd): undefined reference to `jpeg_destroy_decompress' I think they are all coming from the same root problem, I do have the libjpeg-dev (Debian system) installed locally though and I wonder what the problem may be. This is the CMakeLists.txt for my microscopic project: ADD_EXECUTABLE(main main.cpp) TARGET_LINK_LIBRARIES(main ${PODOFO_INSTALL_TOP}/lib/libpodofo.a) INCLUDE_DIRECTORIES(${PODOFO_INSTALL_TOP}/include) Best Regards, Pietro. On 13/08/2019 11:27, Pietro Paolini wrote: Hi Matthew, Thanks a lot for your answer, I am not bothered at all and I am more than happy to stay on trunk - I just intended to flag this up. I've also found that the example "pdfcontentsgraph" is not built as part of the main build. Thanks, P. On 12/08/2019 23:36, Matthew Brincke wrote: Hello Pietro, hello all, On 12 August 2019 at 23:24 Pietro Paolini wrote: Hi all, I've stumbled upon this library only a few days ago and I noticed that the downloadable version from http://sourceforge.net/projects/podofo/files/podofo/0.9.6/podofo-0.9.6.tar.gz/download Does not compile, at least on my system, while what I get right off trunk does compile. in the meantime some issues were fixed which could have had made it difficult to compile podofo, a number of security/crash issues were also fixed, so it is definitely recommended to only use current svn trunk anymore. On Tuesday, August 13, in the afternoon, I'll very probably commit (I'm a full committer) a fix for issue #58 (plus some debug code to avoid such issues being so difficult to debug as this was for me). I'd just like to run some further tests with it, which are likely to pass, could you please hold out until then? I am exploring the possibility of using the library to inspect PDFs, making some analysis on them and saving the PDF result of some processing. A good example could be hidden text. It seems to be able to parse the input PDF and to "translate" in into an a "PdfVecObjects" but I have not found - within the time I had available for it, I should mention - a way to render the PDFs on screen. PoDoFo is suitable for analyzing PDF documents, modify them (text editing is still very limited, mostly adding is supported yet), and create them, but what it doesn't do at all (because it's outside its scope) is rendering. There are other libraries for that, a popular free (open source) one is libpoppler (homepage URL https://poppler.freedesktop.org/ ), except for some commenting functionality it does exclusively rendering (AFAIK), so IMHO it's a good complement to PoDoFo which you'd use for the analysis and transformation (sorry, colour space support is still rather limited, I hope that'll change before 1.0 ;-) ). Is there an example somewhere for it ? For rendering, please see the libpoppler homepage, for creation, there are examples in that directory under podofo trunk, for analysis, please see the tools directory's sub-directories there (podofopdfinfo could be a good start). Thanks, Pietro. I hope this helps. Best regards, Matthew ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
[Podofo-users] Podofo rendering
Hi all, I've stumbled upon this library only a few days ago and I noticed that the downloadable version from http://sourceforge.net/projects/podofo/files/podofo/0.9.6/podofo-0.9.6.tar.gz/download Does not compile, at least on my system, while what I get right off trunk does compile. I am exploring the possibility of using the library to inspect PDFs, making some analysis on them and saving the PDF result of some processing. A good example could be hidden text. It seems to be able to parse the input PDF and to "translate" in into an a "PdfVecObjects" but I have not found - within the time I had available for it, I should mention - a way to render the PDFs on screen. Is there an example somewhere for it ? Thanks, Pietro. ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users