Re: Extract Embedded files from pdf using pdfbox in .NET application

Andreas Lehmkuehler Sat, 22 Jun 2013 00:08:41 -0700

Hi,


Am 20.06.2013 12:48, schrieb Ramesh Shrestha:

Thanks,

As per your suggestion using annotation I was able to extract the name of
the embedded file however the contents of that file could not be extracted
Please refer to the code below.

var originalDocument = PDDocument.load(_PdfFile);

var originalCatalog = originalDocument.getDocumentCatalog();

java.util.List sourceDocumentPages = originalCatalog.getAllPages();

var newDocument = new PDDocument();

//number of pages in pdf file = 2

int[] PageNumbers = { 1, 2 };



foreach (var pageNumber in PageNumbers)

{

// Page numbers are 1-based, but PDPages are contained in a zero-based
array:

int pageIndex = pageNumber - 1;

PDPage pdpage = new PDPage();

try

{

pdpage = (PDPage)sourceDocumentPages.get(pageIndex);

List anno =  pdpage.getAnnotations();

If(anno.size() > 0)

{

PDAnnotationFileAttachment pafa = (PDAnnotationFileAttachment)anno.get(0);

//FILENAME = GETCONTENTS()

string filename = pafa.getContents();

PDFileSpecification fs = pafa.getFile();

               }

        }

catch (Exception)

        { }

}
Can you help me one more time to extract and dump the embedded file in the
specified location?


You already mentioned some sample code yourself. [1] demonstrates how to do 
that.

[1]http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java

On Thu, Jun 20, 2013 at 2:46 PM, Ramesh Shrestha <[email protected]>wrote:


Even after trying Annotation i am not able to extract the
embedded/attached doc file located in the page of pdf.

On Tue, Jun 11, 2013 at 5:29 PM, Andreas Lehmkuehler <[email protected]>wrote:

Am 11.06.2013 07:06, schrieb Ramesh Shrestha:

Thanks,

The java example link i provided should have been -

http://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java

But your suggestion WORKS.

Now i am able to extract the attached file located in the *attachments
tab*but
*haven't been able to extract the attached file located in page*. I am

getting null efTree in this case.

          PDDocumentNameDictionary namesDictionary = new
PDDocumentNameDictionary(pdfDoc.getDocumentCatalog());
          PDEmbeddedFilesNameTreeNode *efTree *=

namesDictionary.getEmbeddedFiles();

So now working on it.

Embedded files are always document related. If an embedded file is
referenced
on a single page a file attachment annotation is used. Try something like
this
to get all annotations of a single page:

List annotations = page.getAnnotations();

The one you are looking for has to be an instance of the class


org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationFileAttachment.

  On Mon, Jun 10, 2013 at 7:38 PM, Andreas Lehmkuehler <[email protected]

wrote:

Hi,


Am 10.06.2013 11:22, schrieb Ramesh Shrestha:

   Hi,



     I am developing .NET Application using pdfbox to extract metadata,
content and attached file from PDF.

I was able to extract metadata and content, but stuck while extracting
attached/embedded files.

I have a pdf with embedded/attached doc file and want to retrieve that
file. I have gone through the java example -

http://www.docjar.com/html/**api/org/apache/pdfbox/**examples/pdmodel/**
EmbeddedFiles.java.html<
http://www.docjar.com/html/api/org/apache/pdfbox/examples/pdmodel/EmbeddedFiles.java.html


.

But while trying to use it in .Net, i got "non generic type
'java.util.Map'
cannot be used with type arguments" in the following code snippet

java.util.Map<String, COSObjectable> names = efTree.getNames();

So, i will be grateful if anybody help me to extract the file from pdf.

  I'm not a .NET expert and don't know what may cause that issue. But

maybe
it is
a good idea to just omit the generics and try something like this:

java.util.Map names = efTree.getNames();

   Thanks in advance.

HTH
Andreas Lehmkühler

BR
Andreas Lehmkühler


BR
Andreas Lehmkühler

Re: Extract Embedded files from pdf using pdfbox in .NET application

Reply via email to