Yes. There "Is there a way to list all the objects within the pdf" That's what Tilman meant when he said "Or look at the PDF in PDFDebugger." The PDFDebugger is a utility included in the PDFBox download (or maybe separately downloadable?)
On Thu, Nov 29, 2018 at 3:27 PM Nicolas Paris <nicolas.pa...@riseup.net> wrote: > On Thu, Nov 29, 2018 at 08:56:59PM +0100, Tilman Hausherr wrote: > > Am 29.11.2018 um 09:49 schrieb Nicolas Paris: > > > Hi > > > > > > > It could be an XFA forms pdf... then you'd have to analyze the XML > content. > > > I opened the pdf in a text editor, and I can say the boxes are in a > > > stream xml entity, in binary format. (By removing some binary, I have > > > been able to remove the boxes. > > > Does it exclude the XFA form pdf nature ? > > > > > > Sorry, "nature" looks like a bad translation, and sadly I don't know what > > you meant... please write that part in french, which I understand too. > > I meant, "do the above informations prove it is *not* a XFA form ?". I > mean, the boxes arent in xml but in the binary part. > > > > > > PDFBox doesn't have an API for the XFA form. > > > > You can also upload the PDF to a sharehoster (no mail attachments). Or > look > > at the PDF in PDFDebugger. > > I cannot share any copy of the pdf. Thanks for that proposition that > would help a lot. > > > > > > > > It could be ordinary text, then the text stripper would do the job. > > > The regular textstripper does not extract them. Does it exclude the > text > > > nature ? > > > > > > Same problem with "nature". PDFBox cannot extract XFA forms. It can > detect > > glyphs that are used for forms, e.g. squares. > > I meant, "if the built-in pdfbox text stripper does not extract the > check-boxes, does it prove that they are not ordinary text." > > > > How could I determine the kind of checkbox I have ? Is there a way to > list all the objects within the pdf ? > > > > > > > > On Thu, Nov 29, 2018 at 08:04:51AM +0100, Tilman Hausherr wrote: > > > > It could be an XFA forms pdf... then you'd have to analyze the XML > content. > > > > > > > > It could be widgets annotations without acroform, then you'd have to > analyse > > > > these. > > > > > > > > It could be ordinary text, then the text stripper would do the job. > > > > > > > > It could be vector graphics, then it gets really difficult. > > > > > > > > Tilman > > > > > > > > Am 28.11.2018 um 23:05 schrieb Nicolas Paris: > > > > > Hi > > > > > > > > > > I have several pdf created with PDFCreator 2.0.1.0 and I want to > extract > > > > > the content as text, including the checkboxes values in it. > > > > > > > > > > THe pdf looks like a regular form pdf with checkboxes. However it > is not > > > > > a acro form based pdf, and the regular pdfbox code I use in this > case > > > > > does not apply : the acroform is null ! > > > > > > > > > > I wonder how I can iterate on those checkboxes (or visually > equivalent) > > > > > objects or symbols. > > > > > > > > > > If someone can give me a starter to list all objects in that pdf, > that > > > > > might be helpful to begin with. > > > > > > > > > > Thanks by advance, > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > > > > For additional commands, e-mail: users-h...@pdfbox.apache.org > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > > For additional commands, e-mail: users-h...@pdfbox.apache.org > > > > -- > nicolas > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > >