> Date: Wed, 19 May 2010 08:52:00 +0200 > From: i...@1t3xt.info > To: itext-questions@lists.sourceforge.net > Subject: Re: [iText-questions] iText Read Chuncks of PDF into java > > crimeunit wrote: >> Dear all, >> >> Does somebody else know maybe that I can use another library where I can >> specially read out the links of content (to another pdf file) into a pdf? > > Reading out the "links" is a completely different question. > > Links (anchors, hyperlinks, external go to actions,...) are not part of > the page content stream; they are stored in Link annotations and very > easy to retrieve.
We just discusses this specific issue in another thread. The question ultimately became, from my perspective, do you need to write a custom piece of code to get links or can stand on the shoulder of giants, avoid reinventing the wheel and solve the problem with cliches and command line tools such as cat xxx.pdf | grep http or better cat xxx.pdf | convert_to_form_suited_for_manipulation | grep $unambiguous_link_thingy> all_links > > Your problem is that you are not using the correct terminology, > therefore it is impossible for anybody to answer your question. > This of course is a very common problem when just starting out and it makes it hard to do key word searches. A lot of your time is spent here but this is hardly unique to itext. A command line tool to dump a pdf in human readable form (LOL) with the right jargon could make this easier ( " I dumped the pdf and all the wazoodalle dictionary entries were blank") This is why I usually talk around ill-posed questions time and interest permitting. > I interpreted your question as a request to do something that is > impossible: you want to extract structure from a PDF that isn't > structured (a PDF that isn't tagged). > > You won't find any tool that can do that. > If you can convert the PDF to text or pixels or anyother thing that may capture structure according rto some external pattern you may be able to use existing text tools or, if this is worth enough effort, OCR tools on pixels. My recurring complaint is the FDA does or has in the past accepted scanned PDF files for documentation of clinical trial results of approved drugs( look for example at dr...@fda various doc packages) . This makes it impossible for automated usage of this voluminous data and I tried OCR but it didn't work too well. Many people who file govt documents don't like automated data processing which does make this format a good choice. Calling this "Accessdata" is almost comical perhaps "accesspictures" LOL. http://www.accessdata.fda.gov/scripts/cder/drugsatfda/ http://www.accessdata.fda.gov/drugsatfda_docs/nda/2004/125104s000_Natalizumab_Pharmr_P1.pdf I did note the labels seem to be selectable and preusmably you could get data out of the cave drawings. _________________________________________________________________ The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5 ------------------------------------------------------------------------------ _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.itextpdf.com/book/ Check the site with examples before you ask questions: http://www.1t3xt.info/examples/ You can also search the keywords list: http://1t3xt.info/tutorials/keywords/