Re: [iText-questions] iText Read Chuncks of PDF into java

Mike Marchywka Wed, 19 May 2010 04:18:56 -0700

> Date: Wed, 19 May 2010 08:52:00 +0200
> From: i...@1t3xt.info
> To: itext-questions@lists.sourceforge.net
> Subject: Re: [iText-questions] iText Read Chuncks of PDF into java
>
> crimeunit wrote:
>> Dear all,
>>
>> Does somebody else know maybe that I can use another library where I can
>> specially read out the links of content (to another pdf file) into a pdf?
>
> Reading out the "links" is a completely different question.
>
> Links (anchors, hyperlinks, external go to actions,...) are not part of
> the page content stream; they are stored in Link annotations and very
> easy to retrieve.


We just discusses this specific issue in another thread. The question
ultimately became, from my perspective, do you need to write
a custom piece of code to get links or can stand on the shoulder
of giants, avoid reinventing the wheel and solve the problem
with cliches and command line tools such as

cat xxx.pdf | grep http  or better
cat xxx.pdf | convert_to_form_suited_for_manipulation | grep 
$unambiguous_link_thingy> all_links

>
> Your problem is that you are not using the correct terminology,
> therefore it is impossible for anybody to answer your question.
>

This of course is a very common problem when just starting out
and it makes it hard to do key word searches. A lot of your
time is spent here but this is hardly unique to itext.
A command line tool to dump a pdf in human readable form (LOL)
with the right jargon could make this easier ( " I dumped the
pdf and all the wazoodalle dictionary entries were blank")

This is why I usually talk around ill-posed questions time 
and interest permitting.


> I interpreted your question as a request to do something that is
> impossible: you want to extract structure from a PDF that isn't
> structured (a PDF that isn't tagged).
>
> You won't find any tool that can do that.
>
If you can convert the PDF to text or pixels or anyother
thing that may capture structure according rto 
some external pattern you may be able to use
existing text tools or, if this is worth enough effort, 
OCR tools on pixels. 
My recurring complaint is the FDA does or
has in the past accepted scanned PDF files for documentation
of clinical trial results of approved drugs( look for example
at dr...@fda various doc packages) . This makes
it impossible for automated usage of this voluminous data
and I tried OCR  but it didn't work too well. 
Many people who file govt documents don't like
automated data processing which does make this format
a good choice. Calling this "Accessdata" is almost comical
perhaps "accesspictures" LOL. 

http://www.accessdata.fda.gov/scripts/cder/drugsatfda/
http://www.accessdata.fda.gov/drugsatfda_docs/nda/2004/125104s000_Natalizumab_Pharmr_P1.pdf

I did note the labels seem to be selectable and preusmably you could get
data out of the cave drawings. 
                                          
_________________________________________________________________
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with 
Hotmail. 
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5
------------------------------------------------------------------------------

_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

Re: [iText-questions] iText Read Chuncks of PDF into java

Reply via email to