crawling PDF file with page links?

Jason Manfield Wed, 18 May 2005 12:30:53 -0700

Can nutch (with its out-of-box PDFBox plugin) crawl PDF files, where each page 
is link (e.g. the URL appends &PGN=pageNumber to go to the specific page)? On 
the browser, each page in the pdf file is loaded on demand basis. However when 
the content is fetched from the URL (from the code), it looks like all the 
pages are not fetched. Even when the pdf is saved from the browser (with Save 
As, not all pages are saved. The Acrobat Reader is able to open only 1 page and 
gives errors (cannot find link) for the other pages. Examining the pdf file 
with notepad, I did find some tags like GoToR for each page, indicating the 
destination (in binary form though) for the page.
 
Any idea on how to extract everything from the pdf??
 
Thanks
 
Jason


                
---------------------------------
Do you Yahoo!?
 Yahoo! Small Business - Try our new resources site!

crawling PDF file with page links?

Reply via email to