Can nutch (with its out-of-box PDFBox plugin) crawl PDF files, where each page
is link (e.g. the URL appends &PGN=pageNumber to go to the specific page)? On
the browser, each page in the pdf file is loaded on demand basis. However when
the content is fetched from the URL (from the code), it looks like all the
pages are not fetched. Even when the pdf is saved from the browser (with Save
As, not all pages are saved. The Acrobat Reader is able to open only 1 page and
gives errors (cannot find link) for the other pages. Examining the pdf file
with notepad, I did find some tags like GoToR for each page, indicating the
destination (in binary form though) for the page.
Any idea on how to extract everything from the pdf??
Thanks
Jason
---------------------------------
Do you Yahoo!?
Yahoo! Small Business - Try our new resources site!