PDF indexing

Sylvain Roche Wed, 09 May 2001 07:48:27 -0700
Hi

I've been working on an indexing engine for a while. Everything works fine
with static or dynamic html. I would like now to be able to retrieve
informations in pdf files. I've found several apis to write dynamic pdf
documents, but no simple one to parse the document, and extract the text
content.

For now, my indexing engine works this way :
1) download a page with a starting url
2) parse the content of the document to extract the headers and meta tags
3) analyse all the html tags of the page (links, colors, simple forms with
no user input)
4) generate a list of urls referenced in the current page, which are queued
5) extract the text content and eliminate the neglectable words, and stores
it in a database
6) pursue with new url in the queue

My concern is that a pdf document will be used to store a much larger
amount of data than a html page. I expect this scheme to be very long with
a pdf. Has anyone any experience on such an search engine, and am I in the
right direction

Regards
Sylvain

===========================================================================
To unsubscribe: mailto [EMAIL PROTECTED] with body: "signoff JSP-INTEREST".
For digest: mailto [EMAIL PROTECTED] with body: "set JSP-INTEREST DIGEST".
Some relevant FAQs on JSP/Servlets can be found at:

 http://java.sun.com/products/jsp/faq.html
 http://www.esperanto.org.nz/jsp/jspfaq.html
 http://www.jguru.com/jguru/faq/faqpage.jsp?name=JSP
 http://www.jguru.com/jguru/faq/faqpage.jsp?name=Servlets
PDF indexing

Reply via email to