Hi I've been working on an indexing engine for a while. Everything works fine with static or dynamic html. I would like now to be able to retrieve informations in pdf files. I've found several apis to write dynamic pdf documents, but no simple one to parse the document, and extract the text content. For now, my indexing engine works this way : 1) download a page with a starting url 2) parse the content of the document to extract the headers and meta tags 3) analyse all the html tags of the page (links, colors, simple forms with no user input) 4) generate a list of urls referenced in the current page, which are queued 5) extract the text content and eliminate the neglectable words, and stores it in a database 6) pursue with new url in the queue My concern is that a pdf document will be used to store a much larger amount of data than a html page. I expect this scheme to be very long with a pdf. Has anyone any experience on such an search engine, and am I in the right direction Regards Sylvain =========================================================================== To unsubscribe: mailto [EMAIL PROTECTED] with body: "signoff JSP-INTEREST". For digest: mailto [EMAIL PROTECTED] with body: "set JSP-INTEREST DIGEST". Some relevant FAQs on JSP/Servlets can be found at: http://java.sun.com/products/jsp/faq.html http://www.esperanto.org.nz/jsp/jspfaq.html http://www.jguru.com/jguru/faq/faqpage.jsp?name=JSP http://www.jguru.com/jguru/faq/faqpage.jsp?name=Servlets
