> - Fetch and index some pages (containing word and pdf documents) on > daily basis. > - Extract all pages that contain some provided keywords after fetching > the pages. > - Create some bulletin from fetched pages, bulletin will be in pdf > format and are categorized based on keywords. > - provide offline search capability (on pages that it indexed and also > it should allows the users to browse the pages offline) > > Can you let me know whether any of Lucene based projects can help me > with this requirements? > Specially with offline browsing feature?
Yes, the UpLib system from PARC does this. It supports Word, Powerpoint, PDF, Web pages, email, images, etc., as input documents. It caches all documents given to it, in their original format, but also allows access to them as HTML or PDF. It Lucene-indexes both the full content text of each document, along with metadata for each document, and contains a number of document analysis engines for calculating "indirect" metadata from the document. It includes several off-line browsers, including a Web-browser tool and a Java client (I tend to use the Java rich client), for searching, reading, and annotating the gathered pages. Screenshots are at <http://uplib.parc.com/uplib/screenshots.html>. We're currently in beta test of our first public release (it's been used internally at PARC for over four years now); to be added to the beta-test list, just create an account on the blog at <http://uplib.parc.com/blog/>. Bill --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]