Hi, this sounds like job for Nutch (one of Lucene family projects).
On Sun, Apr 27, 2008 at 8:26 PM, Legolas wood <[EMAIL PROTECTED]> wrote: > Hi > Thank you for reading my post. > I have to design a system with the following requirements, I think > Lucene or one of the projects which are based on Lucene can help me as a > base to continue on. > Here is the requirements: > > - Fetch and index some pages (containing word and pdf documents) on > daily basis. Nutch ca do this for you. It uses plug-ins for parsing various content. PDF should be no problem. As for MS falimy documents it uses POI I think. THis means that it is not 100% reliable but I guess you can write your own plug-in and utilize commercial third-party utility for Word processing. > > - Extract all pages that contain some provided keywords after fetching > the pages. I think you will need to look closer at Nutch tools. As far as I know Nutch can drop all pages containing specific words. I am not sure if there is any tool to do the opposite. > > - Create some bulletin from fetched pages, bulletin will be in pdf > format and are categorized based on keywords. Can you be more specific? I think you need to implement this yourself. (May be carrot could be helpful) > > - provide offline search capability (on pages that it indexed and also > it should allows the users to browse the pages offline) > If I am not mistaken then Nutch can cache documents for you but they are converted into plain text. But I am really not sure in this point. > > Can you let me know whether any of Lucene based projects can help me > with this requirements? > Specially with offline browsing feature? > > > Thanks. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- http://blog.lukas-vlcek.com/