See inlined comments below. > We have had requests from some clients who would like the ability to > "index" PDF files, now and possibly other text files in the future. The > PDF files live on a server and are in a structured environment. I would > like to somehow index the content inside the PDF and be able to run > searches on that information from a web-form. The result MUST BE a text > snippet (that being some text prior to the searched word and after the > searched word). Does this make sense? And can Lucene do this?
Lucene indexes text documents, so you will need to convert your PDF to a text document. PDFBox (http://www.pdfbox.org/) can do that, PDFBox provides a summary of the document, which is just the first x number of characters. If you wanted a smarter summary you would need to create that yourself. > If the product can do this, how is the best way to get rolling on a > project of this nature? Purchase an example book, or are there simple > examples one can pick up on? Does Lucene have a large learning curve? or > reasonably quick? There are tutorials available on the website, and I would recommend the "Lucene in Action" book. There is a learning curve for lucene, but it sounds like your requirements are pretty basic so it shouldn't be that hard. > If all the above will work, what kind of license does this require? I > have not been able to find a link to that yet on the jakarta site. http://www.apache.org/licenses/LICENSE-2.0 Ben --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]