RE: Investingating Lucene For Project
Also there is a book called "Lucene in Action" that was released recently. It is a great introduction to Lucene and has sections dedicated to indexing different text document types (txt, html, pdf, doc, rtf). FYI I am in no way related to the book or the authors so this is a real recommendation. It will help you quickly learn what Lucene is and can do. It has lots of pointers to other projects that use Lucene or expand upon it's functionality. Thanks, Kevin -Original Message- From: Ben Litchfield [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 01, 2005 3:08 PM To: Lucene Users List Subject: Re: Investingating Lucene For Project See inlined comments below. > We have had requests from some clients who would like the ability to > "index" PDF files, now and possibly other text files in the future. The > PDF files live on a server and are in a structured environment. I would > like to somehow index the content inside the PDF and be able to run > searches on that information from a web-form. The result MUST BE a text > snippet (that being some text prior to the searched word and after the > searched word). Does this make sense? And can Lucene do this? Lucene indexes text documents, so you will need to convert your PDF to a text document. PDFBox (http://www.pdfbox.org/) can do that, PDFBox provides a summary of the document, which is just the first x number of characters. If you wanted a smarter summary you would need to create that yourself. > If the product can do this, how is the best way to get rolling on a > project of this nature? Purchase an example book, or are there simple > examples one can pick up on? Does Lucene have a large learning curve? or > reasonably quick? There are tutorials available on the website, and I would recommend the "Lucene in Action" book. There is a learning curve for lucene, but it sounds like your requirements are pretty basic so it shouldn't be that hard. > If all the above will work, what kind of license does this require? I > have not been able to find a link to that yet on the jakarta site. http://www.apache.org/licenses/LICENSE-2.0 Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Investingating Lucene For Project
See inlined comments below. > We have had requests from some clients who would like the ability to > "index" PDF files, now and possibly other text files in the future. The > PDF files live on a server and are in a structured environment. I would > like to somehow index the content inside the PDF and be able to run > searches on that information from a web-form. The result MUST BE a text > snippet (that being some text prior to the searched word and after the > searched word). Does this make sense? And can Lucene do this? Lucene indexes text documents, so you will need to convert your PDF to a text document. PDFBox (http://www.pdfbox.org/) can do that, PDFBox provides a summary of the document, which is just the first x number of characters. If you wanted a smarter summary you would need to create that yourself. > If the product can do this, how is the best way to get rolling on a > project of this nature? Purchase an example book, or are there simple > examples one can pick up on? Does Lucene have a large learning curve? or > reasonably quick? There are tutorials available on the website, and I would recommend the "Lucene in Action" book. There is a learning curve for lucene, but it sounds like your requirements are pretty basic so it shouldn't be that hard. > If all the above will work, what kind of license does this require? I > have not been able to find a link to that yet on the jakarta site. http://www.apache.org/licenses/LICENSE-2.0 Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Investingating Lucene For Project
I am looking for a solution to a problem I am having. We have a web-based asset management solution where we manage customers assets. We have had requests from some clients who would like the ability to "index" PDF files, now and possibly other text files in the future. The PDF files live on a server and are in a structured environment. I would like to somehow index the content inside the PDF and be able to run searches on that information from a web-form. The result MUST BE a text snippet (that being some text prior to the searched word and after the searched word). Does this make sense? And can Lucene do this? If the product can do this, how is the best way to get rolling on a project of this nature? Purchase an example book, or are there simple examples one can pick up on? Does Lucene have a large learning curve? or reasonably quick? If all the above will work, what kind of license does this require? I have not been able to find a link to that yet on the jakarta site. I sincerely appreciate any input into this. Sincerely Scott