Also there is a book called "Lucene in Action" that was released
recently. It is a great introduction to Lucene and has sections
dedicated to indexing different text document types (txt, html, pdf,
doc, rtf). FYI I am in no way related to the book or the authors so this
is a real recommendation. It will help you quickly learn what Lucene is
and can do. It has lots of pointers to other projects that use Lucene or
expand upon it's functionality.

Thanks,
Kevin 

-----Original Message-----
From: Ben Litchfield [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 01, 2005 3:08 PM
To: Lucene Users List
Subject: Re: Investingating Lucene For Project 


See inlined comments below.

> We have had requests from some clients who would like the ability to
> "index"  PDF files, now and possibly other text files in the future.
The
> PDF files live on a server and are in a structured environment. I
would
> like to somehow index the content inside the PDF and be able to run
> searches on that information from a web-form. The result MUST BE a
text
> snippet (that being some text prior to the searched word and after the
> searched word).  Does this make sense? And can Lucene do this?


Lucene indexes text documents, so you will need to convert your PDF to a
text document.  PDFBox (http://www.pdfbox.org/) can do that, PDFBox
provides a summary of the document, which is just the first x number of
characters.  If you wanted a smarter summary you would need to create
that
yourself.

> If the product can do this, how is the best way to get rolling on a
> project of this nature? Purchase an example book, or are there simple
> examples one can pick up on? Does Lucene have a large learning curve?
or
> reasonably quick?

There are tutorials available on the website, and I would recommend
the "Lucene in Action" book.  There is a learning curve for lucene, but
it
sounds like your requirements are pretty basic so it shouldn't be that
hard.



> If all the above will work, what kind of license does this require? I
> have not been able to find a link to that yet on the jakarta site.

http://www.apache.org/licenses/LICENSE-2.0

Ben

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to