Re: indexing and searching different file formats

Andrew Libby Wed, 13 Feb 2002 19:39:15 -0800


Pradeep,
    Currently Lucene does not provide the ability to convert documents
to text for indexing.  There is talk of adding this kind of thing to the
goal of the project, along with providing crawlers to traverse web, 
local disk, ftp, and RDBMS sources of data.

The problem with indexining irrespective of file type is that each document
format contains embedded information that must be stripped out (or ignored)
and the text needs to be retrieved for indexing.  An extreeme example is
a PDF which has a considerably complicated document format.

On the contributions page there are some pointers that may provide information
about processing the types of documents you're interested in.

http://jakarta.apache.org/lucene/docs/contributions.html

If you've not taken the time to do so, look at the FAQs, they are very
informative:

http://www.lucene.com/cgi-bin/faq/faqmanager.cgi
http://jakarta.apache.org/lucene/docs/gettingstarted.html
http://www.jguru.com/faq/Lucene

Good luck!

Andy

On Wed, Feb 13, 2002 at 09:24:33PM +0530, Pradeep Kumar K wrote:
> Hi Lucene friends!
> 
>    How the files of different format can be indexed and searched? ( As I 
> know lucene is having HTML indexer and searcher, which comes along with 
> it and also XML indexer, but is there any way to index files  
> irrespective of the file type)
> Any suggestions will be greatly appreciated..
> 
> Thanks in advance.
> Pradeep
> 
> 
> --------------------------------------------------------------
> Robosoft Technologies, Mangalore, India
> 
> 
> 
> --
> To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
> 

-- 
--------------------------------------------------
Andrew Libby
CommNav, Inc
[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: indexing and searching different file formats

Reply via email to