Hasan Diwan <hasan.di...@gmail.com> wrote: > On 27 October 2010 18:16, Troy Wical <t...@wical.com> wrote: > > Depends on what your trying to index, I suppose. Maildir or mbox? For some > > time now, off and on, I have been working to index an ezmlm mailing list > > archive. In the end, I went with Swish-E and have made quite a bit of > > progress. I am short of my complete goal though. The issue is that the > > search results do not return results that contain the subject, and there is > > currently no excerpt or phrase highlighting. My problem is the flat text > > email files I am working with have no xml or anything to help the indexer > > create fields from. I've not yet figured out how to convert the emails to > > xml. > > Neither Maildir or mbox -- IMAP/POP doesn't care. Basically, I want to > build the index based on the contents of (my) gmail box. I can > retrieve the messages using IMAP, just need to figure out the > structure of the index.
UpLib has email indexing via Lucene, and an email crawler for IMAP that can easily be adapted to GMail (I've done it for my gmail accounts). It also does message threading, splits off attachments and indexes them as separate documents (as well as being attachments to the email), and a few other things, like providing an IMAP server for access to the indexed mailboxes, as well as the standard Web interface. It does not have excerpt or phrase highlighting in the way that you suggest, though. No need to convert the emails to XML (though UpLib does do that internally to create a rendering of each email message). Bill --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org