Scott Smith wrote:
I'm building an application which has to provide "real-time" searching
of emails as they come in.  I have a number of search strings that I
need to apply against each email as it comes in and then do something
with the email based on which search string(s) get a hit.
My initial thought was to create a lucene index of the emails received
in the last N seconds (where N is around 5 since I don't have to be
quite real-time) in a memory directory, do my searches and then delete
the index and create a new index for emails received in the next 5
seconds.   I'm a little concerned because the number of search strings
will probably grow over time and so there is a bit of a scalability
issue-though I'm not sure there's anyway around that other than doing
parallel processing on different machines.

What exactly do you need to do?  That is, what
will the "search strings" look like?  And what
do you need to do with the mail based on the
result of search?  There's a substantial literature
on routing incoming emails and zoning them to
send them off to appropriate customer support
reps and even auto-generate reply templates.
It's really just a classification problem.

If you just need boolean search over a set of
strings, then a reasonable solution is to just
use a regular expression.  That'll even let you
find where the strings match in text at the same
time.  It's also easy to parallelize across
machines/threads.  Regexes also are built into Java now and likely more
easily understood by other developers you may
be working with now or in the future.
If you need a lot of efficiency, you can
unfold the regex into a deterministic finite automaton,
but that may be a lot more than you want to take
on for this project.

If you need the more complex logic of Lucene's
boolean searches or the weighting provided by TF/IDF,
then you need to think about what the corpus
is going to be that'll generate the IDF weightings
for terms that'll contribute their weight to
overall scores.  Ideally, it'll be small enough
to fit in a memory-backed directory (in Lucene's
terminology).  And the new emails would go in
a separate in-memory index.

If you need to do something like route the emails
to different departments, you can turn the
problem around and use the incoming email to
create a query which you match against pseudo-documents
generated by concatenating example documents
that should go to those departments together.

We did this at Bell Labs for routing telephone
calls, and used simple TF-IDF weighting (this
was, alas, before the days of Lucene).  Because
we were in the speech lab, we did it for telephone
calls based on speech recognition:

http://acl.ldc.upenn.edu/J/J99/J99-3003.pdf

- Bob Carpenter
  Alias-i

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to