Re: stream of events never to know when it ends? how to index such things & search

Erick Erickson Wed, 18 Feb 2009 08:38:42 -0800

You could always sort by EVENTID, that way at least
you'd have all the events for a particular ID together
in your results. You'd have to post-filter the results to
determine whether all the necessary descriptions were
present. But I don't think this works all that well because,
as you pointed out, you may have a lot of records to
sort through so I don't think this is a very good idea...

How many events are we talking about here and what
kind of lag between an event and being able to search it
can you tolerate? I guess what I'm really asking is whether
it's possible to recreate your index "often enough" to
satisfy your users. If so, you can index multiple
descriptions in a single document, something like

doc.add("EVENTDESCRIPTION", "STARTING EVENT")
doc.add("EVENTDESCRIPTION", "XYZ")
doc.add("EVENTDESCRIPTION", "ABC")
doc.add("EVENTID", "1")
IndexWriter.addDocument(doc);

You'd have to gather all the descriptions related
to each EVENTID before you were able to index the doc.....

By manipulating the PositionIncrementGap you could also
keep searches from matching across different EVENTDESCRIPTIONs,
e.g. if you didn't want to match +STARTING +ABC you could use
SpanQueries or the proximity operator, but going into details
depends upon whether you can rebuild your index so we'll defer
that part....

You could also think about updating the document when new events
were added, but since an update is really a delete/add under the
covers you'd have to either gather enough information from what I
assume is your log or store enough information with the document to
recreate it.

How big is your index currently and what kind of throughput do you
require?

Best
Erick

On Wed, Feb 18, 2009 at 10:20 AM, Christian Brennsteiner <
[email protected]> wrote:

> dear lucene community,
>
> i am playing around with lucene right now. and have come to very bad
> problem.
>
> given environment:
>
> a signal source gives signals with eventids ans eventdescriptions
>
> for example EVENTID=1 and EVENTDESCRIPTION="STARTING EVENT"
>
> those events can be running very long (e.g. one month) during this
> period we will receive for example
>
> EVENTID=1 and EVENTDESCRIPTION="EXECUTING XYZ"
> 10 minutes later
> EVENTID=1 and EVENTDESCRIPTION="EXECUTING YZA"
> 10 minutes later
> EVENTID=1 and EVENTDESCRIPTION="PASSED MILESTONE1"
> 10 minutes later
> EVENTID=1 and EVENTDESCRIPTION="EXECUTING ZAB"
>
> after e.g. 1 week we receive
> EVENTID=1 and EVENTDESCRIPTION="STOPING EVENT"
>
> what i want:
> i want to be able to search e.g. which eventids are connected to "XYZ"
> AND "ZAB" AND have already passed "MILESTONE1"
>
> so my current try is to index all events by full indexing (without
> storing) eventdescriptions AND stemming e.g. EXECUTING
>
> then searching for "+XYZ +ZAB +MILESTONE1"
> --> result no document since those are all seperated documents
> when i search
>  "XYZ ZAB MILESTONE1"
> i am getting 3 times EVENTID 3
> --> this is bad since when i get 1000000 of such events how do i rank them?
>
> CONCLUSION:
> my biggest problem is that my lucene document given to the index
> currently is not in a final state BUT i have to index and search it
> also while it is in progress.
> as a result of this the ranking as i do it now has no real value since
> the ranking is just based on a "line of a whole event"
>
> QUESTION:
> is there a solution within lucene to combine search results? e.g. merge
> them OR
> is there a better workaround how i would do such updates to the index
> without storing the original docmuent inside the index (since this
> consumes so many space)? e.g. extracting the keywords that were stored
> for the item?
>
> any hints appreciated.
>
> regards chris
>
>
> ----------
> Christian Brennsteiner
> Salzburg / Austria / Europe
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: stream of events never to know when it ends? how to index such things & search

Reply via email to