There was a design implemented in Streaming Expression for large scale
alerting described here:

https://joelsolr.blogspot.com/2017/01/deploying-solrs-new-parallel-executor.html

In this design you would store each alert in Solr as a topic expression.
Then a single daemon can run all the topics or it can be parallelized.



Joel Bernstein
http://joelsolr.blogspot.com/


On Tue, Sep 7, 2021 at 6:32 AM Charlie Hull <[email protected]>
wrote:

> Hi Dan,
>
> Yuval and my suggestions both rely on the same underlying code (Luwak,
> now called Lucene Monitor). This lets you store a set of Lucene queries
> and run them against every new document.
>
> The Lucene Monitor allows for very high-performance matching (I know of
> situations with around 1m stored queries, monitoring 1m new documents a
> day running on a few tens of nodes) and it does this with some clever
> optimisations: effectively it builds an index of your stored queries,
> and turns each new document into a query across this index (I know it
> sounds confusing!). It's a 'reverse search'. Check out the original
> Luwak project as it's got links to several presentations and blogs
> showing how others have implemented these systems.
>
> The bit you'll have to build is the Solr layer and then the code that
> uses this to generate alerts - and Solcolator and
> https://github.com/o19s/solr-monitor are two examples of how to do the
> first part, which you can build on. The facility to do a reverse search
> is not built into Solr - yet, unlike Elasticsearch's Percolator.
>
> Best
>
> Charlie
>
> On 07/09/2021 10:24, Dan Rosher wrote:
> > Thanks Eric, Charlie and Yuval for all the feedback and suggestions.
> >
> > Eric: Yes I thought the monitoring might be a it of a pain, esp with
> > millions of them, I'll have to check out the topic code, but I wondered
> if
> > I can look @ the checkpoint collections for uniqueIds that haven't been
> > updated for a 'while' which might suggest the demon had stopped/died,
> > rather than checking each daemon individually?
> >
> > I was also wondering whether it's possible, or a useful enhancement to
> look
> > at the replica index version (as opposed to _vesion_ ) for the topic
> > streaming expression to skip queries where the replica index is the same
> as
> > what we might store in the checkpoint collection ? For collections that
> > update infrequently I think this might be useful.
> >
> > Charlie: It was for email alerts, so a user stores a query for collection
> > docs to match against, and then the system emails matches to the user. Do
> > you think solr-monitor can be used for this purpose?
> >
> > Yuval: I like the idea of using the UpdateProcessor, at least there's no
> > need for deamons or monitoring of them, but would this scale for millions
> > of email queries though?
> >
> > Many thanks again to all.
> >
> > Kind regards,
> > Dan
> >
> >
> >
> >
> > On Mon, 6 Sept 2021 at 18:47, Yuval Paz <[email protected]>
> wrote:
> >
> >> Me and my team are building upon this solcolator:
> >> https://github.com/SOLR4189/solcolator
> >>
> >> Currently the processor is build for Solr 6.5.1, we are working on
> updating
> >> our Solr and I hope to release a complete version of our Solcolator  as
> >> open source then (it will be for version 8.6.x).
> >>
> >> Making it an update processor (either make it the last element and
> replace
> >> the usual processor that index the document, or by using it as the one
> from
> >> last processor in the collection, and so allow monitoring also atomic
> >> updates [which is relatively costly]).
> >>
> >> By making it an update processor we don't rely on the streaming deamon,
> >> which we found unsatisfying as we wish to allow users to define their
> own
> >> monitors over the index.
> >>
> >> On Mon, Sep 6, 2021, 8:25 PM Charlie Hull <
> [email protected]
> >> wrote:
> >>
> >>> Are you trying to monitor a stream of emails for certain patterns? In
> >>> which case you might look at the Lucene Monitor
> >>>
> >>>
> >>
> https://lucene.apache.org/core/8_2_0/monitor/index.html?overview-summary.html
> >>> https://issues.apache.org/jira/browse/LUCENE-8766, which was
> originally
> >>> Luwak - at my previous company Flax we helped build several large-scale
> >>> monitoring systems with this https://github.com/flaxsearch/luwak .
> It's
> >>> not officially surfaced in Solr yet although my colleague Scott Stults
> >>> has been working on some ideas: https://github.com/o19s/solr-monitor
> >>>
> >>> best
> >>> Charlie
> >>>
> >>> On 06/09/2021 14:32, Dan Rosher wrote:
> >>>> Hi,
> >>>>
> >>>> I was wondering if anyone had tried email alerts with streaming
> >>>> expressions, and what their experience was if attempting this with say
> >> 12
> >>>> million emails / day? Traditionally this might have been done with a
> >>>> database cursor iterator daily.
> >>>>
> >>>> I was thinking if something like the following pseudocode expression
> >> with
> >>>> 'kafka' as a custom push expression:
> >>>>
> >>>> daemon(id="alertId",
> >>>>          runInterval="1000",
> >>>>          kafka(
> >>>>           kafka_topic,
> >>>>           alertId,
> >>>>           topic(email_alerts,
> >>>>             doc_collection,
> >>>>             q="email query",
> >>>>             fl="id, title, abstract",
> >>>>             id="alertId",
> >>>>             initialCheckpoint=0)
> >>>>           )
> >>>>
> >>>> If you have done something like this 'where' would you typically run
> >> the
> >>>> daemon, on replicas away from replicas running web queries?
> >>>>
> >>>> Many thanks in advance for any advice / suggestions,
> >>>>
> >>>> Dan
> >>>>
> >>> --
> >>> Charlie Hull - Managing Consultant at OpenSource Connections Limited
> >>> <www.o19s.com>
> >>> Founding member of The Search Network <https://thesearchnetwork.com/>
> >>> and co-author of Searching the Enterprise
> >>> <https://opensourceconnections.com/about-us/books-resources/>
> >>> tel/fax: +44 (0)8700 118334
> >>> mobile: +44 (0)7767 825828
> >>>
> >>> OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
> >>> Amtsgericht Charlottenburg | HRB 230712 B
> >>> Geschäftsführer: John M. Woodell | David E. Pugh
> >>> Finanzamt: Berlin Finanzamt für Körperschaften II
> >>>
>
> --
> Charlie Hull - Managing Consultant at OpenSource Connections Limited
> <www.o19s.com>
> Founding member of The Search Network <https://thesearchnetwork.com/>
> and co-author of Searching the Enterprise
> <https://opensourceconnections.com/about-us/books-resources/>
> tel/fax: +44 (0)8700 118334
> mobile: +44 (0)7767 825828
>
> OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
> Amtsgericht Charlottenburg | HRB 230712 B
> Geschäftsführer: John M. Woodell | David E. Pugh
> Finanzamt: Berlin Finanzamt für Körperschaften II
>

Reply via email to