There was a design implemented in Streaming Expression for large scale alerting described here:
https://joelsolr.blogspot.com/2017/01/deploying-solrs-new-parallel-executor.html In this design you would store each alert in Solr as a topic expression. Then a single daemon can run all the topics or it can be parallelized. Joel Bernstein http://joelsolr.blogspot.com/ On Tue, Sep 7, 2021 at 6:32 AM Charlie Hull <[email protected]> wrote: > Hi Dan, > > Yuval and my suggestions both rely on the same underlying code (Luwak, > now called Lucene Monitor). This lets you store a set of Lucene queries > and run them against every new document. > > The Lucene Monitor allows for very high-performance matching (I know of > situations with around 1m stored queries, monitoring 1m new documents a > day running on a few tens of nodes) and it does this with some clever > optimisations: effectively it builds an index of your stored queries, > and turns each new document into a query across this index (I know it > sounds confusing!). It's a 'reverse search'. Check out the original > Luwak project as it's got links to several presentations and blogs > showing how others have implemented these systems. > > The bit you'll have to build is the Solr layer and then the code that > uses this to generate alerts - and Solcolator and > https://github.com/o19s/solr-monitor are two examples of how to do the > first part, which you can build on. The facility to do a reverse search > is not built into Solr - yet, unlike Elasticsearch's Percolator. > > Best > > Charlie > > On 07/09/2021 10:24, Dan Rosher wrote: > > Thanks Eric, Charlie and Yuval for all the feedback and suggestions. > > > > Eric: Yes I thought the monitoring might be a it of a pain, esp with > > millions of them, I'll have to check out the topic code, but I wondered > if > > I can look @ the checkpoint collections for uniqueIds that haven't been > > updated for a 'while' which might suggest the demon had stopped/died, > > rather than checking each daemon individually? > > > > I was also wondering whether it's possible, or a useful enhancement to > look > > at the replica index version (as opposed to _vesion_ ) for the topic > > streaming expression to skip queries where the replica index is the same > as > > what we might store in the checkpoint collection ? For collections that > > update infrequently I think this might be useful. > > > > Charlie: It was for email alerts, so a user stores a query for collection > > docs to match against, and then the system emails matches to the user. Do > > you think solr-monitor can be used for this purpose? > > > > Yuval: I like the idea of using the UpdateProcessor, at least there's no > > need for deamons or monitoring of them, but would this scale for millions > > of email queries though? > > > > Many thanks again to all. > > > > Kind regards, > > Dan > > > > > > > > > > On Mon, 6 Sept 2021 at 18:47, Yuval Paz <[email protected]> > wrote: > > > >> Me and my team are building upon this solcolator: > >> https://github.com/SOLR4189/solcolator > >> > >> Currently the processor is build for Solr 6.5.1, we are working on > updating > >> our Solr and I hope to release a complete version of our Solcolator as > >> open source then (it will be for version 8.6.x). > >> > >> Making it an update processor (either make it the last element and > replace > >> the usual processor that index the document, or by using it as the one > from > >> last processor in the collection, and so allow monitoring also atomic > >> updates [which is relatively costly]). > >> > >> By making it an update processor we don't rely on the streaming deamon, > >> which we found unsatisfying as we wish to allow users to define their > own > >> monitors over the index. > >> > >> On Mon, Sep 6, 2021, 8:25 PM Charlie Hull < > [email protected] > >> wrote: > >> > >>> Are you trying to monitor a stream of emails for certain patterns? In > >>> which case you might look at the Lucene Monitor > >>> > >>> > >> > https://lucene.apache.org/core/8_2_0/monitor/index.html?overview-summary.html > >>> https://issues.apache.org/jira/browse/LUCENE-8766, which was > originally > >>> Luwak - at my previous company Flax we helped build several large-scale > >>> monitoring systems with this https://github.com/flaxsearch/luwak . > It's > >>> not officially surfaced in Solr yet although my colleague Scott Stults > >>> has been working on some ideas: https://github.com/o19s/solr-monitor > >>> > >>> best > >>> Charlie > >>> > >>> On 06/09/2021 14:32, Dan Rosher wrote: > >>>> Hi, > >>>> > >>>> I was wondering if anyone had tried email alerts with streaming > >>>> expressions, and what their experience was if attempting this with say > >> 12 > >>>> million emails / day? Traditionally this might have been done with a > >>>> database cursor iterator daily. > >>>> > >>>> I was thinking if something like the following pseudocode expression > >> with > >>>> 'kafka' as a custom push expression: > >>>> > >>>> daemon(id="alertId", > >>>> runInterval="1000", > >>>> kafka( > >>>> kafka_topic, > >>>> alertId, > >>>> topic(email_alerts, > >>>> doc_collection, > >>>> q="email query", > >>>> fl="id, title, abstract", > >>>> id="alertId", > >>>> initialCheckpoint=0) > >>>> ) > >>>> > >>>> If you have done something like this 'where' would you typically run > >> the > >>>> daemon, on replicas away from replicas running web queries? > >>>> > >>>> Many thanks in advance for any advice / suggestions, > >>>> > >>>> Dan > >>>> > >>> -- > >>> Charlie Hull - Managing Consultant at OpenSource Connections Limited > >>> <www.o19s.com> > >>> Founding member of The Search Network <https://thesearchnetwork.com/> > >>> and co-author of Searching the Enterprise > >>> <https://opensourceconnections.com/about-us/books-resources/> > >>> tel/fax: +44 (0)8700 118334 > >>> mobile: +44 (0)7767 825828 > >>> > >>> OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin > >>> Amtsgericht Charlottenburg | HRB 230712 B > >>> Geschäftsführer: John M. Woodell | David E. Pugh > >>> Finanzamt: Berlin Finanzamt für Körperschaften II > >>> > > -- > Charlie Hull - Managing Consultant at OpenSource Connections Limited > <www.o19s.com> > Founding member of The Search Network <https://thesearchnetwork.com/> > and co-author of Searching the Enterprise > <https://opensourceconnections.com/about-us/books-resources/> > tel/fax: +44 (0)8700 118334 > mobile: +44 (0)7767 825828 > > OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin > Amtsgericht Charlottenburg | HRB 230712 B > Geschäftsführer: John M. Woodell | David E. Pugh > Finanzamt: Berlin Finanzamt für Körperschaften II >
