I resend this message that was accidentally sent to Sergiu only. ---------- Forwarded message ---------- From: Guillaume Delhumeau <[email protected]> Date: 2018-07-02 16:06 GMT+02:00 Subject: Re: [xwiki-devs] [Brainstorm] Notifications filters capabilities and performances. To: Sergiu Dumitriu <[email protected]>
Hi and thanks you for your answers. 2018-06-29 5:41 GMT+02:00 Sergiu Dumitriu <[email protected]>: > The two main problems that I see are that you're mixing two separate > things: > > 1. Tags which are stored in one place, and events which are stored in a > different place > 2. Again tags, and document locations. They may seem similar, but they > are completely different as they are implemented. That's why saying "not > in this wiki, except for this tag" cannot ever be implemented in a sane > way since the exception is on a different plane. > > The logical approach would be to also store the tags in the events, but > this is not a good thing to do. Also, a tag could be added to a page AFTER the event has been triggered. In this case, if you still want to fetch these pages, it won't work. > Where do we stop? For future-proofing > filters, we would have to store the entire document in the event, in > case someone wants to add a filter on "documents with a specific type of > object attached", or "documents with a specific value for a specific > property of a specific object", or "documents with attachments ending in > .extension". > > On 06/28/2018 04:44 PM, Guillaume Delhumeau wrote: > > For the tags filter, I can also: > > > > * perform a query to fetch all documents that correspond to the given > tags > > (it doesn't need to be permanent, but it would be better to cache it) > > * add a HQL filter on these pages (OR document IN :list_of_pages). > > > > It's a little variation of solution A. It's ugly but it could do the job. > > It's the sanest proposal so far, but still impossible. > Apparently it's the way it's done on Activity Stream: https://github.com/xwiki/xwiki-platform/blob/553afe7287dcc4ce8b588276e639bf 283352e805/xwiki-platform-core/xwiki-platform-activitystream/xwiki-platform- activitystream-ui/src/main/resources/Main/Activity.xml#L1008 > > Since "not in LOCATION except WITH TAG" doesn't work because it mixes > two types of information in the same query fragment, how about making > this a one-dimensional query as "only WITH TAG". And since this > information is in a different place, pre-querying it is needed. However, > if there are thousands or tens/hundreds of thousands of documents with a > tag, won't the query crash anyway? > Sure, it won't scale. > > > 2018-06-27 12:58 GMT+02:00 Guillaume Delhumeau < > > [email protected]>: > > > >> Hi developers. > >> > >> I am trying to add a new filter to the notifications to be able to > follow > >> pages > >> that are marked with a given tag. And it leads me to some questions > about > >> the > >> technical implementation of the notifications. > >> > >> To remind the context: notifications are computed on top of the events > >> recorded > >> by the event stream (a.k.a activity stream). We take events from the > event > >> stream SQL table, we apply some transformations on them, and we display > >> them to > >> the user. > >> > >> Then we have implemented the ability to filter on these events: for > >> example > >> "don't show events concerning the document A nor the wiki B". Filters > are > >> implemented with 2 distinct ways: > >> > >> 1/ SQL injections: each filter can add SQL elements in the query we > make > >> to > >> fetch the events from the event stream table. We made this > mechanism > >> so we > >> can let the database do a lot of the filtering process. After all, > >> it's its > >> job and it's supposed to perform well. To be precise, Clement has > >> even > >> created an Abstract Syntax Tree (AST) so it's easier to inject some > >> content > >> in the query and it creates an abstraction over the SQL language so > >> we can > >> even consider to change the storage of the event stream someday. > >> > >> The bad thing is that some complex filtering are difficult to write > >> with > >> the SQL language (event with the AST) or even impossible. > >> > >> 2/ Post-filtering: after the events have been fetched from the > database, > >> each > >> filter can still decide to keep or filter them. This is useful for > >> complex filtering that cannot be expressed with the SQL language. > It > >> is > >> also needed by the real-time notification email sender, because it > >> takes > >> the events immediately when they occurs without fetching them in > the > >> database (so SQL filters are bypassed). > >> > >> The bad thing is that some events are loaded in the memory to > finally > >> be > >> rejected, and these filters can perform costly operations such as > >> loading > >> documents. > >> > >> Until now, this double mechanism was working quite well, with each > >> mechanism > >> filling the lacks of the other. > >> > >> However, we still have technical limitations in our design: > >> > >> 1/ Users who have a lot of filter preferences can end up with a giant > >> SQL > >> query that is almost impossible to perform by the database. > Actually > >> we had > >> a user complaining about an OutOfMemory problem in the HQL to SQL > >> translator ! > >> > >> 2/ I cannot implement the tag filter ! > >> > >> The tag filter is supposed to show events that concern pages that hold a > >> given > >> tag, EVEN IF THE PAGE WAS EXCLUDED BY THE USER. Example of use-case: "I > >> don't > >> want to receive notifications about wiki A except for pages marked with > >> the tag > >> T". > >> > >> And it is not working. First because it is difficult to write a SQL > query > >> for > >> that. It requires to make a join clause with the document and the object > >> tables, > >> which our SQL injection mechanism does not support. Even if it were > >> possible, > >> creating a SQL join with the document table will de-facto filter events > >> that do > >> not concern any pages or pages that do not have any objects: so many > other > >> filters will be broken. I also don't consider creating a SQL subquery, I > >> think > >> the whole query would became too big. So I decided to just not inject > any > >> SQL > >> code for this filter and only implement the post-filtering mechanism. > >> > >> But the other filter "EXCLUDE WIKI A" generates a SQL injection such as > >> "WIKI <> 'WIKI A'" so the events concerning the wiki A are not fetched > >> from the > >> database. Consequence: the tag filter never see the events that it is > >> supposed > >> to keep. It would be actually possible to by-pass the first SQL > injections > >> by > >> injecting something like "OR 1=1". But doing something like this is like > >> dropping the all SQL injections mechanism. > >> > >> I see some solutions for this problem: > >> > >> A/ For each tag, create a permanent list of pages that hold it. So I > can > >> inject "OR document IN (that_list)". I think this is heavy. > > Definitely not. You'll have to maintain this list in parallel to the > canonical storage for tags, the xwikidoc table itself, with all the > headaches that data duplication brings. > I agree. > > >> B/ Drop the SQL injection mechanism and only rely on the > post-filtering > >> mechanism. It would require to load from the database A LOT of > events, > >> but maybe we could cache this. > > Nope, filtering millions of events by inspecting them one by one is a > performance nightmare. > > >> C/ Don't drop the SQL injection mechanism completely but use it as > >> little as > >> possible (for example, do not use it for LOCATION filtering). Seems > >> hard to > >> determine when a filter should use this feature or not. > > I think location filtering should still happen in the query itself, > since it is a lot faster. > > >> D/ Don't implement the "tags" filter, since it is the root of the > issue. > >> But > >> it is like sweeping dirt under the carpet! > > Well, not implementing something is always the fastest, least intrusive > solution to any problem... > > Humor aside, I think that in this case it is the best approach. > Thanks for saying it :) > > >> Since we have the OutOfMemory problem with the SQL injections becoming > too > >> huge, > >> I am more in favor of solution B or C. But I'm not sure for now, since I > >> do not > >> know how much it would impact the performances and the scalability of > the > >> whole > >> notifications feature. > >> > >> This is a complex topic, but I hope this message will inspire you some > >> suggestions or things I have not seen with my own eyes. > >> > >> Thanks for your help, > >> > >> Guillaume > >> > > > > > > > Other options: > > E/ Drop SQL altogether, move the event stream into a nosql database, and > do stream filtering; kind of like the B proposal > That's unfortunately out of the scope right now. > > F/ Drop SQL and querying altogether, switch to a pub-sub mechanism where > a subscriber is created for every user and his filters, and matching > events are placed in a queue until they are all sent (consumed) in a > notification. Obviously, this only works for email notifications, not > for browsing past events in a webpage. > Indeed it cannot replace Activity Stream. However, it was my first design attempt: http://design.xwiki.org/xwiki/bin/view/Proposal/ NotificationCenterforAppsImplementation#HIteration1 But then we decided to rely on what we had (the event stream) and to kill 2 birds with the same stones (having notifications + replacing activity stream). Thanks, Guillaume > > -- > Sergiu Dumitriu > http://purl.org/net/sergiu/ > -- Guillaume Delhumeau ([email protected]) Research & Development Engineer at XWiki SAS Committer on the XWiki.org project -- Guillaume Delhumeau ([email protected]) Research & Development Engineer at XWiki SAS Committer on the XWiki.org project

