I resend this message that was accidentally sent to Sergiu only.

---------- Forwarded message ----------
From: Guillaume Delhumeau <[email protected]>
Date: 2018-07-02 16:06 GMT+02:00
Subject: Re: [xwiki-devs] [Brainstorm] Notifications filters capabilities
and performances.
To: Sergiu Dumitriu <[email protected]>


Hi and thanks you for your answers.

2018-06-29 5:41 GMT+02:00 Sergiu Dumitriu <[email protected]>:

> The two main problems that I see are that you're mixing two separate
> things:
>
> 1. Tags which are stored in one place, and events which are stored in a
> different place
> 2. Again tags, and document locations. They may seem similar, but they
> are completely different as they are implemented. That's why saying "not
> in this wiki, except for this tag" cannot ever be implemented in a sane
> way since the exception is on a different plane.
>
> The logical approach would be to also store the tags in the events, but
> this is not a good thing to do.


Also, a tag could be added to a page AFTER the event has been triggered. In
this case, if you still want to fetch these pages, it won't work.



> Where do we stop? For future-proofing
> filters, we would have to store the entire document in the event, in
> case someone wants to add a filter on "documents with a specific type of
> object attached", or "documents with a specific value for a specific
> property of a specific object", or "documents with attachments ending in
> .extension".
>
> On 06/28/2018 04:44 PM, Guillaume Delhumeau wrote:
> > For the tags filter, I can also:
> >
> > * perform a query to fetch all documents that correspond to the given
> tags
> > (it doesn't need to be permanent, but it would be better to cache it)
> > * add a HQL filter on these pages (OR document IN :list_of_pages).
> >
> > It's a little variation of solution A. It's ugly but it could do the job.
>
> It's the sanest proposal so far, but still impossible.
>

Apparently it's the way it's done on Activity Stream:
https://github.com/xwiki/xwiki-platform/blob/553afe7287dcc4ce8b588276e639bf
283352e805/xwiki-platform-core/xwiki-platform-activitystream/xwiki-platform-
activitystream-ui/src/main/resources/Main/Activity.xml#L1008


>
> Since "not in LOCATION except WITH TAG" doesn't work because it mixes
> two types of information in the same query fragment, how about making
> this a one-dimensional query as "only WITH TAG". And since this
> information is in a different place, pre-querying it is needed. However,
> if there are thousands or tens/hundreds of thousands of documents with a
> tag, won't the query crash anyway?
>

Sure, it won't scale.


>
> > 2018-06-27 12:58 GMT+02:00 Guillaume Delhumeau <
> > [email protected]>:
> >
> >> Hi developers.
> >>
> >> I am trying to add a new filter to the notifications to be able to
> follow
> >> pages
> >> that are marked with a given tag. And it leads me to some questions
> about
> >> the
> >> technical implementation of the notifications.
> >>
> >> To remind the context: notifications are computed on top of the events
> >> recorded
> >> by the event stream (a.k.a activity stream). We take events from the
> event
> >> stream SQL table, we apply some transformations on them, and we display
> >> them to
> >> the user.
> >>
> >> Then we have implemented the ability to filter on these events: for
> >> example
> >> "don't show events concerning the document A nor the wiki B". Filters
> are
> >> implemented with 2 distinct ways:
> >>
> >>   1/ SQL injections: each filter can add SQL elements in the query we
> make
> >> to
> >>      fetch the events from the event stream table. We made this
> mechanism
> >> so we
> >>      can let the database do a lot of the filtering process. After all,
> >> it's its
> >>      job and it's supposed to perform well. To be precise, Clement has
> >> even
> >>      created an Abstract Syntax Tree (AST) so it's easier to inject some
> >> content
> >>      in the query and it creates an abstraction over the SQL language so
> >> we can
> >>      even consider to change the storage of the event stream someday.
> >>
> >>      The bad thing is that some complex filtering are difficult to write
> >> with
> >>      the SQL language (event with the AST) or even impossible.
> >>
> >>   2/ Post-filtering: after the events have been fetched from the
> database,
> >> each
> >>      filter can still decide to keep or filter them. This is useful for
> >>      complex filtering that cannot be expressed with the SQL language.
> It
> >> is
> >>      also needed by the real-time notification email sender, because it
> >> takes
> >>      the events immediately when they occurs without fetching them in
> the
> >>      database (so SQL filters are bypassed).
> >>
> >>      The bad thing is that some events are loaded in the memory to
> finally
> >> be
> >>      rejected, and these filters can perform costly operations such as
> >> loading
> >>      documents.
> >>
> >> Until now, this double mechanism was working quite well, with each
> >> mechanism
> >> filling the lacks of the other.
> >>
> >> However, we still have technical limitations in our design:
> >>
> >>   1/ Users who have a lot of filter preferences can end up with a giant
> >> SQL
> >>      query that is almost impossible to perform by the database.
> Actually
> >> we had
> >>      a user complaining about an OutOfMemory problem in the HQL to SQL
> >>      translator !
> >>
> >>   2/ I cannot implement the tag filter !
> >>
> >> The tag filter is supposed to show events that concern pages that hold a
> >> given
> >> tag, EVEN IF THE PAGE WAS EXCLUDED BY THE USER. Example of use-case: "I
> >> don't
> >> want to receive notifications about wiki A except for pages marked with
> >> the tag
> >> T".
> >>
> >> And it is not working. First because it is difficult to write a SQL
> query
> >> for
> >> that. It requires to make a join clause with the document and the object
> >> tables,
> >> which our SQL injection mechanism does not support. Even if it were
> >> possible,
> >> creating a SQL join with the document table will de-facto filter events
> >> that do
> >> not concern any pages or pages that do not have any objects: so many
> other
> >> filters will be broken. I also don't consider creating a SQL subquery, I
> >> think
> >> the whole query would became too big. So I decided to just not inject
> any
> >> SQL
> >> code for this filter and only implement the post-filtering mechanism.
> >>
> >> But the other filter "EXCLUDE WIKI A" generates a SQL injection such as
> >> "WIKI <> 'WIKI A'" so the events concerning the wiki A are not fetched
> >> from the
> >> database. Consequence: the tag filter never see the events that it is
> >> supposed
> >> to keep. It would be actually possible to by-pass the first SQL
> injections
> >> by
> >> injecting something like "OR 1=1". But doing something like this is like
> >> dropping the all SQL injections mechanism.
> >>
> >> I see some solutions for this problem:
> >>
> >>   A/ For each tag, create a permanent list of pages that hold it. So I
> can
> >>      inject "OR document IN (that_list)". I think this is heavy.
>
> Definitely not. You'll have to maintain this list in parallel to the
> canonical storage for tags, the xwikidoc table itself, with all the
> headaches that data duplication brings.
>

I agree.


>
> >>   B/ Drop the SQL injection mechanism and only rely on the
> post-filtering
> >>      mechanism. It would require to load from the database A LOT of
> events,
> >>      but maybe we could cache this.
>
> Nope, filtering millions of events by inspecting them one by one is a
> performance nightmare.
>
> >>   C/ Don't drop the SQL injection mechanism completely but use it as
> >> little as
> >>      possible (for example, do not use it for LOCATION filtering). Seems
> >> hard to
> >>      determine when a filter should use this feature or not.
>
> I think location filtering should still happen in the query itself,
> since it is a lot faster.
>
> >>   D/ Don't implement the "tags" filter, since it is the root of the
> issue.
> >> But
> >>      it is like sweeping dirt under the carpet!
>
> Well, not implementing something is always the fastest, least intrusive
> solution to any problem...
>
> Humor aside, I think that in this case it is the best approach.
>

Thanks for saying it :)


>
> >> Since we have the OutOfMemory problem with the SQL injections becoming
> too
> >> huge,
> >> I am more in favor of solution B or C. But I'm not sure for now, since I
> >> do not
> >> know how much it would impact the performances and the scalability of
> the
> >> whole
> >> notifications feature.
> >>
> >> This is a complex topic, but I hope this message will inspire you some
> >> suggestions or things I have not seen with my own eyes.
> >>
> >> Thanks for your help,
> >>
> >> Guillaume
> >>
> >
> >
> >
> Other options:
>
> E/ Drop SQL altogether, move the event stream into a nosql database, and
> do stream filtering; kind of like the B proposal
>

That's unfortunately out of the scope right now.


>
> F/ Drop SQL and querying altogether, switch to a pub-sub mechanism where
> a subscriber is created for every user and his filters, and matching
> events are placed in a queue until they are all sent (consumed) in a
> notification. Obviously, this only works for email notifications, not
> for browsing past events in a webpage.
>

Indeed it cannot replace Activity Stream. However, it was my first design
attempt:
http://design.xwiki.org/xwiki/bin/view/Proposal/
NotificationCenterforAppsImplementation#HIteration1

But then we decided to rely on what we had (the event stream) and to kill 2
birds with the same stones (having notifications + replacing activity
stream).

Thanks,

Guillaume




>
> --
> Sergiu Dumitriu
> http://purl.org/net/sergiu/
>



-- 
Guillaume Delhumeau ([email protected])
Research & Development Engineer at XWiki SAS
Committer on the XWiki.org project



-- 
Guillaume Delhumeau ([email protected])
Research & Development Engineer at XWiki SAS
Committer on the XWiki.org project

Reply via email to