Re: Continuous queries and duplicates

Piotr Romański Mon, 17 Dec 2018 09:55:17 -0800

Hi all, sorry for answering so late.

I would like to use SqlQuery because I can leverage indexes there.


As it was already mentioned earlier, the partition update counter is
exposed through CacheQueryEntryEvent. Initially, I thought that the
partition update counter is something what's persisted together with the
data but I'm guessing now that this is only a part of the notification
mechanism.

I imagined that I would be able to implement my own deduplicaton by having
3 stages on the client side: 1. Keep processing initial query results,
store their keys in memory, 2. When initial query is over, then process
listener entries but before that check if they have been already delivered
in the first stage, 3. When we are sure that we are already processing
notifications for commits executed after initial query was done, then we
can process listener entries without any additional checks (so our key set
from stage 1 can be removed from memory). The problem is that I have no way
to say that I can move from stage 2 to 3. Another problem is that we need
to stash listener entries while still processing initial query results
causing an excessive memory pressure on our client.

In my case, values are immutable - I never change them, I just add new
entry for newer versions. Does it mean that I won't have any duplicates
between the initial query and listener entries when using continuous
queries on caches supporting MVCC?

After reading the related thread (
http://apache-ignite-developers.2346864.n4.nabble.com/Continuous-queries-and-MVCC-td33972.html)
I'm now concerned about the ordering. My case assumes that there are groups
of entries which belong to a business aggregate object and I would like to
make sure that if I commit two records in two serial transactions then I
have notifications in the same order. Those entries will have different
keys so based on what you said ("we'd better to leave things as is and
guarantee only per-key ordering"), it would seem that the order is not
guaranteed. But do you think it would possible to guarantee order when
those entries share the same affinity key and they belong to the same
partition?

Piotr

pt., 14 gru 2018, 19:31: Denis Magda <dma...@apache.org> napisał(a):

> Vladimir,
>
> Thanks for referring to the MVCC and Continuous Queries discussion, I knew
> that saw us discussing a solution of the duplication problem. Let me copy
> and paste it in here for others:
>
> 2) *Initial query*. We implemented it so that user can get some initial
> > data snapshot and then start receiving events. Without MVCC we have no
> > guarantees of visibility. E.g. if key is updated from V1 to V2, it is
> > possible to see V2 in initial query and in event. With MVCC it is now
> > technically possible to query data on certain snapshot and then receive
> > only events happened after this snapshot. So that we never see V2 twice.
> > Do
> > you think we this feature will be interesting for our users?
>
>
> Am I right that this would be a generic solution - whether you use Scan or
> SQL query as an initial one? Have we planned it for the transactional SQL
> GA or it's out of scope for now?
>
> --
> Denis
>
> On Thu, Dec 13, 2018 at 12:40 PM Vladimir Ozerov <voze...@gridgain.com>
> wrote:
>
> > [1]
> >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/Continuous-queries-and-MVCC-td33972.html
> >
> > On Thu, Dec 13, 2018 at 11:38 PM Vladimir Ozerov <voze...@gridgain.com>
> > wrote:
> >
> > > Denis,
> > >
> > > Not really. They are used to ensure that ordering of notifications is
> > > consistent with ordering of updates, so that when a key K is updated to
> > V1,
> > > then V2, then V3, you never observe V1 -> V3 -> V2. It also solves
> > > duplicate notification problem in case of node failures, when the same
> > > update is delivered twice.
> > >
> > > However, partition counters are unable to solve duplicates problem in
> > > general. Essentially, the question is how to get consistent view on
> some
> > > data plus all notifications which happened afterwards. There are only
> two
> > > ways to achieve this - either lock entries during initial query, or
> take
> > a
> > > kind of consistent data snapshot. The former was never implemented in
> > > Ignite - our Scan and SQL queries do not user locking. The latter is
> > > achievable in theory with MVCC. I raised that question earlier [1] (see
> > > p.2), and we came to conclusion that it might be a good feature for the
> > > product. It is not implemented that way for MVCC now, but most probably
> > is
> > > not extraordinary difficult to implement.
> > >
> > > Vladimir.
> > >
> > > [1]
> > >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/Continuous-queries-and-MVCC-td33972.html#a33998
> > >
> > > On Thu, Dec 13, 2018 at 11:17 PM Denis Magda <dma...@apache.org>
> wrote:
> > >
> > >> Vladimir,
> > >>
> > >> The partition counter is supposed to be used internally to solve the
> > >> duplication issue. Does it sound like a right approach then?
> > >>
> > >> What would be an approach for SQL queries? Not sure the partition
> > counter
> > >> is applicable.
> > >>
> > >> --
> > >> Denis
> > >>
> > >> On Thu, Dec 13, 2018 at 11:16 AM Vladimir Ozerov <
> voze...@gridgain.com>
> > >> wrote:
> > >>
> > >> > Partition counter is internal implemenattion detail, which has no
> > >> sensible
> > >> > meaning to end users. It should not be exposed through public API.
> > >> >
> > >> > On Thu, Dec 13, 2018 at 10:14 PM Denis Magda <dma...@apache.org>
> > wrote:
> > >> >
> > >> > > Hello Piotr,
> > >> > >
> > >> > > That's a known problem and I thought a JIRA ticket already exists.
> > >> > However,
> > >> > > failed to locate it. The ticket for the improvement should be
> > created
> > >> as
> > >> > a
> > >> > > result of this conversation.
> > >> > >
> > >> > > Speaking of an initial query type, I would differentiate from
> > >> ScanQueries
> > >> > > and SqlQueries. For the former, it sounds reasonable to apply the
> > >> > > partitionCounter logic. As for the latter, Vladimir Ozerov will it
> > be
> > >> > > addressed as part of MVCC/Transactional SQL activities?
> > >> > >
> > >> > > Btw, Piotr what's your initial query type?
> > >> > >
> > >> > > --
> > >> > > Denis
> > >> > >
> > >> > > On Thu, Dec 13, 2018 at 3:28 AM Piotr Romański <
> > >> piotr.roman...@gmail.com
> > >> > >
> > >> > > wrote:
> > >> > >
> > >> > > > Hi, as suggested by Ilya here:
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> http://apache-ignite-users.70518.x6.nabble.com/Continuous-queries-and-duplicates-td25314.html
> > >> > > > I'm resending it to the developers list.
> > >> > > >
> > >> > > > From that thread we know that there might be duplicates between
> > >> initial
> > >> > > > query results and listener entries received as part of
> continuous
> > >> > query.
> > >> > > > That means that users need to manually dedupe data.
> > >> > > >
> > >> > > > In my opinion the manual deduplication in some use cases may
> lead
> > to
> > >> > > > possible memory problems on the client side. In order to remove
> > >> > > duplicated
> > >> > > > notifications which we are receiving in the local listener, we
> > need
> > >> to
> > >> > > keep
> > >> > > > all initial query results in memory (or at least their unique
> > ids).
> > >> > > > Unfortunately, there is no way (is there?) to find a point in
> time
> > >> when
> > >> > > we
> > >> > > > can be sure that no dups will arrive anymore. That would mean
> that
> > >> we
> > >> > > need
> > >> > > > to keep that data indefinitely and use it every time a new
> > >> notification
> > >> > > > arrives. In case of multiple continuous queries run from a
> single
> > >> JVM,
> > >> > > this
> > >> > > > might eventually become a memory or performance problem. I can
> see
> > >> the
> > >> > > > following possible improvements to Ignite:
> > >> > > >
> > >> > > > 1. The deduplication between initial query and incoming
> > notification
> > >> > > could
> > >> > > > be done fully in Ignite. As far as I know there is already the
> > >> > > > updateCounter and partition id for all the objects so it could
> be
> > >> used
> > >> > > > internally.
> > >> > > >
> > >> > > > 2. Add a guarantee that notifications arriving in the local
> > listener
> > >> > > after
> > >> > > > query() method returns are not duplicates. This kind of
> > >> functionality
> > >> > > would
> > >> > > > require a specific synchronization inside Ignite. It would also
> > mean
> > >> > that
> > >> > > > the query() method cannot return before all potential duplicates
> > are
> > >> > > > processed by a local listener what looks wrong.
> > >> > > >
> > >> > > > 3. Notify users that starting from a given notification they can
> > be
> > >> > sure
> > >> > > > they will not receive any duplicates anymore. This could be an
> > >> > additional
> > >> > > > boolean flag in the CacheQueryEntryEvent.
> > >> > > >
> > >> > > > 4. CacheQueryEntryEvent already exposes the
> > partitionUpdateCounter.
> > >> > > > Unfortunately we don't have this information for initial query
> > >> results.
> > >> > > If
> > >> > > > we had, a client could manually deduplicate notifications and
> get
> > >> rid
> > >> > of
> > >> > > > initial query results for a given partition after newer
> > >> notifications
> > >> > > > arrive. Also it would be very convenient to expose partition id
> as
> > >> well
> > >> > > but
> > >> > > > now we can figure it out using the affinity service. The
> > assumption
> > >> > here
> > >> > > is
> > >> > > > that notifications are ordered by partitionUpdateCounter (is it
> > >> true?).
> > >> > > >
> > >> > > > Please correct me if I'm missing anything.
> > >> > > >
> > >> > > > What do you think?
> > >> > > >
> > >> > > > Piotr
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: Continuous queries and duplicates

Reply via email to