Re: Batch DML queries design discussion

Vladimir Ozerov Fri, 09 Dec 2016 00:36:45 -0800

I already expressed my concern - this is counterintuitive approach. Because
without happens-before pure streaming model can be applied only on
independent chunks of data. It mean that mentioned ETL use case is not
feasible - ETL always depend on implicit or explicit links between tables,
and hence streaming is not applicable here. And my question stands still -
what produce except of possibly Ignite do this kind of JDBC streaming? Any
example?


Another problem is that connection-wide property doesn't fit well in JDBC

On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan <[email protected]>
wrote:

> Gents,
>
> As Sergi suggested, batching and streaming are very different semantically.
>
> To use standard JDBC batching, all we need to do is convert it to a
> cache.putAll() method, as semantically a putAll(...) call is identical to a
> JDBC batch. Of course, if we see and UPDATE with a WHERE clause in between,
> then we may have to break a batch into several chunks and execute the
> update in between. The DataStreamer should not be used here.
>
> I believe that for streaming we need to add a special JDBC/ODBC connection
> flag. Whenever this flag is set to true, then we only should allow INSERT
> or single-UPDATE operations and use DataStreamer API internally. All
> operations other than INSERT or single-UPDATE should be prohibited.
>
> I think this design is semantically clear. Any objections?
>
> D.
>
> On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin <[email protected]>
> wrote:
>
> > If we use Streamer, then we always have `happens-before` broken. This is
> > ok, because Streamer is for data loading, not for usual operating.
> >
> > We are not inventing any bicycles, just separating concerns: Batching and
> > Streaming.
> >
> > My point here is that they should not depend on each other at all:
> Batching
> > can work with or without Streaming, as well as Streaming can work with or
> > without Batching.
> >
> > Your proposal is a set of non-obvious rules for them to work. I see no
> > reasons for these complications.
> >
> > Sergi
> >
> >
> > 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <[email protected]>:
> >
> > > Sergi,
> > >
> > > If user call single *execute() *operation, than most likely it is not
> > > batching. We should not rely on strange case where user perform
> batching
> > > without using standard and well-adopted batching JDBC API. The main
> > problem
> > > with streamer is that it is async and hence break happens-before
> > guarantees
> > > in a single thread: SELECT after INSERT might not return inserted
> value.
> > >
> > > Honestly, I do not really understand why we are trying to re-invent a
> > > bicycle here. There is standard API - let's just use it and make
> flexible
> > > enough to take advantage of IgniteDataStreamer if needed.
> > >
> > > Is there any use case which is not covered with this solution? Or let
> me
> > > ask from the opposite side - are there any well-known JDBC drivers
> which
> > > perform batching/streaming from non-batched update statements?
> > >
> > > Vladimir.
> > >
> > > On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin <
> [email protected]
> > >
> > > wrote:
> > >
> > > > Vladimir,
> > > >
> > > > I see no reason to forbid Streamer usage from non-batched statement
> > > > execution.
> > > > It is common that users already have their ETL tools and you can't be
> > > sure
> > > > if they use batching or not.
> > > >
> > > > Alex,
> > > >
> > > > I guess we have to decide on Streaming first and then we will discuss
> > > > Batching separately, ok? Because this decision may become important
> for
> > > > batching implementation.
> > > >
> > > > Sergi
> > > >
> > > > 2016-12-08 15:31 GMT+03:00 Andrey Gura <[email protected]>:
> > > >
> > > > > Alex,
> > > > >
> > > > > In most cases JdbcQueryTask should be executed locally on client
> node
> > > > > started by JDBC driver.
> > > > >
> > > > > JdbcQueryTask.QueryResult res =
> > > > >     loc ? qryTask.call() :
> > > > > ignite.compute(ignite.cluster().forNodeId(nodeId)).call(qryTask);
> > > > >
> > > > > Is it valid behavior after introducing DML functionality?
> > > > >
> > > > > In cases when user wants to execute query on specific node he
> should
> > > > > fully understand what he wants and what can go in wrong way.
> > > > >
> > > > >
> > > > > On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko
> > > > > <[email protected]> wrote:
> > > > > > Sergi,
> > > > > >
> > > > > > JDBC batching might work quite differently from driver to driver.
> > > Say,
> > > > > > MySQL happily rewrites queries as I had suggested in the
> beginning
> > of
> > > > > > this thread (it's not the only strategy, but one of the possible
> > > > > > options) - and, BTW, would like to hear at least an opinion about
> > it.
> > > > > >
> > > > > > On your first approach, section before streamer: you suggest that
> > we
> > > > > > send single statement and multiple param sets as a single query
> > task,
> > > > > > am I right? (Just to make sure that I got you properly.) If so,
> do
> > > you
> > > > > > also mean that API (namely JdbcQueryTask) between server and
> client
> > > > > > should also change? Or should new API means be added to
> facilitate
> > > > > > batching tasks?
> > > > > >
> > > > > > - Alex
> > > > > >
> > > > > > 2016-12-08 15:05 GMT+03:00 Sergi Vladykin <
> > [email protected]
> > > >:
> > > > > >> Guys,
> > > > > >>
> > > > > >> I discussed this feature with Dmitriy and we came to conclusion
> > that
> > > > > >> batching in JDBC and Data Streaming in Ignite have different
> > > semantics
> > > > > and
> > > > > >> performance characteristics. Thus they are independent features
> > > (they
> > > > > may
> > > > > >> work together, may separately, but this is another story).
> > > > > >>
> > > > > >> Let me explain.
> > > > > >>
> > > > > >> This is how JDBC batching works:
> > > > > >> - Add N sets of parameters to a prepared statement.
> > > > > >> - Manually execute prepared statement.
> > > > > >> - Repeat until all the data is loaded.
> > > > > >>
> > > > > >>
> > > > > >> This is how data streamer works:
> > > > > >> - Keep adding data.
> > > > > >> - Streamer will buffer and load buffered per-node batches when
> > they
> > > > are
> > > > > big
> > > > > >> enough.
> > > > > >> - Close streamer to make sure that everything is loaded.
> > > > > >>
> > > > > >> As you can see we have a difference in semantics of when we send
> > > data:
> > > > > if
> > > > > >> in our JDBC we will allow sending batches to nodes without
> calling
> > > > > >> `execute` (and probably we will need to make `execute` to no-op
> > > here),
> > > > > then
> > > > > >> we are violating semantics of JDBC, if we will disallow this
> > > behavior,
> > > > > then
> > > > > >> this batching will underperform.
> > > > > >>
> > > > > >> Thus I suggest keeping these features (JDBC Batching and JDBC
> > > > > Streaming) as
> > > > > >> separate features.
> > > > > >>
> > > > > >> As I already said they can work together: Batching will batch
> > > > parameters
> > > > > >> and on `execute` they will go to the Streamer in one shot and
> > > Streamer
> > > > > will
> > > > > >> deal with the rest.
> > > > > >>
> > > > > >> Sergi
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov <
> [email protected]
> > >:
> > > > > >>
> > > > > >>> Hi Alex,
> > > > > >>>
> > > > > >>> To my understanding there are two possible approaches to
> batching
> > > in
> > > > > JDBC
> > > > > >>> layer:
> > > > > >>>
> > > > > >>> 1) Rely on default batching API. Specifically
> > > > > >>> *PreparedStatement.addBatch()* [1]
> > > > > >>> and others. This is nice and clear API, users are used to it,
> and
> > > > it's
> > > > > >>> adoption will minimize user code changes when migrating from
> > other
> > > > JDBC
> > > > > >>> sources. We simply copy updates locally and then execute them
> all
> > > at
> > > > > once
> > > > > >>> with only a single network hop to servers. *IgniteDataStreamer*
> > can
> > > > be
> > > > > used
> > > > > >>> underneath.
> > > > > >>>
> > > > > >>> 2) Or we can have separate connection flag which will move all
> > > > > >>> INSERT/UPDATE/DELETE statements through streamer.
> > > > > >>>
> > > > > >>> I prefer the first approach
> > > > > >>>
> > > > > >>> Also we need to keep in mind that data streamer has poor
> > > performance
> > > > > when
> > > > > >>> adding single key-value pairs due to high overhead on
> concurrency
> > > and
> > > > > other
> > > > > >>> bookkeeping. Instead, it is better to pre-batch key-value pairs
> > > > before
> > > > > >>> giving them to streamer.
> > > > > >>>
> > > > > >>> Vladimir.
> > > > > >>>
> > > > > >>> [1]
> > > > > >>> https://docs.oracle.com/javase/8/docs/api/java/sql/
> > > > > PreparedStatement.html#
> > > > > >>> addBatch--
> > > > > >>>
> > > > > >>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko <
> > > > > >>> [email protected]> wrote:
> > > > > >>>
> > > > > >>> > Hello Igniters,
> > > > > >>> >
> > > > > >>> > One of the major improvements to DML has to be support of
> batch
> > > > > >>> > statements. I'd like to discuss its implementation. The
> > suggested
> > > > > >>> > approach is to rewrite given query turning it from few
> INSERTs
> > > into
> > > > > >>> > single statement and processing arguments accordingly. I
> > suggest
> > > > this
> > > > > >>> > as long as the whole point of batching is to make as little
> > > > > >>> > interactions with cluster as possible and to make operations
> as
> > > > > >>> > condensed as possible, and in case of Ignite it means that we
> > > > should
> > > > > >>> > send as little JdbcQueryTasks as possible. And, as long as a
> > > query
> > > > > >>> > task holds single query and its arguments, this approach will
> > not
> > > > > >>> > require any changes to be done to current design and won't
> > break
> > > > any
> > > > > >>> > backward compatibility - all dirty work on rewriting will be
> > done
> > > > by
> > > > > >>> > JDBC driver.
> > > > > >>> > Without rewriting, we could introduce some new query task for
> > > batch
> > > > > >>> > operations, but that would make impossible sending such
> > requests
> > > > from
> > > > > >>> > newer clients to older servers (say, servers of version
> 1.8.0,
> > > > which
> > > > > >>> > does not know about batching, let alone older versions).
> > > > > >>> > I'd like to hear comments and suggestions from the community.
> > > > Thanks!
> > > > > >>> >
> > > > > >>> > - Alex
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
>



-- 
Vladimir Ozerov
Senior Software Architect
GridGain Systems
www.gridgain.com
*+7 (960) 283 98 40*

Re: Batch DML queries design discussion

Reply via email to