Given how messed up MySQL is, in particular, it seems to me like adding a config that can do a loobback (either ID or timestamp-based) would we useful, and pretty straight forward to add. This will at least help in preventing data loss, at the expense of duplicate rows. If people *really* care about not losing data, they can configure a second connector to do bulk loads on a more periodic basis. Thoughts?
On Wed, Feb 24, 2016 at 12:41 AM, Chris Riccomini <[email protected]> wrote: > Hey all, > > Some more details on my question. My concern about using even IDs on > immutable tables as a way to replicate data seems to be confirmed, at least > for InnoDB if not configured properly. This page: > > http://dev.mysql.com/doc/refman/5.7/en/innodb-auto-increment-handling.html > > Describes how auto increment IDs can get out of order. I have replicated > this issue on my local machine (see example below the fold) with all three > innodb_autoinc_lock_modes (1, 2, and 3). They all show the same results. > > As far as I can tell, at least with MySQL+InnoDB, it seems to me that the > JDBC connector has no way to configure it such that it isn't at risk of > dropping data on incremental DB polls (whether you use incrementing, > timestamp, or timestamp+incrementing). > > I'm curious if anyone has any thoughts on this. Again, the link I showed > in the above email suggests that this is a problem with Oracle as well. Are > people seeing data loss with the JDBC driver on high-write-load tables? I > would be surprised if they weren't. > > Several solutions covered in that PDF are of interest: > > 1. The look-back based approach (NOW() - 1 minute, or MAX(id) - 100). This > will lead to significant duplication in the Kafka topic. Even with log > compaction enabled, this will be annoying to users that are reading the > feed in realtime. > 2. Some kind of pre-query assignment. Essentially a field that's set to > NULL on insert. Every update sets the field back to NULL. An async process > periodically wakes up and sets the field to a monotonically increasing ID. > In the PDF, they suggest SCN for Oracle. MySQL doesn't easily expose this, > but I think UNIX_TIMESTAMP() would suffice as long as the updates don't > happen too frequently. > > For (2), one idea would be to add the ability int he JDBC connector to add > a pre-query, which could be executed just before the SELECT. In this mode, > you could execute an UPDATE my_table SET scn = UNIX_TIMESTAMP() WHERE scn = > NULL. As far as I can tell, with MySQL, this would involve using a trigger > to update the `scn` field back to NULL whenever an update on the row occurs. > > Cheers, > Chris > > ---------- > > #### Start with an empty table > > mysql> select * from funding_instructions; > > Empty set (0.00 sec) > > #### Create a transaction, and insert a row with state=2, but don't commit > (this is done in another terminal). > > mysql> begin; > > Query OK, 0 rows affected (0.00 sec) > > mysql> insert into funding_instructions (state) VALUES (2); > > Query OK, 1 row affected, 4 warnings (0.00 sec) > > #### Outside of the transaction in the main terminal, insert a second row > with state=3 > > mysql> insert into funding_instructions (state) VALUES (3); > > Query OK, 1 row affected, 4 warnings (0.00 sec) > > #### Now you see only the second row that was inserted. Note the ID. > > mysql> select * from funding_instructions; > > > +----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+ > > | id | state | type | amount | currency | batch | litle_id | > modify_time | create_time | version | payment_id | mid | > funds_transfer_request_date | > > > +----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+ > > | 25607888 | 3 | 0 | 0 | XXX | NULL | NULL | > 0 | 0 | 0 | 0 | NULL | > NULL | > > > +----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+ > > 1 row in set (0.00 sec) > > #### Now commit the first transaction. > > mysql> commit; > > Query OK, 0 rows affected (0.00 sec) > > #### Now look at your results. Note that the smaller ID now appears. This > breaks strict ID-based replication, even if the table is append-only. > > mysql> select * from funding_instructions; > > > +----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+ > > | id | state | type | amount | currency | batch | litle_id | > modify_time | create_time | version | payment_id | mid | > funds_transfer_request_date | > > > +----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+ > > | 25607887 | 2 | 0 | 0 | XXX | NULL | NULL | > 0 | 0 | 0 | 0 | NULL | > NULL | > > | 25607888 | 3 | 0 | 0 | XXX | NULL | NULL | > 0 | 0 | 0 | 0 | NULL | > NULL | > > > +----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+ > > On Tue, Feb 23, 2016 at 4:26 PM, Chris Riccomini <[email protected]> > wrote: > >> Hey all, >> >> I was reviewing the Kafka connect JDBC driver, and I had a question. Is >> it possible to use the JDBC driver with a look-back configured? The reason >> that I ask is that there are some known issues with using a modified >> timestamp: >> >> Slide 14 here explains one with Oracle: >> >> >> https://qconsf.com/sf2007/dl/QConSF2007/slides/public/JeanLucVaillant_LinkedIn.pdf?path=/QConSF2007/slides/public/JeanLucVaillant_LinkedIn.pdf >> >> There is also some SCN-related discussion here: >> >> https://github.com/linkedin/databus/wiki/Databus-for-MySQL >> >> Though that is more specific to MySQL. >> >> I am concerned that using insert IDs might not even be good enough >> (assuming my tables were immutable, which they're not), since I believe >> some DB storage systems might have the same issue. I think InnoDB's pkey >> auto increment ID commit order is even tunable based on config. >> >> I would rather get some duplicates than lose data if at all possible. Can >> I configure the JDBC driver to subtract some number from the offset to >> prevent (or drastically reduce) lost data? >> >> Cheers, >> Chris >> > >
