Re: Kafka connect JDBC DB SCNs

Chris Riccomini Wed, 24 Feb 2016 00:43:27 -0800

Hey all,

Some more details on my question. My concern about using even IDs on
immutable tables as a way to replicate data seems to be confirmed, at least
for InnoDB if not configured properly. This page:

http://dev.mysql.com/doc/refman/5.7/en/innodb-auto-increment-handling.html

Describes how auto increment IDs can get out of order. I have replicated
this issue on my local machine (see example below the fold) with all three
innodb_autoinc_lock_modes (1, 2, and 3). They all show the same results.

As far as I can tell, at least with MySQL+InnoDB, it seems to me that the
JDBC connector has no way to configure it such that it isn't at risk of
dropping data on incremental DB polls (whether you use incrementing,
timestamp, or timestamp+incrementing).

I'm curious if anyone has any thoughts on this. Again, the link I showed in
the above email suggests that this is a problem with Oracle as well. Are
people seeing data loss with the JDBC driver on high-write-load tables? I
would be surprised if they weren't.

Several solutions covered in that PDF are of interest:

1. The look-back based approach (NOW() - 1 minute, or MAX(id) - 100). This
will lead to significant duplication in the Kafka topic. Even with log
compaction enabled, this will be annoying to users that are reading the
feed in realtime.
2. Some kind of pre-query assignment. Essentially a field that's set to
NULL on insert. Every update sets the field back to NULL. An async process
periodically wakes up and sets the field to a monotonically increasing ID.
In the PDF, they suggest SCN for Oracle. MySQL doesn't easily expose this,
but I think UNIX_TIMESTAMP() would suffice as long as the updates don't
happen too frequently.

For (2), one idea would be to add the ability int he JDBC connector to add
a pre-query, which could be executed just before the SELECT. In this mode,
you could execute an UPDATE my_table SET scn = UNIX_TIMESTAMP() WHERE scn =
NULL. As far as I can tell, with MySQL, this would involve using a trigger
to update the `scn` field back to NULL whenever an update on the row occurs.

Cheers,
Chris

----------

#### Start with an empty table

mysql> select * from funding_instructions;

Empty set (0.00 sec)

#### Create a transaction, and insert a row with state=2, but don't commit
(this is done in another terminal).

mysql> begin;

Query OK, 0 rows affected (0.00 sec)

mysql> insert into funding_instructions (state) VALUES (2);

Query OK, 1 row affected, 4 warnings (0.00 sec)

#### Outside of the transaction in the main terminal, insert a second row
with state=3

mysql> insert into funding_instructions (state) VALUES (3);

Query OK, 1 row affected, 4 warnings (0.00 sec)

#### Now you see only the second row that was inserted. Note the ID.

mysql> select * from funding_instructions;

+----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+

| id       | state | type | amount | currency | batch | litle_id |
modify_time | create_time | version | payment_id | mid  |
funds_transfer_request_date |

+----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+

| 25607888 |     3 |    0 |      0 | XXX      | NULL  | NULL     |
  0 |           0 |       0 |          0 | NULL |
NULL |

+----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+

1 row in set (0.00 sec)

#### Now commit the first transaction.

mysql> commit;

Query OK, 0 rows affected (0.00 sec)

#### Now look at your results. Note that the smaller ID now appears. This
breaks strict ID-based replication, even if the table is append-only.

mysql> select * from funding_instructions;

+----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+

| id       | state | type | amount | currency | batch | litle_id |
modify_time | create_time | version | payment_id | mid  |
funds_transfer_request_date |

+----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+

| 25607887 |     2 |    0 |      0 | XXX      | NULL  | NULL     |
  0 |           0 |       0 |          0 | NULL |
NULL |

| 25607888 |     3 |    0 |      0 | XXX      | NULL  | NULL     |
  0 |           0 |       0 |          0 | NULL |
NULL |

+----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+

On Tue, Feb 23, 2016 at 4:26 PM, Chris Riccomini <[email protected]>
wrote:

> Hey all,
>
> I was reviewing the Kafka connect JDBC driver, and I had a question. Is it
> possible to use the JDBC driver with a look-back configured? The reason
> that I ask is that there are some known issues with using a modified
> timestamp:
>
> Slide 14 here explains one with Oracle:
>
>
> https://qconsf.com/sf2007/dl/QConSF2007/slides/public/JeanLucVaillant_LinkedIn.pdf?path=/QConSF2007/slides/public/JeanLucVaillant_LinkedIn.pdf
>
> There is also some SCN-related discussion here:
>
> https://github.com/linkedin/databus/wiki/Databus-for-MySQL
>
> Though that is more specific to MySQL.
>
> I am concerned that using insert IDs might not even be good enough
> (assuming my tables were immutable, which they're not), since I believe
> some DB storage systems might have the same issue. I think InnoDB's pkey
> auto increment ID commit order is even tunable based on config.
>
> I would rather get some duplicates than lose data if at all possible. Can
> I configure the JDBC driver to subtract some number from the offset to
> prevent (or drastically reduce) lost data?
>
> Cheers,
> Chris
>

Re: Kafka connect JDBC DB SCNs

Reply via email to