Re: Kafka connect JDBC DB SCNs

Chris Riccomini Wed, 24 Feb 2016 09:10:19 -0800

Given how messed up MySQL is, in particular, it seems to me like adding a
config that can do a loobback (either ID or timestamp-based) would we
useful, and pretty straight forward to add. This will at least help in
preventing data loss, at the expense of duplicate rows. If people *really*
care about not losing data, they can configure a second connector to do
bulk loads on a more periodic basis. Thoughts?


On Wed, Feb 24, 2016 at 12:41 AM, Chris Riccomini <[email protected]>
wrote:

> Hey all,
>
> Some more details on my question. My concern about using even IDs on
> immutable tables as a way to replicate data seems to be confirmed, at least
> for InnoDB if not configured properly. This page:
>
> http://dev.mysql.com/doc/refman/5.7/en/innodb-auto-increment-handling.html
>
> Describes how auto increment IDs can get out of order. I have replicated
> this issue on my local machine (see example below the fold) with all three
> innodb_autoinc_lock_modes (1, 2, and 3). They all show the same results.
>
> As far as I can tell, at least with MySQL+InnoDB, it seems to me that the
> JDBC connector has no way to configure it such that it isn't at risk of
> dropping data on incremental DB polls (whether you use incrementing,
> timestamp, or timestamp+incrementing).
>
> I'm curious if anyone has any thoughts on this. Again, the link I showed
> in the above email suggests that this is a problem with Oracle as well. Are
> people seeing data loss with the JDBC driver on high-write-load tables? I
> would be surprised if they weren't.
>
> Several solutions covered in that PDF are of interest:
>
> 1. The look-back based approach (NOW() - 1 minute, or MAX(id) - 100). This
> will lead to significant duplication in the Kafka topic. Even with log
> compaction enabled, this will be annoying to users that are reading the
> feed in realtime.
> 2. Some kind of pre-query assignment. Essentially a field that's set to
> NULL on insert. Every update sets the field back to NULL. An async process
> periodically wakes up and sets the field to a monotonically increasing ID.
> In the PDF, they suggest SCN for Oracle. MySQL doesn't easily expose this,
> but I think UNIX_TIMESTAMP() would suffice as long as the updates don't
> happen too frequently.
>
> For (2), one idea would be to add the ability int he JDBC connector to add
> a pre-query, which could be executed just before the SELECT. In this mode,
> you could execute an UPDATE my_table SET scn = UNIX_TIMESTAMP() WHERE scn =
> NULL. As far as I can tell, with MySQL, this would involve using a trigger
> to update the `scn` field back to NULL whenever an update on the row occurs.
>
> Cheers,
> Chris
>
> ----------
>
> #### Start with an empty table
>
> mysql> select * from funding_instructions;
>
> Empty set (0.00 sec)
>
> #### Create a transaction, and insert a row with state=2, but don't commit
> (this is done in another terminal).
>
> mysql> begin;
>
> Query OK, 0 rows affected (0.00 sec)
>
> mysql> insert into funding_instructions (state) VALUES (2);
>
> Query OK, 1 row affected, 4 warnings (0.00 sec)
>
> #### Outside of the transaction in the main terminal, insert a second row
> with state=3
>
> mysql> insert into funding_instructions (state) VALUES (3);
>
> Query OK, 1 row affected, 4 warnings (0.00 sec)
>
> #### Now you see only the second row that was inserted. Note the ID.
>
> mysql> select * from funding_instructions;
>
>
> +----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+
>
> | id       | state | type | amount | currency | batch | litle_id |
> modify_time | create_time | version | payment_id | mid  |
> funds_transfer_request_date |
>
>
> +----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+
>
> | 25607888 |     3 |    0 |      0 | XXX      | NULL  | NULL     |
>   0 |           0 |       0 |          0 | NULL |
> NULL |
>
>
> +----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+
>
> 1 row in set (0.00 sec)
>
> #### Now commit the first transaction.
>
> mysql> commit;
>
> Query OK, 0 rows affected (0.00 sec)
>
> #### Now look at your results. Note that the smaller ID now appears. This
> breaks strict ID-based replication, even if the table is append-only.
>
> mysql> select * from funding_instructions;
>
>
> +----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+
>
> | id       | state | type | amount | currency | batch | litle_id |
> modify_time | create_time | version | payment_id | mid  |
> funds_transfer_request_date |
>
>
> +----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+
>
> | 25607887 |     2 |    0 |      0 | XXX      | NULL  | NULL     |
>   0 |           0 |       0 |          0 | NULL |
> NULL |
>
> | 25607888 |     3 |    0 |      0 | XXX      | NULL  | NULL     |
>   0 |           0 |       0 |          0 | NULL |
> NULL |
>
>
> +----------+-------+------+--------+----------+-------+----------+-------------+-------------+---------+------------+------+-----------------------------+
>
> On Tue, Feb 23, 2016 at 4:26 PM, Chris Riccomini <[email protected]>
> wrote:
>
>> Hey all,
>>
>> I was reviewing the Kafka connect JDBC driver, and I had a question. Is
>> it possible to use the JDBC driver with a look-back configured? The reason
>> that I ask is that there are some known issues with using a modified
>> timestamp:
>>
>> Slide 14 here explains one with Oracle:
>>
>>
>> https://qconsf.com/sf2007/dl/QConSF2007/slides/public/JeanLucVaillant_LinkedIn.pdf?path=/QConSF2007/slides/public/JeanLucVaillant_LinkedIn.pdf
>>
>> There is also some SCN-related discussion here:
>>
>> https://github.com/linkedin/databus/wiki/Databus-for-MySQL
>>
>> Though that is more specific to MySQL.
>>
>> I am concerned that using insert IDs might not even be good enough
>> (assuming my tables were immutable, which they're not), since I believe
>> some DB storage systems might have the same issue. I think InnoDB's pkey
>> auto increment ID commit order is even tunable based on config.
>>
>> I would rather get some duplicates than lose data if at all possible. Can
>> I configure the JDBC driver to subtract some number from the offset to
>> prevent (or drastically reduce) lost data?
>>
>> Cheers,
>> Chris
>>
>
>

Re: Kafka connect JDBC DB SCNs

Reply via email to