[ https://issues.apache.org/jira/browse/KUDU-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763963#comment-16763963 ]
Andrew Wong commented on KUDU-2690: ----------------------------------- I've got a very targeted test to reproduce this issue; the issue seems to be the following race: {{s[n] = schema n, s3 = s2 + column}} {{ // 1. tserver receives op 1(->s2)}} {{ // 2. tserver receives op 2(->s2)}} {{ // 3. tserver checks op 1(->s2) is valid under lock}} {{ // 4. tserver checks op 2(->s2) is valid under lock}} {{ // 5. tserver performs op 1(->s2) succeed}} {{ // 6. tserver receives op 3(->s3)}} {{ // 7. tserver checks op 3(->s3) is valid under lock}} {{ // 8. tserver performs op 3(->s3) succeed}} {{ // GC op 3}} {{ // 9. tserver performs op 2(->s2) fail, but the log segment has s2}} {{ // write a bunch of stuff with s3}} {{ // on replay:}} {{ // 1. see s2 in the WAL header}} {{ // 2. writes are in s3}} {{// 3. FATAL}} While a bit contrived, given the cluster is extra susceptible to KUDU-2681 because of KUDU-2684, the race of op1 and op2 doesn't seem too unlikely. > Alter schema seems to be missing > -------------------------------- > > Key: KUDU-2690 > URL: https://issues.apache.org/jira/browse/KUDU-2690 > Project: Kudu > Issue Type: Bug > Components: log, master, tablet > Affects Versions: 1.7.1 > Reporter: Andrew Wong > Priority: Major > > I've seen an issue that looks as though an ADD_COLUMN is not fully applied > before performing writes. This results in a failure to bootstrap with an > error like: > {{F0112 19:58:08.591284 8692 transaction_driver.cc:383] T > 578f2c6e60d84cb18d704889ea323cda P dc0af5867d52468f8fd47abf13c08040 S R-NP Ts > 6317323785408049152: Cannot cancel transactions that have already replicated: > Invalid argument: Client provided column <COLUMN NAME>[double NULLABLE] not > present in tablet transaction:R-NP WriteTransaction [type=REPLICA, > start_time=2019-01-12 19:58:08, state=WriteTransactionState 0x5d52000 > [op_id=(term: 2548 index: 160364490), ts=6317323785408049152, rows=[]]]}} > > One clue is that in the WALs, the "client schema" (the schema in each write > request) contains a column that is not in the "tablet schema" (the schema in > the log segment), and so dumping the WALs will fail. This alone shouldn't > prevent bootstrapping, but when replaying the WAL, we decode the write > request against the schema in the tablet metadata. This failure seems to > indicate that the tablet metadata's schema is missing a column that is being > used by a committed write. I've been trying to piece together various ALTER > SCHEMA bugs that we have (e.g. KUDU-860) to recreate this, but haven't had > much luck. > > It's worth noting that this cluster is misconfigured so its tablet servers > point to duplicate master addresses, and is therefore susceptible to > KUDU-2681 and KUDU-2684, meaning each tablet report will result in multiple > concurrent tasks being scheduled in response. -- This message was sent by Atlassian JIRA (v7.6.3#76005)