[jira] [Updated] (KUDU-1736) kudu crash in debug build: unordered undo delta
[ https://issues.apache.org/jira/browse/KUDU-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-1736: Affects Version/s: 1.17.0 1.16.0 1.15.0 1.14.0 1.13.0 > kudu crash in debug build: unordered undo delta > --- > > Key: KUDU-1736 > URL: https://issues.apache.org/jira/browse/KUDU-1736 > Project: Kudu > Issue Type: Bug > Components: tablet >Affects Versions: 1.7.0, 1.8.0, 1.7.1, 1.9.0, 1.10.0, 1.10.1, 1.12.0, > 1.11.1, 1.13.0, 1.14.0, 1.15.0, 1.16.0, 1.17.0 >Reporter: zhangsong >Priority: Critical > Labels: stability > Attachments: mt-tablet-test-20171123.txt.xz, > mt-tablet-test-20191227.txt.xz, mt-tablet-test.1.txt.xz, > mt-tablet-test.3.txt, mt-tablet-test.txt, mt-tablet-test.txt.gz > > > in jd cluster we met kudu-tserver crash with fatal messages described as > follow: > Check failed: last_key_.CompareTo(key) <= 0 must insert undo deltas in > sorted order (ascending key, then descending ts): got key (row > 1422@tx6052042821982183424) after (row 1422@tx6052042821953155072) > This is a dcheck which should not failed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-2667) MultiThreadedTabletTest/DeleteAndReinsert is flaky
[ https://issues.apache.org/jira/browse/KUDU-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2667: Affects Version/s: 1.17.0 1.16.0 1.15.0 1.14.0 1.13.0 1.11.1 1.12.0 1.11.0 1.10.1 1.10.0 > MultiThreadedTabletTest/DeleteAndReinsert is flaky > > > Key: KUDU-2667 > URL: https://issues.apache.org/jira/browse/KUDU-2667 > Project: Kudu > Issue Type: Test >Affects Versions: 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.12.0, 1.11.1, 1.13.0, > 1.14.0, 1.15.0, 1.16.0, 1.17.0 >Reporter: Hao Hao >Priority: Major > Fix For: n/a > > Attachments: mt-tablet-test.1.txt.xz, mt-tablet-test.3.txt > > > I recently came across a failure in MultiThreadedTabletTest/DeleteAndReinsert > of ASAN. The error message is: > {noformat} > Error Message > mt-tablet-test.cc:378] Check failed: _s.ok() Bad status: Already present: > int32 key=2, int32 key_idx=2, int32 val=NULL: key already present > Stacktrace > mt-tablet-test.cc:378] Check failed: _s.ok() Bad status: Already present: > int32 key=2, int32 key_idx=2, int32 val=NULL: key already present > @ 0x7f66b32a5c37 gsignal at ??:0 > @ 0x7f66b32a9028 abort at ??:0 > @ 0x62c995 > kudu::tablet::MultiThreadedTabletTest<>::DeleteAndReinsertCycleThread() at > /home/jenkins-slave/workspace/kudu-master/0/src/kudu/tablet/mt-tablet-test.cc:378 > @ 0x617e63 boost::_bi::bind_t<>::operator()() at > /home/jenkins-slave/workspace/kudu-master/0/thirdparty/installed/uninstrumented/include/boost/bind/bind.hpp:1223 > @ 0x7f66b92d8dac boost::function0<>::operator()() at ??:0 > @ 0x7f66b7792afb kudu::Thread::SuperviseThread() at ??:0 > @ 0x7f66bec0e184 start_thread at ??:0 > @ 0x7f66b336cffd clone at ??:0 > {noformat} > Attached the full log. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KUDU-2667) MultiThreadedTabletTest/DeleteAndReinsert is flaky
[ https://issues.apache.org/jira/browse/KUDU-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17886281#comment-17886281 ] Alexey Serbin edited comment on KUDU-2667 at 10/1/24 9:38 PM: -- It's Kudu 1.17.0 (and 1.18.0 is about to be released soon), and the issue is still present: {noformat} I20240928 20:12:04.655783 7516 mvcc.cc:204] Tried to move back new op lower bound from 70669 to 70668. Current Snapshot: MvccSnapshot[applied={T|T < 70565 or (T in {70566,70567,70568,70569,70570,70572,70571,70573,70574,70575,70576,70578,70577,70579,70580,70583,70582,70581,70584,70585,70586,70587,70588,70589,70590,70592,70591,70594,70593,70595,70596,70597,70598,70599,70602,70601,70600,70603,70604,70606,70605,70608,70607,70609,70610,70613,70612,70611,70616,70614,70615,70617,70618,70619,70621,70620,70622,70623,70624,70625,70627,70626,70628,70629,70630,70631,70632,70633,70634,70635,70636,70638,70637,70639,70640,70641,70642,70643,70644,70645,70647,70646,70648,70649,70650,70651,70652,70653,70654,70655,70656,70658,70657,70659,70660,70661,70662,70663,70664,70665,70666,70667,70669})}] src/kudu/tablet/mt-tablet-test.cc:489: Failure Failed Bad status: Already present: int32 key=9, int32 key_idx=9, int32 val=162: key already present Google Test trace: src/kudu/tablet/mt-tablet-test.cc:487: DeleteAndReinsert thread ID 16 {noformat} There have been an evidence of the issue happening during pre-commit test run at least in ASAN and RELEASE builds. The full log is attached (RELEASE build). [^mt-tablet-test.1.txt.xz] was (Author: aserbin): It's Kudu 1.17.0 (and 1.18.0 is about to be released soon), and the issue is still present: {noformat} I20240928 20:12:04.655783 7516 mvcc.cc:204] Tried to move back new op lower bound from 70669 to 70668. Current Snapshot: MvccSnapshot[applied={T|T < 70565 or (T in {70566,70567,70568,70569,70570,70572,70571,70573,70574,70575,70576,70578,70577,70579,70580,70583,70582,70581,70584,70585,70586,70587,70588,70589,70590,70592,70591,70594,70593,70595,70596,70597,70598,70599,70602,70601,70600,70603,70604,70606,70605,70608,70607,70609,70610,70613,70612,70611,70616,70614,70615,70617,70618,70619,70621,70620,70622,70623,70624,70625,70627,70626,70628,70629,70630,70631,70632,70633,70634,70635,70636,70638,70637,70639,70640,70641,70642,70643,70644,70645,70647,70646,70648,70649,70650,70651,70652,70653,70654,70655,70656,70658,70657,70659,70660,70661,70662,70663,70664,70665,70666,70667,70669})}] src/kudu/tablet/mt-tablet-test.cc:489: Failure Failed Bad status: Already present: int32 key=9, int32 key_idx=9, int32 val=162: key already present Google Test trace: src/kudu/tablet/mt-tablet-test.cc:487: DeleteAndReinsert thread ID 16 {noformat} The full log is attached. [^mt-tablet-test.1.txt.xz] > MultiThreadedTabletTest/DeleteAndReinsert is flaky > > > Key: KUDU-2667 > URL: https://issues.apache.org/jira/browse/KUDU-2667 > Project: Kudu > Issue Type: Test >Affects Versions: 1.9.0 >Reporter: Hao Hao >Priority: Major > Fix For: n/a > > Attachments: mt-tablet-test.1.txt.xz, mt-tablet-test.3.txt > > > I recently came across a failure in MultiThreadedTabletTest/DeleteAndReinsert > of ASAN. The error message is: > {noformat} > Error Message > mt-tablet-test.cc:378] Check failed: _s.ok() Bad status: Already present: > int32 key=2, int32 key_idx=2, int32 val=NULL: key already present > Stacktrace > mt-tablet-test.cc:378] Check failed: _s.ok() Bad status: Already present: > int32 key=2, int32 key_idx=2, int32 val=NULL: key already present > @ 0x7f66b32a5c37 gsignal at ??:0 > @ 0x7f66b32a9028 abort at ??:0 > @ 0x62c995 > kudu::tablet::MultiThreadedTabletTest<>::DeleteAndReinsertCycleThread() at > /home/jenkins-slave/workspace/kudu-master/0/src/kudu/tablet/mt-tablet-test.cc:378 > @ 0x617e63 boost::_bi::bind_t<>::operator()() at > /home/jenkins-slave/workspace/kudu-master/0/thirdparty/installed/uninstrumented/include/boost/bind/bind.hpp:1223 > @ 0x7f66b92d8dac boost::function0<>::operator()() at ??:0 > @ 0x7f66b7792afb kudu::Thread::SuperviseThread() at ??:0 > @ 0x7f66bec0e184 start_thread at ??:0 > @ 0x7f66b336cffd clone at ??:0 > {noformat} > Attached the full log. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KUDU-1736) kudu crash in debug build: unordered undo delta
[ https://issues.apache.org/jira/browse/KUDU-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17004358#comment-17004358 ] Alexey Serbin edited comment on KUDU-1736 at 10/1/24 9:35 PM: -- Another occurrence in pre-commit build (ASAN). Attaching the log. [^mt-tablet-test-20191227.txt.xz] was (Author: aserbin): Another occurrence if pre-commit build (ASAN). Attaching the log. [^mt-tablet-test-20191227.txt.xz] > kudu crash in debug build: unordered undo delta > --- > > Key: KUDU-1736 > URL: https://issues.apache.org/jira/browse/KUDU-1736 > Project: Kudu > Issue Type: Bug > Components: tablet >Affects Versions: 1.7.0, 1.8.0, 1.7.1, 1.9.0, 1.10.0, 1.10.1, 1.12.0, > 1.11.1 >Reporter: zhangsong >Priority: Critical > Labels: stability > Attachments: mt-tablet-test-20171123.txt.xz, > mt-tablet-test-20191227.txt.xz, mt-tablet-test.1.txt.xz, > mt-tablet-test.3.txt, mt-tablet-test.txt, mt-tablet-test.txt.gz > > > in jd cluster we met kudu-tserver crash with fatal messages described as > follow: > Check failed: last_key_.CompareTo(key) <= 0 must insert undo deltas in > sorted order (ascending key, then descending ts): got key (row > 1422@tx6052042821982183424) after (row 1422@tx6052042821953155072) > This is a dcheck which should not failed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-2667) MultiThreadedTabletTest/DeleteAndReinsert is flaky
[ https://issues.apache.org/jira/browse/KUDU-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2667: Attachment: mt-tablet-test.1.txt.xz > MultiThreadedTabletTest/DeleteAndReinsert is flaky > > > Key: KUDU-2667 > URL: https://issues.apache.org/jira/browse/KUDU-2667 > Project: Kudu > Issue Type: Test >Affects Versions: 1.9.0 >Reporter: Hao Hao >Priority: Major > Fix For: n/a > > Attachments: mt-tablet-test.1.txt.xz, mt-tablet-test.3.txt > > > I recently came across a failure in MultiThreadedTabletTest/DeleteAndReinsert > of ASAN. The error message is: > {noformat} > Error Message > mt-tablet-test.cc:378] Check failed: _s.ok() Bad status: Already present: > int32 key=2, int32 key_idx=2, int32 val=NULL: key already present > Stacktrace > mt-tablet-test.cc:378] Check failed: _s.ok() Bad status: Already present: > int32 key=2, int32 key_idx=2, int32 val=NULL: key already present > @ 0x7f66b32a5c37 gsignal at ??:0 > @ 0x7f66b32a9028 abort at ??:0 > @ 0x62c995 > kudu::tablet::MultiThreadedTabletTest<>::DeleteAndReinsertCycleThread() at > /home/jenkins-slave/workspace/kudu-master/0/src/kudu/tablet/mt-tablet-test.cc:378 > @ 0x617e63 boost::_bi::bind_t<>::operator()() at > /home/jenkins-slave/workspace/kudu-master/0/thirdparty/installed/uninstrumented/include/boost/bind/bind.hpp:1223 > @ 0x7f66b92d8dac boost::function0<>::operator()() at ??:0 > @ 0x7f66b7792afb kudu::Thread::SuperviseThread() at ??:0 > @ 0x7f66bec0e184 start_thread at ??:0 > @ 0x7f66b336cffd clone at ??:0 > {noformat} > Attached the full log. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KUDU-2667) MultiThreadedTabletTest/DeleteAndReinsert is flaky
[ https://issues.apache.org/jira/browse/KUDU-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17886281#comment-17886281 ] Alexey Serbin commented on KUDU-2667: - It's Kudu 1.17.0 (and 1.18.0 is about to be released soon), and the issue is still present: {noformat} I20240928 20:12:04.655783 7516 mvcc.cc:204] Tried to move back new op lower bound from 70669 to 70668. Current Snapshot: MvccSnapshot[applied={T|T < 70565 or (T in {70566,70567,70568,70569,70570,70572,70571,70573,70574,70575,70576,70578,70577,70579,70580,70583,70582,70581,70584,70585,70586,70587,70588,70589,70590,70592,70591,70594,70593,70595,70596,70597,70598,70599,70602,70601,70600,70603,70604,70606,70605,70608,70607,70609,70610,70613,70612,70611,70616,70614,70615,70617,70618,70619,70621,70620,70622,70623,70624,70625,70627,70626,70628,70629,70630,70631,70632,70633,70634,70635,70636,70638,70637,70639,70640,70641,70642,70643,70644,70645,70647,70646,70648,70649,70650,70651,70652,70653,70654,70655,70656,70658,70657,70659,70660,70661,70662,70663,70664,70665,70666,70667,70669})}] src/kudu/tablet/mt-tablet-test.cc:489: Failure Failed Bad status: Already present: int32 key=9, int32 key_idx=9, int32 val=162: key already present Google Test trace: src/kudu/tablet/mt-tablet-test.cc:487: DeleteAndReinsert thread ID 16 {noformat} The full log is attached. [^mt-tablet-test.1.txt.xz] > MultiThreadedTabletTest/DeleteAndReinsert is flaky > > > Key: KUDU-2667 > URL: https://issues.apache.org/jira/browse/KUDU-2667 > Project: Kudu > Issue Type: Test >Affects Versions: 1.9.0 >Reporter: Hao Hao >Priority: Major > Fix For: n/a > > Attachments: mt-tablet-test.1.txt.xz, mt-tablet-test.3.txt > > > I recently came across a failure in MultiThreadedTabletTest/DeleteAndReinsert > of ASAN. The error message is: > {noformat} > Error Message > mt-tablet-test.cc:378] Check failed: _s.ok() Bad status: Already present: > int32 key=2, int32 key_idx=2, int32 val=NULL: key already present > Stacktrace > mt-tablet-test.cc:378] Check failed: _s.ok() Bad status: Already present: > int32 key=2, int32 key_idx=2, int32 val=NULL: key already present > @ 0x7f66b32a5c37 gsignal at ??:0 > @ 0x7f66b32a9028 abort at ??:0 > @ 0x62c995 > kudu::tablet::MultiThreadedTabletTest<>::DeleteAndReinsertCycleThread() at > /home/jenkins-slave/workspace/kudu-master/0/src/kudu/tablet/mt-tablet-test.cc:378 > @ 0x617e63 boost::_bi::bind_t<>::operator()() at > /home/jenkins-slave/workspace/kudu-master/0/thirdparty/installed/uninstrumented/include/boost/bind/bind.hpp:1223 > @ 0x7f66b92d8dac boost::function0<>::operator()() at ??:0 > @ 0x7f66b7792afb kudu::Thread::SuperviseThread() at ??:0 > @ 0x7f66bec0e184 start_thread at ??:0 > @ 0x7f66b336cffd clone at ??:0 > {noformat} > Attached the full log. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-2667) MultiThreadedTabletTest/DeleteAndReinsert is flaky
[ https://issues.apache.org/jira/browse/KUDU-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2667: Summary: MultiThreadedTabletTest/DeleteAndReinsert is flaky (was: MultiThreadedTabletTest/DeleteAndReinsert is flaky in ASAN) > MultiThreadedTabletTest/DeleteAndReinsert is flaky > > > Key: KUDU-2667 > URL: https://issues.apache.org/jira/browse/KUDU-2667 > Project: Kudu > Issue Type: Test >Affects Versions: 1.9.0 >Reporter: Hao Hao >Priority: Major > Fix For: n/a > > Attachments: mt-tablet-test.3.txt > > > I recently came across a failure in MultiThreadedTabletTest/DeleteAndReinsert > of ASAN. The error message is: > {noformat} > Error Message > mt-tablet-test.cc:378] Check failed: _s.ok() Bad status: Already present: > int32 key=2, int32 key_idx=2, int32 val=NULL: key already present > Stacktrace > mt-tablet-test.cc:378] Check failed: _s.ok() Bad status: Already present: > int32 key=2, int32 key_idx=2, int32 val=NULL: key already present > @ 0x7f66b32a5c37 gsignal at ??:0 > @ 0x7f66b32a9028 abort at ??:0 > @ 0x62c995 > kudu::tablet::MultiThreadedTabletTest<>::DeleteAndReinsertCycleThread() at > /home/jenkins-slave/workspace/kudu-master/0/src/kudu/tablet/mt-tablet-test.cc:378 > @ 0x617e63 boost::_bi::bind_t<>::operator()() at > /home/jenkins-slave/workspace/kudu-master/0/thirdparty/installed/uninstrumented/include/boost/bind/bind.hpp:1223 > @ 0x7f66b92d8dac boost::function0<>::operator()() at ??:0 > @ 0x7f66b7792afb kudu::Thread::SuperviseThread() at ??:0 > @ 0x7f66bec0e184 start_thread at ??:0 > @ 0x7f66b336cffd clone at ??:0 > {noformat} > Attached the full log. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3453) Fine-grained anchoring for WAL segments for tablet copy
[ https://issues.apache.org/jira/browse/KUDU-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3453: Description: Tablet copying is a provision to implement the process of automatic tablet re-replication in Kudu. When the system catalog (Kudu master) detects that a tablet replica is no longer available, it automatically re-replicates a tablet to a destination tablet server using another healthy tablet replica in the cluster as the source. When copying a tablet from one tablet server to another, the source tablet copying session "anchors" WAL segments to be transferred to the destination server, so they are not GC-ed by the tablet maintenance operation when they are no longer needed locally, but the tablet copy session is still in progress. The anchored WAL segments are released all at once when the tablet copying session completes with success of failure. However, there might be long running tablet copying sessions, and with high data ingest rate, the source tablet replica might accumulate huge amount of WAL data which isn't relevant at both the source and the destination server. To prevent accumulation of WAL data for long-running tablet copying sessions, it's necessary to update the WAL anchors in a more granular manner, e.g. un-anchor a segment once it has been successfully copied and persisted by the client tablet copying session. was: Tablet copying is a provision to implement the process of automatic tablet re-replication in Kudu. When the system catalog (Kudu master) detects that a tablet replica is no longer available, it automatically re-replicates a tablet to a destination tablet server using another healthy tablet replica in the cluster as the source. When copying a tablet from one tablet server to another, the source tablet copying session "anchors" WAL segments to be transfered to the destination server, so they are not GC-ed by the tablet maintenance operation when they are no longer needed locally, but the tablet copy session is still in progress. The anchored WAL segments are releases all at once when the tablet copying session completes with success of failure. However, there might be long running tablet copying sessions, and with high data ingest rate, the source tablet replica might accumulate huge amount of WAL data which isn't relevant at both the source and the destination server. To prevent accumulation of WAL data for long-running tablet copying sessions, it's necessary to update the WAL anchors in a more granular manner, e.g. un-anchor a segment once it has been successfully copied and persisted by the client tablet copying session. > Fine-grained anchoring for WAL segments for tablet copy > --- > > Key: KUDU-3453 > URL: https://issues.apache.org/jira/browse/KUDU-3453 > Project: Kudu > Issue Type: Improvement > Components: tablet, tserver >Reporter: Alexey Serbin >Priority: Major > > Tablet copying is a provision to implement the process of automatic tablet > re-replication in Kudu. When the system catalog (Kudu master) detects that a > tablet replica is no longer available, it automatically re-replicates a > tablet to a destination tablet server using another healthy tablet replica in > the cluster as the source. > When copying a tablet from one tablet server to another, the source tablet > copying session "anchors" WAL segments to be transferred to the destination > server, so they are not GC-ed by the tablet maintenance operation when they > are no longer needed locally, but the tablet copy session is still in > progress. > The anchored WAL segments are released all at once when the tablet copying > session completes with success of failure. However, there might be long > running tablet copying sessions, and with high data ingest rate, the source > tablet replica might accumulate huge amount of WAL data which isn't relevant > at both the source and the destination server. > To prevent accumulation of WAL data for long-running tablet copying sessions, > it's necessary to update the WAL anchors in a more granular manner, e.g. > un-anchor a segment once it has been successfully copied and persisted by the > client tablet copying session. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KUDU-3465) MultiThreadedHybridClockTabletTest/5.UpdateNoMergeCompaction is flaky
[ https://issues.apache.org/jira/browse/KUDU-3465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884855#comment-17884855 ] Alexey Serbin edited comment on KUDU-3465 at 9/26/24 5:50 AM: -- Should be fixed with: * [3666d2026|https://github.com/apache/kudu/commit/3666d2026] in Kudu 1.18.0 * [05043e6ab|https://github.com/apache/kudu/commit/05043e6ab] in Kudu 1.17.1 See KUDU-3619 for details. was (Author: aserbin): Should be fixed with: * [3666d2026|https://github.com/apache/kudu/commit/3666d2026] in Kudu 1.18.0 * [05043e6ab|https://github.com/apache/kudu/commit/05043e6ab] in Kudu 1.17.1 > MultiThreadedHybridClockTabletTest/5.UpdateNoMergeCompaction is flaky > - > > Key: KUDU-3465 > URL: https://issues.apache.org/jira/browse/KUDU-3465 > Project: Kudu > Issue Type: Bug > Components: compaction, test >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Major > Fix For: 1.18.0, 1.17.1 > > Attachments: mt-tablet-test.3.txt.xz > > > MultiThreadedHybridClockTabletTest/5.UpdateNoMergeCompaction sometimes fails > in TSAN builds with error like below: > {noformat} > W20230401 03:59:16.541988 18207 diskrowset.cc:592] T test_tablet_id P > fedb48ed543846edbabf74b3b7007739: RowSet(1): Error during major delta > compaction! Rolling back rowset metadata > F20230401 03:59:16.543404 18207 mt-tablet-test.cc:339] Check failed: _s.ok() > Bad status: Corruption: Failed major delta compaction on RowSet(1): No min > key found: CFile base data in RowSet(1) > {noformat} > I attached the full log. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KUDU-3465) MultiThreadedHybridClockTabletTest/5.UpdateNoMergeCompaction is flaky
[ https://issues.apache.org/jira/browse/KUDU-3465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-3465. - Fix Version/s: 1.18.0 1.17.1 Resolution: Fixed Should be fixed with: * [3666d2026|https://github.com/apache/kudu/commit/3666d2026] in Kudu 1.18.0 * [05043e6ab|https://github.com/apache/kudu/commit/05043e6ab] in Kudu 1.17.1 > MultiThreadedHybridClockTabletTest/5.UpdateNoMergeCompaction is flaky > - > > Key: KUDU-3465 > URL: https://issues.apache.org/jira/browse/KUDU-3465 > Project: Kudu > Issue Type: Bug > Components: compaction, test >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Major > Fix For: 1.18.0, 1.17.1 > > Attachments: mt-tablet-test.3.txt.xz > > > MultiThreadedHybridClockTabletTest/5.UpdateNoMergeCompaction sometimes fails > in TSAN builds with error like below: > {noformat} > W20230401 03:59:16.541988 18207 diskrowset.cc:592] T test_tablet_id P > fedb48ed543846edbabf74b3b7007739: RowSet(1): Error during major delta > compaction! Rolling back rowset metadata > F20230401 03:59:16.543404 18207 mt-tablet-test.cc:339] Check failed: _s.ok() > Bad status: Corruption: Failed major delta compaction on RowSet(1): No min > key found: CFile base data in RowSet(1) > {noformat} > I attached the full log. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3620) Race condition in OpDriver::ReplicationFinished()
[ https://issues.apache.org/jira/browse/KUDU-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3620: Description: As of There is a race condition in {{OpDriver::ReplicationFinished}} that with [1b99da532|https://github.com/apache/kudu/commit/1b99da532f52d143c46440c3903785d642fb45a3] manifests itself in the following ways when running ts_recovery-itest: # A tablet server crashes with SIGSEGV (DEBUG builds and probably RELEASE builds as well) # The address sanitizer issues warnings (ASAN builds) Full logs are attached. The stack trace for item 1: {noformat} *** Aborted at 1727269462 (unix time) try "date -d @1727269462" if you are using GNU date *** PC: @0x0 (unknown) *** SIGSEGV (@0x30) received by PID 14694 (TID 0x7f734f91b700) from PID 48; stack trace: *** @ 0x7f73830a5980 (unknown) at ??:0 @ 0x7f73848b3db6 kudu::tablet::OpState::tablet_replica() at ??:0 @ 0x7f73848d55c3 kudu::tablet::OpDriver::ReplicationFinished() at ??:0 @ 0x7f73848aa27e _ZZN4kudu6tablet13TabletReplica15StartFollowerOpERK13scoped_refptrINS_9consensus14ConsensusRoundEEENKUlRKNS_6StatusEE_clESA_ at ??:0 @ 0x7f73848b0f41 _ZNSt17_Function_handlerIFvRKN4kudu6StatusEEZNS0_6tablet13TabletReplica15StartFollowerOpERK13scoped_refptrINS0_9consensus14ConsensusRoundEEEUlS3_E_E9_M_invokeERKSt9_Any_dataS3_ at ??:0 @ 0x7f7386351325 std::function<>::operator()() at ??:0 @ 0x7f7384407f2b kudu::consensus::ConsensusRound::NotifyReplicationFinished() at ??:0 @ 0x7f73843d774b kudu::consensus::PendingRounds::AdvanceCommittedIndex() at ??:0 @ 0x7f73843f6888 kudu::consensus::RaftConsensus::UpdateReplica() at ??:0 @ 0x7f73843f1ef5 kudu::consensus::RaftConsensus::Update() at ??:0 @ 0x7f7385467de7 kudu::tserver::ConsensusServiceImpl::UpdateConsensus() at ??:0 @ 0x7f7383c95fd2 _ZZN4kudu9consensus18ConsensusServiceIfC4ERK13scoped_refptrINS_12MetricEntityEERKS2_INS_3rpc13ResultTrackerEEENKUlPKN6google8protobuf7MessageEPSE_PNS7_10RpcContextEE0_clESG_SH_SJ_ at ??:0 @ 0x7f7383c9a063 _ZNSt17_Function_handlerIFvPKN6google8protobuf7MessageEPS2_PN4kudu3rpc10RpcContextEEZNS6_9consensus18ConsensusServiceIfC4ERK13scoped_refptrINS6_12MetricEntityEERKSD_INS7_13ResultTrackerEEEUlS4_S5_S9_E0_E9_M_invokeERKSt9_Any_dataOS4_OS5_OS9_ at ??:0 @ 0x7f73834af4b8 std::function<>::operator()() at ??:0 @ 0x7f73834aed6c kudu::rpc::GeneratedServiceIf::Handle() at ??:0 @ 0x7f73834b1a7d kudu::rpc::ServicePool::RunThread() at ??:0 @ 0x7f73834b03c7 _ZZN4kudu3rpc11ServicePool4InitEiENKUlvE_clEv at ??:0 @ 0x7f73834b1e06 _ZNSt17_Function_handlerIFvvEZN4kudu3rpc11ServicePool4InitEiEUlvE_E9_M_invokeERKSt9_Any_data at ??:0 @ 0x55ab245f526e std::function<>::operator()() at ??:0 @ 0x7f7382853bb1 kudu::Thread::SuperviseThread() at ??:0 @ 0x7f738309a6db start_thread at ??:0 @ 0x7f73805ae71f clone at ??:0 {noformat} A sample of output for item 2: {noformat} ==26864==ERROR: AddressSanitizer: heap-use-after-free on address 0x617000212830 at pc 0x7fd36dc2c636 bp 0x7fd32f986530 sp 0x7fd32f986528 READ of size 8 at 0x617000212830 thread T84 (rpc worker-2694) #0 0x7fd36dc2c635 in kudu::tablet::OpState::tablet_replica() const /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/tablet/ops/op.h:189:12 #1 0x7fd36dc70732 in kudu::tablet::OpDriver::ReplicationFinished(kudu::Status const&) /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/tablet/ops/op_driver.cc:443:37 #2 0x7fd36dc20493 in kudu::tablet::TabletReplica::StartFollowerOp(scoped_refptr const&)::$_7::operator()(kudu::Status const&) const /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/tablet/tablet_replica.cc:857:51 #3 0x7fd36dc202fc in std::_Function_handler const&)::$_7>::_M_invoke(std::_Any_data const&, kudu::Status const&) ../../../include/c++/7.5.0/bits/std_function.h:316:2 #4 0x7fd37460bd0d in std::function::operator()(kudu::Status const&) const ../../../include/c++/7.5.0/bits/std_function.h:706:14 #5 0x7fd36c940afc in kudu::consensus::ConsensusRound::NotifyReplicationFinished(kudu::Status const&) /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/consensus/raft_consensus.cc:3311:3 #6 0x7fd36c8cdbbc in kudu::consensus::PendingRounds::AdvanceCommittedIndex(long) /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/consensus/pending_rounds.cc:185:12 #7 0x7fd36c916f16 in kudu::consensus::RaftConsensus::UpdateReplica(kudu::consensus::ConsensusRequestPB const*, kudu::consensus::ConsensusResponsePB*) /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/consensus/raft_consensus.cc:1530:5 #8 0x7fd36c914e57 in kudu::consensus::RaftConsensus::Update(kudu::consensus::ConsensusRequestPB const*, kudu::consensus::ConsensusResponsePB*) /home/jen
[jira] [Updated] (KUDU-3620) Race condition in OpDriver::ReplicationFinished()
[ https://issues.apache.org/jira/browse/KUDU-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3620: Attachment: ts_recovery-itest.asan.txt.xz > Race condition in OpDriver::ReplicationFinished() > - > > Key: KUDU-3620 > URL: https://issues.apache.org/jira/browse/KUDU-3620 > Project: Kudu > Issue Type: Bug > Components: master, tserver >Reporter: Alexey Serbin >Priority: Major > Attachments: ts_recovery-itest.asan.txt.xz, > ts_recovery-itest.sigsegv.txt.xz > > > As of There is a race condition in {{OpDriver::ReplicationFinished}} that > with [1b99da532f52d143c46440c3903785d642fb45a3] manifests itself in the > following ways when running ts_recovery-itest: > # A tablet server crashes with SIGSEGV (DEBUG builds and probably RELEASE > builds as well) > # Address sanitizer issues warnings (ASAN builds) > Full logs are attached. > The stack trace for item 1: > {noformat} > *** Aborted at 1727269462 (unix time) try "date -d @1727269462" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGSEGV (@0x30) received by PID 14694 (TID 0x7f734f91b700) from PID 48; > stack trace: *** > @ 0x7f73830a5980 (unknown) at ??:0 > @ 0x7f73848b3db6 kudu::tablet::OpState::tablet_replica() at ??:0 > @ 0x7f73848d55c3 kudu::tablet::OpDriver::ReplicationFinished() at ??:0 > @ 0x7f73848aa27e > _ZZN4kudu6tablet13TabletReplica15StartFollowerOpERK13scoped_refptrINS_9consensus14ConsensusRoundEEENKUlRKNS_6StatusEE_clESA_ > at ??:0 > @ 0x7f73848b0f41 > _ZNSt17_Function_handlerIFvRKN4kudu6StatusEEZNS0_6tablet13TabletReplica15StartFollowerOpERK13scoped_refptrINS0_9consensus14ConsensusRoundEEEUlS3_E_E9_M_invokeERKSt9_Any_dataS3_ > at ??:0 > @ 0x7f7386351325 std::function<>::operator()() at ??:0 > @ 0x7f7384407f2b > kudu::consensus::ConsensusRound::NotifyReplicationFinished() at ??:0 > @ 0x7f73843d774b > kudu::consensus::PendingRounds::AdvanceCommittedIndex() at ??:0 > @ 0x7f73843f6888 kudu::consensus::RaftConsensus::UpdateReplica() at > ??:0 > @ 0x7f73843f1ef5 kudu::consensus::RaftConsensus::Update() at ??:0 > @ 0x7f7385467de7 > kudu::tserver::ConsensusServiceImpl::UpdateConsensus() at ??:0 > @ 0x7f7383c95fd2 > _ZZN4kudu9consensus18ConsensusServiceIfC4ERK13scoped_refptrINS_12MetricEntityEERKS2_INS_3rpc13ResultTrackerEEENKUlPKN6google8protobuf7MessageEPSE_PNS7_10RpcContextEE0_clESG_SH_SJ_ > at ??:0 > @ 0x7f7383c9a063 > _ZNSt17_Function_handlerIFvPKN6google8protobuf7MessageEPS2_PN4kudu3rpc10RpcContextEEZNS6_9consensus18ConsensusServiceIfC4ERK13scoped_refptrINS6_12MetricEntityEERKSD_INS7_13ResultTrackerEEEUlS4_S5_S9_E0_E9_M_invokeERKSt9_Any_dataOS4_OS5_OS9_ > at ??:0 > @ 0x7f73834af4b8 std::function<>::operator()() at ??:0 > @ 0x7f73834aed6c kudu::rpc::GeneratedServiceIf::Handle() at ??:0 > @ 0x7f73834b1a7d kudu::rpc::ServicePool::RunThread() at ??:0 > @ 0x7f73834b03c7 _ZZN4kudu3rpc11ServicePool4InitEiENKUlvE_clEv at ??:0 > @ 0x7f73834b1e06 > _ZNSt17_Function_handlerIFvvEZN4kudu3rpc11ServicePool4InitEiEUlvE_E9_M_invokeERKSt9_Any_data > at ??:0 > @ 0x55ab245f526e std::function<>::operator()() at ??:0 > @ 0x7f7382853bb1 kudu::Thread::SuperviseThread() at ??:0 > @ 0x7f738309a6db start_thread at ??:0 > @ 0x7f73805ae71f clone at ??:0 > {noformat} > A sample of output for item 2: > {noformat} > ==26864==ERROR: AddressSanitizer: heap-use-after-free on address > 0x617000212830 at pc 0x7fd36dc2c636 bp 0x7fd32f986530 sp 0x7fd32f986528 > READ of size 8 at 0x617000212830 thread T84 (rpc worker-2694) > #0 0x7fd36dc2c635 in kudu::tablet::OpState::tablet_replica() const > /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/tablet/ops/op.h:189:12 > #1 0x7fd36dc70732 in > kudu::tablet::OpDriver::ReplicationFinished(kudu::Status const&) > /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/tablet/ops/op_driver.cc:443:37 > #2 0x7fd36dc20493 in > kudu::tablet::TabletReplica::StartFollowerOp(scoped_refptr > const&)::$_7::operator()(kudu::Status const&) const > /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/tablet/tablet_replica.cc:857:51 > #3 0x7fd36dc202fc in std::_Function_handler kudu::tablet::TabletReplica::StartFollowerOp(scoped_refptr > const&)::$_7>::_M_invoke(std::_Any_data const&, kudu::Status const&) > ../../../include/c++/7.5.0/bits/std_function.h:316:2 > #4 0x7fd37460bd0d in std::function const&)>::operator()(kudu::Status const&) const > ../../../include/c++/7.5.0/bits/std_function.h:706:14 > #5 0x7fd36c940afc in > kudu::consensus::ConsensusRound::NotifyReplicationFinished(kudu::Status > const&) > /home/jenkins-slave/workspace/build_an
[jira] [Updated] (KUDU-3620) Race condition in OpDriver::ReplicationFinished()
[ https://issues.apache.org/jira/browse/KUDU-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3620: Description: As of There is a race condition in {{OpDriver::ReplicationFinished}} that with [1b99da532f52d143c46440c3903785d642fb45a3] manifests itself in the following ways when running ts_recovery-itest: # A tablet server crashes with SIGSEGV (DEBUG builds and probably RELEASE builds as well) # Address sanitizer issues warnings (ASAN builds) Full logs are attached. The stack trace for item 1: {noformat} *** Aborted at 1727269462 (unix time) try "date -d @1727269462" if you are using GNU date *** PC: @0x0 (unknown) *** SIGSEGV (@0x30) received by PID 14694 (TID 0x7f734f91b700) from PID 48; stack trace: *** @ 0x7f73830a5980 (unknown) at ??:0 @ 0x7f73848b3db6 kudu::tablet::OpState::tablet_replica() at ??:0 @ 0x7f73848d55c3 kudu::tablet::OpDriver::ReplicationFinished() at ??:0 @ 0x7f73848aa27e _ZZN4kudu6tablet13TabletReplica15StartFollowerOpERK13scoped_refptrINS_9consensus14ConsensusRoundEEENKUlRKNS_6StatusEE_clESA_ at ??:0 @ 0x7f73848b0f41 _ZNSt17_Function_handlerIFvRKN4kudu6StatusEEZNS0_6tablet13TabletReplica15StartFollowerOpERK13scoped_refptrINS0_9consensus14ConsensusRoundEEEUlS3_E_E9_M_invokeERKSt9_Any_dataS3_ at ??:0 @ 0x7f7386351325 std::function<>::operator()() at ??:0 @ 0x7f7384407f2b kudu::consensus::ConsensusRound::NotifyReplicationFinished() at ??:0 @ 0x7f73843d774b kudu::consensus::PendingRounds::AdvanceCommittedIndex() at ??:0 @ 0x7f73843f6888 kudu::consensus::RaftConsensus::UpdateReplica() at ??:0 @ 0x7f73843f1ef5 kudu::consensus::RaftConsensus::Update() at ??:0 @ 0x7f7385467de7 kudu::tserver::ConsensusServiceImpl::UpdateConsensus() at ??:0 @ 0x7f7383c95fd2 _ZZN4kudu9consensus18ConsensusServiceIfC4ERK13scoped_refptrINS_12MetricEntityEERKS2_INS_3rpc13ResultTrackerEEENKUlPKN6google8protobuf7MessageEPSE_PNS7_10RpcContextEE0_clESG_SH_SJ_ at ??:0 @ 0x7f7383c9a063 _ZNSt17_Function_handlerIFvPKN6google8protobuf7MessageEPS2_PN4kudu3rpc10RpcContextEEZNS6_9consensus18ConsensusServiceIfC4ERK13scoped_refptrINS6_12MetricEntityEERKSD_INS7_13ResultTrackerEEEUlS4_S5_S9_E0_E9_M_invokeERKSt9_Any_dataOS4_OS5_OS9_ at ??:0 @ 0x7f73834af4b8 std::function<>::operator()() at ??:0 @ 0x7f73834aed6c kudu::rpc::GeneratedServiceIf::Handle() at ??:0 @ 0x7f73834b1a7d kudu::rpc::ServicePool::RunThread() at ??:0 @ 0x7f73834b03c7 _ZZN4kudu3rpc11ServicePool4InitEiENKUlvE_clEv at ??:0 @ 0x7f73834b1e06 _ZNSt17_Function_handlerIFvvEZN4kudu3rpc11ServicePool4InitEiEUlvE_E9_M_invokeERKSt9_Any_data at ??:0 @ 0x55ab245f526e std::function<>::operator()() at ??:0 @ 0x7f7382853bb1 kudu::Thread::SuperviseThread() at ??:0 @ 0x7f738309a6db start_thread at ??:0 @ 0x7f73805ae71f clone at ??:0 {noformat} A sample of output for item 2: {noformat} ==26864==ERROR: AddressSanitizer: heap-use-after-free on address 0x617000212830 at pc 0x7fd36dc2c636 bp 0x7fd32f986530 sp 0x7fd32f986528 READ of size 8 at 0x617000212830 thread T84 (rpc worker-2694) #0 0x7fd36dc2c635 in kudu::tablet::OpState::tablet_replica() const /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/tablet/ops/op.h:189:12 #1 0x7fd36dc70732 in kudu::tablet::OpDriver::ReplicationFinished(kudu::Status const&) /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/tablet/ops/op_driver.cc:443:37 #2 0x7fd36dc20493 in kudu::tablet::TabletReplica::StartFollowerOp(scoped_refptr const&)::$_7::operator()(kudu::Status const&) const /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/tablet/tablet_replica.cc:857:51 #3 0x7fd36dc202fc in std::_Function_handler const&)::$_7>::_M_invoke(std::_Any_data const&, kudu::Status const&) ../../../include/c++/7.5.0/bits/std_function.h:316:2 #4 0x7fd37460bd0d in std::function::operator()(kudu::Status const&) const ../../../include/c++/7.5.0/bits/std_function.h:706:14 #5 0x7fd36c940afc in kudu::consensus::ConsensusRound::NotifyReplicationFinished(kudu::Status const&) /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/consensus/raft_consensus.cc:3311:3 #6 0x7fd36c8cdbbc in kudu::consensus::PendingRounds::AdvanceCommittedIndex(long) /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/consensus/pending_rounds.cc:185:12 #7 0x7fd36c916f16 in kudu::consensus::RaftConsensus::UpdateReplica(kudu::consensus::ConsensusRequestPB const*, kudu::consensus::ConsensusResponsePB*) /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/consensus/raft_consensus.cc:1530:5 #8 0x7fd36c914e57 in kudu::consensus::RaftConsensus::Update(kudu::consensus::ConsensusRequestPB const*, kudu::consensus::ConsensusResponsePB*) /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu
[jira] [Updated] (KUDU-3620) Race condition in OpDriver::ReplicationFinished()
[ https://issues.apache.org/jira/browse/KUDU-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3620: Attachment: ts_recovery-itest.sigsegv.txt.xz > Race condition in OpDriver::ReplicationFinished() > - > > Key: KUDU-3620 > URL: https://issues.apache.org/jira/browse/KUDU-3620 > Project: Kudu > Issue Type: Bug > Components: master, tserver >Reporter: Alexey Serbin >Priority: Major > Attachments: ts_recovery-itest.sigsegv.txt.xz > > > As of There is a race condition in {{OpDriver::ReplicationFinished}} that > with [1b99da532f52d143c46440c3903785d642fb45a3] manifests itself in the > following ways when running ts_recovery-itest: > # A tablet server crashes with SIGSEGV (DEBUG builds and probably RELEASE > builds as well) > # Address sanitizer issues warnings (ASAN builds) > Full logs are attached. > The stack trace for item 1: > {noformat} > *** Aborted at 1727269462 (unix time) try "date -d @1727269462" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGSEGV (@0x30) received by PID 14694 (TID 0x7f734f91b700) from PID 48; > stack trace: *** > @ 0x7f73830a5980 (unknown) at ??:0 > @ 0x7f73848b3db6 kudu::tablet::OpState::tablet_replica() at ??:0 > @ 0x7f73848d55c3 kudu::tablet::OpDriver::ReplicationFinished() at ??:0 > @ 0x7f73848aa27e > _ZZN4kudu6tablet13TabletReplica15StartFollowerOpERK13scoped_refptrINS_9consensus14ConsensusRoundEEENKUlRKNS_6StatusEE_clESA_ > at ??:0 > @ 0x7f73848b0f41 > _ZNSt17_Function_handlerIFvRKN4kudu6StatusEEZNS0_6tablet13TabletReplica15StartFollowerOpERK13scoped_refptrINS0_9consensus14ConsensusRoundEEEUlS3_E_E9_M_invokeERKSt9_Any_dataS3_ > at ??:0 > @ 0x7f7386351325 std::function<>::operator()() at ??:0 > @ 0x7f7384407f2b > kudu::consensus::ConsensusRound::NotifyReplicationFinished() at ??:0 > @ 0x7f73843d774b > kudu::consensus::PendingRounds::AdvanceCommittedIndex() at ??:0 > @ 0x7f73843f6888 kudu::consensus::RaftConsensus::UpdateReplica() at > ??:0 > @ 0x7f73843f1ef5 kudu::consensus::RaftConsensus::Update() at ??:0 > @ 0x7f7385467de7 > kudu::tserver::ConsensusServiceImpl::UpdateConsensus() at ??:0 > @ 0x7f7383c95fd2 > _ZZN4kudu9consensus18ConsensusServiceIfC4ERK13scoped_refptrINS_12MetricEntityEERKS2_INS_3rpc13ResultTrackerEEENKUlPKN6google8protobuf7MessageEPSE_PNS7_10RpcContextEE0_clESG_SH_SJ_ > at ??:0 > @ 0x7f7383c9a063 > _ZNSt17_Function_handlerIFvPKN6google8protobuf7MessageEPS2_PN4kudu3rpc10RpcContextEEZNS6_9consensus18ConsensusServiceIfC4ERK13scoped_refptrINS6_12MetricEntityEERKSD_INS7_13ResultTrackerEEEUlS4_S5_S9_E0_E9_M_invokeERKSt9_Any_dataOS4_OS5_OS9_ > at ??:0 > @ 0x7f73834af4b8 std::function<>::operator()() at ??:0 > @ 0x7f73834aed6c kudu::rpc::GeneratedServiceIf::Handle() at ??:0 > @ 0x7f73834b1a7d kudu::rpc::ServicePool::RunThread() at ??:0 > @ 0x7f73834b03c7 _ZZN4kudu3rpc11ServicePool4InitEiENKUlvE_clEv at ??:0 > @ 0x7f73834b1e06 > _ZNSt17_Function_handlerIFvvEZN4kudu3rpc11ServicePool4InitEiEUlvE_E9_M_invokeERKSt9_Any_data > at ??:0 > @ 0x55ab245f526e std::function<>::operator()() at ??:0 > @ 0x7f7382853bb1 kudu::Thread::SuperviseThread() at ??:0 > @ 0x7f738309a6db start_thread at ??:0 > @ 0x7f73805ae71f clone at ??:0 > {noformat} > A sample of output for item 2: > {noformat} > ==26864==ERROR: AddressSanitizer: heap-use-after-free on address > 0x617000212830 at pc 0x7fd36dc2c636 bp 0x7fd32f986530 sp 0x7fd32f986528 > READ of size 8 at 0x617000212830 thread T84 (rpc worker-2694) > #0 0x7fd36dc2c635 in kudu::tablet::OpState::tablet_replica() const > /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/tablet/ops/op.h:189:12 > #1 0x7fd36dc70732 in > kudu::tablet::OpDriver::ReplicationFinished(kudu::Status const&) > /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/tablet/ops/op_driver.cc:443:37 > #2 0x7fd36dc20493 in > kudu::tablet::TabletReplica::StartFollowerOp(scoped_refptr > const&)::$_7::operator()(kudu::Status const&) const > /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/tablet/tablet_replica.cc:857:51 > #3 0x7fd36dc202fc in std::_Function_handler kudu::tablet::TabletReplica::StartFollowerOp(scoped_refptr > const&)::$_7>::_M_invoke(std::_Any_data const&, kudu::Status const&) > ../../../include/c++/7.5.0/bits/std_function.h:316:2 > #4 0x7fd37460bd0d in std::function const&)>::operator()(kudu::Status const&) const > ../../../include/c++/7.5.0/bits/std_function.h:706:14 > #5 0x7fd36c940afc in > kudu::consensus::ConsensusRound::NotifyReplicationFinished(kudu::Status > const&) > /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/consens
[jira] [Created] (KUDU-3620) Race condition in OpDriver::ReplicationFinished()
Alexey Serbin created KUDU-3620: --- Summary: Race condition in OpDriver::ReplicationFinished() Key: KUDU-3620 URL: https://issues.apache.org/jira/browse/KUDU-3620 Project: Kudu Issue Type: Bug Components: master, tserver Reporter: Alexey Serbin As of There is a race condition in {{OpDriver::ReplicationFinished}} that with [1b99da532f52d143c46440c3903785d642fb45a3] manifests itself in the following ways when running ts_recovery-itest: # A tablet server crashing with SIGSEGV # Address sanitizer issues warnings The stack trace for item 1: {noformat} *** Aborted at 1727269462 (unix time) try "date -d @1727269462" if you are using GNU date *** PC: @0x0 (unknown) *** SIGSEGV (@0x30) received by PID 14694 (TID 0x7f734f91b700) from PID 48; stack trace: *** @ 0x7f73830a5980 (unknown) at ??:0 @ 0x7f73848b3db6 kudu::tablet::OpState::tablet_replica() at ??:0 @ 0x7f73848d55c3 kudu::tablet::OpDriver::ReplicationFinished() at ??:0 @ 0x7f73848aa27e _ZZN4kudu6tablet13TabletReplica15StartFollowerOpERK13scoped_refptrINS_9consensus14ConsensusRoundEEENKUlRKNS_6StatusEE_clESA_ at ??:0 @ 0x7f73848b0f41 _ZNSt17_Function_handlerIFvRKN4kudu6StatusEEZNS0_6tablet13TabletReplica15StartFollowerOpERK13scoped_refptrINS0_9consensus14ConsensusRoundEEEUlS3_E_E9_M_invokeERKSt9_Any_dataS3_ at ??:0 @ 0x7f7386351325 std::function<>::operator()() at ??:0 @ 0x7f7384407f2b kudu::consensus::ConsensusRound::NotifyReplicationFinished() at ??:0 @ 0x7f73843d774b kudu::consensus::PendingRounds::AdvanceCommittedIndex() at ??:0 @ 0x7f73843f6888 kudu::consensus::RaftConsensus::UpdateReplica() at ??:0 @ 0x7f73843f1ef5 kudu::consensus::RaftConsensus::Update() at ??:0 @ 0x7f7385467de7 kudu::tserver::ConsensusServiceImpl::UpdateConsensus() at ??:0 @ 0x7f7383c95fd2 _ZZN4kudu9consensus18ConsensusServiceIfC4ERK13scoped_refptrINS_12MetricEntityEERKS2_INS_3rpc13ResultTrackerEEENKUlPKN6google8protobuf7MessageEPSE_PNS7_10RpcContextEE0_clESG_SH_SJ_ at ??:0 @ 0x7f7383c9a063 _ZNSt17_Function_handlerIFvPKN6google8protobuf7MessageEPS2_PN4kudu3rpc10RpcContextEEZNS6_9consensus18ConsensusServiceIfC4ERK13scoped_refptrINS6_12MetricEntityEERKSD_INS7_13ResultTrackerEEEUlS4_S5_S9_E0_E9_M_invokeERKSt9_Any_dataOS4_OS5_OS9_ at ??:0 @ 0x7f73834af4b8 std::function<>::operator()() at ??:0 @ 0x7f73834aed6c kudu::rpc::GeneratedServiceIf::Handle() at ??:0 @ 0x7f73834b1a7d kudu::rpc::ServicePool::RunThread() at ??:0 @ 0x7f73834b03c7 _ZZN4kudu3rpc11ServicePool4InitEiENKUlvE_clEv at ??:0 @ 0x7f73834b1e06 _ZNSt17_Function_handlerIFvvEZN4kudu3rpc11ServicePool4InitEiEUlvE_E9_M_invokeERKSt9_Any_data at ??:0 @ 0x55ab245f526e std::function<>::operator()() at ??:0 @ 0x7f7382853bb1 kudu::Thread::SuperviseThread() at ??:0 @ 0x7f738309a6db start_thread at ??:0 @ 0x7f73805ae71f clone at ??:0 {noformat} A sample of output for item 2: {noformat} ==26864==ERROR: AddressSanitizer: heap-use-after-free on address 0x617000212830 at pc 0x7fd36dc2c636 bp 0x7fd32f986530 sp 0x7fd32f986528 READ of size 8 at 0x617000212830 thread T84 (rpc worker-2694) #0 0x7fd36dc2c635 in kudu::tablet::OpState::tablet_replica() const /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/tablet/ops/op.h:189:12 #1 0x7fd36dc70732 in kudu::tablet::OpDriver::ReplicationFinished(kudu::Status const&) /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/tablet/ops/op_driver.cc:443:37 #2 0x7fd36dc20493 in kudu::tablet::TabletReplica::StartFollowerOp(scoped_refptr const&)::$_7::operator()(kudu::Status const&) const /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/tablet/tablet_replica.cc:857:51 #3 0x7fd36dc202fc in std::_Function_handler const&)::$_7>::_M_invoke(std::_Any_data const&, kudu::Status const&) ../../../include/c++/7.5.0/bits/std_function.h:316:2 #4 0x7fd37460bd0d in std::function::operator()(kudu::Status const&) const ../../../include/c++/7.5.0/bits/std_function.h:706:14 #5 0x7fd36c940afc in kudu::consensus::ConsensusRound::NotifyReplicationFinished(kudu::Status const&) /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/consensus/raft_consensus.cc:3311:3 #6 0x7fd36c8cdbbc in kudu::consensus::PendingRounds::AdvanceCommittedIndex(long) /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/consensus/pending_rounds.cc:185:12 #7 0x7fd36c916f16 in kudu::consensus::RaftConsensus::UpdateReplica(kudu::consensus::ConsensusRequestPB const*, kudu::consensus::ConsensusResponsePB*) /home/jenkins-slave/workspace/build_and_test_flaky@2/src/kudu/consensus/raft_consensus.cc:1530:5 #8 0x7fd36c914e57 in kudu::consensus::RaftConsensus::Update(kudu::consensus::ConsensusRequestPB const*, kudu::consensus::ConsensusResp
[jira] [Resolved] (KUDU-3619) The 'supplement to GC algorithm' breaks major delta compaction
[ https://issues.apache.org/jira/browse/KUDU-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-3619. - Fix Version/s: 1.18.0 1.17.1 Resolution: Fixed > The 'supplement to GC algorithm' breaks major delta compaction > -- > > Key: KUDU-3619 > URL: https://issues.apache.org/jira/browse/KUDU-3619 > Project: Kudu > Issue Type: Bug > Components: compaction, tserver >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Major > Fix For: 1.18.0, 1.17.1 > > > With the functionality introduced with > [ad920e69f|https://github.com/apache/kudu/commit/ad920e69fcd67ceefa25ea81a38a10a27d9e3afc] > doesn't handle the appearance of an empty rowset as the result of major > delta compaction scheduled, and that leads to errors like below once it's run > its course: > {noformat} > W20240906 10:59:01.768857 189660 tablet_mm_ops.cc:364] T > 64144a1d4b864aa080e6cc53056546a5 P 574954b3b13a415c83a1660e7f51ee4e: Major > delta compaction failed on 64144a1d4b864aa080e6cc53056546a5: Corruption: > Failed major delta compaction on RowSet(1675): No min key found: CFile base > data in RowSet(1675) > {noformat} > Similarly, the {{mt-tablet-test}} is sporadically failing due to the same > issue when the test workload happens to create similar situation with > all-the-rows-deleted rowsets: > {noformat} > MultiThreadedHybridClockTabletTest/5.UpdateNoMergeCompaction: > src/kudu/tablet/mt-tablet-test.cc:489: Failure > Failed > Bad status: Corruption: Failed major delta compaction on RowSet(1): No min > key found: CFile base data in RowSet(1) > {noformat} > There is a simple test scenario that triggers the issue: > [https://gerrit.cloudera.org/#/c/21809/|https://gerrit.cloudera.org/#/c/21809/]. > As a workaround, it's possible to set the > {{\-\-all_delete_op_delta_file_cnt_for_compaction}} to a very high value, > e.g. 100. > To address the issue properly, it's necessary to update the major delta > compaction code to handle situations where the result rowset is completely > empty. In theory, swapping out the result rowset with an empty one should be > enough: for example, see how it's done in [changelist > 705954872|https://github.com/apache/kudu/commit/705954872dc86238556456abed0a879bb1462e51]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3619) The 'supplement to GC algorithm' breaks major delta compaction
[ https://issues.apache.org/jira/browse/KUDU-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3619: Code Review: http://gerrit.cloudera.org:8080/21848 > The 'supplement to GC algorithm' breaks major delta compaction > -- > > Key: KUDU-3619 > URL: https://issues.apache.org/jira/browse/KUDU-3619 > Project: Kudu > Issue Type: Bug > Components: compaction, tserver >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Major > > With the functionality introduced with > [ad920e69f|https://github.com/apache/kudu/commit/ad920e69fcd67ceefa25ea81a38a10a27d9e3afc] > doesn't handle the appearance of an empty rowset as the result of major > delta compaction scheduled, and that leads to errors like below once it's run > its course: > {noformat} > W20240906 10:59:01.768857 189660 tablet_mm_ops.cc:364] T > 64144a1d4b864aa080e6cc53056546a5 P 574954b3b13a415c83a1660e7f51ee4e: Major > delta compaction failed on 64144a1d4b864aa080e6cc53056546a5: Corruption: > Failed major delta compaction on RowSet(1675): No min key found: CFile base > data in RowSet(1675) > {noformat} > Similarly, the {{mt-tablet-test}} is sporadically failing due to the same > issue when the test workload happens to create similar situation with > all-the-rows-deleted rowsets: > {noformat} > MultiThreadedHybridClockTabletTest/5.UpdateNoMergeCompaction: > src/kudu/tablet/mt-tablet-test.cc:489: Failure > Failed > Bad status: Corruption: Failed major delta compaction on RowSet(1): No min > key found: CFile base data in RowSet(1) > {noformat} > There is a simple test scenario that triggers the issue: > [https://gerrit.cloudera.org/#/c/21809/|https://gerrit.cloudera.org/#/c/21809/]. > As a workaround, it's possible to set the > {{\-\-all_delete_op_delta_file_cnt_for_compaction}} to a very high value, > e.g. 100. > To address the issue properly, it's necessary to update the major delta > compaction code to handle situations where the result rowset is completely > empty. In theory, swapping out the result rowset with an empty one should be > enough: for example, see how it's done in [changelist > 705954872|https://github.com/apache/kudu/commit/705954872dc86238556456abed0a879bb1462e51]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3619) The 'supplement to GC algorithm' breaks major delta compaction
[ https://issues.apache.org/jira/browse/KUDU-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3619: Description: With the functionality introduced with [ad920e69f|https://github.com/apache/kudu/commit/ad920e69fcd67ceefa25ea81a38a10a27d9e3afc] doesn't handle the appearance of an empty rowset as the result of major delta compaction scheduled, and that leads to errors like below once it's run its course: {noformat} W20240906 10:59:01.768857 189660 tablet_mm_ops.cc:364] T 64144a1d4b864aa080e6cc53056546a5 P 574954b3b13a415c83a1660e7f51ee4e: Major delta compaction failed on 64144a1d4b864aa080e6cc53056546a5: Corruption: Failed major delta compaction on RowSet(1675): No min key found: CFile base data in RowSet(1675) {noformat} Similarly, the {{mt-tablet-test}} is sporadically failing due to the same issue when the test workload happens to create similar situation with all-the-rows-deleted rowsets: {noformat} MultiThreadedHybridClockTabletTest/5.UpdateNoMergeCompaction: src/kudu/tablet/mt-tablet-test.cc:489: Failure Failed Bad status: Corruption: Failed major delta compaction on RowSet(1): No min key found: CFile base data in RowSet(1) {noformat} There is a simple test scenario that triggers the issue: [https://gerrit.cloudera.org/#/c/21809/|https://gerrit.cloudera.org/#/c/21809/]. As a workaround, it's possible to set the {{\-\-all_delete_op_delta_file_cnt_for_compaction}} to a very high value, e.g. 100. To address the issue properly, it's necessary to update the major delta compaction code to handle situations where the result rowset is completely empty. In theory, swapping out the result rowset with an empty one should be enough: for example, see how it's done in [changelist 705954872|https://github.com/apache/kudu/commit/705954872dc86238556456abed0a879bb1462e51]. was: With the functionality introduced with [ad920e69f|https://github.com/apache/kudu/commit/ad920e69fcd67ceefa25ea81a38a10a27d9e3afc] doesn't handle the appearance of an empty rowset as the result of major delta compaction scheduled, and that leads to errors like below once it's run its course: {noformat} W20240906 10:59:01.768857 189660 tablet_mm_ops.cc:364] T 64144a1d4b864aa080e6cc53056546a5 P 574954b3b13a415c83a1660e7f51ee4e: Major delta compaction failed on 64144a1d4b864aa080e6cc53056546a5: Corruption: Failed major delta compaction on RowSet(1675): No min key found: CFile base data in RowSet(1675) {noformat} Similarly, the {{mt-tablet-test}} is sporadically failing due to the same issue when the test workload happens to create similar situation with all-the-rows-deleted rowsets: {noformat} MultiThreadedHybridClockTabletTest/5.UpdateNoMergeCompaction: src/kudu/tablet/mt-tablet-test.cc:489: Failure Failed Bad status: Corruption: Failed major delta compaction on RowSet(1): No min key found: CFile base data in RowSet(1) {noformat} There is a simple test scenario that triggers the issue: [https://gerrit.cloudera.org/#/c/21809/|https://gerrit.cloudera.org/#/c/21809/]. As a workaround, it's possible to set the {{\-\-all_delete_op_delta_file_cnt_for_compaction}} to a very high value, e.g. 100. To address the issue properly, it's necessary to update the major delta compaction code to handle situations where the result rowset is completely empty. In theory, swapping the rowset with an empty one should be enough: for example, see how it's done in [changelist 705954872|https://github.com/apache/kudu/commit/705954872dc86238556456abed0a879bb1462e51]. > The 'supplement to GC algorithm' breaks major delta compaction > -- > > Key: KUDU-3619 > URL: https://issues.apache.org/jira/browse/KUDU-3619 > Project: Kudu > Issue Type: Bug > Components: compaction, tserver >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Major > > With the functionality introduced with > [ad920e69f|https://github.com/apache/kudu/commit/ad920e69fcd67ceefa25ea81a38a10a27d9e3afc] > doesn't handle the appearance of an empty rowset as the result of major > delta compaction scheduled, and that leads to errors like below once it's run > its course: > {noformat} > W20240906 10:59:01.768857 189660 tablet_mm_ops.cc:364] T > 64144a1d4b864aa080e6cc53056546a5 P 574954b3b13a415c83a1660e7f51ee4e: Major > delta compaction failed on 64144a1d4b864aa080e6cc53056546a5: Corruption: > Failed major delta compaction on RowSet(1675): No min key found: CFile base > data in RowSet(1675) > {noformat} > Similarly, the {{mt-tablet-test}} is sporadically failing due to the same > issue when the test workload happens to create similar situation with > all-the-rows-deleted rowsets: > {noformat} > MultiThreadedHybridClockTabletTest/5.UpdateNoMergeCompaction: > src/kudu/tablet/mt-tablet-test.cc:489:
[jira] [Created] (KUDU-3619) The 'supplement to GC algorithm' breaks major delta compaction
Alexey Serbin created KUDU-3619: --- Summary: The 'supplement to GC algorithm' breaks major delta compaction Key: KUDU-3619 URL: https://issues.apache.org/jira/browse/KUDU-3619 Project: Kudu Issue Type: Bug Components: compaction, tserver Affects Versions: 1.17.0 Reporter: Alexey Serbin With the functionality introduced with [ad920e69f|https://github.com/apache/kudu/commit/ad920e69fcd67ceefa25ea81a38a10a27d9e3afc] doesn't handle the appearance of an empty rowset as the result of major delta compaction scheduled, and that leads to errors like below once it's run its course: {noformat} W20240906 10:59:01.768857 189660 tablet_mm_ops.cc:364] T 64144a1d4b864aa080e6cc53056546a5 P 574954b3b13a415c83a1660e7f51ee4e: Major delta compaction failed on 64144a1d4b864aa080e6cc53056546a5: Corruption: Failed major delta compaction on RowSet(1675): No min key found: CFile base data in RowSet(1675) {noformat} Similarly, the {{mt-tablet-test}} is sporadically failing due to the same issue when the test workload happens to create similar situation with all-the-rows-deleted rowsets: {noformat} MultiThreadedHybridClockTabletTest/5.UpdateNoMergeCompaction: src/kudu/tablet/mt-tablet-test.cc:489: Failure Failed Bad status: Corruption: Failed major delta compaction on RowSet(1): No min key found: CFile base data in RowSet(1) {noformat} There is a simple test scenario that triggers the issue: [https://gerrit.cloudera.org/#/c/21809/|https://gerrit.cloudera.org/#/c/21809/]. As a workaround, it's possible to set the {{\-\-all_delete_op_delta_file_cnt_for_compaction}} to a very high value, e.g. 100. To address the issue properly, it's necessary to update the major delta compaction code to handle situations where the result rowset is completely empty. In theory, swapping the rowset with an empty one should be enough: for example, see how it's done in [changelist 705954872|https://github.com/apache/kudu/commit/705954872dc86238556456abed0a879bb1462e51]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KUDU-3613) ReplaceTabletITest.ReplaceTabletsWhileWriting fails from time to time
Alexey Serbin created KUDU-3613: --- Summary: ReplaceTabletITest.ReplaceTabletsWhileWriting fails from time to time Key: KUDU-3613 URL: https://issues.apache.org/jira/browse/KUDU-3613 Project: Kudu Issue Type: Bug Reporter: Alexey Serbin Attachments: replace_tablet-itest.txt.xz The ReplaceTabletsWhileWriting scenario of the ReplaceTabletITest test fails from time to time with error like below: {noformat} I20240903 18:48:30.731338 2161 replace_tablet-itest.cc:78] Replacing tablet 84944f0d32304f639244185c3ea9323a I20240903 18:48:30.731984 2195 master_service.cc:946] ReplaceTablet: received request to replace tablet 84944f0d32304f639244185c3ea9323a from {username='slave'} at 127.0.0.1:52450 src/kudu/integration-tests/replace_tablet-itest.cc:121: Failure Failed Bad status: Not found: Tablet 84944f0d32304f639244185c3ea9323a already deleted I20240903 18:48:30.735751 2764 ts_tablet_manager.cc:1918] T 84944f0d32304f639244185c3ea9323a P 21583b3f42394236b177d3789a521308: tablet deleted with delete type TABLET_DATA_DELETED: last-logged OpId 1.213 {noformat} The full log is attached (ASAN build). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3363) impala get wrong timestamp when scan kudu timestamp with timezone
[ https://issues.apache.org/jira/browse/KUDU-3363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3363: Fix Version/s: n/a Resolution: Fixed Status: Resolved (was: In Review) This issue has been addressed in Impala 4.5.0 in the context of [IMPALA-12370|https://issues.apache.org/jira/browse/IMPALA-12370] and [IMPALA-12322|https://issues.apache.org/jira/browse/IMPALA-12322]. > impala get wrong timestamp when scan kudu timestamp with timezone > - > > Key: KUDU-3363 > URL: https://issues.apache.org/jira/browse/KUDU-3363 > Project: Kudu > Issue Type: Bug > Components: impala >Reporter: daicheng >Priority: Major > Fix For: n/a > > Attachments: image-2022-04-24-00-01-05-746.png, > image-2022-04-24-00-01-37-520.png, image-2022-04-24-00-03-14-467.png, > image-2022-04-24-00-04-16-240.png, image-2022-04-24-00-04-52-860.png, > image-2022-04-24-00-05-52-086.png, image-2022-04-24-00-07-09-776.png > > > impala version is 3.1.0-cdh6.1 > !image-2022-04-24-00-01-37-520.png|width=504,height=37! > i have set system timezone=Asia/Shanghai: > !image-2022-04-24-00-01-05-746.png|width=566,height=91! > here is the bug: > *step 1* > i have parquet file with two columns like below,and read it with impala-shell > and spark (timezone=shanghai) > !image-2022-04-24-00-03-14-467.png|width=666,height=101! > !image-2022-04-24-00-04-16-240.png|width=551,height=214! > the result both exactly right。 > *step two* > create kudu table with impala-shell: > CREATE TABLE default.test_{_}test{_}_test_time2 (id BIGINT,t > TIMESTAMP,PRIMARY KEY (id) ) STORED AS KUDU; > note: kudu version:1.8 > and insert 2 row into the table with spark : > !image-2022-04-24-00-04-52-860.png|width=577,height=176! > *stop 3* > read it with spark (timezone=shanghai),spark read kudu table with kudu-client > api,here is the result: > !image-2022-04-24-00-05-52-086.png|width=747,height=246! > the result is still exactly right。 > but read it with impala-shell: > !image-2022-04-24-00-07-09-776.png|width=701,height=118! > the result show late 8hour > *conclusion* > it seems like impala timezone didn't work when kudu column type is > timestamp, but it work fine in parquet file,I don't know why? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3611) Rare flakiness in TestRpc.TimedOutOnResponseMetricServiceQueue
[ https://issues.apache.org/jira/browse/KUDU-3611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3611: Code Review: https://gerrit.cloudera.org/#/c/21710/ Component/s: test > Rare flakiness in TestRpc.TimedOutOnResponseMetricServiceQueue > -- > > Key: KUDU-3611 > URL: https://issues.apache.org/jira/browse/KUDU-3611 > Project: Kudu > Issue Type: Bug > Components: test >Reporter: Alexey Serbin >Priority: Minor > Attachments: rpc-test.4.txt.xz > > > The TimedOutOnResponseMetricServiceQueue scenario of TestRpc fails in very > rare cases with error message like below: > {noformat} > src/kudu/rpc/rpc-test.cc:1427: Failure > Expected equality of these values: > 3 > latency_histogram->TotalCount() > Which is: 2 > {noformat} > Full log is attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3611) Rare flakiness in TestRpc.TimedOutOnResponseMetricServiceQueue
[ https://issues.apache.org/jira/browse/KUDU-3611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3611: Status: In Review (was: Open) > Rare flakiness in TestRpc.TimedOutOnResponseMetricServiceQueue > -- > > Key: KUDU-3611 > URL: https://issues.apache.org/jira/browse/KUDU-3611 > Project: Kudu > Issue Type: Bug > Components: test >Reporter: Alexey Serbin >Priority: Minor > Attachments: rpc-test.4.txt.xz > > > The TimedOutOnResponseMetricServiceQueue scenario of TestRpc fails in very > rare cases with error message like below: > {noformat} > src/kudu/rpc/rpc-test.cc:1427: Failure > Expected equality of these values: > 3 > latency_histogram->TotalCount() > Which is: 2 > {noformat} > Full log is attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KUDU-3611) Rare flakiness in TestRpc.TimedOutOnResponseMetricServiceQueue
Alexey Serbin created KUDU-3611: --- Summary: Rare flakiness in TestRpc.TimedOutOnResponseMetricServiceQueue Key: KUDU-3611 URL: https://issues.apache.org/jira/browse/KUDU-3611 Project: Kudu Issue Type: Bug Reporter: Alexey Serbin Attachments: rpc-test.4.txt.xz The TimedOutOnResponseMetricServiceQueue scenario of TestRpc fails in very rare cases with error message like below: {noformat} src/kudu/rpc/rpc-test.cc:1427: Failure Expected equality of these values: 3 latency_histogram->TotalCount() Which is: 2 {noformat} Full log is attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3577) Altering a table with per-range hash partitions might make the table unusable
[ https://issues.apache.org/jira/browse/KUDU-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3577: Description: For tables with per-range hash schemas, dropping or adding a particular number of columns might make the table inaccessible for Kudu client applications. For example, dropping a nullable column from a table with per-range hash bucketing might make the table unusable. In this particular case, a workaround exists: just add the dropped column back using the {{kudu table add_column}} CLI tool. For example, in the reproduction scenario below, use the following command to restore the access to the table's data: {noformat} $ kudu table add_column $M test city string {noformat} As for the reproduction scenario, see below for the sequence of {{kudu}} CLI commands. Set environment variable for the Kudu cluster's RPC endpoint: {noformat} $ export M= {noformat} Create a table with two range partitions. It's crucial that the {{city}} column is nullable. {noformat} $ kudu table create $M '{ "table_name": "test", "schema": { "columns": [ { "column_name": "id", "column_type": "INT64" }, { "column_name": "name", "column_type": "STRING" }, { "column_name": "age", "column_type": "INT32" }, { "column_name": "city", "column_type": "STRING", "is_nullable": true } ], "key_column_names": ["id", "name", "age"] }, "partition": { "hash_partitions": [ {"columns": ["id"], "num_buckets": 4, "seed": 1}, {"columns": ["name"], "num_buckets": 4, "seed": 2} ], "range_partition": { "columns": ["age"], "range_bounds": [ { "lower_bound": {"bound_type": "inclusive", "bound_values": ["30"]}, "upper_bound": {"bound_type": "exclusive", "bound_values": ["60"]} }, { "lower_bound": {"bound_type": "inclusive", "bound_values": ["60"]}, "upper_bound": {"bound_type": "exclusive", "bound_values": ["90"]} } ] } }, "num_replicas": 1 }' {noformat} Add an extra range partition with custom hash schema: {noformat} $ kudu table add_range_partition $M test '[90]' '[120]' --hash_schema '{"hash_schema": [ {"columns": ["id"], "num_buckets": 3, "seed": 5}, {"columns": ["name"], "num_buckets": 3, "seed": 6} ]}' {noformat} Check the updated partitioning info: {noformat} $ kudu table describe $M test TABLE test ( id INT64 NOT NULL, name STRING NOT NULL, age INT32 NOT NULL, city STRING NULLABLE, PRIMARY KEY (id, name, age) ) HASH (id) PARTITIONS 4 SEED 1, HASH (name) PARTITIONS 4 SEED 2, RANGE (age) ( PARTITION 30 <= VALUES < 60, PARTITION 60 <= VALUES < 90, PARTITION 90 <= VALUES < 120 HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3 ) OWNER root REPLICAS 1 COMMENT {noformat} Drop the {{city}} column: {noformat} $ kudu table delete_column $M test city {noformat} Now try to run the {{kudu table describe}} against the table once the {{city}} column is dropped. It errors out with {{Invalid argument}}: {noformat} $ kudu table describe $M test Invalid argument: Invalid split row type UNKNOWN {noformat} A similar issue manifests itself when trying to run {{kudu table scan}} against the table: {noformat} $ kudu table scan $M test Invalid argument: Invalid split row type UNKNOWN {noformat} was: For particular table schemas with per-range hash schemas, dropping a nullable column from might make the table unusable. A workaround exists: just add the dropped column back using the {{kudu table add_column}} CLI tool. For example, for the reproduction scenario below, use the following command to restore the access to the table's data: {noformat} $ kudu table add_column $M test city string {noformat} As for the reproduction scenario, see below for the sequence of {{kudu}} CLI commands. Set environment variable for the Kudu cluster's RPC endpoint: {noformat} $ export M= {noformat} Create a table with two range partitions. It's crucial that the {{city}} column is nullable. {noformat} $ kudu table create $M '{ "table_name": "test", "schema": { "columns": [ { "column_name": "id", "column_type": "INT64" }, { "column_name": "name", "column_type": "STRING" }, { "column_name": "age", "column_type": "INT32" }, { "column_name": "city", "column_type": "STRING", "is_nullable": true } ], "key_column_names": ["id", "name", "age"] }, "partition": { "hash_partitions": [ {"columns": ["id"], "num_buckets": 4, "seed": 1}, {"columns": ["name"], "num_buckets": 4, "seed": 2} ], "range_partition": { "columns": ["age"], "range_bounds": [ { "lower_bound": {"bound_type": "inclusive", "bound_values": ["30"]}, "upper_bound": {"bound_type": "exclusive", "bound_values": ["60"]} }, { "lower_bound": {"bound_type": "inclusive", "bound_values": ["60"]}, "upper_bound": {"bound_type": "exclusive", "bound_values": ["90"]} } ] } }, "num_replicas": 1 }' {noformat} Add an extra range partition with custom hash schema: {noformat} $ kudu table add_range_partition $M test '[90]' '[120]' --hash_schema '{"hash_schema": [ {"columns
[jira] [Updated] (KUDU-3462) LogBlockManagerTest.TestContainerBlockLimitingByMetadataSizeWithCompaction sometimes fails
[ https://issues.apache.org/jira/browse/KUDU-3462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3462: Description: The TestContainerBlockLimitingByMetadataSizeWithCompaction scenario of the LogBlockManagerTest sometimes fails with messages like the following: {noformat} src/kudu/fs/log_block_manager-test.cc:1343: Failure Expected: (FLAGS_log_container_metadata_max_size * FLAGS_log_container_metadata_size_before_compact_ratio) >= (file_size), actual: 26214.4 vs 31565 {noformat} I'm attaching the full log generated by one of the pre-commit test runs (ASAN). To reproduce the flakiness, build the test in ASAN configuration and run the scenario with the {{\-\-stress_cpu_threads=16}} extra flag. was: The TestContainerBlockLimitingByMetadataSizeWithCompaction scenario of the LogBlockManagerTest sometimes fails with messages like the following: {noformat} src/kudu/fs/log_block_manager-test.cc:1343: Failure Expected: (FLAGS_log_container_metadata_max_size * FLAGS_log_container_metadata_size_before_compact_ratio) >= (file_size), actual: 26214.4 vs 31565 {noformat} I'm attaching the full log generated by one of the pre-commit test runs (ASAN). A good > LogBlockManagerTest.TestContainerBlockLimitingByMetadataSizeWithCompaction > sometimes fails > -- > > Key: KUDU-3462 > URL: https://issues.apache.org/jira/browse/KUDU-3462 > Project: Kudu > Issue Type: Bug > Components: test >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Minor > Attachments: log_block_manager-test.txt.xz > > > The TestContainerBlockLimitingByMetadataSizeWithCompaction scenario of the > LogBlockManagerTest sometimes fails with messages like the following: > {noformat} > src/kudu/fs/log_block_manager-test.cc:1343: Failure > Expected: (FLAGS_log_container_metadata_max_size * > FLAGS_log_container_metadata_size_before_compact_ratio) >= (file_size), > actual: 26214.4 vs 31565 > {noformat} > I'm attaching the full log generated by one of the pre-commit test runs > (ASAN). > To reproduce the flakiness, build the test in ASAN configuration and run the > scenario with the {{\-\-stress_cpu_threads=16}} extra flag. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3462) LogBlockManagerTest.TestContainerBlockLimitingByMetadataSizeWithCompaction sometimes fails
[ https://issues.apache.org/jira/browse/KUDU-3462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3462: Description: The TestContainerBlockLimitingByMetadataSizeWithCompaction scenario of the LogBlockManagerTest sometimes fails with messages like the following: {noformat} src/kudu/fs/log_block_manager-test.cc:1343: Failure Expected: (FLAGS_log_container_metadata_max_size * FLAGS_log_container_metadata_size_before_compact_ratio) >= (file_size), actual: 26214.4 vs 31565 {noformat} I'm attaching the full log generated by one of the pre-commit test runs (ASAN). A good was: The TestContainerBlockLimitingByMetadataSizeWithCompaction scenario of the LogBlockManagerTest sometimes fails with messages like the following: {noformat} src/kudu/fs/log_block_manager-test.cc:1343: Failure Expected: (FLAGS_log_container_metadata_max_size * FLAGS_log_container_metadata_size_before_compact_ratio) >= (file_size), actual: 26214.4 vs 31565 {noformat} I'm attaching the full log generated by one of the pre-commit test runs (ASAN). > LogBlockManagerTest.TestContainerBlockLimitingByMetadataSizeWithCompaction > sometimes fails > -- > > Key: KUDU-3462 > URL: https://issues.apache.org/jira/browse/KUDU-3462 > Project: Kudu > Issue Type: Bug > Components: test >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Minor > Attachments: log_block_manager-test.txt.xz > > > The TestContainerBlockLimitingByMetadataSizeWithCompaction scenario of the > LogBlockManagerTest sometimes fails with messages like the following: > {noformat} > src/kudu/fs/log_block_manager-test.cc:1343: Failure > Expected: (FLAGS_log_container_metadata_max_size * > FLAGS_log_container_metadata_size_before_compact_ratio) >= (file_size), > actual: 26214.4 vs 31565 > {noformat} > I'm attaching the full log generated by one of the pre-commit test runs > (ASAN). > A good -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KUDU-3599) tablet_copy_service-test is flaky in TSAN builds
[ https://issues.apache.org/jira/browse/KUDU-3599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-3599. - Fix Version/s: 1.18.0 Resolution: Fixed > tablet_copy_service-test is flaky in TSAN builds > > > Key: KUDU-3599 > URL: https://issues.apache.org/jira/browse/KUDU-3599 > Project: Kudu > Issue Type: Sub-task >Reporter: Mahesh Reddy >Priority: Major > Fix For: 1.18.0 > > Attachments: tablet_copy_service-test.txt.gz > > > TSAN reports data races. The entire trace is attached. The following is a > snippet of the trace: > {code:java} > WARNING: ThreadSanitizer: data race (pid=20262) > Read of size 8 at 0x7b580003adb8 by thread T58 (mutexes: write > M867641427390348624): > #0 kudu::MonoTime::Initialized() const > /home/jenkins-slave/workspace/build_and_test@2/src/kudu/util/monotime.cc:202:10 > (libkudu_util.so+0x3d7b79) > #1 kudu::tserver::RemoteTabletCopySourceSession::UpdateTabletMetrics() > /home/jenkins-slave/workspace/build_and_test@2/src/kudu/tserver/tablet_copy_source_session.cc:518:7 > (libtserver.so+0x249407) > #2 > kudu::tserver::TabletCopyServiceImpl::DoEndTabletCopySessionUnlocked(std::_1::basic_string std::1::char_traits, std::_1::allocator > const&, > kudu::tserver::TabletCopyErrorPB_Code*) > /home/jenkins-slave/workspace/build_and_test@2/src/kudu/tserver/tablet_copy_service.cc:431:12 > (libtserver.so+0x23b567) > #3 kudu::tserver::TabletCopyServiceImpl::EndExpiredSessions() > /home/jenkins-slave/workspace/build_and_test@2/src/kudu/tserver/tablet_copy_service.cc:461:7 > (libtserver.so+0x23ca63) > #4 > kudu::tserver::TabletCopyServiceImpl::TabletCopyServiceImpl(kudu::server::ServerBase*, > kudu::tserver::TabletReplicaLookupIf*)::$_0::operator()() const > /home/jenkins-slave/workspace/build_and_test@2/src/kudu/tserver/tablet_copy_service.cc:115:3 > (libtserver.so+0x23dd01) > #5 > decltype(std::_1::forward kudu::tserver::TabletReplicaLookupIf*)::$_0&>(fp)()) > std::1::_invoke > kudu::tserver::TabletReplicaLookupIf*)::$_0&>(kudu::tserver::TabletCopyServiceImpl::TabletCopyServiceImpl(kudu::server::ServerBase*, > kudu::tserver::TabletReplicaLookupIf*)::$_0&) > /home/jenkins-slave/workspace/build_and_test@2/thirdparty/installed/tsan/include/c++/v1/type_traits:3899:1 > (libtserver.so+0x23dcb9) > #6 void > std::_1::invoke_void_return_wrapper::call > kudu::tserver::TabletReplicaLookupIf*)::$_0&>(kudu::tserver::TabletCopyServiceImpl::TabletCopyServiceImpl(kudu::server::ServerBase*, > kudu::tserver::TabletReplicaLookupIf*)::$_0&) > /home/jenkins-slave/workspace/build_and_test@2/thirdparty/installed/tsan/include/c++/v1/_functional_base:348:9 > (libtserver.so+0x23dc49) > #7 > std::_1::function::alloc_func kudu::tserver::TabletReplicaLookupIf*)::$_0, > std::_1::allocator kudu::tserver::TabletReplicaLookupIf*)::$_0>, void ()>::operator()() > /home/jenkins-slave/workspace/build_and_test@2/thirdparty/installed/tsan/include/c++/v1/functional:1557:16 > (libtserver.so+0x23dc11) > #8 > std::_1::function::func kudu::tserver::TabletReplicaLookupIf*)::$_0, > std::_1::allocator kudu::tserver::TabletReplicaLookupIf*)::$_0>, void ()>::operator()() > /home/jenkins-slave/workspace/build_and_test@2/thirdparty/installed/tsan/include/c++/v1/functional:1731:12 > (libtserver.so+0x23cf0d) > #9 std::_1::function::_value_func::operator()() const > /home/jenkins-slave/workspace/build_and_test@2/thirdparty/installed/tsan/include/c++/v1/functional:1884:16 > (libtserver_test_util.so+0x601a4) > #10 std::__1::function::operator()() const > /home/jenkins-slave/workspace/build_and_test@2/thirdparty/installed/tsan/include/c++/v1/functional:2556:12 > (libtserver_test_util.so+0x5ffd9) > #11 kudu::Thread::SuperviseThread(void*) > /home/jenkins-slave/workspace/build_and_test@2/src/kudu/util/thread.cc:694:3 > (libkudu_util.so+0x449546){code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3599) tablet_copy_service-test is flaky in TSAN builds
[ https://issues.apache.org/jira/browse/KUDU-3599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3599: Code Review: http://gerrit.cloudera.org:8080/21658 > tablet_copy_service-test is flaky in TSAN builds > > > Key: KUDU-3599 > URL: https://issues.apache.org/jira/browse/KUDU-3599 > Project: Kudu > Issue Type: Sub-task >Reporter: Mahesh Reddy >Priority: Major > Attachments: tablet_copy_service-test.txt.gz > > > TSAN reports data races. The entire trace is attached. The following is a > snippet of the trace: > {code:java} > WARNING: ThreadSanitizer: data race (pid=20262) > Read of size 8 at 0x7b580003adb8 by thread T58 (mutexes: write > M867641427390348624): > #0 kudu::MonoTime::Initialized() const > /home/jenkins-slave/workspace/build_and_test@2/src/kudu/util/monotime.cc:202:10 > (libkudu_util.so+0x3d7b79) > #1 kudu::tserver::RemoteTabletCopySourceSession::UpdateTabletMetrics() > /home/jenkins-slave/workspace/build_and_test@2/src/kudu/tserver/tablet_copy_source_session.cc:518:7 > (libtserver.so+0x249407) > #2 > kudu::tserver::TabletCopyServiceImpl::DoEndTabletCopySessionUnlocked(std::_1::basic_string std::1::char_traits, std::_1::allocator > const&, > kudu::tserver::TabletCopyErrorPB_Code*) > /home/jenkins-slave/workspace/build_and_test@2/src/kudu/tserver/tablet_copy_service.cc:431:12 > (libtserver.so+0x23b567) > #3 kudu::tserver::TabletCopyServiceImpl::EndExpiredSessions() > /home/jenkins-slave/workspace/build_and_test@2/src/kudu/tserver/tablet_copy_service.cc:461:7 > (libtserver.so+0x23ca63) > #4 > kudu::tserver::TabletCopyServiceImpl::TabletCopyServiceImpl(kudu::server::ServerBase*, > kudu::tserver::TabletReplicaLookupIf*)::$_0::operator()() const > /home/jenkins-slave/workspace/build_and_test@2/src/kudu/tserver/tablet_copy_service.cc:115:3 > (libtserver.so+0x23dd01) > #5 > decltype(std::_1::forward kudu::tserver::TabletReplicaLookupIf*)::$_0&>(fp)()) > std::1::_invoke > kudu::tserver::TabletReplicaLookupIf*)::$_0&>(kudu::tserver::TabletCopyServiceImpl::TabletCopyServiceImpl(kudu::server::ServerBase*, > kudu::tserver::TabletReplicaLookupIf*)::$_0&) > /home/jenkins-slave/workspace/build_and_test@2/thirdparty/installed/tsan/include/c++/v1/type_traits:3899:1 > (libtserver.so+0x23dcb9) > #6 void > std::_1::invoke_void_return_wrapper::call > kudu::tserver::TabletReplicaLookupIf*)::$_0&>(kudu::tserver::TabletCopyServiceImpl::TabletCopyServiceImpl(kudu::server::ServerBase*, > kudu::tserver::TabletReplicaLookupIf*)::$_0&) > /home/jenkins-slave/workspace/build_and_test@2/thirdparty/installed/tsan/include/c++/v1/_functional_base:348:9 > (libtserver.so+0x23dc49) > #7 > std::_1::function::alloc_func kudu::tserver::TabletReplicaLookupIf*)::$_0, > std::_1::allocator kudu::tserver::TabletReplicaLookupIf*)::$_0>, void ()>::operator()() > /home/jenkins-slave/workspace/build_and_test@2/thirdparty/installed/tsan/include/c++/v1/functional:1557:16 > (libtserver.so+0x23dc11) > #8 > std::_1::function::func kudu::tserver::TabletReplicaLookupIf*)::$_0, > std::_1::allocator kudu::tserver::TabletReplicaLookupIf*)::$_0>, void ()>::operator()() > /home/jenkins-slave/workspace/build_and_test@2/thirdparty/installed/tsan/include/c++/v1/functional:1731:12 > (libtserver.so+0x23cf0d) > #9 std::_1::function::_value_func::operator()() const > /home/jenkins-slave/workspace/build_and_test@2/thirdparty/installed/tsan/include/c++/v1/functional:1884:16 > (libtserver_test_util.so+0x601a4) > #10 std::__1::function::operator()() const > /home/jenkins-slave/workspace/build_and_test@2/thirdparty/installed/tsan/include/c++/v1/functional:2556:12 > (libtserver_test_util.so+0x5ffd9) > #11 kudu::Thread::SuperviseThread(void*) > /home/jenkins-slave/workspace/build_and_test@2/src/kudu/util/thread.cc:694:3 > (libkudu_util.so+0x449546){code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KUDU-3601) TimestampAdvancementITest.TestUpgradeFromOlderCorruptedData fails intermittently
Alexey Serbin created KUDU-3601: --- Summary: TimestampAdvancementITest.TestUpgradeFromOlderCorruptedData fails intermittently Key: KUDU-3601 URL: https://issues.apache.org/jira/browse/KUDU-3601 Project: Kudu Issue Type: Bug Reporter: Alexey Serbin Attachments: timestamp_advancement-itest.txt.xz The TestUpgradeFromOlderCorruptedData scenario of the TimestampAdvancementITest fails from time to time (at least in RELEASE builds) with errors like below; the logs are attached. {noformat} I20240731 20:55:43.510514 3223 timestamp_advancement-itest.cc:261] GCing logs... src/kudu/integration-tests/timestamp_advancement-itest.cc:261: Failure Expected: (gcable_size) > (0), actual: 0 vs 0 src/kudu/util/test_util.cc:395: Failure Failed Timed out waiting for assertion to pass. src/kudu/integration-tests/timestamp_advancement-itest.cc:326: Failure Expected: GCUntilNoWritesInWAL(ts, replica) doesn't generate new fatal failures in the current thread. Actual: it does. {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KUDU-3598) ReplicatedAlterTableTest.AlterReplicationFactorWhileScanning fails from time to time
Alexey Serbin created KUDU-3598: --- Summary: ReplicatedAlterTableTest.AlterReplicationFactorWhileScanning fails from time to time Key: KUDU-3598 URL: https://issues.apache.org/jira/browse/KUDU-3598 Project: Kudu Issue Type: Bug Components: test Reporter: Alexey Serbin Attachments: alter_table-test.txt.xz The ReplicatedAlterTableTest.AlterReplicationFactorWhileScanning test scenario fails from time to time (at least in TSAN builds) with error output like below. The full log is attached. {noformat} src/kudu/integration-tests/alter_table-test.cc:327: Failure Expected equality of these values: replication_factor Which is: 3 actual_replica_count Which is: 2 src/kudu/util/test_util.cc:395: Failure Failed Timed out waiting for assertion to pass. src/kudu/integration-tests/alter_table-test.cc:2476: Failure Expected: VerifyTabletReplicaCount(3, VerifyRowCount::kEnable) doesn't generate new fatal failures in the current thread. Actual: it does. src/kudu/util/test_util.cc:395: Failure Failed Timed out waiting for assertion to pass. {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3597) Test scenarios of TabletServerDiskErrorITest fail from time to time
[ https://issues.apache.org/jira/browse/KUDU-3597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3597: Summary: Test scenarios of TabletServerDiskErrorITest fail from time to time (was: Test scenarios from TabletServerDiskErrorITest fail from time to time) > Test scenarios of TabletServerDiskErrorITest fail from time to time > --- > > Key: KUDU-3597 > URL: https://issues.apache.org/jira/browse/KUDU-3597 > Project: Kudu > Issue Type: Bug > Components: test >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Major > Attachments: disk_failure-itest.txt.xz > > > Test scenarios in TabletServerDiskErrorITest fail from time to time. For > example, > {{TabletServerDiskError/TabletServerDiskErrorITest.TestSpaceAvailableMetrics}} > test scenario sometimes fails with errors like below (at least in TSAN > builds). The full log is attached. > TabletServerDiskErrorITest > {noformat} > src/kudu/integration-tests/disk_failure-itest.cc:267: Failure > Failed > Bad status: Network error: rpc failed: Client connection negotiation failed: > client connection to 127.21.83.1:33163: connect: Connection refused (error > 111) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KUDU-3597) Test scenarios from TabletServerDiskErrorITest fail from time to time
Alexey Serbin created KUDU-3597: --- Summary: Test scenarios from TabletServerDiskErrorITest fail from time to time Key: KUDU-3597 URL: https://issues.apache.org/jira/browse/KUDU-3597 Project: Kudu Issue Type: Bug Components: test Affects Versions: 1.17.0 Reporter: Alexey Serbin Attachments: disk_failure-itest.txt.xz Test scenarios in TabletServerDiskErrorITest fail from time to time. For example, {{TabletServerDiskError/TabletServerDiskErrorITest.TestSpaceAvailableMetrics}} test scenario sometimes fails with errors like below (at least in TSAN builds). The full log is attached. TabletServerDiskErrorITest {noformat} src/kudu/integration-tests/disk_failure-itest.cc:267: Failure Failed Bad status: Network error: rpc failed: Client connection negotiation failed: client connection to 127.21.83.1:33163: connect: Connection refused (error 111) {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3596) TsTabletManagerITest.TestTableStats fails from time to time (TSAN build)
[ https://issues.apache.org/jira/browse/KUDU-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3596: Affects Version/s: 1.17.0 > TsTabletManagerITest.TestTableStats fails from time to time (TSAN build) > > > Key: KUDU-3596 > URL: https://issues.apache.org/jira/browse/KUDU-3596 > Project: Kudu > Issue Type: Bug > Components: test >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Major > Attachments: ts_tablet_manager-itest.txt.xz > > > The {{TsTabletManagerITest.TestTableStats}} test scenario fails from time to > time, at least in TSAN builds. I'm not sure whether that's something related > to the test itself, an issue related to stack collection, or a bug somewhere > else. When it fails, it reports something like below. The full log is > attached. > {noformat} > /home/jenkins-slave/workspace/build_and_test@2/src/kudu/integration-tests/ts_tablet_manager-itest.cc:783: > Failure > Expected equality of these values: > > live_row_count > > Which is: 67 > > table_info->GetMetrics()->live_row_count->value() > > Which is: 0 > > /home/jenkins-slave/workspace/build_and_test@2/src/kudu/util/test_util.cc:395: > Failure > Failed > > Timed out waiting for assertion to pass. > > /home/jenkins-slave/workspace/build_and_test@2/src/kudu/integration-tests/ts_tablet_manager-itest.cc:767: > Failure > Expected: check_function(table_infos[0].get(), live_row_count) doesn't > generate new fatal failures in the current thread. > Actual: it does. > > /home/jenkins-slave/workspace/build_and_test@2/src/kudu/integration-tests/ts_tablet_manager-itest.cc:785: > Failure > Expected: GetLeaderMasterAndRun(live_row_count, [&] ( TableInfo* table_info, > int64_t live_row_count) { do { AssertEventually([&] () { switch (0) case 0: > default: if (const ::testing::AssertionResult gtest_ar_ = > ::testing::AssertionResult(table_info->GetMetrics()->TableSupportsLiveRowCount())) > ; else return > ::testing::internal::AssertHelper(::testing::TestPartResult::kFatalFailure, > "/home/jenkins-slave/workspace/build_and_test@2/src/kudu/integration-tests/ts_tablet_manager-itest.cc", > 782, ::testing::internal::GetBoolAssertionFailureMessage( gtest_ar_, > "table_info->GetMetrics()->TableSupportsLiveRowCount()", "false", "true") > .c_str()) = ::testing::Message(); switch (0) case 0: default: if (const > ::testing::AssertionResult gtest_ar = > (::testing::internal::EqHelper::Compare("live_row_count", > "table_info->GetMetrics()->live_row_count->value()", live_row_count, > table_info->GetMetrics()->live_row_count->value( ; else return > ::testing::internal::AssertHelper(::testing::TestPartResult::kFatalFailure, > "/home/jenkins-slave/workspace/build_and_test@2/src/kudu/integration-tests/ts_tablet_manager-itest.cc", > 783, gtest_ar.failure_message()) = ::testing::Message(); }); do { if > (testing::Test::HasFatalFailure()) { return; } } while (0); } while (0); }) > doesn't generate new fatal failures in the current thread. > Actual: it does. > > /home/jenkins-slave/workspace/build_and_test@2/src/kudu/integration-tests/ts_tablet_manager-itest.cc:895: > Failure > Expected: CheckStats(kRowsCount) doesn't generate new fatal failures in the > current thread. > Actual: it does. > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KUDU-3596) TsTabletManagerITest.TestTableStats fails from time to time (TSAN build)
Alexey Serbin created KUDU-3596: --- Summary: TsTabletManagerITest.TestTableStats fails from time to time (TSAN build) Key: KUDU-3596 URL: https://issues.apache.org/jira/browse/KUDU-3596 Project: Kudu Issue Type: Bug Components: test Reporter: Alexey Serbin Attachments: ts_tablet_manager-itest.txt.xz The {{TsTabletManagerITest.TestTableStats}} test scenario fails from time to time, at least in TSAN builds. I'm not sure whether that's something related to the test itself, an issue related to stack collection, or a bug somewhere else. When it fails, it reports something like below. The full log is attached. {noformat} /home/jenkins-slave/workspace/build_and_test@2/src/kudu/integration-tests/ts_tablet_manager-itest.cc:783: Failure Expected equality of these values: live_row_count Which is: 67 table_info->GetMetrics()->live_row_count->value() Which is: 0 /home/jenkins-slave/workspace/build_and_test@2/src/kudu/util/test_util.cc:395: Failure Failed Timed out waiting for assertion to pass. /home/jenkins-slave/workspace/build_and_test@2/src/kudu/integration-tests/ts_tablet_manager-itest.cc:767: Failure Expected: check_function(table_infos[0].get(), live_row_count) doesn't generate new fatal failures in the current thread. Actual: it does. /home/jenkins-slave/workspace/build_and_test@2/src/kudu/integration-tests/ts_tablet_manager-itest.cc:785: Failure Expected: GetLeaderMasterAndRun(live_row_count, [&] ( TableInfo* table_info, int64_t live_row_count) { do { AssertEventually([&] () { switch (0) case 0: default: if (const ::testing::AssertionResult gtest_ar_ = ::testing::AssertionResult(table_info->GetMetrics()->TableSupportsLiveRowCount())) ; else return ::testing::internal::AssertHelper(::testing::TestPartResult::kFatalFailure, "/home/jenkins-slave/workspace/build_and_test@2/src/kudu/integration-tests/ts_tablet_manager-itest.cc", 782, ::testing::internal::GetBoolAssertionFailureMessage( gtest_ar_, "table_info->GetMetrics()->TableSupportsLiveRowCount()", "false", "true") .c_str()) = ::testing::Message(); switch (0) case 0: default: if (const ::testing::AssertionResult gtest_ar = (::testing::internal::EqHelper::Compare("live_row_count", "table_info->GetMetrics()->live_row_count->value()", live_row_count, table_info->GetMetrics()->live_row_count->value( ; else return ::testing::internal::AssertHelper(::testing::TestPartResult::kFatalFailure, "/home/jenkins-slave/workspace/build_and_test@2/src/kudu/integration-tests/ts_tablet_manager-itest.cc", 783, gtest_ar.failure_message()) = ::testing::Message(); }); do { if (testing::Test::HasFatalFailure()) { return; } } while (0); } while (0); }) doesn't generate new fatal failures in the current thread. Actual: it does. /home/jenkins-slave/workspace/build_and_test@2/src/kudu/integration-tests/ts_tablet_manager-itest.cc:895: Failure Expected: CheckStats(kRowsCount) doesn't generate new fatal failures in the current thread. Actual: it does. {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KUDU-2376) SIGSEGV while adding and dropping the same range partition and concurrently writing
[ https://issues.apache.org/jira/browse/KUDU-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-2376. - Fix Version/s: 1.18.0 Resolution: Duplicate > SIGSEGV while adding and dropping the same range partition and concurrently > writing > --- > > Key: KUDU-2376 > URL: https://issues.apache.org/jira/browse/KUDU-2376 > Project: Kudu > Issue Type: Bug >Affects Versions: 1.7.0 >Reporter: William Berkeley >Priority: Major > Fix For: 1.18.0 > > Attachments: alter_table-test.patch > > > While adding a test to https://gerrit.cloudera.org/#/c/9393/, I ran into the > problem that writing while doing a replace tablet operation caused the client > to segfault. After inspecting the client code, it looked like the same > problem could occur if the same range partition was added and dropped with > concurrent writes. > Attached is a patch that adds a test to alter_table-test that reliably > reproduces the segmentation fault. > I don't totally understand what's happening, but here's what I think I have > figured out: > Suppose the range partition P=[0, 100) is dropped and re-added in a single > alter. This causes the tablet X for hash bucket 0 and range partition P to be > dropped, and a new one Y created for the same partition. There is a batch > pending to X which the client attempts to send to each of the replicas of X > in turn. Once the replicas are exhausted, the client attempts to find a new > leader with MetaCacheServerPicker::PickLeader, which triggers a master lookup > to get the latest consensus info for X (#5 in the big comment in PickLeader). > This calls LookupTabletByKey, which attempts a fast path lookup. Assuming > other metadata operations have already cached a tablet for Y, the tablet for > X will have been removed from the by-table-and-by-key map, and the fast > lookup with return an entry for Y. The client code doesn't know the > difference because the code paths just look at partition boundaries, which > match for X and Y. The lookup doesn't happen, and the client ends up in a > pretty tight loop repeating the above process, until the segfault. > I'm not sure exactly what the segmentation fault is. I looked at it a bit in > gdb and the segfault was a few calls deep into STL maps in release mode and > inside a refcount increment in debug mode. I'll try to attach some gdb output > showing that later. > The problem is also hinted at in a TODO in PickLeader: > {noformat} > // TODO: When we support tablet splits, we should let the lookup shift > // the write to another tablet (i.e. if it's since been split). > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3591) FsManagerTestBase.TestAddRemoveDataDirsFuzz fails from time to time with RocksDB
[ https://issues.apache.org/jira/browse/KUDU-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3591: Description: The fuzz test scenario FsManagerTestBase.TestAddRemoveDataDirsFuzz from time to time fails when run with the RocksDB option. I'm not sure whether the failure is just due to very long runtime, or something has stuck -- that's to be triaged. Below are a few lines from the log: they came just before the dump of the threads' stacks. The test was declared as failed because of exceeding the maximum unit test's run time. The full log is attached. {noformat} W20240713 01:16:21.454174 31297 fs_manager.cc:538] unable to create missing filesystem roots: Already present: FSManager roots already exist: /tmp/dist-test-taskLd2saP/test-tmp/fs_manager-test.0.BlockManagerTypes_FsManagerTestBase.TestAddRemoveDataDirsFuzz_7.1720832511594434-31297-0/new_data_849 W20240713 01:16:21.635000 31297 dir_manager.cc:228] Invalid argument: could not initialize /tmp/dist-test-taskLd2saP/test-tmp/fs_manager-test.0.BlockManagerTypes_FsManagerTestBase.TestAddRemoveDataDirsFuzz_7.1720832511594434-31297-0/new_data_654/data: open RocksDB failed, path: /tmp/dist-test-taskLd2saP/test-tmp/fs_manager-test.0.BlockManagerTypes_FsManagerTestBase.TestAddRemoveDataDirsFuzz_7.1720832511594434-31297-0/new_data_654/data/rdb: Invalid argument: /tmp/dist-test-taskLd2saP/test-tmp/fs_manager-test.0.BlockManagerTypes_FsManagerTestBase.TestAddRemoveDataDirsFuzz_7.1720832511594434-31297-0/new_data_654/data/rdb/CURRENT: does not exist (create_if_missing is false) ... F20240713 01:16:37.134395 31300 test_main.cc:63] Maximum unit test time exceeded (870 sec) {noformat} was: The fuzz test scenario FsManagerTestBase.TestAddRemoveDataDirsFuzz from time to time fails when run with the RocksDB option. I'm not sure whether the failure is just due to very long runtime, or something has stuck -- that's to be triaged. A few lines before the failure is declared due to long run-time are below, and the full log is attached. {noformat} W20240713 01:16:21.454174 31297 fs_manager.cc:538] unable to create missing filesystem roots: Already present: FSManager roots already exist: /tmp/dist-test-taskLd2saP/test-tmp/fs_manager-test.0.BlockManagerTypes_FsManagerTestBase.TestAddRemoveDataDirsFuzz_7.1720832511594434-31297-0/new_data_849 W20240713 01:16:21.635000 31297 dir_manager.cc:228] Invalid argument: could not initialize /tmp/dist-test-taskLd2saP/test-tmp/fs_manager-test.0.BlockManagerTypes_FsManagerTestBase.TestAddRemoveDataDirsFuzz_7.1720832511594434-31297-0/new_data_654/data: open RocksDB failed, path: /tmp/dist-test-taskLd2saP/test-tmp/fs_manager-test.0.BlockManagerTypes_FsManagerTestBase.TestAddRemoveDataDirsFuzz_7.1720832511594434-31297-0/new_data_654/data/rdb: Invalid argument: /tmp/dist-test-taskLd2saP/test-tmp/fs_manager-test.0.BlockManagerTypes_FsManagerTestBase.TestAddRemoveDataDirsFuzz_7.1720832511594434-31297-0/new_data_654/data/rdb/CURRENT: does not exist (create_if_missing is false) {noformat} > FsManagerTestBase.TestAddRemoveDataDirsFuzz fails from time to time with > RocksDB > > > Key: KUDU-3591 > URL: https://issues.apache.org/jira/browse/KUDU-3591 > Project: Kudu > Issue Type: Bug > Components: fs, test >Reporter: Alexey Serbin >Priority: Major > Attachments: fs_manager-test.txt.xz > > > The fuzz test scenario FsManagerTestBase.TestAddRemoveDataDirsFuzz from time > to time fails when run with the RocksDB option. I'm not sure whether the > failure is just due to very long runtime, or something has stuck -- that's to > be triaged. Below are a few lines from the log: they came just before the > dump of the threads' stacks. The test was declared as failed because of > exceeding the maximum unit test's run time. The full log is attached. > {noformat} > W20240713 01:16:21.454174 31297 fs_manager.cc:538] unable to create missing > filesystem roots: Already present: FSManager roots already exist: > /tmp/dist-test-taskLd2saP/test-tmp/fs_manager-test.0.BlockManagerTypes_FsManagerTestBase.TestAddRemoveDataDirsFuzz_7.1720832511594434-31297-0/new_data_849 > W20240713 01:16:21.635000 31297 dir_manager.cc:228] Invalid argument: could > not initialize > /tmp/dist-test-taskLd2saP/test-tmp/fs_manager-test.0.BlockManagerTypes_FsManagerTestBase.TestAddRemoveDataDirsFuzz_7.1720832511594434-31297-0/new_data_654/data: > open RocksDB failed, path: > /tmp/dist-test-taskLd2saP/test-tmp/fs_manager-test.0.BlockManagerTypes_FsManagerTestBase.TestAddRemoveDataDirsFuzz_7.1720832511594434-31297-0/new_data_654/data/rdb: > Invalid argument: > /tmp/dist-test-taskLd2saP/test-tmp/fs_manager-test.0.BlockManagerTypes_FsManagerTestBase.TestAddRemoveDataDirsFuzz_
[jira] [Created] (KUDU-3591) FsManagerTestBase.TestAddRemoveDataDirsFuzz fails from time to time with RocksDB
Alexey Serbin created KUDU-3591: --- Summary: FsManagerTestBase.TestAddRemoveDataDirsFuzz fails from time to time with RocksDB Key: KUDU-3591 URL: https://issues.apache.org/jira/browse/KUDU-3591 Project: Kudu Issue Type: Bug Components: fs, test Reporter: Alexey Serbin Attachments: fs_manager-test.txt.xz The fuzz test scenario FsManagerTestBase.TestAddRemoveDataDirsFuzz from time to time fails when run with the RocksDB option. I'm not sure whether the failure is just due to very long runtime, or something has stuck -- that's to be triaged. A few lines before the failure is declared due to long run-time are below, and the full log is attached. {noformat} W20240713 01:16:21.454174 31297 fs_manager.cc:538] unable to create missing filesystem roots: Already present: FSManager roots already exist: /tmp/dist-test-taskLd2saP/test-tmp/fs_manager-test.0.BlockManagerTypes_FsManagerTestBase.TestAddRemoveDataDirsFuzz_7.1720832511594434-31297-0/new_data_849 W20240713 01:16:21.635000 31297 dir_manager.cc:228] Invalid argument: could not initialize /tmp/dist-test-taskLd2saP/test-tmp/fs_manager-test.0.BlockManagerTypes_FsManagerTestBase.TestAddRemoveDataDirsFuzz_7.1720832511594434-31297-0/new_data_654/data: open RocksDB failed, path: /tmp/dist-test-taskLd2saP/test-tmp/fs_manager-test.0.BlockManagerTypes_FsManagerTestBase.TestAddRemoveDataDirsFuzz_7.1720832511594434-31297-0/new_data_654/data/rdb: Invalid argument: /tmp/dist-test-taskLd2saP/test-tmp/fs_manager-test.0.BlockManagerTypes_FsManagerTestBase.TestAddRemoveDataDirsFuzz_7.1720832511594434-31297-0/new_data_654/data/rdb/CURRENT: does not exist (create_if_missing is false) {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KUDU-3590) update certs in test_certs.cc
[ https://issues.apache.org/jira/browse/KUDU-3590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866481#comment-17866481 ] Alexey Serbin edited comment on KUDU-3590 at 7/16/24 5:29 PM: -- When updating the certs, please take the following into the consideration: * re-generate them to have at least 20 years validity period * use meaningful entries for the certificate fields The [prior patch|https://github.com/apache/kudu/commit/2d624f877262efafcd4fbaa59d7bb5a8c65ff54e] that touched the certificates regenerated the test certs with just 1 year validity interval, and put junk into the certificate fields. I didn't verify that in review, and it went unnoticed, sorry. You could take a look at the original certificates prior to [2d624f877|https://github.com/apache/kudu/commit/2d624f877262efafcd4fbaa59d7bb5a8c65ff54e] to get an idea what would be the contents of the corresponding certificate fields. Thank you! was (Author: aserbin): When updating the certs, please take the following into the consideration: * re-generate them to have at least 20 years validity period * use meaningful entries for the certificate fields The [prior patch|https://github.com/apache/kudu/commit/2d624f877262efafcd4fbaa59d7bb5a8c65ff54e] that touched the certificates regenerated the test certs with just 1 year validity interval, and put junk into the certificate fields. I didn't verify that in review, and it went unnoticed, sorry. > update certs in test_certs.cc > - > > Key: KUDU-3590 > URL: https://issues.apache.org/jira/browse/KUDU-3590 > Project: Kudu > Issue Type: Sub-task >Reporter: Bakai Ádám >Assignee: Bakai Ádám >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KUDU-3590) update certs in test_certs.cc
[ https://issues.apache.org/jira/browse/KUDU-3590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866481#comment-17866481 ] Alexey Serbin commented on KUDU-3590: - When updating the certs, please take the following into the consideration: * re-generate them to have at least 20 years validity period * use meaningful entries for the certificate fields The [prior patch|https://github.com/apache/kudu/commit/2d624f877262efafcd4fbaa59d7bb5a8c65ff54e] that touched the certificates regenerated the test certs with just 1 year validity interval, and put junk into the certificate fields. I didn't verify that in review, and it went unnoticed, sorry. > update certs in test_certs.cc > - > > Key: KUDU-3590 > URL: https://issues.apache.org/jira/browse/KUDU-3590 > Project: Kudu > Issue Type: Sub-task >Reporter: Bakai Ádám >Assignee: Bakai Ádám >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3587) Implement smarter back-off strategy for RetriableRpc upon receving REPLICA_NOT_LEADER response
[ https://issues.apache.org/jira/browse/KUDU-3587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3587: Description: As of Kudu 1.17.0, the implementation of RetriableRpc for WriteRpc in the C++ client uses linear back-off strategy, where the hold-off time interval (in milliseconds) is computed as {noformat} num_attempts + (rand() % 5) {noformat} Even if Kudu servers use separate incoming queues for different RPC interfaces (e.g. TabletServerService, ConsensusService, etc.), in the presence of many active clients, many tablet replicas per tablet server, and on-going Raft election storms due to frozen and/or slow RPC worker threads, many more unrelated write requests might be dropped out of the overflown TabletServerService RPC queues because the queues are flooded with too many retried write requests to tablets whose leader replicas aren't yet established. It doesn't make sense to self-inflict such a DoS condition because of non-optimal RPC retry strategy at the client side. One option might be using linear back-off strategy when going round-robin through the recently refreshed list of tablet replicas, but using exponential strategy upon completing a full circle and issuing next GetTablesLocation request to Kudu master. was: As of Kudu 1.17.0, the implementation of RetriableRpc for WriteRpc in the C++ client uses linear back-off strategy, where the hold-off time interval (in milliseconds) is computed as {noformat} num_attempts + (rand() % 5) {noformat} Since Kudu servers use a single queue for all their RPC interfaces (e.g. TabletServerService, ConsensusService, etc.), in the presence of many active clients and busy server nodes, this might start Raft election storm or exacerbate an existing one by keeping the RPC queue full or almost full, so more ConsensusService requests are dropped out of overflown RPC queues. Of course, separating RPC queues for different interfaces is one part of the remedy (e.g., see [KUDU-2955|https://issues.apache.org/jira/browse/KUDU-2955]), but even with separate RPC queues it doesn't make sense to self-inflict a DoS condition because of non-optimal RPC retry strategy when there are many active clients and tablet leadership transition is in progress for many "hot" tables. One option might be using linear back-off strategy when going round-robin through the recently refreshed list of tablet replicas, but using exponential strategy upon completing a full circle and issuing next GetTablesLocation request to Kudu master. > Implement smarter back-off strategy for RetriableRpc upon receving > REPLICA_NOT_LEADER response > -- > > Key: KUDU-3587 > URL: https://issues.apache.org/jira/browse/KUDU-3587 > Project: Kudu > Issue Type: Improvement > Components: client >Reporter: Alexey Serbin >Priority: Major > > As of Kudu 1.17.0, the implementation of RetriableRpc for WriteRpc in the C++ > client uses linear back-off strategy, where the hold-off time interval (in > milliseconds) is computed as > {noformat} > num_attempts + (rand() % 5) > {noformat} > Even if Kudu servers use separate incoming queues for different RPC > interfaces (e.g. TabletServerService, ConsensusService, etc.), in the > presence of many active clients, many tablet replicas per tablet server, and > on-going Raft election storms due to frozen and/or slow RPC worker threads, > many more unrelated write requests might be dropped out of the overflown > TabletServerService RPC queues because the queues are flooded with too many > retried write requests to tablets whose leader replicas aren't yet > established. It doesn't make sense to self-inflict such a DoS condition > because of non-optimal RPC retry strategy at the client side. > One option might be using linear back-off strategy when going round-robin > through the recently refreshed list of tablet replicas, but using exponential > strategy upon completing a full circle and issuing next GetTablesLocation > request to Kudu master. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KUDU-3587) Implement smarter back-off strategy for RetriableRpc upon receving REPLICA_NOT_LEADER response
Alexey Serbin created KUDU-3587: --- Summary: Implement smarter back-off strategy for RetriableRpc upon receving REPLICA_NOT_LEADER response Key: KUDU-3587 URL: https://issues.apache.org/jira/browse/KUDU-3587 Project: Kudu Issue Type: Improvement Components: client Reporter: Alexey Serbin As of Kudu 1.17.0, the implementation of RetriableRpc for WriteRpc in the C++ client uses linear back-off strategy, where the hold-off time interval (in milliseconds) is computed as {noformat} num_attempts + (rand() % 5) {noformat} Since Kudu servers use a single queue for all their RPC interfaces (e.g. TabletServerService, ConsensusService, etc.), in the presence of many active clients and busy server nodes, this might start Raft election storm or exacerbate an existing one by keeping the RPC queue full or almost full, so more ConsensusService requests are dropped out of overflown RPC queues. Of course, separating RPC queues for different interfaces is one part of the remedy (e.g., see [KUDU-2955|https://issues.apache.org/jira/browse/KUDU-2955]), but even with separate RPC queues it doesn't make sense to self-inflict a DoS condition because of non-optimal RPC retry strategy when there are many active clients and tablet leadership transition is in progress for many "hot" tables. One option might be using linear back-off strategy when going round-robin through the recently refreshed list of tablet replicas, but using exponential strategy upon completing a full circle and issuing next GetTablesLocation request to Kudu master. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KUDU-2679) In some scenarios, a Spark Kudu application can be devoid of fresh authn tokens
[ https://issues.apache.org/jira/browse/KUDU-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17856067#comment-17856067 ] Alexey Serbin commented on KUDU-2679: - [~ArnaudL], {quote} It has a similar topology (driver = job manager ; executor = task manager) ; however the "job manager" does not query the kudu tables ; only the task managers do ; so effectively it is not the same issue. {quote} One distinctive trait of this issue is that Spark executors never have Kerberos credentials, even at the time when a task starts. They receive the authn credentials in form of authentication tokens from the Spark driver, but on their own they can never acquire Kudu authn tokens, and that's how it's supposed to be. That's the Spark driver who is to re-acquire authn tokens and spawn new tasks with new, non-expired tokens. However, in the case when Spark driver keeps its RPC connection to Kudu master open and authn tokens are expired, this issue happens. This issue isn't about expiring authn tokens for long-running tasks, no. This issue is about specific conditions when Spark driver fails to recognize that the authn token it uses to start tasks on executors has expired, and doesn't automatically re-acquire the token, even if it has the required Kerberos credentials. That's the issue this JIRA is about. {quote} However it is somewhat related since the kudu client initialisation occurs on each task manager when the task starts and each time a small batch of rows is received an insertion/upsertion is made, failing after 7 days. {quote} Please open a separate JIRA for that "somewhat related" issue, and describe the problem and how it manifests itself. From what I understood so far, the issue you hit isn't going to be fixed when this JIRA is addressed, so I don't see why to mention that "somewhat related" issue in here. As for more context, Kudu authentication tokens are designed to expire after the configured time interval (default is 7 days) -- [that's by design|https://github.com/apache/kudu/blob/master/docs/security.adoc#authentication-tokens], and it's working exactly as expected. The Kudu client libraries automatically re-acquire authn tokens when detecting their expiration, but it's necessary to have valid Kerberos credentials to do so. > In some scenarios, a Spark Kudu application can be devoid of fresh authn > tokens > --- > > Key: KUDU-2679 > URL: https://issues.apache.org/jira/browse/KUDU-2679 > Project: Kudu > Issue Type: Bug > Components: client, security, spark >Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1 >Reporter: Alexey Serbin >Priority: Major > > When running in {{cluster}} mode, tasks run as a part of Spark Kudu client > application can be devoid of getting new (i.e. non-expired) authentication > tokens even if they run for a very short time. Essentially, if the driver > runs longer than the authn token expiration interval and has a particular > pattern of making RPC calls to Kudu masters and tablet servers, all tasks > scheduled to run after the authn token expiration interval will be supplied > with expired authn tokens, making every task fail. The only way to fix that > is restarting the application or dropping long-established connections from > the driver to Kudu masters/tservers. > Below are some details, explaining why that can happen. > Let's assume the following holds true for a Spark Kudu application: > * The application is running against a secured Kudu cluster. > * The application is running in the {{cluster}} mode. > * There are no primary authentication credentials at the machines for the > user under which the Spark executors are running (i.e. {{kinit}} hasn't been > run at those executor machines for the corresponding user or the Kerberos > credentials has already expired there). > * The {{--authn_token_validity_seconds}} masters' flag is set to {{X}} > seconds (default is 60 * 60 * 24 * 7 seconds, i.e. 7 days). > * The {{--rpc_default_keepalive_time_ms}} flag for masters (and tablet > servers, if they are involved into the communications between the driver > process and the Kudu backend) is set to {{Y}} milliseconds (default is 65000 > ms). > * The application is running for longer than {{X}} seconds. > * The driver process makes requests to Kudu masters at least every {{Y}} > milliseconds. > * The driver either doesn't make requests to Kudu tablet servers or makes > such requests at least every {{Y}} milliseconds to each of the involved > tablet servers. > * The executors are running tasks that keep connections to tablet servers > idle for longer than {{Y}} milliseconds or the driver spawns tasks at an > executor after {{Y}} milliseconds since last ta
[jira] [Resolved] (KUDU-3585) ClientTest.ClearCacheAndConcurrentWorkload fails from time to time in TSAN builds
[ https://issues.apache.org/jira/browse/KUDU-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-3585. - Fix Version/s: 1.18.0 Resolution: Fixed Fixed with [8ed4db154|https://github.com/apache/kudu/commit/8ed4db154596136e3ef4fbe27457992c119ed2b6]. > ClientTest.ClearCacheAndConcurrentWorkload fails from time to time in TSAN > builds > - > > Key: KUDU-3585 > URL: https://issues.apache.org/jira/browse/KUDU-3585 > Project: Kudu > Issue Type: Sub-task > Components: client, test >Affects Versions: 1.14.0, 1.15.0, 1.16.0, 1.17.0 >Reporter: Alexey Serbin >Priority: Major > Fix For: 1.18.0 > > Attachments: client-test.5.txt.xz > > > The scenario sometimes fails in TSAN builds with output like cited below. > It seems the root cause was RPC queue overflows at kudu-master and > kudu-tserver: both spend much more time on regular requests when built with > TSAN instrumentation, and resetting the client'ss meta-cache too often > induces a lot of GetTableLocations requests, and serving eats a lot of CPU > and many threads are kept busy. Since an internal mini-cluster is used in > the scenario (i.e. all masters and tablet servers are a part of just one > process), that affects kudu-tserver RPC worker threads as well, so many > requests accumulate in the RPC queues. > {noformat} > src/kudu/client/client-test.cc:408: Failure > Expected equality of these values: 0 > > server->server()->rpc_server()-> > service_pool("kudu.tserver.TabletServerService")-> > RpcsQueueOverflowMetric()->value() > Which is: 1 > src/kudu/client/client-test.cc:584: Failure > Expected: CheckNoRpcOverflow() doesn't generate new fatal failures in the > current thread. > Actual: it does. > > src/kudu/client/client-test.cc:2466: Failure > Expected: DeleteTestRows(client_table_.get(), kLowIdx, kHighIdx) doesn't > generate new fatal failures in the current thread. > Actual: it does. > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KUDU-2679) In some scenarios, a Spark Kudu application can be devoid of fresh authn tokens
[ https://issues.apache.org/jira/browse/KUDU-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855666#comment-17855666 ] Alexey Serbin edited comment on KUDU-2679 at 6/17/24 6:38 PM: -- {quote} The same happens with Flink streaming applications with a Kudu Sink. After the --authn_token_validity_seconds period we have to restart the application. {quote} [~ArnaudL], The crux of this issue is in the presence of two actors of different types in Spark: the driver and the executors. That's why the built-in connection renegotiation upon expiring authentication token is working such an unexpected way. Does Flink have similar topology when assigning tasks? If not, then it's not the same issue. was (Author: aserbin): {quote} The same happens with Flink streaming applications with a Kudu Sink. After the --authn_token_validity_seconds period we have to restart the application. {quote} [~ArnaudL], The crux of this issue with in presence of two actors of different types in Spark: the driver and the executors. Does Flink have similar topology when assigning tasks? If not, then it's not the same issue. > In some scenarios, a Spark Kudu application can be devoid of fresh authn > tokens > --- > > Key: KUDU-2679 > URL: https://issues.apache.org/jira/browse/KUDU-2679 > Project: Kudu > Issue Type: Bug > Components: client, security, spark >Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1 >Reporter: Alexey Serbin >Priority: Major > > When running in {{cluster}} mode, tasks run as a part of Spark Kudu client > application can be devoid of getting new (i.e. non-expired) authentication > tokens even if they run for a very short time. Essentially, if the driver > runs longer than the authn token expiration interval and has a particular > pattern of making RPC calls to Kudu masters and tablet servers, all tasks > scheduled to run after the authn token expiration interval will be supplied > with expired authn tokens, making every task fail. The only way to fix that > is restarting the application or dropping long-established connections from > the driver to Kudu masters/tservers. > Below are some details, explaining why that can happen. > Let's assume the following holds true for a Spark Kudu application: > * The application is running against a secured Kudu cluster. > * The application is running in the {{cluster}} mode. > * There are no primary authentication credentials at the machines for the > user under which the Spark executors are running (i.e. {{kinit}} hasn't been > run at those executor machines for the corresponding user or the Kerberos > credentials has already expired there). > * The {{--authn_token_validity_seconds}} masters' flag is set to {{X}} > seconds (default is 60 * 60 * 24 * 7 seconds, i.e. 7 days). > * The {{--rpc_default_keepalive_time_ms}} flag for masters (and tablet > servers, if they are involved into the communications between the driver > process and the Kudu backend) is set to {{Y}} milliseconds (default is 65000 > ms). > * The application is running for longer than {{X}} seconds. > * The driver process makes requests to Kudu masters at least every {{Y}} > milliseconds. > * The driver either doesn't make requests to Kudu tablet servers or makes > such requests at least every {{Y}} milliseconds to each of the involved > tablet servers. > * The executors are running tasks that keep connections to tablet servers > idle for longer than {{Y}} milliseconds or the driver spawns tasks at an > executor after {{Y}} milliseconds since last task has completed by the > executor. > Essentially, that's about a Spark Kudu application where the driver process > keeps once opened connections active and the executors need to open new > connections to Kudu tablet servers (and/or masters). Also, the executor > machines doesn't have Kerberos credentials for the OS user under which the > executor processes are run. > In such scenarios, the application's tasks spawned after {{X}} seconds from > the application start will fail because of expired authentication tokens, > while the driver process will never re-acquire its authn token, keeping the > expired token in {{KuduContext}} forever. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3585) ClientTest.ClearCacheAndConcurrentWorkload fails from time to time in TSAN builds
[ https://issues.apache.org/jira/browse/KUDU-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3585: Attachment: client-test.5.txt.xz > ClientTest.ClearCacheAndConcurrentWorkload fails from time to time in TSAN > builds > - > > Key: KUDU-3585 > URL: https://issues.apache.org/jira/browse/KUDU-3585 > Project: Kudu > Issue Type: Sub-task > Components: client, test >Affects Versions: 1.14.0, 1.15.0, 1.16.0, 1.17.0 >Reporter: Alexey Serbin >Priority: Major > Attachments: client-test.5.txt.xz > > > The scenario sometimes fails in TSAN builds with output like cited below. > It seems the root cause was RPC queue overflows at kudu-master and > kudu-tserver: both spend much more time on regular requests when built with > TSAN instrumentation, and resetting the client'ss meta-cache too often > induces a lot of GetTableLocations requests, and serving eats a lot of CPU > and many threads are kept busy. Since an internal mini-cluster is used in > the scenario (i.e. all masters and tablet servers are a part of just one > process), that affects kudu-tserver RPC worker threads as well, so many > requests accumulate in the RPC queues. > {noformat} > src/kudu/client/client-test.cc:408: Failure > Expected equality of these values: 0 > > server->server()->rpc_server()-> > service_pool("kudu.tserver.TabletServerService")-> > RpcsQueueOverflowMetric()->value() > Which is: 1 > src/kudu/client/client-test.cc:584: Failure > Expected: CheckNoRpcOverflow() doesn't generate new fatal failures in the > current thread. > Actual: it does. > > src/kudu/client/client-test.cc:2466: Failure > Expected: DeleteTestRows(client_table_.get(), kLowIdx, kHighIdx) doesn't > generate new fatal failures in the current thread. > Actual: it does. > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3585) ClientTest.ClearCacheAndConcurrentWorkload fails from time to time in TSAN builds
[ https://issues.apache.org/jira/browse/KUDU-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3585: Code Review: https://gerrit.cloudera.org/#/c/21523/ > ClientTest.ClearCacheAndConcurrentWorkload fails from time to time in TSAN > builds > - > > Key: KUDU-3585 > URL: https://issues.apache.org/jira/browse/KUDU-3585 > Project: Kudu > Issue Type: Sub-task > Components: client, test >Affects Versions: 1.14.0, 1.15.0, 1.16.0, 1.17.0 >Reporter: Alexey Serbin >Priority: Major > > The scenario sometimes fails in TSAN builds with output like cited below. > It seems the root cause was RPC queue overflows at kudu-master and > kudu-tserver: both spend much more time on regular requests when built with > TSAN instrumentation, and resetting the client'ss meta-cache too often > induces a lot of GetTableLocations requests, and serving eats a lot of CPU > and many threads are kept busy. Since an internal mini-cluster is used in > the scenario (i.e. all masters and tablet servers are a part of just one > process), that affects kudu-tserver RPC worker threads as well, so many > requests accumulate in the RPC queues. > {noformat} > src/kudu/client/client-test.cc:408: Failure > Expected equality of these values: 0 > > server->server()->rpc_server()-> > service_pool("kudu.tserver.TabletServerService")-> > RpcsQueueOverflowMetric()->value() > Which is: 1 > src/kudu/client/client-test.cc:584: Failure > Expected: CheckNoRpcOverflow() doesn't generate new fatal failures in the > current thread. > Actual: it does. > > src/kudu/client/client-test.cc:2466: Failure > Expected: DeleteTestRows(client_table_.get(), kLowIdx, kHighIdx) doesn't > generate new fatal failures in the current thread. > Actual: it does. > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KUDU-3585) ClientTest.ClearCacheAndConcurrentWorkload fails from time to time in TSAN builds
Alexey Serbin created KUDU-3585: --- Summary: ClientTest.ClearCacheAndConcurrentWorkload fails from time to time in TSAN builds Key: KUDU-3585 URL: https://issues.apache.org/jira/browse/KUDU-3585 Project: Kudu Issue Type: Sub-task Components: client, test Affects Versions: 1.17.0, 1.16.0, 1.15.0, 1.14.0 Reporter: Alexey Serbin The scenario sometimes fails in TSAN builds with output like cited below. It seems the root cause was RPC queue overflows at kudu-master and kudu-tserver: both spend much more time on regular requests when built with TSAN instrumentation, and resetting the client'ss meta-cache too often induces a lot of GetTableLocations requests, and serving eats a lot of CPU and many threads are kept busy. Since an internal mini-cluster is used in the scenario (i.e. all masters and tablet servers are a part of just one process), that affects kudu-tserver RPC worker threads as well, so many requests accumulate in the RPC queues. {noformat} src/kudu/client/client-test.cc:408: Failure Expected equality of these values: 0 server->server()->rpc_server()-> service_pool("kudu.tserver.TabletServerService")-> RpcsQueueOverflowMetric()->value() Which is: 1 src/kudu/client/client-test.cc:584: Failure Expected: CheckNoRpcOverflow() doesn't generate new fatal failures in the current thread. Actual: it does. src/kudu/client/client-test.cc:2466: Failure Expected: DeleteTestRows(client_table_.get(), kLowIdx, kHighIdx) doesn't generate new fatal failures in the current thread. Actual: it does. {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KUDU-2679) In some scenarios, a Spark Kudu application can be devoid of fresh authn tokens
[ https://issues.apache.org/jira/browse/KUDU-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855666#comment-17855666 ] Alexey Serbin commented on KUDU-2679: - {quote} The same happens with Flink streaming applications with a Kudu Sink. After the --authn_token_validity_seconds period we have to restart the application. {quote} [~ArnaudL], The crux of this issue with in presence of two actors of different types in Spark: the driver and the executors. Does Flink have similar topology when assigning tasks? If not, then it's not the same issue. > In some scenarios, a Spark Kudu application can be devoid of fresh authn > tokens > --- > > Key: KUDU-2679 > URL: https://issues.apache.org/jira/browse/KUDU-2679 > Project: Kudu > Issue Type: Bug > Components: client, security, spark >Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1 >Reporter: Alexey Serbin >Priority: Major > > When running in {{cluster}} mode, tasks run as a part of Spark Kudu client > application can be devoid of getting new (i.e. non-expired) authentication > tokens even if they run for a very short time. Essentially, if the driver > runs longer than the authn token expiration interval and has a particular > pattern of making RPC calls to Kudu masters and tablet servers, all tasks > scheduled to run after the authn token expiration interval will be supplied > with expired authn tokens, making every task fail. The only way to fix that > is restarting the application or dropping long-established connections from > the driver to Kudu masters/tservers. > Below are some details, explaining why that can happen. > Let's assume the following holds true for a Spark Kudu application: > * The application is running against a secured Kudu cluster. > * The application is running in the {{cluster}} mode. > * There are no primary authentication credentials at the machines for the > user under which the Spark executors are running (i.e. {{kinit}} hasn't been > run at those executor machines for the corresponding user or the Kerberos > credentials has already expired there). > * The {{--authn_token_validity_seconds}} masters' flag is set to {{X}} > seconds (default is 60 * 60 * 24 * 7 seconds, i.e. 7 days). > * The {{--rpc_default_keepalive_time_ms}} flag for masters (and tablet > servers, if they are involved into the communications between the driver > process and the Kudu backend) is set to {{Y}} milliseconds (default is 65000 > ms). > * The application is running for longer than {{X}} seconds. > * The driver process makes requests to Kudu masters at least every {{Y}} > milliseconds. > * The driver either doesn't make requests to Kudu tablet servers or makes > such requests at least every {{Y}} milliseconds to each of the involved > tablet servers. > * The executors are running tasks that keep connections to tablet servers > idle for longer than {{Y}} milliseconds or the driver spawns tasks at an > executor after {{Y}} milliseconds since last task has completed by the > executor. > Essentially, that's about a Spark Kudu application where the driver process > keeps once opened connections active and the executors need to open new > connections to Kudu tablet servers (and/or masters). Also, the executor > machines doesn't have Kerberos credentials for the OS user under which the > executor processes are run. > In such scenarios, the application's tasks spawned after {{X}} seconds from > the application start will fail because of expired authentication tokens, > while the driver process will never re-acquire its authn token, keeping the > expired token in {{KuduContext}} forever. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3567) Resource leakage related to HashedWheelTimer in AsyncKuduScanner
[ https://issues.apache.org/jira/browse/KUDU-3567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3567: Code Review: https://gerrit.cloudera.org/#/c/21512/ > Resource leakage related to HashedWheelTimer in AsyncKuduScanner > > > Key: KUDU-3567 > URL: https://issues.apache.org/jira/browse/KUDU-3567 > Project: Kudu > Issue Type: Bug > Components: client, java >Affects Versions: 1.18.0 >Reporter: Alexey Serbin >Assignee: YifanZhang >Priority: Major > > With KUDU-3498 implemented in > [8683b8bdb|https://github.com/apache/kudu/commit/8683b8bdb675db96aac52d75a31d00232f7b9fb8], > now there are resource leak reports, see below. > Overall, the way how {{HashedWheelTimer}} is used for keeping scanners alive > is in direct contradiction with the recommendation at [this documentation > page|https://netty.io/4.1/api/io/netty/util/HashedWheelTimer.html]: > {quote}*Do not create many instances.* > HashedWheelTimer creates a new thread whenever it is instantiated and > started. Therefore, you should make sure to create only one instance and > share it across your application. One of the common mistakes, that makes your > application unresponsive, is to create a new instance for every connection. > {quote} > Probably, a better way of implementing the keep-alive feature for scanner > objects in Kudu Java client would be reusing the {{HashedWheelTimer}} > instance from corresponding {{AsyncKuduClient}} client instance, not creating > a new instance of the timer (along with corresponding thread) per > AsyncKuduScanner object. At least, an instance of {{HashedWheelTimer}} > should be properly released/shutdown to avoid resource leakages (a running > thread?) when GC-ing {{AsyncKuduScanner}} objects. > For example, below is an example how the leak is reported when running > {{TestKuduClient.testStrings}}: > {noformat} > 23:04:57.774 [ERROR - main] (ResourceLeakDetector.java:327) LEAK: > HashedWheelTimer.release() was not called before it's garbage-collected. See > https://netty.io/wiki/reference-counted-objects.html for more information. > Recent access records: > Created at: > io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:312) > io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:251) > io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:224) > io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:203) > io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:185) > org.apache.kudu.client.AsyncKuduScanner.(AsyncKuduScanner.java:296) > org.apache.kudu.client.AsyncKuduScanner.(AsyncKuduScanner.java:431) > > org.apache.kudu.client.KuduScanner$KuduScannerBuilder.build(KuduScanner.java:260) > org.apache.kudu.client.TestKuduClient.testStrings(TestKuduClient.java:692) > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > java.lang.reflect.Method.invoke(Method.java:498) > > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) > > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) > java.util.concurrent.FutureTask.run(FutureTask.java:266) > java.lang.Thread.run(Thread.java:748) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KUDU-3567) Resource leakage related to HashedWheelTimer in AsyncKuduScanner
[ https://issues.apache.org/jira/browse/KUDU-3567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-3567. - Fix Version/s: 1.18.0 Resolution: Fixed > Resource leakage related to HashedWheelTimer in AsyncKuduScanner > > > Key: KUDU-3567 > URL: https://issues.apache.org/jira/browse/KUDU-3567 > Project: Kudu > Issue Type: Bug > Components: client, java >Affects Versions: 1.18.0 >Reporter: Alexey Serbin >Assignee: YifanZhang >Priority: Major > Fix For: 1.18.0 > > > With KUDU-3498 implemented in > [8683b8bdb|https://github.com/apache/kudu/commit/8683b8bdb675db96aac52d75a31d00232f7b9fb8], > now there are resource leak reports, see below. > Overall, the way how {{HashedWheelTimer}} is used for keeping scanners alive > is in direct contradiction with the recommendation at [this documentation > page|https://netty.io/4.1/api/io/netty/util/HashedWheelTimer.html]: > {quote}*Do not create many instances.* > HashedWheelTimer creates a new thread whenever it is instantiated and > started. Therefore, you should make sure to create only one instance and > share it across your application. One of the common mistakes, that makes your > application unresponsive, is to create a new instance for every connection. > {quote} > Probably, a better way of implementing the keep-alive feature for scanner > objects in Kudu Java client would be reusing the {{HashedWheelTimer}} > instance from corresponding {{AsyncKuduClient}} client instance, not creating > a new instance of the timer (along with corresponding thread) per > AsyncKuduScanner object. At least, an instance of {{HashedWheelTimer}} > should be properly released/shutdown to avoid resource leakages (a running > thread?) when GC-ing {{AsyncKuduScanner}} objects. > For example, below is an example how the leak is reported when running > {{TestKuduClient.testStrings}}: > {noformat} > 23:04:57.774 [ERROR - main] (ResourceLeakDetector.java:327) LEAK: > HashedWheelTimer.release() was not called before it's garbage-collected. See > https://netty.io/wiki/reference-counted-objects.html for more information. > Recent access records: > Created at: > io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:312) > io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:251) > io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:224) > io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:203) > io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:185) > org.apache.kudu.client.AsyncKuduScanner.(AsyncKuduScanner.java:296) > org.apache.kudu.client.AsyncKuduScanner.(AsyncKuduScanner.java:431) > > org.apache.kudu.client.KuduScanner$KuduScannerBuilder.build(KuduScanner.java:260) > org.apache.kudu.client.TestKuduClient.testStrings(TestKuduClient.java:692) > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > java.lang.reflect.Method.invoke(Method.java:498) > > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) > > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) > java.util.concurrent.FutureTask.run(FutureTask.java:266) > java.lang.Thread.run(Thread.java:748) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3584) TableKeyRangeTest.TestGetTableKeyRange fails from time to time
[ https://issues.apache.org/jira/browse/KUDU-3584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3584: Fix Version/s: 1.18.0 Resolution: Fixed Status: Resolved (was: In Review) > TableKeyRangeTest.TestGetTableKeyRange fails from time to time > -- > > Key: KUDU-3584 > URL: https://issues.apache.org/jira/browse/KUDU-3584 > Project: Kudu > Issue Type: Sub-task > Components: test >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Major > Fix For: 1.18.0 > > Attachments: client-test.00.6.txt.xz, client-test.01.6.txt.xz > > > The {{TableKeyRangeTest.TestGetTableKeyRange}} scenario in {{client-test}} is > flaky, especially in sanitizer builds, failing from time to time with error > like below: > {noformat} > src/kudu/client/client-test.cc:9050: Failure > Expected equality of these values: > 1000 > CountRows(tokens) > Which is: 990 > {noformat} > Logs from two different failed runs are attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3584) TableKeyRangeTest.TestGetTableKeyRange fails from time to time
[ https://issues.apache.org/jira/browse/KUDU-3584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3584: Status: In Review (was: Open) > TableKeyRangeTest.TestGetTableKeyRange fails from time to time > -- > > Key: KUDU-3584 > URL: https://issues.apache.org/jira/browse/KUDU-3584 > Project: Kudu > Issue Type: Bug > Components: test >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Major > Attachments: client-test.00.6.txt.xz, client-test.01.6.txt.xz > > > The {{TableKeyRangeTest.TestGetTableKeyRange}} scenario in {{client-test}} is > flaky, especially in sanitizer builds, failing from time to time with error > like below: > {noformat} > src/kudu/client/client-test.cc:9050: Failure > Expected equality of these values: > 1000 > CountRows(tokens) > Which is: 990 > {noformat} > Logs from two different failed runs are attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3584) TableKeyRangeTest.TestGetTableKeyRange fails from time to time
[ https://issues.apache.org/jira/browse/KUDU-3584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3584: Code Review: https://gerrit.cloudera.org/#/c/21506/ > TableKeyRangeTest.TestGetTableKeyRange fails from time to time > -- > > Key: KUDU-3584 > URL: https://issues.apache.org/jira/browse/KUDU-3584 > Project: Kudu > Issue Type: Bug > Components: test >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Major > Attachments: client-test.00.6.txt.xz, client-test.01.6.txt.xz > > > The {{TableKeyRangeTest.TestGetTableKeyRange}} scenario in {{client-test}} is > flaky, especially in sanitizer builds, failing from time to time with error > like below: > {noformat} > src/kudu/client/client-test.cc:9050: Failure > Expected equality of these values: > 1000 > CountRows(tokens) > Which is: 990 > {noformat} > Logs from two different failed runs are attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3584) TableKeyRangeTest.TestGetTableKeyRange fails from time to time
[ https://issues.apache.org/jira/browse/KUDU-3584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3584: Affects Version/s: 1.17.0 > TableKeyRangeTest.TestGetTableKeyRange fails from time to time > -- > > Key: KUDU-3584 > URL: https://issues.apache.org/jira/browse/KUDU-3584 > Project: Kudu > Issue Type: Bug > Components: test >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Major > Attachments: client-test.00.6.txt.xz, client-test.01.6.txt.xz > > > The {{TableKeyRangeTest.TestGetTableKeyRange}} scenario in {{client-test}} is > flaky, especially in sanitizer builds, failing from time to time with error > like below: > {noformat} > src/kudu/client/client-test.cc:9050: Failure > Expected equality of these values: > 1000 > CountRows(tokens) > Which is: 990 > {noformat} > Logs from two different failed runs are attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KUDU-3584) TableKeyRangeTest.TestGetTableKeyRange fails from time to time
Alexey Serbin created KUDU-3584: --- Summary: TableKeyRangeTest.TestGetTableKeyRange fails from time to time Key: KUDU-3584 URL: https://issues.apache.org/jira/browse/KUDU-3584 Project: Kudu Issue Type: Bug Components: test Reporter: Alexey Serbin Attachments: client-test.00.6.txt.xz, client-test.01.6.txt.xz The {{TableKeyRangeTest.TestGetTableKeyRange}} scenario in {{client-test}} is flaky, especially in sanitizer builds, failing from time to time with error like below: {noformat} src/kudu/client/client-test.cc:9050: Failure Expected equality of these values: 1000 CountRows(tokens) Which is: 990 {noformat} Logs from two different failed runs are attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KUDU-2782) Implement distributed tracing support in Kudu
[ https://issues.apache.org/jira/browse/KUDU-2782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853838#comment-17853838 ] Alexey Serbin commented on KUDU-2782: - It's been a while, and the OpenTracing project has been sunset as of now, and [OpenTelemetry|https://opentelemetry.io/] emerged out of OpenTracing and OpenCensus. The need for distributed tracing in Kudu still stays as is, and many projects added support for OpenTelemetry at this point. When implementing this, it makes sense to take a look at [already existing integrations with OpenTelemetry|https://opentelemetry.io/ecosystem/integrations/]. > Implement distributed tracing support in Kudu > - > > Key: KUDU-2782 > URL: https://issues.apache.org/jira/browse/KUDU-2782 > Project: Kudu > Issue Type: Task > Components: ops-tooling >Reporter: Mike Percy >Priority: Major > Labels: hackathon, roadmap-candidate, supportability > > It would be useful to implement distributed tracing support in Kudu, > especially something like OpenTracing support that we could use with Zipkin, > Jaeger, DataDog, etc. Particularly useful would be auto-sampled and on-demand > traces of write RPCs since that would help us identify slow nodes or hotspots > in the replication group and troubleshoot performance and stability issues. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3577) Altering a table with per-range hash partitions might make the table unusable
[ https://issues.apache.org/jira/browse/KUDU-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3577: Fix Version/s: 1.17.1 1.18.0 Resolution: Fixed Status: Resolved (was: In Review) > Altering a table with per-range hash partitions might make the table unusable > - > > Key: KUDU-3577 > URL: https://issues.apache.org/jira/browse/KUDU-3577 > Project: Kudu > Issue Type: Bug > Components: client, master, tserver >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Major > Fix For: 1.17.1 1.18.0 > > > For particular table schemas with per-range hash schemas, dropping a nullable > column from might make the table unusable. A workaround exists: just add the > dropped column back using the {{kudu table add_column}} CLI tool. For > example, for the reproduction scenario below, use the following command to > restore the access to the table's data: > {noformat} > $ kudu table add_column $M test city string > {noformat} > As for the reproduction scenario, see below for the sequence of {{kudu}} CLI > commands. > Set environment variable for the Kudu cluster's RPC endpoint: > {noformat} > $ export M= > {noformat} > Create a table with two range partitions. It's crucial that the {{city}} > column is nullable. > {noformat} > $ kudu table create $M '{ "table_name": "test", "schema": { "columns": [ { > "column_name": "id", "column_type": "INT64" }, { "column_name": "name", > "column_type": "STRING" }, { "column_name": "age", "column_type": "INT32" }, > { "column_name": "city", "column_type": "STRING", "is_nullable": true } ], > "key_column_names": ["id", "name", "age"] }, "partition": { > "hash_partitions": [ {"columns": ["id"], "num_buckets": 4, "seed": 1}, > {"columns": ["name"], "num_buckets": 4, "seed": 2} ], "range_partition": { > "columns": ["age"], "range_bounds": [ { "lower_bound": {"bound_type": > "inclusive", "bound_values": ["30"]}, "upper_bound": {"bound_type": > "exclusive", "bound_values": ["60"]} }, { "lower_bound": {"bound_type": > "inclusive", "bound_values": ["60"]}, "upper_bound": {"bound_type": > "exclusive", "bound_values": ["90"]} } ] } }, "num_replicas": 1 }' > {noformat} > Add an extra range partition with custom hash schema: > {noformat} > $ kudu table add_range_partition $M test '[90]' '[120]' --hash_schema > '{"hash_schema": [ {"columns": ["id"], "num_buckets": 3, "seed": 5}, > {"columns": ["name"], "num_buckets": 3, "seed": 6} ]}' > {noformat} > Check the updated partitioning info: > {noformat} > $ kudu table describe $M test > TABLE test ( > id INT64 NOT NULL, > name STRING NOT NULL, > age INT32 NOT NULL, > city STRING NULLABLE, > PRIMARY KEY (id, name, age) > ) > HASH (id) PARTITIONS 4 SEED 1, > HASH (name) PARTITIONS 4 SEED 2, > RANGE (age) ( > PARTITION 30 <= VALUES < 60, > PARTITION 60 <= VALUES < 90, > PARTITION 90 <= VALUES < 120 HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3 > ) > OWNER root > REPLICAS 1 > COMMENT > {noformat} > Drop the {{city}} column: > {noformat} > $ kudu table delete_column $M test city > {noformat} > Now try to run the {{kudu table describe}} against the table once the > {{city}} column is dropped. It errors out with {{Invalid argument}}: > {noformat} > $ kudu table describe $M test > Invalid argument: Invalid split row type UNKNOWN > {noformat} > A similar issue manifests itself when trying to run {{kudu table scan}} > against the table: > {noformat} > $ kudu table scan $M test > Invalid argument: Invalid split row type UNKNOWN > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3577) Altering a table with per-range hash partitions might make the table unusable
[ https://issues.apache.org/jira/browse/KUDU-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3577: Status: In Review (was: Open) > Altering a table with per-range hash partitions might make the table unusable > - > > Key: KUDU-3577 > URL: https://issues.apache.org/jira/browse/KUDU-3577 > Project: Kudu > Issue Type: Bug > Components: client, master, tserver >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Major > > For particular table schemas with per-range hash schemas, dropping a nullable > column from might make the table unusable. A workaround exists: just add the > dropped column back using the {{kudu table add_column}} CLI tool. For > example, for the reproduction scenario below, use the following command to > restore the access to the table's data: > {noformat} > $ kudu table add_column $M test city string > {noformat} > As for the reproduction scenario, see below for the sequence of {{kudu}} CLI > commands. > Set environment variable for the Kudu cluster's RPC endpoint: > {noformat} > $ export M= > {noformat} > Create a table with two range partitions. It's crucial that the {{city}} > column is nullable. > {noformat} > $ kudu table create $M '{ "table_name": "test", "schema": { "columns": [ { > "column_name": "id", "column_type": "INT64" }, { "column_name": "name", > "column_type": "STRING" }, { "column_name": "age", "column_type": "INT32" }, > { "column_name": "city", "column_type": "STRING", "is_nullable": true } ], > "key_column_names": ["id", "name", "age"] }, "partition": { > "hash_partitions": [ {"columns": ["id"], "num_buckets": 4, "seed": 1}, > {"columns": ["name"], "num_buckets": 4, "seed": 2} ], "range_partition": { > "columns": ["age"], "range_bounds": [ { "lower_bound": {"bound_type": > "inclusive", "bound_values": ["30"]}, "upper_bound": {"bound_type": > "exclusive", "bound_values": ["60"]} }, { "lower_bound": {"bound_type": > "inclusive", "bound_values": ["60"]}, "upper_bound": {"bound_type": > "exclusive", "bound_values": ["90"]} } ] } }, "num_replicas": 1 }' > {noformat} > Add an extra range partition with custom hash schema: > {noformat} > $ kudu table add_range_partition $M test '[90]' '[120]' --hash_schema > '{"hash_schema": [ {"columns": ["id"], "num_buckets": 3, "seed": 5}, > {"columns": ["name"], "num_buckets": 3, "seed": 6} ]}' > {noformat} > Check the updated partitioning info: > {noformat} > $ kudu table describe $M test > TABLE test ( > id INT64 NOT NULL, > name STRING NOT NULL, > age INT32 NOT NULL, > city STRING NULLABLE, > PRIMARY KEY (id, name, age) > ) > HASH (id) PARTITIONS 4 SEED 1, > HASH (name) PARTITIONS 4 SEED 2, > RANGE (age) ( > PARTITION 30 <= VALUES < 60, > PARTITION 60 <= VALUES < 90, > PARTITION 90 <= VALUES < 120 HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3 > ) > OWNER root > REPLICAS 1 > COMMENT > {noformat} > Drop the {{city}} column: > {noformat} > $ kudu table delete_column $M test city > {noformat} > Now try to run the {{kudu table describe}} against the table once the > {{city}} column is dropped. It errors out with {{Invalid argument}}: > {noformat} > $ kudu table describe $M test > Invalid argument: Invalid split row type UNKNOWN > {noformat} > A similar issue manifests itself when trying to run {{kudu table scan}} > against the table: > {noformat} > $ kudu table scan $M test > Invalid argument: Invalid split row type UNKNOWN > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3577) Altering a table with per-range hash partitions might make the table unusable
[ https://issues.apache.org/jira/browse/KUDU-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3577: Code Review: https://gerrit.cloudera.org/#/c/21486/ > Altering a table with per-range hash partitions might make the table unusable > - > > Key: KUDU-3577 > URL: https://issues.apache.org/jira/browse/KUDU-3577 > Project: Kudu > Issue Type: Bug > Components: client, master, tserver >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Major > > For particular table schemas with per-range hash schemas, dropping a nullable > column from might make the table unusable. A workaround exists: just add the > dropped column back using the {{kudu table add_column}} CLI tool. For > example, for the reproduction scenario below, use the following command to > restore the access to the table's data: > {noformat} > $ kudu table add_column $M test city string > {noformat} > As for the reproduction scenario, see below for the sequence of {{kudu}} CLI > commands. > Set environment variable for the Kudu cluster's RPC endpoint: > {noformat} > $ export M= > {noformat} > Create a table with two range partitions. It's crucial that the {{city}} > column is nullable. > {noformat} > $ kudu table create $M '{ "table_name": "test", "schema": { "columns": [ { > "column_name": "id", "column_type": "INT64" }, { "column_name": "name", > "column_type": "STRING" }, { "column_name": "age", "column_type": "INT32" }, > { "column_name": "city", "column_type": "STRING", "is_nullable": true } ], > "key_column_names": ["id", "name", "age"] }, "partition": { > "hash_partitions": [ {"columns": ["id"], "num_buckets": 4, "seed": 1}, > {"columns": ["name"], "num_buckets": 4, "seed": 2} ], "range_partition": { > "columns": ["age"], "range_bounds": [ { "lower_bound": {"bound_type": > "inclusive", "bound_values": ["30"]}, "upper_bound": {"bound_type": > "exclusive", "bound_values": ["60"]} }, { "lower_bound": {"bound_type": > "inclusive", "bound_values": ["60"]}, "upper_bound": {"bound_type": > "exclusive", "bound_values": ["90"]} } ] } }, "num_replicas": 1 }' > {noformat} > Add an extra range partition with custom hash schema: > {noformat} > $ kudu table add_range_partition $M test '[90]' '[120]' --hash_schema > '{"hash_schema": [ {"columns": ["id"], "num_buckets": 3, "seed": 5}, > {"columns": ["name"], "num_buckets": 3, "seed": 6} ]}' > {noformat} > Check the updated partitioning info: > {noformat} > $ kudu table describe $M test > TABLE test ( > id INT64 NOT NULL, > name STRING NOT NULL, > age INT32 NOT NULL, > city STRING NULLABLE, > PRIMARY KEY (id, name, age) > ) > HASH (id) PARTITIONS 4 SEED 1, > HASH (name) PARTITIONS 4 SEED 2, > RANGE (age) ( > PARTITION 30 <= VALUES < 60, > PARTITION 60 <= VALUES < 90, > PARTITION 90 <= VALUES < 120 HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3 > ) > OWNER root > REPLICAS 1 > COMMENT > {noformat} > Drop the {{city}} column: > {noformat} > $ kudu table delete_column $M test city > {noformat} > Now try to run the {{kudu table describe}} against the table once the > {{city}} column is dropped. It errors out with {{Invalid argument}}: > {noformat} > $ kudu table describe $M test > Invalid argument: Invalid split row type UNKNOWN > {noformat} > A similar issue manifests itself when trying to run {{kudu table scan}} > against the table: > {noformat} > $ kudu table scan $M test > Invalid argument: Invalid split row type UNKNOWN > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3577) Altering a table with per-range hash partitions might make the table unusable
[ https://issues.apache.org/jira/browse/KUDU-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3577: Summary: Altering a table with per-range hash partitions might make the table unusable (was: Dropping a nullable column from a table with per-range hash partitions make the table unusable) > Altering a table with per-range hash partitions might make the table unusable > - > > Key: KUDU-3577 > URL: https://issues.apache.org/jira/browse/KUDU-3577 > Project: Kudu > Issue Type: Bug > Components: client, master, tserver >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Major > > For particular table schemas with per-range hash schemas, dropping a nullable > column from might make the table unusable. A workaround exists: just add the > dropped column back using the {{kudu table add_column}} CLI tool. For > example, for the reproduction scenario below, use the following command to > restore the access to the table's data: > {noformat} > $ kudu table add_column $M test city string > {noformat} > As for the reproduction scenario, see below for the sequence of {{kudu}} CLI > commands. > Set environment variable for the Kudu cluster's RPC endpoint: > {noformat} > $ export M= > {noformat} > Create a table with two range partitions. It's crucial that the {{city}} > column is nullable. > {noformat} > $ kudu table create $M '{ "table_name": "test", "schema": { "columns": [ { > "column_name": "id", "column_type": "INT64" }, { "column_name": "name", > "column_type": "STRING" }, { "column_name": "age", "column_type": "INT32" }, > { "column_name": "city", "column_type": "STRING", "is_nullable": true } ], > "key_column_names": ["id", "name", "age"] }, "partition": { > "hash_partitions": [ {"columns": ["id"], "num_buckets": 4, "seed": 1}, > {"columns": ["name"], "num_buckets": 4, "seed": 2} ], "range_partition": { > "columns": ["age"], "range_bounds": [ { "lower_bound": {"bound_type": > "inclusive", "bound_values": ["30"]}, "upper_bound": {"bound_type": > "exclusive", "bound_values": ["60"]} }, { "lower_bound": {"bound_type": > "inclusive", "bound_values": ["60"]}, "upper_bound": {"bound_type": > "exclusive", "bound_values": ["90"]} } ] } }, "num_replicas": 1 }' > {noformat} > Add an extra range partition with custom hash schema: > {noformat} > $ kudu table add_range_partition $M test '[90]' '[120]' --hash_schema > '{"hash_schema": [ {"columns": ["id"], "num_buckets": 3, "seed": 5}, > {"columns": ["name"], "num_buckets": 3, "seed": 6} ]}' > {noformat} > Check the updated partitioning info: > {noformat} > $ kudu table describe $M test > TABLE test ( > id INT64 NOT NULL, > name STRING NOT NULL, > age INT32 NOT NULL, > city STRING NULLABLE, > PRIMARY KEY (id, name, age) > ) > HASH (id) PARTITIONS 4 SEED 1, > HASH (name) PARTITIONS 4 SEED 2, > RANGE (age) ( > PARTITION 30 <= VALUES < 60, > PARTITION 60 <= VALUES < 90, > PARTITION 90 <= VALUES < 120 HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3 > ) > OWNER root > REPLICAS 1 > COMMENT > {noformat} > Drop the {{city}} column: > {noformat} > $ kudu table delete_column $M test city > {noformat} > Now try to run the {{kudu table describe}} against the table once the > {{city}} column is dropped. It errors out with {{Invalid argument}}: > {noformat} > $ kudu table describe $M test > Invalid argument: Invalid split row type UNKNOWN > {noformat} > A similar issue manifests itself when trying to run {{kudu table scan}} > against the table: > {noformat} > $ kudu table scan $M test > Invalid argument: Invalid split row type UNKNOWN > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KUDU-3582) Incomplete sidecar data returned by RpcContext::GetInboundSidecar()
[ https://issues.apache.org/jira/browse/KUDU-3582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850533#comment-17850533 ] Alexey Serbin edited comment on KUDU-3582 at 5/30/24 12:53 AM: --- Hi [~wzhou], I'm not sure there is enough information in the description to understand what this report is about. IIUC, the summary states that the sidecar data returned by {{RpcContext}} wasn't complete, but it's also stated that the corresponding RPC has been cancelled. Why would one expect complete data from a cancelled RPC? Could you please clarify on the following: # What was the expected KRPC's behavior in your scenario? # What is the actual KRPC's behavior you observed? In other words, what's the essence of the bug reported here? Thank you! was (Author: aserbin): Hi [~wzhou], I'm not sure there is enough information in the description to understand what this report is about. Could you please clarify on the following: # What was the expected KRPC's behavior in your scenario? # What is the actual KRPC's behavior you observed? In other words, what's the essence of the bug reported here? Thank you! > Incomplete sidecar data returned by RpcContext::GetInboundSidecar() > --- > > Key: KUDU-3582 > URL: https://issues.apache.org/jira/browse/KUDU-3582 > Project: Kudu > Issue Type: Bug > Components: rpc >Reporter: Wenzhe Zhou >Priority: Major > > Impala executor calls KRPC sidecar API RpcContext::GetInboundSidecar() to > read serialized thrift object from KRPC, then do thrift deserialization. (See > GetSidecar() at > https://github.com/apache/impala/blob/master/be/src/rpc/sidecar-util.h#L60-L67) > In a customer reported cases, extra workloads were added to Impala cluster, > which caused long delay for KRPCs between Impala daemons. The long delay > caused KRPCs been cancelled, hence impala query failures. > {code:java} > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.383988 1182322 krpc-data-stream-sender.cc:405] Slow EndDataStream > RPC to 10.243.38.160:27000 > (fragment_instance_id=9940332ce09828fd:b75196630632): took 59m57s. Error: > Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state > ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.384006 1182322 kudu-status-util.h:55] EndDataStream() to > 10.243.38.160:27000 failed: Aborted: EndDataStream RPC to 10.243.38.160:27000 > is cancelled in state ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.384631 1182314 krpc-data-stream-sender.cc:405] Slow EndDataStream > RPC to 10.34.163.32:27000 > (fragment_instance_id=9940332ce09828fd:b75196630735): took 59m57s. Error: > Aborted: EndDataStream RPC to 10.34.163.32:27000 is cancelled in state > ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.384668 1182314 kudu-status-util.h:55] EndDataStream() to > 10.34.163.32:27000 failed: Aborted: EndDataStream RPC to 10.34.163.32:27000 > is cancelled in state ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.420662 1182313 krpc-data-stream-sender.cc:405] Slow EndDataStream > RPC to 10.243.36.21:27000 > (fragment_instance_id=9940332ce09828fd:b7519663033a): took 1h. Error: > Aborted: EndDataStream RPC to 10.243.36.21:27000 is cancelled in state > ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.420683 1182313 kudu-status-util.h:55] EndDataStream() to > 10.243.36.21:27000 failed: Aborted: EndDataStream RPC to 10.243.36.21:27000 > is cancelled in state ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.420779 1182322 krpc-data-stream-sender.cc:405] Slow EndDataStream > RPC to 10.243.38.160:27000 > (fragment_instance_id=9940332ce09828fd:b7519663033b): took 1h. Error: > Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state > ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.420799 1182322 kudu-status-util.h:55] EndDataStream() to > 10.243.38.160:27000 failed: Aborted: EndDataStream RPC to 10.243.38.160:27000 > is cancelled in state ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.421937 1182314 krpc-data-stream-sender.cc:405] Slow EndDataStream > RPC to 10.34.163.32:27000 > (fragment_instance_id=9940332ce09828fd:b7519663043e): took 1h. Error: > Aborted: > {code} > Then
[jira] [Commented] (KUDU-3582) Incomplete sidecar data returned by RpcContext::GetInboundSidecar()
[ https://issues.apache.org/jira/browse/KUDU-3582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850533#comment-17850533 ] Alexey Serbin commented on KUDU-3582: - Hi [~wzhou], I'm not sure there is enough information in the description to understand what this report is about. Could you please clarify on the following: # What was the expected KRPC's behavior in your scenario? # What is the actual KRPC's behavior you observed? In other words, what's the essence of the bug reported here? Thank you! > Incomplete sidecar data returned by RpcContext::GetInboundSidecar() > --- > > Key: KUDU-3582 > URL: https://issues.apache.org/jira/browse/KUDU-3582 > Project: Kudu > Issue Type: Bug > Components: rpc >Reporter: Wenzhe Zhou >Priority: Major > > Impala executor calls KRPC sidecar API RpcContext::GetInboundSidecar() to > read serialized thrift object from KRPC, then do thrift deserialization. (See > GetSidecar() at > https://github.com/apache/impala/blob/master/be/src/rpc/sidecar-util.h#L60-L67) > In a customer reported cases, extra workloads were added to Impala cluster, > which caused long delay for KRPCs between Impala daemons. The long delay > caused KRPCs been cancelled, hence impala query failures. > {code:java} > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.383988 1182322 krpc-data-stream-sender.cc:405] Slow EndDataStream > RPC to 10.243.38.160:27000 > (fragment_instance_id=9940332ce09828fd:b75196630632): took 59m57s. Error: > Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state > ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.384006 1182322 kudu-status-util.h:55] EndDataStream() to > 10.243.38.160:27000 failed: Aborted: EndDataStream RPC to 10.243.38.160:27000 > is cancelled in state ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.384631 1182314 krpc-data-stream-sender.cc:405] Slow EndDataStream > RPC to 10.34.163.32:27000 > (fragment_instance_id=9940332ce09828fd:b75196630735): took 59m57s. Error: > Aborted: EndDataStream RPC to 10.34.163.32:27000 is cancelled in state > ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.384668 1182314 kudu-status-util.h:55] EndDataStream() to > 10.34.163.32:27000 failed: Aborted: EndDataStream RPC to 10.34.163.32:27000 > is cancelled in state ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.420662 1182313 krpc-data-stream-sender.cc:405] Slow EndDataStream > RPC to 10.243.36.21:27000 > (fragment_instance_id=9940332ce09828fd:b7519663033a): took 1h. Error: > Aborted: EndDataStream RPC to 10.243.36.21:27000 is cancelled in state > ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.420683 1182313 kudu-status-util.h:55] EndDataStream() to > 10.243.36.21:27000 failed: Aborted: EndDataStream RPC to 10.243.36.21:27000 > is cancelled in state ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.420779 1182322 krpc-data-stream-sender.cc:405] Slow EndDataStream > RPC to 10.243.38.160:27000 > (fragment_instance_id=9940332ce09828fd:b7519663033b): took 1h. Error: > Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state > ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.420799 1182322 kudu-status-util.h:55] EndDataStream() to > 10.243.38.160:27000 failed: Aborted: EndDataStream RPC to 10.243.38.160:27000 > is cancelled in state ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.421937 1182314 krpc-data-stream-sender.cc:405] Slow EndDataStream > RPC to 10.34.163.32:27000 > (fragment_instance_id=9940332ce09828fd:b7519663043e): took 1h. Error: > Aborted: > {code} > Then extra workloads were removed and Impala cluster was restarted. During > restarting Impala cluster, lots of Impala daemon crashed. The stacktraces of > core files and log messages shows that impala daemons received incomplete > data from KRPC sidecar. The incomplete data did not cause thrift > deserialization failure so the valid but incomplete data was not captured and > handled properly. > See impala Jira: IMPALA-13107. The issue could not be re-produced locally. > A quick fixing from Impala side was merged to mitigate the crash issue. Need > to look into this issue further from KRPC internal. -- This message was sent by
[jira] [Updated] (KUDU-3581) Netty CVE Rapid Reset
[ https://issues.apache.org/jira/browse/KUDU-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3581: Fix Version/s: 1.18.0 1.17.1 Resolution: Fixed Status: Resolved (was: In Review) > Netty CVE Rapid Reset > - > > Key: KUDU-3581 > URL: https://issues.apache.org/jira/browse/KUDU-3581 > Project: Kudu > Issue Type: Task >Reporter: Colm O hEigeartaigh >Priority: Minor > Fix For: 1.18.0, 1.17.1 > > > The version of Netty in Kudu 1.17.0 (4.1.94.Final - > [https://github.com/apache/kudu/blob/6d6364d19d287d8effb604b6ab11dfdff5db794e/java/gradle/dependencies.gradle#L52)] > is vulnerable to a security issue: > [https://github.com/advisories/GHSA-xpw8-rcwv-8f8p] > Please upgrade to at least 4.1.100.Final -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3581) Netty CVE Rapid Reset
[ https://issues.apache.org/jira/browse/KUDU-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3581: Status: In Review (was: Open) > Netty CVE Rapid Reset > - > > Key: KUDU-3581 > URL: https://issues.apache.org/jira/browse/KUDU-3581 > Project: Kudu > Issue Type: Task >Reporter: Colm O hEigeartaigh >Priority: Minor > > The version of Netty in Kudu 1.17.0 (4.1.94.Final - > [https://github.com/apache/kudu/blob/6d6364d19d287d8effb604b6ab11dfdff5db794e/java/gradle/dependencies.gradle#L52)] > is vulnerable to a security issue: > [https://github.com/advisories/GHSA-xpw8-rcwv-8f8p] > Please upgrade to at least 4.1.100.Final -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3581) Netty CVE Rapid Reset
[ https://issues.apache.org/jira/browse/KUDU-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3581: Code Review: https://gerrit.cloudera.org/#/c/21464/ > Netty CVE Rapid Reset > - > > Key: KUDU-3581 > URL: https://issues.apache.org/jira/browse/KUDU-3581 > Project: Kudu > Issue Type: Task >Reporter: Colm O hEigeartaigh >Priority: Minor > > The version of Netty in Kudu 1.17.0 (4.1.94.Final - > [https://github.com/apache/kudu/blob/6d6364d19d287d8effb604b6ab11dfdff5db794e/java/gradle/dependencies.gradle#L52)] > is vulnerable to a security issue: > [https://github.com/advisories/GHSA-xpw8-rcwv-8f8p] > Please upgrade to at least 4.1.100.Final -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KUDU-3581) Netty CVE Rapid Reset
[ https://issues.apache.org/jira/browse/KUDU-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850053#comment-17850053 ] Alexey Serbin commented on KUDU-3581: - Thank you for the report. IIUC, Kudu isn't affected by [https://github.com/advisories/GHSA-xpw8-rcwv-8f8p|https://github.com/advisories/GHSA-xpw8-rcwv-8f8p] since it doesn't use Netty for any of its server-side functionality. The server-side Kudu is C++ only, no any Java involved. The Netty component in the Java client should be upgraded eventually at least to please various security scanners. > Netty CVE Rapid Reset > - > > Key: KUDU-3581 > URL: https://issues.apache.org/jira/browse/KUDU-3581 > Project: Kudu > Issue Type: Task >Reporter: Colm O hEigeartaigh >Priority: Minor > > The version of Netty in Kudu 1.17.0 (4.1.94.Final - > [https://github.com/apache/kudu/blob/6d6364d19d287d8effb604b6ab11dfdff5db794e/java/gradle/dependencies.gradle#L52)] > is vulnerable to a security issue: > [https://github.com/advisories/GHSA-xpw8-rcwv-8f8p] > Please upgrade to at least 4.1.100.Final -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3581) Netty CVE Rapid Reset
[ https://issues.apache.org/jira/browse/KUDU-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3581: Priority: Minor (was: Major) > Netty CVE Rapid Reset > - > > Key: KUDU-3581 > URL: https://issues.apache.org/jira/browse/KUDU-3581 > Project: Kudu > Issue Type: Task >Reporter: Colm O hEigeartaigh >Priority: Minor > > The version of Netty in Kudu 1.17.0 (4.1.94.Final - > [https://github.com/apache/kudu/blob/6d6364d19d287d8effb604b6ab11dfdff5db794e/java/gradle/dependencies.gradle#L52)] > is vulnerable to a security issue: > [https://github.com/advisories/GHSA-xpw8-rcwv-8f8p] > Please upgrade to at least 4.1.100.Final -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KUDU-3429) Refactor CompactRowSetsOp to run on a pre-determined memory budget
[ https://issues.apache.org/jira/browse/KUDU-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin reassigned KUDU-3429: --- Assignee: Ashwani Raina (was: Alexey Serbin) > Refactor CompactRowSetsOp to run on a pre-determined memory budget > --- > > Key: KUDU-3429 > URL: https://issues.apache.org/jira/browse/KUDU-3429 > Project: Kudu > Issue Type: Improvement >Reporter: Alexey Serbin >Assignee: Ashwani Raina >Priority: Major > > [KUDU-3406|https://issues.apache.org/jira/browse/KUDU-3406] added memory > budgeting for running CompactRowSetsOp maintenance operations. On its > nature, that provides an interim approach adding memory budgeting on top of > the current CompactRowSetsOp implementation as-is. > Ideally, the implementation of CompactRowSetsOp should be refactored to merge > the deltas in participating rowsets sequentially, chunk by chunk, persisting > the results and allocating memory just for small bunch of processed deltas, > not loading all the deltas at once. > This JIRA item is to track the work in the context outlined above. > Key points to address in this scope: > * even if it's a merge-like operation by its nature, the current > implementation of CompactRowSetsOp allocates all the memory necessary to load > the UNDO deltas at once, and it keeps all the preliminary results in the > memory as well before persisting the result data to disk > * the current implementation of CompactRowSetsOp loads all the UNDO deltas > from the rowsets selected for compaction regardless whether they are ancient > or not; it discards of the data sourced from the ancient deltas in the very > end before persisting the result data > Also, while keeping memory usage on a predetermined budget, the new > implementation for CompactRowSetsOp should strive to avoid IO multiplication > as much as possible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KUDU-3429) Refactor CompactRowSetsOp to run on a pre-determined memory budget
[ https://issues.apache.org/jira/browse/KUDU-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin reassigned KUDU-3429: --- Assignee: Alexey Serbin > Refactor CompactRowSetsOp to run on a pre-determined memory budget > --- > > Key: KUDU-3429 > URL: https://issues.apache.org/jira/browse/KUDU-3429 > Project: Kudu > Issue Type: Improvement >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > > [KUDU-3406|https://issues.apache.org/jira/browse/KUDU-3406] added memory > budgeting for running CompactRowSetsOp maintenance operations. On its > nature, that provides an interim approach adding memory budgeting on top of > the current CompactRowSetsOp implementation as-is. > Ideally, the implementation of CompactRowSetsOp should be refactored to merge > the deltas in participating rowsets sequentially, chunk by chunk, persisting > the results and allocating memory just for small bunch of processed deltas, > not loading all the deltas at once. > This JIRA item is to track the work in the context outlined above. > Key points to address in this scope: > * even if it's a merge-like operation by its nature, the current > implementation of CompactRowSetsOp allocates all the memory necessary to load > the UNDO deltas at once, and it keeps all the preliminary results in the > memory as well before persisting the result data to disk > * the current implementation of CompactRowSetsOp loads all the UNDO deltas > from the rowsets selected for compaction regardless whether they are ancient > or not; it discards of the data sourced from the ancient deltas in the very > end before persisting the result data > Also, while keeping memory usage on a predetermined budget, the new > implementation for CompactRowSetsOp should strive to avoid IO multiplication > as much as possible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3429) Refactor CompactRowSetsOp to run on a pre-determined memory budget
[ https://issues.apache.org/jira/browse/KUDU-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3429: Description: [KUDU-3406|https://issues.apache.org/jira/browse/KUDU-3406] added memory budgeting for running CompactRowSetsOp maintenance operations. On its nature, that provides an interim approach adding memory budgeting on top of the current CompactRowSetsOp implementation as-is. Ideally, the implementation of CompactRowSetsOp should be refactored to merge the deltas in participating rowsets sequentially, chunk by chunk, persisting the results and allocating memory just for small bunch of processed deltas, not loading all the deltas at once. This JIRA item is to track the work in the context outlined above. Key points to address in this scope: * even if it's a merge-like operation by its nature, the current implementation of CompactRowSetsOp allocates all the memory necessary to load the UNDO deltas at once, and it keeps all the preliminary results in the memory as well before persisting the result data to disk * the current implementation of CompactRowSetsOp loads all the UNDO deltas from the rowsets selected for compaction regardless whether they are ancient or not; it discards of the data sourced from the ancient deltas in the very end before persisting the result data Also, while keeping memory usage on a predetermined budget, the new implementation for CompactRowSetsOp should strive to avoid IO multiplication as much as possible. was: [KUDU-3406|https://issues.apache.org/jira/browse/KUDU-3406] added memory budgeting for running CompactRowSetsOp maintenance operations. On its nature, that provides an interim approach adding memory budgeting on top of the current CompactRowSetsOp implementation as-is. Ideally, the implementation of CompactRowSetsOp should be refactored to merge the deltas in participating rowsets sequentially, chunk by chunk, persisting the results and allocating memory just for small bunch of processed deltas, not loading all the deltas at once. This JIRA item is to track the work in the context outlined above. Below are a key points to address in this scope: * even if it's a merge-like operation by its nature, the current implementation of CompactRowSetsOp allocates all the memory necessary to load the UNDO deltas at once, and it keeps all the preliminary results in the memory as well before persisting the result data to disk * the current implementation of CompactRowSetsOp loads all the UNDO deltas from the rowsets selected for compaction regardless whether they are ancient or not; it discards of the data sourced from the ancient deltas in the very end before persisting the result data Also, while keeping memory usage on a predetermined budget, the new implementation for CompactRowSetsOp should strive to avoid IO multiplication as much as possible. > Refactor CompactRowSetsOp to run on a pre-determined memory budget > --- > > Key: KUDU-3429 > URL: https://issues.apache.org/jira/browse/KUDU-3429 > Project: Kudu > Issue Type: Improvement >Reporter: Alexey Serbin >Priority: Major > > [KUDU-3406|https://issues.apache.org/jira/browse/KUDU-3406] added memory > budgeting for running CompactRowSetsOp maintenance operations. On its > nature, that provides an interim approach adding memory budgeting on top of > the current CompactRowSetsOp implementation as-is. > Ideally, the implementation of CompactRowSetsOp should be refactored to merge > the deltas in participating rowsets sequentially, chunk by chunk, persisting > the results and allocating memory just for small bunch of processed deltas, > not loading all the deltas at once. > This JIRA item is to track the work in the context outlined above. > Key points to address in this scope: > * even if it's a merge-like operation by its nature, the current > implementation of CompactRowSetsOp allocates all the memory necessary to load > the UNDO deltas at once, and it keeps all the preliminary results in the > memory as well before persisting the result data to disk > * the current implementation of CompactRowSetsOp loads all the UNDO deltas > from the rowsets selected for compaction regardless whether they are ancient > or not; it discards of the data sourced from the ancient deltas in the very > end before persisting the result data > Also, while keeping memory usage on a predetermined budget, the new > implementation for CompactRowSetsOp should strive to avoid IO multiplication > as much as possible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KUDU-3406) CompactRowSetsOp can allocate much more memory than specified by the hard memory limit
[ https://issues.apache.org/jira/browse/KUDU-3406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849077#comment-17849077 ] Alexey Serbin commented on KUDU-3406: - A note: it's also necessary to pick up [ae7b08c00|https://github.com/apache/kudu/commit/ae7b08c006167da1ebb0c4302e5d6d7aa739a862] -- it contains the fix for the threshold MiBytes-to-bytes conversion. > CompactRowSetsOp can allocate much more memory than specified by the hard > memory limit > -- > > Key: KUDU-3406 > URL: https://issues.apache.org/jira/browse/KUDU-3406 > Project: Kudu > Issue Type: Bug > Components: master, tserver >Affects Versions: 1.10.0, 1.10.1, 1.11.0, 1.12.0, 1.11.1, 1.13.0, 1.14.0, > 1.15.0, 1.16.0 >Reporter: Alexey Serbin >Assignee: Ashwani Raina >Priority: Critical > Labels: compaction, stability > Fix For: 1.17.0 > > Attachments: 270.svg, 283.svg, 296.svg, 308.svg, 332.svg, 344.svg, > fs_list.before > > > In some scenarios, rowsets can accumulate a lot of data, so {{kudu-master}} > and {{kudu-tserver}} processes grow far beyond the hard memory limit > (controlled by the {{\-\-memory_limit_hard_bytes}} flag) when running > CompactRowSetsOp. In some cases, a Kudu server process consumes all the > available memory, so that the OS might invoke the OOM killer. > At this point I'm not yet sure about the exact versions affected, and what > leads to accumulating so much data in flushed rowsets, but I know that 1.13, > 1.14, 1.15 and 1.16 are affected. It's also not clear whether the actual > regression is in allowing the flushed rowsets growing that big. > There is a reproduction scenario for this bug with {{kudu-master}} using the > real data from the fields. With that data, {{kudu fs list}} reveals a rowset > with many UNDOs: see the attached {{fs_list.before}} file. When starting > {{kudu-master}} with the data, the process memory usage eventually peaked > with about 25GByte of RSS while running CompactRowSetsOp, and then the RSS > eventually subsides down to about 200MByte once the CompactRowSetsOp is > completed. > I also attached several SVG files generated by the TCMalloc's pprof from the > memory profile snapshots output by {{kudu-master}} when configured to dump > allocation stats every 512 MBytes. I generated the SVG reports for profiles > attributed to the highest memory usage: > {noformat} > Dumping heap profile to /opt/tmp/master/nn1/profile.0270.heap (24573 MB > currently in use) > Dumping heap profile to /opt/tmp/master/nn1/profile.0283.heap (64594 MB > allocated cumulatively, 13221 MB currently in use) > Dumping heap profile to /opt/tmp/master/nn1/profile.0296.heap (77908 MB > allocated cumulatively, 12110 MB currently in use) > Dumping heap profile to /opt/tmp/master/nn1/profile.0308.heap (90197 MB > allocated cumulatively, 12406 MB currently in use) > Dumping heap profile to /opt/tmp/master/nn1/profile.0332.heap (114775 MB > allocated cumulatively, 23884 MB currently in use) > Dumping heap profile to /opt/tmp/master/nn1/profile.0344.heap (127064 MB > allocated cumulatively, 12648 MB currently in use) > {noformat} > The report from the compaction doesn't look like anything extraordinary > (except for the duration): > {noformat} > I20221012 10:45:49.684247 101750 maintenance_manager.cc:603] P > 68dbea0ec022440d9fc282099a8656cb: > CompactRowSetsOp() complete. Timing: real > 522.617s user 471.783s sys 46.588s Metrics: > {"bytes_written":1665145,"cfile_cache_hit":846,"cfile_cache_hit_bytes":14723646,"cfile_cache_miss":1786556,"cfile_cache_miss_bytes":4065589152,"cfile_init":7,"delta_iterators_relevant":1558,"dirs.queue_time_us":220086,"dirs.run_cpu_time_us":89219,"dirs.run_wall_time_us":89163,"drs_written":1,"fdatasync":15,"fdatasync_us":150709,"lbm_read_time_us":11120726,"lbm_reads_1-10_ms":1,"lbm_reads_lt_1ms":1786583,"lbm_write_time_us":14120016,"lbm_writes_1-10_ms":3,"lbm_writes_lt_1ms":894069,"mutex_wait_us":108,"num_input_rowsets":5,"rows_written":4043,"spinlock_wait_cycles":14720,"thread_start_us":741,"threads_started":9,"wal-append.queue_time_us":307} > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KUDU-3452) Support creating three-replicas table or partition when only 2 tservers healthy
[ https://issues.apache.org/jira/browse/KUDU-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-3452. - Fix Version/s: 1.18.0 Resolution: Fixed > Support creating three-replicas table or partition when only 2 tservers > healthy > --- > > Key: KUDU-3452 > URL: https://issues.apache.org/jira/browse/KUDU-3452 > Project: Kudu > Issue Type: Improvement >Reporter: Xixu Wang >Priority: Major > Fix For: 1.18.0 > > > h1. Background > In my case, every day a new Kudu table (called: history_data_table) will be > created to store history data and a new partition for another table (called: > business_data_table) to be ready to store today's data. These tables and > partitions all require 3 replicas. This business logic was implemented by > some Python scripts. My Kudu cluster contains 3 masters and 3 tservers. Flag: > --catalog_manager_check_ts_count_for_create_table is false. > Sometimes, one tserver maybe become unavailable. Table creating task will > retry continuously and always fail until the tserver become healthy again. > See the error: > {color:#ff8b00}E0222 11:10:32.767140 3321 catalog_manager.cc:672] Error > processing pending assignments: Invalid argument: error selecting replicas > for tablet 41dffa9783f14f36a5b6c35e89075c1a, state:0: Not enough tablet > servers are online for table 'test_table'. Need at least 3 replicas, but only > 2 tablet servers are available{color} > {color:#172b4d}As there are no enough replicas, a tablet will never be > created. The state of this tablet is not running. Therefore, read or write > this tablet will fail even if there are 2 tservers can be used to create 2 > replicas.{color} > > An already created tablet can still be on service even if one of its 3 > replicas become unavailable. Why can not create a three-replicas table when > only 2 tservers healthy? > > Besides, a validate table creating task will be affected by another > invalidate tasks. In the upper example, a table creating task with RF=1 will > still not succeed even if there exists more than one alive tablet servers. > Because the background task manager will break the whole process when finds a > tablet creating task failed and begin a new process to try to execute all > tasks. > > > h1. Design > A new flag: --support_create_tablet_without_enough_healthy_tservers is added. > The original logic keeps the same. When this flag is set true, a > three-replicas tablet can be created successfully and its status is losing > one replica. This tablet can be be read and write normally. > > There are 3 things need to do: > # A tool to cancel the table creating task. > # A tool to show the running table creating task. > # A method to create table without enough healthy tservers. > # make invalidate table creating task not affected by other invalidate tasks. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KUDU-3577) Dropping a nullable column from a table with per-range hash partitions make the table unusable
[ https://issues.apache.org/jira/browse/KUDU-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848014#comment-17848014 ] Alexey Serbin edited comment on KUDU-3577 at 5/21/24 12:46 AM: --- The root cause of the issue is storing the range partition information using {{RowOperationsPB}}: {noformat} message RangeWithHashSchemaPB { // Row operations containing the lower and upper range bound for the range. optional RowOperationsPB range_bounds = 1; // Hash schema for the range. repeated HashBucketSchemaPB hash_schema = 2; } {noformat} The problem is that the range partition boundaries information is stored in a serialized format, where the serialization depends on the current schema of the table, since the data in the serialized format looks like this: {noformat} [type_of_range_boundary][columns_set_bitmap][non_null_bitmap][encoded_range_key] {noformat} Essentially, the size of the 'columns_set_bitmap' and 'null_bitmap' depend on the total number of columns in the table and the number of nullable columns in the table correspondingly. Also, the latter field can be absent if there isn't any nullable column in the table -- that's exactly the case exposed by the reproduction scenario. The information should have been encoded independently of the table schema, similar to how the tablet's start/end ranges are encoded and stored. Alternatively, primary-key-only sub-schema should have been used to encode the range boundaries in the field of the {{RowOperationsPB}} type -- since the primary key is immutable for a table since its creation, the serialized representation wouldn't change on any allowable ALTER operations for the table. Since the feature is already released and we don't control the deployment of Kudu clients that use the original way of encoding/decoding the data, it adds compatibility constraints. The following approach preserves the backward compatibility (while it isn't optimal from the performance standpoint): # Upon processing ALTER table operations for adding and removing columns, it's necessary to check if the size of the columns-set bitmap and non-null bitmap changes after applying the ALTER operation on a table. # If the size of either of the bitmap changes, it's necessary to re-encode the information stored as {{PartitionSchemaPB::custom_hash_schema_ranges}} in the system catalog table on partition ranges of the affected table. The backwards-compatible approach above might still expose a gap when two clients are working with the same table and at least one altering the table by dropping/adding columns, but at least it's better than the current state when a table becomes non-accessible since its schema information becomes effectively corrupted under the conditions described above. was (Author: aserbin): The root cause of the issue is storing the range partition information using {{RowOperationsPB}}: {noformat} message RangeWithHashSchemaPB { // Row operations containing the lower and upper range bound for the range. optional RowOperationsPB range_bounds = 1; // Hash schema for the range. repeated HashBucketSchemaPB hash_schema = 2; } {noformat} The problem is that the range partition boundaries information is stored in a serialized format, where the serialization depends on the current schema of the table, since the data in the serialized format looks like this: {noformat} [type_of_range_boundary][columns_set_bitmap][non_null_bitmap][encoded_range_key] {noformat} Essentially, the size of the 'columns_set_bitmap' and 'null_bitmap' depend on the total number of columns in the table and the number of nullable columns in the table correspondingly. Also, the latter field can be absent if there isn't any nullable column in the table -- that's exactly the case exposed by the reproduction scenario. The information should have been encoded independently of the table schema, similar to how the tablet's start/end ranges are encoded and stored. Alternatively, an alternative primary-key-only schema should have been used to encode the range boundaries in the field of the {{RowOperationsPB}} type -- since the primary key is immutable for a table since its creation, the serialized representation wouldn't change on any allowable ALTER operations for the table. Since the feature is already released and we don't control the deployment of Kudu clients that use the original way of encoding/decoding the data, it adds compatibility constraints. The following approach preserves the backward compatibility (while it isn't optimal from the performance standpoint): # Upon processing ALTER table operations for adding and removing columns, it's necessary to check if the size of the columns-set bitmap and non-null bitmap changes after applying the ALTER operation on a table. # If the size of either of the bitmap changes
[jira] [Commented] (KUDU-3577) Dropping a nullable column from a table with per-range hash partitions make the table unusable
[ https://issues.apache.org/jira/browse/KUDU-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848014#comment-17848014 ] Alexey Serbin commented on KUDU-3577: - The root cause of the issue is storing the range partition information using {{RowOperationsPB}}: {noformat} message RangeWithHashSchemaPB { // Row operations containing the lower and upper range bound for the range. optional RowOperationsPB range_bounds = 1; // Hash schema for the range. repeated HashBucketSchemaPB hash_schema = 2; } {noformat} The problem is that the range partition boundaries information is stored in a serialized format, where the serialization depends on the current schema of the table, since the data in the serialized format looks like this: {noformat} [type_of_range_boundary][columns_set_bitmap][non_null_bitmap][encoded_range_key] {noformat} Essentially, the size of the 'columns_set_bitmap' and 'null_bitmap' depend on the total number of columns in the table and the number of nullable columns in the table correspondingly. Also, the latter field can be absent if there isn't any nullable column in the table -- that's exactly the case exposed by the reproduction scenario. The information should have been encoded independently of the table schema, similar to how the tablet's start/end ranges are encoded and stored. Alternatively, an alternative primary-key-only schema should have been used to encode the range boundaries in the field of the {{RowOperationsPB}} type -- since the primary key is immutable for a table since its creation, the serialized representation wouldn't change on any allowable ALTER operations for the table. Since the feature is already released and we don't control the deployment of Kudu clients that use the original way of encoding/decoding the data, it adds compatibility constraints. The following approach preserves the backward compatibility (while it isn't optimal from the performance standpoint): # Upon processing ALTER table operations for adding and removing columns, it's necessary to check if the size of the columns-set bitmap and non-null bitmap changes after applying the ALTER operation on a table. # If the size of either of the bitmap changes, it's necessary to re-encode the information stored as {{PartitionSchemaPB::custom_hash_schema_ranges}} in the system catalog table on partition ranges of the affected table. The backwards-compatible approach above might still expose a gap when two clients are working with the same table and at least one altering the table by dropping/adding columns, but at least it's better than the current state when a table becomes non-accessible since its schema information becomes effectively corrupted under the conditions described above. > Dropping a nullable column from a table with per-range hash partitions make > the table unusable > -- > > Key: KUDU-3577 > URL: https://issues.apache.org/jira/browse/KUDU-3577 > Project: Kudu > Issue Type: Bug > Components: client, master, tserver >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Major > > For particular table schemas with per-range hash schemas, dropping a nullable > column from might make the table unusable. A workaround exists: just add the > dropped column back using the {{kudu table add_column}} CLI tool. For > example, for the reproduction scenario below, use the following command to > restore the access to the table's data: > {noformat} > $ kudu table add_column $M test city string > {noformat} > As for the reproduction scenario, see below for the sequence of {{kudu}} CLI > commands. > Set environment variable for the Kudu cluster's RPC endpoint: > {noformat} > $ export M= > {noformat} > Create a table with two range partitions. It's crucial that the {{city}} > column is nullable. > {noformat} > $ kudu table create $M '{ "table_name": "test", "schema": { "columns": [ { > "column_name": "id", "column_type": "INT64" }, { "column_name": "name", > "column_type": "STRING" }, { "column_name": "age", "column_type": "INT32" }, > { "column_name": "city", "column_type": "STRING", "is_nullable": true } ], > "key_column_names": ["id", "name", "age"] }, "partition": { > "hash_partitions": [ {"columns": ["id"], "num_buckets": 4, "seed": 1}, > {"columns": ["name"], "num_buckets": 4, "seed": 2} ], "range_partition": { > "columns": ["age"], "range_bounds": [ { "lower_bound": {"bound_type": > "inclusive", "bound_values": ["30"]}, "upper_bound": {"bound_type": > "exclusive", "bound_values": ["60"]} }, { "lower_bound": {"bound_type": > "inclusive", "bound_values": ["60"]}, "upper_bound": {"bound_type": > "exclusive", "bound_values": ["90"]} } ] } }, "num_replicas": 1 }' > {noform
[jira] [Updated] (KUDU-3577) Dropping a nullable column from a table with per-range hash partitions make the table unusable
[ https://issues.apache.org/jira/browse/KUDU-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3577: Description: For particular table schemas with per-range hash schemas, dropping a nullable column from might make the table unusable. A workaround exists: just add the dropped column back using the {{kudu table add_column}} CLI tool. For example, for the reproduction scenario below, use the following command to restore the access to the table's data: {noformat} $ kudu table add_column $M test city string {noformat} As for the reproduction scenario, see below for the sequence of {{kudu}} CLI commands. Set environment variable for the Kudu cluster's RPC endpoint: {noformat} $ export M= {noformat} Create a table with two range partitions. It's crucial that the {{city}} column is nullable. {noformat} $ kudu table create $M '{ "table_name": "test", "schema": { "columns": [ { "column_name": "id", "column_type": "INT64" }, { "column_name": "name", "column_type": "STRING" }, { "column_name": "age", "column_type": "INT32" }, { "column_name": "city", "column_type": "STRING", "is_nullable": true } ], "key_column_names": ["id", "name", "age"] }, "partition": { "hash_partitions": [ {"columns": ["id"], "num_buckets": 4, "seed": 1}, {"columns": ["name"], "num_buckets": 4, "seed": 2} ], "range_partition": { "columns": ["age"], "range_bounds": [ { "lower_bound": {"bound_type": "inclusive", "bound_values": ["30"]}, "upper_bound": {"bound_type": "exclusive", "bound_values": ["60"]} }, { "lower_bound": {"bound_type": "inclusive", "bound_values": ["60"]}, "upper_bound": {"bound_type": "exclusive", "bound_values": ["90"]} } ] } }, "num_replicas": 1 }' {noformat} Add an extra range partition with custom hash schema: {noformat} $ kudu table add_range_partition $M test '[90]' '[120]' --hash_schema '{"hash_schema": [ {"columns": ["id"], "num_buckets": 3, "seed": 5}, {"columns": ["name"], "num_buckets": 3, "seed": 6} ]}' {noformat} Check the updated partitioning info: {noformat} $ kudu table describe $M test TABLE test ( id INT64 NOT NULL, name STRING NOT NULL, age INT32 NOT NULL, city STRING NULLABLE, PRIMARY KEY (id, name, age) ) HASH (id) PARTITIONS 4 SEED 1, HASH (name) PARTITIONS 4 SEED 2, RANGE (age) ( PARTITION 30 <= VALUES < 60, PARTITION 60 <= VALUES < 90, PARTITION 90 <= VALUES < 120 HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3 ) OWNER root REPLICAS 1 COMMENT {noformat} Drop the {{city}} column: {noformat} $ kudu table delete_column $M test city {noformat} Now try to run the {{kudu table describe}} against the table once the {{city}} column is dropped. It errors out with {{Invalid argument}}: {noformat} $ kudu table describe $M test Invalid argument: Invalid split row type UNKNOWN {noformat} A similar issue manifests itself when trying to run {{kudu table scan}} against the table: {noformat} $ kudu table scan $M test Invalid argument: Invalid split row type UNKNOWN {noformat} was: See the reproduction scenario using the {{kudu}} CLI tools below. Set environment variable for the Kudu cluster's RPC endpoint: {noformat} $ export M= {noformat} Create a table with two range partitions. It's crucial that the {{city}} column is nullable. {noformat} $ kudu table create $M '{ "table_name": "test", "schema": { "columns": [ { "column_name": "id", "column_type": "INT64" }, { "column_name": "name", "column_type": "STRING" }, { "column_name": "age", "column_type": "INT32" }, { "column_name": "city", "column_type": "STRING", "is_nullable": true } ], "key_column_names": ["id", "name", "age"] }, "partition": { "hash_partitions": [ {"columns": ["id"], "num_buckets": 4, "seed": 1}, {"columns": ["name"], "num_buckets": 4, "seed": 2} ], "range_partition": { "columns": ["age"], "range_bounds": [ { "lower_bound": {"bound_type": "inclusive", "bound_values": ["30"]}, "upper_bound": {"bound_type": "exclusive", "bound_values": ["60"]} }, { "lower_bound": {"bound_type": "inclusive", "bound_values": ["60"]}, "upper_bound": {"bound_type": "exclusive", "bound_values": ["90"]} } ] } }, "num_replicas": 1 }' {noformat} Add an extra range partition with custom hash schema: {noformat} $ kudu table add_range_partition $M test '[90]' '[120]' --hash_schema '{"hash_schema": [ {"columns": ["id"], "num_buckets": 3, "seed": 5}, {"columns": ["name"], "num_buckets": 3, "seed": 6} ]}' {noformat} Check the updated partitioning info: {noformat} $ kudu table describe $M test TABLE test ( id INT64 NOT NULL, name STRING NOT NULL, age INT32 NOT NULL, city STRING NULLABLE, PRIMARY KEY (id, name, age) ) HASH (id) PARTITIONS 4 SEED 1, HASH (name) PARTITIONS 4 SEED 2, RANGE (age) ( PARTITION 30 <= VALUES < 60, PARTITION 60 <= VALUES < 90, PARTITION 90 <= VALUES < 120 HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3 ) OWNER root REPLICAS 1 COMMENT {noformat} Drop th
[jira] [Updated] (KUDU-3577) Dropping a nullable column from a table with per-range hash partitions make the table unusable
[ https://issues.apache.org/jira/browse/KUDU-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3577: Component/s: master > Dropping a nullable column from a table with per-range hash partitions make > the table unusable > -- > > Key: KUDU-3577 > URL: https://issues.apache.org/jira/browse/KUDU-3577 > Project: Kudu > Issue Type: Bug > Components: client, master, tserver >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Major > > See the reproduction scenario using the {{kudu}} CLI tools below. > Set environment variable for the Kudu cluster's RPC endpoint: > {noformat} > $ export M= > {noformat} > Create a table with two range partitions. It's crucial that the {{city}} > column is nullable. > {noformat} > $ kudu table create $M '{ "table_name": "test", "schema": { "columns": [ { > "column_name": "id", "column_type": "INT64" }, { "column_name": "name", > "column_type": "STRING" }, { "column_name": "age", "column_type": "INT32" }, > { "column_name": "city", "column_type": "STRING", "is_nullable": true } ], > "key_column_names": ["id", "name", "age"] }, "partition": { > "hash_partitions": [ {"columns": ["id"], "num_buckets": 4, "seed": 1}, > {"columns": ["name"], "num_buckets": 4, "seed": 2} ], "range_partition": { > "columns": ["age"], "range_bounds": [ { "lower_bound": {"bound_type": > "inclusive", "bound_values": ["30"]}, "upper_bound": {"bound_type": > "exclusive", "bound_values": ["60"]} }, { "lower_bound": {"bound_type": > "inclusive", "bound_values": ["60"]}, "upper_bound": {"bound_type": > "exclusive", "bound_values": ["90"]} } ] } }, "num_replicas": 1 }' > {noformat} > Add an extra range partition with custom hash schema: > {noformat} > $ kudu table add_range_partition $M test '[90]' '[120]' --hash_schema > '{"hash_schema": [ {"columns": ["id"], "num_buckets": 3, "seed": 5}, > {"columns": ["name"], "num_buckets": 3, "seed": 6} ]}' > {noformat} > Check the updated partitioning info: > {noformat} > $ kudu table describe $M test > TABLE test ( > id INT64 NOT NULL, > name STRING NOT NULL, > age INT32 NOT NULL, > city STRING NULLABLE, > PRIMARY KEY (id, name, age) > ) > HASH (id) PARTITIONS 4 SEED 1, > HASH (name) PARTITIONS 4 SEED 2, > RANGE (age) ( > PARTITION 30 <= VALUES < 60, > PARTITION 60 <= VALUES < 90, > PARTITION 90 <= VALUES < 120 HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3 > ) > OWNER root > REPLICAS 1 > COMMENT > {noformat} > Drop the {{city}} column: > {noformat} > $ kudu table delete_column $M test city > {noformat} > Now try to run the {{kudu table describe}} against the table once the > {{city}} column is dropped. It errors out with {{Invalid argument}}: > {noformat} > $ kudu table describe $M test > Invalid argument: Invalid split row type UNKNOWN > {noformat} > A similar issue manifests itself when trying to run {{kudu table scan}} > against the table: > {noformat} > $ kudu table scan $M test > Invalid argument: Invalid split row type UNKNOWN > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KUDU-3577) Dropping a nullable column from a table with per-range hash partitions make the table unusable
Alexey Serbin created KUDU-3577: --- Summary: Dropping a nullable column from a table with per-range hash partitions make the table unusable Key: KUDU-3577 URL: https://issues.apache.org/jira/browse/KUDU-3577 Project: Kudu Issue Type: Bug Components: client, tserver Affects Versions: 1.17.0 Reporter: Alexey Serbin See the reproduction scenario using the {{kudu}} CLI tools below. Set environment variable for the Kudu cluster's RPC endpoint: {noformat} $ export M= {noformat} Create a table with two range partitions. It's crucial that the {{city}} column is nullable. {noformat} $ kudu table create $M '{ "table_name": "test", "schema": { "columns": [ { "column_name": "id", "column_type": "INT64" }, { "column_name": "name", "column_type": "STRING" }, { "column_name": "age", "column_type": "INT32" }, { "column_name": "city", "column_type": "STRING", "is_nullable": true } ], "key_column_names": ["id", "name", "age"] }, "partition": { "hash_partitions": [ {"columns": ["id"], "num_buckets": 4, "seed": 1}, {"columns": ["name"], "num_buckets": 4, "seed": 2} ], "range_partition": { "columns": ["age"], "range_bounds": [ { "lower_bound": {"bound_type": "inclusive", "bound_values": ["30"]}, "upper_bound": {"bound_type": "exclusive", "bound_values": ["60"]} }, { "lower_bound": {"bound_type": "inclusive", "bound_values": ["60"]}, "upper_bound": {"bound_type": "exclusive", "bound_values": ["90"]} } ] } }, "num_replicas": 1 }' {noformat} Add an extra range partition with custom hash schema: {noformat} $ kudu table add_range_partition $M test '[90]' '[120]' --hash_schema '{"hash_schema": [ {"columns": ["id"], "num_buckets": 3, "seed": 5}, {"columns": ["name"], "num_buckets": 3, "seed": 6} ]}' {noformat} Check the updated partitioning info: {noformat} $ kudu table describe $M test TABLE test ( id INT64 NOT NULL, name STRING NOT NULL, age INT32 NOT NULL, city STRING NULLABLE, PRIMARY KEY (id, name, age) ) HASH (id) PARTITIONS 4 SEED 1, HASH (name) PARTITIONS 4 SEED 2, RANGE (age) ( PARTITION 30 <= VALUES < 60, PARTITION 60 <= VALUES < 90, PARTITION 90 <= VALUES < 120 HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3 ) OWNER root REPLICAS 1 COMMENT {noformat} Drop the {{city}} column: {noformat} $ kudu table delete_column $M test city {noformat} Now try to run the {{kudu table describe}} against the table once the {{city}} column is dropped. It errors out with {{Invalid argument}}: {noformat} $ kudu table describe $M test Invalid argument: Invalid split row type UNKNOWN {noformat} A similar issue manifests itself when trying to run {{kudu table scan}} against the table: {noformat} $ kudu table scan $M test Invalid argument: Invalid split row type UNKNOWN {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KUDU-3576) An NPE thrown in Connection.exceptionCaught() makes the connection to corresponding tablet server unusable
[ https://issues.apache.org/jira/browse/KUDU-3576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-3576. - Resolution: Fixed > An NPE thrown in Connection.exceptionCaught() makes the connection to > corresponding tablet server unusable > -- > > Key: KUDU-3576 > URL: https://issues.apache.org/jira/browse/KUDU-3576 > Project: Kudu > Issue Type: Bug > Components: client, java >Affects Versions: 1.12.0, 1.13.0, 1.14.0, 1.15.0, 1.16.0, 1.17.0 >Reporter: Alexey Serbin >Priority: Major > Fix For: 1.18.0, 1.17.1 > > > If a Kudu Java client application keeps a connection to a tablet server open > and the tablet server is killed/restarted or a network error happens on the > connection, the client application might end up in a state when it cannot > communicate with the tablet server even after the tablet server is up and > running again. If the application tries to write to any tablet replica that > is hosted at the tablet server, all such requests will timeout on the very > first attempt, and the state of the connection to the server remains in a > limbo since then. The only way to get out of the trouble is to recreate the > affected Java Kudu client instance, e.g., by restarting the application. > More details are below. > Once the NPE is thrown by {{Connection.exceptionCaught()}} upon an attempt to > access null {{ctx}} variable of the {{ChannelHandlerContext}} type, all the > subsequent attempts to send Write RPC to any tablet replica hosted at the > tablet server end up with a timeout on a very first attempt (i.e. there are > no retries): > {noformat} > java.lang.RuntimeException: PendingErrors overflowed. Failed to write at > least 1000 rows to Kudu; Sample errors: Timed out: cannot complete before > timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" > [0x000B8134D82B, 0x000B8134D82C), ignoredErrors=[], > rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, > TimeoutTracker(timeout=3, elapsed=30018), Trace Summary(0 ms): Sent(1), > Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false > Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot > complete before timeout: Batch{operations=1000, > tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x000B8134D82B, > 0x000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, > tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, > TimeoutTracker(timeout=3, elapsed=30018), Trace Summary(0 ms): Sent(1), > Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false > Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot > complete before timeout: Batch{operations=1000, > tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x000B8134D82B, > 0x000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, > tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, > TimeoutTracker(timeout=3, elapsed=30018), Trace Summary(0 ms): Sent(1), > Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false > Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot > complete before timeout: Batch{operations=1000, > tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x000B8134D82B, > 0x000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, > tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, > TimeoutTracker(timeout=3, elapsed=30018), Trace Summary(0 ms): Sent(1), > Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false > Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot > complete before timeout: Batch{operations=1000, > tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x000B8134D82B, > 0x000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, > tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, > TimeoutTracker(timeout=3, elapsed=30018), Trace Summary(0 ms): Sent(1), > Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false > Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))} > {noformat} > The root cause of the problem manifests itself as an NPE in > {{Connection.exceptionCaught()}} with a stack trace like below: > {noformat} > 24/04/27 13:07:18 WARN DefaultPromise: An exception was thrown by > org.apache.kudu.client.Connection$1.operationComplete() > java.lang.NullPointerException > at org.apache.kudu.client.Connection.exceptionCaught(Connection.java:434) > at > org.apache.kudu.client.Connection$1.operationComplete(Connection.java:746) > at > org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) > at > org.apache.kudu.s
[jira] [Created] (KUDU-3576) An NPE thrown in Connection.exceptionCaught() makes the connection to corresponding tablet server unusable
Alexey Serbin created KUDU-3576: --- Summary: An NPE thrown in Connection.exceptionCaught() makes the connection to corresponding tablet server unusable Key: KUDU-3576 URL: https://issues.apache.org/jira/browse/KUDU-3576 Project: Kudu Issue Type: Bug Components: client, java Affects Versions: 1.17.0, 1.16.0, 1.15.0, 1.14.0, 1.13.0, 1.12.0 Reporter: Alexey Serbin Fix For: 1.18.0, 1.17.1 If a Kudu Java client application keeps a connection to a tablet server open and the tablet server is killed/restarted or a network error happens on the connection, the client application might end up in a state when it cannot communicate with the tablet server even after the tablet server is up and running again. If the application tries to write to any tablet replica that is hosted at the tablet server, all such requests will timeout on the very first attempt, and the state of the connection to the server remains in a limbo since then. The only way to get out of the trouble is to recreate the affected Java Kudu client instance, e.g., by restarting the application. More details are below. Once the NPE is thrown by {{Connection.exceptionCaught()}} upon an attempt to access null {{ctx}} variable of the {{ChannelHandlerContext}} type, all the subsequent attempts to send Write RPC to any tablet replica hosted at the tablet server end up with a timeout on a very first attempt (i.e. there are no retries): {noformat} java.lang.RuntimeException: PendingErrors overflowed. Failed to write at least 1000 rows to Kudu; Sample errors: Timed out: cannot complete before timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x000B8134D82B, 0x000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, TimeoutTracker(timeout=3, elapsed=30018), Trace Summary(0 ms): Sent(1), Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot complete before timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x000B8134D82B, 0x000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, TimeoutTracker(timeout=3, elapsed=30018), Trace Summary(0 ms): Sent(1), Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot complete before timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x000B8134D82B, 0x000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, TimeoutTracker(timeout=3, elapsed=30018), Trace Summary(0 ms): Sent(1), Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot complete before timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x000B8134D82B, 0x000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, TimeoutTracker(timeout=3, elapsed=30018), Trace Summary(0 ms): Sent(1), Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot complete before timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x000B8134D82B, 0x000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, TimeoutTracker(timeout=3, elapsed=30018), Trace Summary(0 ms): Sent(1), Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))} {noformat} The root cause of the problem manifests itself as an NPE in {{Connection.exceptionCaught()}} with a stack trace like below: {noformat} 24/04/27 13:07:18 WARN DefaultPromise: An exception was thrown by org.apache.kudu.client.Connection$1.operationComplete() java.lang.NullPointerException at org.apache.kudu.client.Connection.exceptionCaught(Connection.java:434) at org.apache.kudu.client.Connection$1.operationComplete(Connection.java:746) at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) at org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:6
[jira] [Updated] (KUDU-3576) An NPE thrown in Connection.exceptionCaught() makes the connection to corresponding tablet server unusable
[ https://issues.apache.org/jira/browse/KUDU-3576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3576: Code Review: http://gerrit.cloudera.org:8080/20858 > An NPE thrown in Connection.exceptionCaught() makes the connection to > corresponding tablet server unusable > -- > > Key: KUDU-3576 > URL: https://issues.apache.org/jira/browse/KUDU-3576 > Project: Kudu > Issue Type: Bug > Components: client, java >Affects Versions: 1.12.0, 1.13.0, 1.14.0, 1.15.0, 1.16.0, 1.17.0 >Reporter: Alexey Serbin >Priority: Major > Fix For: 1.18.0, 1.17.1 > > > If a Kudu Java client application keeps a connection to a tablet server open > and the tablet server is killed/restarted or a network error happens on the > connection, the client application might end up in a state when it cannot > communicate with the tablet server even after the tablet server is up and > running again. If the application tries to write to any tablet replica that > is hosted at the tablet server, all such requests will timeout on the very > first attempt, and the state of the connection to the server remains in a > limbo since then. The only way to get out of the trouble is to recreate the > affected Java Kudu client instance, e.g., by restarting the application. > More details are below. > Once the NPE is thrown by {{Connection.exceptionCaught()}} upon an attempt to > access null {{ctx}} variable of the {{ChannelHandlerContext}} type, all the > subsequent attempts to send Write RPC to any tablet replica hosted at the > tablet server end up with a timeout on a very first attempt (i.e. there are > no retries): > {noformat} > java.lang.RuntimeException: PendingErrors overflowed. Failed to write at > least 1000 rows to Kudu; Sample errors: Timed out: cannot complete before > timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395" > [0x000B8134D82B, 0x000B8134D82C), ignoredErrors=[], > rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, > TimeoutTracker(timeout=3, elapsed=30018), Trace Summary(0 ms): Sent(1), > Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false > Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot > complete before timeout: Batch{operations=1000, > tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x000B8134D82B, > 0x000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, > tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, > TimeoutTracker(timeout=3, elapsed=30018), Trace Summary(0 ms): Sent(1), > Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false > Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot > complete before timeout: Batch{operations=1000, > tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x000B8134D82B, > 0x000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, > tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, > TimeoutTracker(timeout=3, elapsed=30018), Trace Summary(0 ms): Sent(1), > Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false > Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot > complete before timeout: Batch{operations=1000, > tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x000B8134D82B, > 0x000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, > tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, > TimeoutTracker(timeout=3, elapsed=30018), Trace Summary(0 ms): Sent(1), > Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false > Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot > complete before timeout: Batch{operations=1000, > tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x000B8134D82B, > 0x000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write, > tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1, > TimeoutTracker(timeout=3, elapsed=30018), Trace Summary(0 ms): Sent(1), > Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false > Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))} > {noformat} > The root cause of the problem manifests itself as an NPE in > {{Connection.exceptionCaught()}} with a stack trace like below: > {noformat} > 24/04/27 13:07:18 WARN DefaultPromise: An exception was thrown by > org.apache.kudu.client.Connection$1.operationComplete() > java.lang.NullPointerException > at org.apache.kudu.client.Connection.exceptionCaught(Connection.java:434) > at > org.apache.kudu.client.Connection$1.operationComplete(Connection.java:746) > at > org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:5
[jira] [Resolved] (KUDU-3568) TestRowSetCompactionSkipWithBudgetingConstraints fails when run on some nodes
[ https://issues.apache.org/jira/browse/KUDU-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-3568. - Fix Version/s: 1.18.0 Resolution: Fixed > TestRowSetCompactionSkipWithBudgetingConstraints fails when run on some nodes > - > > Key: KUDU-3568 > URL: https://issues.apache.org/jira/browse/KUDU-3568 > Project: Kudu > Issue Type: Bug >Affects Versions: 1.18.0 >Reporter: Alexey Serbin >Assignee: Ashwani Raina >Priority: Major > Fix For: 1.18.0 > > Attachments: test-failure.log.xz > > > The {{TestCompaction.TestRowSetCompactionSkipWithBudgetingConstraints}} > scenario fails with the error like below when run on a machine with > relatively high memory (it might be just a Docker instance with tiny actual > memory allocated, but having the access to the {{/proc}} filesystem of the > host machine). The full test log is attached. > {noformat} > src/kudu/tablet/compaction-test.cc:908: Failure > Value of: JoinStrings(sink.logged_msgs(), "\n") > Expected: has substring "removed from compaction input due to memory > constraints" > Actual: "I20240425 10:13:05.497732 3573764 compaction-test.cc:902] > CompactRowSetsOp complete. Timing: real 0.673s\tuser 0.669s\tsys 0.004s > Metrics: > {\"bytes_written\":4817,\"cfile_cache_hit\":90,\"cfile_cache_hit_bytes\":4310,\"cfile_cache_miss\":330,\"cfile_cache_miss_bytes\":3794180,\"cfile_init\":41,\"delta_iterators_relevant\":40,\"dirs.queue_time_us\":503,\"dirs.run_cpu_time_us\":338,\"dirs.run_wall_time_us\":1780,\"drs_written\":1,\"lbm_read_time_us\":1951,\"lbm_reads_lt_1ms\":494,\"lbm_write_time_us\":1767,\"lbm_writes_lt_1ms\":132,\"mutex_wait_us\":189,\"num_input_rowsets\":10,\"peak_mem_usage\":2147727,\"rows_written\":20,\"thread_start_us\":242,\"threads_started\":5}" > (of type std::string) > {noformat} > For extra information, below is 10 lines from {{/proc/meminfo}} file on a > node where the test failed: > {noformat} > # cat /proc/meminfo | head -10 > MemTotal: 527417196 kB > MemFree:96640684 kB > MemAvailable: 363590980 kB > Buffers:15352304 kB > Cached: 246687576 kB > SwapCached: 1294016 kB > Active: 214889608 kB > Inactive: 189745504 kB > Active(anon): 133110648 kB > Inactive(anon): 16977280 kB > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KUDU-3575) Update kudu::tools::ParseValue() to handle all supported column types
Alexey Serbin created KUDU-3575: --- Summary: Update kudu::tools::ParseValue() to handle all supported column types Key: KUDU-3575 URL: https://issues.apache.org/jira/browse/KUDU-3575 Project: Kudu Issue Type: Task Components: CLI Affects Versions: 1.17.0, 1.16.0, 1.15.0, 1.14.0, 1.13.0, 1.11.1, 1.12.0, 1.11.0, 1.10.1, 1.10.0, 1.9.0 Reporter: Alexey Serbin With [0afeddf9e530762e0e47beb7428982763715c746|https://github.com/apache/kudu/commit/0afeddf9e530762e0e47beb7428982763715c746], new functionality has been introduced in Kudu 1.9.0. It's currently used in various CLI tools such as {{kudu table scan}}, {{kudu table copy}}, etc. However, if using predicates, the tool didn't handle all the column types available back then in Kudu tables and with adding new types such as DECIMAL and VARCHAR, the {{kudu::tools::ParseValue()}} utility function became even more outdated. As a result, an attempt to run corresponding CLI tools against tables using predicates on columns of particular types (e.g. UNIXTIME_MICROS) results in a crash due to SIGABRT with stack traces like below produced by {{kudu table copy}} CLI tool: {noformat} F0509 13:08:49.058050 226781 table_scanner.cc:189] unhandled data type 9 *** Check failure stack trace: *** @ 0x1411c8d google::LogMessage::Fail() @ 0x141656d google::LogMessage::SendToLog() @ 0x1411970 google::LogMessage::Flush() @ 0x14121d9 google::LogMessageFatal::~LogMessageFatal() @ 0x145d955 kudu::tools::ParseValue() @ 0x145f1e2 kudu::tools::NewComparisonPredicate() @ 0x14606b0 kudu::tools::AddPredicate() @ 0x14610ee kudu::tools::AddPredicates() @ 0x146297c kudu::tools::TableScanner::StartWork() @ 0x1465232 kudu::tools::TableScanner::StartCopy() @ 0xde7415 kudu::tools::(anonymous namespace)::CopyTable() @ 0xd87dd4 std::_Function_handler<>::_M_invoke() @ 0x14667b2 kudu::tools::Action::Run() @ 0xe166b5 kudu::tools::DispatchCommand() @ 0xe17325 kudu::tools::RunTool() @ 0xd080c4 main {noformat} It's necessary to update implementation of {{kudu::tools::ParseValue()}} to handle all the supported column types. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3574) MasterAuthzITest.TestAuthzListTablesConcurrentRename fails from time to time
[ https://issues.apache.org/jira/browse/KUDU-3574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3574: Affects Version/s: 1.17.0 > MasterAuthzITest.TestAuthzListTablesConcurrentRename fails from time to time > > > Key: KUDU-3574 > URL: https://issues.apache.org/jira/browse/KUDU-3574 > Project: Kudu > Issue Type: Bug >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Major > Attachments: master_authz-itest.6.txt.xz > > > The {{MasterAuthzITest.TestAuthzListTablesConcurrentRename}} scenario > sometimes fail with errors like below: > {noformat} > src/kudu/integration-tests/master_authz-itest.cc:913: Failure > Expected equality of these values: > 1 > tables.size() > Which is: 2 > {noformat} > The log is attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KUDU-3574) MasterAuthzITest.TestAuthzListTablesConcurrentRename fails from time to time
Alexey Serbin created KUDU-3574: --- Summary: MasterAuthzITest.TestAuthzListTablesConcurrentRename fails from time to time Key: KUDU-3574 URL: https://issues.apache.org/jira/browse/KUDU-3574 Project: Kudu Issue Type: Bug Reporter: Alexey Serbin Attachments: master_authz-itest.6.txt.xz The {{MasterAuthzITest.TestAuthzListTablesConcurrentRename}} scenario sometimes fail with errors like below: {noformat} src/kudu/integration-tests/master_authz-itest.cc:913: Failure Expected equality of these values: 1 tables.size() Which is: 2 {noformat} The log is attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3573) TestNewOpsDontGetScheduledDuringUnregister sometimes fail
[ https://issues.apache.org/jira/browse/KUDU-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3573: Affects Version/s: 1.17.0 > TestNewOpsDontGetScheduledDuringUnregister sometimes fail > - > > Key: KUDU-3573 > URL: https://issues.apache.org/jira/browse/KUDU-3573 > Project: Kudu > Issue Type: Bug >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Major > Attachments: maintenance_manager-test.txt.xz > > > The {{MaintenanceManagerTest.TestNewOpsDontGetScheduledDuringUnregister}} > scenario fails from time to time with output like below: > {noformat} > src/kudu/util/maintenance_manager-test.cc:468: Failure > Expected: (op1.DurationHistogram()->TotalCount()) <= (2), actual: 3 vs 2 > {noformat} > Full output produced by the test scenario is attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KUDU-3573) TestNewOpsDontGetScheduledDuringUnregister sometimes fail
Alexey Serbin created KUDU-3573: --- Summary: TestNewOpsDontGetScheduledDuringUnregister sometimes fail Key: KUDU-3573 URL: https://issues.apache.org/jira/browse/KUDU-3573 Project: Kudu Issue Type: Bug Reporter: Alexey Serbin Attachments: maintenance_manager-test.txt.xz The {{MaintenanceManagerTest.TestNewOpsDontGetScheduledDuringUnregister}} scenario fails from time to time with output like below: {noformat} src/kudu/util/maintenance_manager-test.cc:468: Failure Expected: (op1.DurationHistogram()->TotalCount()) <= (2), actual: 3 vs 2 {noformat} Full output produced by the test scenario is attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3566) Incorrect semantics for Prometheus-style histogram metrics
[ https://issues.apache.org/jira/browse/KUDU-3566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3566: Fix Version/s: 1.18.0 Resolution: Fixed Status: Resolved (was: In Review) > Incorrect semantics for Prometheus-style histogram metrics > -- > > Key: KUDU-3566 > URL: https://issues.apache.org/jira/browse/KUDU-3566 > Project: Kudu > Issue Type: Bug > Components: master, tserver >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Major > Labels: metrics, observability > Fix For: 1.18.0 > > > Original KUDU-3375 implementation incorrectly exposes [summary-type > Prometheus metrics|https://prometheus.io/docs/concepts/metric_types/#summary] > as [histogram-type > ones|https://prometheus.io/docs/concepts/metric_types/#histogram] for data > collected by corresponding HDR histograms. For example, below are snippets > from {{/metric}} and {{/metrics_prometheus}} for statistics on ListMasters > RPC. > The data exposed as Prometheus-style histogram metrics should have been > reported as summary metrics instead. > JSON-style: > {noformat} > { > "name": "handler_latency_kudu_master_MasterService_ListMasters", > "total_count": 26, > "min": 152, > "mean": 301.2692307692308, > "percentile_75": 324, > "percentile_95": 468, > "percentile_99": 844, > "percentile_99_9": 844, > "percentile_99_99": 844, > "max": 844, > "total_sum": 7833 > } > {noformat} > Prometheus-style counterpart: > {noformat} > # HELP kudu_master_handler_latency_kudu_master_MasterService_ListMasters > Microseconds spent handling kudu.master.MasterService.ListMasters RPC requests > # TYPE kudu_master_handler_latency_kudu_master_MasterService_ListMasters > histogram > kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds", > le="0.75"} 324 > kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds", > le="0.95"} 468 > kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds", > le="0.99"} 844 > kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds", > le="0.999"} 844 > kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds", > le="0."} 844 > kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds", > le="+Inf"} 26 > kudu_master_handler_latency_kudu_master_MasterService_ListMasters_sum{unit_type="microseconds"} > 7833 > kudu_master_handler_latency_kudu_master_MasterService_ListMasters_count{unit_type="microseconds"} > 26 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KUDU-3216) CatalogManagerTskITest.LeadershipChangeOnTskGeneration sometimes fails
[ https://issues.apache.org/jira/browse/KUDU-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-3216. - Fix Version/s: 1.18.0 Resolution: Fixed > CatalogManagerTskITest.LeadershipChangeOnTskGeneration sometimes fails > -- > > Key: KUDU-3216 > URL: https://issues.apache.org/jira/browse/KUDU-3216 > Project: Kudu > Issue Type: Bug > Components: test >Affects Versions: 1.13.0 >Reporter: Alexey Serbin >Priority: Major > Fix For: 1.18.0 > > Attachments: catalog_manager_tsk-itest.txt.xz > > > The {{CatalogManagerTskITest.LeadershipChangeOnTskGeneration}} sometimes > fails with the following error: > {noformat} > src/kudu/integration-tests/catalog_manager_tsk-itest.cc:129: Failure > Failed > > Bad status: Service unavailable: Error creating table test-table on the > master: an error occurred while writing to the sys-catalog: leader is not yet > ready > {noformat} > This time the issue happened with a DEBUG build; the full log is attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3569) Data race in CFileSet::Iterator::OptimizePKPredicates()
[ https://issues.apache.org/jira/browse/KUDU-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3569: Fix Version/s: 1.17.1 > Data race in CFileSet::Iterator::OptimizePKPredicates() > --- > > Key: KUDU-3569 > URL: https://issues.apache.org/jira/browse/KUDU-3569 > Project: Kudu > Issue Type: Bug > Components: tserver >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Major > Fix For: 1.18.0, 1.17.1 > > > Running {{alter_table-randomized-test}} under TSAN produced data race > warnings like below, indicating a race in > {{CFileSet::Iterator::OptimizePKPredicates()}}. One actor was > {{tablet::AlterSchemaOp::Apply()}} initiated by AlterTable, the other > concurrent actor was the maintenance thread running major delta compaction. > Apparently, the same data race might happen if the other concurrent actor was > a thread handling a scan request containing IN-list predicates optimized at > the DRS level. > {noformat} > WARNING: ThreadSanitizer: data race (pid=3919595) > Write of size 8 at 0x7b44000f4a20 by thread T7: > #0 std::__1::__vector_base long>>::__destruct_at_end(unsigned long*) > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:429:12 > (kudu+0x4d4080) > #1 std::__1::__vector_base long>>::clear() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:371:29 > (kudu+0x4d3f94) > #2 std::__1::__vector_base long>>::~__vector_base() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:465:9 > (kudu+0x4d3d4b) > #3 std::__1::vector > >::~vector() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:557:5 > (kudu+0x4d1261) > #4 kudu::Schema::~Schema() > /root/Projects/kudu/src/kudu/common/schema.h:491:7 (kudu+0x4cc40f) > #5 std::__1::__shared_ptr_emplace std::__1::allocator>::__on_zero_shared() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3503:23 > (libtablet.so+0x389d45) > #6 std::__1::__shared_count::__release_shared() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3341:9 > (kudu+0x4d4d05) > #7 std::__1::__shared_weak_count::__release_shared() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3383:27 > (kudu+0x4d4ca9) > #8 std::__1::shared_ptr::~shared_ptr() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:4098:19 > (kudu+0x5303e8) > #9 > kudu::tablet::TabletMetadata::SetSchema(std::__1::shared_ptr > const&, unsigned int) > /root/Projects/kudu/src/kudu/tablet/tablet_metadata.cc:957:1 > (libtablet.so+0x4d8882) > #10 kudu::tablet::Tablet::AlterSchema(kudu::tablet::AlterSchemaOpState*) > /root/Projects/kudu/src/kudu/tablet/tablet.cc:1727:14 (libtablet.so+0x32720a) > #11 kudu::tablet::AlterSchemaOp::Apply(kudu::consensus::CommitMsg**) > /root/Projects/kudu/src/kudu/tablet/ops/alter_schema_op.cc:127:3 > (libtablet.so+0x4013f8) > #12 kudu::tablet::OpDriver::ApplyTask() > /root/Projects/kudu/src/kudu/tablet/ops/op_driver.cc:527:21 > (libtablet.so+0x40873a) > ... > Previous read of size 8 at 0x7b44000f4a20 by thread T22 (mutexes: write > M799524414306809968, write M765184518688777856): > #0 std::__1::vector > >::empty() const > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:664:41 > (kudu+0x5ca926) > #1 kudu::Schema::initialized() const > /root/Projects/kudu/src/kudu/common/schema.h:676:26 (kudu+0x5ca3fd) > #2 kudu::Schema::key_byte_size() const > /root/Projects/kudu/src/kudu/common/schema.h:572:5 > (libkudu_common.so+0x171eae) > #3 kudu::EncodedKey::DecodeEncodedString(kudu::Schema const&, > kudu::Arena*, kudu::Slice const&, kudu::EncodedKey**) > /root/Projects/kudu/src/kudu/common/encoded_key.cc:60:76 > (libkudu_common.so+0x171091) > #4 > kudu::tablet::CFileSet::Iterator::OptimizePKPredicates(kudu::ScanSpec*) > /root/Projects/kudu/src/kudu/tablet/cfile_set.cc:444:5 (libtablet.so+0x428934) > #5 kudu::tablet::CFileSet::Iterator::Init(kudu::ScanSpec*) > /root/Projects/kudu/src/kudu/tablet/cfile_set.cc:410:3 (libtablet.so+0x4285d7) > #6 kudu::MaterializingIterator::Init(kudu::ScanSpec*) > /root/Projects/kudu/src/kudu/common/generic_iterators.cc:1176:3 > (libkudu_common.so+0x178872) > #7 > kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas(kudu::fs::IOContext > const*) /root/Projects/kudu/src/kudu/tablet/delta_compaction.cc:130:3 > (libtablet.so+0x54ca30) > #8 kudu::tablet::MajorDeltaCompaction::Compact(kudu::fs::IOContext > const*) /root/Projects/kudu/src/kudu/tablet/delta_compaction.cc:340:3 > (libtablet.so+0x54ead0) > #9 > kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds(std::__1::vector std::__1::allocator > const&,
[jira] [Updated] (KUDU-3570) Use-after-free and data race in MajorDeltaCompactionOp when AlterTablet is running
[ https://issues.apache.org/jira/browse/KUDU-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3570: Fix Version/s: 1.18.0 1.17.1 Resolution: Fixed Status: Resolved (was: In Review) > Use-after-free and data race in MajorDeltaCompactionOp when AlterTablet is > running > -- > > Key: KUDU-3570 > URL: https://issues.apache.org/jira/browse/KUDU-3570 > Project: Kudu > Issue Type: Bug > Components: tserver >Reporter: Alexey Serbin >Priority: Major > Fix For: 1.18.0, 1.17.1 > > > Running {{alter_table-randomized-test}} under TSAN produced > heap-use-after-free and data race warnings like below, indicating > corresponding conditions might hit when a major delta compaction > (MajorDeltaCompactionOp) maintenance operation is run when a table being > altered. > In addition to TSAN warnings, running the {{alter_table-randomized-test}} for > DEBUG/ASAN/TSAN builds would crash with SIGABRT and fatal messages like below > due to a triggered DCHECK constraint. In RELEASE build that condition might > lead to an unexpected behavior (e.g., silent data corruption, crash, etc.) > when {{Schema::num_columns()}} or other methods called on a corrupted > {{Schema}} object. > DCHECK triggers a crash with SIGABRT with funny size numbers: > {noformat} > F20240426 14:25:15.006683 245509 schema.h:584] Check failed: cols_.size() == > name_to_index_.size() (5270498306772959232 vs. 643461730718517486) > *** Check failure stack trace: *** > @ 0x7f2006677390 google::LogMessage::Flush() > @ 0x7f200667c4cb google::LogMessageFatal::~LogMessageFatal() > @ 0x4eefff kudu::Schema::num_columns() > @ 0x7f200dd18529 kudu::tablet::DeltaPreparer<>::Start() > @ 0x7f200dcde94f kudu::tablet::DeltaFileIterator<>::PrepareBatch() > @ 0x7f200dd07d81 kudu::tablet::DeltaIteratorMerger::PrepareBatch() > @ 0x7f200dd012b1 > kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas() > @ 0x7f200dd02fa1 kudu::tablet::MajorDeltaCompaction::Compact() > @ 0x7f200dc1f85d > kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds() > @ 0x7f200dc1f504 kudu::tablet::DiskRowSet::MajorCompactDeltaStores() > @ 0x7f200dae1cc3 kudu::tablet::Tablet::CompactWorstDeltas() > @ 0x7f200db74cd7 kudu::tablet::MajorDeltaCompactionOp::Perform() > @ 0x7f2007498827 kudu::MaintenanceManager::LaunchOp() > @ 0x7f200749c773 > kudu::MaintenanceManager::RunSchedulerThread()::$_3::operator()() > {noformat} > TSAN warning on use-after-free: > {noformat} > WARNING: ThreadSanitizer: heap-use-after-free (pid=3392364) > Read of size 8 at 0x7b4400100060 by thread T23 (mutexes: write > M236855935862339888, write M206456689917755456): > #0 std::__1::vector std::__1::allocator >::size() const > thirdparty/installed/tsan/include/c++/v1/vector:658:46 (kudu+0x4ee25b) > #1 kudu::Schema::num_columns() const src/kudu/common/schema.h:584:5 > (kudu+0x4eef50) > #2 > kudu::tablet::DeltaPreparer > >::Start(unsigned long, int) src/kudu/tablet/delta_store.cc:204:46 > (libtablet.so+0x578488) > #3 > kudu::tablet::DeltaFileIterator<(kudu::tablet::DeltaType)0>::PrepareBatch(unsigned > long, int) src/kudu/tablet/deltafile.cc:608:13 (libtablet.so+0x53e8ae) > #4 kudu::tablet::DeltaIteratorMerger::PrepareBatch(unsigned long, int) > src/kudu/tablet/delta_iterator_merger.cc:66:5 (libtablet.so+0x567ce0) > #5 > kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas(kudu::fs::IOContext > const*) src/kudu/tablet/delta_compaction.cc:155:5 (libtablet.so+0x561210) > #6 kudu::tablet::MajorDeltaCompaction::Compact(kudu::fs::IOContext > const*) src/kudu/tablet/delta_compaction.cc:340:3 (libtablet.so+0x562f00) > #7 > kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds(std::__1::vector std::__1::allocator > const&, kudu::fs::IOContext const*, > kudu::tablet::HistoryGcOpts) src/kudu/tablet/diskrowset.cc:588:3 > (libtablet.so+0x47f7bc) > #8 kudu::tablet::DiskRowSet::MajorCompactDeltaStores(kudu::fs::IOContext > const*, kudu::tablet::HistoryGcOpts) src/kudu/tablet/diskrowset.cc:572:10 > (libtablet.so+0x47f463) > #9 > kudu::tablet::Tablet::CompactWorstDeltas(kudu::tablet::RowSet::DeltaCompactionType) > src/kudu/tablet/tablet.cc:2881:5 (libtablet.so+0x341c92) > #10 kudu::tablet::MajorDeltaCompactionOp::Perform() > src/kudu/tablet/tablet_mm_ops.cc:364:3 (libtablet.so+0x3d4ca6) > #11 kudu::MaintenanceManager::LaunchOp(kudu::MaintenanceOp*) > src/kudu/util/maintenance_manager.cc:640:9 (libkudu_util.so+0x38c826) > #12 kudu::MaintenanceManager::RunSchedulerThread()::$_3::operator()() > const src/kudu/ut
[jira] [Created] (KUDU-3571) AutoIncrementingItest.BootstrapNoWalsNoData fails sometimes
Alexey Serbin created KUDU-3571: --- Summary: AutoIncrementingItest.BootstrapNoWalsNoData fails sometimes Key: KUDU-3571 URL: https://issues.apache.org/jira/browse/KUDU-3571 Project: Kudu Issue Type: Bug Components: test Affects Versions: 1.17.0 Reporter: Alexey Serbin Attachments: auto_incrementing-itest.txt.xz The {{AutoIncrementingItest.BootstrapNoWalsNoData}} scenario fails from time to time with one of its assertions triggered, see below. Full log is attached. {noformat} /root/Projects/kudu/src/kudu/tserver/tablet_server-test-base.cc:362: Failure Failed Bad status: Invalid argument: Index 0 does not reference a valid sidecar /root/Projects/kudu/src/kudu/integration-tests/auto_incrementing-itest.cc:446: Failure Expected equality of these values: 200 results.size() Which is: 0 {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3570) Use-after-free and data race in MajorDeltaCompactionOp when AlterTablet is running
[ https://issues.apache.org/jira/browse/KUDU-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3570: Code Review: https://gerrit.cloudera.org/#/c/21362/ > Use-after-free and data race in MajorDeltaCompactionOp when AlterTablet is > running > -- > > Key: KUDU-3570 > URL: https://issues.apache.org/jira/browse/KUDU-3570 > Project: Kudu > Issue Type: Bug > Components: tserver >Reporter: Alexey Serbin >Priority: Major > > Running {{alter_table-randomized-test}} under TSAN produced > heap-use-after-free and data race warnings like below, indicating > corresponding conditions might hit when a major delta compaction > (MajorDeltaCompactionOp) maintenance operation is run when a table being > altered. > In addition to TSAN warnings, running the {{alter_table-randomized-test}} for > DEBUG/ASAN/TSAN builds would crash with SIGABRT and fatal messages like below > due to a triggered DCHECK constraint. In RELEASE build that condition might > lead to an unexpected behavior (e.g., silent data corruption, crash, etc.) > when {{Schema::num_columns()}} or other methods called on a corrupted > {{Schema}} object. > DCHECK triggers a crash with SIGABRT with funny size numbers: > {noformat} > F20240426 14:25:15.006683 245509 schema.h:584] Check failed: cols_.size() == > name_to_index_.size() (5270498306772959232 vs. 643461730718517486) > *** Check failure stack trace: *** > @ 0x7f2006677390 google::LogMessage::Flush() > @ 0x7f200667c4cb google::LogMessageFatal::~LogMessageFatal() > @ 0x4eefff kudu::Schema::num_columns() > @ 0x7f200dd18529 kudu::tablet::DeltaPreparer<>::Start() > @ 0x7f200dcde94f kudu::tablet::DeltaFileIterator<>::PrepareBatch() > @ 0x7f200dd07d81 kudu::tablet::DeltaIteratorMerger::PrepareBatch() > @ 0x7f200dd012b1 > kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas() > @ 0x7f200dd02fa1 kudu::tablet::MajorDeltaCompaction::Compact() > @ 0x7f200dc1f85d > kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds() > @ 0x7f200dc1f504 kudu::tablet::DiskRowSet::MajorCompactDeltaStores() > @ 0x7f200dae1cc3 kudu::tablet::Tablet::CompactWorstDeltas() > @ 0x7f200db74cd7 kudu::tablet::MajorDeltaCompactionOp::Perform() > @ 0x7f2007498827 kudu::MaintenanceManager::LaunchOp() > @ 0x7f200749c773 > kudu::MaintenanceManager::RunSchedulerThread()::$_3::operator()() > {noformat} > TSAN warning on use-after-free: > {noformat} > WARNING: ThreadSanitizer: heap-use-after-free (pid=3392364) > Read of size 8 at 0x7b4400100060 by thread T23 (mutexes: write > M236855935862339888, write M206456689917755456): > #0 std::__1::vector std::__1::allocator >::size() const > thirdparty/installed/tsan/include/c++/v1/vector:658:46 (kudu+0x4ee25b) > #1 kudu::Schema::num_columns() const src/kudu/common/schema.h:584:5 > (kudu+0x4eef50) > #2 > kudu::tablet::DeltaPreparer > >::Start(unsigned long, int) src/kudu/tablet/delta_store.cc:204:46 > (libtablet.so+0x578488) > #3 > kudu::tablet::DeltaFileIterator<(kudu::tablet::DeltaType)0>::PrepareBatch(unsigned > long, int) src/kudu/tablet/deltafile.cc:608:13 (libtablet.so+0x53e8ae) > #4 kudu::tablet::DeltaIteratorMerger::PrepareBatch(unsigned long, int) > src/kudu/tablet/delta_iterator_merger.cc:66:5 (libtablet.so+0x567ce0) > #5 > kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas(kudu::fs::IOContext > const*) src/kudu/tablet/delta_compaction.cc:155:5 (libtablet.so+0x561210) > #6 kudu::tablet::MajorDeltaCompaction::Compact(kudu::fs::IOContext > const*) src/kudu/tablet/delta_compaction.cc:340:3 (libtablet.so+0x562f00) > #7 > kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds(std::__1::vector std::__1::allocator > const&, kudu::fs::IOContext const*, > kudu::tablet::HistoryGcOpts) src/kudu/tablet/diskrowset.cc:588:3 > (libtablet.so+0x47f7bc) > #8 kudu::tablet::DiskRowSet::MajorCompactDeltaStores(kudu::fs::IOContext > const*, kudu::tablet::HistoryGcOpts) src/kudu/tablet/diskrowset.cc:572:10 > (libtablet.so+0x47f463) > #9 > kudu::tablet::Tablet::CompactWorstDeltas(kudu::tablet::RowSet::DeltaCompactionType) > src/kudu/tablet/tablet.cc:2881:5 (libtablet.so+0x341c92) > #10 kudu::tablet::MajorDeltaCompactionOp::Perform() > src/kudu/tablet/tablet_mm_ops.cc:364:3 (libtablet.so+0x3d4ca6) > #11 kudu::MaintenanceManager::LaunchOp(kudu::MaintenanceOp*) > src/kudu/util/maintenance_manager.cc:640:9 (libkudu_util.so+0x38c826) > #12 kudu::MaintenanceManager::RunSchedulerThread()::$_3::operator()() > const src/kudu/util/maintenance_manager.cc:422:5 (libkudu_util.so+0x390772) > ... > Previous write of size 8 at 0x7b440010
[jira] [Updated] (KUDU-3570) Use-after-free and data race in MajorDeltaCompactionOp when AlterTablet is running
[ https://issues.apache.org/jira/browse/KUDU-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3570: Status: In Review (was: Open) > Use-after-free and data race in MajorDeltaCompactionOp when AlterTablet is > running > -- > > Key: KUDU-3570 > URL: https://issues.apache.org/jira/browse/KUDU-3570 > Project: Kudu > Issue Type: Bug > Components: tserver >Reporter: Alexey Serbin >Priority: Major > > Running {{alter_table-randomized-test}} under TSAN produced > heap-use-after-free and data race warnings like below, indicating > corresponding conditions might hit when a major delta compaction > (MajorDeltaCompactionOp) maintenance operation is run when a table being > altered. > In addition to TSAN warnings, running the {{alter_table-randomized-test}} for > DEBUG/ASAN/TSAN builds would crash with SIGABRT and fatal messages like below > due to a triggered DCHECK constraint. In RELEASE build that condition might > lead to an unexpected behavior (e.g., silent data corruption, crash, etc.) > when {{Schema::num_columns()}} or other methods called on a corrupted > {{Schema}} object. > DCHECK triggers a crash with SIGABRT with funny size numbers: > {noformat} > F20240426 14:25:15.006683 245509 schema.h:584] Check failed: cols_.size() == > name_to_index_.size() (5270498306772959232 vs. 643461730718517486) > *** Check failure stack trace: *** > @ 0x7f2006677390 google::LogMessage::Flush() > @ 0x7f200667c4cb google::LogMessageFatal::~LogMessageFatal() > @ 0x4eefff kudu::Schema::num_columns() > @ 0x7f200dd18529 kudu::tablet::DeltaPreparer<>::Start() > @ 0x7f200dcde94f kudu::tablet::DeltaFileIterator<>::PrepareBatch() > @ 0x7f200dd07d81 kudu::tablet::DeltaIteratorMerger::PrepareBatch() > @ 0x7f200dd012b1 > kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas() > @ 0x7f200dd02fa1 kudu::tablet::MajorDeltaCompaction::Compact() > @ 0x7f200dc1f85d > kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds() > @ 0x7f200dc1f504 kudu::tablet::DiskRowSet::MajorCompactDeltaStores() > @ 0x7f200dae1cc3 kudu::tablet::Tablet::CompactWorstDeltas() > @ 0x7f200db74cd7 kudu::tablet::MajorDeltaCompactionOp::Perform() > @ 0x7f2007498827 kudu::MaintenanceManager::LaunchOp() > @ 0x7f200749c773 > kudu::MaintenanceManager::RunSchedulerThread()::$_3::operator()() > {noformat} > TSAN warning on use-after-free: > {noformat} > WARNING: ThreadSanitizer: heap-use-after-free (pid=3392364) > Read of size 8 at 0x7b4400100060 by thread T23 (mutexes: write > M236855935862339888, write M206456689917755456): > #0 std::__1::vector std::__1::allocator >::size() const > thirdparty/installed/tsan/include/c++/v1/vector:658:46 (kudu+0x4ee25b) > #1 kudu::Schema::num_columns() const src/kudu/common/schema.h:584:5 > (kudu+0x4eef50) > #2 > kudu::tablet::DeltaPreparer > >::Start(unsigned long, int) src/kudu/tablet/delta_store.cc:204:46 > (libtablet.so+0x578488) > #3 > kudu::tablet::DeltaFileIterator<(kudu::tablet::DeltaType)0>::PrepareBatch(unsigned > long, int) src/kudu/tablet/deltafile.cc:608:13 (libtablet.so+0x53e8ae) > #4 kudu::tablet::DeltaIteratorMerger::PrepareBatch(unsigned long, int) > src/kudu/tablet/delta_iterator_merger.cc:66:5 (libtablet.so+0x567ce0) > #5 > kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas(kudu::fs::IOContext > const*) src/kudu/tablet/delta_compaction.cc:155:5 (libtablet.so+0x561210) > #6 kudu::tablet::MajorDeltaCompaction::Compact(kudu::fs::IOContext > const*) src/kudu/tablet/delta_compaction.cc:340:3 (libtablet.so+0x562f00) > #7 > kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds(std::__1::vector std::__1::allocator > const&, kudu::fs::IOContext const*, > kudu::tablet::HistoryGcOpts) src/kudu/tablet/diskrowset.cc:588:3 > (libtablet.so+0x47f7bc) > #8 kudu::tablet::DiskRowSet::MajorCompactDeltaStores(kudu::fs::IOContext > const*, kudu::tablet::HistoryGcOpts) src/kudu/tablet/diskrowset.cc:572:10 > (libtablet.so+0x47f463) > #9 > kudu::tablet::Tablet::CompactWorstDeltas(kudu::tablet::RowSet::DeltaCompactionType) > src/kudu/tablet/tablet.cc:2881:5 (libtablet.so+0x341c92) > #10 kudu::tablet::MajorDeltaCompactionOp::Perform() > src/kudu/tablet/tablet_mm_ops.cc:364:3 (libtablet.so+0x3d4ca6) > #11 kudu::MaintenanceManager::LaunchOp(kudu::MaintenanceOp*) > src/kudu/util/maintenance_manager.cc:640:9 (libkudu_util.so+0x38c826) > #12 kudu::MaintenanceManager::RunSchedulerThread()::$_3::operator()() > const src/kudu/util/maintenance_manager.cc:422:5 (libkudu_util.so+0x390772) > ... > Previous write of size 8 at 0x7b4400100060 by thread T158:
[jira] [Updated] (KUDU-3570) Use-after-free and data race in MajorDeltaCompactionOp when AlterTablet is running
[ https://issues.apache.org/jira/browse/KUDU-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3570: Description: Running {{alter_table-randomized-test}} under TSAN produced heap-use-after-free and data race warnings like below, indicating corresponding conditions might hit when a major delta compaction (MajorDeltaCompactionOp) maintenance operation is run when a table being altered. In addition to TSAN warnings, running the {{alter_table-randomized-test}} for DEBUG/ASAN/TSAN builds would crash with SIGABRT and fatal messages like below due to a triggered DCHECK constraint. In RELEASE build that condition might lead to an unexpected behavior (e.g., silent data corruption, crash, etc.) when {{Schema::num_columns()}} or other methods called on a corrupted {{Schema}} object. DCHECK triggers a crash with SIGABRT with funny size numbers: {noformat} F20240426 14:25:15.006683 245509 schema.h:584] Check failed: cols_.size() == name_to_index_.size() (5270498306772959232 vs. 643461730718517486) *** Check failure stack trace: *** @ 0x7f2006677390 google::LogMessage::Flush() @ 0x7f200667c4cb google::LogMessageFatal::~LogMessageFatal() @ 0x4eefff kudu::Schema::num_columns() @ 0x7f200dd18529 kudu::tablet::DeltaPreparer<>::Start() @ 0x7f200dcde94f kudu::tablet::DeltaFileIterator<>::PrepareBatch() @ 0x7f200dd07d81 kudu::tablet::DeltaIteratorMerger::PrepareBatch() @ 0x7f200dd012b1 kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas() @ 0x7f200dd02fa1 kudu::tablet::MajorDeltaCompaction::Compact() @ 0x7f200dc1f85d kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds() @ 0x7f200dc1f504 kudu::tablet::DiskRowSet::MajorCompactDeltaStores() @ 0x7f200dae1cc3 kudu::tablet::Tablet::CompactWorstDeltas() @ 0x7f200db74cd7 kudu::tablet::MajorDeltaCompactionOp::Perform() @ 0x7f2007498827 kudu::MaintenanceManager::LaunchOp() @ 0x7f200749c773 kudu::MaintenanceManager::RunSchedulerThread()::$_3::operator()() {noformat} TSAN warning on use-after-free: {noformat} WARNING: ThreadSanitizer: heap-use-after-free (pid=3392364) Read of size 8 at 0x7b4400100060 by thread T23 (mutexes: write M236855935862339888, write M206456689917755456): #0 std::__1::vector >::size() const thirdparty/installed/tsan/include/c++/v1/vector:658:46 (kudu+0x4ee25b) #1 kudu::Schema::num_columns() const src/kudu/common/schema.h:584:5 (kudu+0x4eef50) #2 kudu::tablet::DeltaPreparer >::Start(unsigned long, int) src/kudu/tablet/delta_store.cc:204:46 (libtablet.so+0x578488) #3 kudu::tablet::DeltaFileIterator<(kudu::tablet::DeltaType)0>::PrepareBatch(unsigned long, int) src/kudu/tablet/deltafile.cc:608:13 (libtablet.so+0x53e8ae) #4 kudu::tablet::DeltaIteratorMerger::PrepareBatch(unsigned long, int) src/kudu/tablet/delta_iterator_merger.cc:66:5 (libtablet.so+0x567ce0) #5 kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas(kudu::fs::IOContext const*) src/kudu/tablet/delta_compaction.cc:155:5 (libtablet.so+0x561210) #6 kudu::tablet::MajorDeltaCompaction::Compact(kudu::fs::IOContext const*) src/kudu/tablet/delta_compaction.cc:340:3 (libtablet.so+0x562f00) #7 kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds(std::__1::vector > const&, kudu::fs::IOContext const*, kudu::tablet::HistoryGcOpts) src/kudu/tablet/diskrowset.cc:588:3 (libtablet.so+0x47f7bc) #8 kudu::tablet::DiskRowSet::MajorCompactDeltaStores(kudu::fs::IOContext const*, kudu::tablet::HistoryGcOpts) src/kudu/tablet/diskrowset.cc:572:10 (libtablet.so+0x47f463) #9 kudu::tablet::Tablet::CompactWorstDeltas(kudu::tablet::RowSet::DeltaCompactionType) src/kudu/tablet/tablet.cc:2881:5 (libtablet.so+0x341c92) #10 kudu::tablet::MajorDeltaCompactionOp::Perform() src/kudu/tablet/tablet_mm_ops.cc:364:3 (libtablet.so+0x3d4ca6) #11 kudu::MaintenanceManager::LaunchOp(kudu::MaintenanceOp*) src/kudu/util/maintenance_manager.cc:640:9 (libkudu_util.so+0x38c826) #12 kudu::MaintenanceManager::RunSchedulerThread()::$_3::operator()() const src/kudu/util/maintenance_manager.cc:422:5 (libkudu_util.so+0x390772) ... Previous write of size 8 at 0x7b4400100060 by thread T158: #0 operator delete(void*) thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_new_delete.cpp:126 (kudu+0x4dd4e9) #1 std::__1::_DeallocateCaller::__do_call(void*) thirdparty/installed/tsan/include/c++/v1/new:334:12 (kudu+0x4e9389) #2 std::__1::_DeallocateCaller::__do_deallocate_handle_size(void*, unsigned long) thirdparty/installed/tsan/include/c++/v1/new:292:12 (kudu+0x4e9329) #3 std::__1::_DeallocateCaller::__do_deallocate_handle_size_align(void*, unsigned long, unsigned long) thirdparty/installed/tsan/include/c++/v1/new:268:14 (libtablet.so+0x35fe42) #4 std::__1::__libcpp_deallocate(void
[jira] [Created] (KUDU-3570) Use-after-free and data race in MajorDeltaCompactionOp when AlterTablet is running
Alexey Serbin created KUDU-3570: --- Summary: Use-after-free and data race in MajorDeltaCompactionOp when AlterTablet is running Key: KUDU-3570 URL: https://issues.apache.org/jira/browse/KUDU-3570 Project: Kudu Issue Type: Bug Components: tserver Reporter: Alexey Serbin Running {{alter_table-randomized-test}} under TSAN produced heap-use-after-free and data race warnings like below, indicating corresponding conditions might hit when a major delta compaction (MajorDeltaCompactionOp) maintenance operation is run when a table being altered. In addition to TSAN warnings, running the {{alter_table-randomized-test}} for DEBUG/ASAN/TSAN builds would crash with SIGABRT and fatal messages like below due to a triggered DCHECK constraint. In RELEASE build that condition might lead to a silent data corruption or a crash. DCHECK triggers a crash with SIGABRT with funny size numbers: {noformat} F20240426 14:25:15.006683 245509 schema.h:584] Check failed: cols_.size() == name_to_index_.size() (5270498306772959232 vs. 643461730718517486) *** Check failure stack trace: *** @ 0x7f2006677390 google::LogMessage::Flush() @ 0x7f200667c4cb google::LogMessageFatal::~LogMessageFatal() @ 0x4eefff kudu::Schema::num_columns() @ 0x7f200dd18529 kudu::tablet::DeltaPreparer<>::Start() @ 0x7f200dcde94f kudu::tablet::DeltaFileIterator<>::PrepareBatch() @ 0x7f200dd07d81 kudu::tablet::DeltaIteratorMerger::PrepareBatch() @ 0x7f200dd012b1 kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas() @ 0x7f200dd02fa1 kudu::tablet::MajorDeltaCompaction::Compact() @ 0x7f200dc1f85d kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds() @ 0x7f200dc1f504 kudu::tablet::DiskRowSet::MajorCompactDeltaStores() @ 0x7f200dae1cc3 kudu::tablet::Tablet::CompactWorstDeltas() @ 0x7f200db74cd7 kudu::tablet::MajorDeltaCompactionOp::Perform() @ 0x7f2007498827 kudu::MaintenanceManager::LaunchOp() @ 0x7f200749c773 kudu::MaintenanceManager::RunSchedulerThread()::$_3::operator()() {noformat} TSAN warning on use-after-free: {noformat} WARNING: ThreadSanitizer: heap-use-after-free (pid=3392364) Read of size 8 at 0x7b4400100060 by thread T23 (mutexes: write M236855935862339888, write M206456689917755456): #0 std::__1::vector >::size() const thirdparty/installed/tsan/include/c++/v1/vector:658:46 (kudu+0x4ee25b) #1 kudu::Schema::num_columns() const src/kudu/common/schema.h:584:5 (kudu+0x4eef50) #2 kudu::tablet::DeltaPreparer >::Start(unsigned long, int) src/kudu/tablet/delta_store.cc:204:46 (libtablet.so+0x578488) #3 kudu::tablet::DeltaFileIterator<(kudu::tablet::DeltaType)0>::PrepareBatch(unsigned long, int) src/kudu/tablet/deltafile.cc:608:13 (libtablet.so+0x53e8ae) #4 kudu::tablet::DeltaIteratorMerger::PrepareBatch(unsigned long, int) src/kudu/tablet/delta_iterator_merger.cc:66:5 (libtablet.so+0x567ce0) #5 kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas(kudu::fs::IOContext const*) src/kudu/tablet/delta_compaction.cc:155:5 (libtablet.so+0x561210) #6 kudu::tablet::MajorDeltaCompaction::Compact(kudu::fs::IOContext const*) src/kudu/tablet/delta_compaction.cc:340:3 (libtablet.so+0x562f00) #7 kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds(std::__1::vector > const&, kudu::fs::IOContext const*, kudu::tablet::HistoryGcOpts) src/kudu/tablet/diskrowset.cc:588:3 (libtablet.so+0x47f7bc) #8 kudu::tablet::DiskRowSet::MajorCompactDeltaStores(kudu::fs::IOContext const*, kudu::tablet::HistoryGcOpts) src/kudu/tablet/diskrowset.cc:572:10 (libtablet.so+0x47f463) #9 kudu::tablet::Tablet::CompactWorstDeltas(kudu::tablet::RowSet::DeltaCompactionType) src/kudu/tablet/tablet.cc:2881:5 (libtablet.so+0x341c92) #10 kudu::tablet::MajorDeltaCompactionOp::Perform() src/kudu/tablet/tablet_mm_ops.cc:364:3 (libtablet.so+0x3d4ca6) #11 kudu::MaintenanceManager::LaunchOp(kudu::MaintenanceOp*) src/kudu/util/maintenance_manager.cc:640:9 (libkudu_util.so+0x38c826) #12 kudu::MaintenanceManager::RunSchedulerThread()::$_3::operator()() const src/kudu/util/maintenance_manager.cc:422:5 (libkudu_util.so+0x390772) ... Previous write of size 8 at 0x7b4400100060 by thread T158: #0 operator delete(void*) thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_new_delete.cpp:126 (kudu+0x4dd4e9) #1 std::__1::_DeallocateCaller::__do_call(void*) thirdparty/installed/tsan/include/c++/v1/new:334:12 (kudu+0x4e9389) #2 std::__1::_DeallocateCaller::__do_deallocate_handle_size(void*, unsigned long) thirdparty/installed/tsan/include/c++/v1/new:292:12 (kudu+0x4e9329) #3 std::__1::_DeallocateCaller::__do_deallocate_handle_size_align(void*, unsigned long, unsigned long) thirdparty/installed/tsan/include/c++/v1/new:268:14
[jira] [Updated] (KUDU-3569) Data race in CFileSet::Iterator::OptimizePKPredicates()
[ https://issues.apache.org/jira/browse/KUDU-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3569: Fix Version/s: 1.18.0 Resolution: Fixed Status: Resolved (was: In Review) > Data race in CFileSet::Iterator::OptimizePKPredicates() > --- > > Key: KUDU-3569 > URL: https://issues.apache.org/jira/browse/KUDU-3569 > Project: Kudu > Issue Type: Bug > Components: tserver >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Major > Fix For: 1.18.0 > > > Running {{alter_table-randomized-test}} under TSAN produced data race > warnings like below, indicating a race in > {{CFileSet::Iterator::OptimizePKPredicates()}}. One actor was > {{tablet::AlterSchemaOp::Apply()}} initiated by AlterTable, the other > concurrent actor was the maintenance thread running major delta compaction. > Apparently, the same data race might happen if the other concurrent actor was > a thread handling a scan request containing IN-list predicates optimized at > the DRS level. > {noformat} > WARNING: ThreadSanitizer: data race (pid=3919595) > Write of size 8 at 0x7b44000f4a20 by thread T7: > #0 std::__1::__vector_base long>>::__destruct_at_end(unsigned long*) > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:429:12 > (kudu+0x4d4080) > #1 std::__1::__vector_base long>>::clear() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:371:29 > (kudu+0x4d3f94) > #2 std::__1::__vector_base long>>::~__vector_base() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:465:9 > (kudu+0x4d3d4b) > #3 std::__1::vector > >::~vector() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:557:5 > (kudu+0x4d1261) > #4 kudu::Schema::~Schema() > /root/Projects/kudu/src/kudu/common/schema.h:491:7 (kudu+0x4cc40f) > #5 std::__1::__shared_ptr_emplace std::__1::allocator>::__on_zero_shared() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3503:23 > (libtablet.so+0x389d45) > #6 std::__1::__shared_count::__release_shared() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3341:9 > (kudu+0x4d4d05) > #7 std::__1::__shared_weak_count::__release_shared() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3383:27 > (kudu+0x4d4ca9) > #8 std::__1::shared_ptr::~shared_ptr() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:4098:19 > (kudu+0x5303e8) > #9 > kudu::tablet::TabletMetadata::SetSchema(std::__1::shared_ptr > const&, unsigned int) > /root/Projects/kudu/src/kudu/tablet/tablet_metadata.cc:957:1 > (libtablet.so+0x4d8882) > #10 kudu::tablet::Tablet::AlterSchema(kudu::tablet::AlterSchemaOpState*) > /root/Projects/kudu/src/kudu/tablet/tablet.cc:1727:14 (libtablet.so+0x32720a) > #11 kudu::tablet::AlterSchemaOp::Apply(kudu::consensus::CommitMsg**) > /root/Projects/kudu/src/kudu/tablet/ops/alter_schema_op.cc:127:3 > (libtablet.so+0x4013f8) > #12 kudu::tablet::OpDriver::ApplyTask() > /root/Projects/kudu/src/kudu/tablet/ops/op_driver.cc:527:21 > (libtablet.so+0x40873a) > ... > Previous read of size 8 at 0x7b44000f4a20 by thread T22 (mutexes: write > M799524414306809968, write M765184518688777856): > #0 std::__1::vector > >::empty() const > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:664:41 > (kudu+0x5ca926) > #1 kudu::Schema::initialized() const > /root/Projects/kudu/src/kudu/common/schema.h:676:26 (kudu+0x5ca3fd) > #2 kudu::Schema::key_byte_size() const > /root/Projects/kudu/src/kudu/common/schema.h:572:5 > (libkudu_common.so+0x171eae) > #3 kudu::EncodedKey::DecodeEncodedString(kudu::Schema const&, > kudu::Arena*, kudu::Slice const&, kudu::EncodedKey**) > /root/Projects/kudu/src/kudu/common/encoded_key.cc:60:76 > (libkudu_common.so+0x171091) > #4 > kudu::tablet::CFileSet::Iterator::OptimizePKPredicates(kudu::ScanSpec*) > /root/Projects/kudu/src/kudu/tablet/cfile_set.cc:444:5 (libtablet.so+0x428934) > #5 kudu::tablet::CFileSet::Iterator::Init(kudu::ScanSpec*) > /root/Projects/kudu/src/kudu/tablet/cfile_set.cc:410:3 (libtablet.so+0x4285d7) > #6 kudu::MaterializingIterator::Init(kudu::ScanSpec*) > /root/Projects/kudu/src/kudu/common/generic_iterators.cc:1176:3 > (libkudu_common.so+0x178872) > #7 > kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas(kudu::fs::IOContext > const*) /root/Projects/kudu/src/kudu/tablet/delta_compaction.cc:130:3 > (libtablet.so+0x54ca30) > #8 kudu::tablet::MajorDeltaCompaction::Compact(kudu::fs::IOContext > const*) /root/Projects/kudu/src/kudu/tablet/delta_compaction.cc:340:3 > (libtablet.so+0x54ead0) > #9 > kudu::tablet::DiskRowSet::MajorCompactDeltaStor
[jira] [Updated] (KUDU-3569) Data race in CFileSet::Iterator::OptimizePKPredicates()
[ https://issues.apache.org/jira/browse/KUDU-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3569: Status: In Review (was: Open) > Data race in CFileSet::Iterator::OptimizePKPredicates() > --- > > Key: KUDU-3569 > URL: https://issues.apache.org/jira/browse/KUDU-3569 > Project: Kudu > Issue Type: Bug > Components: tserver >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Major > > Running {{alter_table-randomized-test}} under TSAN produced data race > warnings like below, indicating a race in > {{CFileSet::Iterator::OptimizePKPredicates()}}. One actor was > {{tablet::AlterSchemaOp::Apply()}} initiated by AlterTable, the other > concurrent actor was the maintenance thread running major delta compaction. > Apparently, the same data race might happen if the other concurrent actor was > a thread handling a scan request containing IN-list predicates optimized at > the DRS level. > {noformat} > WARNING: ThreadSanitizer: data race (pid=3919595) > Write of size 8 at 0x7b44000f4a20 by thread T7: > #0 std::__1::__vector_base long>>::__destruct_at_end(unsigned long*) > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:429:12 > (kudu+0x4d4080) > #1 std::__1::__vector_base long>>::clear() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:371:29 > (kudu+0x4d3f94) > #2 std::__1::__vector_base long>>::~__vector_base() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:465:9 > (kudu+0x4d3d4b) > #3 std::__1::vector > >::~vector() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:557:5 > (kudu+0x4d1261) > #4 kudu::Schema::~Schema() > /root/Projects/kudu/src/kudu/common/schema.h:491:7 (kudu+0x4cc40f) > #5 std::__1::__shared_ptr_emplace std::__1::allocator>::__on_zero_shared() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3503:23 > (libtablet.so+0x389d45) > #6 std::__1::__shared_count::__release_shared() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3341:9 > (kudu+0x4d4d05) > #7 std::__1::__shared_weak_count::__release_shared() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3383:27 > (kudu+0x4d4ca9) > #8 std::__1::shared_ptr::~shared_ptr() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:4098:19 > (kudu+0x5303e8) > #9 > kudu::tablet::TabletMetadata::SetSchema(std::__1::shared_ptr > const&, unsigned int) > /root/Projects/kudu/src/kudu/tablet/tablet_metadata.cc:957:1 > (libtablet.so+0x4d8882) > #10 kudu::tablet::Tablet::AlterSchema(kudu::tablet::AlterSchemaOpState*) > /root/Projects/kudu/src/kudu/tablet/tablet.cc:1727:14 (libtablet.so+0x32720a) > #11 kudu::tablet::AlterSchemaOp::Apply(kudu::consensus::CommitMsg**) > /root/Projects/kudu/src/kudu/tablet/ops/alter_schema_op.cc:127:3 > (libtablet.so+0x4013f8) > #12 kudu::tablet::OpDriver::ApplyTask() > /root/Projects/kudu/src/kudu/tablet/ops/op_driver.cc:527:21 > (libtablet.so+0x40873a) > ... > Previous read of size 8 at 0x7b44000f4a20 by thread T22 (mutexes: write > M799524414306809968, write M765184518688777856): > #0 std::__1::vector > >::empty() const > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:664:41 > (kudu+0x5ca926) > #1 kudu::Schema::initialized() const > /root/Projects/kudu/src/kudu/common/schema.h:676:26 (kudu+0x5ca3fd) > #2 kudu::Schema::key_byte_size() const > /root/Projects/kudu/src/kudu/common/schema.h:572:5 > (libkudu_common.so+0x171eae) > #3 kudu::EncodedKey::DecodeEncodedString(kudu::Schema const&, > kudu::Arena*, kudu::Slice const&, kudu::EncodedKey**) > /root/Projects/kudu/src/kudu/common/encoded_key.cc:60:76 > (libkudu_common.so+0x171091) > #4 > kudu::tablet::CFileSet::Iterator::OptimizePKPredicates(kudu::ScanSpec*) > /root/Projects/kudu/src/kudu/tablet/cfile_set.cc:444:5 (libtablet.so+0x428934) > #5 kudu::tablet::CFileSet::Iterator::Init(kudu::ScanSpec*) > /root/Projects/kudu/src/kudu/tablet/cfile_set.cc:410:3 (libtablet.so+0x4285d7) > #6 kudu::MaterializingIterator::Init(kudu::ScanSpec*) > /root/Projects/kudu/src/kudu/common/generic_iterators.cc:1176:3 > (libkudu_common.so+0x178872) > #7 > kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas(kudu::fs::IOContext > const*) /root/Projects/kudu/src/kudu/tablet/delta_compaction.cc:130:3 > (libtablet.so+0x54ca30) > #8 kudu::tablet::MajorDeltaCompaction::Compact(kudu::fs::IOContext > const*) /root/Projects/kudu/src/kudu/tablet/delta_compaction.cc:340:3 > (libtablet.so+0x54ead0) > #9 > kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds(std::__1::vector std::__1::allocator > const&, kudu::fs::IOContext const*, >
[jira] [Updated] (KUDU-3569) Data race in CFileSet::Iterator::OptimizePKPredicates()
[ https://issues.apache.org/jira/browse/KUDU-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3569: Code Review: http://gerrit.cloudera.org:8080/21359 > Data race in CFileSet::Iterator::OptimizePKPredicates() > --- > > Key: KUDU-3569 > URL: https://issues.apache.org/jira/browse/KUDU-3569 > Project: Kudu > Issue Type: Bug > Components: tserver >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Priority: Major > > Running {{alter_table-randomized-test}} under TSAN produced data race > warnings like below, indicating a race in > {{CFileSet::Iterator::OptimizePKPredicates()}}. One actor was > {{tablet::AlterSchemaOp::Apply()}} initiated by AlterTable, the other > concurrent actor was the maintenance thread running major delta compaction. > Apparently, the same data race might happen if the other concurrent actor was > a thread handling a scan request containing IN-list predicates optimized at > the DRS level. > {noformat} > WARNING: ThreadSanitizer: data race (pid=3919595) > Write of size 8 at 0x7b44000f4a20 by thread T7: > #0 std::__1::__vector_base long>>::__destruct_at_end(unsigned long*) > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:429:12 > (kudu+0x4d4080) > #1 std::__1::__vector_base long>>::clear() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:371:29 > (kudu+0x4d3f94) > #2 std::__1::__vector_base long>>::~__vector_base() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:465:9 > (kudu+0x4d3d4b) > #3 std::__1::vector > >::~vector() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:557:5 > (kudu+0x4d1261) > #4 kudu::Schema::~Schema() > /root/Projects/kudu/src/kudu/common/schema.h:491:7 (kudu+0x4cc40f) > #5 std::__1::__shared_ptr_emplace std::__1::allocator>::__on_zero_shared() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3503:23 > (libtablet.so+0x389d45) > #6 std::__1::__shared_count::__release_shared() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3341:9 > (kudu+0x4d4d05) > #7 std::__1::__shared_weak_count::__release_shared() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3383:27 > (kudu+0x4d4ca9) > #8 std::__1::shared_ptr::~shared_ptr() > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:4098:19 > (kudu+0x5303e8) > #9 > kudu::tablet::TabletMetadata::SetSchema(std::__1::shared_ptr > const&, unsigned int) > /root/Projects/kudu/src/kudu/tablet/tablet_metadata.cc:957:1 > (libtablet.so+0x4d8882) > #10 kudu::tablet::Tablet::AlterSchema(kudu::tablet::AlterSchemaOpState*) > /root/Projects/kudu/src/kudu/tablet/tablet.cc:1727:14 (libtablet.so+0x32720a) > #11 kudu::tablet::AlterSchemaOp::Apply(kudu::consensus::CommitMsg**) > /root/Projects/kudu/src/kudu/tablet/ops/alter_schema_op.cc:127:3 > (libtablet.so+0x4013f8) > #12 kudu::tablet::OpDriver::ApplyTask() > /root/Projects/kudu/src/kudu/tablet/ops/op_driver.cc:527:21 > (libtablet.so+0x40873a) > ... > Previous read of size 8 at 0x7b44000f4a20 by thread T22 (mutexes: write > M799524414306809968, write M765184518688777856): > #0 std::__1::vector > >::empty() const > /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:664:41 > (kudu+0x5ca926) > #1 kudu::Schema::initialized() const > /root/Projects/kudu/src/kudu/common/schema.h:676:26 (kudu+0x5ca3fd) > #2 kudu::Schema::key_byte_size() const > /root/Projects/kudu/src/kudu/common/schema.h:572:5 > (libkudu_common.so+0x171eae) > #3 kudu::EncodedKey::DecodeEncodedString(kudu::Schema const&, > kudu::Arena*, kudu::Slice const&, kudu::EncodedKey**) > /root/Projects/kudu/src/kudu/common/encoded_key.cc:60:76 > (libkudu_common.so+0x171091) > #4 > kudu::tablet::CFileSet::Iterator::OptimizePKPredicates(kudu::ScanSpec*) > /root/Projects/kudu/src/kudu/tablet/cfile_set.cc:444:5 (libtablet.so+0x428934) > #5 kudu::tablet::CFileSet::Iterator::Init(kudu::ScanSpec*) > /root/Projects/kudu/src/kudu/tablet/cfile_set.cc:410:3 (libtablet.so+0x4285d7) > #6 kudu::MaterializingIterator::Init(kudu::ScanSpec*) > /root/Projects/kudu/src/kudu/common/generic_iterators.cc:1176:3 > (libkudu_common.so+0x178872) > #7 > kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas(kudu::fs::IOContext > const*) /root/Projects/kudu/src/kudu/tablet/delta_compaction.cc:130:3 > (libtablet.so+0x54ca30) > #8 kudu::tablet::MajorDeltaCompaction::Compact(kudu::fs::IOContext > const*) /root/Projects/kudu/src/kudu/tablet/delta_compaction.cc:340:3 > (libtablet.so+0x54ead0) > #9 > kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds(std::__1::vector std::__1::allocator > const&, kudu::fs::
[jira] [Updated] (KUDU-3569) Data race in CFileSet::Iterator::OptimizePKPredicates()
[ https://issues.apache.org/jira/browse/KUDU-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3569: Description: Running {{alter_table-randomized-test}} under TSAN produced data race warnings like below, indicating a race in {{CFileSet::Iterator::OptimizePKPredicates()}}. One actor was {{tablet::AlterSchemaOp::Apply()}} initiated by AlterTable, the other concurrent actor was the maintenance thread running major delta compaction. Apparently, the same data race might happen if the other concurrent actor was a thread handling a scan request containing IN-list predicates optimized at the DRS level. {noformat} WARNING: ThreadSanitizer: data race (pid=3919595) Write of size 8 at 0x7b44000f4a20 by thread T7: #0 std::__1::__vector_base>::__destruct_at_end(unsigned long*) /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:429:12 (kudu+0x4d4080) #1 std::__1::__vector_base>::clear() /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:371:29 (kudu+0x4d3f94) #2 std::__1::__vector_base>::~__vector_base() /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:465:9 (kudu+0x4d3d4b) #3 std::__1::vector >::~vector() /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:557:5 (kudu+0x4d1261) #4 kudu::Schema::~Schema() /root/Projects/kudu/src/kudu/common/schema.h:491:7 (kudu+0x4cc40f) #5 std::__1::__shared_ptr_emplace>::__on_zero_shared() /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3503:23 (libtablet.so+0x389d45) #6 std::__1::__shared_count::__release_shared() /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3341:9 (kudu+0x4d4d05) #7 std::__1::__shared_weak_count::__release_shared() /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3383:27 (kudu+0x4d4ca9) #8 std::__1::shared_ptr::~shared_ptr() /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:4098:19 (kudu+0x5303e8) #9 kudu::tablet::TabletMetadata::SetSchema(std::__1::shared_ptr const&, unsigned int) /root/Projects/kudu/src/kudu/tablet/tablet_metadata.cc:957:1 (libtablet.so+0x4d8882) #10 kudu::tablet::Tablet::AlterSchema(kudu::tablet::AlterSchemaOpState*) /root/Projects/kudu/src/kudu/tablet/tablet.cc:1727:14 (libtablet.so+0x32720a) #11 kudu::tablet::AlterSchemaOp::Apply(kudu::consensus::CommitMsg**) /root/Projects/kudu/src/kudu/tablet/ops/alter_schema_op.cc:127:3 (libtablet.so+0x4013f8) #12 kudu::tablet::OpDriver::ApplyTask() /root/Projects/kudu/src/kudu/tablet/ops/op_driver.cc:527:21 (libtablet.so+0x40873a) ... Previous read of size 8 at 0x7b44000f4a20 by thread T22 (mutexes: write M799524414306809968, write M765184518688777856): #0 std::__1::vector >::empty() const /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:664:41 (kudu+0x5ca926) #1 kudu::Schema::initialized() const /root/Projects/kudu/src/kudu/common/schema.h:676:26 (kudu+0x5ca3fd) #2 kudu::Schema::key_byte_size() const /root/Projects/kudu/src/kudu/common/schema.h:572:5 (libkudu_common.so+0x171eae) #3 kudu::EncodedKey::DecodeEncodedString(kudu::Schema const&, kudu::Arena*, kudu::Slice const&, kudu::EncodedKey**) /root/Projects/kudu/src/kudu/common/encoded_key.cc:60:76 (libkudu_common.so+0x171091) #4 kudu::tablet::CFileSet::Iterator::OptimizePKPredicates(kudu::ScanSpec*) /root/Projects/kudu/src/kudu/tablet/cfile_set.cc:444:5 (libtablet.so+0x428934) #5 kudu::tablet::CFileSet::Iterator::Init(kudu::ScanSpec*) /root/Projects/kudu/src/kudu/tablet/cfile_set.cc:410:3 (libtablet.so+0x4285d7) #6 kudu::MaterializingIterator::Init(kudu::ScanSpec*) /root/Projects/kudu/src/kudu/common/generic_iterators.cc:1176:3 (libkudu_common.so+0x178872) #7 kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas(kudu::fs::IOContext const*) /root/Projects/kudu/src/kudu/tablet/delta_compaction.cc:130:3 (libtablet.so+0x54ca30) #8 kudu::tablet::MajorDeltaCompaction::Compact(kudu::fs::IOContext const*) /root/Projects/kudu/src/kudu/tablet/delta_compaction.cc:340:3 (libtablet.so+0x54ead0) #9 kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds(std::__1::vector > const&, kudu::fs::IOContext const*, kudu::tablet::HistoryGcOpts) /root/Projects/kudu/src/kudu/tablet/diskrowset.cc:588:3 (libtablet.so+0x46b38c) #10 kudu::tablet::DiskRowSet::MajorCompactDeltaStores(kudu::fs::IOContext const*, kudu::tablet::HistoryGcOpts) /root/Projects/kudu/src/kudu/tablet/diskrowset.cc:572:10 (libtablet.so+0x46b033) #11 kudu::tablet::Tablet::CompactWorstDeltas(kudu::tablet::RowSet::DeltaCompactionType) /root/Projects/kudu/src/kudu/tablet/tablet.cc:2881:5 (libtablet.so+0x32d832) #12 kudu::tablet::MajorDeltaCompactionOp::Perform() /root/Projects/kudu/src/kudu/tablet/tablet_mm_ops.cc:364:3 (libtablet.so+0x3c0846) #13 kudu::MaintenanceManager