Re: Does recovery continue after truncating a table?

Hiroyuki Yamada Sat, 26 Nov 2016 02:09:03 -0800

Hi Yuji and Ben,

I tried out this revised script and the same issue occurred to me, too.
I think it's definitely a bug to be solved asap.


>Ben
What do you mean "an undocumented limitation" ?

Thanks,
Hiro

On Sat, Nov 26, 2016 at 3:13 PM, Ben Slater <ben.sla...@instaclustr.com> wrote:
> Nice detective work! Seems to me that it’s a best an undocumented limitation
> and potentially could be viewed as a bug - maybe log another JIRA?
>
> One node - there is a nodetool truncatehints command that could be used to
> clear out the hints
> (http://cassandra.apache.org/doc/latest/tools/nodetool/truncatehints.html?highlight=truncate)
> . However, it seems to clear all hints on particular endpoint, not just for
> a specific table.
>
> Cheers
> Ben
>
> On Fri, 25 Nov 2016 at 17:42 Yuji Ito <y...@imagine-orb.com> wrote:
>>
>> Hi all,
>>
>> I revised the script to reproduce the issue.
>> I think the issue happens more frequently than before.
>> Killing another node is added to the previous script.
>>
>> ==== [script] ====
>> #!/bin/sh
>>
>> node1_ip=<node1 IP address>
>> node2_ip=<node2 IP address>
>> node3_ip=<node3 IP address>
>> node2_user=<user name>
>> node3_user=<user name>
>> rows=10000
>>
>> echo "consistency quorum;" > init_data.cql
>> for key in $(seq 0 $(expr $rows - 1))
>> do
>>     echo "insert into testdb.testtbl (key, val) values($key, 1111) IF NOT
>> EXISTS;" >> init_data.cql
>>     done
>>
>>     while true
>>     do
>>     echo "truncate the table"
>>     cqlsh $node1_ip -e "truncate table testdb.testtbl" > /dev/null 2>&1
>>     if [ $? -ne 0 ]; then
>>         echo "truncating failed"
>>     continue
>>     else
>>         break
>>     fi
>> done
>>
>> echo "kill C* process on node3"
>> pdsh -l $node3_user -R ssh -w $node3_ip "ps auxww | grep CassandraDaemon |
>> awk '{if (\$13 ~ /cassand/) print \$2}' | xargs sudo kill -9"
>>
>> echo "insert $rows rows"
>> cqlsh $node1_ip -f init_data.cql > insert_log 2>&1
>>
>> echo "restart C* process on node3"
>> pdsh -l $node3_user -R ssh -w $node3_ip "sudo /etc/init.d/cassandra start"
>>
>> while true
>> do
>> echo "truncate the table again"
>> cqlsh $node1_ip -e "truncate table testdb.testtbl"
>> if [ $? -ne 0 ]; then
>>     echo "truncating failed"
>>         continue
>> else
>>     echo "truncation succeeded!"
>>     break
>> fi
>> done
>>
>> echo "kill C* process on node2"
>> pdsh -l $node2_user -R ssh -w $node2_ip "ps auxww | grep CassandraDaemon |
>> awk '{if (\$13 ~ /cassand/) print \$2}' | xargs sudo kill -9"
>>
>> cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select
>> count(*) from testdb.testtbl;"
>> sleep 10
>> cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select
>> count(*) from testdb.testtbl;"
>>
>> echo "restart C* process on node2"
>> pdsh -l $node2_user -R ssh -w $node2_ip "sudo /etc/init.d/cassandra start"
>>
>>
>> Thanks,
>> yuji
>>
>>
>> On Fri, Nov 18, 2016 at 7:52 PM, Yuji Ito <y...@imagine-orb.com> wrote:
>>>
>>> I investigated source code and logs of killed node.
>>> I guess that unexpected writes are executed when truncation is being
>>> executed.
>>>
>>> Some writes were executed after flush (the first flush) in truncation and
>>> these writes could be read.
>>> These writes were requested as MUTATION by another node for hinted
>>> handoff.
>>> Their data was stored to a new memtable and flushed (the second flush) to
>>> a new SSTable before snapshot in truncation.
>>> So, the truncation discarded only old SSTables, not the new SSTable.
>>> That's because ReplayPosition which was used for discarding SSTable was
>>> that of the first flush.
>>>
>>> I copied some parts of log as below.
>>> "##" line is my comment.
>>> The point is that the ReplayPosition is moved forward by the second
>>> flush.
>>> It means some writes are executed after the first flush.
>>>
>>> == log ==
>>> ## started truncation
>>> TRACE [SharedPool-Worker-16] 2016-11-17 08:36:04,612
>>> ColumnFamilyStore.java:2790 - truncating testtbl
>>> ## the first flush started before truncation
>>> DEBUG [SharedPool-Worker-16] 2016-11-17 08:36:04,612
>>> ColumnFamilyStore.java:952 - Enqueuing flush of testtbl: 591360 (0%)
>>> on-heap, 0 (0%) off-heap
>>> INFO  [MemtableFlushWriter:1] 2016-11-17 08:36:04,613 Memtable.java:352 -
>>> Writing Memtable-testtbl@1863835308(42.625KiB serialized bytes, 2816 ops,
>>> 0%/0% of on/off-heap limit)
>>> ...
>>> DEBUG [MemtableFlushWriter:1] 2016-11-17 08:36:04,973 Memtable.java:386 -
>>> Completed flushing
>>> /var/lib/cassandra/data/testdb/testtbl-562848f0a55611e68b1451065d58fdfb/tmp-lb-1-big-Data.db
>>> (17.651KiB) for commitlog position ReplayPosition(segmentId=1479371760395,
>>> position=315867)
>>> ## this ReplayPosition was used for discarding SSTables
>>> ...
>>> TRACE [MemtablePostFlush:1] 2016-11-17 08:36:05,022 CommitLog.java:298 -
>>> discard completed log segments for ReplayPosition(segmentId=1479371760395,
>>> position=315867), table 562848f0-a556-11e6-8b14-51065d58fdfb
>>> ## end of the first flush
>>> DEBUG [SharedPool-Worker-16] 2016-11-17 08:36:05,028
>>> ColumnFamilyStore.java:2823 - Discarding sstable data for truncated CF +
>>> indexes
>>> ## the second flush before snapshot
>>> DEBUG [SharedPool-Worker-16] 2016-11-17 08:36:05,028
>>> ColumnFamilyStore.java:952 - Enqueuing flush of testtbl: 698880 (0%)
>>> on-heap, 0 (0%) off-heap
>>> INFO  [MemtableFlushWriter:2] 2016-11-17 08:36:05,029 Memtable.java:352 -
>>> Writing Memtable-testtbl@1186728207(50.375KiB serialized bytes, 3328 ops,
>>> 0%/0% of on/off-heap limit)
>>> ...
>>> DEBUG [MemtableFlushWriter:2] 2016-11-17 08:36:05,258 Memtable.java:386 -
>>> Completed flushing
>>> /var/lib/cassandra/data/testdb/testtbl-562848f0a55611e68b1451065d58fdfb/tmp-lb-2-big-Data.db
>>> (17.696KiB) for commitlog position ReplayPosition(segmentId=1479371760395,
>>> position=486627)
>>> ...
>>> TRACE [MemtablePostFlush:1] 2016-11-17 08:36:05,289 CommitLog.java:298 -
>>> discard completed log segments for ReplayPosition(segmentId=1479371760395,
>>> position=486627), table 562848f0-a556-11e6-8b14-51065d58fdfb
>>> ## end of the second flush: position was moved
>>> ...
>>> ## only old SSTable was deleted because this SSTable was older than
>>> ReplayPosition(segmentId=1479371760395, position=315867)
>>> TRACE [NonPeriodicTasks:1] 2016-11-17 08:36:05,303 SSTable.java:118 -
>>> Deleted
>>> /var/lib/cassandra/data/testdb/testtbl-562848f0a55611e68b1451065d58fdfb/lb-1-big
>>> ...
>>> TRACE [SharedPool-Worker-16] 2016-11-17 08:36:05,320
>>> ColumnFamilyStore.java:2841 - truncate complete
>>> TRACE [SharedPool-Worker-16] 2016-11-17 08:36:05,320
>>> TruncateVerbHandler.java:53 - Truncation(keyspace='testdb', cf='testtbl')
>>> applied.  Enqueuing response to 36512@/10.91.145.7
>>> TRACE [SharedPool-Worker-16] 2016-11-17 08:36:05,320
>>> MessagingService.java:728 - /10.91.145.27 sending REQUEST_RESPONSE to
>>> 36512@/10.91.145.7
>>> ## end of truncation
>>> ====
>>>
>>> Actually, "truncated_at" of the table on the system.local after running
>>> the script was 0x00000158716da30b0004d1db00000158716db524.
>>> It means segmentId=1479371760395, position=315867
>>> truncated_at=1479371765028 (2016-11-17 08:36:05,028)
>>>
>>> thanks,
>>> yuji
>>>
>>>
>>> On Wed, Nov 16, 2016 at 5:25 PM, Yuji Ito <y...@imagine-orb.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I could find stale data after truncating a table.
>>>> It seems that truncating starts while recovery is being executed just
>>>> after a node restarts.
>>>> After the truncating finishes, recovery still continues?
>>>> Is it expected?
>>>>
>>>> I use C* 2.2.8 and can reproduce it as below.
>>>>
>>>> ==== [create table] ====
>>>> cqlsh $ip -e "drop keyspace testdb;"
>>>> cqlsh $ip -e "CREATE KEYSPACE testdb WITH replication = {'class':
>>>> 'SimpleStrategy', 'replication_factor': '3'};"
>>>> cqlsh $ip -e "CREATE TABLE testdb.testtbl (key int PRIMARY KEY, val
>>>> int);"
>>>>
>>>> ==== [script] ====
>>>> #!/bin/sh
>>>>
>>>> node1_ip=<node1 IP address>
>>>> node2_ip=<node2 IP address>
>>>> node3_ip=<node3 IP address>
>>>> node3_user=<user name>
>>>> rows=10000
>>>>
>>>> echo "consistency quorum;" > init_data.cql
>>>> for key in $(seq 0 $(expr $rows - 1))
>>>> do
>>>>     echo "insert into testdb.testtbl (key, val) values($key, 1111) IF
>>>> NOT EXISTS;" >> init_data.cql
>>>> done
>>>>
>>>> while true
>>>> do
>>>> echo "truncate the table"
>>>> cqlsh $node1_ip -e "truncate table testdb.testtbl"
>>>> if [ $? -ne 0 ]; then
>>>>     echo "truncating failed"
>>>>     continue
>>>> else
>>>>     break
>>>> fi
>>>> done
>>>>
>>>> echo "kill C* process on node3"
>>>> pdsh -l $node3_user -R ssh -w $node3_ip "ps auxww | grep CassandraDaemon
>>>> | awk '{if (\$13 ~ /cassand/) print \$2}' | xargs sudo kill -9"
>>>>
>>>> echo "insert $rows rows"
>>>> cqlsh $node1_ip -f init_data.cql > insert_log 2>&1
>>>>
>>>> echo "restart C* process on node3"
>>>> pdsh -l $node3_user -R ssh -w $node3_ip "sudo /etc/init.d/cassandra
>>>> start"
>>>>
>>>> while true
>>>> do
>>>> echo "truncate the table again"
>>>> cqlsh $node1_ip -e "truncate table testdb.testtbl"
>>>> if [ $? -ne 0 ]; then
>>>>     echo "truncating failed"
>>>>     continue
>>>> else
>>>>     break
>>>> fi
>>>> done
>>>>
>>>> cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select
>>>> count(*) from testdb.testtbl;"
>>>> sleep 10
>>>> cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select
>>>> count(*) from testdb.testtbl;"
>>>>
>>>>
>>>> ==== [result] ====
>>>> truncate the table
>>>> kill C* process on node3
>>>> insert 10000 rows
>>>> restart C* process on node3
>>>> 10.91.145.27: Starting Cassandra: OK
>>>> truncate the table again
>>>> <stdin>:1:TruncateError: Error during truncate: Cannot achieve
>>>> consistency level ALL
>>>> truncating failed
>>>> truncate the table again
>>>> <stdin>:1:TruncateError: Error during truncate: Cannot achieve
>>>> consistency level ALL
>>>> truncating failed
>>>> truncate the table again
>>>> <stdin>:1:TruncateError: Error during truncate: Cannot achieve
>>>> consistency level ALL
>>>> truncating failed
>>>> truncate the table again
>>>> <stdin>:1:TruncateError: Error during truncate: Cannot achieve
>>>> consistency level ALL
>>>> truncating failed
>>>> truncate the table again
>>>> <stdin>:1:TruncateError: Error during truncate: Cannot achieve
>>>> consistency level ALL
>>>> truncating failed
>>>> truncate the table again
>>>> <stdin>:1:TruncateError: Error during truncate: Cannot achieve
>>>> consistency level ALL
>>>> truncating failed
>>>> truncate the table again
>>>> Consistency level set to SERIAL.
>>>>
>>>>  count
>>>> -------
>>>>    300
>>>>
>>>> (1 rows)
>>>>
>>>> Warnings :
>>>> Aggregation query used without partition key
>>>>
>>>> Consistency level set to SERIAL.
>>>>
>>>>  count
>>>> -------
>>>>   2304
>>>>
>>>> (1 rows)
>>>>
>>>> Warnings :
>>>> Aggregation query used without partition key
>>>> ====
>>>>
>>>> I found it when I was investigating data lost problem. (Ref. "failure
>>>> node rejoin" thread)
>>>> I'm not sure this problem is related to data lost.
>>>>
>>>> Thanks,
>>>> yuji
>>>
>>>
>>
>

Re: Does recovery continue after truncating a table?

Reply via email to