[ https://issues.apache.org/jira/browse/CASSANDRA-7582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076509#comment-14076509 ]
Benedict commented on CASSANDRA-7582: ------------------------------------- I'm -1000 on encountering an error and silently swallowing it on something as core to correctness as the commit log - this at least gives the user a big red flag they may want to seek expert help. I think there are two distinct problems here - there are the 'unexpected' errors which should almost certainly involve the user seeking help from an expert to diagnose (or perhaps JIRA, since it possibly means a bug), and the unknown table exceptions. The latter are debatably more ok to ignore, but I would much rather we simply retain information about dropped tables, much as we do truncated tables, so that we can suppress those known to have been dropped (with knowledge of exactly _when_ they were dropped, so if we see CL records past that time we can still fail and ask the user to at least file a bug report). Consider the following (pretty plausible scenario): * User turns on CL saving * User creates table X, populates it with some data (let's say it's a fairly static dataset) * User uses the database for a period, mostly changing other tables * At time T, user drops table X, recreates it (instead of, e.g. truncate (which is separately also subtly dangerous in this scenario), and repopulates it with subtly but business-wise importantly different data * Some time after T, user has to restore the cluster, and restores the schema from prior to T by mistake (let's say the team member restoring doesn't realise the table was recreated since then), then performs a PIT restore The user now has no idea they have stale business data in their tables. Now, assuming we have saved the ids of all dropped tables we could report to the user that they are likely restoring data from a future schema, and they could then decide if this was safe or not; in this case they would be able to restore a newer schema (assuming they had saved it) and a major business error would have been averted. In general this fail-fast is likely to result in an increase in JIRA filing, and possibly for relatively benign bugs, but on the whole I would prefer that scenario than leaving subtle bugs in the CL. We've already caught at least one as a result of this, and we've had long standing bugs with respect to drain that still affect 2.0 that would have been caught a long time ago with better reporting. > 2.1 multi-dc upgrade errors > --------------------------- > > Key: CASSANDRA-7582 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7582 > Project: Cassandra > Issue Type: Bug > Components: Core > Reporter: Ryan McGuire > Assignee: Benedict > Priority: Critical > Fix For: 2.1.1 > > > Multi-dc upgrade [was working from 2.0 -> 2.1 fairly > recently|http://cassci.datastax.com/job/cassandra_upgrade_dtest/55/testReport/upgrade_through_versions_test/TestUpgrade_from_cassandra_2_0_latest_tag_to_cassandra_2_1_HEAD/], > but is currently failing. > Running > upgrade_through_versions_test.py:TestUpgrade_from_cassandra_2_0_HEAD_to_cassandra_2_1_HEAD.bootstrap_multidc_test > I get the following errors when starting 2.1 upgraded from 2.0: > {code} > ERROR [main] 2014-07-21 23:54:20,862 CommitLog.java:143 - Commit log replay > failed due to replaying a mutation for a missing table. This error can be > ignored by providing -Dcassandra.commitlog.stop_on_missing_tables=false on > the command line > ERROR [main] 2014-07-21 23:54:20,869 CassandraDaemon.java:474 - Exception > encountered during startup > java.lang.RuntimeException: > org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find > cfId=a1b676f3-0c5d-3276-bfd5-07cf43397004 > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:300) > [main/:na] > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:457) > [main/:na] > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:546) > [main/:na] > Caused by: org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't > find cfId=a1b676f3-0c5d-3276-bfd5-07cf43397004 > at > org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164) > ~[main/:na] > at > org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97) > ~[main/:na] > at > org.apache.cassandra.db.Mutation$MutationSerializer.deserializeOneCf(Mutation.java:353) > ~[main/:na] > at > org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:333) > ~[main/:na] > at > org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:365) > ~[main/:na] > at > org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:98) > ~[main/:na] > at > org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:137) > ~[main/:na] > at > org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:115) > ~[main/:na] > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)