[jira] [Commented] (CASSANDRA-8993) EffectiveIndexInterval calculation is incorrect
[ https://issues.apache.org/jira/browse/CASSANDRA-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14383976#comment-14383976 ] Tyler Hobbs commented on CASSANDRA-8993: I'm partial to the more explicit {{nextSamplePosition = 0}}, so I've committed that as f7856c22. Thanks, I think that should wrap this ticket up. > EffectiveIndexInterval calculation is incorrect > --- > > Key: CASSANDRA-8993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8993 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Blocker > Fix For: 2.1.4 > > Attachments: 8993-2.1-v2.txt, 8993-2.1.txt, 8993.txt > > > I'm not familiar enough with the calculation itself to understand why this is > happening, but see discussion on CASSANDRA-8851 for the background. I've > introduced a test case to look for this during downsampling, but it seems to > pass just fine, so it may be an artefact of upgrading. > The problem was, unfortunately, not manifesting directly because it would > simply result in a failed lookup. This was only exposed when early opening > used firstKeyBeyond, which does not use the effective interval, and provided > the result to getPosition(). > I propose a simple fix that ensures a bug here cannot break correctness. > Perhaps [~thobbs] can follow up with an investigation as to how it actually > went wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8993) EffectiveIndexInterval calculation is incorrect
[ https://issues.apache.org/jira/browse/CASSANDRA-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14383561#comment-14383561 ] Benedict commented on CASSANDRA-8993: - I hate it when the explanation is simply that I'm an idiot (or if I'm charitable, that I forgot the reason). Given this explanation, I think either piece of code is as good as the other, so since it's your baby perhaps you can decide which you prefer? > EffectiveIndexInterval calculation is incorrect > --- > > Key: CASSANDRA-8993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8993 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Blocker > Fix For: 2.1.4 > > Attachments: 8993-2.1-v2.txt, 8993-2.1.txt, 8993.txt > > > I'm not familiar enough with the calculation itself to understand why this is > happening, but see discussion on CASSANDRA-8851 for the background. I've > introduced a test case to look for this during downsampling, but it seems to > pass just fine, so it may be an artefact of upgrading. > The problem was, unfortunately, not manifesting directly because it would > simply result in a failed lookup. This was only exposed when early opening > used firstKeyBeyond, which does not use the effective interval, and provided > the result to getPosition(). > I propose a simple fix that ensures a bug here cannot break correctness. > Perhaps [~thobbs] can follow up with an investigation as to how it actually > went wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8993) EffectiveIndexInterval calculation is incorrect
[ https://issues.apache.org/jira/browse/CASSANDRA-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382867#comment-14382867 ] Tyler Hobbs commented on CASSANDRA-8993: Since you +1'ed the current patch, I went ahead and committed it as 7ff25f0d. I can make a follow-up commit for the nextSamplePosition change if the proposed logic sounds correct to you. > EffectiveIndexInterval calculation is incorrect > --- > > Key: CASSANDRA-8993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8993 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Blocker > Fix For: 2.1.4 > > Attachments: 8993-2.1-v2.txt, 8993-2.1.txt, 8993.txt > > > I'm not familiar enough with the calculation itself to understand why this is > happening, but see discussion on CASSANDRA-8851 for the background. I've > introduced a test case to look for this during downsampling, but it seems to > pass just fine, so it may be an artefact of upgrading. > The problem was, unfortunately, not manifesting directly because it would > simply result in a failed lookup. This was only exposed when early opening > used firstKeyBeyond, which does not use the effective interval, and provided > the result to getPosition(). > I propose a simple fix that ensures a bug here cannot break correctness. > Perhaps [~thobbs] can follow up with an investigation as to how it actually > went wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8993) EffectiveIndexInterval calculation is incorrect
[ https://issues.apache.org/jira/browse/CASSANDRA-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382831#comment-14382831 ] Tyler Hobbs commented on CASSANDRA-8993: bq. without a setNextSamplePosition(-minIndexInterval) it doesn't pass its test cases (i.e. initiating the first sample index deterministically to zero caused unit test failures) I looked into this. When I simply replaced the {{setNextSamplePosition(-minIndexInterval)}} call with {{nextSamplePosition = 0}}, I got failures in IndexSummaryManagerTest. Perhaps that's what you were seeing? It looks like the correct replacement for that line is: {noformat} nextSamplePosition = 0; indexIntervalMatches++; {noformat} since we implicitly have an index interval match at 0. With that code, the index summary related tests pass. bq. I wonder if we couldn't get a lot of the benefit of downsampling by sticking to powers of 2, as it might simplify the code significantly? That's true, it would simplify the code, we would just lose granularity in how the index summary memory pool is distributed. I wouldn't be opposed to doing that, but I don't think it's a pressing issue yet. > EffectiveIndexInterval calculation is incorrect > --- > > Key: CASSANDRA-8993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8993 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Blocker > Fix For: 2.1.4 > > Attachments: 8993-2.1-v2.txt, 8993-2.1.txt, 8993.txt > > > I'm not familiar enough with the calculation itself to understand why this is > happening, but see discussion on CASSANDRA-8851 for the background. I've > introduced a test case to look for this during downsampling, but it seems to > pass just fine, so it may be an artefact of upgrading. > The problem was, unfortunately, not manifesting directly because it would > simply result in a failed lookup. This was only exposed when early opening > used firstKeyBeyond, which does not use the effective interval, and provided > the result to getPosition(). > I propose a simple fix that ensures a bug here cannot break correctness. > Perhaps [~thobbs] can follow up with an investigation as to how it actually > went wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8993) EffectiveIndexInterval calculation is incorrect
[ https://issues.apache.org/jira/browse/CASSANDRA-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382505#comment-14382505 ] Benedict commented on CASSANDRA-8993: - OK, so it makes a lot more sense now that I realise the downsampling granularity can be so small - I was thrown by the "BSL must be a power of two" without an equivalent statement for the sampling itself, and in my head I just assumed it was all dealing with powers of 2 (so nothing technically wrong with the comments, just my interpretation of them). This also explains why the effective index intervals were always the same - with powers of 2 they would be. I wonder if we couldn't get a lot of the benefit of downsampling by sticking to powers of 2, as it might simplify the code significantly? The original indices, indices to skip, and effective intervals could each be implemented with approximately one simple statement. Not pushing for it, mind, just airing the question. Thanks for taking the time to explain, anyway, and with that clarification I am +1 the patch as stands. On the topic of the zero index always being present: I can vouch that this assumption breaks somewhere, because I assumed this to be the case when modifying IndexSummaryBuilder, and without a setNextSamplePosition(-minIndexInterval) it doesn't pass its test cases (i.e. initiating the first sample index deterministically to zero caused unit test failures). So we should perhaps track down where the logical flaw is, however minor it may be. > EffectiveIndexInterval calculation is incorrect > --- > > Key: CASSANDRA-8993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8993 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Blocker > Fix For: 2.1.4 > > Attachments: 8993-2.1-v2.txt, 8993-2.1.txt, 8993.txt > > > I'm not familiar enough with the calculation itself to understand why this is > happening, but see discussion on CASSANDRA-8851 for the background. I've > introduced a test case to look for this during downsampling, but it seems to > pass just fine, so it may be an artefact of upgrading. > The problem was, unfortunately, not manifesting directly because it would > simply result in a failed lookup. This was only exposed when early opening > used firstKeyBeyond, which does not use the effective interval, and provided > the result to getPosition(). > I propose a simple fix that ensures a bug here cannot break correctness. > Perhaps [~thobbs] can follow up with an investigation as to how it actually > went wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8993) EffectiveIndexInterval calculation is incorrect
[ https://issues.apache.org/jira/browse/CASSANDRA-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382302#comment-14382302 ] Tyler Hobbs commented on CASSANDRA-8993: I'll try to explain a bit about how downsampling works overall so that more people besides myself understand how it works :) I can put whatever info is useful into comments for posterity. bq. If I print out the "original indices" and "effective intervals", it seems that at the first downsampling level (64) The sampling level after minimal downsampling is 127, not 64. The sampling level can be anywhere between 0 and BASE_SAMPLING_LEVEL. When a summary moves from sampling level 128 to level 127, it will drop one summary entry with an index between \[0, 127\], one entry between \[127, 255\], and so on for the rest of the summary. The index to drop is determined by {{Downsampling.getSamplingPattern()}}. The list of integers returned from {{Downsampling.getSamplingPattern(BASE_SAMPLING_LEVEL)}} are the indexes that we'll drop for each round of downsampling. As an example, suppose BASE_SAMPLING_LEVEL is 16 instead of 128. {{Downsampling.getSamplingPattern(16)}} returns the following pattern: {noformat} 15, 7, 11, 3, 13, 5, 9, 1, 14, 6, 10, 2, 12, 4, 8, 0 {noformat} So, when we move from sampling level 16 to 15, we'll drop the entry at index 15 (and repeat that for indexes 15 + (16 * 1), 15 + (16 * 2), 15 + (16 * 3), etc). When we move from sampling level 15 to 14, we'll drop the entry at index 7 (and repeat as before, but take into account the fact that we've already dropped the entry at index 15). This pattern of dropping minimizes the maximum distance between remaining summary entries. Now, in practice, we will never move from sampling level 128 directly to level 127 because of IndexSummaryManager's {{DOWNSAMPLE_THRESHOLD}}. However, an index summary could go through multiple rounds of down and upsampling and arrive at level 127, so we need to be able to handle that. bq. Further confusion to understanding Downsampling as a whole stems from the permission of a -1 index into getEffectiveIndexIntervalAfterIndex without explanation Hmm, yeah, looking at the code, I don't think we actually need to handle that. I believe it is leftover logic from earlier in the development of the code when downsampling would remove the 0th index in an earlier round. With the current code, the 0th index entry should always be present. I'll make some changes to remove that. bq. and the fact that every effective interval is the same despite there being multiple avenues for calculating it I'm not sure what you mean here. > EffectiveIndexInterval calculation is incorrect > --- > > Key: CASSANDRA-8993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8993 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Blocker > Fix For: 2.1.4 > > Attachments: 8993-2.1-v2.txt, 8993-2.1.txt, 8993.txt > > > I'm not familiar enough with the calculation itself to understand why this is > happening, but see discussion on CASSANDRA-8851 for the background. I've > introduced a test case to look for this during downsampling, but it seems to > pass just fine, so it may be an artefact of upgrading. > The problem was, unfortunately, not manifesting directly because it would > simply result in a failed lookup. This was only exposed when early opening > used firstKeyBeyond, which does not use the effective interval, and provided > the result to getPosition(). > I propose a simple fix that ensures a bug here cannot break correctness. > Perhaps [~thobbs] can follow up with an investigation as to how it actually > went wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8993) EffectiveIndexInterval calculation is incorrect
[ https://issues.apache.org/jira/browse/CASSANDRA-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381053#comment-14381053 ] Benedict commented on CASSANDRA-8993: - bq. that it should hopefully clarify the reason. Unfortunately not, at least for me. I am afraid I don't really follow the Downsampling logic, and the comment actually confused me further. I tried to print out the various static method states in Downsampling to visualise them better, and it still doesn't make much sense to me, so I'll outline my confusion and we can decide either to help me understand, or to perhaps get [~jbellis] (the original reviewer iirc) to step in. I should note that, either way, I expect the test you've introduced to be sound for solving this problem, so if this takes too long we can roll a version without further ceremony, but I think it may be an unnecessarily lengthy test on startup for old format files, which could be onerous. bq. Unfortunately, the first entry to be dropped is the entry at index (BASE_SAMPLING_LEVEL - 1), so we need to check a full set of BASE_SAMPLING_LEVEL entries. If I print out the "original indices" and "effective intervals", it seems that at the first downsampling level (64) alternating summary entries are dropped (and again for each extra level), up to dropping all but each 128th (beginning with zero), so the first half of the comment doesn't seem to match with the behaviour I see in a way I understand. If the behaviour matches what I see, and not the comment, then it seems we could safely just check the first two "expected" intervals and if they mismatch we've downsampled and need to rebuild. This would translate into always just checking 2 * minIndexInterval records in the index, or 1/64th the data. Further confusion to understanding Downsampling as a whole stems from the permission of a -1 index into getEffectiveIndexIntervalAfterIndex without explanation, and the fact that every effective interval is the same despite there being multiple avenues for calculating it (it would be much clearer if it were just minIndexInterval * (BASE_SAMPLING_LEVEL / samplingLevel). I apologise if I'm just being a bit dimwitted. > EffectiveIndexInterval calculation is incorrect > --- > > Key: CASSANDRA-8993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8993 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Blocker > Fix For: 2.1.4 > > Attachments: 8993-2.1-v2.txt, 8993-2.1.txt, 8993.txt > > > I'm not familiar enough with the calculation itself to understand why this is > happening, but see discussion on CASSANDRA-8851 for the background. I've > introduced a test case to look for this during downsampling, but it seems to > pass just fine, so it may be an artefact of upgrading. > The problem was, unfortunately, not manifesting directly because it would > simply result in a failed lookup. This was only exposed when early opening > used firstKeyBeyond, which does not use the effective interval, and provided > the result to getPosition(). > I propose a simple fix that ensures a bug here cannot break correctness. > Perhaps [~thobbs] can follow up with an investigation as to how it actually > went wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8993) EffectiveIndexInterval calculation is incorrect
[ https://issues.apache.org/jira/browse/CASSANDRA-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380553#comment-14380553 ] Tyler Hobbs commented on CASSANDRA-8993: bq. Might it be easier to filter the list of readers earlier Sure, there's no good reason not to do this earlier, so I've moved it to the suggested location. bq. Could you clarify SSTableReader Line 875: why do we check BASE_SAMPLING_LEVEL entries? I meant to explain this a little better, thanks for catching that. I've improved the comment enough that it should hopefully clarify the reason. My [branch|https://github.com/thobbs/cassandra/tree/CASSANDRA-8993] has been updated in addition to the v2 patch. > EffectiveIndexInterval calculation is incorrect > --- > > Key: CASSANDRA-8993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8993 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Blocker > Fix For: 2.1.4 > > Attachments: 8993-2.1-v2.txt, 8993-2.1.txt, 8993.txt > > > I'm not familiar enough with the calculation itself to understand why this is > happening, but see discussion on CASSANDRA-8851 for the background. I've > introduced a test case to look for this during downsampling, but it seems to > pass just fine, so it may be an artefact of upgrading. > The problem was, unfortunately, not manifesting directly because it would > simply result in a failed lookup. This was only exposed when early opening > used firstKeyBeyond, which does not use the effective interval, and provided > the result to getPosition(). > I propose a simple fix that ensures a bug here cannot break correctness. > Perhaps [~thobbs] can follow up with an investigation as to how it actually > went wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8993) EffectiveIndexInterval calculation is incorrect
[ https://issues.apache.org/jira/browse/CASSANDRA-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380252#comment-14380252 ] Benedict commented on CASSANDRA-8993: - * Might it be easier to filter the list of readers earlier, in {{redistributeSummaries(List compacting, List nonCompacting, long memoryPoolBytes)}} ? * Could you clarify SSTableReader Line 875: why do we check BASE_SAMPLING_LEVEL entries? > EffectiveIndexInterval calculation is incorrect > --- > > Key: CASSANDRA-8993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8993 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Blocker > Fix For: 2.1.4 > > Attachments: 8993-2.1.txt, 8993.txt > > > I'm not familiar enough with the calculation itself to understand why this is > happening, but see discussion on CASSANDRA-8851 for the background. I've > introduced a test case to look for this during downsampling, but it seems to > pass just fine, so it may be an artefact of upgrading. > The problem was, unfortunately, not manifesting directly because it would > simply result in a failed lookup. This was only exposed when early opening > used firstKeyBeyond, which does not use the effective interval, and provided > the result to getPosition(). > I propose a simple fix that ensures a bug here cannot break correctness. > Perhaps [~thobbs] can follow up with an investigation as to how it actually > went wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8993) EffectiveIndexInterval calculation is incorrect
[ https://issues.apache.org/jira/browse/CASSANDRA-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377985#comment-14377985 ] Benedict commented on CASSANDRA-8993: - [~Hachmann]: Yes, that's the expectation - I'm as certain as I can be that this will prevent the underlying bug causing any problems. We would love for you to corroborate this in the wild, though, by deploying the patch if you can. > EffectiveIndexInterval calculation is incorrect > --- > > Key: CASSANDRA-8993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8993 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Blocker > Fix For: 2.1.4 > > Attachments: 8993.txt > > > I'm not familiar enough with the calculation itself to understand why this is > happening, but see discussion on CASSANDRA-8851 for the background. I've > introduced a test case to look for this during downsampling, but it seems to > pass just fine, so it may be an artefact of upgrading. > The problem was, unfortunately, not manifesting directly because it would > simply result in a failed lookup. This was only exposed when early opening > used firstKeyBeyond, which does not use the effective interval, and provided > the result to getPosition(). > I propose a simple fix that ensures a bug here cannot break correctness. > Perhaps [~thobbs] can follow up with an investigation as to how it actually > went wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8993) EffectiveIndexInterval calculation is incorrect
[ https://issues.apache.org/jira/browse/CASSANDRA-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377981#comment-14377981 ] Björn Hachmann commented on CASSANDRA-8993: --- Hi @Benedict, as I do not completely understand the discussion in this ticket, I ask myself, if the attached patch might also be a cure for the NPEs reported for CASSANDRA-8851? Thank you + Kind regards! > EffectiveIndexInterval calculation is incorrect > --- > > Key: CASSANDRA-8993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8993 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Blocker > Fix For: 2.1.4 > > Attachments: 8993.txt > > > I'm not familiar enough with the calculation itself to understand why this is > happening, but see discussion on CASSANDRA-8851 for the background. I've > introduced a test case to look for this during downsampling, but it seems to > pass just fine, so it may be an artefact of upgrading. > The problem was, unfortunately, not manifesting directly because it would > simply result in a failed lookup. This was only exposed when early opening > used firstKeyBeyond, which does not use the effective interval, and provided > the result to getPosition(). > I propose a simple fix that ensures a bug here cannot break correctness. > Perhaps [~thobbs] can follow up with an investigation as to how it actually > went wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8993) EffectiveIndexInterval calculation is incorrect
[ https://issues.apache.org/jira/browse/CASSANDRA-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14370218#comment-14370218 ] Benedict commented on CASSANDRA-8993: - Huh, you're right, it's CQL rows we track. > EffectiveIndexInterval calculation is incorrect > --- > > Key: CASSANDRA-8993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8993 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Blocker > Fix For: 2.1.4 > > Attachments: 8993.txt > > > I'm not familiar enough with the calculation itself to understand why this is > happening, but see discussion on CASSANDRA-8851 for the background. I've > introduced a test case to look for this during downsampling, but it seems to > pass just fine, so it may be an artefact of upgrading. > The problem was, unfortunately, not manifesting directly because it would > simply result in a failed lookup. This was only exposed when early opening > used firstKeyBeyond, which does not use the effective interval, and provided > the result to getPosition(). > I propose a simple fix that ensures a bug here cannot break correctness. > Perhaps [~thobbs] can follow up with an investigation as to how it actually > went wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8993) EffectiveIndexInterval calculation is incorrect
[ https://issues.apache.org/jira/browse/CASSANDRA-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14370213#comment-14370213 ] Tyler Hobbs commented on CASSANDRA-8993: I don't think we actually track a row count for sstables anywhere (including SSTable/StatsMetadata), so we'll need to compare summary entries against index entries for the first 2048 index entries. > EffectiveIndexInterval calculation is incorrect > --- > > Key: CASSANDRA-8993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8993 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Blocker > Fix For: 2.1.4 > > Attachments: 8993.txt > > > I'm not familiar enough with the calculation itself to understand why this is > happening, but see discussion on CASSANDRA-8851 for the background. I've > introduced a test case to look for this during downsampling, but it seems to > pass just fine, so it may be an artefact of upgrading. > The problem was, unfortunately, not manifesting directly because it would > simply result in a failed lookup. This was only exposed when early opening > used firstKeyBeyond, which does not use the effective interval, and provided > the result to getPosition(). > I propose a simple fix that ensures a bug here cannot break correctness. > Perhaps [~thobbs] can follow up with an investigation as to how it actually > went wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8993) EffectiveIndexInterval calculation is incorrect
[ https://issues.apache.org/jira/browse/CASSANDRA-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14370187#comment-14370187 ] Tyler Hobbs commented on CASSANDRA-8993: For the fix, we'll need to do two things. First, stop overwriting summaries when the sstable is not a 2.1 format. Second, to recover from this bug, when we load sstables with the old format, we need to check the summary against the actual index to verify that it's at full sampling. If not, trigger a rebuild of the summary. Unfortunately, I think that will need to be done every time until the sstables have been converted to the new format. We could avoid that by persisting a flag somewhere, but I don't think the additional complexity is worth the savings in startup time. I'll work on the patch for this. > EffectiveIndexInterval calculation is incorrect > --- > > Key: CASSANDRA-8993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8993 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Blocker > Fix For: 2.1.4 > > Attachments: 8993.txt > > > I'm not familiar enough with the calculation itself to understand why this is > happening, but see discussion on CASSANDRA-8851 for the background. I've > introduced a test case to look for this during downsampling, but it seems to > pass just fine, so it may be an artefact of upgrading. > The problem was, unfortunately, not manifesting directly because it would > simply result in a failed lookup. This was only exposed when early opening > used firstKeyBeyond, which does not use the effective interval, and provided > the result to getPosition(). > I propose a simple fix that ensures a bug here cannot break correctness. > Perhaps [~thobbs] can follow up with an investigation as to how it actually > went wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8993) EffectiveIndexInterval calculation is incorrect
[ https://issues.apache.org/jira/browse/CASSANDRA-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14370178#comment-14370178 ] Benedict commented on CASSANDRA-8993: - Could we corroborate this using SSTableMetadata rowCount when deserializing? > EffectiveIndexInterval calculation is incorrect > --- > > Key: CASSANDRA-8993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8993 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Blocker > Fix For: 2.1.4 > > Attachments: 8993.txt > > > I'm not familiar enough with the calculation itself to understand why this is > happening, but see discussion on CASSANDRA-8851 for the background. I've > introduced a test case to look for this during downsampling, but it seems to > pass just fine, so it may be an artefact of upgrading. > The problem was, unfortunately, not manifesting directly because it would > simply result in a failed lookup. This was only exposed when early opening > used firstKeyBeyond, which does not use the effective interval, and provided > the result to getPosition(). > I propose a simple fix that ensures a bug here cannot break correctness. > Perhaps [~thobbs] can follow up with an investigation as to how it actually > went wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8993) EffectiveIndexInterval calculation is incorrect
[ https://issues.apache.org/jira/browse/CASSANDRA-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14370150#comment-14370150 ] Tyler Hobbs commented on CASSANDRA-8993: Okay, I think I have an idea of what's happening. The 2.0-format SSTable is loaded by 2.1 and is eventually downsampled. This results in the summary being overwritten on disk (with every 2048th index entry). However, we overwrite it with the old 2.0 format, which does _not_ include the sampling level. Whenever the sstable and summary are reloaded (due to a restart, etc), there's no sampling level to read from the serialized summary, so it's assumed to be fully sampled (with every 128th index entry). I'm pretty sure that's what's leading to this mismatch. > EffectiveIndexInterval calculation is incorrect > --- > > Key: CASSANDRA-8993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8993 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Blocker > Fix For: 2.1.4 > > Attachments: 8993.txt > > > I'm not familiar enough with the calculation itself to understand why this is > happening, but see discussion on CASSANDRA-8851 for the background. I've > introduced a test case to look for this during downsampling, but it seems to > pass just fine, so it may be an artefact of upgrading. > The problem was, unfortunately, not manifesting directly because it would > simply result in a failed lookup. This was only exposed when early opening > used firstKeyBeyond, which does not use the effective interval, and provided > the result to getPosition(). > I propose a simple fix that ensures a bug here cannot break correctness. > Perhaps [~thobbs] can follow up with an investigation as to how it actually > went wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8993) EffectiveIndexInterval calculation is incorrect
[ https://issues.apache.org/jira/browse/CASSANDRA-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14370092#comment-14370092 ] Benedict commented on CASSANDRA-8993: - bq. I'm assuming you inspected the serialized index summary and saw an (old) interval of 2048? No, I'm using the summary to search the index, and logging how far I have to go to find the record I want. It ranges up to 2048 items. The value we deserialize is 128, so presumably it was serialized incorrectly, but I'm really not at all familiar with this to know it isn't a conversion problem of some kind. > EffectiveIndexInterval calculation is incorrect > --- > > Key: CASSANDRA-8993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8993 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Blocker > Fix For: 2.1.4 > > Attachments: 8993.txt > > > I'm not familiar enough with the calculation itself to understand why this is > happening, but see discussion on CASSANDRA-8851 for the background. I've > introduced a test case to look for this during downsampling, but it seems to > pass just fine, so it may be an artefact of upgrading. > The problem was, unfortunately, not manifesting directly because it would > simply result in a failed lookup. This was only exposed when early opening > used firstKeyBeyond, which does not use the effective interval, and provided > the result to getPosition(). > I propose a simple fix that ensures a bug here cannot break correctness. > Perhaps [~thobbs] can follow up with an investigation as to how it actually > went wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8993) EffectiveIndexInterval calculation is incorrect
[ https://issues.apache.org/jira/browse/CASSANDRA-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14370087#comment-14370087 ] Tyler Hobbs commented on CASSANDRA-8993: bq. I have had another closer look. It seems that we're reading an interval of 128, when in fact it is 2048. Are you referring to when we deserialize the index summary? The default {{index_interval}} in 2.0 was 128. In 2.1, we use the old value of {{index_interval}} for {{min_index_interval}} and set {{max_index_interval}} to 2048. Unfortunately there was a bug in cqlsh in 2.1, so neither property is shown in the {{DESCRIBE}} output. I'm assuming you inspected the serialized index summary and saw an (old) interval of 2048? > EffectiveIndexInterval calculation is incorrect > --- > > Key: CASSANDRA-8993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8993 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Blocker > Fix For: 2.1.4 > > Attachments: 8993.txt > > > I'm not familiar enough with the calculation itself to understand why this is > happening, but see discussion on CASSANDRA-8851 for the background. I've > introduced a test case to look for this during downsampling, but it seems to > pass just fine, so it may be an artefact of upgrading. > The problem was, unfortunately, not manifesting directly because it would > simply result in a failed lookup. This was only exposed when early opening > used firstKeyBeyond, which does not use the effective interval, and provided > the result to getPosition(). > I propose a simple fix that ensures a bug here cannot break correctness. > Perhaps [~thobbs] can follow up with an investigation as to how it actually > went wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8993) EffectiveIndexInterval calculation is incorrect
[ https://issues.apache.org/jira/browse/CASSANDRA-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14370027#comment-14370027 ] Benedict commented on CASSANDRA-8993: - I've pushed the current mitigation, anyway > EffectiveIndexInterval calculation is incorrect > --- > > Key: CASSANDRA-8993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8993 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Blocker > Fix For: 2.1.4 > > Attachments: 8993.txt > > > I'm not familiar enough with the calculation itself to understand why this is > happening, but see discussion on CASSANDRA-8851 for the background. I've > introduced a test case to look for this during downsampling, but it seems to > pass just fine, so it may be an artefact of upgrading. > The problem was, unfortunately, not manifesting directly because it would > simply result in a failed lookup. This was only exposed when early opening > used firstKeyBeyond, which does not use the effective interval, and provided > the result to getPosition(). > I propose a simple fix that ensures a bug here cannot break correctness. > Perhaps [~thobbs] can follow up with an investigation as to how it actually > went wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8993) EffectiveIndexInterval calculation is incorrect
[ https://issues.apache.org/jira/browse/CASSANDRA-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14369998#comment-14369998 ] Benedict commented on CASSANDRA-8993: - I have had another closer look. It seems that we're reading an interval of 128, when in fact it is 2048. Exactly why this is happening, I don't know. I've tried to poll git to see where it's gone wrong, but it's probably more easily done by someone with knowledge of the history here. > EffectiveIndexInterval calculation is incorrect > --- > > Key: CASSANDRA-8993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8993 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Blocker > Fix For: 2.1.4 > > Attachments: 8993.txt > > > I'm not familiar enough with the calculation itself to understand why this is > happening, but see discussion on CASSANDRA-8851 for the background. I've > introduced a test case to look for this during downsampling, but it seems to > pass just fine, so it may be an artefact of upgrading. > The problem was, unfortunately, not manifesting directly because it would > simply result in a failed lookup. This was only exposed when early opening > used firstKeyBeyond, which does not use the effective interval, and provided > the result to getPosition(). > I propose a simple fix that ensures a bug here cannot break correctness. > Perhaps [~thobbs] can follow up with an investigation as to how it actually > went wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8993) EffectiveIndexInterval calculation is incorrect
[ https://issues.apache.org/jira/browse/CASSANDRA-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14369973#comment-14369973 ] Benedict commented on CASSANDRA-8993: - Either way, I think it would be helpful to expand the current test coverage to include summaries after the serialization (under different de/serialization settings) > EffectiveIndexInterval calculation is incorrect > --- > > Key: CASSANDRA-8993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8993 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Blocker > Fix For: 2.1.4 > > Attachments: 8993.txt > > > I'm not familiar enough with the calculation itself to understand why this is > happening, but see discussion on CASSANDRA-8851 for the background. I've > introduced a test case to look for this during downsampling, but it seems to > pass just fine, so it may be an artefact of upgrading. > The problem was, unfortunately, not manifesting directly because it would > simply result in a failed lookup. This was only exposed when early opening > used firstKeyBeyond, which does not use the effective interval, and provided > the result to getPosition(). > I propose a simple fix that ensures a bug here cannot break correctness. > Perhaps [~thobbs] can follow up with an investigation as to how it actually > went wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8993) EffectiveIndexInterval calculation is incorrect
[ https://issues.apache.org/jira/browse/CASSANDRA-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14369967#comment-14369967 ] Benedict commented on CASSANDRA-8993: - Early opening doesn't touch the index or index summary contents. I've reproduced the problem on the raw sstable directly with some hacky code to just open the index and summary directly. The sstable is a prior format, so it has never been successfully subjected to early opening or rewriting. You can download the files from CASSANDRA-8851 yourself, and I can provide you with the gpg key. I suspect the problem is related to the elimination of "indexInterval" from CfMetaData prematurely. It looks like it is needed to establish the actual sampling level - users that have modified this will have the incorrect level set after upgrade until their sstables are rewritten. > EffectiveIndexInterval calculation is incorrect > --- > > Key: CASSANDRA-8993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8993 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Blocker > Fix For: 2.1.4 > > Attachments: 8993.txt > > > I'm not familiar enough with the calculation itself to understand why this is > happening, but see discussion on CASSANDRA-8851 for the background. I've > introduced a test case to look for this during downsampling, but it seems to > pass just fine, so it may be an artefact of upgrading. > The problem was, unfortunately, not manifesting directly because it would > simply result in a failed lookup. This was only exposed when early opening > used firstKeyBeyond, which does not use the effective interval, and provided > the result to getPosition(). > I propose a simple fix that ensures a bug here cannot break correctness. > Perhaps [~thobbs] can follow up with an investigation as to how it actually > went wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8993) EffectiveIndexInterval calculation is incorrect
[ https://issues.apache.org/jira/browse/CASSANDRA-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14369903#comment-14369903 ] Tyler Hobbs commented on CASSANDRA-8993: I'm okay with the patch to mitigate incorrect calculations, but at the moment I'm at a loss as to where the calculation is going wrong. We have pretty thorough test coverage around downsampling (especially with the modifications to the test in your patch). I can only guess that it may be some interaction between downsampling and early opening. We don't have a way to reproduce yet, correct? Do you think extending the SSTableRewriterTests with checks on {{getPosition(..., EQ)}} and downsampling of index summaries would yield something? > EffectiveIndexInterval calculation is incorrect > --- > > Key: CASSANDRA-8993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8993 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Blocker > Fix For: 2.1.4 > > Attachments: 8993.txt > > > I'm not familiar enough with the calculation itself to understand why this is > happening, but see discussion on CASSANDRA-8851 for the background. I've > introduced a test case to look for this during downsampling, but it seems to > pass just fine, so it may be an artefact of upgrading. > The problem was, unfortunately, not manifesting directly because it would > simply result in a failed lookup. This was only exposed when early opening > used firstKeyBeyond, which does not use the effective interval, and provided > the result to getPosition(). > I propose a simple fix that ensures a bug here cannot break correctness. > Perhaps [~thobbs] can follow up with an investigation as to how it actually > went wrong? -- This message was sent by Atlassian JIRA (v6.3.4#6332)