[ https://issues.apache.org/jira/browse/CASSANDRA-15259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900575#comment-16900575 ]
Jordan West edited comment on CASSANDRA-15259 at 8/6/19 2:43 PM: ----------------------------------------------------------------- [~bdeggleston] good catch re: 2.x sstables. I see two ways to handle that off the top of my head – besides not including the legacy sstables in the calculation which is broken. I think I prefer {{getMeanRowCount2}} (average of the row count and column count) because in the case of 100% legacy sstables or 100% new sstables it degrades to {{getMeanColumns}} or the original {{getMeanRowCount.}} Neither implementation is ideal since we have to handle it at the per sstable level and what that means for an average is ambiguous. Also, I wonder if the method name should change and/or if the logic should be moved to somewhere index specific like {{CassandraIndex}}, now that what its doing is a bit more specialized and less clear. WDYT? {code:java} public int getMeanRowCount() { long totalRows = 0; long totalPartitions = 0; for (SSTableReader sstable : getSSTables(SSTableSet.CANONICAL)) { if (sstable.descriptor.version.storeRows()) { totalPartitions += sstable.getEstimatedPartitionSize().count(); totalRows += sstable.getTotalRows(); } else { long colCount = sstable.getEstimatedColumnCount().count(); totalPartitions += colCount; totalRows += sstable.getEstimatedColumnCount().mean() * colCount; } } return totalPartitions > 0 ? (int) (totalRows / totalPartitions) : 0; } public int getMeanRowCount2() { long totalRows = 0; long totalPartitions = 0; long legacyCols = 0; long legacyTotal = 0; for (SSTableReader sstable : getSSTables(SSTableSet.CANONICAL)) { if (sstable.descriptor.version.storeRows()) { totalPartitions += sstable.getEstimatedPartitionSize().count(); totalRows += sstable.getTotalRows(); } else { long colCount = sstable.getEstimatedColumnCount().count(); legacyCols += sstable.getEstimatedColumnCount().mean() * colCount; legacyTotal += colCount; } } int rowMean = totalPartitions > 0 ? (int) (totalRows / totalPartitions) : 0; int legacyMean = legacyTotal > 0 ? (int) (legacyCols / legacyTotal) : 0; return (int) (((rowMean * totalPartitions) + (legacyMean * legacyTotal)) / (totalPartitions + legacyTotal)); } {code} was (Author: jrwest): [~bdeggleston] good catch re: 2.1 sstables. I see two ways to handle that off the top of my head – besides not including the legacy sstables in the calculation which is broken. I think I prefer {{getMeanRowCount2}} (average of the row count and column count) because in the case of 100% legacy sstables or 100% new sstables it degrades to {{getMeanColumns}} or the original {{getMeanRowCount.}} Neither implementation is ideal since we have to handle it at the per sstable level and what that means for an average is ambiguous. Also, I wonder if the method name should change and/or if the logic should be moved to somewhere index specific like {{CassandraIndex}}, now that what its doing is a bit more specialized and less clear. WDYT? {code:java} public int getMeanRowCount() { long totalRows = 0; long totalPartitions = 0; for (SSTableReader sstable : getSSTables(SSTableSet.CANONICAL)) { if (sstable.descriptor.version.storeRows()) { totalPartitions += sstable.getEstimatedPartitionSize().count(); totalRows += sstable.getTotalRows(); } else { long colCount = sstable.getEstimatedColumnCount().count(); totalPartitions += colCount; totalRows += sstable.getEstimatedColumnCount().mean() * colCount; } } return totalPartitions > 0 ? (int) (totalRows / totalPartitions) : 0; } public int getMeanRowCount2() { long totalRows = 0; long totalPartitions = 0; long legacyCols = 0; long legacyTotal = 0; for (SSTableReader sstable : getSSTables(SSTableSet.CANONICAL)) { if (sstable.descriptor.version.storeRows()) { totalPartitions += sstable.getEstimatedPartitionSize().count(); totalRows += sstable.getTotalRows(); } else { long colCount = sstable.getEstimatedColumnCount().count(); legacyCols += sstable.getEstimatedColumnCount().mean() * colCount; legacyTotal += colCount; } } int rowMean = totalPartitions > 0 ? (int) (totalRows / totalPartitions) : 0; int legacyMean = legacyTotal > 0 ? (int) (legacyCols / legacyTotal) : 0; return (int) (((rowMean * totalPartitions) + (legacyMean * legacyTotal)) / (totalPartitions + legacyTotal)); } {code} > Selecting Index by Lowest Mean Column Count Selects Random Index > ---------------------------------------------------------------- > > Key: CASSANDRA-15259 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15259 > Project: Cassandra > Issue Type: Bug > Components: Feature/2i Index > Reporter: Jordan West > Assignee: Jordan West > Priority: Urgent > Fix For: 3.0.19, 4.0, 3.11.x > > > {{CassandraIndex}} uses > [{{ColumnFamilyStore#getMeanColumns}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/internal/CassandraIndex.java#L273], > average columns per partition, which always returns the same answer for > index CFs because they contain no regular columns and clustering columns > aren't included in the count in Cassandra 3.0+. > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org