[jira] [Comment Edited] (CASSANDRA-15259) Selecting Index by Lowest Mean Column Count Selects Random Index

Jordan West (JIRA) Tue, 06 Aug 2019 07:44:13 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900575#comment-16900575
 ]


Jordan West edited comment on CASSANDRA-15259 at 8/6/19 2:43 PM:
-----------------------------------------------------------------

[~bdeggleston] good catch re: 2.x sstables. I see two ways to handle that off 
the top of my head – besides not including the legacy sstables in the 
calculation which is broken.

I think I prefer {{getMeanRowCount2}} (average of the row count and column 
count) because in the case of 100% legacy sstables or 100% new sstables it 
degrades to {{getMeanColumns}} or the original {{getMeanRowCount.}} Neither 
implementation is ideal since we have to handle it at the per sstable level and 
what that means for an average is ambiguous. 

Also, I wonder if the method name should change and/or if the logic should be 
moved to somewhere index specific like {{CassandraIndex}}, now that what its 
doing is a bit more specialized and less clear. WDYT?

 
{code:java}
public int getMeanRowCount()
{
    long totalRows = 0;
    long totalPartitions = 0;
    for (SSTableReader sstable : getSSTables(SSTableSet.CANONICAL))
    {
        if (sstable.descriptor.version.storeRows())
        {
            totalPartitions += sstable.getEstimatedPartitionSize().count();
            totalRows += sstable.getTotalRows();
        } else
        {
            long colCount = sstable.getEstimatedColumnCount().count();
            totalPartitions += colCount;
            totalRows += sstable.getEstimatedColumnCount().mean() * colCount;
        }
    }

    return totalPartitions > 0 ? (int) (totalRows / totalPartitions) : 0;
}

public int getMeanRowCount2()
{
    long totalRows = 0;
    long totalPartitions = 0;
    long legacyCols = 0;
    long legacyTotal = 0;
    for (SSTableReader sstable : getSSTables(SSTableSet.CANONICAL))
    {
        if (sstable.descriptor.version.storeRows())
        {
            totalPartitions += sstable.getEstimatedPartitionSize().count();
            totalRows += sstable.getTotalRows();
        } else
        {
            long colCount = sstable.getEstimatedColumnCount().count();
            legacyCols += sstable.getEstimatedColumnCount().mean() * colCount;
            legacyTotal += colCount;
        }
    }

    int rowMean = totalPartitions > 0 ? (int) (totalRows / totalPartitions) : 0;
    int legacyMean = legacyTotal > 0 ? (int) (legacyCols / legacyTotal) : 0;

    return (int) (((rowMean * totalPartitions) + (legacyMean * legacyTotal)) / 
(totalPartitions + legacyTotal));
}
{code}
 


was (Author: jrwest):
[~bdeggleston] good catch re: 2.1 sstables. I see two ways to handle that off 
the top of my head – besides not including the legacy sstables in the 
calculation which is broken.

I think I prefer {{getMeanRowCount2}} (average of the row count and column 
count) because in the case of 100% legacy sstables or 100% new sstables it 
degrades to {{getMeanColumns}} or the original {{getMeanRowCount.}} Neither 
implementation is ideal since we have to handle it at the per sstable level and 
what that means for an average is ambiguous. 

Also, I wonder if the method name should change and/or if the logic should be 
moved to somewhere index specific like {{CassandraIndex}}, now that what its 
doing is a bit more specialized and less clear. WDYT?

 
{code:java}
public int getMeanRowCount()
{
    long totalRows = 0;
    long totalPartitions = 0;
    for (SSTableReader sstable : getSSTables(SSTableSet.CANONICAL))
    {
        if (sstable.descriptor.version.storeRows())
        {
            totalPartitions += sstable.getEstimatedPartitionSize().count();
            totalRows += sstable.getTotalRows();
        } else
        {
            long colCount = sstable.getEstimatedColumnCount().count();
            totalPartitions += colCount;
            totalRows += sstable.getEstimatedColumnCount().mean() * colCount;
        }
    }

    return totalPartitions > 0 ? (int) (totalRows / totalPartitions) : 0;
}

public int getMeanRowCount2()
{
    long totalRows = 0;
    long totalPartitions = 0;
    long legacyCols = 0;
    long legacyTotal = 0;
    for (SSTableReader sstable : getSSTables(SSTableSet.CANONICAL))
    {
        if (sstable.descriptor.version.storeRows())
        {
            totalPartitions += sstable.getEstimatedPartitionSize().count();
            totalRows += sstable.getTotalRows();
        } else
        {
            long colCount = sstable.getEstimatedColumnCount().count();
            legacyCols += sstable.getEstimatedColumnCount().mean() * colCount;
            legacyTotal += colCount;
        }
    }

    int rowMean = totalPartitions > 0 ? (int) (totalRows / totalPartitions) : 0;
    int legacyMean = legacyTotal > 0 ? (int) (legacyCols / legacyTotal) : 0;

    return (int) (((rowMean * totalPartitions) + (legacyMean * legacyTotal)) / 
(totalPartitions + legacyTotal));
}
{code}
 

> Selecting Index by Lowest Mean Column Count Selects Random Index
> ----------------------------------------------------------------
>
>                 Key: CASSANDRA-15259
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15259
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Feature/2i Index
>            Reporter: Jordan West
>            Assignee: Jordan West
>            Priority: Urgent
>             Fix For: 3.0.19, 4.0, 3.11.x
>
>
> {{CassandraIndex}} uses 
> [{{ColumnFamilyStore#getMeanColumns}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/internal/CassandraIndex.java#L273],
>  average columns per partition, which always returns the same answer for 
> index CFs because they contain no regular columns and clustering columns 
> aren't included in the count in Cassandra 3.0+.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15259) Selecting Index by Lowest Mean Column Count Selects Random Index

Reply via email to