[ 
https://issues.apache.org/jira/browse/PHOENIX-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14165794#comment-14165794
 ] 

James Taylor commented on PHOENIX-1333:
---------------------------------------

bq. Which byte[] you mean? The row that is added to the list in the 
Pair<Integer,List<byte[]>? If we add it here then we may have not know the 
exact length until we traverse the guidePost map.

I'd structure this slightly differently. Don't convert the List<byte[]> to 
byte[] until the end when you serialize it out as a KeyValue. Also, don't use 
Pair, as you'll need to store a bit more information and this will become 
unwieldy. Do this instead:
{code}
private static class GuidePostInfo {
    public long byteCount; // Number of bytes traversed in the region
    public long keyByteSize; // Total number of bytes in keys stored in 
guidePosts
    public List<byte[]> guidePosts;
}

private Map<String, GuidePostInfo> guidePostsMap = Maps.newHashMap();
{code}

Then in public void updateStatistic(KeyValue kv), it's easy to update the 
GuidePostInfo in place as you traverse the bytes (you're just updating the 
values in the structure instead of the Pair first/second values.

When you serialize the information in public byte[] getGuidePosts(String fam), 
I'd modify the serialization format to include all the information in 
GuidePostInfo:
<byteCount as long>  // this is the total number of bytes traversed in the 
region
<keyByteSize as long> // total number of bytes in the list of keys
<totalNumOfKeys as int>
<numBytesInKey as vint><keyBytes><numBytesInKey as vint><keyBytes>...

In a separate key value as you are more or less doing now (with one tweak - see 
below), capture the totalNumOfKeys in GUIDE_POSTS_COUNT and byteCount in the 
GUIDE_POSTS_WIDTH_BYTES (it's better to capture the total byte count than the 
byte count per guidepost here, which is a slight tweak on what you have).

Then in StatisticsUtil.readStatistics(), you can pretty easily roll up the 
above serialized format into an aggregated one. Just read the first three 
fields: <byteCount as long>,<keyByteSize as long>, and <totalNumOfKeys as int> 
and you'll have the sizing information you need. You can just keep the same Map 
as before to aggregate them: Map<String, GuidePostInfo>. When you do the put, 
if you had an old value, just combine it together with the new value. You can 
pass in the final GuidePostInfo to the PTableStatsImpl constructor and store 
the List<byte[]> and the byteCount (i.e. summed up byteCount across all 
regions) in a new PTableStatImpl member variable. Make sure this byteCount 
value gets serialized and pushed into a new PColumnFamilyImpl and PTableImpl 
member variable (which is a sibling of the List<byte[]> guidePosts member 
variable that provides the number of bytes traversed).





> Store statistics guideposts as VARBINARY
> ----------------------------------------
>
>                 Key: PHOENIX-1333
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1333
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: James Taylor
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Critical
>         Attachments: Phoenix-1333.patch
>
>
> There's a potential problem with storing the guideposts as a VARBINARY ARRAY, 
> as pointed out by PHOENIX-1329. We'd run into this issue if we're collecting 
> stats for a table with a trailing VARBINARY row key column if the value 
> contained embedded null bytes. Because of this, we're better off storing 
> guideposts as VARBINARY and serializing/deserializing in the following manner:
> <byte length as vint><bytes><byte length as vint><bytes>...
> We should also store as a separate KeyValue column the total number of 
> guideposts. So the schema of SYSTEM.STATS would look like this now instead:
> {code}
>     public static final String CREATE_STATS_TABLE_METADATA = 
>             "CREATE TABLE " + SYSTEM_CATALOG_SCHEMA + ".\"" + 
> SYSTEM_STATS_TABLE + "\"(\n" +
>             // PK columns
>             PHYSICAL_NAME  + " VARCHAR NOT NULL," +
>             COLUMN_FAMILY + " VARCHAR," +
>             REGION_NAME + " VARCHAR," +
>             GUIDE_POSTS  + " VARBINARY," +
>             GUIDE_POSTS_COUNT + " SMALLINT," +
>             MIN_KEY + " VARBINARY," + 
>             MAX_KEY + " VARBINARY," +
>             LAST_STATS_UPDATE_TIME+ " DATE, "+
>             "CONSTRAINT " + SYSTEM_TABLE_PK_NAME + " PRIMARY KEY ("
>             + PHYSICAL_NAME + ","
>             + COLUMN_FAMILY + ","+ REGION_NAME+"))\n" +
>             // TODO: should we support versioned stats?
>             // Install split policy to prevent a physical table's stats from 
> being split across regions.
>             HTableDescriptor.SPLIT_POLICY + "='" + 
> MetaDataSplitPolicy.class.getName() + "'\n";
> {code}
> Then the serialization code in StatisticsTable.addStats() would need to 
> change to populate the GUIDE_POSTS_COUNT and serialize the GUIDE_POSTS in the 
> new format.
> The deserialization code is isolated to StatisticsUtil.readStatisitics(). It 
> would need to read the GUIDE_POSTS_COUNT first for estimated sizing, and then 
> deserialize the GUIDE_POSTS in the new format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to