Re: Any reason for so small phoenix.mutate.batchSize by default?

Alexander Batyrshin Tue, 03 Sep 2019 09:45:19 -0700

I observer that there is some extra mutations in batch for every my UPSERTs
For example if app call executeUpdate() only 5 times then on commit there will 
be "DEBUG MutationState:1046 - Sent batch of 10"
Can’t figure out where this extra mutations comes from and why.


This is mean that “useful” batch size is phoenix.mutate.batchSize / 2.

> * What does your table DDL look like?

CREATE TABLE IF NOT EXISTS TABLE_CODES (
    "id" VARCHAR NOT NULL PRIMARY KEY,
    "d"."tg" VARCHAR,
    "d"."drip" VARCHAR,
    "d"."s" UNSIGNED_TINYINT,
    "d"."se" UNSIGNED_TINYINT,
    "d"."rle" UNSIGNED_TINYINT,
    "d"."dme" TIMESTAMP,
    "d"."dpa" TIMESTAMP,
    "d"."p" VARCHAR,
    "d"."pt" UNSIGNED_TINYINT,
    "d"."x" VARCHAR,
    "d"."pn" VARCHAR,
    "d"."b" VARCHAR,
    "d"."hc" VARCHAR ARRAY,
    "d"."ns" VARCHAR(16),
    "d"."tv" VARCHAR(10),
    "d"."vcp" VARCHAR,
    "d"."et" UNSIGNED_TINYINT,
    "d"."xoa" BINARY(16),
    "d"."j" VARCHAR
) SALT_BUCKETS=30, COLUMN_ENCODED_BYTES=NONE;

CREATE INDEX "IDX_CIS_O" ON "TABLE_CODES" ("d"."x", "d"."dme") 
INCLUDE("d"."tg", "d"."rle", "d"."pt" ... ) SALT_BUCKETS=30;
CREATE INDEX "IDX_CIS_PRID" ON "TABLE_CODES" ("d"."drip", "d"."dme") 
INCLUDE("d"."tg", "d"."rle", "d"."pt" ...) SALT_BUCKETS=30;

For my case SALT_BUCKET=30 every batch with default settings will carry only 50 
“useful” rows and they will be splitted across 30 servers, so every server will 
get only 1-2 rows.

> * How large is one mutation you're writing (in bytes)?

Any idea how to calculate it?
https://phoenix.apache.org/metrics.html 
<https://phoenix.apache.org/metrics.html> will give me total mutations count 
and total size in bytes of batch. But as I mentioned before there is “extra” 
mutation that will corrupt statistics

> * How much data ends up being sent to a RegionServer in one RPC?
Where I can get this metric?


> On 3 Sep 2019, at 17:19, Josh Elser <[email protected]> wrote:
> 
> Hey Alexander,
> 
> Was just poking at the code for this: it looks like this is really just 
> determining the number of mutations that get "processed together" (as opposed 
> to a hard limit).
> 
> Since you have done some work, I'm curious if you could generate some data to 
> help back up your suggestion:
> 
> * What does your table DDL look like?
> * How large is one mutation you're writing (in bytes)?
> * How much data ends up being sent to a RegionServer in one RPC?
> 
> You're right in that we would want to make sure that we're sending an 
> adequate amount of data to a RegionServer in an RPC, but this is tricky to 
> balance for all cases (thus, setting a smaller value to avoid sending batches 
> that are too large is safer).
> 
> On 9/3/19 8:03 AM, Alexander Batyrshin wrote:
>>  Hello all,
>> 1) There is bug in documentation - http://phoenix.apache.org/tuning.html
>> phoenix.mutate.batchSize is not 1000, but only 100 by default
>> https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/query/QueryServicesOptions.java#L164
>> Changed for https://issues.apache.org/jira/browse/PHOENIX-541
>> 2) I want to discuss this default value. From PHOENIX-541 
>> <https://issues.apache.org/jira/browse/PHOENIX-541> I read about issue with 
>> MR and wide rows (2MB per row) and it looks like rare case. But in most 
>> common cases we can get much better write perfomance with batchSize = 1000 
>> especially if it used with SALT table

Re: Any reason for so small phoenix.mutate.batchSize by default?

Reply via email to