Re: "upsert select" with "limit" clause

2018-12-19 Thread Jaanai Zhang
Shawn,

I have done some tests in the 4.14.1-HBase-1.4 version. The detail
information is as follow:

CREATE TABLE test (id VARCHAR PRIMARY KEY, c1 varchar, c2 varchar)
SALT_BUCKETS = 10;

explain select * from test where c1 = 'xx' limit 5 offset 100;

CREATE TABLE test1 (id VARCHAR PRIMARY KEY, c1 varchar, c2 varchar)
SALT_BUCKETS = 10;

explain upsert into test1 select * from test limit 10;



0: jdbc:phoenix:thin:url=http://localhost:876> explain upsert into test1
select * from test limit 10;

+---+-++--+

|   PLAN
| EST_BYTES_READ  | EST_ROWS_READ  | EST_INFO_TS  |

+---+-++--+

| UPSERT SELECT
 | 2040| 10 | 0|

| CLIENT 10-CHUNK 10 ROWS 2040 BYTES *SERIAL* 10-WAY ROUND ROBIN FULL SCAN
OVER TEST  | 2040| 10 | 0|

| SERVER 10 ROW LIMIT
 | 2040| 10 | 0|

| CLIENT 10 ROW LIMIT
 | 2040| 10 | 0|

+---+-++--+

4 rows selected (0.028 seconds)

0: jdbc:phoenix:thin:url=http://localhost:876> explain upsert into test1
select * from test;

+--+-++--+

|   PLAN   |
EST_BYTES_READ  | EST_ROWS_READ  | EST_INFO_TS  |

+--+-++--+

| UPSERT SELECT| null
  | null   | null |

| CLIENT 10-CHUNK PARALLEL 10-WAY ROUND ROBIN FULL SCAN OVER TEST  | null
  | null   | null |

+--+-++--+

2 rows selected (0.033 seconds)


I notice that the UPSERT will produce serial scans with the limit clause.
what is your Phoenix's version?   @Vincent FYI







   Jaanai Zhang
   Best regards!



Vincent Poon  于2018年12月20日周四 上午6:04写道:

> Shawn,
>
> Took a quick look, I think what is happening is the UPSERT is done
> serially when you have LIMIT.
> Parallel scans are issued for the SELECT, which is why the explain plan
> shows PARALLEL, but then the results are concatenated via a single
> LimitingResultIterator, in order to apply the CLIENT LIMIT.
> The upsert then reads from that iterator and does the mutations in batches.
>
> To insert in parallel, we would need some sort of shared state between the
> writing threads to ensure we respect the limit, and I don't think we
> currently have something like that.
>
> Vincent
>
> On Tue, Dec 18, 2018 at 2:31 PM Vincent Poon 
> wrote:
>
>>
>> Shawn, that sounds like a bug, I would file a JIRA.
>>
>> On Tue, Dec 18, 2018 at 12:33 PM Shawn Li  wrote:
>>
>>> Hi Vincent & William,
>>>
>>>
>>>
>>> Below is the explain plan, both are PARALLEL excuted in plan:
>>>
>>>
>>>
>>> explain upsert into table1 select * from table2;
>>>
>>>
>>>
>>> UPSERT
>>> SELECT
>>>   |
>>>
>>> CLIENT 27-CHUNK 915799 ROWS 2831155510 BYTES PARALLEL 18-WAY ROUND ROBIN
>>> FULL SCAN OVER table2
>>>
>>>
>>>
>>> explain upsert into table1 select * from table2 limit 200;
>>>
>>>
>>>
>>> UPSERT
>>> SELECT
>>>   |
>>>
>>> | CLIENT 27-CHUNK 3600 ROWS 48114000 BYTES PARALLEL 18-WAY ROUND
>>> ROBIN FULL SCAN OVER table2 |
>>>
>>> | SERVER 200 ROW
>>> LIMIT
>>> |
>>>
>>> | CLIENT 200 ROW LIMIT
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Shawn
>>>
>>> On Tue, Dec 18, 2018, 13:30 Vincent Poon >>
 Shawn,

 Can you do an "explain" to show what your two statements are doing?
 That might give some clues.  Perhaps one is able to be run on the server
 for some reason and the other is not.
 Otherwise, I don't see why one would be substantially slower than the
 other.

 Vincent

 On Mon, Dec 17, 2018 at 9:14 PM Shawn Li  wrote:

> Hi Jonathan,
>
> The single threaded on one side sounds logical to me. Hopefully
> Vincent can confirm it.
>
> Thanks,
> Shawn
>
> On Mon, Dec 17, 2018 at 9:25 PM Jonathan Leech 
> wrote:
>
>> My guess is that in order to enforce the limit that it’s effectively
>> single threaded in either the select or the upsert.
>>
>> On Dec 17, 2018, at 6:43 PM, Shawn Li  wrote:
>>
>> Hi Vincent,
>>
>> Thanks for explaining. That makes much m

Re: "upsert select" with "limit" clause

2018-12-19 Thread Vincent Poon
Shawn,

Took a quick look, I think what is happening is the UPSERT is done serially
when you have LIMIT.
Parallel scans are issued for the SELECT, which is why the explain plan
shows PARALLEL, but then the results are concatenated via a single
LimitingResultIterator, in order to apply the CLIENT LIMIT.
The upsert then reads from that iterator and does the mutations in batches.

To insert in parallel, we would need some sort of shared state between the
writing threads to ensure we respect the limit, and I don't think we
currently have something like that.

Vincent

On Tue, Dec 18, 2018 at 2:31 PM Vincent Poon  wrote:

>
> Shawn, that sounds like a bug, I would file a JIRA.
>
> On Tue, Dec 18, 2018 at 12:33 PM Shawn Li  wrote:
>
>> Hi Vincent & William,
>>
>>
>>
>> Below is the explain plan, both are PARALLEL excuted in plan:
>>
>>
>>
>> explain upsert into table1 select * from table2;
>>
>>
>>
>> UPSERT
>> SELECT
>>   |
>>
>> CLIENT 27-CHUNK 915799 ROWS 2831155510 BYTES PARALLEL 18-WAY ROUND ROBIN
>> FULL SCAN OVER table2
>>
>>
>>
>> explain upsert into table1 select * from table2 limit 200;
>>
>>
>>
>> UPSERT
>> SELECT
>>   |
>>
>> | CLIENT 27-CHUNK 3600 ROWS 48114000 BYTES PARALLEL 18-WAY ROUND
>> ROBIN FULL SCAN OVER table2 |
>>
>> | SERVER 200 ROW
>> LIMIT
>> |
>>
>> | CLIENT 200 ROW LIMIT
>>
>>
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Shawn
>>
>> On Tue, Dec 18, 2018, 13:30 Vincent Poon >
>>> Shawn,
>>>
>>> Can you do an "explain" to show what your two statements are doing?
>>> That might give some clues.  Perhaps one is able to be run on the server
>>> for some reason and the other is not.
>>> Otherwise, I don't see why one would be substantially slower than the
>>> other.
>>>
>>> Vincent
>>>
>>> On Mon, Dec 17, 2018 at 9:14 PM Shawn Li  wrote:
>>>
 Hi Jonathan,

 The single threaded on one side sounds logical to me. Hopefully Vincent
 can confirm it.

 Thanks,
 Shawn

 On Mon, Dec 17, 2018 at 9:25 PM Jonathan Leech 
 wrote:

> My guess is that in order to enforce the limit that it’s effectively
> single threaded in either the select or the upsert.
>
> On Dec 17, 2018, at 6:43 PM, Shawn Li  wrote:
>
> Hi Vincent,
>
> Thanks for explaining. That makes much more sense now and it explains
> the high memory usage when without "limit" clause. Because it upserts much
> quickly when using "upsert select" without "limit", the memory usage in
> client machine is much higher than "upsert select" with "limit" .
>
> So back to the other question. Can you explain what is underlying
> Phoenix implementation for "upsert select limit"? Why it is slower than
> without "limit" when insert a huge number (2m rows) like ""upsert into
> table2 select * from table1 limit 2,000,000;". This is much slower than
> inserting the whole table (upsert into table2 select * from table1;).
>
> Thanks,
> Xiang
>
>
> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon 
> wrote:
>
>> Shawn,
>> Your query *upsert into table2 select * from table1;  *would not be
>> run on the server - the source and target table are different.  It would
>> have to be something like:
>> *upsert into table1 select * from table1;*
>>
>> If you want to run server-side upsert select on a target table that
>> is different from the source table, you need to set
>> "phoenix.client.enable.server.upsert.select" to true on your client.
>> The are some other restrictions: the table can't have any global
>> indexes, and the statement can't have a join or where subquery.  We need 
>> to
>> update the documentation with this information.
>>
>> The reason there are all these hurdles is because it's generally not
>> recommended to do server-side upsert select across different tables,
>> because that means you're doing cross-regionserver RPCs (e.g. read data
>> from a region of sourcetable, and write to a region of targettable on a
>> different regionserver), potentially tying up handlers in an 
>> unpredictable
>> way.
>>
>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li 
>> wrote:
>>
>>> Hi Jaanai,
>>>
>>> According to Phoenix website, " If auto commit is on, and both a)
>>> the target table matches the source table, and b) the select performs no
>>> aggregation, then the population of the target table will be done
>>> completely on the server-side (with constraint violations logged, but
>>> otherwise ignored). Otherwise, data is buffered on the client and, if 
>>> auto
>>> commit is on, committed in row batches as specified by the 
>>> UpsertBatchSize
>>> connection property (or the phoenix.mutate.upsertBatchSize HBase config
>>> property which defaults to 1 rows)"
>>>
>>> And our sql statement is just: *upsert