Shawn,
Took a quick look, I think what is happening is the UPSERT is done serially
when you have LIMIT.
Parallel scans are issued for the SELECT, which is why the explain plan
shows PARALLEL, but then the results are concatenated via a single
LimitingResultIterator, in order to apply the CLIENT LIMIT.
The upsert then reads from that iterator and does the mutations in batches.
To insert in parallel, we would need some sort of shared state between the
writing threads to ensure we respect the limit, and I don't think we
currently have something like that.
Vincent
On Tue, Dec 18, 2018 at 2:31 PM Vincent Poon wrote:
>
> Shawn, that sounds like a bug, I would file a JIRA.
>
> On Tue, Dec 18, 2018 at 12:33 PM Shawn Li wrote:
>
>> Hi Vincent & William,
>>
>>
>>
>> Below is the explain plan, both are PARALLEL excuted in plan:
>>
>>
>>
>> explain upsert into table1 select * from table2;
>>
>>
>>
>> UPSERT
>> SELECT
>> |
>>
>> CLIENT 27-CHUNK 915799 ROWS 2831155510 BYTES PARALLEL 18-WAY ROUND ROBIN
>> FULL SCAN OVER table2
>>
>>
>>
>> explain upsert into table1 select * from table2 limit 200;
>>
>>
>>
>> UPSERT
>> SELECT
>> |
>>
>> | CLIENT 27-CHUNK 3600 ROWS 48114000 BYTES PARALLEL 18-WAY ROUND
>> ROBIN FULL SCAN OVER table2 |
>>
>> | SERVER 200 ROW
>> LIMIT
>> |
>>
>> | CLIENT 200 ROW LIMIT
>>
>>
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Shawn
>>
>> On Tue, Dec 18, 2018, 13:30 Vincent Poon >
>>> Shawn,
>>>
>>> Can you do an "explain" to show what your two statements are doing?
>>> That might give some clues. Perhaps one is able to be run on the server
>>> for some reason and the other is not.
>>> Otherwise, I don't see why one would be substantially slower than the
>>> other.
>>>
>>> Vincent
>>>
>>> On Mon, Dec 17, 2018 at 9:14 PM Shawn Li wrote:
>>>
Hi Jonathan,
The single threaded on one side sounds logical to me. Hopefully Vincent
can confirm it.
Thanks,
Shawn
On Mon, Dec 17, 2018 at 9:25 PM Jonathan Leech
wrote:
> My guess is that in order to enforce the limit that it’s effectively
> single threaded in either the select or the upsert.
>
> On Dec 17, 2018, at 6:43 PM, Shawn Li wrote:
>
> Hi Vincent,
>
> Thanks for explaining. That makes much more sense now and it explains
> the high memory usage when without "limit" clause. Because it upserts much
> quickly when using "upsert select" without "limit", the memory usage in
> client machine is much higher than "upsert select" with "limit" .
>
> So back to the other question. Can you explain what is underlying
> Phoenix implementation for "upsert select limit"? Why it is slower than
> without "limit" when insert a huge number (2m rows) like ""upsert into
> table2 select * from table1 limit 2,000,000;". This is much slower than
> inserting the whole table (upsert into table2 select * from table1;).
>
> Thanks,
> Xiang
>
>
> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon
> wrote:
>
>> Shawn,
>> Your query *upsert into table2 select * from table1; *would not be
>> run on the server - the source and target table are different. It would
>> have to be something like:
>> *upsert into table1 select * from table1;*
>>
>> If you want to run server-side upsert select on a target table that
>> is different from the source table, you need to set
>> "phoenix.client.enable.server.upsert.select" to true on your client.
>> The are some other restrictions: the table can't have any global
>> indexes, and the statement can't have a join or where subquery. We need
>> to
>> update the documentation with this information.
>>
>> The reason there are all these hurdles is because it's generally not
>> recommended to do server-side upsert select across different tables,
>> because that means you're doing cross-regionserver RPCs (e.g. read data
>> from a region of sourcetable, and write to a region of targettable on a
>> different regionserver), potentially tying up handlers in an
>> unpredictable
>> way.
>>
>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li
>> wrote:
>>
>>> Hi Jaanai,
>>>
>>> According to Phoenix website, " If auto commit is on, and both a)
>>> the target table matches the source table, and b) the select performs no
>>> aggregation, then the population of the target table will be done
>>> completely on the server-side (with constraint violations logged, but
>>> otherwise ignored). Otherwise, data is buffered on the client and, if
>>> auto
>>> commit is on, committed in row batches as specified by the
>>> UpsertBatchSize
>>> connection property (or the phoenix.mutate.upsertBatchSize HBase config
>>> property which defaults to 1 rows)"
>>>
>>> And our sql statement is just: *upsert