Re: Scaling with SQL query

2018-06-27 Thread Tom M

Hi Dmitry,

 

Another question. I am curious if I query via JDBC or new thin client, will the reduce happen on one of server nodes serving as a proxy?

 

Sent: Thursday, June 28, 2018 at 2:02 PM
From: "Tom M" 
To: user@ignite.apache.org
Subject: Re: Scaling with SQL query




Hi Dmitry, 

 

Thanks for the great explanation!

 

Looks like "reduce" hapenning on the client is the issue that can be solved with adding clients.

 

Sent: Wednesday, June 27, 2018 at 6:22 PM
From: dkarachentsev 
To: user@ignite.apache.org
Subject: Re: Scaling with SQL query

Hi,

Slight degradation is expected in some cases. Let me explain how it works.
1) Client sends request to each node (if you have query parallelism > 1 than
number of requests multiplied by that num).
2) Each node runs that query against it's local dataset.
3) Each node responses with 100 entries.
4) Client collects all responses and performs reduce.

So what happens when you add node? First of all dataset splits between
larger number of nodes, but if dataset is too small you will not see any
difference in query processing, or if newly added node does not
significantly reduces amount of data for each other node. F.e. you have 9
nodes and add one more. Each node looses no more than 10% of data. In case
of small dataset it will not give you any performance boost.

In the other hand, client has to send more requests and reduce more data.
For instance, with 9 nodes it receives 900 entries, with 10 nodes - 1K
entries. Again, if dataset is relatively small you get overhead on client
for additional requests/responses and data.

The best scaling show queries by primary key, because in that case client
can send request to affinity node directly without broadcasting to all
nodes.

So when can you get scaling profit for SQL?
1) You have a very large dataset. Each node will process less data and they
will do it in parallel. Here boost for each node will beat additional
overhead on client.
2) You add more clients that run queries in parallel. Total throughput
increases because request/response overhead will be divided between larger
number of clients. (Or you can set more connections per node to better
utilize client machine resources).
3) You query for primary key.

Please note one more thing, that overall latency depends on how fast the
slowest node, because client will wait all responses.

Thanks!
-Dmitry



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/










Re: Scaling with SQL query

2018-06-27 Thread Tom M

Hi Dmitry, 

 

Thanks for the great explanation!

 

Looks like "reduce" hapenning on the client is the issue that can be solved with adding clients.

 

Sent: Wednesday, June 27, 2018 at 6:22 PM
From: dkarachentsev 
To: user@ignite.apache.org
Subject: Re: Scaling with SQL query

Hi,

Slight degradation is expected in some cases. Let me explain how it works.
1) Client sends request to each node (if you have query parallelism > 1 than
number of requests multiplied by that num).
2) Each node runs that query against it's local dataset.
3) Each node responses with 100 entries.
4) Client collects all responses and performs reduce.

So what happens when you add node? First of all dataset splits between
larger number of nodes, but if dataset is too small you will not see any
difference in query processing, or if newly added node does not
significantly reduces amount of data for each other node. F.e. you have 9
nodes and add one more. Each node looses no more than 10% of data. In case
of small dataset it will not give you any performance boost.

In the other hand, client has to send more requests and reduce more data.
For instance, with 9 nodes it receives 900 entries, with 10 nodes - 1K
entries. Again, if dataset is relatively small you get overhead on client
for additional requests/responses and data.

The best scaling show queries by primary key, because in that case client
can send request to affinity node directly without broadcasting to all
nodes.

So when can you get scaling profit for SQL?
1) You have a very large dataset. Each node will process less data and they
will do it in parallel. Here boost for each node will beat additional
overhead on client.
2) You add more clients that run queries in parallel. Total throughput
increases because request/response overhead will be divided between larger
number of clients. (Or you can set more connections per node to better
utilize client machine resources).
3) You query for primary key.

Please note one more thing, that overall latency depends on how fast the
slowest node, because client will wait all responses.

Thanks!
-Dmitry



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/





Re: Scaling with SQL query

2018-06-27 Thread Tom M

Hi Pavel,

 

Thank you for the reply. The cache is partitioned (with 3 copies).


 

[[SELECT
    __Z0.ID AS __C0_0,
    __Z0.CS AS __C0_1,
    __Z0.TIME AS __C0_2,
    __Z0.SID AS __C0_3,
    __Z0.SCITY AS __C0_4,
    __Z0.SADDRESS AS __C0_5,
    __Z0.IID AS __C0_6,
    __Z0.IURL AS __C0_7
FROM "Logs".LOGS __Z0
    /* "Logs".TIME_IDX: TIME > 0 */
WHERE (__Z0.TIME > 0)
ORDER BY 3 DESC
LIMIT ?1
/* index sorted */], [SELECT
    __C0_0 AS ID,
    __C0_1 AS CS,
    __C0_2 AS TIME,
    __C0_3 AS SID,
    __C0_4 AS SCITY,
    __C0_5 AS SADDRESS,
    __C0_6 AS IID,
    __C0_7 AS IURL
FROM PUBLIC.__T0
    /* "Logs"."merge_sorted" */
ORDER BY 3 DESC
LIMIT ?1
/* index sorted */]]

 

This is what I have (I have added "WHERE time > unix_epoch_time" to the original query here, changed to 0 above).

 

I thought that adding more nodes shouldn't introduce much overhead. How will this query be processed? Sending requests to each node, looking up for data at each node's index and returning the data to the "reducer" node? Any inter-node data exchange involved?

 


Sent: Wednesday, June 27, 2018 at 1:36 PM
From: "Pavel Vinokurov" 
To: user@ignite.apache.org
Subject: Re: Scaling with SQL query


Hi Tom,
 

In case of a replicated cache the Ignite plans the execution of the sql query across whole cluster by splitting into multiple map queries and a single reduce query.

Thus it is possible communication overheads caused by  that the "reduce" node collects data from multiple nodes.

Please show metrics for this query for you configuration.

 

Thanks,

Pavel

 


 
2018-06-26 6:24 GMT+03:00 Tom M <tar...@mail.com>:




Hi,

 

I have a cluster of 10 nodes, and a cache with replication factor 3 and no persistency enabled.

The SQL query is pretty simple -- "SELECT * FROM Logs ORDER by time DESC LIMIT 100".

I have checked the index for "time" attribute is applied.

 

When I increase the number of nodes, throughput drops and latency increases.

Can you please explain why and how Ignite processes this SQL request?




 

 
--


Regards

Pavel Vinokurov








Scaling with SQL query

2018-06-25 Thread Tom M
Hi,

 

I have a cluster of 10 nodes, and a cache with replication factor 3 and no persistency enabled.

The SQL query is pretty simple -- "SELECT * FROM Logs ORDER by time DESC LIMIT 100".

I have checked the index for "time" attribute is applied.

 

When I increase the number of nodes, throughput drops and latency increases.

Can you please explain why and how Ignite processes this SQL request?