Re: Scaling with SQL query
Hi Dmitry, Another question. I am curious if I query via JDBC or new thin client, will the reduce happen on one of server nodes serving as a proxy? Sent: Thursday, June 28, 2018 at 2:02 PM From: "Tom M" To: user@ignite.apache.org Subject: Re: Scaling with SQL query Hi Dmitry, Thanks for the great explanation! Looks like "reduce" hapenning on the client is the issue that can be solved with adding clients. Sent: Wednesday, June 27, 2018 at 6:22 PM From: dkarachentsev To: user@ignite.apache.org Subject: Re: Scaling with SQL query Hi, Slight degradation is expected in some cases. Let me explain how it works. 1) Client sends request to each node (if you have query parallelism > 1 than number of requests multiplied by that num). 2) Each node runs that query against it's local dataset. 3) Each node responses with 100 entries. 4) Client collects all responses and performs reduce. So what happens when you add node? First of all dataset splits between larger number of nodes, but if dataset is too small you will not see any difference in query processing, or if newly added node does not significantly reduces amount of data for each other node. F.e. you have 9 nodes and add one more. Each node looses no more than 10% of data. In case of small dataset it will not give you any performance boost. In the other hand, client has to send more requests and reduce more data. For instance, with 9 nodes it receives 900 entries, with 10 nodes - 1K entries. Again, if dataset is relatively small you get overhead on client for additional requests/responses and data. The best scaling show queries by primary key, because in that case client can send request to affinity node directly without broadcasting to all nodes. So when can you get scaling profit for SQL? 1) You have a very large dataset. Each node will process less data and they will do it in parallel. Here boost for each node will beat additional overhead on client. 2) You add more clients that run queries in parallel. Total throughput increases because request/response overhead will be divided between larger number of clients. (Or you can set more connections per node to better utilize client machine resources). 3) You query for primary key. Please note one more thing, that overall latency depends on how fast the slowest node, because client will wait all responses. Thanks! -Dmitry -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Re: Scaling with SQL query
Hi Dmitry, Thanks for the great explanation! Looks like "reduce" hapenning on the client is the issue that can be solved with adding clients. Sent: Wednesday, June 27, 2018 at 6:22 PM From: dkarachentsev To: user@ignite.apache.org Subject: Re: Scaling with SQL query Hi, Slight degradation is expected in some cases. Let me explain how it works. 1) Client sends request to each node (if you have query parallelism > 1 than number of requests multiplied by that num). 2) Each node runs that query against it's local dataset. 3) Each node responses with 100 entries. 4) Client collects all responses and performs reduce. So what happens when you add node? First of all dataset splits between larger number of nodes, but if dataset is too small you will not see any difference in query processing, or if newly added node does not significantly reduces amount of data for each other node. F.e. you have 9 nodes and add one more. Each node looses no more than 10% of data. In case of small dataset it will not give you any performance boost. In the other hand, client has to send more requests and reduce more data. For instance, with 9 nodes it receives 900 entries, with 10 nodes - 1K entries. Again, if dataset is relatively small you get overhead on client for additional requests/responses and data. The best scaling show queries by primary key, because in that case client can send request to affinity node directly without broadcasting to all nodes. So when can you get scaling profit for SQL? 1) You have a very large dataset. Each node will process less data and they will do it in parallel. Here boost for each node will beat additional overhead on client. 2) You add more clients that run queries in parallel. Total throughput increases because request/response overhead will be divided between larger number of clients. (Or you can set more connections per node to better utilize client machine resources). 3) You query for primary key. Please note one more thing, that overall latency depends on how fast the slowest node, because client will wait all responses. Thanks! -Dmitry -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Re: Scaling with SQL query
Hi Pavel, Thank you for the reply. The cache is partitioned (with 3 copies). [[SELECT __Z0.ID AS __C0_0, __Z0.CS AS __C0_1, __Z0.TIME AS __C0_2, __Z0.SID AS __C0_3, __Z0.SCITY AS __C0_4, __Z0.SADDRESS AS __C0_5, __Z0.IID AS __C0_6, __Z0.IURL AS __C0_7 FROM "Logs".LOGS __Z0 /* "Logs".TIME_IDX: TIME > 0 */ WHERE (__Z0.TIME > 0) ORDER BY 3 DESC LIMIT ?1 /* index sorted */], [SELECT __C0_0 AS ID, __C0_1 AS CS, __C0_2 AS TIME, __C0_3 AS SID, __C0_4 AS SCITY, __C0_5 AS SADDRESS, __C0_6 AS IID, __C0_7 AS IURL FROM PUBLIC.__T0 /* "Logs"."merge_sorted" */ ORDER BY 3 DESC LIMIT ?1 /* index sorted */]] This is what I have (I have added "WHERE time > unix_epoch_time" to the original query here, changed to 0 above). I thought that adding more nodes shouldn't introduce much overhead. How will this query be processed? Sending requests to each node, looking up for data at each node's index and returning the data to the "reducer" node? Any inter-node data exchange involved? Sent: Wednesday, June 27, 2018 at 1:36 PM From: "Pavel Vinokurov" To: user@ignite.apache.org Subject: Re: Scaling with SQL query Hi Tom, In case of a replicated cache the Ignite plans the execution of the sql query across whole cluster by splitting into multiple map queries and a single reduce query. Thus it is possible communication overheads caused by that the "reduce" node collects data from multiple nodes. Please show metrics for this query for you configuration. Thanks, Pavel 2018-06-26 6:24 GMT+03:00 Tom M <tar...@mail.com>: Hi, I have a cluster of 10 nodes, and a cache with replication factor 3 and no persistency enabled. The SQL query is pretty simple -- "SELECT * FROM Logs ORDER by time DESC LIMIT 100". I have checked the index for "time" attribute is applied. When I increase the number of nodes, throughput drops and latency increases. Can you please explain why and how Ignite processes this SQL request? -- Regards Pavel Vinokurov
Scaling with SQL query
Hi, I have a cluster of 10 nodes, and a cache with replication factor 3 and no persistency enabled. The SQL query is pretty simple -- "SELECT * FROM Logs ORDER by time DESC LIMIT 100". I have checked the index for "time" attribute is applied. When I increase the number of nodes, throughput drops and latency increases. Can you please explain why and how Ignite processes this SQL request?