Pavel, maybe it's time to put your five-cent in. Can you share your code as a GitHub project? Maybe with the script to reproduce 6 GB of data.
As for MSSQL data retrieval being the bottleneck - don't think so, I got 15 min load time for 1 node and 3.5 min time for 4 nodes. Looks like a linear dependency (the table and the RDBMS server were the same). -- Vladimir пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn <[email protected]>: > > First of all, I tried to select the whole table as once > > Hmm, it looks like MSSQL data retrieval may be the bottleneck here, not > Ignite. > > Can you run a test where some dummy data of the same size as real data is > generated and inserted into Ignite, > so that we test Ignite perf only, excluding MSSQL from the equation? > For example, streaming 300 million entries (total size 6 GB) takes around > 1 minute on my machine, with a simple single-threaded DataStreamer. > > On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <[email protected]> > wrote: > >> Hi folks, >> thanks for your interest in my work. >> >> I didn't try COPY FROM since I've tried to work with Ignite SQL a couple >> of years ago and didn't succeed. Probably because examples available aren't >> complete/downloadable/compilable (the paper [1] contains GitHub repo, that >> is my five cents in changing the status quo). My interest is in KV API. >> >> I did try a data streamer, and that was my first try. I did not notice a >> significant time reduction in using code from my paper [1] versus data >> streamer/receiver. There was some memory economy with the streamer, though. >> I must say my experiment was made on a heavily loaded production mssql >> server. Filtered query with 300K rows resultset takes about 15 sec. The >> story follows. >> >> First of all, I tried to select the whole table as once, I got the >> network timeout and the client node was dropped off the cluster (is node >> still alive?). >> So I'd partitioned the table and executed a number of queries one-by-one >> on the client node, each query for the specific table partition. That >> process took about 90 min. Inacceptable time. >> >> Then I tried to execute my queries in parallel on the client node, each >> query executing dataStreamer.addData() for a single dataStreamer. The >> timing was not less than 15 min. All the attempts were the same, probably >> that was the network throughput limit on the client node (same interface >> used for the resultset and for cluster intercom). Say it again - that was >> the production environment. >> >> Final schema: >> * ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one job >> for one table partition; >> * each job executes SQL query, constructs a map with binary object key >> and value. Then the job executes targetCache.invokeAll() specifying the >> constructed map and the static EntryProcessor class. The EntryProcessor >> contains the logic for cache binary entry update; >> * ComputeTask.reduce() summarizes the row count reported by each job. >> >> The schema described proved to be network error-free in my production >> network and gives acceptable timing. >> >> Vladimir >> >> [1] >> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api >> >> пт, 19 февр. 2021 г. в 16:41, Stephen Darlington < >> [email protected]>: >> >>> I think it’s more that that putAll is mostly atomic, so the more records >>> you save in one chunk, the more locking, etc. happens. Distributing as >>> compute jobs means all the putAlls will be local which is beneficial, and >>> the size of each put is going to be smaller (also beneficial). >>> >>> But that’s a lot of work that the data streamer already does for you and >>> the data streamer also batches updates so would still be faster. >>> >>> On 19 Feb 2021, at 13:33, Maximiliano Gazquez <[email protected]> >>> wrote: >>> >>> What would be the difference between doing cache.putAll(all rows) and >>> separating them by affinity key+executing putAll inside a compute job. >>> If I'm not mistaken, doing putAll should end up splitting those rows by >>> affinity key in one of the servers, right? >>> Is there a comparison of that? >>> >>> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <[email protected]> >>> wrote: >>> >>>> Hi Vladimir, >>>> Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC? >>>> This command uses 'IgniteDataStreamer' to write data into cluster and >>>> parse CSV on the server node. >>>> >>>> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data. >>>> >>>> Hi Denis, >>>> >>>> Data space is 3.7Gb according to MSSQL table properries >>>> >>>> Vladimir >>>> >>>> 9:47, 19 февраля 2021 г., Denis Magda <[email protected]> >>>> <[email protected]>: >>>> >>>> Hello Vladimir, >>>> >>>> Good to hear from you! How much is that in gigabytes? >>>> >>>> - >>>> Denis >>>> >>>> >>>> On Thu, Feb 18, 2021 at 10:06 PM <[email protected]> wrote: >>>> >>>> Sep 2020 I've published the paper about Loading Large Datasets into >>>> Apache Ignite by Using a Key-Value API (English [1] and Russian [2] >>>> version). The approach described works in production, but shows >>>> inacceptable perfomance for very large tables. >>>> >>>> The story continues, and yesterday I've finished the proof of concept >>>> for very fast loading of very big table. The partitioned MSSQL table about >>>> 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. >>>> Each node had executed its own SQL queries in parallel and then distributed >>>> the loaded values across the other cluster nodes. >>>> >>>> Probably that result will be of interest for the community. >>>> >>>> Regards, >>>> Vladimir Chernyi >>>> >>>> [1] >>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api >>>> [2] https://m.habr.com/ru/post/526708/ >>>> >>>> >>>> >>>> -- >>>> Отправлено из мобильного приложения Яндекс.Почты >>>> >>>> -- >>>> Taras Ledkov >>>> Mail-To: [email protected] >>>> >>>> >>> >>>
