Hi, I'd spent some time thinking about the community comments on my post. It seems that Ignite is really not a bottleneck here. The performance of my production MSSQL is a given restriction and the problem is to ensure fast loading by executing multiple parallel queries. I'll test my code in production for a couple of months for possible problems. If it will be OK, probably the complete/downloadable/compilable GitHub example will be useful for the community.
WDYT? пт, 19 февр. 2021 г. в 21:47, Vladimir Tchernyi <[email protected]>: > Pavel, > > maybe it's time to put your five-cent in. Can you share your code as a > GitHub project? Maybe with the script to reproduce 6 GB of data. > > As for MSSQL data retrieval being the bottleneck - don't think so, I got > 15 min load time for 1 node and 3.5 min time for 4 nodes. Looks like a > linear dependency (the table and the RDBMS server were the same). > -- > Vladimir > > пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn <[email protected]>: > >> > First of all, I tried to select the whole table as once >> >> Hmm, it looks like MSSQL data retrieval may be the bottleneck here, not >> Ignite. >> >> Can you run a test where some dummy data of the same size as real data is >> generated and inserted into Ignite, >> so that we test Ignite perf only, excluding MSSQL from the equation? >> For example, streaming 300 million entries (total size 6 GB) takes around >> 1 minute on my machine, with a simple single-threaded DataStreamer. >> >> On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <[email protected]> >> wrote: >> >>> Hi folks, >>> thanks for your interest in my work. >>> >>> I didn't try COPY FROM since I've tried to work with Ignite SQL a couple >>> of years ago and didn't succeed. Probably because examples available aren't >>> complete/downloadable/compilable (the paper [1] contains GitHub repo, that >>> is my five cents in changing the status quo). My interest is in KV API. >>> >>> I did try a data streamer, and that was my first try. I did not notice a >>> significant time reduction in using code from my paper [1] versus data >>> streamer/receiver. There was some memory economy with the streamer, though. >>> I must say my experiment was made on a heavily loaded production mssql >>> server. Filtered query with 300K rows resultset takes about 15 sec. The >>> story follows. >>> >>> First of all, I tried to select the whole table as once, I got the >>> network timeout and the client node was dropped off the cluster (is node >>> still alive?). >>> So I'd partitioned the table and executed a number of queries one-by-one >>> on the client node, each query for the specific table partition. That >>> process took about 90 min. Inacceptable time. >>> >>> Then I tried to execute my queries in parallel on the client node, each >>> query executing dataStreamer.addData() for a single dataStreamer. The >>> timing was not less than 15 min. All the attempts were the same, probably >>> that was the network throughput limit on the client node (same interface >>> used for the resultset and for cluster intercom). Say it again - that was >>> the production environment. >>> >>> Final schema: >>> * ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one job >>> for one table partition; >>> * each job executes SQL query, constructs a map with binary object key >>> and value. Then the job executes targetCache.invokeAll() specifying the >>> constructed map and the static EntryProcessor class. The EntryProcessor >>> contains the logic for cache binary entry update; >>> * ComputeTask.reduce() summarizes the row count reported by each job. >>> >>> The schema described proved to be network error-free in my production >>> network and gives acceptable timing. >>> >>> Vladimir >>> >>> [1] >>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api >>> >>> пт, 19 февр. 2021 г. в 16:41, Stephen Darlington < >>> [email protected]>: >>> >>>> I think it’s more that that putAll is mostly atomic, so the more >>>> records you save in one chunk, the more locking, etc. happens. Distributing >>>> as compute jobs means all the putAlls will be local which is beneficial, >>>> and the size of each put is going to be smaller (also beneficial). >>>> >>>> But that’s a lot of work that the data streamer already does for you >>>> and the data streamer also batches updates so would still be faster. >>>> >>>> On 19 Feb 2021, at 13:33, Maximiliano Gazquez < >>>> [email protected]> wrote: >>>> >>>> What would be the difference between doing cache.putAll(all rows) and >>>> separating them by affinity key+executing putAll inside a compute job. >>>> If I'm not mistaken, doing putAll should end up splitting those rows by >>>> affinity key in one of the servers, right? >>>> Is there a comparison of that? >>>> >>>> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <[email protected]> >>>> wrote: >>>> >>>>> Hi Vladimir, >>>>> Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC? >>>>> This command uses 'IgniteDataStreamer' to write data into cluster and >>>>> parse CSV on the server node. >>>>> >>>>> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data. >>>>> >>>>> Hi Denis, >>>>> >>>>> Data space is 3.7Gb according to MSSQL table properries >>>>> >>>>> Vladimir >>>>> >>>>> 9:47, 19 февраля 2021 г., Denis Magda <[email protected]> >>>>> <[email protected]>: >>>>> >>>>> Hello Vladimir, >>>>> >>>>> Good to hear from you! How much is that in gigabytes? >>>>> >>>>> - >>>>> Denis >>>>> >>>>> >>>>> On Thu, Feb 18, 2021 at 10:06 PM <[email protected]> wrote: >>>>> >>>>> Sep 2020 I've published the paper about Loading Large Datasets into >>>>> Apache Ignite by Using a Key-Value API (English [1] and Russian [2] >>>>> version). The approach described works in production, but shows >>>>> inacceptable perfomance for very large tables. >>>>> >>>>> The story continues, and yesterday I've finished the proof of concept >>>>> for very fast loading of very big table. The partitioned MSSQL table about >>>>> 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. >>>>> Each node had executed its own SQL queries in parallel and then >>>>> distributed >>>>> the loaded values across the other cluster nodes. >>>>> >>>>> Probably that result will be of interest for the community. >>>>> >>>>> Regards, >>>>> Vladimir Chernyi >>>>> >>>>> [1] >>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api >>>>> [2] https://m.habr.com/ru/post/526708/ >>>>> >>>>> >>>>> >>>>> -- >>>>> Отправлено из мобильного приложения Яндекс.Почты >>>>> >>>>> -- >>>>> Taras Ledkov >>>>> Mail-To: [email protected] >>>>> >>>>> >>>> >>>>
