Hi Pavel, the code [1] you shared is a kind of in-memory experiment with all the processes inside a single JVM. My work differs - it is from the big retail business, and hence it is 100% practice-oriented. True to say, it's oriented to the state of things inside my company, and that is my question - will my results be interesting to the community? I have seen a lot of questions on the user list regarding data loading and difficulties here seem to be a blocker in extending Ignite's popularity.
Please let me know if my case is not common in the industry. We have a big bare-metal Windows MSSQL server and a number of bare metal hosts, each with the virtualization software and a single CentOs virtual server inside. These CentOs hosts currently form an Ignite cluster with 4 data nodes and 1 client node. The example [2] I published last year is intended to solve the business problem we have out here: 1) the data currently present in the cluster have zero value; 2) actual data is in the database and must be loaded in the cluster ASAP. We use BinaryObject as cache key and value; 3) cluster performs some data processing and writes the result to the database. Unfortunately, the code [2] does not 100% OK in my case, it tends to say "is node still alive" and to drop the client node off the cluster. The performance of the MSSQL and network is what it is, I consider it as a given restriction. It seems I got some progress when managed to move the data loading process from a single client node to multiple data nodes. When the extra data nodes will be added, I expect the load performance will be better. Of course, until my big MSSQL will be able to hold the load. So I want to know how interesting my results will be if it will be published. WDYT? [1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15 [2] https://github.com/vtchernyi/FastDataLoad чт, 25 февр. 2021 г. в 11:01, Pavel Tupitsyn <ptupit...@apache.org>: > Vladimir, > > Thanks for getting back to us. A full example that clarifies the situation > will be great! > > > Can you share your code as a GitHub project? Maybe with the script to > reproduce 6 GB of data. > > It is super trivial, I just wanted to get a sense of the throughput and > check if we have some kind of a regression in recent versions (we don't) [1] > Also I realised that the data size can be counted very differently - do we > account for DB overhead and how? > > [1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15 > > On Thu, Feb 25, 2021 at 10:49 AM Vladimir Tchernyi <vtcher...@gmail.com> > wrote: > >> Hi, >> >> I'd spent some time thinking about the community comments on my post. It >> seems that Ignite is really not a bottleneck here. The performance of my >> production MSSQL is a given restriction and the problem is to ensure fast >> loading by executing multiple parallel queries. I'll test my code in >> production for a couple of months for possible problems. If it will be OK, >> probably the complete/downloadable/compilable GitHub example will be useful >> for the community. >> >> WDYT? >> >> пт, 19 февр. 2021 г. в 21:47, Vladimir Tchernyi <vtcher...@gmail.com>: >> >>> Pavel, >>> >>> maybe it's time to put your five-cent in. Can you share your code as a >>> GitHub project? Maybe with the script to reproduce 6 GB of data. >>> >>> As for MSSQL data retrieval being the bottleneck - don't think so, I got >>> 15 min load time for 1 node and 3.5 min time for 4 nodes. Looks like a >>> linear dependency (the table and the RDBMS server were the same). >>> -- >>> Vladimir >>> >>> пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn <ptupit...@apache.org>: >>> >>>> > First of all, I tried to select the whole table as once >>>> >>>> Hmm, it looks like MSSQL data retrieval may be the bottleneck here, not >>>> Ignite. >>>> >>>> Can you run a test where some dummy data of the same size as real data >>>> is generated and inserted into Ignite, >>>> so that we test Ignite perf only, excluding MSSQL from the equation? >>>> For example, streaming 300 million entries (total size 6 GB) takes >>>> around 1 minute on my machine, with a simple single-threaded DataStreamer. >>>> >>>> On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <vtcher...@gmail.com> >>>> wrote: >>>> >>>>> Hi folks, >>>>> thanks for your interest in my work. >>>>> >>>>> I didn't try COPY FROM since I've tried to work with Ignite SQL a >>>>> couple of years ago and didn't succeed. Probably because examples >>>>> available >>>>> aren't complete/downloadable/compilable (the paper [1] contains GitHub >>>>> repo, that is my five cents in changing the status quo). My interest is in >>>>> KV API. >>>>> >>>>> I did try a data streamer, and that was my first try. I did not notice >>>>> a significant time reduction in using code from my paper [1] versus data >>>>> streamer/receiver. There was some memory economy with the streamer, >>>>> though. >>>>> I must say my experiment was made on a heavily loaded production mssql >>>>> server. Filtered query with 300K rows resultset takes about 15 sec. The >>>>> story follows. >>>>> >>>>> First of all, I tried to select the whole table as once, I got the >>>>> network timeout and the client node was dropped off the cluster (is node >>>>> still alive?). >>>>> So I'd partitioned the table and executed a number of queries >>>>> one-by-one on the client node, each query for the specific table >>>>> partition. >>>>> That process took about 90 min. Inacceptable time. >>>>> >>>>> Then I tried to execute my queries in parallel on the client node, >>>>> each query executing dataStreamer.addData() for a single dataStreamer. The >>>>> timing was not less than 15 min. All the attempts were the same, probably >>>>> that was the network throughput limit on the client node (same interface >>>>> used for the resultset and for cluster intercom). Say it again - that was >>>>> the production environment. >>>>> >>>>> Final schema: >>>>> * ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one >>>>> job for one table partition; >>>>> * each job executes SQL query, constructs a map with binary object key >>>>> and value. Then the job executes targetCache.invokeAll() specifying the >>>>> constructed map and the static EntryProcessor class. The EntryProcessor >>>>> contains the logic for cache binary entry update; >>>>> * ComputeTask.reduce() summarizes the row count reported by each job. >>>>> >>>>> The schema described proved to be network error-free in my production >>>>> network and gives acceptable timing. >>>>> >>>>> Vladimir >>>>> >>>>> [1] >>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api >>>>> >>>>> пт, 19 февр. 2021 г. в 16:41, Stephen Darlington < >>>>> stephen.darling...@gridgain.com>: >>>>> >>>>>> I think it’s more that that putAll is mostly atomic, so the more >>>>>> records you save in one chunk, the more locking, etc. happens. >>>>>> Distributing >>>>>> as compute jobs means all the putAlls will be local which is beneficial, >>>>>> and the size of each put is going to be smaller (also beneficial). >>>>>> >>>>>> But that’s a lot of work that the data streamer already does for you >>>>>> and the data streamer also batches updates so would still be faster. >>>>>> >>>>>> On 19 Feb 2021, at 13:33, Maximiliano Gazquez < >>>>>> maximiliano....@gmail.com> wrote: >>>>>> >>>>>> What would be the difference between doing cache.putAll(all rows) and >>>>>> separating them by affinity key+executing putAll inside a compute job. >>>>>> If I'm not mistaken, doing putAll should end up splitting those rows >>>>>> by affinity key in one of the servers, right? >>>>>> Is there a comparison of that? >>>>>> >>>>>> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <tled...@gridgain.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Vladimir, >>>>>>> Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC? >>>>>>> This command uses 'IgniteDataStreamer' to write data into cluster >>>>>>> and parse CSV on the server node. >>>>>>> >>>>>>> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load >>>>>>> data. >>>>>>> >>>>>>> Hi Denis, >>>>>>> >>>>>>> Data space is 3.7Gb according to MSSQL table properries >>>>>>> >>>>>>> Vladimir >>>>>>> >>>>>>> 9:47, 19 февраля 2021 г., Denis Magda <dma...@apache.org> >>>>>>> <dma...@apache.org>: >>>>>>> >>>>>>> Hello Vladimir, >>>>>>> >>>>>>> Good to hear from you! How much is that in gigabytes? >>>>>>> >>>>>>> - >>>>>>> Denis >>>>>>> >>>>>>> >>>>>>> On Thu, Feb 18, 2021 at 10:06 PM <vtcher...@gmail.com> wrote: >>>>>>> >>>>>>> Sep 2020 I've published the paper about Loading Large Datasets into >>>>>>> Apache Ignite by Using a Key-Value API (English [1] and Russian [2] >>>>>>> version). The approach described works in production, but shows >>>>>>> inacceptable perfomance for very large tables. >>>>>>> >>>>>>> The story continues, and yesterday I've finished the proof of >>>>>>> concept for very fast loading of very big table. The partitioned MSSQL >>>>>>> table about 295 million rows was loaded by the 4-node Ignite cluster in >>>>>>> 3 >>>>>>> min 35 sec. Each node had executed its own SQL queries in parallel and >>>>>>> then >>>>>>> distributed the loaded values across the other cluster nodes. >>>>>>> >>>>>>> Probably that result will be of interest for the community. >>>>>>> >>>>>>> Regards, >>>>>>> Vladimir Chernyi >>>>>>> >>>>>>> [1] >>>>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api >>>>>>> [2] https://m.habr.com/ru/post/526708/ >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Отправлено из мобильного приложения Яндекс.Почты >>>>>>> >>>>>>> -- >>>>>>> Taras Ledkov >>>>>>> Mail-To: tled...@gridgain.com >>>>>>> >>>>>>> >>>>>> >>>>>>