Vladimir, I think all real-world use cases are very valuable to the community. However, we should be careful to avoid misleading conclusions.
We have well-known patterns for loading data from other systems: DataStreamer [1] and CacheStore [2]. The article [3] seems a bit confusing to me, since none of those two patterns are mentioned there. When proposing a custom approach, it would be great to compare it to the standard alternatives. [1] https://ignite.apache.org/docs/latest/data-streaming [2] https://ignite.apache.org/docs/latest/persistence/custom-cache-store [3] https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api On Fri, Feb 26, 2021 at 9:19 AM Vladimir Tchernyi <vtcher...@gmail.com> wrote: > Hi Pavel, > > the code [1] you shared is a kind of in-memory experiment with all the > processes inside a single JVM. My work differs - it is from the big retail > business, and hence it is 100% practice-oriented. True to say, it's > oriented to the state of things inside my company, and that is my question > - will my results be interesting to the community? I have seen a lot of > questions on the user list regarding data loading and difficulties here > seem to be a blocker in extending Ignite's popularity. > > Please let me know if my case is not common in the industry. We have a big > bare-metal Windows MSSQL server and a number of bare metal hosts, each with > the virtualization software and a single CentOs virtual server inside. > These CentOs hosts currently form an Ignite cluster with 4 data nodes and 1 > client node. The example [2] I published last year is intended to solve the > business problem we have out here: > 1) the data currently present in the cluster have zero value; > 2) actual data is in the database and must be loaded in the cluster ASAP. > We use BinaryObject as cache key and value; > 3) cluster performs some data processing and writes the result to the > database. > > Unfortunately, the code [2] does not 100% OK in my case, it tends to say > "is node still alive" and to drop the client node off the cluster. The > performance of the MSSQL and network is what it is, I consider it as a > given restriction. It seems I got some progress when managed to move the > data loading process from a single client node to multiple data nodes. When > the extra data nodes will be added, I expect the load performance will be > better. Of course, until my big MSSQL will be able to hold the load. So I > want to know how interesting my results will be if it will be published. > > WDYT? > > [1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15 > [2] https://github.com/vtchernyi/FastDataLoad > > чт, 25 февр. 2021 г. в 11:01, Pavel Tupitsyn <ptupit...@apache.org>: > >> Vladimir, >> >> Thanks for getting back to us. A full example that clarifies the >> situation will be great! >> >> > Can you share your code as a GitHub project? Maybe with the script to >> reproduce 6 GB of data. >> >> It is super trivial, I just wanted to get a sense of the throughput and >> check if we have some kind of a regression in recent versions (we don't) [1] >> Also I realised that the data size can be counted very differently - do >> we account for DB overhead and how? >> >> [1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15 >> >> On Thu, Feb 25, 2021 at 10:49 AM Vladimir Tchernyi <vtcher...@gmail.com> >> wrote: >> >>> Hi, >>> >>> I'd spent some time thinking about the community comments on my post. It >>> seems that Ignite is really not a bottleneck here. The performance of my >>> production MSSQL is a given restriction and the problem is to ensure fast >>> loading by executing multiple parallel queries. I'll test my code in >>> production for a couple of months for possible problems. If it will be OK, >>> probably the complete/downloadable/compilable GitHub example will be useful >>> for the community. >>> >>> WDYT? >>> >>> пт, 19 февр. 2021 г. в 21:47, Vladimir Tchernyi <vtcher...@gmail.com>: >>> >>>> Pavel, >>>> >>>> maybe it's time to put your five-cent in. Can you share your code as a >>>> GitHub project? Maybe with the script to reproduce 6 GB of data. >>>> >>>> As for MSSQL data retrieval being the bottleneck - don't think so, I >>>> got 15 min load time for 1 node and 3.5 min time for 4 nodes. Looks like a >>>> linear dependency (the table and the RDBMS server were the same). >>>> -- >>>> Vladimir >>>> >>>> пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn <ptupit...@apache.org>: >>>> >>>>> > First of all, I tried to select the whole table as once >>>>> >>>>> Hmm, it looks like MSSQL data retrieval may be the bottleneck here, >>>>> not Ignite. >>>>> >>>>> Can you run a test where some dummy data of the same size as real data >>>>> is generated and inserted into Ignite, >>>>> so that we test Ignite perf only, excluding MSSQL from the equation? >>>>> For example, streaming 300 million entries (total size 6 GB) takes >>>>> around 1 minute on my machine, with a simple single-threaded DataStreamer. >>>>> >>>>> On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <vtcher...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi folks, >>>>>> thanks for your interest in my work. >>>>>> >>>>>> I didn't try COPY FROM since I've tried to work with Ignite SQL a >>>>>> couple of years ago and didn't succeed. Probably because examples >>>>>> available >>>>>> aren't complete/downloadable/compilable (the paper [1] contains GitHub >>>>>> repo, that is my five cents in changing the status quo). My interest is >>>>>> in >>>>>> KV API. >>>>>> >>>>>> I did try a data streamer, and that was my first try. I did not >>>>>> notice a significant time reduction in using code from my paper [1] >>>>>> versus >>>>>> data streamer/receiver. There was some memory economy with the streamer, >>>>>> though. I must say my experiment was made on a heavily loaded production >>>>>> mssql server. Filtered query with 300K rows resultset takes about 15 sec. >>>>>> The story follows. >>>>>> >>>>>> First of all, I tried to select the whole table as once, I got the >>>>>> network timeout and the client node was dropped off the cluster (is node >>>>>> still alive?). >>>>>> So I'd partitioned the table and executed a number of queries >>>>>> one-by-one on the client node, each query for the specific table >>>>>> partition. >>>>>> That process took about 90 min. Inacceptable time. >>>>>> >>>>>> Then I tried to execute my queries in parallel on the client node, >>>>>> each query executing dataStreamer.addData() for a single dataStreamer. >>>>>> The >>>>>> timing was not less than 15 min. All the attempts were the same, probably >>>>>> that was the network throughput limit on the client node (same interface >>>>>> used for the resultset and for cluster intercom). Say it again - that was >>>>>> the production environment. >>>>>> >>>>>> Final schema: >>>>>> * ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one >>>>>> job for one table partition; >>>>>> * each job executes SQL query, constructs a map with binary object >>>>>> key and value. Then the job executes targetCache.invokeAll() specifying >>>>>> the >>>>>> constructed map and the static EntryProcessor class. The EntryProcessor >>>>>> contains the logic for cache binary entry update; >>>>>> * ComputeTask.reduce() summarizes the row count reported by each job. >>>>>> >>>>>> The schema described proved to be network error-free in my production >>>>>> network and gives acceptable timing. >>>>>> >>>>>> Vladimir >>>>>> >>>>>> [1] >>>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api >>>>>> >>>>>> пт, 19 февр. 2021 г. в 16:41, Stephen Darlington < >>>>>> stephen.darling...@gridgain.com>: >>>>>> >>>>>>> I think it’s more that that putAll is mostly atomic, so the more >>>>>>> records you save in one chunk, the more locking, etc. happens. >>>>>>> Distributing >>>>>>> as compute jobs means all the putAlls will be local which is beneficial, >>>>>>> and the size of each put is going to be smaller (also beneficial). >>>>>>> >>>>>>> But that’s a lot of work that the data streamer already does for you >>>>>>> and the data streamer also batches updates so would still be faster. >>>>>>> >>>>>>> On 19 Feb 2021, at 13:33, Maximiliano Gazquez < >>>>>>> maximiliano....@gmail.com> wrote: >>>>>>> >>>>>>> What would be the difference between doing cache.putAll(all rows) >>>>>>> and separating them by affinity key+executing putAll inside a compute >>>>>>> job. >>>>>>> If I'm not mistaken, doing putAll should end up splitting those rows >>>>>>> by affinity key in one of the servers, right? >>>>>>> Is there a comparison of that? >>>>>>> >>>>>>> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <tled...@gridgain.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Vladimir, >>>>>>>> Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC? >>>>>>>> This command uses 'IgniteDataStreamer' to write data into cluster >>>>>>>> and parse CSV on the server node. >>>>>>>> >>>>>>>> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load >>>>>>>> data. >>>>>>>> >>>>>>>> Hi Denis, >>>>>>>> >>>>>>>> Data space is 3.7Gb according to MSSQL table properries >>>>>>>> >>>>>>>> Vladimir >>>>>>>> >>>>>>>> 9:47, 19 февраля 2021 г., Denis Magda <dma...@apache.org> >>>>>>>> <dma...@apache.org>: >>>>>>>> >>>>>>>> Hello Vladimir, >>>>>>>> >>>>>>>> Good to hear from you! How much is that in gigabytes? >>>>>>>> >>>>>>>> - >>>>>>>> Denis >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Feb 18, 2021 at 10:06 PM <vtcher...@gmail.com> wrote: >>>>>>>> >>>>>>>> Sep 2020 I've published the paper about Loading Large Datasets into >>>>>>>> Apache Ignite by Using a Key-Value API (English [1] and Russian [2] >>>>>>>> version). The approach described works in production, but shows >>>>>>>> inacceptable perfomance for very large tables. >>>>>>>> >>>>>>>> The story continues, and yesterday I've finished the proof of >>>>>>>> concept for very fast loading of very big table. The partitioned MSSQL >>>>>>>> table about 295 million rows was loaded by the 4-node Ignite cluster >>>>>>>> in 3 >>>>>>>> min 35 sec. Each node had executed its own SQL queries in parallel and >>>>>>>> then >>>>>>>> distributed the loaded values across the other cluster nodes. >>>>>>>> >>>>>>>> Probably that result will be of interest for the community. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Vladimir Chernyi >>>>>>>> >>>>>>>> [1] >>>>>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api >>>>>>>> [2] https://m.habr.com/ru/post/526708/ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Отправлено из мобильного приложения Яндекс.Почты >>>>>>>> >>>>>>>> -- >>>>>>>> Taras Ledkov >>>>>>>> Mail-To: tled...@gridgain.com >>>>>>>> >>>>>>>> >>>>>>> >>>>>>>