Pavel, thanks for mentioning the patterns. Of course, I spent a lot of time reading documentation, [2] at the very beginning and [1] a couple of months ago. Here is the origin of my pain-in-the-neck about a complete GitHub example - none of [1] and [2] give an answer about my problem. The keyword in my case is ASAP, there should be a multithreaded example. Of course, the real-world example must not use a primitive types as a cache values, I tried to illustrate that in [3].
I'd built one with the data streamer [1], it seems I was limited by network adapter performance (see my previous post in this thread). That is the reason I decided to move SQL queries to the data nodes. Vladimir пт, 26 февр. 2021 г. в 10:54, Pavel Tupitsyn <[email protected]>: > Vladimir, > > I think all real-world use cases are very valuable to the community. > However, we should be careful to avoid misleading conclusions. > > We have well-known patterns for loading data from other systems: > DataStreamer [1] and CacheStore [2]. > The article [3] seems a bit confusing to me, since none of those two > patterns are mentioned there. > When proposing a custom approach, it would be great to compare it to the > standard alternatives. > > [1] https://ignite.apache.org/docs/latest/data-streaming > [2] https://ignite.apache.org/docs/latest/persistence/custom-cache-store > [3] > https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api > > On Fri, Feb 26, 2021 at 9:19 AM Vladimir Tchernyi <[email protected]> > wrote: > >> Hi Pavel, >> >> the code [1] you shared is a kind of in-memory experiment with all the >> processes inside a single JVM. My work differs - it is from the big retail >> business, and hence it is 100% practice-oriented. True to say, it's >> oriented to the state of things inside my company, and that is my question >> - will my results be interesting to the community? I have seen a lot of >> questions on the user list regarding data loading and difficulties here >> seem to be a blocker in extending Ignite's popularity. >> >> Please let me know if my case is not common in the industry. We have a >> big bare-metal Windows MSSQL server and a number of bare metal hosts, each >> with the virtualization software and a single CentOs virtual server inside. >> These CentOs hosts currently form an Ignite cluster with 4 data nodes and 1 >> client node. The example [2] I published last year is intended to solve the >> business problem we have out here: >> 1) the data currently present in the cluster have zero value; >> 2) actual data is in the database and must be loaded in the cluster ASAP. >> We use BinaryObject as cache key and value; >> 3) cluster performs some data processing and writes the result to the >> database. >> >> Unfortunately, the code [2] does not 100% OK in my case, it tends to say >> "is node still alive" and to drop the client node off the cluster. The >> performance of the MSSQL and network is what it is, I consider it as a >> given restriction. It seems I got some progress when managed to move the >> data loading process from a single client node to multiple data nodes. When >> the extra data nodes will be added, I expect the load performance will be >> better. Of course, until my big MSSQL will be able to hold the load. So I >> want to know how interesting my results will be if it will be published. >> >> WDYT? >> >> [1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15 >> [2] https://github.com/vtchernyi/FastDataLoad >> >> чт, 25 февр. 2021 г. в 11:01, Pavel Tupitsyn <[email protected]>: >> >>> Vladimir, >>> >>> Thanks for getting back to us. A full example that clarifies the >>> situation will be great! >>> >>> > Can you share your code as a GitHub project? Maybe with the script to >>> reproduce 6 GB of data. >>> >>> It is super trivial, I just wanted to get a sense of the throughput and >>> check if we have some kind of a regression in recent versions (we don't) [1] >>> Also I realised that the data size can be counted very differently - do >>> we account for DB overhead and how? >>> >>> [1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15 >>> >>> On Thu, Feb 25, 2021 at 10:49 AM Vladimir Tchernyi <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> I'd spent some time thinking about the community comments on my post. >>>> It seems that Ignite is really not a bottleneck here. The performance of my >>>> production MSSQL is a given restriction and the problem is to ensure fast >>>> loading by executing multiple parallel queries. I'll test my code in >>>> production for a couple of months for possible problems. If it will be OK, >>>> probably the complete/downloadable/compilable GitHub example will be useful >>>> for the community. >>>> >>>> WDYT? >>>> >>>> пт, 19 февр. 2021 г. в 21:47, Vladimir Tchernyi <[email protected]>: >>>> >>>>> Pavel, >>>>> >>>>> maybe it's time to put your five-cent in. Can you share your code as a >>>>> GitHub project? Maybe with the script to reproduce 6 GB of data. >>>>> >>>>> As for MSSQL data retrieval being the bottleneck - don't think so, I >>>>> got 15 min load time for 1 node and 3.5 min time for 4 nodes. Looks like a >>>>> linear dependency (the table and the RDBMS server were the same). >>>>> -- >>>>> Vladimir >>>>> >>>>> пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn <[email protected]>: >>>>> >>>>>> > First of all, I tried to select the whole table as once >>>>>> >>>>>> Hmm, it looks like MSSQL data retrieval may be the bottleneck here, >>>>>> not Ignite. >>>>>> >>>>>> Can you run a test where some dummy data of the same size as real >>>>>> data is generated and inserted into Ignite, >>>>>> so that we test Ignite perf only, excluding MSSQL from the equation? >>>>>> For example, streaming 300 million entries (total size 6 GB) takes >>>>>> around 1 minute on my machine, with a simple single-threaded >>>>>> DataStreamer. >>>>>> >>>>>> On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi folks, >>>>>>> thanks for your interest in my work. >>>>>>> >>>>>>> I didn't try COPY FROM since I've tried to work with Ignite SQL a >>>>>>> couple of years ago and didn't succeed. Probably because examples >>>>>>> available >>>>>>> aren't complete/downloadable/compilable (the paper [1] contains GitHub >>>>>>> repo, that is my five cents in changing the status quo). My interest is >>>>>>> in >>>>>>> KV API. >>>>>>> >>>>>>> I did try a data streamer, and that was my first try. I did not >>>>>>> notice a significant time reduction in using code from my paper [1] >>>>>>> versus >>>>>>> data streamer/receiver. There was some memory economy with the streamer, >>>>>>> though. I must say my experiment was made on a heavily loaded production >>>>>>> mssql server. Filtered query with 300K rows resultset takes about 15 >>>>>>> sec. >>>>>>> The story follows. >>>>>>> >>>>>>> First of all, I tried to select the whole table as once, I got the >>>>>>> network timeout and the client node was dropped off the cluster (is node >>>>>>> still alive?). >>>>>>> So I'd partitioned the table and executed a number of queries >>>>>>> one-by-one on the client node, each query for the specific table >>>>>>> partition. >>>>>>> That process took about 90 min. Inacceptable time. >>>>>>> >>>>>>> Then I tried to execute my queries in parallel on the client node, >>>>>>> each query executing dataStreamer.addData() for a single dataStreamer. >>>>>>> The >>>>>>> timing was not less than 15 min. All the attempts were the same, >>>>>>> probably >>>>>>> that was the network throughput limit on the client node (same interface >>>>>>> used for the resultset and for cluster intercom). Say it again - that >>>>>>> was >>>>>>> the production environment. >>>>>>> >>>>>>> Final schema: >>>>>>> * ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one >>>>>>> job for one table partition; >>>>>>> * each job executes SQL query, constructs a map with binary object >>>>>>> key and value. Then the job executes targetCache.invokeAll() specifying >>>>>>> the >>>>>>> constructed map and the static EntryProcessor class. The EntryProcessor >>>>>>> contains the logic for cache binary entry update; >>>>>>> * ComputeTask.reduce() summarizes the row count reported by each job. >>>>>>> >>>>>>> The schema described proved to be network error-free in my >>>>>>> production network and gives acceptable timing. >>>>>>> >>>>>>> Vladimir >>>>>>> >>>>>>> [1] >>>>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api >>>>>>> >>>>>>> пт, 19 февр. 2021 г. в 16:41, Stephen Darlington < >>>>>>> [email protected]>: >>>>>>> >>>>>>>> I think it’s more that that putAll is mostly atomic, so the more >>>>>>>> records you save in one chunk, the more locking, etc. happens. >>>>>>>> Distributing >>>>>>>> as compute jobs means all the putAlls will be local which is >>>>>>>> beneficial, >>>>>>>> and the size of each put is going to be smaller (also beneficial). >>>>>>>> >>>>>>>> But that’s a lot of work that the data streamer already does for >>>>>>>> you and the data streamer also batches updates so would still be >>>>>>>> faster. >>>>>>>> >>>>>>>> On 19 Feb 2021, at 13:33, Maximiliano Gazquez < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>> What would be the difference between doing cache.putAll(all rows) >>>>>>>> and separating them by affinity key+executing putAll inside a compute >>>>>>>> job. >>>>>>>> If I'm not mistaken, doing putAll should end up splitting those >>>>>>>> rows by affinity key in one of the servers, right? >>>>>>>> Is there a comparison of that? >>>>>>>> >>>>>>>> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Vladimir, >>>>>>>>> Did you try to use SQL command 'COPY FROM <csv_file>' via thin >>>>>>>>> JDBC? >>>>>>>>> This command uses 'IgniteDataStreamer' to write data into cluster >>>>>>>>> and parse CSV on the server node. >>>>>>>>> >>>>>>>>> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load >>>>>>>>> data. >>>>>>>>> >>>>>>>>> Hi Denis, >>>>>>>>> >>>>>>>>> Data space is 3.7Gb according to MSSQL table properries >>>>>>>>> >>>>>>>>> Vladimir >>>>>>>>> >>>>>>>>> 9:47, 19 февраля 2021 г., Denis Magda <[email protected]> >>>>>>>>> <[email protected]>: >>>>>>>>> >>>>>>>>> Hello Vladimir, >>>>>>>>> >>>>>>>>> Good to hear from you! How much is that in gigabytes? >>>>>>>>> >>>>>>>>> - >>>>>>>>> Denis >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Feb 18, 2021 at 10:06 PM <[email protected]> wrote: >>>>>>>>> >>>>>>>>> Sep 2020 I've published the paper about Loading Large Datasets >>>>>>>>> into Apache Ignite by Using a Key-Value API (English [1] and Russian >>>>>>>>> [2] >>>>>>>>> version). The approach described works in production, but shows >>>>>>>>> inacceptable perfomance for very large tables. >>>>>>>>> >>>>>>>>> The story continues, and yesterday I've finished the proof of >>>>>>>>> concept for very fast loading of very big table. The partitioned MSSQL >>>>>>>>> table about 295 million rows was loaded by the 4-node Ignite cluster >>>>>>>>> in 3 >>>>>>>>> min 35 sec. Each node had executed its own SQL queries in parallel >>>>>>>>> and then >>>>>>>>> distributed the loaded values across the other cluster nodes. >>>>>>>>> >>>>>>>>> Probably that result will be of interest for the community. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Vladimir Chernyi >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api >>>>>>>>> [2] https://m.habr.com/ru/post/526708/ >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Отправлено из мобильного приложения Яндекс.Почты >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Taras Ledkov >>>>>>>>> Mail-To: [email protected] >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>>
