Re: very fast loading of very big table

Vladimir Tchernyi Fri, 26 Feb 2021 00:26:40 -0800

Pavel,

thanks for mentioning the patterns. Of course, I spent a lot of time
reading documentation, [2] at the very beginning and [1] a couple of months
ago. Here is the origin of my pain-in-the-neck about a complete GitHub
example - none of [1] and [2] give an answer about my problem. The keyword
in my case is ASAP, there should be a multithreaded example. Of course, the
real-world example must not use a primitive types as a cache values, I
tried to illustrate that in [3].


I'd built one with the data streamer [1], it seems I was limited by network
adapter performance (see my previous post in this thread). That is the
reason I decided to move SQL queries to the data nodes.

Vladimir

пт, 26 февр. 2021 г. в 10:54, Pavel Tupitsyn <[email protected]>:

> Vladimir,
>
> I think all real-world use cases are very valuable to the community.
> However, we should be careful to avoid misleading conclusions.
>
> We have well-known patterns for loading data from other systems:
> DataStreamer [1] and CacheStore [2].
> The article [3] seems a bit confusing to me, since none of those two
> patterns are mentioned there.
> When proposing a custom approach, it would be great to compare it to the
> standard alternatives.
>
> [1] https://ignite.apache.org/docs/latest/data-streaming
> [2] https://ignite.apache.org/docs/latest/persistence/custom-cache-store
> [3]
> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>
> On Fri, Feb 26, 2021 at 9:19 AM Vladimir Tchernyi <[email protected]>
> wrote:
>
>> Hi Pavel,
>>
>> the code [1] you shared is a kind of in-memory experiment with all the
>> processes inside a single JVM. My work differs - it is from the big retail
>> business, and hence it is 100% practice-oriented. True to say, it's
>> oriented to the state of things inside my company, and that is my question
>> - will my results be interesting to the community? I have seen a lot of
>> questions on the user list regarding data loading and difficulties here
>> seem to be a blocker in extending Ignite's popularity.
>>
>> Please let me know if my case is not common in the industry. We have a
>> big bare-metal Windows MSSQL server and a number of bare metal hosts, each
>> with the virtualization software and a single CentOs virtual server inside.
>> These CentOs hosts currently form an Ignite cluster with 4 data nodes and 1
>> client node. The example [2] I published last year is intended to solve the
>> business problem we have out here:
>> 1) the data currently present in the cluster have zero value;
>> 2) actual data is in the database and must be loaded in the cluster ASAP.
>> We use BinaryObject as cache key and value;
>> 3) cluster performs some data processing and writes the result to the
>> database.
>>
>> Unfortunately, the code [2] does not 100% OK in my case, it tends to say
>> "is node still alive" and to drop the client node off the cluster. The
>> performance of the MSSQL and network is what it is, I consider it as a
>> given restriction. It seems I got some progress when managed to move the
>> data loading process from a single client node to multiple data nodes. When
>> the extra data nodes will be added, I expect the load performance will be
>> better. Of course, until my big MSSQL will be able to hold the load. So I
>> want to know how interesting my results will be if it will be published.
>>
>> WDYT?
>>
>> [1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15
>> [2] https://github.com/vtchernyi/FastDataLoad
>>
>> чт, 25 февр. 2021 г. в 11:01, Pavel Tupitsyn <[email protected]>:
>>
>>> Vladimir,
>>>
>>> Thanks for getting back to us. A full example that clarifies the
>>> situation will be great!
>>>
>>> > Can you share your code as a GitHub project? Maybe with the script to
>>> reproduce 6 GB of data.
>>>
>>> It is super trivial, I just wanted to get a sense of the throughput and
>>> check if we have some kind of a regression in recent versions (we don't) [1]
>>> Also I realised that the data size can be counted very differently - do
>>> we account for DB overhead and how?
>>>
>>> [1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15
>>>
>>> On Thu, Feb 25, 2021 at 10:49 AM Vladimir Tchernyi <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'd spent some time thinking about the community comments on my post.
>>>> It seems that Ignite is really not a bottleneck here. The performance of my
>>>> production MSSQL is a given restriction and the problem is to ensure fast
>>>> loading by executing multiple parallel queries. I'll test my code in
>>>> production for a couple of months for possible problems. If it will be OK,
>>>> probably the complete/downloadable/compilable GitHub example will be useful
>>>> for the community.
>>>>
>>>> WDYT?
>>>>
>>>> пт, 19 февр. 2021 г. в 21:47, Vladimir Tchernyi <[email protected]>:
>>>>
>>>>> Pavel,
>>>>>
>>>>> maybe it's time to put your five-cent in. Can you share your code as a
>>>>> GitHub project? Maybe with the script to reproduce 6 GB of data.
>>>>>
>>>>> As for MSSQL data retrieval being the bottleneck - don't think so, I
>>>>> got 15 min load time for 1 node and 3.5 min time for 4 nodes. Looks like a
>>>>> linear dependency (the table and the RDBMS server were the same).
>>>>> --
>>>>> Vladimir
>>>>>
>>>>> пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn <[email protected]>:
>>>>>
>>>>>> > First of all, I tried to select the whole table as once
>>>>>>
>>>>>> Hmm, it looks like MSSQL data retrieval may be the bottleneck here,
>>>>>> not Ignite.
>>>>>>
>>>>>> Can you run a test where some dummy data of the same size as real
>>>>>> data is generated and inserted into Ignite,
>>>>>> so that we test Ignite perf only, excluding MSSQL from the equation?
>>>>>> For example, streaming 300 million entries (total size 6 GB) takes
>>>>>> around 1 minute on my machine, with a simple single-threaded 
>>>>>> DataStreamer.
>>>>>>
>>>>>> On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi folks,
>>>>>>> thanks for your interest in my work.
>>>>>>>
>>>>>>> I didn't try COPY FROM since I've tried to work with Ignite SQL a
>>>>>>> couple of years ago and didn't succeed. Probably because examples 
>>>>>>> available
>>>>>>> aren't complete/downloadable/compilable (the paper [1] contains GitHub
>>>>>>> repo, that is my five cents in changing the status quo). My interest is 
>>>>>>> in
>>>>>>> KV API.
>>>>>>>
>>>>>>> I did try a data streamer, and that was my first try. I did not
>>>>>>> notice a significant time reduction in using code from my paper [1] 
>>>>>>> versus
>>>>>>> data streamer/receiver. There was some memory economy with the streamer,
>>>>>>> though. I must say my experiment was made on a heavily loaded production
>>>>>>> mssql server. Filtered query with 300K rows resultset takes about 15 
>>>>>>> sec.
>>>>>>> The story follows.
>>>>>>>
>>>>>>> First of all, I tried to select the whole table as once, I got the
>>>>>>> network timeout and the client node was dropped off the cluster (is node
>>>>>>> still alive?).
>>>>>>> So I'd partitioned the table and executed a number of queries
>>>>>>> one-by-one on the client node, each query for the specific table 
>>>>>>> partition.
>>>>>>> That process took about 90 min. Inacceptable time.
>>>>>>>
>>>>>>> Then I tried to execute my queries in parallel on the client node,
>>>>>>> each query executing dataStreamer.addData() for a single dataStreamer. 
>>>>>>> The
>>>>>>> timing was not less than 15 min. All the attempts were the same, 
>>>>>>> probably
>>>>>>> that was the network throughput limit on the client node (same interface
>>>>>>> used for the resultset and for cluster intercom). Say it again - that 
>>>>>>> was
>>>>>>> the production environment.
>>>>>>>
>>>>>>> Final schema:
>>>>>>> * ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one
>>>>>>> job for one table partition;
>>>>>>> * each job executes SQL query, constructs a map with binary object
>>>>>>> key and value. Then the job executes targetCache.invokeAll() specifying 
>>>>>>> the
>>>>>>> constructed map and the static EntryProcessor class. The EntryProcessor
>>>>>>> contains the logic for cache binary entry update;
>>>>>>> * ComputeTask.reduce() summarizes the row count reported by each job.
>>>>>>>
>>>>>>> The schema described proved to be network error-free in my
>>>>>>> production network and gives acceptable timing.
>>>>>>>
>>>>>>> Vladimir
>>>>>>>
>>>>>>> [1]
>>>>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>>>>>>
>>>>>>> пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <
>>>>>>> [email protected]>:
>>>>>>>
>>>>>>>> I think it’s more that that putAll is mostly atomic, so the more
>>>>>>>> records you save in one chunk, the more locking, etc. happens. 
>>>>>>>> Distributing
>>>>>>>> as compute jobs means all the putAlls will be local which is 
>>>>>>>> beneficial,
>>>>>>>> and the size of each put is going to be smaller (also beneficial).
>>>>>>>>
>>>>>>>> But that’s a lot of work that the data streamer already does for
>>>>>>>> you and the data streamer also batches updates so would still be 
>>>>>>>> faster.
>>>>>>>>
>>>>>>>> On 19 Feb 2021, at 13:33, Maximiliano Gazquez <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>> What would be the difference between doing cache.putAll(all rows)
>>>>>>>> and separating them by affinity key+executing putAll inside a compute 
>>>>>>>> job.
>>>>>>>> If I'm not mistaken, doing putAll should end up splitting those
>>>>>>>> rows by affinity key in one of the servers, right?
>>>>>>>> Is there a comparison of that?
>>>>>>>>
>>>>>>>> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Vladimir,
>>>>>>>>> Did you try to use SQL command 'COPY FROM <csv_file>' via thin
>>>>>>>>> JDBC?
>>>>>>>>> This command uses 'IgniteDataStreamer' to write data into cluster
>>>>>>>>> and parse CSV on the server node.
>>>>>>>>>
>>>>>>>>> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load
>>>>>>>>> data.
>>>>>>>>>
>>>>>>>>> Hi Denis,
>>>>>>>>>
>>>>>>>>> Data space is 3.7Gb according to MSSQL table properries
>>>>>>>>>
>>>>>>>>> Vladimir
>>>>>>>>>
>>>>>>>>> 9:47, 19 февраля 2021 г., Denis Magda <[email protected]>
>>>>>>>>> <[email protected]>:
>>>>>>>>>
>>>>>>>>> Hello Vladimir,
>>>>>>>>>
>>>>>>>>> Good to hear from you! How much is that in gigabytes?
>>>>>>>>>
>>>>>>>>> -
>>>>>>>>> Denis
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Feb 18, 2021 at 10:06 PM <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> Sep 2020 I've published the paper about Loading Large Datasets
>>>>>>>>> into Apache Ignite by Using a Key-Value API (English [1] and Russian 
>>>>>>>>> [2]
>>>>>>>>> version). The approach described works in production, but shows
>>>>>>>>> inacceptable perfomance for very large tables.
>>>>>>>>>
>>>>>>>>> The story continues, and yesterday I've finished the proof of
>>>>>>>>> concept for very fast loading of very big table. The partitioned MSSQL
>>>>>>>>> table about 295 million rows was loaded by the 4-node Ignite cluster 
>>>>>>>>> in 3
>>>>>>>>> min 35 sec. Each node had executed its own SQL queries in parallel 
>>>>>>>>> and then
>>>>>>>>> distributed the loaded values across the other cluster nodes.
>>>>>>>>>
>>>>>>>>> Probably that result will be of interest for the community.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Vladimir Chernyi
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>>>>>>>> [2] https://m.habr.com/ru/post/526708/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Отправлено из мобильного приложения Яндекс.Почты
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Taras Ledkov
>>>>>>>>> Mail-To: [email protected]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>

Re: very fast loading of very big table

Reply via email to