Re: very fast loading of very big table

Vladimir Tchernyi Thu, 25 Feb 2021 22:19:34 -0800

Hi Pavel,

the code [1] you shared is a kind of in-memory experiment with all the
processes inside a single JVM. My work differs - it is from the big retail
business, and hence it is 100% practice-oriented. True to say, it's
oriented to the state of things inside my company, and that is my question
- will my results be interesting to the community? I have seen a lot of
questions on the user list regarding data loading and difficulties here
seem to be a blocker in extending Ignite's popularity.


Please let me know if my case is not common in the industry. We have a big
bare-metal Windows MSSQL server and a number of bare metal hosts, each with
the virtualization software and a single CentOs virtual server inside.
These CentOs hosts currently form an Ignite cluster with 4 data nodes and 1
client node. The example [2] I published last year is intended to solve the
business problem we have out here:
1) the data currently present in the cluster have zero value;
2) actual data is in the database and must be loaded in the cluster ASAP.
We use BinaryObject as cache key and value;
3) cluster performs some data processing and writes the result to the
database.

Unfortunately, the code [2] does not 100% OK in my case, it tends to say
"is node still alive" and to drop the client node off the cluster. The
performance of the MSSQL and network is what it is, I consider it as a
given restriction. It seems I got some progress when managed to move the
data loading process from a single client node to multiple data nodes. When
the extra data nodes will be added, I expect the load performance will be
better. Of course, until my big MSSQL will be able to hold the load. So I
want to know how interesting my results will be if it will be published.

WDYT?

[1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15
[2] https://github.com/vtchernyi/FastDataLoad

чт, 25 февр. 2021 г. в 11:01, Pavel Tupitsyn <ptupit...@apache.org>:

> Vladimir,
>
> Thanks for getting back to us. A full example that clarifies the situation
> will be great!
>
> > Can you share your code as a GitHub project? Maybe with the script to
> reproduce 6 GB of data.
>
> It is super trivial, I just wanted to get a sense of the throughput and
> check if we have some kind of a regression in recent versions (we don't) [1]
> Also I realised that the data size can be counted very differently - do we
> account for DB overhead and how?
>
> [1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15
>
> On Thu, Feb 25, 2021 at 10:49 AM Vladimir Tchernyi <vtcher...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I'd spent some time thinking about the community comments on my post. It
>> seems that Ignite is really not a bottleneck here. The performance of my
>> production MSSQL is a given restriction and the problem is to ensure fast
>> loading by executing multiple parallel queries. I'll test my code in
>> production for a couple of months for possible problems. If it will be OK,
>> probably the complete/downloadable/compilable GitHub example will be useful
>> for the community.
>>
>> WDYT?
>>
>> пт, 19 февр. 2021 г. в 21:47, Vladimir Tchernyi <vtcher...@gmail.com>:
>>
>>> Pavel,
>>>
>>> maybe it's time to put your five-cent in. Can you share your code as a
>>> GitHub project? Maybe with the script to reproduce 6 GB of data.
>>>
>>> As for MSSQL data retrieval being the bottleneck - don't think so, I got
>>> 15 min load time for 1 node and 3.5 min time for 4 nodes. Looks like a
>>> linear dependency (the table and the RDBMS server were the same).
>>> --
>>> Vladimir
>>>
>>> пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn <ptupit...@apache.org>:
>>>
>>>> > First of all, I tried to select the whole table as once
>>>>
>>>> Hmm, it looks like MSSQL data retrieval may be the bottleneck here, not
>>>> Ignite.
>>>>
>>>> Can you run a test where some dummy data of the same size as real data
>>>> is generated and inserted into Ignite,
>>>> so that we test Ignite perf only, excluding MSSQL from the equation?
>>>> For example, streaming 300 million entries (total size 6 GB) takes
>>>> around 1 minute on my machine, with a simple single-threaded DataStreamer.
>>>>
>>>> On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <vtcher...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi folks,
>>>>> thanks for your interest in my work.
>>>>>
>>>>> I didn't try COPY FROM since I've tried to work with Ignite SQL a
>>>>> couple of years ago and didn't succeed. Probably because examples 
>>>>> available
>>>>> aren't complete/downloadable/compilable (the paper [1] contains GitHub
>>>>> repo, that is my five cents in changing the status quo). My interest is in
>>>>> KV API.
>>>>>
>>>>> I did try a data streamer, and that was my first try. I did not notice
>>>>> a significant time reduction in using code from my paper [1] versus data
>>>>> streamer/receiver. There was some memory economy with the streamer, 
>>>>> though.
>>>>> I must say my experiment was made on a heavily loaded production mssql
>>>>> server. Filtered query with 300K rows resultset takes about 15 sec. The
>>>>> story follows.
>>>>>
>>>>> First of all, I tried to select the whole table as once, I got the
>>>>> network timeout and the client node was dropped off the cluster (is node
>>>>> still alive?).
>>>>> So I'd partitioned the table and executed a number of queries
>>>>> one-by-one on the client node, each query for the specific table 
>>>>> partition.
>>>>> That process took about 90 min. Inacceptable time.
>>>>>
>>>>> Then I tried to execute my queries in parallel on the client node,
>>>>> each query executing dataStreamer.addData() for a single dataStreamer. The
>>>>> timing was not less than 15 min. All the attempts were the same, probably
>>>>> that was the network throughput limit on the client node (same interface
>>>>> used for the resultset and for cluster intercom). Say it again - that was
>>>>> the production environment.
>>>>>
>>>>> Final schema:
>>>>> * ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one
>>>>> job for one table partition;
>>>>> * each job executes SQL query, constructs a map with binary object key
>>>>> and value. Then the job executes targetCache.invokeAll() specifying the
>>>>> constructed map and the static EntryProcessor class. The EntryProcessor
>>>>> contains the logic for cache binary entry update;
>>>>> * ComputeTask.reduce() summarizes the row count reported by each job.
>>>>>
>>>>> The schema described proved to be network error-free in my production
>>>>> network and gives acceptable timing.
>>>>>
>>>>> Vladimir
>>>>>
>>>>> [1]
>>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>>>>
>>>>> пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <
>>>>> stephen.darling...@gridgain.com>:
>>>>>
>>>>>> I think it’s more that that putAll is mostly atomic, so the more
>>>>>> records you save in one chunk, the more locking, etc. happens. 
>>>>>> Distributing
>>>>>> as compute jobs means all the putAlls will be local which is beneficial,
>>>>>> and the size of each put is going to be smaller (also beneficial).
>>>>>>
>>>>>> But that’s a lot of work that the data streamer already does for you
>>>>>> and the data streamer also batches updates so would still be faster.
>>>>>>
>>>>>> On 19 Feb 2021, at 13:33, Maximiliano Gazquez <
>>>>>> maximiliano....@gmail.com> wrote:
>>>>>>
>>>>>> What would be the difference between doing cache.putAll(all rows) and
>>>>>> separating them by affinity key+executing putAll inside a compute job.
>>>>>> If I'm not mistaken, doing putAll should end up splitting those rows
>>>>>> by affinity key in one of the servers, right?
>>>>>> Is there a comparison of that?
>>>>>>
>>>>>> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <tled...@gridgain.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Vladimir,
>>>>>>> Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
>>>>>>> This command uses 'IgniteDataStreamer' to write data into cluster
>>>>>>> and parse CSV on the server node.
>>>>>>>
>>>>>>> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load
>>>>>>> data.
>>>>>>>
>>>>>>> Hi Denis,
>>>>>>>
>>>>>>> Data space is 3.7Gb according to MSSQL table properries
>>>>>>>
>>>>>>> Vladimir
>>>>>>>
>>>>>>> 9:47, 19 февраля 2021 г., Denis Magda <dma...@apache.org>
>>>>>>> <dma...@apache.org>:
>>>>>>>
>>>>>>> Hello Vladimir,
>>>>>>>
>>>>>>> Good to hear from you! How much is that in gigabytes?
>>>>>>>
>>>>>>> -
>>>>>>> Denis
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Feb 18, 2021 at 10:06 PM <vtcher...@gmail.com> wrote:
>>>>>>>
>>>>>>> Sep 2020 I've published the paper about Loading Large Datasets into
>>>>>>> Apache Ignite by Using a Key-Value API (English [1] and Russian [2]
>>>>>>> version). The approach described works in production, but shows
>>>>>>> inacceptable perfomance for very large tables.
>>>>>>>
>>>>>>> The story continues, and yesterday I've finished the proof of
>>>>>>> concept for very fast loading of very big table. The partitioned MSSQL
>>>>>>> table about 295 million rows was loaded by the 4-node Ignite cluster in 
>>>>>>> 3
>>>>>>> min 35 sec. Each node had executed its own SQL queries in parallel and 
>>>>>>> then
>>>>>>> distributed the loaded values across the other cluster nodes.
>>>>>>>
>>>>>>> Probably that result will be of interest for the community.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Vladimir Chernyi
>>>>>>>
>>>>>>> [1]
>>>>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>>>>>> [2] https://m.habr.com/ru/post/526708/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Отправлено из мобильного приложения Яндекс.Почты
>>>>>>>
>>>>>>> --
>>>>>>> Taras Ledkov
>>>>>>> Mail-To: tled...@gridgain.com
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>

Re: very fast loading of very big table

Reply via email to