Re: very fast loading of very big table

Pavel Tupitsyn Fri, 19 Feb 2021 08:47:28 -0800

> First of all, I tried to select the whole table as once

Hmm, it looks like MSSQL data retrieval may be the bottleneck here, not
Ignite.


Can you run a test where some dummy data of the same size as real data is
generated and inserted into Ignite,
so that we test Ignite perf only, excluding MSSQL from the equation?
For example, streaming 300 million entries (total size 6 GB) takes around 1
minute on my machine, with a simple single-threaded DataStreamer.

On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <[email protected]>
wrote:

> Hi folks,
> thanks for your interest in my work.
>
> I didn't try COPY FROM since I've tried to work with Ignite SQL a couple
> of years ago and didn't succeed. Probably because examples available aren't
> complete/downloadable/compilable (the paper [1] contains GitHub repo, that
> is my five cents in changing the status quo). My interest is in KV API.
>
> I did try a data streamer, and that was my first try. I did not notice a
> significant time reduction in using code from my paper [1] versus data
> streamer/receiver. There was some memory economy with the streamer, though.
> I must say my experiment was made on a heavily loaded production mssql
> server. Filtered query with 300K rows resultset takes about 15 sec. The
> story follows.
>
> First of all, I tried to select the whole table as once, I got the network
> timeout and the client node was dropped off the cluster (is node still
> alive?).
> So I'd partitioned the table and executed a number of queries one-by-one
> on the client node, each query for the specific table partition. That
> process took about 90 min. Inacceptable time.
>
> Then I tried to execute my queries in parallel on the client node, each
> query executing dataStreamer.addData() for a single dataStreamer. The
> timing was not less than 15 min. All the attempts were the same, probably
> that was the network throughput limit on the client node (same interface
> used for the resultset and for cluster intercom). Say it again - that was
> the production environment.
>
> Final schema:
> * ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one job
> for one table partition;
> * each job executes SQL query, constructs a map with binary object key and
> value. Then the job executes targetCache.invokeAll() specifying the
> constructed map and the static EntryProcessor class. The EntryProcessor
> contains the logic for cache binary entry update;
> * ComputeTask.reduce() summarizes the row count reported by each job.
>
> The schema described proved to be network error-free in my production
> network and gives acceptable timing.
>
> Vladimir
>
> [1]
> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>
> пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <
> [email protected]>:
>
>> I think it’s more that that putAll is mostly atomic, so the more records
>> you save in one chunk, the more locking, etc. happens. Distributing as
>> compute jobs means all the putAlls will be local which is beneficial, and
>> the size of each put is going to be smaller (also beneficial).
>>
>> But that’s a lot of work that the data streamer already does for you and
>> the data streamer also batches updates so would still be faster.
>>
>> On 19 Feb 2021, at 13:33, Maximiliano Gazquez <[email protected]>
>> wrote:
>>
>> What would be the difference between doing cache.putAll(all rows) and
>> separating them by affinity key+executing putAll inside a compute job.
>> If I'm not mistaken, doing putAll should end up splitting those rows by
>> affinity key in one of the servers, right?
>> Is there a comparison of that?
>>
>> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <[email protected]>
>> wrote:
>>
>>> Hi Vladimir,
>>> Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
>>> This command uses 'IgniteDataStreamer' to write data into cluster and
>>> parse CSV on the server node.
>>>
>>> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.
>>>
>>> Hi Denis,
>>>
>>> Data space is 3.7Gb according to MSSQL table properries
>>>
>>> Vladimir
>>>
>>> 9:47, 19 февраля 2021 г., Denis Magda <[email protected]>
>>> <[email protected]>:
>>>
>>> Hello Vladimir,
>>>
>>> Good to hear from you! How much is that in gigabytes?
>>>
>>> -
>>> Denis
>>>
>>>
>>> On Thu, Feb 18, 2021 at 10:06 PM <[email protected]> wrote:
>>>
>>> Sep 2020 I've published the paper about Loading Large Datasets into
>>> Apache Ignite by Using a Key-Value API (English [1] and Russian [2]
>>> version). The approach described works in production, but shows
>>> inacceptable perfomance for very large tables.
>>>
>>> The story continues, and yesterday I've finished the proof of concept
>>> for very fast loading of very big table. The partitioned MSSQL table about
>>> 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec.
>>> Each node had executed its own SQL queries in parallel and then distributed
>>> the loaded values across the other cluster nodes.
>>>
>>> Probably that result will be of interest for the community.
>>>
>>> Regards,
>>> Vladimir Chernyi
>>>
>>> [1]
>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>> [2] https://m.habr.com/ru/post/526708/
>>>
>>>
>>>
>>> --
>>> Отправлено из мобильного приложения Яндекс.Почты
>>>
>>> --
>>> Taras Ledkov
>>> Mail-To: [email protected]
>>>
>>>
>>
>>

Re: very fast loading of very big table

Reply via email to