Re: very fast loading of very big table

Vladimir Tchernyi Fri, 19 Feb 2021 10:47:32 -0800

Pavel,

maybe it's time to put your five-cent in. Can you share your code as a
GitHub project? Maybe with the script to reproduce 6 GB of data.


As for MSSQL data retrieval being the bottleneck - don't think so, I got 15
min load time for 1 node and 3.5 min time for 4 nodes. Looks like a linear
dependency (the table and the RDBMS server were the same).
--
Vladimir

пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn <[email protected]>:

> > First of all, I tried to select the whole table as once
>
> Hmm, it looks like MSSQL data retrieval may be the bottleneck here, not
> Ignite.
>
> Can you run a test where some dummy data of the same size as real data is
> generated and inserted into Ignite,
> so that we test Ignite perf only, excluding MSSQL from the equation?
> For example, streaming 300 million entries (total size 6 GB) takes around
> 1 minute on my machine, with a simple single-threaded DataStreamer.
>
> On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <[email protected]>
> wrote:
>
>> Hi folks,
>> thanks for your interest in my work.
>>
>> I didn't try COPY FROM since I've tried to work with Ignite SQL a couple
>> of years ago and didn't succeed. Probably because examples available aren't
>> complete/downloadable/compilable (the paper [1] contains GitHub repo, that
>> is my five cents in changing the status quo). My interest is in KV API.
>>
>> I did try a data streamer, and that was my first try. I did not notice a
>> significant time reduction in using code from my paper [1] versus data
>> streamer/receiver. There was some memory economy with the streamer, though.
>> I must say my experiment was made on a heavily loaded production mssql
>> server. Filtered query with 300K rows resultset takes about 15 sec. The
>> story follows.
>>
>> First of all, I tried to select the whole table as once, I got the
>> network timeout and the client node was dropped off the cluster (is node
>> still alive?).
>> So I'd partitioned the table and executed a number of queries one-by-one
>> on the client node, each query for the specific table partition. That
>> process took about 90 min. Inacceptable time.
>>
>> Then I tried to execute my queries in parallel on the client node, each
>> query executing dataStreamer.addData() for a single dataStreamer. The
>> timing was not less than 15 min. All the attempts were the same, probably
>> that was the network throughput limit on the client node (same interface
>> used for the resultset and for cluster intercom). Say it again - that was
>> the production environment.
>>
>> Final schema:
>> * ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one job
>> for one table partition;
>> * each job executes SQL query, constructs a map with binary object key
>> and value. Then the job executes targetCache.invokeAll() specifying the
>> constructed map and the static EntryProcessor class. The EntryProcessor
>> contains the logic for cache binary entry update;
>> * ComputeTask.reduce() summarizes the row count reported by each job.
>>
>> The schema described proved to be network error-free in my production
>> network and gives acceptable timing.
>>
>> Vladimir
>>
>> [1]
>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>
>> пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <
>> [email protected]>:
>>
>>> I think it’s more that that putAll is mostly atomic, so the more records
>>> you save in one chunk, the more locking, etc. happens. Distributing as
>>> compute jobs means all the putAlls will be local which is beneficial, and
>>> the size of each put is going to be smaller (also beneficial).
>>>
>>> But that’s a lot of work that the data streamer already does for you and
>>> the data streamer also batches updates so would still be faster.
>>>
>>> On 19 Feb 2021, at 13:33, Maximiliano Gazquez <[email protected]>
>>> wrote:
>>>
>>> What would be the difference between doing cache.putAll(all rows) and
>>> separating them by affinity key+executing putAll inside a compute job.
>>> If I'm not mistaken, doing putAll should end up splitting those rows by
>>> affinity key in one of the servers, right?
>>> Is there a comparison of that?
>>>
>>> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <[email protected]>
>>> wrote:
>>>
>>>> Hi Vladimir,
>>>> Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
>>>> This command uses 'IgniteDataStreamer' to write data into cluster and
>>>> parse CSV on the server node.
>>>>
>>>> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.
>>>>
>>>> Hi Denis,
>>>>
>>>> Data space is 3.7Gb according to MSSQL table properries
>>>>
>>>> Vladimir
>>>>
>>>> 9:47, 19 февраля 2021 г., Denis Magda <[email protected]>
>>>> <[email protected]>:
>>>>
>>>> Hello Vladimir,
>>>>
>>>> Good to hear from you! How much is that in gigabytes?
>>>>
>>>> -
>>>> Denis
>>>>
>>>>
>>>> On Thu, Feb 18, 2021 at 10:06 PM <[email protected]> wrote:
>>>>
>>>> Sep 2020 I've published the paper about Loading Large Datasets into
>>>> Apache Ignite by Using a Key-Value API (English [1] and Russian [2]
>>>> version). The approach described works in production, but shows
>>>> inacceptable perfomance for very large tables.
>>>>
>>>> The story continues, and yesterday I've finished the proof of concept
>>>> for very fast loading of very big table. The partitioned MSSQL table about
>>>> 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec.
>>>> Each node had executed its own SQL queries in parallel and then distributed
>>>> the loaded values across the other cluster nodes.
>>>>
>>>> Probably that result will be of interest for the community.
>>>>
>>>> Regards,
>>>> Vladimir Chernyi
>>>>
>>>> [1]
>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>>> [2] https://m.habr.com/ru/post/526708/
>>>>
>>>>
>>>>
>>>> --
>>>> Отправлено из мобильного приложения Яндекс.Почты
>>>>
>>>> --
>>>> Taras Ledkov
>>>> Mail-To: [email protected]
>>>>
>>>>
>>>
>>>

Re: very fast loading of very big table

Reply via email to