Re: very fast loading of very big table

Vladimir Tchernyi Wed, 24 Feb 2021 23:49:48 -0800

Hi,

I'd spent some time thinking about the community comments on my post. It
seems that Ignite is really not a bottleneck here. The performance of my
production MSSQL is a given restriction and the problem is to ensure fast
loading by executing multiple parallel queries. I'll test my code in
production for a couple of months for possible problems. If it will be OK,
probably the complete/downloadable/compilable GitHub example will be useful
for the community.


WDYT?

пт, 19 февр. 2021 г. в 21:47, Vladimir Tchernyi <[email protected]>:

> Pavel,
>
> maybe it's time to put your five-cent in. Can you share your code as a
> GitHub project? Maybe with the script to reproduce 6 GB of data.
>
> As for MSSQL data retrieval being the bottleneck - don't think so, I got
> 15 min load time for 1 node and 3.5 min time for 4 nodes. Looks like a
> linear dependency (the table and the RDBMS server were the same).
> --
> Vladimir
>
> пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn <[email protected]>:
>
>> > First of all, I tried to select the whole table as once
>>
>> Hmm, it looks like MSSQL data retrieval may be the bottleneck here, not
>> Ignite.
>>
>> Can you run a test where some dummy data of the same size as real data is
>> generated and inserted into Ignite,
>> so that we test Ignite perf only, excluding MSSQL from the equation?
>> For example, streaming 300 million entries (total size 6 GB) takes around
>> 1 minute on my machine, with a simple single-threaded DataStreamer.
>>
>> On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <[email protected]>
>> wrote:
>>
>>> Hi folks,
>>> thanks for your interest in my work.
>>>
>>> I didn't try COPY FROM since I've tried to work with Ignite SQL a couple
>>> of years ago and didn't succeed. Probably because examples available aren't
>>> complete/downloadable/compilable (the paper [1] contains GitHub repo, that
>>> is my five cents in changing the status quo). My interest is in KV API.
>>>
>>> I did try a data streamer, and that was my first try. I did not notice a
>>> significant time reduction in using code from my paper [1] versus data
>>> streamer/receiver. There was some memory economy with the streamer, though.
>>> I must say my experiment was made on a heavily loaded production mssql
>>> server. Filtered query with 300K rows resultset takes about 15 sec. The
>>> story follows.
>>>
>>> First of all, I tried to select the whole table as once, I got the
>>> network timeout and the client node was dropped off the cluster (is node
>>> still alive?).
>>> So I'd partitioned the table and executed a number of queries one-by-one
>>> on the client node, each query for the specific table partition. That
>>> process took about 90 min. Inacceptable time.
>>>
>>> Then I tried to execute my queries in parallel on the client node, each
>>> query executing dataStreamer.addData() for a single dataStreamer. The
>>> timing was not less than 15 min. All the attempts were the same, probably
>>> that was the network throughput limit on the client node (same interface
>>> used for the resultset and for cluster intercom). Say it again - that was
>>> the production environment.
>>>
>>> Final schema:
>>> * ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one job
>>> for one table partition;
>>> * each job executes SQL query, constructs a map with binary object key
>>> and value. Then the job executes targetCache.invokeAll() specifying the
>>> constructed map and the static EntryProcessor class. The EntryProcessor
>>> contains the logic for cache binary entry update;
>>> * ComputeTask.reduce() summarizes the row count reported by each job.
>>>
>>> The schema described proved to be network error-free in my production
>>> network and gives acceptable timing.
>>>
>>> Vladimir
>>>
>>> [1]
>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>>
>>> пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <
>>> [email protected]>:
>>>
>>>> I think it’s more that that putAll is mostly atomic, so the more
>>>> records you save in one chunk, the more locking, etc. happens. Distributing
>>>> as compute jobs means all the putAlls will be local which is beneficial,
>>>> and the size of each put is going to be smaller (also beneficial).
>>>>
>>>> But that’s a lot of work that the data streamer already does for you
>>>> and the data streamer also batches updates so would still be faster.
>>>>
>>>> On 19 Feb 2021, at 13:33, Maximiliano Gazquez <
>>>> [email protected]> wrote:
>>>>
>>>> What would be the difference between doing cache.putAll(all rows) and
>>>> separating them by affinity key+executing putAll inside a compute job.
>>>> If I'm not mistaken, doing putAll should end up splitting those rows by
>>>> affinity key in one of the servers, right?
>>>> Is there a comparison of that?
>>>>
>>>> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Vladimir,
>>>>> Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
>>>>> This command uses 'IgniteDataStreamer' to write data into cluster and
>>>>> parse CSV on the server node.
>>>>>
>>>>> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.
>>>>>
>>>>> Hi Denis,
>>>>>
>>>>> Data space is 3.7Gb according to MSSQL table properries
>>>>>
>>>>> Vladimir
>>>>>
>>>>> 9:47, 19 февраля 2021 г., Denis Magda <[email protected]>
>>>>> <[email protected]>:
>>>>>
>>>>> Hello Vladimir,
>>>>>
>>>>> Good to hear from you! How much is that in gigabytes?
>>>>>
>>>>> -
>>>>> Denis
>>>>>
>>>>>
>>>>> On Thu, Feb 18, 2021 at 10:06 PM <[email protected]> wrote:
>>>>>
>>>>> Sep 2020 I've published the paper about Loading Large Datasets into
>>>>> Apache Ignite by Using a Key-Value API (English [1] and Russian [2]
>>>>> version). The approach described works in production, but shows
>>>>> inacceptable perfomance for very large tables.
>>>>>
>>>>> The story continues, and yesterday I've finished the proof of concept
>>>>> for very fast loading of very big table. The partitioned MSSQL table about
>>>>> 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec.
>>>>> Each node had executed its own SQL queries in parallel and then 
>>>>> distributed
>>>>> the loaded values across the other cluster nodes.
>>>>>
>>>>> Probably that result will be of interest for the community.
>>>>>
>>>>> Regards,
>>>>> Vladimir Chernyi
>>>>>
>>>>> [1]
>>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>>>> [2] https://m.habr.com/ru/post/526708/
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Отправлено из мобильного приложения Яндекс.Почты
>>>>>
>>>>> --
>>>>> Taras Ledkov
>>>>> Mail-To: [email protected]
>>>>>
>>>>>
>>>>
>>>>

Re: very fast loading of very big table

Reply via email to