Vladimir,

I think all real-world use cases are very valuable to the community.
However, we should be careful to avoid misleading conclusions.

We have well-known patterns for loading data from other systems:
DataStreamer [1] and CacheStore [2].
The article [3] seems a bit confusing to me, since none of those two
patterns are mentioned there.
When proposing a custom approach, it would be great to compare it to the
standard alternatives.

[1] https://ignite.apache.org/docs/latest/data-streaming
[2] https://ignite.apache.org/docs/latest/persistence/custom-cache-store
[3]
https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api

On Fri, Feb 26, 2021 at 9:19 AM Vladimir Tchernyi <vtcher...@gmail.com>
wrote:

> Hi Pavel,
>
> the code [1] you shared is a kind of in-memory experiment with all the
> processes inside a single JVM. My work differs - it is from the big retail
> business, and hence it is 100% practice-oriented. True to say, it's
> oriented to the state of things inside my company, and that is my question
> - will my results be interesting to the community? I have seen a lot of
> questions on the user list regarding data loading and difficulties here
> seem to be a blocker in extending Ignite's popularity.
>
> Please let me know if my case is not common in the industry. We have a big
> bare-metal Windows MSSQL server and a number of bare metal hosts, each with
> the virtualization software and a single CentOs virtual server inside.
> These CentOs hosts currently form an Ignite cluster with 4 data nodes and 1
> client node. The example [2] I published last year is intended to solve the
> business problem we have out here:
> 1) the data currently present in the cluster have zero value;
> 2) actual data is in the database and must be loaded in the cluster ASAP.
> We use BinaryObject as cache key and value;
> 3) cluster performs some data processing and writes the result to the
> database.
>
> Unfortunately, the code [2] does not 100% OK in my case, it tends to say
> "is node still alive" and to drop the client node off the cluster. The
> performance of the MSSQL and network is what it is, I consider it as a
> given restriction. It seems I got some progress when managed to move the
> data loading process from a single client node to multiple data nodes. When
> the extra data nodes will be added, I expect the load performance will be
> better. Of course, until my big MSSQL will be able to hold the load. So I
> want to know how interesting my results will be if it will be published.
>
> WDYT?
>
> [1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15
> [2] https://github.com/vtchernyi/FastDataLoad
>
> чт, 25 февр. 2021 г. в 11:01, Pavel Tupitsyn <ptupit...@apache.org>:
>
>> Vladimir,
>>
>> Thanks for getting back to us. A full example that clarifies the
>> situation will be great!
>>
>> > Can you share your code as a GitHub project? Maybe with the script to
>> reproduce 6 GB of data.
>>
>> It is super trivial, I just wanted to get a sense of the throughput and
>> check if we have some kind of a regression in recent versions (we don't) [1]
>> Also I realised that the data size can be counted very differently - do
>> we account for DB overhead and how?
>>
>> [1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15
>>
>> On Thu, Feb 25, 2021 at 10:49 AM Vladimir Tchernyi <vtcher...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I'd spent some time thinking about the community comments on my post. It
>>> seems that Ignite is really not a bottleneck here. The performance of my
>>> production MSSQL is a given restriction and the problem is to ensure fast
>>> loading by executing multiple parallel queries. I'll test my code in
>>> production for a couple of months for possible problems. If it will be OK,
>>> probably the complete/downloadable/compilable GitHub example will be useful
>>> for the community.
>>>
>>> WDYT?
>>>
>>> пт, 19 февр. 2021 г. в 21:47, Vladimir Tchernyi <vtcher...@gmail.com>:
>>>
>>>> Pavel,
>>>>
>>>> maybe it's time to put your five-cent in. Can you share your code as a
>>>> GitHub project? Maybe with the script to reproduce 6 GB of data.
>>>>
>>>> As for MSSQL data retrieval being the bottleneck - don't think so, I
>>>> got 15 min load time for 1 node and 3.5 min time for 4 nodes. Looks like a
>>>> linear dependency (the table and the RDBMS server were the same).
>>>> --
>>>> Vladimir
>>>>
>>>> пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn <ptupit...@apache.org>:
>>>>
>>>>> > First of all, I tried to select the whole table as once
>>>>>
>>>>> Hmm, it looks like MSSQL data retrieval may be the bottleneck here,
>>>>> not Ignite.
>>>>>
>>>>> Can you run a test where some dummy data of the same size as real data
>>>>> is generated and inserted into Ignite,
>>>>> so that we test Ignite perf only, excluding MSSQL from the equation?
>>>>> For example, streaming 300 million entries (total size 6 GB) takes
>>>>> around 1 minute on my machine, with a simple single-threaded DataStreamer.
>>>>>
>>>>> On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <vtcher...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi folks,
>>>>>> thanks for your interest in my work.
>>>>>>
>>>>>> I didn't try COPY FROM since I've tried to work with Ignite SQL a
>>>>>> couple of years ago and didn't succeed. Probably because examples 
>>>>>> available
>>>>>> aren't complete/downloadable/compilable (the paper [1] contains GitHub
>>>>>> repo, that is my five cents in changing the status quo). My interest is 
>>>>>> in
>>>>>> KV API.
>>>>>>
>>>>>> I did try a data streamer, and that was my first try. I did not
>>>>>> notice a significant time reduction in using code from my paper [1] 
>>>>>> versus
>>>>>> data streamer/receiver. There was some memory economy with the streamer,
>>>>>> though. I must say my experiment was made on a heavily loaded production
>>>>>> mssql server. Filtered query with 300K rows resultset takes about 15 sec.
>>>>>> The story follows.
>>>>>>
>>>>>> First of all, I tried to select the whole table as once, I got the
>>>>>> network timeout and the client node was dropped off the cluster (is node
>>>>>> still alive?).
>>>>>> So I'd partitioned the table and executed a number of queries
>>>>>> one-by-one on the client node, each query for the specific table 
>>>>>> partition.
>>>>>> That process took about 90 min. Inacceptable time.
>>>>>>
>>>>>> Then I tried to execute my queries in parallel on the client node,
>>>>>> each query executing dataStreamer.addData() for a single dataStreamer. 
>>>>>> The
>>>>>> timing was not less than 15 min. All the attempts were the same, probably
>>>>>> that was the network throughput limit on the client node (same interface
>>>>>> used for the resultset and for cluster intercom). Say it again - that was
>>>>>> the production environment.
>>>>>>
>>>>>> Final schema:
>>>>>> * ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one
>>>>>> job for one table partition;
>>>>>> * each job executes SQL query, constructs a map with binary object
>>>>>> key and value. Then the job executes targetCache.invokeAll() specifying 
>>>>>> the
>>>>>> constructed map and the static EntryProcessor class. The EntryProcessor
>>>>>> contains the logic for cache binary entry update;
>>>>>> * ComputeTask.reduce() summarizes the row count reported by each job.
>>>>>>
>>>>>> The schema described proved to be network error-free in my production
>>>>>> network and gives acceptable timing.
>>>>>>
>>>>>> Vladimir
>>>>>>
>>>>>> [1]
>>>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>>>>>
>>>>>> пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <
>>>>>> stephen.darling...@gridgain.com>:
>>>>>>
>>>>>>> I think it’s more that that putAll is mostly atomic, so the more
>>>>>>> records you save in one chunk, the more locking, etc. happens. 
>>>>>>> Distributing
>>>>>>> as compute jobs means all the putAlls will be local which is beneficial,
>>>>>>> and the size of each put is going to be smaller (also beneficial).
>>>>>>>
>>>>>>> But that’s a lot of work that the data streamer already does for you
>>>>>>> and the data streamer also batches updates so would still be faster.
>>>>>>>
>>>>>>> On 19 Feb 2021, at 13:33, Maximiliano Gazquez <
>>>>>>> maximiliano....@gmail.com> wrote:
>>>>>>>
>>>>>>> What would be the difference between doing cache.putAll(all rows)
>>>>>>> and separating them by affinity key+executing putAll inside a compute 
>>>>>>> job.
>>>>>>> If I'm not mistaken, doing putAll should end up splitting those rows
>>>>>>> by affinity key in one of the servers, right?
>>>>>>> Is there a comparison of that?
>>>>>>>
>>>>>>> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <tled...@gridgain.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Vladimir,
>>>>>>>> Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
>>>>>>>> This command uses 'IgniteDataStreamer' to write data into cluster
>>>>>>>> and parse CSV on the server node.
>>>>>>>>
>>>>>>>> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load
>>>>>>>> data.
>>>>>>>>
>>>>>>>> Hi Denis,
>>>>>>>>
>>>>>>>> Data space is 3.7Gb according to MSSQL table properries
>>>>>>>>
>>>>>>>> Vladimir
>>>>>>>>
>>>>>>>> 9:47, 19 февраля 2021 г., Denis Magda <dma...@apache.org>
>>>>>>>> <dma...@apache.org>:
>>>>>>>>
>>>>>>>> Hello Vladimir,
>>>>>>>>
>>>>>>>> Good to hear from you! How much is that in gigabytes?
>>>>>>>>
>>>>>>>> -
>>>>>>>> Denis
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Feb 18, 2021 at 10:06 PM <vtcher...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Sep 2020 I've published the paper about Loading Large Datasets into
>>>>>>>> Apache Ignite by Using a Key-Value API (English [1] and Russian [2]
>>>>>>>> version). The approach described works in production, but shows
>>>>>>>> inacceptable perfomance for very large tables.
>>>>>>>>
>>>>>>>> The story continues, and yesterday I've finished the proof of
>>>>>>>> concept for very fast loading of very big table. The partitioned MSSQL
>>>>>>>> table about 295 million rows was loaded by the 4-node Ignite cluster 
>>>>>>>> in 3
>>>>>>>> min 35 sec. Each node had executed its own SQL queries in parallel and 
>>>>>>>> then
>>>>>>>> distributed the loaded values across the other cluster nodes.
>>>>>>>>
>>>>>>>> Probably that result will be of interest for the community.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Vladimir Chernyi
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>>>>>>> [2] https://m.habr.com/ru/post/526708/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Отправлено из мобильного приложения Яндекс.Почты
>>>>>>>>
>>>>>>>> --
>>>>>>>> Taras Ledkov
>>>>>>>> Mail-To: tled...@gridgain.com
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>

Reply via email to