Re: Re: BinaryObject Data Can Not Mapping To SQL-Data

2022-04-19 Thread Vladimir Tchernyi
Hello Huty,
please read my post [1]. The approach in that paper works successfully in
production for more than one year and seems to be correct
[1]
https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api

Vladimir
telegram @vtchernyi

PS
hope I named you correct, the name is not widespread here in Russia

вт, 19 апр. 2022 г. в 09:46, y :

> Hi Vladimir,
> Thank you for your answer.  Emmm.Actually, most of my methods are the
> same as yours except for the following two points:
> 1、I didn't use ComputeTask. The data is sent to the server node through
> the* thin client.*
>
> 2、I didn't use the standard POJO. Key-type is the following code and
> value-type is an empty class. That means *All columns are dynamically
> specified through BinaryObjectBuilder. *
>
> public class PubPartionKeys_1_7 {
> @AffinityKeyMapped
> private String TBDATA_DX01;
> private String TBDATA_DX02;
> private String TBDATA_DX03;
> private String TBDATA_DX04;
> private String TBDATA_DX05;
> private String TBDATA_DX06;
> private String TBDATA_DX07;
>
> public PubPartionKeys_1_7() {
> }
>
> // get/set method
> // .
> }
>
> I would be appreciate it very much if you attach your code back! :)
>
> Huty,
> 2022/4/19
>
>
> At 2022-04-19 12:40:20, vtcher...@gmail.com wrote:
>
> Hi,
>
> I have had the same experience without sql, using KV API only. My cluster
> consists of several data nodes and self-written jar application that starts
> the client node. When started, client node executes mapreduce tasks for
> data load and processing.
>
> The workaround is as follows:
> 1. create POJO on the client node;
> 2. convert it to the binary object;
> 3. on the data node, get binary object over the network and get its
> builder (obj.toBuilder());
> 4. set some fields, build and put in the cache.
>
> The builder on the step 3 seems to be the same as the one on the cluent
> node.
>
> Hope that helps,
> Vladimir
>
> 13:06, 18 апреля 2022 г., y :
>
> Hi ,
> When using binary to insert data, I need to  get *an
> exist BinaryObject/BinaryObjectBuilder*  from the database, similar to
> the code below.
> 442062c6$3$1803c222cba$Coremail$hty1994712$163.com
>
> If I create a BinaryObjectBuilder directly, inserting binary data does not
> map to table data. The following code will not throw error, but the* data
> is not mapped to sql. *If there is *no data in my table at first*, how
> can I insert data?
> 3ecbd8f9$4$1803c222cba$Coremail$hty1994712$163.com
>
>
>
>
>
>
>
>
> --
> Отправлено из мобильного приложения Яндекс Почты
>
>
>
>
>


rendezvous partitions on the web

2021-11-23 Thread Vladimir Tchernyi
Thanks,

I'd fount that already in my recent project repo. The question is about the
website, I did not manage to find that info there.
"how to set non default partition number for some cache" - do we have
java/xml example at ignite.apache.org website?

Vladimir

PS: sorry for missing thread name, just changed it

13:49, 23 ноября 2021 г., andrei :

Hi,

You can set it as part of CacheConfiguration:

https://www.javadoc.io/doc/org.apache.ignite/ignite-core/1.6.0/org/apache/ignite/configuration/CacheConfiguration.html#setAffinity(org.apache.ignite.cache.affinity.AffinityFunction)








Regards,
Andei

11/23/2021 1:39 PM, vtcher...@gmail.com пишет:

Hi community,

I try to remember how to set non default partition number for some cache.
Do we have java/xml example at ignite.apache.org website?

Vladimir



-- 
Отправлено из мобильного приложения Яндекс.Почты


Re: Peer ClassLoading Issue | Apache Ignite 2.10 with Spring Boot 2.3

2021-05-11 Thread Vladimir Tchernyi
Hi Siva,

I still have no clue what's wrong, your code seems OK. Just one more thing
to check.

>>  cacheConfiguration.setIndexedTypes(String.class, IgniteUser1.class);
what do you need this string for? Does your code work without this string?

It seems to be something about SQL layer. Here's a little post I wrote [1],
it is written in Russian, hope you will be able to translate and read it.
I'd spent a lot of time forging through StackOverflow and learned that user
POJOs are not peer-deployed at all. So if you need  IgniteUser1 class,
probably that class should extend/implement something from Ignite platform.

--
Vladimir

[1] https://habr.com/ru/post/472568/

вт, 11 мая 2021 г. в 06:00, :

> Thanks Vladimir Chernyi for your blog.  That helped a lot in improving
> the performance. I am converting the object to Binary Object before
> storing. This doesn’t seem to be issue just when loading the data alone.
>
>
>
> I have a scan query which also fails with same exception.
>
>
>
> List cacheObjects = cache.withKeepBinary().query(new
> ScanQuery<>(
> //filter for just Blade Enabled Users
> new IgniteBiPredicate() {
> @Override
> public boolean apply(String s, BinaryObject
> binaryObject) {
> return ((HashSet) binaryObject.field("field1"
> )).contains("34");
> }
> }),
> // Transformer
> new IgniteClosure, BinaryObject>()
> {
> @Override
> public BinaryObject apply(Cache.Entry
> stringBinaryObjectEntry) {
> return stringBinaryObjectEntry.getValue();
> }
> }).getAll();
>
> Weirdest thing is that the classes seem to be loading on the cluster, only
> when i modify something on the class. I can replicate it again when I start
> the cluster wiping the data  folder.  It appears so that the grid deployer
> is not deploying classes to the cluster until it sees that the class has
> changed , but shouldn’t it deploy first time?.
>
>
>
>
>
> Thanks,
>
> Siva.
>
>
>
> *From:* vtcher...@gmail.com 
> *Sent:* Monday, May 10, 2021 1:07
> *To:* user@ignite.apache.org
> *Subject:* Re: Peer ClassLoading Issue | Apache Ignite 2.10 with Spring
> Boot 2.3
>
>
>
> This message originated from outside our organisation and is from web
> based email - vtcher...@gmail.com
>
> Hi Siva,
>
>
>
> Thank you for reading my blog post. I have no idea what is the problem in
> your case, just wanna share some experience.
>
>
>
> I do not use any user POJOs on the remote nodes. Instead, I create POJO on
> the thick client node, convert it in BinaryObject and change that object on
> the remote node by object.toBuilder().setField().build(). I use key value
> API only.
>
>
>
> So no class not found issues arise. Hope that helps
>
>
>
> Vladimir Chernyi
>
> 8:28, 9 мая 2021 г., "siva.velich...@barclays.com" <
> siva.velich...@barclays.com>:
>
> Hi,
>
>
>
> We are trying to use ignite for the first time in our project. We are
> trying to use ignite with persistence enabled.
>
>
>
> Architecture is as follows.
>
>
>
> SpringBoot 2.3 application (thick client ) tries to connect to apace
> ignite cluster (3 nodes ) with persistence enabled and peer class loading
> enabled.
>
>
>
> There seems to be a weird  issue with peer class loading.
>
>
>
> We are trying to load huge data following the same approach as here -
> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
> 
>
>
>
> Cache Configuration
>
>
>
> cacheConfiguration.setName(CacheIdentifiers.*USER_IGNITE_CACHE*
> .toString());
> cacheConfiguration.setIndexedTypes(String.class, IgniteUser1.class);
> cacheConfiguration.setCacheMode(CacheMode.*PARTITIONED*);
> cacheConfiguration.setStoreKeepBinary(true);
> RendezvousAffinityFunction rendezvousAffinityFunction = new
> RendezvousAffinityFunction();
> rendezvousAffinityFunction.setPartitions(512);
> cacheConfiguration.setBackups(1);
> cacheConfiguration.setAffinity(rendezvousAffinityFunction);
>
>
>
>
>
> Scenario 1.
>
>
>
> Start the cluster à activate the cluster à start the thick client à
>  Loading clients/ignite.cluster fails
>
>
>
> Exception occured in adding the data javax.cache.CacheException: class
> org.apache.ignite.IgniteCheckedException: Failed to resolve class name
> [platformId=0, platform=Java, typeId=620850656]
>
>
>
> Scenario 2.
>
>
>
> Stop the Thick client , Rename the file from IgniteUser1 to IgniteUser and
> restart the thick client , the classes are now copied to the cluster and
> works fine.
>
>
>
> I am not sure if there is an issue with grid deployment. Any help would be
> appreciated.
>
>
>
> Thanks,
> Siva.
>
>
> 

Re: very fast loading of very big table

2021-02-26 Thread Vladimir Tchernyi
Pavel,

thanks for mentioning the patterns. Of course, I spent a lot of time
reading documentation, [2] at the very beginning and [1] a couple of months
ago. Here is the origin of my pain-in-the-neck about a complete GitHub
example - none of [1] and [2] give an answer about my problem. The keyword
in my case is ASAP, there should be a multithreaded example. Of course, the
real-world example must not use a primitive types as a cache values, I
tried to illustrate that in [3].

I'd built one with the data streamer [1], it seems I was limited by network
adapter performance (see my previous post in this thread). That is the
reason I decided to move SQL queries to the data nodes.

Vladimir

пт, 26 февр. 2021 г. в 10:54, Pavel Tupitsyn :

> Vladimir,
>
> I think all real-world use cases are very valuable to the community.
> However, we should be careful to avoid misleading conclusions.
>
> We have well-known patterns for loading data from other systems:
> DataStreamer [1] and CacheStore [2].
> The article [3] seems a bit confusing to me, since none of those two
> patterns are mentioned there.
> When proposing a custom approach, it would be great to compare it to the
> standard alternatives.
>
> [1] https://ignite.apache.org/docs/latest/data-streaming
> [2] https://ignite.apache.org/docs/latest/persistence/custom-cache-store
> [3]
> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>
> On Fri, Feb 26, 2021 at 9:19 AM Vladimir Tchernyi 
> wrote:
>
>> Hi Pavel,
>>
>> the code [1] you shared is a kind of in-memory experiment with all the
>> processes inside a single JVM. My work differs - it is from the big retail
>> business, and hence it is 100% practice-oriented. True to say, it's
>> oriented to the state of things inside my company, and that is my question
>> - will my results be interesting to the community? I have seen a lot of
>> questions on the user list regarding data loading and difficulties here
>> seem to be a blocker in extending Ignite's popularity.
>>
>> Please let me know if my case is not common in the industry. We have a
>> big bare-metal Windows MSSQL server and a number of bare metal hosts, each
>> with the virtualization software and a single CentOs virtual server inside.
>> These CentOs hosts currently form an Ignite cluster with 4 data nodes and 1
>> client node. The example [2] I published last year is intended to solve the
>> business problem we have out here:
>> 1) the data currently present in the cluster have zero value;
>> 2) actual data is in the database and must be loaded in the cluster ASAP.
>> We use BinaryObject as cache key and value;
>> 3) cluster performs some data processing and writes the result to the
>> database.
>>
>> Unfortunately, the code [2] does not 100% OK in my case, it tends to say
>> "is node still alive" and to drop the client node off the cluster. The
>> performance of the MSSQL and network is what it is, I consider it as a
>> given restriction. It seems I got some progress when managed to move the
>> data loading process from a single client node to multiple data nodes. When
>> the extra data nodes will be added, I expect the load performance will be
>> better. Of course, until my big MSSQL will be able to hold the load. So I
>> want to know how interesting my results will be if it will be published.
>>
>> WDYT?
>>
>> [1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15
>> [2] https://github.com/vtchernyi/FastDataLoad
>>
>> чт, 25 февр. 2021 г. в 11:01, Pavel Tupitsyn :
>>
>>> Vladimir,
>>>
>>> Thanks for getting back to us. A full example that clarifies the
>>> situation will be great!
>>>
>>> > Can you share your code as a GitHub project? Maybe with the script to
>>> reproduce 6 GB of data.
>>>
>>> It is super trivial, I just wanted to get a sense of the throughput and
>>> check if we have some kind of a regression in recent versions (we don't) [1]
>>> Also I realised that the data size can be counted very differently - do
>>> we account for DB overhead and how?
>>>
>>> [1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15
>>>
>>> On Thu, Feb 25, 2021 at 10:49 AM Vladimir Tchernyi 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'd spent some time thinking about the community comments on my post.
>>>> It seems that Ignite is really not a bottleneck here. The performance of my
>>>> production MSSQL is a given restriction and the problem is to ensure fast
>>>&

Re: very fast loading of very big table

2021-02-25 Thread Vladimir Tchernyi
Hi Pavel,

the code [1] you shared is a kind of in-memory experiment with all the
processes inside a single JVM. My work differs - it is from the big retail
business, and hence it is 100% practice-oriented. True to say, it's
oriented to the state of things inside my company, and that is my question
- will my results be interesting to the community? I have seen a lot of
questions on the user list regarding data loading and difficulties here
seem to be a blocker in extending Ignite's popularity.

Please let me know if my case is not common in the industry. We have a big
bare-metal Windows MSSQL server and a number of bare metal hosts, each with
the virtualization software and a single CentOs virtual server inside.
These CentOs hosts currently form an Ignite cluster with 4 data nodes and 1
client node. The example [2] I published last year is intended to solve the
business problem we have out here:
1) the data currently present in the cluster have zero value;
2) actual data is in the database and must be loaded in the cluster ASAP.
We use BinaryObject as cache key and value;
3) cluster performs some data processing and writes the result to the
database.

Unfortunately, the code [2] does not 100% OK in my case, it tends to say
"is node still alive" and to drop the client node off the cluster. The
performance of the MSSQL and network is what it is, I consider it as a
given restriction. It seems I got some progress when managed to move the
data loading process from a single client node to multiple data nodes. When
the extra data nodes will be added, I expect the load performance will be
better. Of course, until my big MSSQL will be able to hold the load. So I
want to know how interesting my results will be if it will be published.

WDYT?

[1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15
[2] https://github.com/vtchernyi/FastDataLoad

чт, 25 февр. 2021 г. в 11:01, Pavel Tupitsyn :

> Vladimir,
>
> Thanks for getting back to us. A full example that clarifies the situation
> will be great!
>
> > Can you share your code as a GitHub project? Maybe with the script to
> reproduce 6 GB of data.
>
> It is super trivial, I just wanted to get a sense of the throughput and
> check if we have some kind of a regression in recent versions (we don't) [1]
> Also I realised that the data size can be counted very differently - do we
> account for DB overhead and how?
>
> [1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15
>
> On Thu, Feb 25, 2021 at 10:49 AM Vladimir Tchernyi 
> wrote:
>
>> Hi,
>>
>> I'd spent some time thinking about the community comments on my post. It
>> seems that Ignite is really not a bottleneck here. The performance of my
>> production MSSQL is a given restriction and the problem is to ensure fast
>> loading by executing multiple parallel queries. I'll test my code in
>> production for a couple of months for possible problems. If it will be OK,
>> probably the complete/downloadable/compilable GitHub example will be useful
>> for the community.
>>
>> WDYT?
>>
>> пт, 19 февр. 2021 г. в 21:47, Vladimir Tchernyi :
>>
>>> Pavel,
>>>
>>> maybe it's time to put your five-cent in. Can you share your code as a
>>> GitHub project? Maybe with the script to reproduce 6 GB of data.
>>>
>>> As for MSSQL data retrieval being the bottleneck - don't think so, I got
>>> 15 min load time for 1 node and 3.5 min time for 4 nodes. Looks like a
>>> linear dependency (the table and the RDBMS server were the same).
>>> --
>>> Vladimir
>>>
>>> пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn :
>>>
>>>> > First of all, I tried to select the whole table as once
>>>>
>>>> Hmm, it looks like MSSQL data retrieval may be the bottleneck here, not
>>>> Ignite.
>>>>
>>>> Can you run a test where some dummy data of the same size as real data
>>>> is generated and inserted into Ignite,
>>>> so that we test Ignite perf only, excluding MSSQL from the equation?
>>>> For example, streaming 300 million entries (total size 6 GB) takes
>>>> around 1 minute on my machine, with a simple single-threaded DataStreamer.
>>>>
>>>> On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi 
>>>> wrote:
>>>>
>>>>> Hi folks,
>>>>> thanks for your interest in my work.
>>>>>
>>>>> I didn't try COPY FROM since I've tried to work with Ignite SQL a
>>>>> couple of years ago and didn't succeed. Probably because examples 
>>>>> available
>>>>> aren't complete/downloadable/compilable (the pa

Re: very fast loading of very big table

2021-02-24 Thread Vladimir Tchernyi
Hi,

I'd spent some time thinking about the community comments on my post. It
seems that Ignite is really not a bottleneck here. The performance of my
production MSSQL is a given restriction and the problem is to ensure fast
loading by executing multiple parallel queries. I'll test my code in
production for a couple of months for possible problems. If it will be OK,
probably the complete/downloadable/compilable GitHub example will be useful
for the community.

WDYT?

пт, 19 февр. 2021 г. в 21:47, Vladimir Tchernyi :

> Pavel,
>
> maybe it's time to put your five-cent in. Can you share your code as a
> GitHub project? Maybe with the script to reproduce 6 GB of data.
>
> As for MSSQL data retrieval being the bottleneck - don't think so, I got
> 15 min load time for 1 node and 3.5 min time for 4 nodes. Looks like a
> linear dependency (the table and the RDBMS server were the same).
> --
> Vladimir
>
> пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn :
>
>> > First of all, I tried to select the whole table as once
>>
>> Hmm, it looks like MSSQL data retrieval may be the bottleneck here, not
>> Ignite.
>>
>> Can you run a test where some dummy data of the same size as real data is
>> generated and inserted into Ignite,
>> so that we test Ignite perf only, excluding MSSQL from the equation?
>> For example, streaming 300 million entries (total size 6 GB) takes around
>> 1 minute on my machine, with a simple single-threaded DataStreamer.
>>
>> On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi 
>> wrote:
>>
>>> Hi folks,
>>> thanks for your interest in my work.
>>>
>>> I didn't try COPY FROM since I've tried to work with Ignite SQL a couple
>>> of years ago and didn't succeed. Probably because examples available aren't
>>> complete/downloadable/compilable (the paper [1] contains GitHub repo, that
>>> is my five cents in changing the status quo). My interest is in KV API.
>>>
>>> I did try a data streamer, and that was my first try. I did not notice a
>>> significant time reduction in using code from my paper [1] versus data
>>> streamer/receiver. There was some memory economy with the streamer, though.
>>> I must say my experiment was made on a heavily loaded production mssql
>>> server. Filtered query with 300K rows resultset takes about 15 sec. The
>>> story follows.
>>>
>>> First of all, I tried to select the whole table as once, I got the
>>> network timeout and the client node was dropped off the cluster (is node
>>> still alive?).
>>> So I'd partitioned the table and executed a number of queries one-by-one
>>> on the client node, each query for the specific table partition. That
>>> process took about 90 min. Inacceptable time.
>>>
>>> Then I tried to execute my queries in parallel on the client node, each
>>> query executing dataStreamer.addData() for a single dataStreamer. The
>>> timing was not less than 15 min. All the attempts were the same, probably
>>> that was the network throughput limit on the client node (same interface
>>> used for the resultset and for cluster intercom). Say it again - that was
>>> the production environment.
>>>
>>> Final schema:
>>> * ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one job
>>> for one table partition;
>>> * each job executes SQL query, constructs a map with binary object key
>>> and value. Then the job executes targetCache.invokeAll() specifying the
>>> constructed map and the static EntryProcessor class. The EntryProcessor
>>> contains the logic for cache binary entry update;
>>> * ComputeTask.reduce() summarizes the row count reported by each job.
>>>
>>> The schema described proved to be network error-free in my production
>>> network and gives acceptable timing.
>>>
>>> Vladimir
>>>
>>> [1]
>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>>
>>> пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <
>>> stephen.darling...@gridgain.com>:
>>>
>>>> I think it’s more that that putAll is mostly atomic, so the more
>>>> records you save in one chunk, the more locking, etc. happens. Distributing
>>>> as compute jobs means all the putAlls will be local which is beneficial,
>>>> and the size of each put is going to be smaller (also beneficial).
>>>>
>>>> But that’s a lot of work that the data streamer already does for you
>>>> and the data streamer also

Re: very fast loading of very big table

2021-02-19 Thread Vladimir Tchernyi
Pavel,

maybe it's time to put your five-cent in. Can you share your code as a
GitHub project? Maybe with the script to reproduce 6 GB of data.

As for MSSQL data retrieval being the bottleneck - don't think so, I got 15
min load time for 1 node and 3.5 min time for 4 nodes. Looks like a linear
dependency (the table and the RDBMS server were the same).
--
Vladimir

пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn :

> > First of all, I tried to select the whole table as once
>
> Hmm, it looks like MSSQL data retrieval may be the bottleneck here, not
> Ignite.
>
> Can you run a test where some dummy data of the same size as real data is
> generated and inserted into Ignite,
> so that we test Ignite perf only, excluding MSSQL from the equation?
> For example, streaming 300 million entries (total size 6 GB) takes around
> 1 minute on my machine, with a simple single-threaded DataStreamer.
>
> On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi 
> wrote:
>
>> Hi folks,
>> thanks for your interest in my work.
>>
>> I didn't try COPY FROM since I've tried to work with Ignite SQL a couple
>> of years ago and didn't succeed. Probably because examples available aren't
>> complete/downloadable/compilable (the paper [1] contains GitHub repo, that
>> is my five cents in changing the status quo). My interest is in KV API.
>>
>> I did try a data streamer, and that was my first try. I did not notice a
>> significant time reduction in using code from my paper [1] versus data
>> streamer/receiver. There was some memory economy with the streamer, though.
>> I must say my experiment was made on a heavily loaded production mssql
>> server. Filtered query with 300K rows resultset takes about 15 sec. The
>> story follows.
>>
>> First of all, I tried to select the whole table as once, I got the
>> network timeout and the client node was dropped off the cluster (is node
>> still alive?).
>> So I'd partitioned the table and executed a number of queries one-by-one
>> on the client node, each query for the specific table partition. That
>> process took about 90 min. Inacceptable time.
>>
>> Then I tried to execute my queries in parallel on the client node, each
>> query executing dataStreamer.addData() for a single dataStreamer. The
>> timing was not less than 15 min. All the attempts were the same, probably
>> that was the network throughput limit on the client node (same interface
>> used for the resultset and for cluster intercom). Say it again - that was
>> the production environment.
>>
>> Final schema:
>> * ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one job
>> for one table partition;
>> * each job executes SQL query, constructs a map with binary object key
>> and value. Then the job executes targetCache.invokeAll() specifying the
>> constructed map and the static EntryProcessor class. The EntryProcessor
>> contains the logic for cache binary entry update;
>> * ComputeTask.reduce() summarizes the row count reported by each job.
>>
>> The schema described proved to be network error-free in my production
>> network and gives acceptable timing.
>>
>> Vladimir
>>
>> [1]
>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>
>> пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <
>> stephen.darling...@gridgain.com>:
>>
>>> I think it’s more that that putAll is mostly atomic, so the more records
>>> you save in one chunk, the more locking, etc. happens. Distributing as
>>> compute jobs means all the putAlls will be local which is beneficial, and
>>> the size of each put is going to be smaller (also beneficial).
>>>
>>> But that’s a lot of work that the data streamer already does for you and
>>> the data streamer also batches updates so would still be faster.
>>>
>>> On 19 Feb 2021, at 13:33, Maximiliano Gazquez 
>>> wrote:
>>>
>>> What would be the difference between doing cache.putAll(all rows) and
>>> separating them by affinity key+executing putAll inside a compute job.
>>> If I'm not mistaken, doing putAll should end up splitting those rows by
>>> affinity key in one of the servers, right?
>>> Is there a comparison of that?
>>>
>>> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov 
>>> wrote:
>>>
>>>> Hi Vladimir,
>>>> Did you try to use SQL command 'COPY FROM ' via thin JDBC?
>>>> This command uses 'IgniteDataStreamer' to write data into cluster and
>>>> parse CSV on the server node.
>>>>

Re: very fast loading of very big table

2021-02-19 Thread Vladimir Tchernyi
Hi folks,
thanks for your interest in my work.

I didn't try COPY FROM since I've tried to work with Ignite SQL a couple of
years ago and didn't succeed. Probably because examples available aren't
complete/downloadable/compilable (the paper [1] contains GitHub repo, that
is my five cents in changing the status quo). My interest is in KV API.

I did try a data streamer, and that was my first try. I did not notice a
significant time reduction in using code from my paper [1] versus data
streamer/receiver. There was some memory economy with the streamer, though.
I must say my experiment was made on a heavily loaded production mssql
server. Filtered query with 300K rows resultset takes about 15 sec. The
story follows.

First of all, I tried to select the whole table as once, I got the network
timeout and the client node was dropped off the cluster (is node still
alive?).
So I'd partitioned the table and executed a number of queries one-by-one on
the client node, each query for the specific table partition. That process
took about 90 min. Inacceptable time.

Then I tried to execute my queries in parallel on the client node, each
query executing dataStreamer.addData() for a single dataStreamer. The
timing was not less than 15 min. All the attempts were the same, probably
that was the network throughput limit on the client node (same interface
used for the resultset and for cluster intercom). Say it again - that was
the production environment.

Final schema:
* ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one job
for one table partition;
* each job executes SQL query, constructs a map with binary object key and
value. Then the job executes targetCache.invokeAll() specifying the
constructed map and the static EntryProcessor class. The EntryProcessor
contains the logic for cache binary entry update;
* ComputeTask.reduce() summarizes the row count reported by each job.

The schema described proved to be network error-free in my production
network and gives acceptable timing.

Vladimir

[1]
https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api

пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <
stephen.darling...@gridgain.com>:

> I think it’s more that that putAll is mostly atomic, so the more records
> you save in one chunk, the more locking, etc. happens. Distributing as
> compute jobs means all the putAlls will be local which is beneficial, and
> the size of each put is going to be smaller (also beneficial).
>
> But that’s a lot of work that the data streamer already does for you and
> the data streamer also batches updates so would still be faster.
>
> On 19 Feb 2021, at 13:33, Maximiliano Gazquez 
> wrote:
>
> What would be the difference between doing cache.putAll(all rows) and
> separating them by affinity key+executing putAll inside a compute job.
> If I'm not mistaken, doing putAll should end up splitting those rows by
> affinity key in one of the servers, right?
> Is there a comparison of that?
>
> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov  wrote:
>
>> Hi Vladimir,
>> Did you try to use SQL command 'COPY FROM ' via thin JDBC?
>> This command uses 'IgniteDataStreamer' to write data into cluster and
>> parse CSV on the server node.
>>
>> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.
>>
>> Hi Denis,
>>
>> Data space is 3.7Gb according to MSSQL table properries
>>
>> Vladimir
>>
>> 9:47, 19 февраля 2021 г., Denis Magda 
>> :
>>
>> Hello Vladimir,
>>
>> Good to hear from you! How much is that in gigabytes?
>>
>> -
>> Denis
>>
>>
>> On Thu, Feb 18, 2021 at 10:06 PM  wrote:
>>
>> Sep 2020 I've published the paper about Loading Large Datasets into
>> Apache Ignite by Using a Key-Value API (English [1] and Russian [2]
>> version). The approach described works in production, but shows
>> inacceptable perfomance for very large tables.
>>
>> The story continues, and yesterday I've finished the proof of concept for
>> very fast loading of very big table. The partitioned MSSQL table about 295
>> million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each
>> node had executed its own SQL queries in parallel and then distributed the
>> loaded values across the other cluster nodes.
>>
>> Probably that result will be of interest for the community.
>>
>> Regards,
>> Vladimir Chernyi
>>
>> [1]
>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>> [2] https://m.habr.com/ru/post/526708/
>>
>>
>>
>> --
>> Отправлено из мобильного приложения Яндекс.Почты
>>
>> --
>> Taras Ledkov
>> Mail-To: tled...@gridgain.com
>>
>>
>
>


Re: IgniteDataStreamer.keepBinary proposal

2020-12-05 Thread Vladimir Tchernyi
Hi Denis,

I think the code examples we already have do not show the nature of Ignite
as a DISTRIBUTED database. These examples are oriented on a single-node
start. An inexperienced user can have a false impression that a single
Ignite node can outperform, for example, a commercial database server.

IMHO the documentation should be written for a multinode Ignite cluster. I
do not understand what is the purpose to show how to stream 100_000 integer
values in a cache defined as . In the real world, I need
to stream structured records (Kafka Avro messages), and I will create a
POJO to hold each message. It is known that Ignite does not
peer-deploy user POJOs, so using BinaryObject is the only way to forward my
POJOs to the remote nodes (correct me if I am wrong).

I trust Ignite and I managed to create really fast Ignite app in
production. But recently I faced again the long-forgotten feeling - the
page is nice but hard to use. Hope my experience will help to
improve documentation.

Vladimir

PS
as for contributing, I need some time to get my Kafka Ignite app to
production to be sure of it. After that, I will be ready to contribute

сб, 5 дек. 2020 г. в 06:31, Denis Magda :

> Hi Vladimir,
>
> Most of the code snippets are already arranged in complete and
> ready-for-usage samples:
>
> https://github.com/apache/ignite/tree/master/docs/_docs/code-snippets/java/src/main/java/org/apache/ignite/snippets
>
> Anyway, those are code snippets that are injected into quite generic
> documentation pages. Your case represents a situation when someone needs to
> work with binary objects and streaming APIs. What if we add a data streamer
> example for BinaryObjects into Ignite's examples and put a reference to
> that example from the documentation page? Are you interested in
> contributing the example?
> https://github.com/apache/ignite/tree/master/examples
>
> -
> Denis
>
>
> On Fri, Dec 4, 2020 at 2:58 AM Vladimir Tchernyi 
> wrote:
>
>> Hi, community
>>
>>
>>
>> I've just finished drilling a small page [1] about Ignite data streaming
>> and I want to share my impressions. The situation is common for many Ignite
>> documentation pages, impressions are the same.
>>
>>
>>
>> My problem was to adapt IgniteDataStreamer to data loading using the
>> binary format as described in my article [2]. I try to use the same
>> approach:
>>
>> 1) load data on the client node;
>>
>> 2) convert it to the binary form;
>>
>> 3) use IgniteDataStreamer/StreamReceiver pair (instead of
>> ComputeTaskAdapter/ComputeJobAdapter) to ingest data in the cache.
>>
>>
>>
>> I modified my production code using IgniteDataStreamer> BinaryObject> and  StreamReceiver, tried to
>> start on the dev cluster made of 2 server nodes and 1 client node. That is
>> it: ClassNotFoundException for the class that exists on the client node
>> only.
>>
>>
>>
>> The solution to the problem seems to be in setting
>> streamer.keepBinary(true), but page [1] never says about it. I found that
>> setter in the IgniteDataStreamer source code after a single day of
>> troubleshooting. Definitely, "In Ignite We Trust" - what else reason would
>> drive me to spend so much time?
>>
>>
>>
>> The code snippets on the page [1] are hard to implement in real-world
>> applications because of using only primitive types String, Integer, etc.
>> These are more like unit tests.
>>
>>
>>
>> My proposal - it would be great to create a small GitHub repo containing
>> a complete compilable code example, one repo for every page. I think such
>> repos will keep the newbie Ignite users inside the project and prevent them
>> from leaving.
>>
>>
>>
>> Regards,
>>
>> Vladimir Tchernyi
>>
>> --
>>
>> [1] https://ignite.apache.org/docs/latest/data-streaming
>>
>> [2]
>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>
>>
>>
>


IgniteDataStreamer.keepBinary proposal

2020-12-04 Thread Vladimir Tchernyi
Hi, community



I've just finished drilling a small page [1] about Ignite data streaming
and I want to share my impressions. The situation is common for many Ignite
documentation pages, impressions are the same.



My problem was to adapt IgniteDataStreamer to data loading using the binary
format as described in my article [2]. I try to use the same approach:

1) load data on the client node;

2) convert it to the binary form;

3) use IgniteDataStreamer/StreamReceiver pair (instead of
ComputeTaskAdapter/ComputeJobAdapter) to ingest data in the cache.



I modified my production code using IgniteDataStreamer and  StreamReceiver, tried to
start on the dev cluster made of 2 server nodes and 1 client node. That is
it: ClassNotFoundException for the class that exists on the client node
only.



The solution to the problem seems to be in setting
streamer.keepBinary(true), but page [1] never says about it. I found that
setter in the IgniteDataStreamer source code after a single day of
troubleshooting. Definitely, "In Ignite We Trust" - what else reason would
drive me to spend so much time?



The code snippets on the page [1] are hard to implement in real-world
applications because of using only primitive types String, Integer, etc.
These are more like unit tests.



My proposal - it would be great to create a small GitHub repo containing a
complete compilable code example, one repo for every page. I think such
repos will keep the newbie Ignite users inside the project and prevent them
from leaving.



Regards,

Vladimir Tchernyi

--

[1] https://ignite.apache.org/docs/latest/data-streaming

[2]
https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api


Re: read-though tutorial for a big table

2020-06-21 Thread Vladimir Tchernyi
Hi Denis,

Some progress had happened and I have some material to share with the
community. I think it will be interesting to newbies. It is about loading
big tables from rdbms and creating cache entries based on table info. This
approach was tested in production and showed good timing being paired with
MSSQL, tables from tens to hundreds million rows.

The loading jar process:
* starts Ignite client node;
* creates user POJO according to business logic;
* converts POJOs to BinaryObjects;
* uses affinity function and creates separate key-value HashMap for every
cache partition;
* uses ComputeTaskAdaper/ComputeJobAdaper to place hashMaps on
corresponding data node.

I would like to publish some tutorial, say on GridGain website in english
and russian version on habr.com.

WDYT?

чт, 12 мар. 2020 г. в 08:25, :

> Hello Denis,
>
> That is possible, my writing activities should be continued. The only
> question is to get my local project to production, there is no sense in
> writing another model example. So I hope there will be a progress in the
> nearest future
>
> Vladimir
>
> 2:25, 12 марта 2020 г., Denis Magda :
>
> Hello Vladimir,
>
> Just to clarify, are you suggesting to create a tutorial for data loading
> scenarios when data resides in an external database?
>
> -
> Denis
>
>
> On Tue, Mar 10, 2020 at 11:41 PM  wrote:
>
> Andrei, Evgenii, thanks for answer.
>
> Aa far as I see, there is no ready to use tutorial. I managed to do
> multi-threaded cache load procedure, out-of-the-box loadCache method is
> extremely slow.
>
> I spent about a month studying write-through topics, and finally got the
> same as "capacity planning" says: 0.8Gb mssql table on disk expands to
> 2.3Gb, size in ram is 2.875 times bigger.
>
> Is it beneficial to use BinaryObject instead of user pojo? If yes, how to
> create BinaryObject without pojo definition and deserialize it back to pojo?
> It would be great to have kind of advanced github example like this
>
> https://github.com/dmagda/MicroServicesExample
>
> It helped a lot in understanding. Current documentation links do not help
> to build a real solution, they are mostly like a reference, with no option
> to compile and debug
>
> Vladimir
>
> 2:51, 11 марта 2020 г., Evgenii Zhuravlev :
>
> When you're saying that the result was poor, do you mean that data
> preloading took too much time, or it's just about get operations?
>
> Evgenii
>
> вт, 10 мар. 2020 г. в 03:29, aealexsandrov :
>
> Hi,
>
> You can read the documentation articles:
>
> https://apacheignite.readme.io/docs/3rd-party-store
>
> In case if you are going to load the cache from 3-rd party store (RDBMS)
> then the default implementation of CacheJdbcPojoStore can take a lot of
> time
> for loading the data because it used JDBC connection inside (not pull of
> these connections).
>
> Probably you should implement your own version of CacheStore that will read
> data from RDBMS in several threads, e.g using the JDBC connection pull
> there. Sources are open for you, so you can copy the existed implementation
> and modify it:
>
>
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/cache/store/jdbc/CacheJdbcPojoStore.java
>
> Otherwise, you can do the initial data loading using some streaming tools:
>
> 1)Spark integration with Ignite -
> https://apacheignite-fs.readme.io/docs/ignite-data-frame
> 2)Kafka integration with Ignite -
> https://apacheignite-mix.readme.io/docs/kafka-streamer
>
> BR,
> Andrei
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>
>
>
> --
> Отправлено из мобильного приложения Яндекс.Почты
>
>
>
> --
> Отправлено из мобильного приложения Яндекс.Почты