C# DataStreamer Best Practices

2021-04-23 Thread William.L
Hi,

We are hitting a performance wall with using the C# thin client PutAsync for
data upload to Ignite. We are in the process of migrating to C#'s thick
client (using client mode) datastreamer. I would appreciate some advice on
some best practices on how to use the thick client:
* Ignite instance -- should we use a single instance within the server
process. It seems like C# is a wrapper around JVM so there is no point to
use multiple instances?
* DataStreamer - I am aware from some posting that it is thread-safe, my
question is whether there are any benefits in trying to share/reuse
DataStreamer instance (e.g. better batching for colocated data) vs using
more DataStreamer instances (more parallelism/connections)?
* Retries and connection failures - there is no mention about connection
failure scenarios or retry setting. Can I assume the DataStreamer (via JVM)
will take care of all the connection failures and reconnecting? 
* Failures - are these surfaced as exceptions from Task returned by
AddData()? Or do I have to use Flush/Close?

Thanks





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Understanding SQL join performance

2021-04-23 Thread William.L
Hi,

I am trying to understand why my colocated join between two tables/caches
are taking so long compare to the individual table filters.

TABLE1

Returns 1 count -- 0.13s

TABLE2

Returns 65000 count -- 0.643s


 JOIN TABLE1 and TABLE2

Returns 650K count -- 7s

Both analysis_input and analysis_output has index on (cohort_id, user_id,
timestamp). The affinity key is user_id. How do I analyze the performance
further?

Here's the explain which does not tell me much:



Is Ignite doing the join and filtering at each data node and then sending
the 650K total rows to the reduce before aggregation? If so, is it possible
for Ignite to do the some aggregation at the data node first and then send
the first level aggregation results to the reducer?






--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: MapReduce - How to efficiently scan through subset of the caches?

2021-04-23 Thread William.L
Thanks for the pointers stephendarlington, ptupitsyn.

Looks like I can run a mapper that does a local SQL query to get the set of
keys for the tenant (that resides on the local server node), and then do
Compute.affinityRun or Cache.invokeAll.

For Cache.invokeAll, it takes a dictionary of keys to EntryProcessor so that
is easy to understand.

For Compute.affinityRun, I am not sure how to work with it for my scenario:
* It takes an affinity key to find the partition's server to run the
IgniteRunnable but I don't see an interface to pass in the specific keys? Am
I expected to pass the key set as part of IgniteRunnable object? 
* Suppose the cache use user_id as the affinity key then it is possible that
2 user_id will map to the same partition. How do I avoid duplicate
processing/scanning?





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: MapReduce - How to efficiently scan through subset of the caches?

2021-04-23 Thread Pavel Tupitsyn
1. Use a separate cache as an index.
   E.g. for every tenant store a list of IDs for quick retrieval,
   then use Compute.affinityRun or Cache.invokeAll to process the subset of
data

2. Use SQL with index, but enable it only for the tenantId field.
Get entry IDs for a given tenant with SQL, then again AffinityRun or
InvokeAll

> IgniteCache.localEntries
Be careful with localEntries - when topology changes and rebalance is in
progress,
you'll miss some data and/or process some of it twice.

Prefer Cache.invoke and Compute.affinity* APIs - they provide guarantees
that the given part of data (key or partition) is locked during the
processing.

On Fri, Apr 23, 2021 at 7:17 PM William.L  wrote:

> Hi,
>
> I am investigating whether the MapReduce API is the right tool for my
> scenario. Here's the context of the caches:
> * Multiple caches for different type of dataset
> * Each cache has multi-tenant data and the tenant id is part of the cache
> key
> * Each cache entry is a complex json/binary object that I want to do
> computation on (let's just say it is hard to do it in SQL) and return some
> complex results for each entry (e.g. a dictionary) that I want to do
> reduce/aggregation on.
> * The cluster is persistence enabled because we have more data then memory
>
> My scenario is to do the MapReduce operation only on data for a specific
> tenant (small subset of the data). From reading the forum about MapReduce,
> it seems like the best way to do this is using the IgniteCache.localEntries
> API and iterate through the node's local cache. My concern with this
> approach is that we are looping through the whole cache (K) which is very
> inefficient. Is there a more efficient way to filter only the relevant keys
> and then access the matching entries only?
>
> Thanks.
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: MapReduce - How to efficiently scan through subset of the caches?

2021-04-23 Thread Stephen Darlington
Add an index on the tenant, run a SQL query with setLocal(true). You might also 
want to look at the IgniteCompute#affinityRun method that takes a partition as 
a parameter and run it by partition rather than node (higher degree of 
parallelism, potentially makes failover easier).

> On 23 Apr 2021, at 17:16, William.L  wrote:
> 
> Hi,
> 
> I am investigating whether the MapReduce API is the right tool for my
> scenario. Here's the context of the caches:
> * Multiple caches for different type of dataset
> * Each cache has multi-tenant data and the tenant id is part of the cache
> key
> * Each cache entry is a complex json/binary object that I want to do
> computation on (let's just say it is hard to do it in SQL) and return some
> complex results for each entry (e.g. a dictionary) that I want to do
> reduce/aggregation on.
> * The cluster is persistence enabled because we have more data then memory
> 
> My scenario is to do the MapReduce operation only on data for a specific
> tenant (small subset of the data). From reading the forum about MapReduce,
> it seems like the best way to do this is using the IgniteCache.localEntries
> API and iterate through the node's local cache. My concern with this
> approach is that we are looping through the whole cache (K) which is very
> inefficient. Is there a more efficient way to filter only the relevant keys
> and then access the matching entries only?
> 
> Thanks.
> 
> 
> 
> 
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/




MapReduce - How to efficiently scan through subset of the caches?

2021-04-23 Thread William.L
Hi,

I am investigating whether the MapReduce API is the right tool for my
scenario. Here's the context of the caches:
* Multiple caches for different type of dataset
* Each cache has multi-tenant data and the tenant id is part of the cache
key
* Each cache entry is a complex json/binary object that I want to do
computation on (let's just say it is hard to do it in SQL) and return some
complex results for each entry (e.g. a dictionary) that I want to do
reduce/aggregation on.
* The cluster is persistence enabled because we have more data then memory

My scenario is to do the MapReduce operation only on data for a specific
tenant (small subset of the data). From reading the forum about MapReduce,
it seems like the best way to do this is using the IgniteCache.localEntries
API and iterate through the node's local cache. My concern with this
approach is that we are looping through the whole cache (K) which is very
inefficient. Is there a more efficient way to filter only the relevant keys
and then access the matching entries only?

Thanks.




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Ignite 2.10. Performance tests in Azure

2021-04-23 Thread Stephen Darlington
I would say that Ignite tends to work better with large numbers of small 
requests rather than a small number of big batches. 

The Java API has the Data Streamer API to help, not sure if that’s in the C++ 
API. But I think smaller batches would help.

I’ve not seen anyone request 18000 records in a single get command. Using 
colocated compute to avoid copying all those records over the network would 
improve performance. Or at least iterate over them using a scan or SQL query.

Regards,
Stephen

> On 23 Apr 2021, at 15:24, jjimeno  wrote:
> 
> Hello, and thanks for answering so quick
> 
> Because, as you say, I should get a bigger throughput when increasing the
> number of nodes.  
> 
> Of course I can't get the best of Ignite with this configuration, but I
> would expect something similar to what I get writing: time decreasing while
> nodes increase until the point that the single thread becomes the
> bottleneck.
> 
> Also, I wouldn't expect having better writing than reading times.
> 
> Sorry, I don't have these values, but I'll try to repeat the tests to get
> them.
> 
> Josemari.
> 
> 
> 
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/




Re: Ignite 2.10. Performance tests in Azure

2021-04-23 Thread jjimeno
Hello, and thanks for answering so quick

Because, as you say, I should get a bigger throughput when increasing the
number of nodes.  

Of course I can't get the best of Ignite with this configuration, but I
would expect something similar to what I get writing: time decreasing while
nodes increase until the point that the single thread becomes the
bottleneck.

Also, I wouldn't expect having better writing than reading times.

Sorry, I don't have these values, but I'll try to repeat the tests to get
them.

Josemari.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Ignite 2.10. Performance tests in Azure

2021-04-23 Thread Ilya Kasnacheev
Hello!

Why do you expect it to scale if you are only seem to run this in a single
thread?

In a distributed system, throughput will scale with cluster growth, but
latency will be steady or become slightly worse.

You need to run the same thread with sufficient number of threads, and
maybe using more than one client (VM and all) to drive load in order to
saturate it.

What is the CPU usage during the test on server nodes, per cluster size?

Regards,
-- 
Ilya Kasnacheev


пт, 23 апр. 2021 г. в 10:59, jjimeno :

> Hello all,
>
> For our project we need a distributed database with transactional support,
> and Ignite is one of the options we are testing.
>
> Scalability is one of our must have, so we created an Ignite Kubernetes
> cluster in Azure to test it, but we found that the results were not what we
> expected.
>
> To discard the problem was in our code or in using transactional caches, we
> created a small test program for writing/reading 1.8M keys of 528 bytes
> each
> (it represents one of our data types).
>
> As you can see in this graph, reading doesn't seem to scale.  Especially
> for
> the transactional cache, where having 4, 8 or 16 nodes in the cluster
> performs worse than having only 2:
> 
>
> While writing in atomic caches does... until 8 nodes, then it gets steady
> (No transactional times because of  this
>   ):
> 
>
> Another strange thing is that, for atomic caches, reading seems to be
> slower
> than writing:
> 
>
> So, my questions are:
>   - Could I been doing something wrong that could lead to this results?
>   - How could it be possible to get worse reading timings in a 4/8/16 nodes
> cluster than in a 2 nodes cluster for a transactional cache?
>   - How could reading be slower than writing in atomic caches?
>
> These are the source code and configuration files we're using:
> Test.cpp
> 
> Order.h 
>
> node-configuration.xml
> <
> http://apache-ignite-users.70518.x6.nabble.com/file/t3059/node-configuration.xml>
>
>
> Best regards and thanks in advance!
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Docker Running Apache Ignite Official Images memory continues to increase

2021-04-23 Thread Stephen Darlington
As Ilya says, it’s not clear that there’s an issue here. Java will use up to 
the amount of memory you configure for its heap. There’s no suggestion that 
it’s running out of memory.

As you insert, update and remove data in a cluster you would expect the memory 
footprint to increase over time and then plateau when it reaches the heap and 
the data region size limit.

> On 23 Apr 2021, at 03:20, yangpeng0...@sina.cn wrote:
> 
> Hello!
> 
> 
> This is the memory usage within 6 hours of my choice. Do you think that 
> memory has been slowly increasing. Is this a configuration problem or an 
> Apache Ignite problem? Is there any way to verify whether it is an Ignite 
> docker image problem?
> yangpeng0...@sina.cn 
>  
> From: Ilya Kasnacheev 
> Date: 2021-04-22 18:10
> To: user 
> Subject: Re: Re: Docker Running Apache Ignite Official Images memory 
> continues to increase
> Hello!
> 
> Apache Ignite node will collect metrics periodically, and send heartbeats 
> which will allocate objects on heap. If there is no load, Java GC is free to 
> not release this memory. It will reclaim it once there's load.
> 
> Regards,
> -- 
> Ilya Kasnacheev
> 
> 
> чт, 22 апр. 2021 г. в 12:55, yangpeng0...@sina.cn 
>   >:
> Hello!
> The reason I tested this docker image is because I have a .Net Core console 
> program as the server side of Ignite. I also set the JvmInitialMemoryMb and 
> JvmMaxMemoryMb parameters to 1024mb and 2048mb. But after the program runs 
> docker, I did nothing. The memory is also the same as the official docker 
> image, the memory grows slowly and will not be released.
> 
> yangpeng0...@sina.cn 
>  
> From: Ilya Kasnacheev 
> Date: 2021-04-22 17:23
> To: user 
> Subject: Re: Docker Running Apache Ignite Official Images memory continues to 
> increase
> Hello!
> 
> A JVM is free to use any amount of memory from 0 to -Xmx and this cannot be 
> qualified as memory leak without some solid evidence.
> 
> Regards,
> -- 
> Ilya Kasnacheev
> 
> 
> чт, 22 апр. 2021 г. в 11:08, yangpeng0...@sina.cn 
>   >:
> Hi:
>   I pull the docker image provided by Apache ignite, address: 
> https://hub.docker.com/r/apacheignite/ignite 
> . Docker Pull Command:docker 
> pull apacheignite/ignite
> 
>  Then execute the dock run command:docker run -itd --cpuset-cpus="0" -m 4096M 
> --memory-reservation 4096M --name apacheignite_ignite --net=host 
> apacheignite/ignite:2.10.0
> But after running for 2 days, we found that the memory used by it has been 
> increasing, as shown in the figure:
> 
> Excuse me, how to solve the problem that the memory has been growing?
> Is there a memory leak?
> Or is this slow memory growth normal when docker is running?
> yangpeng0...@sina.cn 



Ignite 2.10. Performance tests in Azure

2021-04-23 Thread jjimeno
Hello all,

For our project we need a distributed database with transactional support,
and Ignite is one of the options we are testing.

Scalability is one of our must have, so we created an Ignite Kubernetes
cluster in Azure to test it, but we found that the results were not what we
expected.

To discard the problem was in our code or in using transactional caches, we
created a small test program for writing/reading 1.8M keys of 528 bytes each
(it represents one of our data types). 

As you can see in this graph, reading doesn't seem to scale.  Especially for
the transactional cache, where having 4, 8 or 16 nodes in the cluster
performs worse than having only 2:
 

While writing in atomic caches does... until 8 nodes, then it gets steady
(No transactional times because of  this
  ):
 

Another strange thing is that, for atomic caches, reading seems to be slower
than writing:
 

So, my questions are:
  - Could I been doing something wrong that could lead to this results?
  - How could it be possible to get worse reading timings in a 4/8/16 nodes
cluster than in a 2 nodes cluster for a transactional cache?
  - How could reading be slower than writing in atomic caches?

These are the source code and configuration files we're using:
Test.cpp
  
Order.h   
node-configuration.xml

  

Best regards and thanks in advance!




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/