Re: Help with tuning for larger clusters

dev Mon, 26 Oct 2015 07:54:26 -0700

Thanks for the quick response Denis.

I did a port range of 10 ports. I'll take a look at thefailureDetectionTimeout and networkTimeout.

Side question: Is there an easy way to map between the programmaticAPI and the spring XML properties? For instance I was trying to findthe correct xml incantation forTcpDiscoverySpi.setMaxMissedHeartbeats(int) and I might have a similarissue finding IgniteConfiguration.setFailureDetectionTimeout(long). Itseems like I can usually drop the set and adjust capitalization(setFooBar() == <property name="fooBar")


Please pardon my ignorance on terminology:

Are the nodes I run ignite.sh on considered server nodes or clusternodes (I would have thought they are the same)


Thanks,
Joe

Quoting Denis Magda <dma...@gridgain.com>:

Hi Joe,
How big is a port range, that you specified in your discoveryconfiguration, for a every single node?Please take into account that the discovery may iterate over everyport from the range before one node connects to the other anddepending on the TCP related settings of your network it may takesignificant time before the cluster is assembled.
Here I would recommend you to reduce the port range as much aspossible and to play with the following network related parameters:- Try to use the failure detection timeout instead of settingsocket, ack and many other timeouts explicitly(https://apacheignite.readme.io/docs/cluster-config#failure-detection-timeout);- Try to play with TcpDiscoverySpi.networkTimeout because thistimeout is considered during the time when a cluster node tries tojoin a cluster.
In order to help you with the hanging compute tasks and to give youmore specific recommendations regarding the slow join process pleaseprovide us with the following:
- config files for server and cluster nodes;
- log files from all the nodes. Please start the nodes with-DIGNITE_QUIET=false virtual machine property. If you start thenodes using ignite.sh/bat then just pass '-v' as an argument to thescript.- thread dumps for the nodes that are hanging waiting for thecompute tasks to be completed.
Regards,
Denis

On 10/26/2015 6:56 AM, d...@eiler.net wrote:
Hi all,
I have been experimenting with ignite and have run into a problemscaling up to larger clusters.
I am playing with only two different use cases, 1) a HadoopMapReduce accelerator 2) an in memory data grid (no secondary filesystem) being accessed by frameworks using the HDFS
Everything works fine with a smaller cluster (8 nodes) but with alarger cluster (64 nodes) it takes a couple of minutes for all thenodes to register with the cluster(which would be ok) and mapreducejobs just hang and never return.
I've compiled the latest Ignite 1.4 (with ignite.edition=hadoop)from source, and am using it with Hadoop 2.7.1 just trying to runthings like the pi estimator and wordcount examples.
I started with the config/hadoop/default-config.xml
I can't use multicast so I've configured it to use static IP baseddiscovery with just a single node/port range.
I've increased the heartbeat frequency to 10000 and that seemed tohelp make things more stable once all the nodes do join thecluster. I've also played with increasing both the socket timeoutand the ack timeout but that seemed to just make it take longer fornodes to attempt to join the cluster after a failed attempt.
I have access to a couple of different clusters, we allocateresources with slurm so I get a piece of a cluster to play with(hence the no-multicast restriction). The nodes all have fastnetworks (FDR InfiniBand) and a decent amount of memory(64GB-128GB) but no local storage (or swap space).
As mentioned earlier, I disable the secondaryFilesystem.

Any advice/hints/example xml configs would be extremely welcome.
I also haven't been seeing the expected performance using the hdfsapi to access ignite. I've tried both using the hdfs cli to do somesimple timings of put/get and a little java program that writesthen reads a file. Even with small files (500MB) that should bekept completely in a single node, I only see about 250MB/s forwrites and reads are much slower than that (4x to 10x). The writesare better than hdfs (our hdfs is backed with pretty poor storage)but reads are much slower. Now I haven't tried scaling this at allbut with an 8 node ignite cluster and a single "client" access asingle file I would hope for something closer to memory speeds. (ifyou would like me to split this into another message to the listjust let me know, I'm assuming the cause it the same---I missed arequired config setting ;-) )
Thanks in advance for any help,
Joe

Re: Help with tuning for larger clusters

Reply via email to