Reducing the port range (to a single port) and lowering the
IgniteConfiguration.setFailureDetectionTimeout to 1000 helped speed up
everybody joining the topology and I was able to get a pi estimator
run on 64 nodes.
Thanks again for the help, I'm over the current hurdle.
Joe
Quoting d...@eiler.net:
Thanks for the quick response Denis.
I did a port range of 10 ports. I'll take a look at the
failureDetectionTimeout and networkTimeout.
Side question: Is there an easy way to map between the programmatic
API and the spring XML properties? For instance I was trying to find
the correct xml incantation for
TcpDiscoverySpi.setMaxMissedHeartbeats(int) and I might have a
similar issue finding
IgniteConfiguration.setFailureDetectionTimeout(long). It seems like
I can usually drop the set and adjust capitalization (setFooBar() ==
<property name="fooBar")
Please pardon my ignorance on terminology:
Are the nodes I run ignite.sh on considered server nodes or cluster
nodes (I would have thought they are the same)
Thanks,
Joe
Quoting Denis Magda <dma...@gridgain.com>:
Hi Joe,
How big is a port range, that you specified in your discovery
configuration, for a every single node?
Please take into account that the discovery may iterate over every
port from the range before one node connects to the other and
depending on the TCP related settings of your network it may take
significant time before the cluster is assembled.
Here I would recommend you to reduce the port range as much as
possible and to play with the following network related parameters:
- Try to use the failure detection timeout instead of setting
socket, ack and many other timeouts explicitly
(https://apacheignite.readme.io/docs/cluster-config#failure-detection-timeout);
- Try to play with TcpDiscoverySpi.networkTimeout because this
timeout is considered during the time when a cluster node tries to
join a cluster.
In order to help you with the hanging compute tasks and to give you
more specific recommendations regarding the slow join process
please provide us with the following:
- config files for server and cluster nodes;
- log files from all the nodes. Please start the nodes with
-DIGNITE_QUIET=false virtual machine property. If you start the
nodes using ignite.sh/bat then just pass '-v' as an argument to the
script.
- thread dumps for the nodes that are hanging waiting for the
compute tasks to be completed.
Regards,
Denis
On 10/26/2015 6:56 AM, d...@eiler.net wrote:
Hi all,
I have been experimenting with ignite and have run into a problem
scaling up to larger clusters.
I am playing with only two different use cases, 1) a Hadoop
MapReduce accelerator 2) an in memory data grid (no secondary file
system) being accessed by frameworks using the HDFS
Everything works fine with a smaller cluster (8 nodes) but with a
larger cluster (64 nodes) it takes a couple of minutes for all the
nodes to register with the cluster(which would be ok) and
mapreduce jobs just hang and never return.
I've compiled the latest Ignite 1.4 (with ignite.edition=hadoop)
from source, and am using it with Hadoop 2.7.1 just trying to run
things like the pi estimator and wordcount examples.
I started with the config/hadoop/default-config.xml
I can't use multicast so I've configured it to use static IP based
discovery with just a single node/port range.
I've increased the heartbeat frequency to 10000 and that seemed to
help make things more stable once all the nodes do join the
cluster. I've also played with increasing both the socket timeout
and the ack timeout but that seemed to just make it take longer
for nodes to attempt to join the cluster after a failed attempt.
I have access to a couple of different clusters, we allocate
resources with slurm so I get a piece of a cluster to play with
(hence the no-multicast restriction). The nodes all have fast
networks (FDR InfiniBand) and a decent amount of memory
(64GB-128GB) but no local storage (or swap space).
As mentioned earlier, I disable the secondaryFilesystem.
Any advice/hints/example xml configs would be extremely welcome.
I also haven't been seeing the expected performance using the hdfs
api to access ignite. I've tried both using the hdfs cli to do
some simple timings of put/get and a little java program that
writes then reads a file. Even with small files (500MB) that
should be kept completely in a single node, I only see about
250MB/s for writes and reads are much slower than that (4x to
10x). The writes are better than hdfs (our hdfs is backed with
pretty poor storage) but reads are much slower. Now I haven't
tried scaling this at all but with an 8 node ignite cluster and a
single "client" access a single file I would hope for something
closer to memory speeds. (if you would like me to split this into
another message to the list just let me know, I'm assuming the
cause it the same---I missed a required config setting ;-) )
Thanks in advance for any help,
Joe