Thanks Steve - we are already taking the safe route - putting NN and
datanodes on the central mesos-masters which are on demand. Later (much
later!) we _may_ put some datanodes on spot instances (and using several
spot instance types as the spikes seem to only affect one type - worst
case we can rebuild the data as well). OTOH this would mainly only be
beneficial if spark/mesos understood the data locality which is probably
some time off (we don't need this ability now).
Indeed, the error we are seeing is orthogonal to the setup - however my
understanding of ha-hdfs is that it should be resolved via the
hdfs-site.xml file and doesn't use DNS whatsoever (and indeed, it _does_
work - but only after we initialise the driver with a bad hdfs url.) I
think there's some (missing) HDFS initialisation therefore when running
spark on mesos - my suspicion is on the spark side (or my spark config).
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html#Configuration_details
On 15/09/2015 10:24, Steve Loughran wrote:
On 15 Sep 2015, at 08:55, Adrian Bridgett <adr...@opensignal.com> wrote:
Hi Sam, in short, no, it's a traditional install as we plan to use spot
instances and didn't want price spikes to kill off HDFS.
We're actually doing a bit of a hybrid, using spot instances for the mesos
slaves, ondemand for the mesos masters. So for the time being, putting hdfs on
the masters (we'll probably move to multiple slave instance types to avoid
losing too many when spot price spikes, but for now this is acceptable).
Masters running CDH5.
It's incredibly dangerous using hdfs NNs on spot vms; a significant enough
spike will lose all of them in one go, and there goes your entire filesystem.
Have a static VM, maybe even backed by EBS.
If you look at Hadoop architectures from Hortonworks, Cloudera and Amazon
themselves, the usual stance is HDFS on static nodes, spot instances for
compute only
Using hdfs://current-hdfs-master:8020 works fine, however using
hdfs://nameservice1 fails in the rather odd way described (well, more that the
workaround actually works!) I think there's some underlying bug here that's
being exposed.
this sounds an issue orthogonal to spot instances. Maybe related to how JVMs
cache DNS entries forever?
--
*Adrian Bridgett* | Sysadmin Engineer, OpenSignal
<http://www.opensignal.com>
_____________________________________________________
Office: First Floor, Scriptor Court, 155-157 Farringdon Road,
Clerkenwell, London, EC1R 3AD
Phone #: +44 777-377-8251
Skype: abridgett |@adrianbridgett <http://twitter.com/adrianbridgett>|
LinkedIn link <https://uk.linkedin.com/in/abridgett>
_____________________________________________________