Thanks Steve - we are already taking the safe route - putting NN and datanodes on the central mesos-masters which are on demand. Later (much later!) we _may_ put some datanodes on spot instances (and using several spot instance types as the spikes seem to only affect one type - worst case we can rebuild the data as well). OTOH this would mainly only be beneficial if spark/mesos understood the data locality which is probably some time off (we don't need this ability now).

Indeed, the error we are seeing is orthogonal to the setup - however my understanding of ha-hdfs is that it should be resolved via the hdfs-site.xml file and doesn't use DNS whatsoever (and indeed, it _does_ work - but only after we initialise the driver with a bad hdfs url.) I think there's some (missing) HDFS initialisation therefore when running spark on mesos - my suspicion is on the spark side (or my spark config).

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html#Configuration_details

On 15/09/2015 10:24, Steve Loughran wrote:
On 15 Sep 2015, at 08:55, Adrian Bridgett <adr...@opensignal.com> wrote:

Hi Sam, in short, no, it's a traditional install as we plan to use spot 
instances and didn't want price spikes to kill off HDFS.

We're actually doing a bit of a hybrid, using spot instances for the mesos 
slaves, ondemand for the mesos masters.  So for the time being, putting hdfs on 
the masters (we'll probably move to multiple slave instance types to avoid 
losing too many when spot price spikes, but for now this is acceptable).   
Masters running CDH5.
It's incredibly dangerous using hdfs NNs on spot vms; a significant enough 
spike will lose all of them in one go, and there goes your entire filesystem. 
Have a static VM, maybe even backed by EBS.

If you look at Hadoop architectures from Hortonworks, Cloudera and Amazon 
themselves, the usual stance is HDFS on static nodes, spot instances for 
compute only

Using hdfs://current-hdfs-master:8020 works fine, however using 
hdfs://nameservice1 fails in the rather odd way described (well, more that the 
workaround actually works!)  I think there's some underlying bug here that's 
being exposed.

this sounds an issue orthogonal to spot instances. Maybe related to how JVMs 
cache DNS entries forever?

--
*Adrian Bridgett* | Sysadmin Engineer, OpenSignal <http://www.opensignal.com>
_____________________________________________________
Office: First Floor, Scriptor Court, 155-157 Farringdon Road, Clerkenwell, London, EC1R 3AD
Phone #: +44 777-377-8251
Skype: abridgett |@adrianbridgett <http://twitter.com/adrianbridgett>| LinkedIn link <https://uk.linkedin.com/in/abridgett>
_____________________________________________________

Reply via email to