Re: Do i really need HDFS?

2014-10-22 Thread Dick Davies
Be interested to know what that is, if you don't mind sharing.

We're thinking of deploying a Ceph cluster for another project anyway,
it seems to remove some of the chokepoints/points of failure HDFS suffers from
but I've no idea how well it can interoperate with the usual HDFS clients
(Spark in my particular case but I'm trying to keep this general).

On 21 October 2014 13:16, David Greenberg dsg123456...@gmail.com wrote:
 We use spark without HDFS--in our case, we just use ansible to copy the
 spark executors onto all hosts at the same path. We also load and store our
 spark data from non-HDFS sources.

 On Tue, Oct 21, 2014 at 4:57 AM, Dick Davies d...@hellooperator.net wrote:

 I think Spark needs a way to send jobs to/from the workers - the Spark
 distro itself
 will pull down the executor ok, but in my (very basic) tests I got
 stuck without HDFS.

 So basically it depends on the framework. I think in Sparks case they
 assume most
 users are migrating from an existing Hadoop deployment, so HDFS is
 sort of assumed.


 On 20 October 2014 23:18, CCAAT cc...@tampabay.rr.com wrote:
  On 10/20/14 11:46, Steven Schlansker wrote:
 
 
  We are running Mesos entirely without HDFS with no problems.  We use
  Docker to distribute our
  application to slave nodes, and keep no state on individual nodes.
 
 
 
  Background: I'm building up a 3 node cluster to run mesos and spark. No
  legacy Hadoop needed or wanted. I am using btrfs for the local file
  system,
  with (2) drives set up for raid1 on each system.
 
  So you  are suggesting that I can install mesos + spark + docker
  and not a DFS on these (3) machines?
 
 
  Will I need any other softwares? My application is a geophysical
  fluid simulator, so scala, R, and all sorts of advanced math will
  be required on the cluster for the Finite Element Methods.
 
 
  James
 
 




Re: Do i really need HDFS?

2014-10-22 Thread David Greenberg
We use lustre and a couple internal data storage services. I wouldn't
recommend lustre much; it's got an SPOF which is a problem at scale. I just
wanted to point out that you can skip hdfs if you so choose.

On Wednesday, October 22, 2014, Dick Davies d...@hellooperator.net wrote:

 Be interested to know what that is, if you don't mind sharing.

 We're thinking of deploying a Ceph cluster for another project anyway,
 it seems to remove some of the chokepoints/points of failure HDFS suffers
 from
 but I've no idea how well it can interoperate with the usual HDFS clients
 (Spark in my particular case but I'm trying to keep this general).

 On 21 October 2014 13:16, David Greenberg dsg123456...@gmail.com
 javascript:; wrote:
  We use spark without HDFS--in our case, we just use ansible to copy the
  spark executors onto all hosts at the same path. We also load and store
 our
  spark data from non-HDFS sources.
 
  On Tue, Oct 21, 2014 at 4:57 AM, Dick Davies d...@hellooperator.net
 javascript:; wrote:
 
  I think Spark needs a way to send jobs to/from the workers - the Spark
  distro itself
  will pull down the executor ok, but in my (very basic) tests I got
  stuck without HDFS.
 
  So basically it depends on the framework. I think in Sparks case they
  assume most
  users are migrating from an existing Hadoop deployment, so HDFS is
  sort of assumed.
 
 
  On 20 October 2014 23:18, CCAAT cc...@tampabay.rr.com javascript:;
 wrote:
   On 10/20/14 11:46, Steven Schlansker wrote:
  
  
   We are running Mesos entirely without HDFS with no problems.  We use
   Docker to distribute our
   application to slave nodes, and keep no state on individual nodes.
  
  
  
   Background: I'm building up a 3 node cluster to run mesos and spark.
 No
   legacy Hadoop needed or wanted. I am using btrfs for the local file
   system,
   with (2) drives set up for raid1 on each system.
  
   So you  are suggesting that I can install mesos + spark + docker
   and not a DFS on these (3) machines?
  
  
   Will I need any other softwares? My application is a geophysical
   fluid simulator, so scala, R, and all sorts of advanced math will
   be required on the cluster for the Finite Element Methods.
  
  
   James
  
  
 
 



Re: Do i really need HDFS?

2014-10-22 Thread Tim St Clair
If it's locally mounted via fuse then there is no issue. 

Also there are tickets open about volume mounting in the sandbox, that would be 
the ideal solution.  

Cheers,
Tim

- Original Message -
 From: Dick Davies d...@hellooperator.net
 To: user@mesos.apache.org
 Sent: Wednesday, October 22, 2014 2:29:20 AM
 Subject: Re: Do i really need HDFS?
 
 Be interested to know what that is, if you don't mind sharing.
 
 We're thinking of deploying a Ceph cluster for another project anyway,
 it seems to remove some of the chokepoints/points of failure HDFS suffers
 from
 but I've no idea how well it can interoperate with the usual HDFS clients
 (Spark in my particular case but I'm trying to keep this general).
 
 On 21 October 2014 13:16, David Greenberg dsg123456...@gmail.com wrote:
  We use spark without HDFS--in our case, we just use ansible to copy the
  spark executors onto all hosts at the same path. We also load and store our
  spark data from non-HDFS sources.
 
  On Tue, Oct 21, 2014 at 4:57 AM, Dick Davies d...@hellooperator.net
  wrote:
 
  I think Spark needs a way to send jobs to/from the workers - the Spark
  distro itself
  will pull down the executor ok, but in my (very basic) tests I got
  stuck without HDFS.
 
  So basically it depends on the framework. I think in Sparks case they
  assume most
  users are migrating from an existing Hadoop deployment, so HDFS is
  sort of assumed.
 
 
  On 20 October 2014 23:18, CCAAT cc...@tampabay.rr.com wrote:
   On 10/20/14 11:46, Steven Schlansker wrote:
  
  
   We are running Mesos entirely without HDFS with no problems.  We use
   Docker to distribute our
   application to slave nodes, and keep no state on individual nodes.
  
  
  
   Background: I'm building up a 3 node cluster to run mesos and spark. No
   legacy Hadoop needed or wanted. I am using btrfs for the local file
   system,
   with (2) drives set up for raid1 on each system.
  
   So you  are suggesting that I can install mesos + spark + docker
   and not a DFS on these (3) machines?
  
  
   Will I need any other softwares? My application is a geophysical
   fluid simulator, so scala, R, and all sorts of advanced math will
   be required on the cluster for the Finite Element Methods.
  
  
   James
  
  
 
 
 

-- 
Cheers,
Timothy St. Clair
Red Hat Inc.


Re: Do i really need HDFS?

2014-10-22 Thread CCAAT

Ok so,

I'd be curious to know your final architecture (D. Davies)?

I was looking to put Ceph on top of the (3) btrfs nodes in case we need 
a DFS at some later point. We're not really sure what softwares will be

in our final mix. Certainly installing Ceph does not hurt anything (?);
and I'm not sure we want to use ceph from userspace only. We have had
excellent success using btrfs, so that is firm for us, short of some
gapping problem emerging. Growing the cluster size will happen, once
we establish the basic functionality of the cluster.

Right now, there is a focus on subsurface fluid simulations for carbon 
sequsttration, but also using the cluster for general (cron-chronos) 
batch jobs is a secondary appeal to us. So, I guess my question is, 
knowing that we want to avoid the hdfs/hadoop setup entirely, will 
localFS/DFS with btrfs/ceph be sufficiently  robust  to test not only 
mesos+spark but many other related softwares, such as but not limited to 
R, scala, sparkR, database(sql) and many other softwares? We're just 
trying to avoid some common mistakes as we move forward with mesos.


James



On 10/22/14 02:29, Dick Davies wrote:

Be interested to know what that is, if you don't mind sharing.

We're thinking of deploying a Ceph cluster for another project anyway,
it seems to remove some of the chokepoints/points of failure HDFS suffers from
but I've no idea how well it can interoperate with the usual HDFS clients
(Spark in my particular case but I'm trying to keep this general).

On 21 October 2014 13:16, David Greenberg dsg123456...@gmail.com wrote:

We use spark without HDFS--in our case, we just use ansible to copy the
spark executors onto all hosts at the same path. We also load and store our
spark data from non-HDFS sources.

On Tue, Oct 21, 2014 at 4:57 AM, Dick Davies d...@hellooperator.net wrote:


I think Spark needs a way to send jobs to/from the workers - the Spark
distro itself
will pull down the executor ok, but in my (very basic) tests I got
stuck without HDFS.

So basically it depends on the framework. I think in Sparks case they
assume most
users are migrating from an existing Hadoop deployment, so HDFS is
sort of assumed.


On 20 October 2014 23:18, CCAAT cc...@tampabay.rr.com wrote:

On 10/20/14 11:46, Steven Schlansker wrote:



We are running Mesos entirely without HDFS with no problems.  We use
Docker to distribute our
application to slave nodes, and keep no state on individual nodes.




Background: I'm building up a 3 node cluster to run mesos and spark. No
legacy Hadoop needed or wanted. I am using btrfs for the local file
system,
with (2) drives set up for raid1 on each system.

So you  are suggesting that I can install mesos + spark + docker
and not a DFS on these (3) machines?


Will I need any other softwares? My application is a geophysical
fluid simulator, so scala, R, and all sorts of advanced math will
be required on the cluster for the Finite Element Methods.


James











Re: Do i really need HDFS?

2014-10-21 Thread David Greenberg
We use spark without HDFS--in our case, we just use ansible to copy the
spark executors onto all hosts at the same path. We also load and store our
spark data from non-HDFS sources.

On Tue, Oct 21, 2014 at 4:57 AM, Dick Davies d...@hellooperator.net wrote:

 I think Spark needs a way to send jobs to/from the workers - the Spark
 distro itself
 will pull down the executor ok, but in my (very basic) tests I got
 stuck without HDFS.

 So basically it depends on the framework. I think in Sparks case they
 assume most
 users are migrating from an existing Hadoop deployment, so HDFS is
 sort of assumed.


 On 20 October 2014 23:18, CCAAT cc...@tampabay.rr.com wrote:
  On 10/20/14 11:46, Steven Schlansker wrote:
 
 
  We are running Mesos entirely without HDFS with no problems.  We use
  Docker to distribute our
  application to slave nodes, and keep no state on individual nodes.
 
 
 
  Background: I'm building up a 3 node cluster to run mesos and spark. No
  legacy Hadoop needed or wanted. I am using btrfs for the local file
 system,
  with (2) drives set up for raid1 on each system.
 
  So you  are suggesting that I can install mesos + spark + docker
  and not a DFS on these (3) machines?
 
 
  Will I need any other softwares? My application is a geophysical
  fluid simulator, so scala, R, and all sorts of advanced math will
  be required on the cluster for the Finite Element Methods.
 
 
  James
 
 



Re: Do i really need HDFS?

2014-10-21 Thread Ankur Chauhan
So that means even if I don't use the dfs I would need HDFS namenode and data 
node and related config to fetch s3 and s3n urns. 

Sent from my iPhone

 On Oct 21, 2014, at 8:40 AM, Tim St Clair tstcl...@redhat.com wrote:
 
 Ankur - 
 
 To answer your specific question re: 
 Q: Is a s3 path considered non-hdfs? 
 A: At this time no, it uses the hdfs layer to resolve (for better or worse).  
  
 
 -
  // Grab the resource using the hadoop client if it's one of the known schemes
  // TODO(tarnfeld): This isn't very scalable with hadoop's pluggable
  // filesystem implementations.
  // TODO(matei): Enforce some size limits on files we get from HDFS
  if (strings::startsWith(uri, hdfs://) ||
  strings::startsWith(uri, hftp://;) ||
  strings::startsWith(uri, s3://) ||
  strings::startsWith(uri, s3n://)) {
Trystring base = os::basename(uri);
if (base.isError()) {
  LOG(ERROR)  Invalid basename for URI:   base.error();
  return Error(Invalid basename for URI);
}
string path = path::join(directory, base.get());
 
HDFS hdfs;
 
LOG(INFO)  Downloading resource from '  uri
   ' to '  path  ';
TryNothing result = hdfs.copyToLocal(uri, path);
if (result.isError()) {
  LOG(ERROR)  HDFS copyToLocal failed:   result.error();
  return Error(result.error());
}
 -
 
 - Original Message - 
 
 From: Ankur Chauhan an...@malloc64.com
 To: user@mesos.apache.org
 Sent: Tuesday, October 21, 2014 10:28:50 AM
 Subject: Re: Do i really need HDFS?
 
 This is what I also intend to do. Is a s3 path considered non-hdfs? If so,
 how does it know the credentials to use to fetch the file.
 
 Sent from my iPhone
 
 On Oct 21, 2014, at 5:16 AM, David Greenberg  dsg123456...@gmail.com 
 wrote:
 
 We use spark without HDFS--in our case, we just use ansible to copy the
 spark
 executors onto all hosts at the same path. We also load and store our spark
 data from non-HDFS sources.
 
 On Tue, Oct 21, 2014 at 4:57 AM, Dick Davies  d...@hellooperator.net 
 wrote:
 
 I think Spark needs a way to send jobs to/from the workers - the Spark
 
 distro itself
 
 will pull down the executor ok, but in my (very basic) tests I got
 
 stuck without HDFS.
 
 So basically it depends on the framework. I think in Sparks case they
 
 assume most
 
 users are migrating from an existing Hadoop deployment, so HDFS is
 
 sort of assumed.
 
 On 20 October 2014 23:18, CCAAT  cc...@tampabay.rr.com  wrote:
 
 On 10/20/14 11:46, Steven Schlansker wrote:
 
 
 
 We are running Mesos entirely without HDFS with no problems. We use
 
 Docker to distribute our
 
 application to slave nodes, and keep no state on individual nodes.
 
 
 
 
 Background: I'm building up a 3 node cluster to run mesos and spark. No
 
 legacy Hadoop needed or wanted. I am using btrfs for the local file
 system,
 
 with (2) drives set up for raid1 on each system.
 
 
 So you are suggesting that I can install mesos + spark + docker
 
 and not a DFS on these (3) machines?
 
 
 
 Will I need any other softwares? My application is a geophysical
 
 fluid simulator, so scala, R, and all sorts of advanced math will
 
 be required on the cluster for the Finite Element Methods.
 
 
 
 James
 
 
 
 -- 
 
 -- 
 Cheers,
 Timothy St. Clair
 Red Hat Inc.


Do i really need HDFS?

2014-10-20 Thread Ankur Chauhan
Hi all,


I am trying to setup a new mesos cluster and I so far I have a set of master 
and slave nodes working and I can get everything running. I am able to install 
and run a couple of sample apps, hookup jenkins etc. My main question now is Do 
I really need HDFS? All my artifacts (for apps) are on a protected S3 bucket or 
in a private docker registry.


If I need HDFS, do I need to go all in even when I am not using hdfs as a 
data store but rather as a simple way to fetch files from s3; or can I get away 
with putting the correct core-site.xml and hdfs-site.xml in HADOOP_HOME and get 
away with it?


It would really help how other have their mesos setup in production or what 
they would recommend regarding my setup?



--Ankur Chauhan

Re: Do i really need HDFS?

2014-10-20 Thread David Greenberg
You certainly don't need hdfs if you've got other infrastructure that's
providing the same sorts of features. We don't run hdfs on our mesos
cluster, and it's fine.

On Monday, October 20, 2014, Ankur Chauhan an...@malloc64.com wrote:

  Hi all,

 I am trying to setup a new mesos cluster and I so far I have a set of
 master and slave nodes working and I can get everything running. I am able
 to install and run a couple of sample apps, hookup jenkins etc. My main
 question now is Do I really need HDFS? All my artifacts (for apps) are on a
 protected S3 bucket or in a private docker registry.

 If I need HDFS, do I need to go all in even when I am not using hdfs as
 a data store but rather as a simple way to fetch files from s3; or can I
 get away with putting the correct core-site.xml and hdfs-site.xml in
 HADOOP_HOME and get away with it?

 It would really help how other have their mesos setup in production or
 what they would recommend regarding my setup?


 --
 Ankur Chauhan



Re: Do i really need HDFS?

2014-10-20 Thread Ankur Chauhan
If HDFS is not required what is the minimum config needed to fetch s3:// urns? 

Sent from my iPhone

 On Oct 20, 2014, at 7:53 AM, David Greenberg dsg123456...@gmail.com wrote:
 
 A DFS is merely a convenience for getting data to all the nodes of your mesos 
 cluster. If you want to have all nodes retrieve data from HTTP servers (like 
 S3), this is feasible as well. If you want your nodes to store data, you can 
 have them write to any database (although I'd recommend something scalable, 
 like Riak, rather than a DB like an unsharded MySQL)
 
 On Mon, Oct 20, 2014 at 11:33 AM, CCAAT cc...@tampabay.rr.com wrote:
 If one is building a mesos-cluster, then is a DFS mandatory, or is there
 combinations of other codes that suffice for the needs of a mesos cluster?
 
 So what is the list of available Distributed File Systems that are generally 
 available and mesos is know to work on top of?
 
 Glusterfs, Lustrefs, FhGFS (BeeGFS), Ceph
 
 [1] http://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems
 
 
 An explicit list of other DFS's or combinations of codes that provide
 these required 'feature sets' is of interest to many.
 
 
 James
 
 
 
 
 On 10/20/14 07:08, David Greenberg wrote:
 You certainly don't need hdfs if you've got other infrastructure that's
 providing the same sorts of features. We don't run hdfs on our mesos
 cluster, and it's fine.
 
 On Monday, October 20, 2014, Ankur Chauhan an...@malloc64.com
 mailto:an...@malloc64.com wrote:
 
 __
 Hi all,
 
 I am trying to setup a new mesos cluster and I so far I have a set
 of master and slave nodes working and I can get everything running.
 I am able to install and run a couple of sample apps, hookup jenkins
 etc. My main question now is Do I really need HDFS? All my artifacts
 (for apps) are on a protected S3 bucket or in a private docker registry.
 
 If I need HDFS, do I need to go all in even when I am not using
 hdfs as a data store but rather as a simple way to fetch files from
 s3; or can I get away with putting the correct core-site.xml and
 hdfs-site.xml in HADOOP_HOME and get away with it?
 
 It would really help how other have their mesos setup in production
 or what they would recommend regarding my setup?
 
 
 --
 Ankur Chauhan