Re: Do i really need HDFS?
Be interested to know what that is, if you don't mind sharing. We're thinking of deploying a Ceph cluster for another project anyway, it seems to remove some of the chokepoints/points of failure HDFS suffers from but I've no idea how well it can interoperate with the usual HDFS clients (Spark in my particular case but I'm trying to keep this general). On 21 October 2014 13:16, David Greenberg dsg123456...@gmail.com wrote: We use spark without HDFS--in our case, we just use ansible to copy the spark executors onto all hosts at the same path. We also load and store our spark data from non-HDFS sources. On Tue, Oct 21, 2014 at 4:57 AM, Dick Davies d...@hellooperator.net wrote: I think Spark needs a way to send jobs to/from the workers - the Spark distro itself will pull down the executor ok, but in my (very basic) tests I got stuck without HDFS. So basically it depends on the framework. I think in Sparks case they assume most users are migrating from an existing Hadoop deployment, so HDFS is sort of assumed. On 20 October 2014 23:18, CCAAT cc...@tampabay.rr.com wrote: On 10/20/14 11:46, Steven Schlansker wrote: We are running Mesos entirely without HDFS with no problems. We use Docker to distribute our application to slave nodes, and keep no state on individual nodes. Background: I'm building up a 3 node cluster to run mesos and spark. No legacy Hadoop needed or wanted. I am using btrfs for the local file system, with (2) drives set up for raid1 on each system. So you are suggesting that I can install mesos + spark + docker and not a DFS on these (3) machines? Will I need any other softwares? My application is a geophysical fluid simulator, so scala, R, and all sorts of advanced math will be required on the cluster for the Finite Element Methods. James
Re: Do i really need HDFS?
We use lustre and a couple internal data storage services. I wouldn't recommend lustre much; it's got an SPOF which is a problem at scale. I just wanted to point out that you can skip hdfs if you so choose. On Wednesday, October 22, 2014, Dick Davies d...@hellooperator.net wrote: Be interested to know what that is, if you don't mind sharing. We're thinking of deploying a Ceph cluster for another project anyway, it seems to remove some of the chokepoints/points of failure HDFS suffers from but I've no idea how well it can interoperate with the usual HDFS clients (Spark in my particular case but I'm trying to keep this general). On 21 October 2014 13:16, David Greenberg dsg123456...@gmail.com javascript:; wrote: We use spark without HDFS--in our case, we just use ansible to copy the spark executors onto all hosts at the same path. We also load and store our spark data from non-HDFS sources. On Tue, Oct 21, 2014 at 4:57 AM, Dick Davies d...@hellooperator.net javascript:; wrote: I think Spark needs a way to send jobs to/from the workers - the Spark distro itself will pull down the executor ok, but in my (very basic) tests I got stuck without HDFS. So basically it depends on the framework. I think in Sparks case they assume most users are migrating from an existing Hadoop deployment, so HDFS is sort of assumed. On 20 October 2014 23:18, CCAAT cc...@tampabay.rr.com javascript:; wrote: On 10/20/14 11:46, Steven Schlansker wrote: We are running Mesos entirely without HDFS with no problems. We use Docker to distribute our application to slave nodes, and keep no state on individual nodes. Background: I'm building up a 3 node cluster to run mesos and spark. No legacy Hadoop needed or wanted. I am using btrfs for the local file system, with (2) drives set up for raid1 on each system. So you are suggesting that I can install mesos + spark + docker and not a DFS on these (3) machines? Will I need any other softwares? My application is a geophysical fluid simulator, so scala, R, and all sorts of advanced math will be required on the cluster for the Finite Element Methods. James
Re: Do i really need HDFS?
If it's locally mounted via fuse then there is no issue. Also there are tickets open about volume mounting in the sandbox, that would be the ideal solution. Cheers, Tim - Original Message - From: Dick Davies d...@hellooperator.net To: user@mesos.apache.org Sent: Wednesday, October 22, 2014 2:29:20 AM Subject: Re: Do i really need HDFS? Be interested to know what that is, if you don't mind sharing. We're thinking of deploying a Ceph cluster for another project anyway, it seems to remove some of the chokepoints/points of failure HDFS suffers from but I've no idea how well it can interoperate with the usual HDFS clients (Spark in my particular case but I'm trying to keep this general). On 21 October 2014 13:16, David Greenberg dsg123456...@gmail.com wrote: We use spark without HDFS--in our case, we just use ansible to copy the spark executors onto all hosts at the same path. We also load and store our spark data from non-HDFS sources. On Tue, Oct 21, 2014 at 4:57 AM, Dick Davies d...@hellooperator.net wrote: I think Spark needs a way to send jobs to/from the workers - the Spark distro itself will pull down the executor ok, but in my (very basic) tests I got stuck without HDFS. So basically it depends on the framework. I think in Sparks case they assume most users are migrating from an existing Hadoop deployment, so HDFS is sort of assumed. On 20 October 2014 23:18, CCAAT cc...@tampabay.rr.com wrote: On 10/20/14 11:46, Steven Schlansker wrote: We are running Mesos entirely without HDFS with no problems. We use Docker to distribute our application to slave nodes, and keep no state on individual nodes. Background: I'm building up a 3 node cluster to run mesos and spark. No legacy Hadoop needed or wanted. I am using btrfs for the local file system, with (2) drives set up for raid1 on each system. So you are suggesting that I can install mesos + spark + docker and not a DFS on these (3) machines? Will I need any other softwares? My application is a geophysical fluid simulator, so scala, R, and all sorts of advanced math will be required on the cluster for the Finite Element Methods. James -- Cheers, Timothy St. Clair Red Hat Inc.
Re: Do i really need HDFS?
Ok so, I'd be curious to know your final architecture (D. Davies)? I was looking to put Ceph on top of the (3) btrfs nodes in case we need a DFS at some later point. We're not really sure what softwares will be in our final mix. Certainly installing Ceph does not hurt anything (?); and I'm not sure we want to use ceph from userspace only. We have had excellent success using btrfs, so that is firm for us, short of some gapping problem emerging. Growing the cluster size will happen, once we establish the basic functionality of the cluster. Right now, there is a focus on subsurface fluid simulations for carbon sequsttration, but also using the cluster for general (cron-chronos) batch jobs is a secondary appeal to us. So, I guess my question is, knowing that we want to avoid the hdfs/hadoop setup entirely, will localFS/DFS with btrfs/ceph be sufficiently robust to test not only mesos+spark but many other related softwares, such as but not limited to R, scala, sparkR, database(sql) and many other softwares? We're just trying to avoid some common mistakes as we move forward with mesos. James On 10/22/14 02:29, Dick Davies wrote: Be interested to know what that is, if you don't mind sharing. We're thinking of deploying a Ceph cluster for another project anyway, it seems to remove some of the chokepoints/points of failure HDFS suffers from but I've no idea how well it can interoperate with the usual HDFS clients (Spark in my particular case but I'm trying to keep this general). On 21 October 2014 13:16, David Greenberg dsg123456...@gmail.com wrote: We use spark without HDFS--in our case, we just use ansible to copy the spark executors onto all hosts at the same path. We also load and store our spark data from non-HDFS sources. On Tue, Oct 21, 2014 at 4:57 AM, Dick Davies d...@hellooperator.net wrote: I think Spark needs a way to send jobs to/from the workers - the Spark distro itself will pull down the executor ok, but in my (very basic) tests I got stuck without HDFS. So basically it depends on the framework. I think in Sparks case they assume most users are migrating from an existing Hadoop deployment, so HDFS is sort of assumed. On 20 October 2014 23:18, CCAAT cc...@tampabay.rr.com wrote: On 10/20/14 11:46, Steven Schlansker wrote: We are running Mesos entirely without HDFS with no problems. We use Docker to distribute our application to slave nodes, and keep no state on individual nodes. Background: I'm building up a 3 node cluster to run mesos and spark. No legacy Hadoop needed or wanted. I am using btrfs for the local file system, with (2) drives set up for raid1 on each system. So you are suggesting that I can install mesos + spark + docker and not a DFS on these (3) machines? Will I need any other softwares? My application is a geophysical fluid simulator, so scala, R, and all sorts of advanced math will be required on the cluster for the Finite Element Methods. James
Re: Do i really need HDFS?
We use spark without HDFS--in our case, we just use ansible to copy the spark executors onto all hosts at the same path. We also load and store our spark data from non-HDFS sources. On Tue, Oct 21, 2014 at 4:57 AM, Dick Davies d...@hellooperator.net wrote: I think Spark needs a way to send jobs to/from the workers - the Spark distro itself will pull down the executor ok, but in my (very basic) tests I got stuck without HDFS. So basically it depends on the framework. I think in Sparks case they assume most users are migrating from an existing Hadoop deployment, so HDFS is sort of assumed. On 20 October 2014 23:18, CCAAT cc...@tampabay.rr.com wrote: On 10/20/14 11:46, Steven Schlansker wrote: We are running Mesos entirely without HDFS with no problems. We use Docker to distribute our application to slave nodes, and keep no state on individual nodes. Background: I'm building up a 3 node cluster to run mesos and spark. No legacy Hadoop needed or wanted. I am using btrfs for the local file system, with (2) drives set up for raid1 on each system. So you are suggesting that I can install mesos + spark + docker and not a DFS on these (3) machines? Will I need any other softwares? My application is a geophysical fluid simulator, so scala, R, and all sorts of advanced math will be required on the cluster for the Finite Element Methods. James
Re: Do i really need HDFS?
So that means even if I don't use the dfs I would need HDFS namenode and data node and related config to fetch s3 and s3n urns. Sent from my iPhone On Oct 21, 2014, at 8:40 AM, Tim St Clair tstcl...@redhat.com wrote: Ankur - To answer your specific question re: Q: Is a s3 path considered non-hdfs? A: At this time no, it uses the hdfs layer to resolve (for better or worse). - // Grab the resource using the hadoop client if it's one of the known schemes // TODO(tarnfeld): This isn't very scalable with hadoop's pluggable // filesystem implementations. // TODO(matei): Enforce some size limits on files we get from HDFS if (strings::startsWith(uri, hdfs://) || strings::startsWith(uri, hftp://;) || strings::startsWith(uri, s3://) || strings::startsWith(uri, s3n://)) { Trystring base = os::basename(uri); if (base.isError()) { LOG(ERROR) Invalid basename for URI: base.error(); return Error(Invalid basename for URI); } string path = path::join(directory, base.get()); HDFS hdfs; LOG(INFO) Downloading resource from ' uri ' to ' path '; TryNothing result = hdfs.copyToLocal(uri, path); if (result.isError()) { LOG(ERROR) HDFS copyToLocal failed: result.error(); return Error(result.error()); } - - Original Message - From: Ankur Chauhan an...@malloc64.com To: user@mesos.apache.org Sent: Tuesday, October 21, 2014 10:28:50 AM Subject: Re: Do i really need HDFS? This is what I also intend to do. Is a s3 path considered non-hdfs? If so, how does it know the credentials to use to fetch the file. Sent from my iPhone On Oct 21, 2014, at 5:16 AM, David Greenberg dsg123456...@gmail.com wrote: We use spark without HDFS--in our case, we just use ansible to copy the spark executors onto all hosts at the same path. We also load and store our spark data from non-HDFS sources. On Tue, Oct 21, 2014 at 4:57 AM, Dick Davies d...@hellooperator.net wrote: I think Spark needs a way to send jobs to/from the workers - the Spark distro itself will pull down the executor ok, but in my (very basic) tests I got stuck without HDFS. So basically it depends on the framework. I think in Sparks case they assume most users are migrating from an existing Hadoop deployment, so HDFS is sort of assumed. On 20 October 2014 23:18, CCAAT cc...@tampabay.rr.com wrote: On 10/20/14 11:46, Steven Schlansker wrote: We are running Mesos entirely without HDFS with no problems. We use Docker to distribute our application to slave nodes, and keep no state on individual nodes. Background: I'm building up a 3 node cluster to run mesos and spark. No legacy Hadoop needed or wanted. I am using btrfs for the local file system, with (2) drives set up for raid1 on each system. So you are suggesting that I can install mesos + spark + docker and not a DFS on these (3) machines? Will I need any other softwares? My application is a geophysical fluid simulator, so scala, R, and all sorts of advanced math will be required on the cluster for the Finite Element Methods. James -- -- Cheers, Timothy St. Clair Red Hat Inc.
Do i really need HDFS?
Hi all, I am trying to setup a new mesos cluster and I so far I have a set of master and slave nodes working and I can get everything running. I am able to install and run a couple of sample apps, hookup jenkins etc. My main question now is Do I really need HDFS? All my artifacts (for apps) are on a protected S3 bucket or in a private docker registry. If I need HDFS, do I need to go all in even when I am not using hdfs as a data store but rather as a simple way to fetch files from s3; or can I get away with putting the correct core-site.xml and hdfs-site.xml in HADOOP_HOME and get away with it? It would really help how other have their mesos setup in production or what they would recommend regarding my setup? --Ankur Chauhan
Re: Do i really need HDFS?
You certainly don't need hdfs if you've got other infrastructure that's providing the same sorts of features. We don't run hdfs on our mesos cluster, and it's fine. On Monday, October 20, 2014, Ankur Chauhan an...@malloc64.com wrote: Hi all, I am trying to setup a new mesos cluster and I so far I have a set of master and slave nodes working and I can get everything running. I am able to install and run a couple of sample apps, hookup jenkins etc. My main question now is Do I really need HDFS? All my artifacts (for apps) are on a protected S3 bucket or in a private docker registry. If I need HDFS, do I need to go all in even when I am not using hdfs as a data store but rather as a simple way to fetch files from s3; or can I get away with putting the correct core-site.xml and hdfs-site.xml in HADOOP_HOME and get away with it? It would really help how other have their mesos setup in production or what they would recommend regarding my setup? -- Ankur Chauhan
Re: Do i really need HDFS?
If HDFS is not required what is the minimum config needed to fetch s3:// urns? Sent from my iPhone On Oct 20, 2014, at 7:53 AM, David Greenberg dsg123456...@gmail.com wrote: A DFS is merely a convenience for getting data to all the nodes of your mesos cluster. If you want to have all nodes retrieve data from HTTP servers (like S3), this is feasible as well. If you want your nodes to store data, you can have them write to any database (although I'd recommend something scalable, like Riak, rather than a DB like an unsharded MySQL) On Mon, Oct 20, 2014 at 11:33 AM, CCAAT cc...@tampabay.rr.com wrote: If one is building a mesos-cluster, then is a DFS mandatory, or is there combinations of other codes that suffice for the needs of a mesos cluster? So what is the list of available Distributed File Systems that are generally available and mesos is know to work on top of? Glusterfs, Lustrefs, FhGFS (BeeGFS), Ceph [1] http://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems An explicit list of other DFS's or combinations of codes that provide these required 'feature sets' is of interest to many. James On 10/20/14 07:08, David Greenberg wrote: You certainly don't need hdfs if you've got other infrastructure that's providing the same sorts of features. We don't run hdfs on our mesos cluster, and it's fine. On Monday, October 20, 2014, Ankur Chauhan an...@malloc64.com mailto:an...@malloc64.com wrote: __ Hi all, I am trying to setup a new mesos cluster and I so far I have a set of master and slave nodes working and I can get everything running. I am able to install and run a couple of sample apps, hookup jenkins etc. My main question now is Do I really need HDFS? All my artifacts (for apps) are on a protected S3 bucket or in a private docker registry. If I need HDFS, do I need to go all in even when I am not using hdfs as a data store but rather as a simple way to fetch files from s3; or can I get away with putting the correct core-site.xml and hdfs-site.xml in HADOOP_HOME and get away with it? It would really help how other have their mesos setup in production or what they would recommend regarding my setup? -- Ankur Chauhan