Re: [gentoo-user] Clusters on Gentoo ?
On Tue 19 Aug 2014 05:34:40 AM EDT, J. Roeleveld wrote: > On Monday, August 18, 2014 10:53:51 AM Alec Ten Harmsel wrote: >> On Mon 18 Aug 2014 10:50:23 AM EDT, Rich Freeman wrote: >>> Hadoop is a very specialized tool. It does what it does very well, >>> but if you want to use it for something other than map/reduce then >>> consider carefully whether it is the right tool for the job. >> >> Agreed; unless you have decent hardware and can comfortably measure >> your data in TB, it'll be quicker to use something else once you factor >> in the administration time and learning curve. > > The benefit of clustering technologies is that you don't need high-end > hardware to start with. You can use the old hardware you found collecting dust > in the basement. Yes, but... if you are doing anything that *needs* to be fast (i.e. if you're not a hobbyist), you don't need some super fancy database machine but you still need some decent hardware (gotta have enough RAM for that JVM ;) ). If you'd like to take a look at our hardware, you can check out http://caen.github.io/hadoop/hardware.html. > The learning curve isn't as steep as it used to be. There are plenty of tools > to make it easier to start using Hadoop. There are plenty of great tools (Pig, Sqoop, Hive, RHadoop, etc.) that you can use so you're not writing Java. This is all client-side; it doesn't make the administration easier. I agree that it's easy to start using it (It's possible to configure a small cluster from scratch in half an hour), but it takes a lot more time to tune your installation so it actually performs well. Just like any other piece of server software; serving a website with httpd is easy, but serving it well and adding security takes a lot more time. Rich Freeman wrote: > As long as you're counting words and don't mind coding everything in Java. :) We discourage researchers from writing in Java and instead use any of the things I list above, unless they really like Java. > I found that if you want to avoid using Java, then the > available documentation plummets Yeah, this is still a pretty big problem. Documentation is pretty sparse. Alec
Re: [gentoo-user] Clusters on Gentoo ?
On Tuesday, August 19, 2014 06:33:29 AM Rich Freeman wrote: > On Tue, Aug 19, 2014 at 5:34 AM, J. Roeleveld wrote: > > On Monday, August 18, 2014 10:53:51 AM Alec Ten Harmsel wrote: > >> On Mon 18 Aug 2014 10:50:23 AM EDT, Rich Freeman wrote: > >> > Hadoop is a very specialized tool. It does what it does very well, > >> > but if you want to use it for something other than map/reduce then > >> > consider carefully whether it is the right tool for the job. > >> > >> Agreed; unless you have decent hardware and can comfortably measure > >> your data in TB, it'll be quicker to use something else once you factor > >> in the administration time and learning curve. > > > > The benefit of clustering technologies is that you don't need high-end > > hardware to start with. You can use the old hardware you found collecting > > dust in the basement. > > > > The learning curve isn't as steep as it used to be. There are plenty of > > tools to make it easier to start using Hadoop. > > As long as you're counting words and don't mind coding everything in Java. > :) > > I found that if you want to avoid using Java, then the available > documentation plummets, and I'm pretty sure the version I was > attempting to use was buggy - it was losing records in the sort/reduce > phase I believe. Or perhaps I was just using it incorrectly, but the > same exact code worked just fine when I ran it on a single host with a > smaller dataset and just piped map | sort | reduce without using > Hadoop. The documentation was pretty sparse on how to get Hadoop to > work via stdin/out with non-Java code and it is quite possible I > wasn't quite doing things right. In the end my problem wasn't big > enough to necessitate using Hadoop and I used GNU parallel instead. No need for Java knowledge to develop against Hadoop. A commercial product: http://www.informatica.com/Images/01603_powerexchange-for-hadoop_ds_en-US.pdf Nice and easy graphical interface. The same "code" that works against a relational database also works with Hadoop. The tool does the translation. I would be surprised if there are no other tools that can make it easier to develop code to work with Hadoop. I just haven't had the reason to search for those yet. -- Joost
Re: [gentoo-user] Clusters on Gentoo ?
On Tue, Aug 19, 2014 at 5:34 AM, J. Roeleveld wrote: > On Monday, August 18, 2014 10:53:51 AM Alec Ten Harmsel wrote: >> On Mon 18 Aug 2014 10:50:23 AM EDT, Rich Freeman wrote: >> > Hadoop is a very specialized tool. It does what it does very well, >> > but if you want to use it for something other than map/reduce then >> > consider carefully whether it is the right tool for the job. >> >> Agreed; unless you have decent hardware and can comfortably measure >> your data in TB, it'll be quicker to use something else once you factor >> in the administration time and learning curve. > > The benefit of clustering technologies is that you don't need high-end > hardware to start with. You can use the old hardware you found collecting dust > in the basement. > > The learning curve isn't as steep as it used to be. There are plenty of tools > to make it easier to start using Hadoop. > As long as you're counting words and don't mind coding everything in Java. :) I found that if you want to avoid using Java, then the available documentation plummets, and I'm pretty sure the version I was attempting to use was buggy - it was losing records in the sort/reduce phase I believe. Or perhaps I was just using it incorrectly, but the same exact code worked just fine when I ran it on a single host with a smaller dataset and just piped map | sort | reduce without using Hadoop. The documentation was pretty sparse on how to get Hadoop to work via stdin/out with non-Java code and it is quite possible I wasn't quite doing things right. In the end my problem wasn't big enough to necessitate using Hadoop and I used GNU parallel instead. -- Rich
Re: [gentoo-user] Clusters on Gentoo ?
On Monday, August 18, 2014 10:53:51 AM Alec Ten Harmsel wrote: > On Mon 18 Aug 2014 10:50:23 AM EDT, Rich Freeman wrote: > > Hadoop is a very specialized tool. It does what it does very well, > > but if you want to use it for something other than map/reduce then > > consider carefully whether it is the right tool for the job. > > Agreed; unless you have decent hardware and can comfortably measure > your data in TB, it'll be quicker to use something else once you factor > in the administration time and learning curve. The benefit of clustering technologies is that you don't need high-end hardware to start with. You can use the old hardware you found collecting dust in the basement. The learning curve isn't as steep as it used to be. There are plenty of tools to make it easier to start using Hadoop. -- Joost
Re: [gentoo-user] Clusters on Gentoo ?
On Monday, August 18, 2014 08:09:00 PM thegeezer wrote: > On 18/08/14 15:31, J. Roeleveld wrote: > > > > valid points, and interesting to see the corrections of my > understanding, always welcome :) You're welcome :) > > Looks nice, but is not going to help with performance if the application > > is > > not designed for distributed processing. > > > > -- > > Joost > > this is the key point i would raise about clusters really -- it would be > nice to not need for example distcc configured and just have portage run > across all connected nodes without any further work, or to use a tablet > computer which is "borrowing" cycles from a GFX card across the network > without having to configure nvidia grid: specifically these two use > cases have wildly different characteristics and are a great example of > why clustering has to be designed first to fit the application and > viceversa. I had a better look at that site you linked to. It won't be as "hidden" as you'd like. The software you run on it needs to be designed to actually use the infrastructure. This means that for your ideal to work, the "industry" needs to decide on a single clustering technology for this. I wish you good luck on that venture. :) > /me continues to wonder if 10GigE is fast enough to page fault across > the network ... ;) Depends on how fast you want the environment to be. Old i386 time, probably. Expecting a performance equivalent to a modern system, no. Check the bus-speeds between the CPU and memory that is being employed these days. That is the minimum speed you need in the network link to be fast enough to actually work. And that is expecting a perfect link with no errors occurring in the wiring. -- Joost
Re: [gentoo-user] Clusters on Gentoo ?
On 18/08/14 15:31, J. Roeleveld wrote: > valid points, and interesting to see the corrections of my understanding, always welcome :) > Looks nice, but is not going to help with performance if the application is > not designed for distributed processing. > > -- > Joost > this is the key point i would raise about clusters really -- it would be nice to not need for example distcc configured and just have portage run across all connected nodes without any further work, or to use a tablet computer which is "borrowing" cycles from a GFX card across the network without having to configure nvidia grid: specifically these two use cases have wildly different characteristics and are a great example of why clustering has to be designed first to fit the application and viceversa. /me continues to wonder if 10GigE is fast enough to page fault across the network ... ;)
Re: [gentoo-user] Clusters on Gentoo ?
On Mon 18 Aug 2014 10:50:23 AM EDT, Rich Freeman wrote: > Hadoop is a very specialized tool. It does what it does very well, > but if you want to use it for something other than map/reduce then > consider carefully whether it is the right tool for the job. Agreed; unless you have decent hardware and can comfortably measure your data in TB, it'll be quicker to use something else once you factor in the administration time and learning curve. Alec
Re: [gentoo-user] Clusters on Gentoo ?
On Mon, Aug 18, 2014 at 10:31 AM, J. Roeleveld wrote: > > I wouldn't use Hadoop for storage of files. It's only useful if you have a lot > (and I do mean a LOT) of data where a query only returns a very small amount. Not to mention a lot of data in a small number of files. I think the minimum allocation size for Hadoop is measured in megabytes. I tried using it to process gentoo-x86 and the number of files just clobbered the thing. Since in my job the files were really just static data and not the actual subject of the map/reduce I instead just replicated the data to all the nodes and had them retrieve the data from the local filesystem. Hadoop is a very specialized tool. It does what it does very well, but if you want to use it for something other than map/reduce then consider carefully whether it is the right tool for the job. -- Rich
Re: [gentoo-user] Clusters on Gentoo ?
On Sunday, August 17, 2014 08:46:58 PM thegeezer wrote: > there are many way to do clustering and one thing that i would consider > a "holy grail" would be something like pvm [1] > because nothing else seems to have similar horizontal scaling of cpu at > the kernel level PVM, from the webpage, looks more like a pre-built VM. Not some kernel module that distributes existing code to different nodes. This kind of clustering also has no benefit for most uses. You really need to design your tasks for these kind of environments. > i would love to know the mechanism behind dell's equallogic san as it > really is clustered lvm on steroids. > GFS / orangefs / ocfs are not the easiest things to setup (ocfs is) and > i've not found performance to be so great for writes. I have seen weird issues when using Oracle's filesystems for anything not Oracle. How important is reliability? > DRBD is only 2 devices as far as i understand, so not really super scalable > i'm still not convinced over the likes of hadoop for storage, maybe i > just don't have the scale to "get" it? I wouldn't use Hadoop for storage of files. It's only useful if you have a lot (and I do mean a LOT) of data where a query only returns a very small amount. Performance of a Hadoop cluster is high because the same query is sent to all nodes at once and the answers get merged into a single answer along the way back to the requestor. I don't see it as a valid system to actually store important data you do not want to risk losing. > the thing with clusters is that you want to be able to spin an extra > node up and join it to the group and then you increase cpu / storage by > n+1 but also you want to be able to spin nodes down dynamically and go > down by n-1. i guess this is where hadoop is of benefit because that is > not a happy thing for a typical file system. Not necessary. That is only one way to use a cluster. It's also an "easy" and "cheap" method of increasing the available processing power. This only works properly if the tasks can be distributed over multiple nodes easily. Having the option to quickly add and remove nodes make it difficult to keep the data consistent. Especially Hadoop prefers the nodes to stay available as there is no single node containing all the data. There is some redundancy, but remove a few nodes and you can easily loose data. > network load balancing is super easy, all info required is in each > packet -- application load balancing requires more thought. > this is where the likes of memcached can help but also why a good design > of the cluster is better. localised data and tiered access etc... kind > of why i would like to see a pvm kind of solution -- so that a page > fault is triggered like swap memory which then fetches the relevant > memory from the network: That is going to kill performance... Have a look into NUMA. It's always best to have the data where it is being processed. Either by moving the data to the processing unit, or by using a processing unit local to the data. Moving data is always expensive with regards to performance. This is how Hadoop clusters work, the data is processed on the node actually having the data. The result (which is often less then 1% of the source-data) is then sent over the network to another node, which, at this stage, merges the result and passes it to another node. This then continues until all the results are merged into a single result-set which is then returned to the requesting application. > bearing in mind that a computer can typically > trigger thousands of page faults a second and that memory access is very > very many times faster than gigabit networking! > > [1] http://www.csm.ornl.gov/pvm/pvm_home.html Looks nice, but is not going to help with performance if the application is not designed for distributed processing. -- Joost
Re: [gentoo-user] Clusters on Gentoo ?
there are many way to do clustering and one thing that i would consider a "holy grail" would be something like pvm [1] because nothing else seems to have similar horizontal scaling of cpu at the kernel level i would love to know the mechanism behind dell's equallogic san as it really is clustered lvm on steroids. GFS / orangefs / ocfs are not the easiest things to setup (ocfs is) and i've not found performance to be so great for writes. DRBD is only 2 devices as far as i understand, so not really super scalable i'm still not convinced over the likes of hadoop for storage, maybe i just don't have the scale to "get" it? the thing with clusters is that you want to be able to spin an extra node up and join it to the group and then you increase cpu / storage by n+1 but also you want to be able to spin nodes down dynamically and go down by n-1. i guess this is where hadoop is of benefit because that is not a happy thing for a typical file system. network load balancing is super easy, all info required is in each packet -- application load balancing requires more thought. this is where the likes of memcached can help but also why a good design of the cluster is better. localised data and tiered access etc... kind of why i would like to see a pvm kind of solution -- so that a page fault is triggered like swap memory which then fetches the relevant memory from the network: bearing in mind that a computer can typically trigger thousands of page faults a second and that memory access is very very many times faster than gigabit networking! [1] http://www.csm.ornl.gov/pvm/pvm_home.html
Re: [gentoo-user] Clusters on Gentoo ?
I'm a Hadoop and related software sysadmin at the University of Michigan. I'm a student still, so it's only a part-time position. I have some documentation at http://caen.github.io/hadoop - if something is not clear, I will gladly take feedback and make appropriate changes. > In a recent thread (schedulers) it was noted that several folks had interest > in clusters (privately operated clouds) as more than a passing interest. I'll try to chime in on any questions about scheduling/clusters in the future as we have a pretty large installation (~20,000 cores) running a traditional HPC stack, and a small Hadoop cluster. > [2] http://hadoop.apache.org/docs/r1.2.1/cluster_setup.html Hadoop is currently "stable" on 2.x (specifically 2.4), so relevant documentation is at http://hadoop.apache.org/docs/stable. Alec
Re: [gentoo-user] Clusters on Gentoo ?
On Wednesday, August 06, 2014 04:50:22 PM James wrote: > Hopefully, we can all share ideas and brainstorm about how Gentoo users > can lead the pack of linux distros into this brave_new world. [Overlays?] A good place to start would also be: http://www.yolinux.com/TUTORIALS/LinuxClustersAndFileSystems.html -- Joost