Re: [gentoo-user] Clusters on Gentoo ?

2014-08-19 Thread Alec Ten Harmsel
On Tue 19 Aug 2014 05:34:40 AM EDT, J. Roeleveld wrote:
> On Monday, August 18, 2014 10:53:51 AM Alec Ten Harmsel wrote:
>> On Mon 18 Aug 2014 10:50:23 AM EDT, Rich Freeman wrote:
>>> Hadoop is a very specialized tool.  It does what it does very well,
>>> but if you want to use it for something other than map/reduce then
>>> consider carefully whether it is the right tool for the job.
>>
>> Agreed; unless you have decent hardware and can comfortably measure
>> your data in TB, it'll be quicker to use something else once you factor
>> in the administration time and learning curve.
>
> The benefit of clustering technologies is that you don't need high-end
> hardware to start with. You can use the old hardware you found collecting dust
> in the basement.

Yes, but... if you are doing anything that *needs* to be fast (i.e. if 
you're not a hobbyist), you don't need some super fancy database 
machine but you still need some decent hardware (gotta have enough RAM 
for that JVM ;) ). If you'd like to take a look at our hardware, you 
can check out http://caen.github.io/hadoop/hardware.html.

> The learning curve isn't as steep as it used to be. There are plenty of tools
> to make it easier to start using Hadoop.

There are plenty of great tools (Pig, Sqoop, Hive, RHadoop, etc.) that 
you can use so you're not writing Java. This is all client-side; it 
doesn't make the administration easier.

I agree that it's easy to start using it (It's possible to configure a 
small cluster from scratch in half an hour), but it takes a lot more 
time to tune your installation so it actually performs well. Just like 
any other piece of server software; serving a website with httpd is 
easy, but serving it well and adding security takes a lot more time.

Rich Freeman wrote:
> As long as you're counting words and don't mind coding everything in Java. :)

We discourage researchers from writing in Java and instead use any of 
the things I list above, unless they really like Java.

> I found that if you want to avoid using Java, then the
> available documentation plummets

Yeah, this is still a pretty big problem. Documentation is pretty 
sparse.

Alec



Re: [gentoo-user] Clusters on Gentoo ?

2014-08-19 Thread J. Roeleveld
On Tuesday, August 19, 2014 06:33:29 AM Rich Freeman wrote:
> On Tue, Aug 19, 2014 at 5:34 AM, J. Roeleveld  wrote:
> > On Monday, August 18, 2014 10:53:51 AM Alec Ten Harmsel wrote:
> >> On Mon 18 Aug 2014 10:50:23 AM EDT, Rich Freeman wrote:
> >> > Hadoop is a very specialized tool.  It does what it does very well,
> >> > but if you want to use it for something other than map/reduce then
> >> > consider carefully whether it is the right tool for the job.
> >> 
> >> Agreed; unless you have decent hardware and can comfortably measure
> >> your data in TB, it'll be quicker to use something else once you factor
> >> in the administration time and learning curve.
> > 
> > The benefit of clustering technologies is that you don't need high-end
> > hardware to start with. You can use the old hardware you found collecting
> > dust in the basement.
> > 
> > The learning curve isn't as steep as it used to be. There are plenty of
> > tools to make it easier to start using Hadoop.
> 
> As long as you're counting words and don't mind coding everything in Java. 
> :)
> 
> I found that if you want to avoid using Java, then the available
> documentation plummets, and I'm pretty sure the version I was
> attempting to use was buggy - it was losing records in the sort/reduce
> phase I believe.  Or perhaps I was just using it incorrectly, but the
> same exact code worked just fine when I ran it on a single host with a
> smaller dataset and just piped map | sort | reduce without using
> Hadoop.  The documentation was pretty sparse on how to get Hadoop to
> work via stdin/out with non-Java code and it is quite possible I
> wasn't quite doing things right.  In the end my problem wasn't big
> enough to necessitate using Hadoop and I used GNU parallel instead.

No need for Java knowledge to develop against Hadoop.
A commercial product:
http://www.informatica.com/Images/01603_powerexchange-for-hadoop_ds_en-US.pdf
Nice and easy graphical interface. The same "code" that works against a 
relational database also works with Hadoop. The tool does the translation.

I would be surprised if there are no other tools that can make it easier to 
develop code to work with Hadoop. I just haven't had the reason to search for 
those yet.

--
Joost



Re: [gentoo-user] Clusters on Gentoo ?

2014-08-19 Thread Rich Freeman
On Tue, Aug 19, 2014 at 5:34 AM, J. Roeleveld  wrote:
> On Monday, August 18, 2014 10:53:51 AM Alec Ten Harmsel wrote:
>> On Mon 18 Aug 2014 10:50:23 AM EDT, Rich Freeman wrote:
>> > Hadoop is a very specialized tool.  It does what it does very well,
>> > but if you want to use it for something other than map/reduce then
>> > consider carefully whether it is the right tool for the job.
>>
>> Agreed; unless you have decent hardware and can comfortably measure
>> your data in TB, it'll be quicker to use something else once you factor
>> in the administration time and learning curve.
>
> The benefit of clustering technologies is that you don't need high-end
> hardware to start with. You can use the old hardware you found collecting dust
> in the basement.
>
> The learning curve isn't as steep as it used to be. There are plenty of tools
> to make it easier to start using Hadoop.
>

As long as you're counting words and don't mind coding everything in Java.  :)

I found that if you want to avoid using Java, then the available
documentation plummets, and I'm pretty sure the version I was
attempting to use was buggy - it was losing records in the sort/reduce
phase I believe.  Or perhaps I was just using it incorrectly, but the
same exact code worked just fine when I ran it on a single host with a
smaller dataset and just piped map | sort | reduce without using
Hadoop.  The documentation was pretty sparse on how to get Hadoop to
work via stdin/out with non-Java code and it is quite possible I
wasn't quite doing things right.  In the end my problem wasn't big
enough to necessitate using Hadoop and I used GNU parallel instead.

--
Rich



Re: [gentoo-user] Clusters on Gentoo ?

2014-08-19 Thread J. Roeleveld
On Monday, August 18, 2014 10:53:51 AM Alec Ten Harmsel wrote:
> On Mon 18 Aug 2014 10:50:23 AM EDT, Rich Freeman wrote:
> > Hadoop is a very specialized tool.  It does what it does very well,
> > but if you want to use it for something other than map/reduce then
> > consider carefully whether it is the right tool for the job.
> 
> Agreed; unless you have decent hardware and can comfortably measure
> your data in TB, it'll be quicker to use something else once you factor
> in the administration time and learning curve.

The benefit of clustering technologies is that you don't need high-end 
hardware to start with. You can use the old hardware you found collecting dust 
in the basement.

The learning curve isn't as steep as it used to be. There are plenty of tools 
to make it easier to start using Hadoop.

--
Joost



Re: [gentoo-user] Clusters on Gentoo ?

2014-08-19 Thread J. Roeleveld
On Monday, August 18, 2014 08:09:00 PM thegeezer wrote:
> On 18/08/14 15:31, J. Roeleveld wrote:
> > 
> 
> valid points, and interesting to see the corrections of my
> understanding, always welcome :)

You're welcome :)

> > Looks nice, but is not going to help with performance if the application
> > is
> > not designed for distributed processing.
> > 
> > --
> > Joost
> 
> this is the key point i would raise about clusters really -- it would be
> nice to not need for example distcc configured and just have portage run
> across all connected nodes without any further work, or to use a tablet
> computer which is "borrowing" cycles from a GFX card across the network
> without having to configure nvidia grid: specifically these two use
> cases have wildly different characteristics and are a great example of
> why clustering has to be designed first to fit the application and
> viceversa.

I had a better look at that site you linked to. It won't be as "hidden" as 
you'd like. The software you run on it needs to be designed to actually use 
the infrastructure.
This means that for your ideal to work, the "industry" needs to decide on a 
single clustering technology for this. I wish you good luck on that venture. 
:)

> /me continues to wonder if 10GigE is fast enough to page fault across
> the network ... ;)

Depends on how fast you want the environment to be.
Old i386 time, probably.
Expecting a performance equivalent to a modern system, no.

Check the bus-speeds between the CPU and memory that is being employed these 
days. That is the minimum speed you need in the network link to be fast enough 
to actually work. And that is expecting a perfect link with no errors 
occurring in the wiring.

--
Joost



Re: [gentoo-user] Clusters on Gentoo ?

2014-08-18 Thread thegeezer
On 18/08/14 15:31, J. Roeleveld wrote:
> 
valid points, and interesting to see the corrections of my
understanding, always welcome :)
> Looks nice, but is not going to help with performance if the application is 
> not designed for distributed processing.
>
> --
> Joost
>
this is the key point i would raise about clusters really -- it would be
nice to not need for example distcc configured and just have portage run
across all connected nodes without any further work, or to use a tablet
computer which is "borrowing" cycles from a GFX card across the network
without having to configure nvidia grid: specifically these two use
cases have wildly different characteristics and are a great example of
why clustering has to be designed first to fit the application and
viceversa.

/me continues to wonder if 10GigE is fast enough to page fault across
the network ... ;)



Re: [gentoo-user] Clusters on Gentoo ?

2014-08-18 Thread Alec Ten Harmsel
On Mon 18 Aug 2014 10:50:23 AM EDT, Rich Freeman wrote:

> Hadoop is a very specialized tool.  It does what it does very well,
> but if you want to use it for something other than map/reduce then
> consider carefully whether it is the right tool for the job.

Agreed; unless you have decent hardware and can comfortably measure 
your data in TB, it'll be quicker to use something else once you factor 
in the administration time and learning curve.

Alec



Re: [gentoo-user] Clusters on Gentoo ?

2014-08-18 Thread Rich Freeman
On Mon, Aug 18, 2014 at 10:31 AM, J. Roeleveld  wrote:
>
> I wouldn't use Hadoop for storage of files. It's only useful if you have a lot
> (and I do mean a LOT) of data where a query only returns a very small amount.

Not to mention a lot of data in a small number of files.  I think the
minimum allocation size for Hadoop is measured in megabytes.  I tried
using it to process gentoo-x86 and the number of files just clobbered
the thing.  Since in my job the files were really just static data and
not the actual subject of the map/reduce I instead just replicated the
data to all the nodes and had them retrieve the data from the local
filesystem.

Hadoop is a very specialized tool.  It does what it does very well,
but if you want to use it for something other than map/reduce then
consider carefully whether it is the right tool for the job.

--
Rich



Re: [gentoo-user] Clusters on Gentoo ?

2014-08-18 Thread J. Roeleveld
On Sunday, August 17, 2014 08:46:58 PM thegeezer wrote:
> there are many way to do clustering and one thing that i would consider
> a "holy grail" would be something like pvm [1]
> because nothing else seems to have similar horizontal scaling of cpu at
> the kernel level

PVM, from the webpage, looks more like a pre-built VM. Not some kernel module 
that distributes existing code to different nodes.
This kind of clustering also has no benefit for most uses. You really need to 
design your tasks for these kind of environments.

> i would love to know the mechanism behind dell's equallogic san as it
> really is clustered lvm on steroids.
> GFS / orangefs / ocfs are not the easiest things to setup (ocfs is) and
> i've not found performance to be so great for writes.

I have seen weird issues when using Oracle's filesystems for anything not 
Oracle. How important is reliability?

> DRBD is only 2 devices as far as i understand, so not really super scalable
> i'm still not convinced over the likes of hadoop for storage, maybe i
> just don't have the scale to "get" it?

I wouldn't use Hadoop for storage of files. It's only useful if you have a lot 
(and I do mean a LOT) of data where a query only returns a very small amount.
Performance of a Hadoop cluster is high because the same query is sent to all 
nodes at once and the answers get merged into a single answer along the way 
back to the requestor. I don't see it as a valid system to actually store 
important data you do not want to risk losing.

> the thing with clusters is that you want to be able to spin an extra
> node up and join it to the group and then you increase cpu / storage by
> n+1   but also you want to be able to spin nodes down dynamically and go
> down by n-1.  i guess this is where hadoop is of benefit because that is
> not a happy thing for a typical file system.

Not necessary. That is only one way to use a cluster.
It's also an "easy" and "cheap" method of increasing the available processing 
power. This only works properly if the tasks can be distributed over multiple 
nodes easily. Having the option to quickly add and remove nodes make it 
difficult to keep the data consistent. Especially Hadoop prefers the nodes to 
stay available as there is no single node containing all the data. There is 
some redundancy, but remove a few nodes and you can easily loose data.

> network load balancing is super easy, all info required is in each
> packet -- application load balancing requires more thought.
> this is where the likes of memcached can help but also why a good design
> of the cluster is better. localised data and tiered access etc...  kind
> of why i would like to see a pvm kind of solution -- so that a page
> fault is triggered like swap memory which then fetches the relevant
> memory from the network:

That is going to kill performance...
Have a look into NUMA. It's always best to have the data where it is being 
processed. Either by moving the data to the processing unit, or by using a 
processing unit local to the data.
Moving data is always expensive with regards to performance.

This is how Hadoop clusters work, the data is processed on the node actually 
having the data. The result (which is often less then 1% of the source-data) 
is then sent over the network to another node, which, at this stage, merges 
the result and passes it to another node. This then continues until all the 
results are merged into a single result-set which is then returned to the 
requesting application.

> bearing in mind that a computer can typically
> trigger thousands of page faults a second and that memory access is very
> very many times faster than gigabit networking!
> 
> [1] http://www.csm.ornl.gov/pvm/pvm_home.html

Looks nice, but is not going to help with performance if the application is 
not designed for distributed processing.

--
Joost



Re: [gentoo-user] Clusters on Gentoo ?

2014-08-17 Thread thegeezer
there are many way to do clustering and one thing that i would consider
a "holy grail" would be something like pvm [1]
because nothing else seems to have similar horizontal scaling of cpu at
the kernel level

i would love to know the mechanism behind dell's equallogic san as it
really is clustered lvm on steroids.
GFS / orangefs / ocfs are not the easiest things to setup (ocfs is) and
i've not found performance to be so great for writes.
DRBD is only 2 devices as far as i understand, so not really super scalable
i'm still not convinced over the likes of hadoop for storage, maybe i
just don't have the scale to "get" it?

the thing with clusters is that you want to be able to spin an extra
node up and join it to the group and then you increase cpu / storage by
n+1   but also you want to be able to spin nodes down dynamically and go
down by n-1.  i guess this is where hadoop is of benefit because that is
not a happy thing for a typical file system.

network load balancing is super easy, all info required is in each
packet -- application load balancing requires more thought.
this is where the likes of memcached can help but also why a good design
of the cluster is better. localised data and tiered access etc...  kind
of why i would like to see a pvm kind of solution -- so that a page
fault is triggered like swap memory which then fetches the relevant
memory from the network: bearing in mind that a computer can typically
trigger thousands of page faults a second and that memory access is very
very many times faster than gigabit networking!

[1] http://www.csm.ornl.gov/pvm/pvm_home.html





Re: [gentoo-user] Clusters on Gentoo ?

2014-08-07 Thread Alec Ten Harmsel
I'm a Hadoop and related software sysadmin at the University of 
Michigan. I'm a student still, so it's only a part-time position. I 
have some documentation at http://caen.github.io/hadoop - if something 
is not clear, I will gladly take feedback and make appropriate changes.

> In a recent thread (schedulers) it was noted that several folks had interest
> in clusters (privately operated clouds) as more than a passing interest.

 I'll try to chime in on any questions about scheduling/clusters in the 
future as we have a pretty large installation (~20,000 cores) running a 
traditional HPC stack, and a small Hadoop cluster.

> [2] http://hadoop.apache.org/docs/r1.2.1/cluster_setup.html

Hadoop is currently "stable" on 2.x (specifically 2.4), so relevant 
documentation is at http://hadoop.apache.org/docs/stable.

Alec



Re: [gentoo-user] Clusters on Gentoo ?

2014-08-07 Thread J. Roeleveld
On Wednesday, August 06, 2014 04:50:22 PM James wrote:


> Hopefully, we can all share ideas and brainstorm about how Gentoo users
> can lead the pack of linux distros into this brave_new world. [Overlays?]

A good place to start would also be:
http://www.yolinux.com/TUTORIALS/LinuxClustersAndFileSystems.html

--
Joost