Re: Infiniband

2006-01-15 Thread lichtner

I think I have found some information which if I had hardware available
would lead me to skip the prototyping stage entirely:

This paper benchmarks the performance of infiniband through 1) UDAPL and
2) Sockets Direct Protocol (SDP) - also available from openib.org:

"Sockets Direct Protocol over Infiniband Clusters: Is it Beneficial?"
Balaji et al.

I think the value of SDP over UDAPL is that it looks like a socket
interface, which means porting applications over could even be trivial.

However, I think that SDP in that paper does not have zero copy, and that
is why the paper shows that UDAPL is faster.

However, Mellanox has a zero-copy SDP:

"Transparently Achieving Superior Socket Performance Using Zero Copy
Socket Direct Protocol over 20Gb/s Infiniband Links"
Goldenberg et al.

It's just enlightening to see the two main sources of waste in network
applications be removed one at a time, namely 1) context switches and 2)
copies.

Using zero-copy SDP from Java should be pretty easy, although interfacing
with UDAPL would also be valuable.

I have found evidence that Sun was planning to include support for Sockets
Direct Protocol in jdk 1.6, but that they gave up because infiniband is
not mainstream hardware (yet).

I think IBM may have put some of this in WebSphere 6:

http://domino.research.ibm.com/comm/research.nsf/pages/r.distributed.innovation.html?Open&printable

That would be just typical of IBM, understating or flat out
hiding important achievements. When Parallex Sysplex came out IBM did a
test with 100% data sharing, meaning _all_ reads and writes where remote
(to the CF), and measured the scalability, and it's basically linear, but
off the diagonal by (only) 13%. Instead of understanding that mainframes
now scaled horizontally, the press focused on the "overhead". This
prompted Lou Goestner that if he invited the press to his house and
showcased his dog walking on water the press would report "Goestner buys
dog that can't swim."

I think if I had a few thousand dollars to spare I would definitely get
a couple of opteron boxes and get this off the ground.

I think from the numbers you can conclude that even where a person wants
to be perverse and keep using object serialization, they will still get
much better throughput (if not better latency) because half the cpu can be
spent executing the jvm rather than switching context and copying data
from the memory to itself.

I hope somebody with a budget picks this up soon.

Guglielmo

On Sun, 15 Jan 2006, James Strachan wrote:

> On 14 Jan 2006, at 22:27, lichtner wrote:
> > On Fri, 13 Jan 2006, James Strachan wrote:
> >
> >>> The infiniband transport would be native code, so you could use JNI.
> >>> However, it would definitely be worth it.
> >>
> >> Agreed! I'd *love* a Java API to Infiniband! Have wanted one for ages
> >> & google every once in a while to see if one shows up :)
> >>
> >> It looks like MPI has support for Infiniband; would it be worth
> >> trying to wrap that in JNI?
> >> http://www-unix.mcs.anl.gov/mpi/
> >> http://www-unix.mcs.anl.gov/mpi/mpich2/
> >
> > I did find that HP has a Java interface for MPI. However, to me it
> > doesn't
> > necessarily seem that this is the way to go. I think for writing
> > distributed computations it would be the right choice, but I think
> > that
> > the people who write those choose to work in a natively compiled
> > language,
> > and I think that this may be the reason why this Java mpi doesn't
> > appear
> > to be that well-known.
> >
> > However I did find something which might work for us, namely UDAPL
> > from the DAT Collaborative. A consortium created a spec for
> > interface to
> > anything that provides RDMA capabilities:
> >
> > http://www.datcollaborative.org/udapl.html
> >
> > The header files and the spec are right there.
> >
> > I downloaded the only release made by infiniband.sf.net and they claim
> > that it only works with kernel 2.4, and that for 2.6 you have to use
> > openib.org. The latter claims to provide an implementation of UDAPL:
> >
> > http://openib.org/doc.html
> >
> > The wiki has a lot of info.
> >
> > From the mailing list archive you can tell that this project has a
> > lot of
> > momentum:
> >
> > http://openib.org/pipermail/openib-general/
>
> Awesome! Thanks for all the links
>
>
> > I think the next thing to do would be to prove that using RDMA as
> > opposed
> > to udp is worthwhile. I think it is, because JITs are so fast now,
> > but I
> > think that before planning anything long-term I would get two
> > infiniband-enabled b

Re: Infiniband

2006-01-15 Thread James Strachan

On 14 Jan 2006, at 22:27, lichtner wrote:

On Fri, 13 Jan 2006, James Strachan wrote:


The infiniband transport would be native code, so you could use JNI.
However, it would definitely be worth it.


Agreed! I'd *love* a Java API to Infiniband! Have wanted one for ages
& google every once in a while to see if one shows up :)

It looks like MPI has support for Infiniband; would it be worth
trying to wrap that in JNI?
http://www-unix.mcs.anl.gov/mpi/
http://www-unix.mcs.anl.gov/mpi/mpich2/


I did find that HP has a Java interface for MPI. However, to me it  
doesn't

necessarily seem that this is the way to go. I think for writing
distributed computations it would be the right choice, but I think  
that
the people who write those choose to work in a natively compiled  
language,
and I think that this may be the reason why this Java mpi doesn't  
appear

to be that well-known.

However I did find something which might work for us, namely UDAPL
from the DAT Collaborative. A consortium created a spec for  
interface to

anything that provides RDMA capabilities:

http://www.datcollaborative.org/udapl.html

The header files and the spec are right there.

I downloaded the only release made by infiniband.sf.net and they claim
that it only works with kernel 2.4, and that for 2.6 you have to use
openib.org. The latter claims to provide an implementation of UDAPL:

http://openib.org/doc.html

The wiki has a lot of info.

From the mailing list archive you can tell that this project has a  
lot of

momentum:

http://openib.org/pipermail/openib-general/


Awesome! Thanks for all the links


I think the next thing to do would be to prove that using RDMA as  
opposed
to udp is worthwhile. I think it is, because JITs are so fast now,  
but I

think that before planning anything long-term I would get two
infiniband-enabled boxes and write a little prototype.


Agreed; the important issue is gonna be, can Java with JNI (or Unsafe  
or one of the alternatives to JNI: http://weblog.janek.org/Archive/ 
2005/07/28/AlternativestoJavaNativeI.html) work with RDMA using  
native ByteBuffers so that the zero copy is avoided and so that  
things perform better than just using some Infiniband-optimised TCP/ 
IP implementation. Though to be able to test this we need to make a  
prototype Java API to RDMA :) But it is definitely well worth the  
experiment IMHO


The main big win is obviously avoiding the double copy of working  
with TCP/IP though there are other benefits like improved flow  
control (you know that you can send a message to a consumer & how  
much capacity it has at any point in time so there is no need to  
worry about slow/dead consumer detection) another is concurrency; in  
a point-cast model, sending to multiple consumers in 1 thread is  
trivial (and multi-threading definitely slows down messaging as we  
found in ActiveMQ).


In ActiveMQ the use of RDMA would allow us to do straight through  
processing for messages which could dramatically cut down on the  
number of objects created per message & the GC overhead. (Indeed  
we've been musing that it might be possible to avoid most per-message  
dispatch object allocations if selectors are not used and we wrote a  
custom RDMA friendly version of ActiveMQMessage; we should also be  
able to optimise the use of the Journal as we can just pass around  
the ByteBuffer rather than using OpenWire marshalling.




I think Appro sells
infiniband blades with Mellanox hcas.

There is also IBM's proprietary API for clustering mainframes, the
Coupling Facility:

http://www.research.ibm.com/journal/sj36-2.html

There are some amazing articles there.


Great stuff - thanks for the link!


Personally I also think there is value in implementing replication  
using
udp (process groups libraries such as evs4j), so I would pursue  
both at

the same time.


Yeah; like many things in distributed systems & messaging; it all  
depends on what you are doing as to what solution is the best for  
your scenario. Certainly both technologies are useful tools to have  
in your bag when creating middleware. I personally see RDMA as a  
possible faster alternative for TCP/IP inside message brokers such as  
ActiveMQ as well as for request-response messaging such as in openejb.


James
---
http://radio.weblogs.com/0112098/



___ 
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com


Re: Infiniband

2006-01-14 Thread lichtner


On Fri, 13 Jan 2006, James Strachan wrote:

> > The infiniband transport would be native code, so you could use JNI.
> > However, it would definitely be worth it.
>
> Agreed! I'd *love* a Java API to Infiniband! Have wanted one for ages
> & google every once in a while to see if one shows up :)
>
> It looks like MPI has support for Infiniband; would it be worth
> trying to wrap that in JNI?
> http://www-unix.mcs.anl.gov/mpi/
> http://www-unix.mcs.anl.gov/mpi/mpich2/

I did find that HP has a Java interface for MPI. However, to me it doesn't
necessarily seem that this is the way to go. I think for writing
distributed computations it would be the right choice, but I think that
the people who write those choose to work in a natively compiled language,
and I think that this may be the reason why this Java mpi doesn't appear
to be that well-known.

However I did find something which might work for us, namely UDAPL
from the DAT Collaborative. A consortium created a spec for interface to
anything that provides RDMA capabilities:

http://www.datcollaborative.org/udapl.html

The header files and the spec are right there.

I downloaded the only release made by infiniband.sf.net and they claim
that it only works with kernel 2.4, and that for 2.6 you have to use
openib.org. The latter claims to provide an implementation of UDAPL:

http://openib.org/doc.html

The wiki has a lot of info.

>From the mailing list archive you can tell that this project has a lot of
momentum:

http://openib.org/pipermail/openib-general/

I think the next thing to do would be to prove that using RDMA as opposed
to udp is worthwhile. I think it is, because JITs are so fast now, but I
think that before planning anything long-term I would get two
infiniband-enabled boxes and write a little prototype. I think Appro sells
infiniband blades with Mellanox hcas.

There is also IBM's proprietary API for clustering mainframes, the
Coupling Facility:

http://www.research.ibm.com/journal/sj36-2.html

There are some amazing articles there.

Personally I also think there is value in implementing replication using
udp (process groups libraries such as evs4j), so I would pursue both at
the same time.

Guglielmo


Re: Infiniband

2006-01-14 Thread lichtner

On Fri, 13 Jan 2006, James Strachan wrote:

> > The infiniband transport would be native code, so you could use JNI.
> > However, it would definitely be worth it.
>
> Agreed! I'd *love* a Java API to Infiniband! Have wanted one for ages
> & google every once in a while to see if one shows up :)
>
> It looks like MPI has support for Infiniband; would it be worth
> trying to wrap that in JNI?
> http://www-unix.mcs.anl.gov/mpi/
> http://www-unix.mcs.anl.gov/mpi/mpich2/

I don't know MPI. Do you think it's a better interface, or that it is much
easier?

I will take a look at MPI.

Guglielmo


Re: Infiniband

2006-01-13 Thread James Strachan


On 13 Jan 2006, at 11:51, [EMAIL PROTECTED] wrote:
With regard to clustering, I also want to mention a remote option,  
which

is to use infiniband RDMA for inter-node communication.

With an infiniband link between two machines you can copy a buffer
directly from the memory of one to the memory of the other, without
switching context. This means the kernel scheduler is not involved  
at all,

and there are no copies.


I love infiniband RDMA! :)


I think the bandwidth can be up to 30Gbps right now. Pathscale  
makes an IB
adapter which plugs into the new HTX hypertransport slot, which is  
to say
it bypasses the pci bus (!). They report an 8-byte message latency  
of 1.32

microseconds.

I think IB costs about $500 per node. But the cost is going down  
steadily
because the people who use IB typically buy thousands of network  
cards at

a time (for supercomputers.)

The infiniband transport would be native code, so you could use JNI.
However, it would definitely be worth it.


Agreed! I'd *love* a Java API to Infiniband! Have wanted one for ages  
& google every once in a while to see if one shows up :)


It looks like MPI has support for Infiniband; would it be worth  
trying to wrap that in JNI?

http://www-unix.mcs.anl.gov/mpi/
http://www-unix.mcs.anl.gov/mpi/mpich2/

James
---
http://radio.weblogs.com/0112098/



Re: Infiniband

2006-01-13 Thread lichtner

On Fri, 13 Jan 2006, Alan D. Cabrera wrote:

> > The infiniband transport would be native code, so you could use JNI.
> > However, it would definitely be worth it.
>
> Do you have any references to the where one could get a peek at the
> transport API?

http://infiniband.sourceforge.net/



Re: Infiniband

2006-01-13 Thread Alan D. Cabrera

[EMAIL PROTECTED] wrote, On 1/13/2006 11:51 AM:

With regard to clustering, I also want to mention a remote option, which
is to use infiniband RDMA for inter-node communication.

With an infiniband link between two machines you can copy a buffer
directly from the memory of one to the memory of the other, without
switching context. This means the kernel scheduler is not involved at all,
and there are no copies.

I think the bandwidth can be up to 30Gbps right now. Pathscale makes an IB
adapter which plugs into the new HTX hypertransport slot, which is to say
it bypasses the pci bus (!). They report an 8-byte message latency of 1.32
microseconds.

I think IB costs about $500 per node. But the cost is going down steadily
because the people who use IB typically buy thousands of network cards at
a time (for supercomputers.)

The infiniband transport would be native code, so you could use JNI.
However, it would definitely be worth it.


Do you have any references to the where one could get a peek at the 
transport API?



Regards,
Alan



Infiniband

2006-01-13 Thread lichtner

With regard to clustering, I also want to mention a remote option, which
is to use infiniband RDMA for inter-node communication.

With an infiniband link between two machines you can copy a buffer
directly from the memory of one to the memory of the other, without
switching context. This means the kernel scheduler is not involved at all,
and there are no copies.

I think the bandwidth can be up to 30Gbps right now. Pathscale makes an IB
adapter which plugs into the new HTX hypertransport slot, which is to say
it bypasses the pci bus (!). They report an 8-byte message latency of 1.32
microseconds.

I think IB costs about $500 per node. But the cost is going down steadily
because the people who use IB typically buy thousands of network cards at
a time (for supercomputers.)

The infiniband transport would be native code, so you could use JNI.
However, it would definitely be worth it.

Guglielmo