I've been experimenting with getting Crail over TCP to work with the
crail-spark-io <https://github.com/zrlio/crail-spark-io> shuffle extensions.

It seems to work fine for small shuffle sizes (up to about 10 gigabytes),
but anything larger than that seems to hang. I've investigated this and the
hangs seem to happen due to a few reasons, mostly contained to the NaRPC
layer.

The benchmark numbers here
<https://crail.incubator.apache.org/blog/2019/03/disaggregation.html> seem
to imply that this has worked for at least 200 gigabyte shuffles (I'm not
certain because that second experiment does not explicitly give the test
parameters). Has anybody had success with Crail over TCP or were pretty
much all of the tests run over RDMA/NVMe?

-- 
-Ben

Reply via email to