I've been experimenting with getting Crail over TCP to work with the crail-spark-io <https://github.com/zrlio/crail-spark-io> shuffle extensions.
It seems to work fine for small shuffle sizes (up to about 10 gigabytes), but anything larger than that seems to hang. I've investigated this and the hangs seem to happen due to a few reasons, mostly contained to the NaRPC layer. The benchmark numbers here <https://crail.incubator.apache.org/blog/2019/03/disaggregation.html> seem to imply that this has worked for at least 200 gigabyte shuffles (I'm not certain because that second experiment does not explicitly give the test parameters). Has anybody had success with Crail over TCP or were pretty much all of the tests run over RDMA/NVMe? -- -Ben