Thanks for the help. I've been able to get the RDMA setup working and am troubleshooting a few issues with the bench tests. The issues so far have all been configuration related: ulimit -l, incorrect value for "crail.namenode.rpctype" I am ignoring the TCP tier for now since I don't really need it yet.
I have more questions about data locality and Spark which I'll ask in another post. Thanks for all your help, Sumit On Sat, Jun 9, 2018 at 1:30 AM Animesh Trivedi <[email protected]> wrote: > Hi Sumit, > > Great that you attended the talk. Please also join the crail mailing list > ([email protected], cc'ed) and post issues there so that others can > benefit from it. As you might have figured out that we are a new project, > so we are still learning the ropes :) > > Having said that : > > 1) The RDMA tier failure looks like (i) if the Infiniband device is not > setup properly (what does ibvc_devices show?) ; and/or (ii) you do no have > permission to register large memory segments (check with ulimit -l). I > think the default is 64kB. If that is so, then you have to increase the > memory limit (https://access.redhat.com/solutions/61334, memlock). For > the RDMA tier, crail needs to register memory that is typically more than > just few kBs. > > 2) The TPC tier error is more cryptic. So may be other develops might have > an idea what might be wrong. Could you also please post your crail > configuration. > > Cheers, > -- > Animesh > > > On Sat, Jun 9, 2018 at 1:00 AM, Sumit Sen <[email protected]> wrote: > >> Hi Animesh, >> >> I've just started trying to use Crail on a cluster running SLES12. I >> attended the talk at Spark Summit which mentioned crail. Our nodes are >> connected with both ethernet and infiniband. I want to run some of the >> benchmarks to see what sort of performance I can get. However I am running >> into problems and haven't been able to figure out what to do. Can you help >> me or give me the name of someone else who can help? I've given some >> details below. I'd appreciate any help I can get to come up to speed on >> this. >> >> Thanks, >> Sumit >> >> Here are the issues I'm facing: >> *RDMA configuration:* >> Unable to start data node: >> Exception in thread "main" java.io.IOException: Memory registration >> failed with -1 >> at >> com.ibm.disni.rdma.verbs.impl.NatRegMrCall.execute(NatRegMrCall.java:80) >> at >> com.ibm.disni.rdma.verbs.impl.NatRegMrCall.execute(NatRegMrCall.java:33) >> at >> org.apache.crail.storage.rdma.RdmaStorageServer.allocateResource(RdmaStorageServer.java:120) >> at >> org.apache.crail.storage.StorageServer.main(StorageServer.java:152) >> >> *TCP configuration:* >> - both namenode and datanode start up >> However, I can't run "iobench -t write". I get an immediate error that >> crashes the jvm on the datanode >> I see the following stack on the iobench console: >> warmUp, warmupFile /tmp.dat2001725267, operations 32 >> Exception in thread "main" java.util.concurrent.ExecutionException: >> java.util.concurrent.ExecutionException: >> java.util.concurrent.ExecutionException: java.io.IOException: Connection >> reset by peer >> at org.apache.crail.utils.MultiFuture.get(MultiFuture.java:93) >> at >> org.apache.crail.tools.CrailBenchmark.warmUp(CrailBenchmark.java:978) >> at >> org.apache.crail.tools.CrailBenchmark.write(CrailBenchmark.java:97) >> at >> org.apache.crail.tools.CrailBenchmark.main(CrailBenchmark.java:1070) >> Caused by: java.util.concurrent.ExecutionException: >> java.util.concurrent.ExecutionException: java.io.IOException: Connection >> reset by peer >> at org.apache.crail.utils.MultiFuture.get(MultiFuture.java:93) >> at org.apache.crail.utils.MultiFuture.get(MultiFuture.java:78) >> ... 3 more >> Caused by: java.util.concurrent.ExecutionException: java.io.IOException: >> Connection reset by peer >> at com.ibm.narpc.NaRPCFuture.get(NaRPCFuture.java:73) >> at >> org.apache.crail.storage.tcp.TcpStorageFuture.get(TcpStorageFuture.java:56) >> at >> org.apache.crail.storage.tcp.TcpStorageFuture.get(TcpStorageFuture.java:30) >> at org.apache.crail.utils.MultiFuture.get(MultiFuture.java:78) >> ... 4 more >> Caused by: java.io.IOException: Connection reset by peer >> at sun.nio.ch.FileDispatcherImpl.read0(Native Method) >> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) >> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) >> at sun.nio.ch.IOUtil.read(IOUtil.java:197) >> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) >> at com.ibm.narpc.NaRPCChannel.fetchBuffer(NaRPCChannel.java:51) >> at com.ibm.narpc.NaRPCEndpoint.pollResponse(NaRPCEndpoint.java:74) >> at com.ibm.narpc.NaRPCFuture.get(NaRPCFuture.java:70) >> ... 7 more >> >> >
