Re: timeout while running simple hadoop job

Johan Oskarsson Wed, 12 May 2010 08:46:46 -0700

Looking over the code this is in fact an issue in 0.6. 
It's fixed in trunk/0.7. Connections will be reused and closed properly, see 
https://issues.apache.org/jira/browse/CASSANDRA-1017 for more details.


We can either backport that patch or make at least close the connections 
properly in 0.6. Can you open an ticket for this bug?

/Johan

On 12 maj 2010, at 12.11, gabriele renzi wrote:

> a follow up for anyone that may end up on this conversation again:
> 
> I kept trying and neither changing the number of concurrent map tasks,
> nor the slice size helped.
> Finally, I found out a screw up in our logging system,  which had
> forbidden us from noticing a couple of recurring errors in the logs :
> 
> ERROR [ROW-READ-STAGE:1] 2010-05-11 16:43:32,328
> DebuggableThreadPoolExecutor.java (line 101) Error in
> ThreadPoolExecutor
> java.lang.RuntimeException: java.lang.RuntimeException: corrupt sstable
>        at 
> org.apache.cassandra.service.RangeSliceVerbHandler.doVerb(RangeSliceVerbHandler.java:53)
>        at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:40)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.RuntimeException: corrupt sstable
>        at 
> org.apache.cassandra.io.SSTableScanner.seekTo(SSTableScanner.java:73)
>        at 
> org.apache.cassandra.db.ColumnFamilyStore.getKeyRange(ColumnFamilyStore.java:907)
>        at 
> org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1000)
>        at 
> org.apache.cassandra.service.RangeSliceVerbHandler.doVerb(RangeSliceVerbHandler.java:41)
>        ... 4 more
> Caused by: java.io.FileNotFoundException:
> /path/to/data/Keyspace/CF-123-Index.db (Too many open files)
>        at java.io.RandomAccessFile.open(Native Method)
>        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:212)
>        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:98)
>        at 
> org.apache.cassandra.io.util.BufferedRandomAccessFile.<init>(BufferedRandomAccessFile.java:143)
>        at 
> org.apache.cassandra.io.util.BufferedRandomAccessFile.<init>(BufferedRandomAccessFile.java:138)
>        at 
> org.apache.cassandra.io.SSTableReader.getNearestPosition(SSTableReader.java:414)
>        at 
> org.apache.cassandra.io.SSTableScanner.seekTo(SSTableScanner.java:62)
>        ... 7 more
> 
> and the related
> 
> WARN [main] 2010-05-11 16:43:38,076 TThreadPoolServer.java (line 190)
> Transport error occurred during acceptance of message.
> org.apache.thrift.transport.TTransportException:
> java.net.SocketException: Too many open files
>        at 
> org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:124)
>        at 
> org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35)
>        at 
> org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31)
>        at 
> org.apache.thrift.server.TThreadPoolServer.serve(TThreadPoolServer.java:184)
>        at 
> org.apache.cassandra.thrift.CassandraDaemon.start(CassandraDaemon.java:149)
>        at 
> org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:190)
> Caused by: java.net.SocketException: Too many open files
>        at java.net.PlainSocketImpl.socketAccept(Native Method)
>        at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
>        at java.net.ServerSocket.implAccept(ServerSocket.java:453)
>        at java.net.ServerSocket.accept(ServerSocket.java:421)
>        at 
> org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:119)
>        ... 5 more
> 
> The client was reporting timeouts in this case.
> 
> 
> The max fd limit on the process was in fact not exceedingly high
> (1024) and raising it seems to have solved the problem.
> 
> Anyway It still seems that there may be two issues:
> 
> - since we had never seen this error before with normal client
> connections (as in: non hadoop), is it possible that the
> Cassandra/hadoop layer is not closing sockets properly between one
> connection and the other, or not reusing connections efficiently?
> E.g. TSocket seems to have a close() method but I don't see it used in
> ColumnFamilyInputFormat.(getSubSplits, getRangeMap) but it may well be
> inside CassandraClient.
> 
> Anyway, judging by lsof's output I can only see about a hundred TCP
> connections, but those from the hadoop jobs seem to always be below 60
> so this may just be my wrong impression.
> 
> - is it possible that such errors show up on the client side as
> timeoutErrors when they could be reported better? this would probably
> help other people in diagnosing/reporting internal errors in the
> future.
> 
> 
> Thanks again to everyone with this, I promise I'll put the discussion
> on the wiki for future reference :)

Re: timeout while running simple hadoop job

Reply via email to