Looking over the code this is in fact an issue in 0.6. 
It's fixed in trunk/0.7. Connections will be reused and closed properly, see 
https://issues.apache.org/jira/browse/CASSANDRA-1017 for more details.

We can either backport that patch or make at least close the connections 
properly in 0.6. Can you open an ticket for this bug?


On 12 maj 2010, at 12.11, gabriele renzi wrote:

> a follow up for anyone that may end up on this conversation again:
> I kept trying and neither changing the number of concurrent map tasks,
> nor the slice size helped.
> Finally, I found out a screw up in our logging system,  which had
> forbidden us from noticing a couple of recurring errors in the logs :
> ERROR [ROW-READ-STAGE:1] 2010-05-11 16:43:32,328
> DebuggableThreadPoolExecutor.java (line 101) Error in
> ThreadPoolExecutor
> java.lang.RuntimeException: java.lang.RuntimeException: corrupt sstable
>        at 
> org.apache.cassandra.service.RangeSliceVerbHandler.doVerb(RangeSliceVerbHandler.java:53)
>        at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:40)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.RuntimeException: corrupt sstable
>        at 
> org.apache.cassandra.io.SSTableScanner.seekTo(SSTableScanner.java:73)
>        at 
> org.apache.cassandra.db.ColumnFamilyStore.getKeyRange(ColumnFamilyStore.java:907)
>        at 
> org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1000)
>        at 
> org.apache.cassandra.service.RangeSliceVerbHandler.doVerb(RangeSliceVerbHandler.java:41)
>        ... 4 more
> Caused by: java.io.FileNotFoundException:
> /path/to/data/Keyspace/CF-123-Index.db (Too many open files)
>        at java.io.RandomAccessFile.open(Native Method)
>        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:212)
>        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:98)
>        at 
> org.apache.cassandra.io.util.BufferedRandomAccessFile.<init>(BufferedRandomAccessFile.java:143)
>        at 
> org.apache.cassandra.io.util.BufferedRandomAccessFile.<init>(BufferedRandomAccessFile.java:138)
>        at 
> org.apache.cassandra.io.SSTableReader.getNearestPosition(SSTableReader.java:414)
>        at 
> org.apache.cassandra.io.SSTableScanner.seekTo(SSTableScanner.java:62)
>        ... 7 more
> and the related
> WARN [main] 2010-05-11 16:43:38,076 TThreadPoolServer.java (line 190)
> Transport error occurred during acceptance of message.
> org.apache.thrift.transport.TTransportException:
> java.net.SocketException: Too many open files
>        at 
> org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:124)
>        at 
> org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35)
>        at 
> org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31)
>        at 
> org.apache.thrift.server.TThreadPoolServer.serve(TThreadPoolServer.java:184)
>        at 
> org.apache.cassandra.thrift.CassandraDaemon.start(CassandraDaemon.java:149)
>        at 
> org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:190)
> Caused by: java.net.SocketException: Too many open files
>        at java.net.PlainSocketImpl.socketAccept(Native Method)
>        at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
>        at java.net.ServerSocket.implAccept(ServerSocket.java:453)
>        at java.net.ServerSocket.accept(ServerSocket.java:421)
>        at 
> org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:119)
>        ... 5 more
> The client was reporting timeouts in this case.
> The max fd limit on the process was in fact not exceedingly high
> (1024) and raising it seems to have solved the problem.
> Anyway It still seems that there may be two issues:
> - since we had never seen this error before with normal client
> connections (as in: non hadoop), is it possible that the
> Cassandra/hadoop layer is not closing sockets properly between one
> connection and the other, or not reusing connections efficiently?
> E.g. TSocket seems to have a close() method but I don't see it used in
> ColumnFamilyInputFormat.(getSubSplits, getRangeMap) but it may well be
> inside CassandraClient.
> Anyway, judging by lsof's output I can only see about a hundred TCP
> connections, but those from the hadoop jobs seem to always be below 60
> so this may just be my wrong impression.
> - is it possible that such errors show up on the client side as
> timeoutErrors when they could be reported better? this would probably
> help other people in diagnosing/reporting internal errors in the
> future.
> Thanks again to everyone with this, I promise I'll put the discussion
> on the wiki for future reference :)

Reply via email to