[ https://issues.apache.org/jira/browse/CASSANDRA-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14265898#comment-14265898 ]
Benedict commented on CASSANDRA-8552: ------------------------------------- [~philipthompson] could we try to reproduce ourselves, since Brent has navigated around the problem for the moment? It would be great to figure this out and fix for 2.1.3 > Large compactions run out of off-heap RAM > ----------------------------------------- > > Key: CASSANDRA-8552 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8552 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Ubuntu 14.4 > AWS EC2 > 12 m1.xlarge nodes [4 cores, 16GB RAM, 1TB storage (251GB Used)] > Java build 1.7.0_55-b13 and build 1.8.0_25-b17 > Reporter: Brent Haines > Assignee: Benedict > Priority: Blocker > Fix For: 2.1.3 > > Attachments: Screen Shot 2015-01-02 at 9.36.11 PM.png, fhandles.log, > freelog.log, lsof.txt, meminfo.txt, sysctl.txt, system.log > > > We have a large table of storing, effectively event logs and a pair of > denormalized tables for indexing. > When updating from 2.0 to 2.1 we saw performance improvements, but some > random and silent crashes during nightly repairs. We lost a node (totally > corrupted) and replaced it. That node has never stabilized -- it simply can't > finish the compactions. > Smaller compactions finish. Larger compactions, like these two never finish - > {code} > pending tasks: 48 > compaction type keyspace table completed total > unit progress > Compaction data stories 16532973358 75977993784 > bytes 21.76% > Compaction data stories_by_text 10593780658 38555048812 > bytes 27.48% > Active compaction remaining time : 0h10m51s > {code} > We are not getting exceptions and are not running out of heap space. The > Ubuntu OOM killer is reaping the process after all of the memory is consumed. > We watch memory in the opscenter console and it will grow. If we turn off the > OOM killer for the process, it will run until everything else is killed > instead and then the kernel panics. > We have the following settings configured: > 2G Heap > 512M New > {code} > memtable_heap_space_in_mb: 1024 > memtable_offheap_space_in_mb: 1024 > memtable_allocation_type: heap_buffers > commitlog_total_space_in_mb: 2048 > concurrent_compactors: 1 > compaction_throughput_mb_per_sec: 128 > {code} > The compaction strategy is leveled (these are read-intensive tables that are > rarely updated) > I have tried every setting, every option and I have the system where the MTBF > is about an hour now, but we never finish compacting because there are some > large compactions pending. None of the GC tools or settings help because it > is not a GC problem. It is an off-heap memory problem. > We are getting these messages in our syslog > {code} > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219527] BUG: Bad page map in > process java pte:00000320 pmd:2d6fa5067 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219545] addr:00007fb820be3000 > vm_flags:08000070 anon_vma: (null) mapping: (null) > index:7fb820be3 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219556] CPU: 3 PID: 27344 > Comm: java Tainted: G B 3.13.0-24-generic #47-Ubuntu > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219559] ffff880028510e40 > ffff88020d43da98 ffffffff81715ac4 00007fb820be3000 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219565] ffff88020d43dae0 > ffffffff81174183 0000000000000320 00000007fb820be3 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219568] ffff8802d6fa5f18 > 0000000000000320 00007fb820be3000 00007fb820be4000 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219572] Call Trace: > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219584] [<ffffffff81715ac4>] > dump_stack+0x45/0x56 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219591] [<ffffffff81174183>] > print_bad_pte+0x1a3/0x250 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219594] [<ffffffff81175439>] > vm_normal_page+0x69/0x80 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219598] [<ffffffff8117580b>] > unmap_page_range+0x3bb/0x7f0 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219602] [<ffffffff81175cc1>] > unmap_single_vma+0x81/0xf0 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219605] [<ffffffff81176d39>] > unmap_vmas+0x49/0x90 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219610] [<ffffffff8117feec>] > exit_mmap+0x9c/0x170 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219617] [<ffffffff8110fcf3>] > ? __delayacct_add_tsk+0x153/0x170 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219621] [<ffffffff8106482c>] > mmput+0x5c/0x120 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219625] [<ffffffff81069bbc>] > do_exit+0x26c/0xa50 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219631] [<ffffffff810d7591>] > ? __unqueue_futex+0x31/0x60 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219634] [<ffffffff810d83b6>] > ? futex_wait+0x126/0x290 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219640] [<ffffffff8171d8e0>] > ? _raw_spin_unlock_irqrestore+0x20/0x40 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219643] [<ffffffff8106a41f>] > do_group_exit+0x3f/0xa0 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219649] [<ffffffff8107a050>] > get_signal_to_deliver+0x1d0/0x6f0 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219655] [<ffffffff81013448>] > do_signal+0x48/0x960 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219660] [<ffffffff811112fc>] > ? acct_account_cputime+0x1c/0x20 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219664] [<ffffffff8109d76b>] > ? account_user_time+0x8b/0xa0 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219667] [<ffffffff8109dd84>] > ? vtime_account_user+0x54/0x60 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219671] [<ffffffff81013dc9>] > do_notify_resume+0x69/0xb0 > Jan 2 07:06:00 ip-10-0-2-226 kernel: [49801151.219676] [<ffffffff8172676a>] > int_signal+0x12/0x17 > {code} > This seems like unmap is failing, but I am uncertain about how to fix it or > work around it. > For completeness sake, let me point this out too: The system.log will show > whatever was happening when the system stops and the the service is > restarted. There is no stake trace. Here is an example: > {code} > INFO [main] 2015-01-02 06:38:38,813 ColumnFamilyStore.java:840 - Enqueuing > flush of local: 1552 (0%) on-heap, 0 (0%) off-heap > INFO [MemtableFlushWriter:1] 2015-01-02 06:38:38,813 Memtable.java:325 - > Writing Memtable-local@172795560(281 serialized bytes, 10 ops, 0%/0% of > on/off-heap limit) > INFO [MemtableFlushWriter:1] 2015-01-02 06:38:38,824 Memtable.java:364 - > Completed flushing > /data/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-778-Data.db > (262 bytes) for commitlog position ReplayPosition(segmentId=1420180671225, > position=87520) > INFO [main] 2015-01-02 06:38:38,825 YamlConfigurationLoader.java:92 - > Loading settings from file:/etc/cassandra/cassandra.yaml > INFO [main] 2015-01-02 06:38:38,837 YamlConfigurationLoader.java:135 - Node > configuration:[authenticator=AllowAllAuthenticator; > authorizer=AllowAllAuthorizer; auto_snapshot=true; > batch_size_warn_threshold_in_kb=5; batchlog_replay_throttle_in_kb=1024; > cas_conten > tion_timeout_in_ms=1000; client_encryption_options=<REDACTED>; > cluster_name=booshaka-batch; column_index_size_in_kb=64; > commit_failure_policy=stop; > commitlog_directory=/commitlog/cassandra/commitlog; > commitlog_segment_size_in_mb=32; commitlog_sync=periodic; comm > itlog_sync_period_in_ms=10000; commitlog_total_space_in_mb=2048; > compaction_throughput_mb_per_sec=128; concurrent_compactors=1; > concurrent_counter_writes=32; concurrent_reads=48; concurrent_writes=48; > counter_cache_save_period=7200; counter_cache_size_in_mb=null > ; counter_write_request_timeout_in_ms=5000; cross_node_timeout=false; > data_file_directories=[/data/cassandra/data]; disk_failure_policy=stop; > dynamic_snitch_badness_threshold=0.1; > dynamic_snitch_reset_interval_in_ms=600000; > dynamic_snitch_update_interval_in_ms=1 > 00; endpoint_snitch=Ec2Snitch; hinted_handoff_enabled=true; > hinted_handoff_throttle_in_kb=1024; incremental_backups=false; > index_summary_capacity_in_mb=null; > index_summary_resize_interval_in_minutes=60; inter_dc_tcp_nodelay=false; > internode_compression=all; key_ > cache_save_period=14400; key_cache_size_in_mb=null; > listen_address=10.0.2.226; max_hint_window_in_ms=10800000; > max_hints_delivery_threads=2; memtable_allocation_type=heap_buffers; > memtable_cleanup_threshold=0.33; memtable_heap_space_in_mb=1024; > memtable_offheap_ > space_in_mb=1024; native_transport_port=9042; num_tokens=256; > partitioner=org.apache.cassandra.dht.Murmur3Partitioner; > permissions_validity_in_ms=2000; phi_convict_threshold=12; > range_request_timeout_in_ms=10000; read_request_timeout_in_ms=5000; > request_schedule > r=org.apache.cassandra.scheduler.NoScheduler; request_timeout_in_ms=10000; > row_cache_save_period=0; row_cache_size_in_mb=0; rpc_address=10.0.2.226; > rpc_keepalive=true; rpc_port=9160; rpc_server_type=sync; > saved_caches_directory=/data/cassandra/saved_caches; seed > _provider=[{class_name=org.apache.cassandra.locator.SimpleSeedProvider, > parameters=[{seeds=10.0.2.8,10.0.2.144,10.0.2.145}]}]; > server_encryption_options=<REDACTED>; snapshot_before_compaction=false; > ssl_storage_port=7001; sstable_preemptive_open_interval_in_mb=5 > 0; start_native_transport=true; start_rpc=true; storage_port=7000; > thrift_framed_transport_size_in_mb=15; tombstone_failure_threshold=100000; > tombstone_warn_threshold=1000; trickle_fsync=false; > trickle_fsync_interval_in_kb=10240; truncate_request_timeout_in_ms=6 > 0000; write_request_timeout_in_ms=2000] > INFO [main] 2015-01-02 06:38:38,943 MessagingService.java:477 - Starting > Messaging Service on port 7000 > INFO [main] 2015-01-02 06:38:38,981 YamlConfigurationLoader.java:92 - > Loading settings from file:/etc/cassandra/cassandra.yaml > INFO [main] 2015-01-02 06:38:38,987 YamlConfigurationLoader.java:135 - Node > configuration:[authenticator=AllowAllAuthenticator; > authorizer=AllowAllAuthorizer; auto_snapshot=true; > batch_size_warn_threshold_in_kb=5; batchlog_replay_throttle_in_kb=1024; > cas_conten > tion_timeout_in_ms=1000; client_encryption_options=<REDACTED>; > cluster_name=booshaka-batch; column_index_size_in_kb=64; > commit_failure_policy=stop; > commitlog_directory=/commitlog/cassandra/commitlog; > commitlog_segment_size_in_mb=32; commitlog_sync=periodic; comm > itlog_sync_period_in_ms=10000; commitlog_total_space_in_mb=2048; > compaction_throughput_mb_per_sec=128; concurrent_compactors=1; > concurrent_counter_writes=32; concurrent_reads=48; concurrent_writes=48; > counter_cache_save_period=7200; counter_cache_size_in_mb=null > ; counter_write_request_timeout_in_ms=5000; cross_node_timeout=false; > data_file_directories=[/data/cassandra/data]; disk_failure_policy=stop; > dynamic_snitch_badness_threshold=0.1; > dynamic_snitch_reset_interval_in_ms=600000; > dynamic_snitch_update_interval_in_ms=1 > 00; endpoint_snitch=Ec2Snitch; hinted_handoff_enabled=true; > hinted_handoff_throttle_in_kb=1024; incremental_backups=false; > index_summary_capacity_in_mb=null; > index_summary_resize_interval_in_minutes=60; inter_dc_tcp_nodelay=false; > internode_compression=all; key_ > cache_save_period=14400; key_cache_size_in_mb=null; > listen_address=10.0.2.226; max_hint_window_in_ms=10800000; > max_hints_delivery_threads=2; memtable_allocation_type=heap_buffers; > memtable_cleanup_threshold=0.33; memtable_heap_space_in_mb=1024; > memtable_offheap_ > space_in_mb=1024; native_transport_port=9042; num_tokens=256; > partitioner=org.apache.cassandra.dht.Murmur3Partitioner; > permissions_validity_in_ms=2000; phi_convict_threshold=12; > range_request_timeout_in_ms=10000; read_request_timeout_in_ms=5000; > request_schedule > r=org.apache.cassandra.scheduler.NoScheduler; request_timeout_in_ms=10000; > row_cache_save_period=0; row_cache_size_in_mb=0; rpc_address=10.0.2.226; > rpc_keepalive=true; rpc_port=9160; rpc_server_type=sync; > saved_caches_directory=/data/cassandra/saved_caches; seed > _provider=[{class_name=org.apache.cassandra.locator.SimpleSeedProvider, > parameters=[{seeds=10.0.2.8,10.0.2.144,10.0.2.145}]}]; > server_encryption_options=<REDACTED>; snapshot_before_compaction=false; > ssl_storage_port=7001; sstable_preemptive_open_interval_in_mb=5 > 0; start_native_transport=true; start_rpc=true; storage_port=7000; > thrift_framed_transport_size_in_mb=15; tombstone_failure_threshold=100000; > tombstone_warn_threshold=1000; trickle_fsync=false; > trickle_fsync_interval_in_kb=10240; truncate_request_timeout_in_ms=6 > 0000; write_request_timeout_in_ms=2000] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)