I believe the "resource temporarily unavailable" actually is related to the
open files, most likely you are hitting the total file descriptor limit.
Sorry if you mentioned this and I missed it, but what was the
max.open.files in your RocksDBConfigSetter when you ran this? Actually,
could you just include your full RocksDBConfigSetter implementation?

Not sure why Rocks would be spawning so many child processes...it does use
background threads for compaction but this sounds like something totally
different. I also notice a read "This is a RocksDB options file" -- are you
trying to use an options file vs setting the options with the config setter
(or otherwise using Rocksdb outside of Streams?)

 Have you tried cleaning up the state between runs? Are you using
interactive queries?

Regarding the .checkpoint file not found thing -- this is an annoying but
pretty much harmless bug we only recently figured out. There's an open PR
for it but for now you can just ignore the warning. See KAFKA-5998
<https://issues.apache.org/jira/browse/KAFKA-5998> for a long history (but
scroll to the bottom for the actual explanation)


On Thu, Jul 4, 2019 at 11:02 AM Thameem Ansari <thame...@gmail.com> wrote:

> i have few more details to share from today’s testing.
>
> Attached strace to the process and noticed that there are thousands of
> short lived child processes have been created by the stream application.
> Not sure whether rocksdb is playing any role here. Noticed more than 73000
> child processes were created and this is after i increased the default
> linux system limit from 32767 to 4million per PID. Appreciate if someone
> answer the following questions.
> 1. Why rocksdb creates these many processes? Each process is trying to
> read the contents of the file and getting EAGAIN (Resource temporarily
> unavailable)
> 2. Noticed very high number of .checkpoint files missing and hence “No
> such file or directory” message
> 3. The child processes were created using “clone” system call. Not sure
> whether there is any significance of this but wanna mention. Moreover these
> child processes were keep on increasing when the load is getting applied
> continuously. Not sure under what condition it will hit the ceiling.
> 4. Noticed very high VIRT memory usage (around 31G) but RES memory was
> only 1.9G. Isn’t every file open/close consume memory? But it is not going
> up even though the number of processes is increased.
>
> Thanks
> Thameem
>
> > On Jul 4, 2019, at 9:28 AM, Thameem Ansari <thame...@gmail.com> wrote:
> >
> > Sorry for reposting the previous message as the images didn’t come
> thru.. pasting as text.
> >
> > I have changed both system and user limits.
> >
> > To completely isolate the problem, I have tried the application in
> Centos 7 environment. Set the ulimit to 1million and system limits to
> 10million open files. Now 3 kafka nodes are running in separate servers
> each and streaming application is running in a dedicated VM. Now the
> application is not explicitly throwing “too many open files” error but
> automatically aborted with the message
> >
> > terminate called after throwing an instance of 'std::system_error'
> >   what():  Resource temporarily unavailable
> > ./bin/start.sh: line 42:  1100 Aborted
> >
> > Here is the last few lines from strace output which shows the aborted
> message.
> >
> > 25721 14:58:35
> open("/home/devops/data/kafka-streams/RawLog_Processor/393_7/MAIL.INCOMING_FILTER_BY_USER_MAIL_INCOMING_DOMAIN_TRAFFIC_DETAIL.DETECTED_CONTENT_FILTER_RECIPIENT_DOMAIN/MAIL.INCOMING_FILTER_BY_USER_MAIL_INCOMING_DOMAIN_TRAFFIC_DETAIL.DETECTED_CONTENT_FILTER_RECIPIENT_DOMAIN.1555200000/MANIFEST-000007",
> O_RDONLY|O_CLOEXEC) = 12505
> > 25721 14:58:35
> open("/sys/devices/virtual/block/dm-2/queue/logical_block_size", O_RDONLY)
> = 12506
> > 25721 14:58:35 read(12506, "512\n", 4096) = 4
> > 25721 14:58:35 close(12506)             = 0
> > 25721 14:58:35 write(12502, "s.advise_random_on_open: 0\n2019/"...,
> 4096) = 4096
> > 25721 14:58:35 write(12502, "ions.comparator: leveldb.Bytewis"..., 4096)
> = 4096
> > 25721 14:58:35 read(12505,
> "V\371\270\370\34\0\1\1\32leveldb.BytewiseCompara"..., 32768) = 192
> > 25721 14:58:35 read(12505, "", 28672)   = 0
> > 25721 14:58:35 close(12505)             = 0
> > 17701 14:58:35
> open("/home/devops/data/kafka-streams/RawLog_Processor/393_7/MAIL.INCOMING_FILTER_BY_USER_MAIL_INCOMING_DOMAIN_TRAFFIC_DETAIL.DETECTED_CONTENT_FILTER_RECIPIENT_DOMAIN/MAIL.INCOMING_FILTER_BY_USER_MAIL_INCOMING_DOMAIN_TRAFFIC_DETAIL.DETECTED_CONTENT_FILTER_RECIPIENT_DOMAIN.1555200000/000006.sst",
> O_RDONLY|O_CLOEXEC) = 12505
> > 17702 14:58:35 +++ exited with 0 +++
> > 25721 14:58:35 write(2, "terminate called after throwing "..., 48) = 48
> > 25721 14:58:35 write(2, "std::system_error", 17) = 17
> > 25721 14:58:35 write(2, "'\n", 2)       = 2
> > 25721 14:58:35 write(2, "  what():  ", 11) = 11
> > 25721 14:58:35 write(2, "Resource temporarily unavailable", 32) = 32
> > 25721 14:58:35 write(2, "\n", 1)        = 1
> > 17701 14:58:35
> open("/sys/devices/virtual/block/dm-2/queue/logical_block_size", O_RDONLY)
> = 12506
> > 25721 14:58:35 write(12502, "ction_dynamic_level_bytes: 0\n201"...,
> 3672) = 3672
> > 25721 14:58:35 --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL,
> si_pid=25697, si_uid=1000} ---
> > 17701 14:58:35 read(12506,  <unfinished ...>
> > 17701 14:58:36 +++ killed by SIGABRT +++
> > 17700 14:58:36 +++ killed by SIGABRT +++
> > 17699 14:58:36 +++ killed by SIGABRT +++
> >
> >
> > As I can see from the open files they are way lower (45201) than the max
> limit and hence can we rule out the “open files” is the root cause?
> >
> > I have also noticed there is lot of EAGAIN (count 3334) and associated
> “Resource temporarily unavailable” messages as well
> >
> >
> >
> > 25732 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes",
> O_RDONLY) = 59
> > 25732 14:49:23 read(59, "9223372036854775807\n", 4096) = 20
> > 25732 14:49:23 close(59)                = 0
> > 25709 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes",
> O_RDONLY) = 59
> > 25709 14:49:23 read(59, "9223372036854775807\n", 4096) = 20
> > 25709 14:49:23 close(59)                = 0
> > 25721 14:49:23 write(50, "\0", 1)       = 1
> > 25709 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes",
> O_RDONLY) = 59
> > 25709 14:49:23 read(59, "9223372036854775807\n", 4096) = 20
> > 25709 14:49:23 close(59)                = 0
> > 25721 14:49:23 read(17,
> "PK\3\4\n\0\0\10\10\0\343SHNM\n4>Q\2\0\0\265\5\0\0A\0\0\0", 30) = 30
> > 25721 14:49:23 read(17,
> "\245T]O\23A\24=\323\226\326-\253\255\250\365\v\4\265ji\221E\344\305\200M\f\201\304\244j"...,
> 593) = 593
> > 25709 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes",
> O_RDONLY) = 59
> > 25709 14:49:23 read(59, "9223372036854775807\n", 4096) = 20
> > 25709 14:49:23 close(59)                = 0
> > 25732 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes",
> O_RDONLY) = 59
> > 25732 14:49:23 read(59, "9223372036854775807\n", 4096) = 20
> > 25732 14:49:23 close(59)                = 0
> > 25721 14:49:23 read(49, "\0", 16)       = 1
> > 25709 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes",
> O_RDONLY) = 59
> > 25709 14:49:23 read(59, "9223372036854775807\n", 4096) = 20
> > 25709 14:49:23 close(59)                = 0
> > 25732 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes",
> O_RDONLY) = 59
> > 25732 14:49:23 read(59, "9223372036854775807\n", 4096) = 20
> > 25732 14:49:23 close(59)                = 0
> > 25709 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes",
> O_RDONLY) = 59
> > 25709 14:49:23 read(59, "9223372036854775807\n", 4096) = 20
> > 25709 14:49:23 close(59)                = 0
> > 25721 14:49:23 read(54, "\0\1dx", 4)    = 4
> > 25721 14:49:23 read(54,
> "\0\0\0\v\0\0\0\0\0\0\0\1dj\0\0\0\0\1\212\0\201RawLog_Pro"..., 91256) =
> 91256
> > 25721 14:49:23 read(54, 0x7f7044003b80, 4) = -1 EAGAIN (Resource
> temporarily unavailable)
> > 25732 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes",
> O_RDONLY) = 59
> > 25732 14:49:23 read(59, "9223372036854775807\n", 4096) = 20
> > 25721 14:49:23 write(35, "2019-07-03 14:49:23.258 [rawLP1-"..., 219
> <unfinished ...>
> > 25732 14:49:23 close(59)                = 0
> > 25721 14:49:23 <... write resumed> )    = 219
> > 25721 14:49:23 write(1, "2019-07-03 14:49:23.258 [rawLP1-"..., 219) = 219
> > 25721 14:49:23 write(50, "\0", 1)       = 1
> > 25721 14:49:23 read(49, "\0", 16)       = 1
> > 25708 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes",
> O_RDONLY) = 59
> > 25709 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes",
> O_RDONLY) = 60
> > 25708 14:49:23 read(59, "9223372036854775807\n", 4096) = 20
> > 25708 14:49:23 close(59)                = 0
> > 25709 14:49:23 read(60, "9223372036854775807\n", 4096) = 20
> > 25709 14:49:23 close(60)                = 0
> > 25709 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes",
> O_RDONLY) = 59
> > 25709 14:49:23 read(59, "9223372036854775807\n", 4096) = 20
> > 25709 14:49:23 close(59)                = 0
> > 25708 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes",
> O_RDONLY) = 59
> > 25708 14:49:23 read(59, "9223372036854775807\n", 4096) = 20
> > 25708 14:49:23 close(59)                = 0
> > 25709 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes",
> O_RDONLY) = 59
> > 25709 14:49:23 read(59, "9223372036854775807\n", 4096) = 20
> > 25709 14:49:23 close(59)                = 0
> > 25732 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes",
> O_RDONLY) = 59
> > 25732 14:49:23 read(59, "9223372036854775807\n", 4096) = 20
> > 25732 14:49:23 close(59)                = 0
> > 25709 14:49:23 open("/sys/fs/cgroup/memory/memory.limit_in_bytes",
> O_RDONLY) = 59
> > 25709 14:49:23 read(59, "9223372036854775807\n", 4096) = 20
> >
> > But unable to come to any conclusion from these findings. Any insight is
> appreciated.
> >
> > Thanks
> > Thameem
> >
> >> On Jul 4, 2019, at 9:18 AM, Thameem Ansari <thame...@gmail.com <mailto:
> thame...@gmail.com>> wrote:
> >>
> >> <strace2.png>
> >>
> >>
> >> But unable to come to any conclusion from these findings. Any insight
> is appreciated.
> >>
> >> Thanks
> >> Thameem
> >
>
>

Reply via email to