Hi, Dennis Zhou
Thanks for your ncie answer.
but still a few questions.
> Percpu is not really cheap memory to allocate because it has a
> amplification factor of NR_CPUS. As a result, percpu on the critical
> path is really not something that is expected to be high throughput.
> Ideally things like btrfs snapshots should preallocate a number of these
> and not try to do atomic allocations because that in theory could fail
> because even after we go to the page allocator in the future we can't
> get enough pages due to needing to go into reclaim.
pre-allocate in module such as mempool_t is just used in a few place in
linux/fs. so most people like system wide pre-allocate, because it is
more easy to use?
can we add more chance to management the system wide pre-alloc
just like this?
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index dc1f4dc..eb3f592 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -226,6 +226,11 @@ static inline void memalloc_noio_restore(unsigned int
flags)
static inline unsigned int memalloc_nofs_save(void)
{
unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+
+ // just like slab_pre_alloc_hook
+ fs_reclaim_acquire(current->flags & gfp_allowed_mask);
+ fs_reclaim_release(current->flags & gfp_allowed_mask);
+
current->flags |= PF_MEMALLOC_NOFS;
return flags;
}
> The workqueue approach has been good enough so far. Technically there is
> a higher priority workqueue that this work could be scheduled on, but
> save for this miss on my part, the system workqueue has worked out fine.
> In the future as I mentioned above. It would be good to support actually
> getting pages, but it's work that needs to be tackled with a bit of
> care. I might target the work for v5.14.
>
> > this is our application pipeline.
> > file_pre_process |
> > bwa.nipt xx |
> > samtools.nipt sort xx |
> > file_post_process
> >
> > file_pre_process/file_post_process is fast, so often are blocked by
> > pipe input/output.
> >
> > 'bwa.nipt xx' is a high-cpu-load, almost all of CPU cores.
> >
> > 'samtools.nipt sort xx' is a high-mem-load, it keep the input in memory.
> > if the memory is not enough, it will save all the buffer to temp file,
> > so it is sometimes high-IO-load too(write 60G or more to file).
> >
> >
> > xfstests(generic/476) is just high-IO-load, cpu/memory load is NOT high.
> > so xfstests(generic/476) maybe easy than our application pipeline.
> >
> > Although there is yet not a simple reproducer for another problem
> > happend here, but there is a little high chance that something is wrong
> > in btrfs/mm/fs-buffer.
> > > but another problem(os freezed without call trace, PANIC without OOPS?,
> > > the reason is yet unkown) still happen.
>
> I do not have an answer for this. I would recommend looking into kdump.
percpu ENOMEM problem blocked many heavy load test a little long time?
I still guess this problem of system freeze is a mm/btrfs problem.
OOM not work, OOPS not work too.
I try to reproduce it with some simple script. I noticed the value of
'free' is a little low, although 'available' is big.
# free -h
total used free shared buff/cache available
Mem: 188Gi 1.4Gi 5.5Gi 17Mi 181Gi 175Gi
Swap: 0B 0B 0B
vm.min_free_kbytes is auto configed to 4Gi(4194304)
# write files with the size >= memory size *3
#for((i=0;i<10;++i));do dd if=/dev/zero bs=1M count=64K of=/nodetmp/${i}.txt;
free -h; done
any advice or patch to let the value of 'free' a little bigger?
Best Regards
Wang Yugui ([email protected])
2021/04/10