Hello everyone!
I'm experiencing problems with memory consumption.

I have a class which is doing ETL job. What`s happening inside:
 - fetching existing objects from DB via SQLAchemy
 - iterate over raw data
 - create new/update existing objects
 - commit changes

Before processing data I create internal cache(dictionary) and store all 
existing objects in it.
Every 10000 items I do bulk insert and flush. At the end I run commit command.

Problem. Before executing, my interpreter process weighs ~100Mb, after first 
run memory increases up to 500Mb
and after second run it weighs 1Gb. If I will continue to run this class, 
memory wont increase, so I think
it's not a memory leak, but rather Python wont release allocated memory back to 
OS. Maybe I'm wrong.

What I tried after executing:
 - gc.collect()
 - created snapshots with tracemalloc and searched for some garbage, diff = 
   smapshot_before_run - smapshot_after_run
 - searched for links with "objgraph" library to internal cache(dictionary 
   containing elements from DB)
 - cleared the cache(dictionary)
 - db.session.expire_all()

This class is a periodic celery task. So when each worker executes this class 
at least two times,
all celery workers need 1Gb of RAM. Before celery there was a cron script and 
this class was executed via API call
and the problem was the same. So no matter how I run, interpreter consumes 1Gb 
of RAM after two runs.

I see few solutions to this problem
1. Execute this class in separate process. But I had few errors when the same 
SQLAlchemy connection being shared
between different processes.
2. Restart celery worker after executing this task by throwing exception.
3. Use separate queue for such tasks, but then worker will stay idle most of 
the time.
All this is looks like a crutch. Do I have any other options ?

I'm using:
Python - 3.6.13
Celery - 4.1.0
Flask-RESTful - 0.3.6
Flask-SQLAlchemy - 2.3.2

Thanks in advance!
-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to