[sqlalchemy] Re: using scoped_session in serial mode as well as parallel (multi-threaded) mode
Addendum: the types of error I'm seeing includes SQLA trying to execute notices from the PG server eg. one of the tracebacks I'm seeing is: #0 PyObject_Malloc (nbytes=86) at ../Objects/obmalloc.c:756 #1 0x00455eb5 in PyString_FromString (str=0x2de0ece0 WARNING: there is no transaction in progress\n) at ../Objects/stringobject.c:139 This smells like memory being overwritten. Any idea what might be causing this? Another possibility is that my usage of scoped_session is wrong. I can't find any explicit examples of usage in the official documentation, so this was partly guesswork on my part. Here is a sketch of my usage. The model I'm using is a thread pool, which lines up n jobs in a queue, and has a pool of k threads executing them. The problems seem to occur when n is too large. Session = scoped_session(sessionmaker()) Session.configure(bind=db) pool = ThreadPool(Session, self.jobs) ids = self.get_idlink_ids(Session) for i in ids: pool.add_task(self.load_geno_table_from_file, Session, i) pool.start() where load_geno_table_from_file is def load_geno_table_from_file(self, session, i): session.execute(ALTER TABLE %s.geno%s SET ( autovacuum_enabled = false );%(self.schema, i)) self.drop_geno_table_constraints(session, 'geno%s'%i) self.copy_data_to_geno_table(session, 'geno%s'%i, 'tmp/geno%s'%i) self.restore_geno_table_constraints(session, 'geno%s'%i) session.execute(ALTER TABLE %s.geno%s SET ( autovacuum_enabled = true );%(self.schema, i)) and add_task is def add_task(self, func, *args, **kargs): Add a task to the queue self.queue.put((func, args, kargs)) So, Session is passed to load_geno_table_from_file, which executes inside each thread. Is that Ok? I'm adding the rest of the thread code below for reference. Regards, Faheem *** import Queue, threading, urllib2, time class Worker(threading.Thread): Thread executing tasks from a given tasks queue def __init__(self, session, queue, num): threading.Thread.__init__(self) self.num = num self.queue = queue self.setDaemon(True) self.session = session def run(self): import traceback while True: func, args, kargs = self.queue.get() try: func(*args, **kargs) except: traceback.print_exc() self.queue.task_done() class ThreadPool: Pool of threads consuming tasks from a queue def __init__(self, session, num_threads): from geno import Geno_Shard self.queue = Queue.Queue() self.workerlist = [] self.num = num_threads self.session = session for i in range(num_threads): self.workerlist.append(Worker(session, self.queue, i)) def add_task(self, func, *args, **kargs): Add a task to the queue self.queue.put((func, args, kargs)) def start(self): for w in self.workerlist: w.start() def wait_completion(self): Wait for completion of all the tasks in the queue self.queue.join() On Thu, 12 Aug 2010, Faheem Mitha wrote: Hi, I'm using scoped_session with PostgreSQL to run multiple threads, each thread creating a table. However, I'm getting some pretty weird and inconsistent errors (segfaults from Python with some very peculiar errors), and I was wondering if my usage was at fault. The program has a serial bit, and a parallel bit, and currently I'm doing something like this. For the serial bit I do db = create_engine(dbstring) Session = sessionmaker() session = Session(bind=db) and then later I do session.close() db = create_engine(self.dbstring) Session = scoped_session(sessionmaker()) Session.configure(bind=db) Looking at this is seems likely that is would be better to just use scoped_session everywhere, that is, just start with db = create_engine(self.dbstring) Session = scoped_session(sessionmaker()) Session.configure(bind=db) [proceed with using Session in serial mode and eventually use it in parallel mode too] I'm basically writing to confirm that it is Ok to use scoped_session in this way. The way I'm doing it looks a little dodgy. I don't know if this is really the cause of my problem - just clutching at straws here. Thanks in advance. Please CC me on any reply. Regards, Faheem -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalch...@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
[sqlalchemy] Re: using scoped_session in serial mode as well as parallel (multi-threaded) mode
Another reply to my own message, sorry. Another thing I might be doing wrong is my usage of MetaData. The code I have looks like db = create_engine(self.dbstring) meta = self.table_dict['metadata'] meta.bind = db meta.create_all() Session = scoped_session(sessionmaker()) Session.configure(bind=db) pool = ThreadPool(Session, self.jobs) ids = self.get_idlink_ids(Session) for i in ids: pool.add_task(self.load_geno_table_from_file, Session, i) pool.start() Perhaps creation of this metadata instance is not thread-safe? I found reference to a ThreadLocalMetaData. Would it better to use that instead? Regards, Faheem. On Thu, 12 Aug 2010, Faheem Mitha wrote: Addendum: the types of error I'm seeing includes SQLA trying to execute notices from the PG server eg. one of the tracebacks I'm seeing is: #0 PyObject_Malloc (nbytes=86) at ../Objects/obmalloc.c:756 #1 0x00455eb5 in PyString_FromString (str=0x2de0ece0 WARNING: there is no transaction in progress\n) at ../Objects/stringobject.c:139 This smells like memory being overwritten. Any idea what might be causing this? Another possibility is that my usage of scoped_session is wrong. I can't find any explicit examples of usage in the official documentation, so this was partly guesswork on my part. Here is a sketch of my usage. The model I'm using is a thread pool, which lines up n jobs in a queue, and has a pool of k threads executing them. The problems seem to occur when n is too large. Session = scoped_session(sessionmaker()) Session.configure(bind=db) pool = ThreadPool(Session, self.jobs) ids = self.get_idlink_ids(Session) for i in ids: pool.add_task(self.load_geno_table_from_file, Session, i) pool.start() where load_geno_table_from_file is def load_geno_table_from_file(self, session, i): session.execute(ALTER TABLE %s.geno%s SET ( autovacuum_enabled = false );%(self.schema, i)) self.drop_geno_table_constraints(session, 'geno%s'%i) self.copy_data_to_geno_table(session, 'geno%s'%i, 'tmp/geno%s'%i) self.restore_geno_table_constraints(session, 'geno%s'%i) session.execute(ALTER TABLE %s.geno%s SET ( autovacuum_enabled = true );%(self.schema, i)) and add_task is def add_task(self, func, *args, **kargs): Add a task to the queue self.queue.put((func, args, kargs)) So, Session is passed to load_geno_table_from_file, which executes inside each thread. Is that Ok? I'm adding the rest of the thread code below for reference. Regards, Faheem *** import Queue, threading, urllib2, time class Worker(threading.Thread): Thread executing tasks from a given tasks queue def __init__(self, session, queue, num): threading.Thread.__init__(self) self.num = num self.queue = queue self.setDaemon(True) self.session = session def run(self): import traceback while True: func, args, kargs = self.queue.get() try: func(*args, **kargs) except: traceback.print_exc() self.queue.task_done() class ThreadPool: Pool of threads consuming tasks from a queue def __init__(self, session, num_threads): from geno import Geno_Shard self.queue = Queue.Queue() self.workerlist = [] self.num = num_threads self.session = session for i in range(num_threads): self.workerlist.append(Worker(session, self.queue, i)) def add_task(self, func, *args, **kargs): Add a task to the queue self.queue.put((func, args, kargs)) def start(self): for w in self.workerlist: w.start() def wait_completion(self): Wait for completion of all the tasks in the queue self.queue.join() On Thu, 12 Aug 2010, Faheem Mitha wrote: Hi, I'm using scoped_session with PostgreSQL to run multiple threads, each thread creating a table. However, I'm getting some pretty weird and inconsistent errors (segfaults from Python with some very peculiar errors), and I was wondering if my usage was at fault. The program has a serial bit, and a parallel bit, and currently I'm doing something like this. For the serial bit I do db = create_engine(dbstring) Session = sessionmaker() session = Session(bind=db) and then later I do session.close() db = create_engine(self.dbstring) Session = scoped_session(sessionmaker()) Session.configure(bind=db) Looking at this is seems likely that is would be better to just use scoped_session everywhere, that is, just start with db = create_engine(self.dbstring) Session = scoped_session(sessionmaker()) Session.configure(bind=db) [proceed with using Session in serial mode and eventually use it in parallel mode too] I'm basically writing to confirm that it is Ok to use
[sqlalchemy] Re: using scoped_session in serial mode as well as parallel (multi-threaded) mode
On Thu, 12 Aug 2010 08:47:33 -0400, Michael Bayer mike...@zzzcomputing.com wrote: On Aug 12, 2010, at 5:10 AM, Faheem Mitha wrote: Hi, I'm using scoped_session with PostgreSQL to run multiple threads, each thread creating a table. Its generally a poor application practice for an application to need new permanent tables on the fly. I think reddit's application might do this, but its considered poor design. Database schemas are supposed to be fixed. I'm not sure what you mean by on-the-fly. The app is creating the tables for later use. It is parallelizing the table creation for performance reasons - there are a lot of tables. However, I'm getting some pretty weird and inconsistent errors (segfaults from Python with some very peculiar errors), and I was wondering if my usage was at fault. The program has a serial bit, and a parallel bit, and currently I'm doing something like this. For the serial bit I do db = create_engine(dbstring) Session = sessionmaker() session = Session(bind=db) and then later I do session.close() db = create_engine(self.dbstring) Session = scoped_session(sessionmaker()) Session.configure(bind=db) Looking at this is seems likely that is would be better to just use scoped_session everywhere, that is, just start with db = create_engine(self.dbstring) Session = scoped_session(sessionmaker()) Session.configure(bind=db) [proceed with using Session in serial mode and eventually use it in parallel mode too] I'm basically writing to confirm that it is Ok to use scoped_session in this way. The way I'm doing it looks a little dodgy. I don't know if this is really the cause of my problem - just clutching at straws here. Thanks in advance. Please CC me on any reply you can make as many scoped_sessions, metadatas, etc. as you want, none of that would cause a segfault. They are just Python objects. Its only if you share a psycopg2 connection between threads and have different threads hammer on it simultaneously that there would be issues like that. If you have a single session, and share that among threads who access it concurrently, this will produce that result. There should be no need to guess about it. If multiple threads are hitting the session you have above, then that's a likely cause of your issue. Hi Mike, Thanks for the response, but I don't follow. When you say multiple threads are hitting the session you have above, which session are you referring to? There is more than one object above that could be called a session. Ie. Session1 in Session1 = sessionmaker() session1 in session1 = Session1(bind=db) Session2 in Session2 = scoped_session(sessionmaker()) Let me try to ask a precise question. If I do Session = scoped_session(sessionmaker()) then is it ok for this Session object to be be passed around between multiple threads and used directly as in Session.commit() Does this correspond to a single psycopyg2 connection? If it does, and this usage is wrong, should I be creating separate sessions within each thread like session = Session() and then doing session.commit() within each thread? Or something else? My usage is based on examples online. My later posts have more details, along with healthy amounts of confusion. I apologise for my cluelessness, sorry. Regards, Faheem -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalch...@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Re: using scoped_session in serial mode as well as parallel (multi-threaded) mode
On Aug 12, 2010, at 9:25 AM, Faheem Mitha wrote: Hi Mike, Thanks for the response, but I don't follow. When you say multiple threads are hitting the session you have above, which session are you referring to? There is more than one object above that could be called a session. Ie. I don't actually know, it would require that I have a full install of your application for me to run and step through with a debugger to fully understand what it's doing. All I can say from here is that the errors you have suggest concurrent access to a single psycopg2 connection resource, and that a single Session references a single connection when in use. A MetaData object does not, nor does an Engine - only a Session.If you remove all threading from your application and the errors go away, then you know you're accessing some resource illegally. Session1 in Session1 = sessionmaker() session1 in session1 = Session1(bind=db) Session2 in Session2 = scoped_session(sessionmaker()) Let me try to ask a precise question. If I do Session = scoped_session(sessionmaker()) then is it ok for this Session object to be be passed around between multiple threads and used directly as in Session.commit() Does this correspond to a single psycopyg2 connection? If it does, and this usage is wrong, should I be creating separate sessions within each thread like That's a scoped_session, which is threadsafe. Everything you call upon it will acquire a Session object from a thread local context, and commit() is called on that (for information on thread locals, see http://docs.python.org/library/threading.html#threading.local. If you pass a scoped_session from one thread to another, and the second thread calls commit(), the second thread is not affecting the transaction begun by the first. They are two separate transactions. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalch...@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
[sqlalchemy] Re: using scoped_session in serial mode as well as parallel (multi-threaded) mode
On Thu, 12 Aug 2010 09:44:04 -0400, Michael Bayer mike...@zzzcomputing.com wrote: On Aug 12, 2010, at 9:25 AM, Faheem Mitha wrote: Hi Mike, Thanks for the response, but I don't follow. When you say multiple threads are hitting the session you have above, which session are you referring to? There is more than one object above that could be called a session. Ie. I don't actually know, it would require that I have a full install of your application for me to run and step through with a debugger to fully understand what it's doing. All I can say from here is that the errors you have suggest concurrent access to a single psycopg2 connection resource, and that a single Session references a single connection when in use. A MetaData object does not, nor does an Engine - only a Session. If you remove all threading from your application and the errors go away, then you know you're accessing some resource illegally. Yes, I see. Yes, the error does not show up unless I run multiple threads, and I agree with your interpretation. If MetaData is threadsafe, then using ThreadLocalMetaData is not necessary? Session1 in Session1 = sessionmaker() session1 in session1 = Session1(bind=db) Session2 in Session2 = scoped_session(sessionmaker()) Let me try to ask a precise question. If I do Session = scoped_session(sessionmaker()) then is it ok for this Session object to be be passed around between multiple threads and used directly as in Session.commit() Does this correspond to a single psycopyg2 connection? If it does, and this usage is wrong, should I be creating separate sessions within each thread like That's a scoped_session, which is threadsafe. Everything you call upon it will acquire a Session object from a thread local context, and commit() is called on that (for information on thread locals, see http://docs.python.org/library/threading.html#threading.local. If you pass a scoped_session from one thread to another, and the second thread calls commit(), the second thread is not affecting the transaction begun by the first. They are two separate transactions. Ok. Thanks for the confirmation. So, if I was to use scoped sessions systematically everywhere, this problem would likely disappear. Can you confirm that there is no reason not to use scoped sessions everywhere, even in serial execution? Of course, if that is the case, then I wonder why non-scoped sessions are used at all. Regards, Faheem. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalch...@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
[sqlalchemy] Re: using scoped_session in serial mode as well as parallel (multi-threaded) mode
On Thu, 12 Aug 2010 09:44:04 -0400, Michael Bayer mike...@zzzcomputing.com wrote: On Aug 12, 2010, at 9:25 AM, Faheem Mitha wrote: Hi Mike, Thanks for the response, but I don't follow. When you say multiple threads are hitting the session you have above, which session are you referring to? There is more than one object above that could be called a session. Ie. I don't actually know, it would require that I have a full install of your application for me to run and step through with a debugger to fully understand what it's doing. All I can say from here is that the errors you have suggest concurrent access to a single psycopg2 connection resource, and that a single Session references a single connection when in use. A MetaData object does not, nor does an Engine - only a Session.If you remove all threading from your application and the errors go away, then you know you're accessing some resource illegally. Session1 in Session1 = sessionmaker() session1 in session1 = Session1(bind=db) Session2 in Session2 = scoped_session(sessionmaker()) Let me try to ask a precise question. If I do Session = scoped_session(sessionmaker()) then is it ok for this Session object to be be passed around between multiple threads and used directly as in Session.commit() Does this correspond to a single psycopyg2 connection? If it does, and this usage is wrong, should I be creating separate sessions within each thread like That's a scoped_session, which is threadsafe. Everything you call upon it will acquire a Session object from a thread local context, and commit() is called on that (for information on thread locals, see http://docs.python.org/library/threading.html#threading.local. If you pass a scoped_session from one thread to another, and the second thread calls commit(), the second thread is not affecting the transaction begun by the first. They are two separate transactions. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalch...@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Re: using scoped_session in serial mode as well as parallel (multi-threaded) mode
On Aug 12, 2010, at 11:47 AM, Faheem Mitha wrote: On Thu, 12 Aug 2010 09:44:04 -0400, Michael Bayer mike...@zzzcomputing.com wrote: On Aug 12, 2010, at 9:25 AM, Faheem Mitha wrote: Hi Mike, Thanks for the response, but I don't follow. When you say multiple threads are hitting the session you have above, which session are you referring to? There is more than one object above that could be called a session. Ie. I don't actually know, it would require that I have a full install of your application for me to run and step through with a debugger to fully understand what it's doing. All I can say from here is that the errors you have suggest concurrent access to a single psycopg2 connection resource, and that a single Session references a single connection when in use. A MetaData object does not, nor does an Engine - only a Session. If you remove all threading from your application and the errors go away, then you know you're accessing some resource illegally. Yes, I see. Yes, the error does not show up unless I run multiple threads, and I agree with your interpretation. If MetaData is threadsafe, then using ThreadLocalMetaData is not necessary? ThreadLocalMetaData is not necessary and is not used for this purpose. Ok. Thanks for the confirmation. So, if I was to use scoped sessions systematically everywhere, this problem would likely disappear. that's not necessarily true - if you share individual persistent objects among threads, they reference their owning session in order to load additional state. If you pass objects between threads you should merge() them into the current thread's session first, then use that result. Can you confirm that there is no reason not to use scoped sessions everywhere, even in serial execution? Of course, if that is the case, then I wonder why non-scoped sessions are used at all. scoped_sessions are usually the default choice for web applications since they desire distinct transactions and object state for individual threads.They are overkill and potentially confusing or inappropriate in other situations, however. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalch...@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
[sqlalchemy] Re: using scoped_session in serial mode as well as parallel (multi-threaded) mode
[This message has also been posted.] On Thu, 12 Aug 2010 12:47:37 -0400, Michael Bayer mike...@zzzcomputing.com wrote: On Aug 12, 2010, at 11:47 AM, Faheem Mitha wrote: On Thu, 12 Aug 2010 09:44:04 -0400, Michael Bayer mike...@zzzcomputing.com wrote: On Aug 12, 2010, at 9:25 AM, Faheem Mitha wrote: Hi Mike, Thanks for the response, but I don't follow. When you say multiple threads are hitting the session you have above, which session are you referring to? There is more than one object above that could be called a session. Ie. I don't actually know, it would require that I have a full install of your application for me to run and step through with a debugger to fully understand what it's doing. All I can say from here is that the errors you have suggest concurrent access to a single psycopg2 connection resource, and that a single Session references a single connection when in use. A MetaData object does not, nor does an Engine - only a Session. If you remove all threading from your application and the errors go away, then you know you're accessing some resource illegally. Yes, I see. Yes, the error does not show up unless I run multiple threads, and I agree with your interpretation. If MetaData is threadsafe, then using ThreadLocalMetaData is not necessary? ThreadLocalMetaData is not necessary and is not used for this purpose. Ok. Thanks for the confirmation. So, if I was to use scoped sessions systematically everywhere, this problem would likely disappear. that's not necessarily true - if you share individual persistent objects among threads, they reference their owning session in order to load additional state. If you pass objects between threads you should merge() them into the current thread's session first, then use that result. I see. That's very enlightening. Can one query such objects to determine their owning session? Some attribute, perhaps? Can you confirm that there is no reason not to use scoped sessions everywhere, even in serial execution? Of course, if that is the case, then I wonder why non-scoped sessions are used at all. scoped_sessions are usually the default choice for web applications since they desire distinct transactions and object state for individual threads. They are overkill and potentially confusing or inappropriate in other situations, however. I'm not sure why they would be potentially confusing. What are some of the downsides? I'd have thought that not having shared state was less confusing. Regards, Faheem. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalch...@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
Re: [sqlalchemy] Re: using scoped_session in serial mode as well as parallel (multi-threaded) mode
On Aug 12, 2010, at 2:41 PM, Faheem Mitha wrote: objects among threads, they reference their owning session in order to load additional state. If you pass objects between threads you should merge() them into the current thread's session first, then use that result. I see. That's very enlightening. Can one query such objects to determine their owning session? Some attribute, perhaps? see http://www.sqlalchemy.org/docs/session.html#frequently-asked-questions Can you confirm that there is no reason not to use scoped sessions everywhere, even in serial execution? Of course, if that is the case, then I wonder why non-scoped sessions are used at all. scoped_sessions are usually the default choice for web applications since they desire distinct transactions and object state for individual threads. They are overkill and potentially confusing or inappropriate in other situations, however. I'm not sure why they would be potentially confusing. What are some of the downsides? I'd have thought that not having shared state was less confusing. the scoped_session is a proxy object to the real object. It is less confusing for those unfamiliar with thread locals and proxy objects to deal with a Session directly. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalch...@googlegroups.com. To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.