Since you are effectively overwriting the table with new file
contents, the fastest may well be to truncate the table then insert
all contents. If you were to just append and update then
session.merge() is convenient way to do this though I am not sure if
the fastest.
On Sep 7, 5:53 pm, Vlad K. v...@haronmedia.com wrote:
Great, thanks!
.oO V Oo.
On 09/06/2011 04:48 PM, Michael Bayer wrote:
On Sep 6, 2011, at 10:40 AM, Vlad K. wrote:
I have a products database which is daily syncronized with an external
source via a csv file. There are several thousand rows in question. The
synchronization does two things:
1. Update only price if changed for existing products
2. Insert new products if they don't exist with all fields from csv
But basically, for each row in the csv, after the row is processed (one of
the above two things is done), I don't need the object in session anymore.
Memory and performance are of course an issue, and I can't find a way to
test memory consumption with or without expunge_all() so my questions are:
1. Do I need to session.expunge_all() after each csv row is processed, or
are they automatically garbage collected?
2. Is there any significant overhead inherent in expunge_all() that I'm
not seeing right now?
Performance-wise, it seems the task is complete in more or less same time
with or without expunge_all()
In modern SQLAlchemy, the Session maintains only weak references to objects
that are clean, that is, are persistent in the database and have no
pending changes to be flushed. As all references to them are lost, they
are garbage collected by the Python interpreter. Note that objects are
strongly referenced when they are present in the collection or attribute of
a parent object, until that parent is also garbage collected. There is
an overhead to process which occurs when the object is dereferenced and
removed from the session (weakref callbacks handle the accounting). But
calling expunge_all() probably isn't doing much here as the objects are
likely being cleaned out in the same way regardless.
While I'm at it, I also need to delete rows in the database that do not
have corresponding row in the csv file (say linked by csv_key field), the
first solution that comes to mind is building a list of keys in the csv
file (few thousand keys) and then doing:
session.query(Product).filter(not_(Product.product_id.in_(csv_keys))).delete()
I believe there is less overhead in sending such a large (but single!)
query to the database and leaving it to determine what to delete by
itself, than selecting each row in the database and checking if its
csv_key exists in the csv_keys list on the application side and then
issuing delete statements for rows that matched the criteria. Am I wrong?
That's definitely a dramatically faster way to do things, rather than to
load each record individually and mark as deleted - it's the primary reason
delete() and update() are there. You'll probably want to send False as
the value of synchronize_session to the delete() call so that it doesn't go
through the effort of locating local records that were affected (unless you
need that feature).
--
You received this message because you are subscribed to the Google Groups
sqlalchemy group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/sqlalchemy?hl=en.