Hey guys,

I have a Pylons back-end running on SQLAlchemy. I have a script that
reads a tree of XML files from an HTTP server (it downloads an xml X,
and then downloads that X's children, and then iterates the children,
and so forth in recursion). Each xml file represents an SQLAlchemy
model.

The problem is that I have thousands of these xml's (sometimes 5000,
sometimes 26000). I was able to optimize the download process with
HTTP pooling, but I cannot seem to think of the best approach as to
committing the models to the DB. Every time an xml file is downloaded,
I create an orm object for it and add it to my session.

Problem 1: some xml's will exists multiple times in the tree so I am
checking that there is no duplicate insertion. Is the check in my code
optimal or should I keep an indexed collection on the side and use it
to check for duplicates?

Problem 2: my autocommit is set to False because I don't want to
commit on every add (not because its bad design, but because of
performance). But I also don't want to iterate the entire tree of
thousands of categories without committing at all. Therefor, I created
a constant number upon which my code commits the data. Is this a good
approach? What would be a good number for that? It might be important
to mention that I do not know in advance how many xml's I am looking
at.

Here is what my pseudo-code looks like now (ignore syntax errors):

    count = 0
    COMMIT_EVERY = 50

    def recursion(parent):
        global count, COMMIT_EVERY
        pool = get_http_connection_pool(...)
        sub_xmls = get_sub_xmls(pool, parent)

        if sub_xmls == None:
            return

        for sub_xml in sub_xmls:
            orm_obj = MyObj(sub_xml)

            duplicate = Session.query(MyObj).filter(MyObj.id ==
orm_obj.id).first()
            if not duplicate:
                Session.add(orm_obj)
                count = count + 1
                if count % COMMIT_EVERY == 0:
                    Session.commit()
                recursion(orm_obj.id)

    recursion(0)

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.

Reply via email to