u can make a simple dict-cache for your authors only, and let it grow. 
in case of limited memory, u can do more sophisticated statistical 
one (say keep recent 1000 authors, or most used ones or whatever 
criterion) and query db if not there and create new one if not there 
either.

But if it's going to be millions of records - it may go for days. u 
may parse in big-big portions (say 1000-10000 entries at once) and 
mass-insert at table level, bypassing the ORM. Or, even lower 
(outside SA), use your server's import mechanism, create huge 
dump-file and feed it.

but YMMV, try simpler/shorter ways first. 
once u get an idea of the speed, choose.

svil

On Tuesday 13 January 2009 13:21:40 Nathan Harmston wrote:
> Hi everyone,
>
> I m trying to get to grips with SQLAlchemy by parsing a file,
> extracting certain fields and then storing them in a database. I am
> currently using the object relational and trying to get a sense of
> how it all works. I have two objects Paper and Author, which is a
> many to many relationship. So currently I am parsing the file and
> then creating a Paper object
>
> names = [ list of author names ]
> authors = [ Author(a) for a in names ]
> p = Paper(title, authors)
> session.add(p)
> session.commit(p)
>
> However, when I get to a new paper with an already existing author
> I get the following error,
> sqlalchemy.exceptions.IntegrityError: (IntegrityError) (1062,
> "Duplicate entry 'Sole R' for key 'name'") u'INSERT INTO authors
> (name) VALUES (%s)' ['Sole R']
>
> which makes sense, however I d prefer not to query the database
> over and over to check if "Sole R" has already been added to the
> database. So I try a different approach without building Author
> objects and using the name string instead as an arg to Paper:
> and get the error:
> AttributeError: 'str' object has no attribute '_state'
>
> So I'm out of ideas, I want to make it fast as possible and without
> firing off lots of queries to the database to look for identity. Is
> there a way supported by the ORM that I ve missed completely, so is
> it more normal to create a cache whereby you add things to a dict
> and look for identity in the dict otherwise query the database? The
> problem is that some of the files I am going to be parsing are
> quite large (Gb) and I dont want to saturate my database server
> with requests and likewise I only have limited memory on my
> machines.
> Have I missed something simple in the documentation?
>
> My code is below:
>
> Many thanks in advance,
>
> Nathan
>
>
>
> document_table = Table('documents', metadata,
>                        Column('id', Integer, primary_key=True),
>                        Column('title', String, nullable=False),
>                           )
>
> authors_table = Table('authors', metadata,
>                       Column('id', Integer, primary_key=True),
>                       Column('name', String(40), unique=True),
>
>                       )
>
> papers_to_authors_table = Table('p2a_assocation', metadata,
>                                 Column('document', Integer,
> ForeignKey(' documents.id')),
>                                 Column('author', Integer,
> ForeignKey(' authors.id'))
>                                 )
>
> class Author(object):
>     def __init__(self, name):
>         self.name = name
>
> class Paper(object):
>     def __init__(self, title, authors):
>         self.title = title
>         self.authors = author
>
>
> mapper(Paper, document_table,
> properties={'authors':relation(Author,
> secondary=papers_to_authors_table, backref='publications')})
> mapper(Author, authors_table)
>
> 


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To post to this group, send email to sqlalchemy@googlegroups.com
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to