Re: Advice on using model to bulk/batch load database

2006-03-23 Thread Adrian Holovaty

On 3/23/06, DavidA <[EMAIL PROTECTED]> wrote:
> Occasionally I will modify my schema, drop the old tables and reload
> all the data from the "cached" files. Over time that could easily be
> millions of rows and some optimization of my technique will be in
> order. But for now, the simple approach (parse a row, create a django
> model object, stuff it and save it) works fine.

For bulk-loading stuff into a database, I would definitely recommend
bypassing the Django layer entirely. There's no need to use an ORM for
that; use whatever bulk-loading capability your database provides. Or,
if you absolutely want to do it in Python, use the
cursor.executemany() functionality of the Python DB-API layer.

Adrian

--
Adrian Holovaty
holovaty.com | djangoproject.com

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users
-~--~~~~--~~--~--~---



Re: Advice on using model to bulk/batch load database

2006-03-23 Thread DavidA

Eric,

The typical mode of operation will be to incrementally load new data
each night - about 10 files each containing dozens to hundreds of rows.
Maybe only a couple will be in the thousands category. But I've also
created my scripts to work in two steps: 1) download the data from
various sources (FTP, HTTP, web scrape, custom API) and save the files
on a server, 2) parse any new files and load into the database.

Occasionally I will modify my schema, drop the old tables and reload
all the data from the "cached" files. Over time that could easily be
millions of rows and some optimization of my technique will be in
order. But for now, the simple approach (parse a row, create a django
model object, stuff it and save it) works fine.

-Dave


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users
-~--~~~~--~~--~--~---



Re: Advice on using model to bulk/batch load database

2006-03-23 Thread Eric Walstad


DavidA wrote:
> I am prepopulating my database with data from a number of flat files.
> I've written a small script to do this using my model class. While this
> works, there are a couple of kludges I made to get it to work and I was
> hoping someone could advise my on a better way to do this.

Hey Dave,

You didn't mention how many records you are importing nor how often you
will be re-importing your flat files.  I regularly have to import files
with ~2.5M records and I've found that using an ORM framework is too
slow for my requirements.

My first script to do the import used Django's ORM.  That script looped
over the lines of the file, slurping in the data, creating a new django
object out of the data and .save()ing the object.  It worked well and
the code was easy to read, but I needed more speed as this script was
going to need to run at least weekly and had to run at night,
completing before my users started hammering on the web app in the
morning.  The second script I wrote would drop existing indexes, then
iterating over the lines of the file, it worked directly with Django's
db object to INSERT the data into the (postgresql) database.  The
indexes were recreated at the end of the script.  If I recall
correctly, that script took around 5-6 hours to complete the import.
The script I'm using now iterates over the lines of the file, formating
the data into a new sql file that is psql (the postgresql command line
tool) friendly.  The script takes four minutes to create the sql file.
It then uses python's os.popen4 to load the python-generated sql file
into psql:

cmd = "psql -U my_user my_db -f %s" % path_to_sql_file
s_in, s_out = os.popen4(cmd)

This approach cut the import time down to just under 2 hours (the sql
file also drops indexes before importing the data, then recreates the
indexes afterwords).

So, If you are importing lots of data and have to do it frequently or
many times and time is of the essence, my recommendation is to give the
read-file/write-file/db-command-line three-step a try.

Good luck and best regards,

Eric.


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users
-~--~~~~--~~--~--~---



Re: Advice on using model to bulk/batch load database

2006-03-23 Thread DavidA

> BTW, if you do this repeatedly and format of those flat files is not a
> requirement it is possible to have all initial data in a form of SQL
> script. If you place it in /sql/.sql then Django will
> include it in 'sqlall' command right after DB creation.

While I'm not in control of the file format in this case, your
suggestion will help solve a different problem I have: I use full-text
searching on some of my tables and I was manually running a little sql
script after I did a 'manage.py sqlall  | mysql ...':

CREATE FULLTEXT INDEX ix_ft_trade_all
ON trade (tradeType, investIdType, investId, portfolio, book,
strategy,
  counterparty, custodian, account);
...

and then I use the where form of get_list(...,
where=["match() against('')"]) to search
for them in one of my views.

Your suggestion automates that step (which I forget to do quite a bit!)

> You can use the very class that is used for POSTs - MultiValueDict. It's
> in django.utils.datastructures:
>
> from django.utils.datastructures import MultiValueDict
> d = MultiValueDict(your_dict)

I made the change and its working fine. One thing you have to be
careful with is to build a MultiValueDict where the values are
lists/tuples. I was putting scalars in and it was interpreting strings
as sequences which caused problems in do_html2python.

-Dave


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users
-~--~~~~--~~--~--~---



Re: Advice on using model to bulk/batch load database

2006-03-23 Thread DavidA

> You can set the environment variable at the top of your scripts by
> importing the Python os module, just like you can import sys and set
> sys.path.

Thanks, I added this and it works fine:

os.environ['DJANGO_SETTINGS_MODULE'] = 'data.settings'

> The Python datetime and time modules are huge resources.  In this case
> you want time.strptime() to parse an arbitrary time format into a tuple
> and then you can use that tuple to create a datetime.date object which
> you can directly assign to the DateField.

I had already done that - I was hoping I could tell the DateField what
format it should expect the date in when it parses it via html2python.
Something like:

myDateField.inputFormats = ['%d-%b-%y', '%m/%d/%y']

and then when I call html2python on it (indirectly via the manipulator)
it will try each of those until it gets a valid date back. Right now,
DateField is hardwired to use '%Y-%m-%d', so I'm pre-converting all
date fields before I send the row from the file to the manipulator.
Like I said, its a small detail, just the one place where I my
parser/loader needs to be aware of the type of columns and I was hoping
to eliminate that.

Thanks,
-Dave


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users
-~--~~~~--~~--~--~---



Re: Advice on using model to bulk/batch load database

2006-03-22 Thread Ivan Sagalaev

DavidA wrote:

>I am prepopulating my database with data from a number of flat files.
>  
>
BTW, if you do this repeatedly and format of those flat files is not a 
requirement it is possible to have all initial data in a form of SQL 
script. If you place it in /sql/.sql then Django will 
include it in 'sqlall' command right after DB creation.

>Second, I'm using a manipulator to convert string values from the file
>to python values but the do_html2python method expects a POST which
>isn't quite the same as a dict. I had to write a little wrapper dict to
>make it compatible:
>  
>
You can use the very class that is used for POSTs - MultiValueDict. It's 
in django.utils.datastructures:

from django.utils.datastructures import MultiValueDict
d = MultiValueDict(your_dict)

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users
-~--~~~~--~~--~--~---



Re: Advice on using model to bulk/batch load database

2006-03-22 Thread Max Battcher

DavidA wrote:
> First, to use the model from a script I had to set the
> DJANGO_SETTINGS_MODULE environment variable and add the base directory
> to sys.path. I've seen this before so I guess this is just the way it
> is but it would be nice not to have dependencies on environment
> variables. Is there a different way to do this?

You can set the environment variable at the top of your scripts by 
importing the Python os module, just like you can import sys and set 
sys.path.

> Finally, for dates I had to manually convert them because they weren't
> in the -mm-dd format required by DateField.html2python. It seems a
> bit limited to only support one date format in DateField but I guess
> that avoids the ambiguities with date formats. Since my model's fields
> are the same names as the column headings in the flat files, now I can
> map all the conversions automatically without doing anything
> field-specific, *except* for date fields. Any suggestions?

The Python datetime and time modules are huge resources.  In this case 
you want time.strptime() to parse an arbitrary time format into a tuple 
and then you can use that tuple to create a datetime.date object which 
you can directly assign to the DateField.

-- 
--Max Battcher--
http://www.worldmaker.net/
"I'm gonna win, trust in me / I have come to save this world / and in 
the end I'll get the grrrl!" --Machinae Supremacy, Hero (Promo Track)

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users
-~--~~~~--~~--~--~---



Advice on using model to bulk/batch load database

2006-03-22 Thread DavidA

I am prepopulating my database with data from a number of flat files.
I've written a small script to do this using my model class. While this
works, there are a couple of kludges I made to get it to work and I was
hoping someone could advise my on a better way to do this.

First, to use the model from a script I had to set the
DJANGO_SETTINGS_MODULE environment variable and add the base directory
to sys.path. I've seen this before so I guess this is just the way it
is but it would be nice not to have dependencies on environment
variables. Is there a different way to do this?

Second, I'm using a manipulator to convert string values from the file
to python values but the do_html2python method expects a POST which
isn't quite the same as a dict. I had to write a little wrapper dict to
make it compatible:

class MyDict(dict):
def getlist(self, key):
return (self[key],)

def setlist(self, key, value):
self[key] = value[0]

Again, not a big deal but it seems kludgy and I thought there might be
a better way. Any ideas?

Finally, for dates I had to manually convert them because they weren't
in the -mm-dd format required by DateField.html2python. It seems a
bit limited to only support one date format in DateField but I guess
that avoids the ambiguities with date formats. Since my model's fields
are the same names as the column headings in the flat files, now I can
map all the conversions automatically without doing anything
field-specific, *except* for date fields. Any suggestions?

Thanks,
-Dave


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users
-~--~~~~--~~--~--~---