Re: Advice on using model to bulk/batch load database
My motivation for using the Django ORM layer was simply to reuse the logic that parses text values to native values (html2python) and any column name mapping that goes on, both of which are managed by the *Field members. I know that's not a huge win, but since I'm at an early point in the project where I'm tweaking the schema quite a bit, its nice to know if I change an int field to a float, or add another field that was previously being skipped from the file, or rename a column, it just works. But I see your point - it remains to be seen if I'm getting enough benefit to justify this approach. Speed really isn't an issue right now. Each file takes a few seconds to load (up to a minute for the really large ones) and they run as scheduled jobs over night. And this is all running on a VMWare virtual machine on my lowly Dell desktop - I could copy the VM to a server and get a huge boost. If I _really_ cared about speed, since I'm using MySQL, I'd consider mapping a table directly to the file via the CSV storage engine and then just copy it from the CSV table to a "real" table. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-users -~--~~~~--~~--~--~---
Re: Advice on using model to bulk/batch load database
On 3/23/06, DavidA <[EMAIL PROTECTED]> wrote: > Occasionally I will modify my schema, drop the old tables and reload > all the data from the "cached" files. Over time that could easily be > millions of rows and some optimization of my technique will be in > order. But for now, the simple approach (parse a row, create a django > model object, stuff it and save it) works fine. For bulk-loading stuff into a database, I would definitely recommend bypassing the Django layer entirely. There's no need to use an ORM for that; use whatever bulk-loading capability your database provides. Or, if you absolutely want to do it in Python, use the cursor.executemany() functionality of the Python DB-API layer. Adrian -- Adrian Holovaty holovaty.com | djangoproject.com --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-users -~--~~~~--~~--~--~---
Re: Advice on using model to bulk/batch load database
Eric, The typical mode of operation will be to incrementally load new data each night - about 10 files each containing dozens to hundreds of rows. Maybe only a couple will be in the thousands category. But I've also created my scripts to work in two steps: 1) download the data from various sources (FTP, HTTP, web scrape, custom API) and save the files on a server, 2) parse any new files and load into the database. Occasionally I will modify my schema, drop the old tables and reload all the data from the "cached" files. Over time that could easily be millions of rows and some optimization of my technique will be in order. But for now, the simple approach (parse a row, create a django model object, stuff it and save it) works fine. -Dave --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-users -~--~~~~--~~--~--~---
Re: Advice on using model to bulk/batch load database
DavidA wrote: > I am prepopulating my database with data from a number of flat files. > I've written a small script to do this using my model class. While this > works, there are a couple of kludges I made to get it to work and I was > hoping someone could advise my on a better way to do this. Hey Dave, You didn't mention how many records you are importing nor how often you will be re-importing your flat files. I regularly have to import files with ~2.5M records and I've found that using an ORM framework is too slow for my requirements. My first script to do the import used Django's ORM. That script looped over the lines of the file, slurping in the data, creating a new django object out of the data and .save()ing the object. It worked well and the code was easy to read, but I needed more speed as this script was going to need to run at least weekly and had to run at night, completing before my users started hammering on the web app in the morning. The second script I wrote would drop existing indexes, then iterating over the lines of the file, it worked directly with Django's db object to INSERT the data into the (postgresql) database. The indexes were recreated at the end of the script. If I recall correctly, that script took around 5-6 hours to complete the import. The script I'm using now iterates over the lines of the file, formating the data into a new sql file that is psql (the postgresql command line tool) friendly. The script takes four minutes to create the sql file. It then uses python's os.popen4 to load the python-generated sql file into psql: cmd = "psql -U my_user my_db -f %s" % path_to_sql_file s_in, s_out = os.popen4(cmd) This approach cut the import time down to just under 2 hours (the sql file also drops indexes before importing the data, then recreates the indexes afterwords). So, If you are importing lots of data and have to do it frequently or many times and time is of the essence, my recommendation is to give the read-file/write-file/db-command-line three-step a try. Good luck and best regards, Eric. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-users -~--~~~~--~~--~--~---
Re: Advice on using model to bulk/batch load database
> BTW, if you do this repeatedly and format of those flat files is not a > requirement it is possible to have all initial data in a form of SQL > script. If you place it in /sql/.sql then Django will > include it in 'sqlall' command right after DB creation. While I'm not in control of the file format in this case, your suggestion will help solve a different problem I have: I use full-text searching on some of my tables and I was manually running a little sql script after I did a 'manage.py sqlall | mysql ...': CREATE FULLTEXT INDEX ix_ft_trade_all ON trade (tradeType, investIdType, investId, portfolio, book, strategy, counterparty, custodian, account); ... and then I use the where form of get_list(..., where=["match() against('')"]) to search for them in one of my views. Your suggestion automates that step (which I forget to do quite a bit!) > You can use the very class that is used for POSTs - MultiValueDict. It's > in django.utils.datastructures: > > from django.utils.datastructures import MultiValueDict > d = MultiValueDict(your_dict) I made the change and its working fine. One thing you have to be careful with is to build a MultiValueDict where the values are lists/tuples. I was putting scalars in and it was interpreting strings as sequences which caused problems in do_html2python. -Dave --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-users -~--~~~~--~~--~--~---
Re: Advice on using model to bulk/batch load database
> You can set the environment variable at the top of your scripts by > importing the Python os module, just like you can import sys and set > sys.path. Thanks, I added this and it works fine: os.environ['DJANGO_SETTINGS_MODULE'] = 'data.settings' > The Python datetime and time modules are huge resources. In this case > you want time.strptime() to parse an arbitrary time format into a tuple > and then you can use that tuple to create a datetime.date object which > you can directly assign to the DateField. I had already done that - I was hoping I could tell the DateField what format it should expect the date in when it parses it via html2python. Something like: myDateField.inputFormats = ['%d-%b-%y', '%m/%d/%y'] and then when I call html2python on it (indirectly via the manipulator) it will try each of those until it gets a valid date back. Right now, DateField is hardwired to use '%Y-%m-%d', so I'm pre-converting all date fields before I send the row from the file to the manipulator. Like I said, its a small detail, just the one place where I my parser/loader needs to be aware of the type of columns and I was hoping to eliminate that. Thanks, -Dave --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-users -~--~~~~--~~--~--~---
Re: Advice on using model to bulk/batch load database
DavidA wrote: >I am prepopulating my database with data from a number of flat files. > > BTW, if you do this repeatedly and format of those flat files is not a requirement it is possible to have all initial data in a form of SQL script. If you place it in /sql/.sql then Django will include it in 'sqlall' command right after DB creation. >Second, I'm using a manipulator to convert string values from the file >to python values but the do_html2python method expects a POST which >isn't quite the same as a dict. I had to write a little wrapper dict to >make it compatible: > > You can use the very class that is used for POSTs - MultiValueDict. It's in django.utils.datastructures: from django.utils.datastructures import MultiValueDict d = MultiValueDict(your_dict) --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-users -~--~~~~--~~--~--~---
Re: Advice on using model to bulk/batch load database
DavidA wrote: > First, to use the model from a script I had to set the > DJANGO_SETTINGS_MODULE environment variable and add the base directory > to sys.path. I've seen this before so I guess this is just the way it > is but it would be nice not to have dependencies on environment > variables. Is there a different way to do this? You can set the environment variable at the top of your scripts by importing the Python os module, just like you can import sys and set sys.path. > Finally, for dates I had to manually convert them because they weren't > in the -mm-dd format required by DateField.html2python. It seems a > bit limited to only support one date format in DateField but I guess > that avoids the ambiguities with date formats. Since my model's fields > are the same names as the column headings in the flat files, now I can > map all the conversions automatically without doing anything > field-specific, *except* for date fields. Any suggestions? The Python datetime and time modules are huge resources. In this case you want time.strptime() to parse an arbitrary time format into a tuple and then you can use that tuple to create a datetime.date object which you can directly assign to the DateField. -- --Max Battcher-- http://www.worldmaker.net/ "I'm gonna win, trust in me / I have come to save this world / and in the end I'll get the grrrl!" --Machinae Supremacy, Hero (Promo Track) --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-users -~--~~~~--~~--~--~---