three column dataset - additions and deletions
I'm going to be writing a utility that will be pulling three fields from a MySQL table. I've already got a sample dataset - there's a long int (which is a db key), a short string, and a looong string. Many rows. As it is, receive this data from the DB interface as a rather large tuple of tuples. I plan on hashing the long string field (both for convenience and security) and storing the set in a pickle. The idea is that this script will run periodically, pulling the table, and comparing the data gathered at that run to that stored by the previous, acting on changes made, and storing the current data back (to be referenced against in the next invocation) I figure it will be easy enough to determine changed hashes for a given key. What I'm unclear on is what the best type of structure to keep this data in, given that I need to modify the data after it comes in (replacing that long string with, say, an MD5 from hashlib) and both need to act on new rows (rows that don't exist in the 'old' data) and deleted rows (rows that only exist in the 'old' data). Keeping in mind that I'm a newbie here, and I'm probably not aware of most of the different ways to store such things. I shouldn't have any problems with the logic itself - I just know enough to know I don't know the best ways of doing things :) Any suggestions? I'm not asking for code or handholding, but some objects or datatypes to look into would be very helpful at this early stage. Thanks! -- http://mail.python.org/mailman/listinfo/python-list
Re: three column dataset - additions and deletions
On 2010-12-02, draeath draeath.spamt...@gmail.com wrote: The idea is that this script will run periodically, pulling the table, and comparing the data gathered at that run to that stored by the previous, acting on changes made, and storing the current data back (to be referenced against in the next invocation) So, basically, you want to store a local copy of the data and sync it to the original. I figure it will be easy enough to determine changed hashes for a given key. What I'm unclear on is what the best type of structure to keep this data in, given that I need to modify the data after it comes in (replacing that long string with, say, an MD5 from hashlib) and both need to act on new rows (rows that don't exist in the 'old' data) and deleted rows (rows that only exist in the 'old' data). You need to differentiate between the in memory data model and the storage model. Since this data comes from a database in the first place, I would dump it to an sqlite3 database from the beginning. You can use this to store, modify, and change the values as you receive them from the database. If you are looking for in-memory structures, then you haven't really provided us with enough information on the significance and organization of the data. -- http://mail.python.org/mailman/listinfo/python-list
Re: three column dataset - additions and deletions
On Thu, 02 Dec 2010 22:55:53 +, Tim Harig wrote: Thanks for taking the time to check in on this, Tim! So, basically, you want to store a local copy of the data and sync it to the original. In a way. I only need to store one copy of the data, and make note of changes between it and the current data. You need to differentiate between the in memory data model and the storage model. Since this data comes from a database in the first place, I would dump it to an sqlite3 database from the beginning. You can use this to store, modify, and change the values as you receive them from the database. I thought of doing that, but given that I only need to store a single instance of the data, a simple pickle will do the job nicely (am I correct in reading that it can save/load any python object?) If you are looking for in-memory structures, then you haven't really provided us with enough information on the significance and organization of the data. The data columns: Long Int, String (under 30 chars), String (over 100 chars) The rows can scale up to hundreds, perhaps thousands. The integer is the database key, the shorter string is a user name, and the longer string is an access control definition. The whole idea of this script is to check, daily, for any added or removed users - or any altered access control definition. I realize this could likely all be done from inside the database itself - but altering the DB itself is not an option (as the product vendor is very touchy about that, and altering it can null our support agreement) -- http://mail.python.org/mailman/listinfo/python-list
Re: three column dataset - additions and deletions
On 2010-12-03, draeath draeath.spamt...@gmail.com wrote: On Thu, 02 Dec 2010 22:55:53 +, Tim Harig wrote: Thanks for taking the time to check in on this, Tim! So, basically, you want to store a local copy of the data and sync it to the original. In a way. I only need to store one copy of the data, and make note of changes between it and the current data. Perhaps I am missing a sublty that makes those statements different. You need to differentiate between the in memory data model and the storage model. Since this data comes from a database in the first place, I would dump it to an sqlite3 database from the beginning. You can use this to store, modify, and change the values as you receive them from the database. I thought of doing that, but given that I only need to store a single instance of the data, a simple pickle will do the job nicely (am I correct in reading that it can save/load any python object?) So, you start by dumping the data from the remote server into an sqlite3 database table. What you end up with is a record=record copy of the original query (plus any other meta data that you want to add). Then, when the data changes, you apply those same changes to your local table (or just regenerate it since you seem to be pulling all of the information from the server anyway. The result is a *single instance of the data*. Why you would want to replace this with a pickle of a nested set up tuples or a homebrew on disk data structure is beyond me. Using sqlite3 is almost certainly faster and more functional then anything you are going to create without some serious work. If you are looking for in-memory structures, then you haven't really provided us with enough information on the significance and organization of the data. The data columns: Long Int, String (under 30 chars), String (over 100 chars) The rows can scale up to hundreds, perhaps thousands. Then those are the columns that you create for your local table. The integer is the database key, the shorter string is a user name, and the longer string is an access control definition. The whole idea of this script is to check, daily, for any added or removed users - or any altered access control definition. The question is how are you going to use this information once you have mirroed it locally. Most likely, from you description, you just need to access it as a local read only data store. Now compare the differences in how you would acces the data: pickle method: 1. You have to load the entire pickle into memory. 2. Unless you add some kind of tree or indexing mechanism, you will have to walk through an average of 1/2 of the records to find the matching id. If you do use an advanced mechanism you have to create the code that inserts and locates the data. sqlite3 method: 1. You open the file using the sqlite3 connector which does not have to read all of the data into memory. 2. You use a select query to get just the record for the id that you are looking for. sqlite3 has already provided you with optimized lookup and indexing capability, as well as modification operations, etc (most likely written in C). As an added bonus, you don't have to worry about locking issues, to keep the clients from accesses the datastore and receiving an inconsistant copy, while you are making your periodic updates to the database Summary: the pickle method is reinventing the wheel. You can do what has already been done for you with the sqlite3 module (and library) that has already been written for you; but, getting anything near the same functionality is going to require considerable effort on your part; and you are probably going to have to write C to get the equivilant performance. Which seems like a better option to you? I realize this could likely all be done from inside the database itself - but altering the DB itself is not an option (as the product vendor is very touchy about that, and altering it can null our support agreement) Altering the remote database is not an option; but, I am talking about modifying only your local copy. If you can rewrite your pickle file, then you can modify the sqlite3 file. -- http://mail.python.org/mailman/listinfo/python-list
Re: three column dataset - additions and deletions
On 03/12/2010 01:42, Tim Harig wrote: On 2010-12-03, draeathdraeath.spamt...@gmail.com wrote: On Thu, 02 Dec 2010 22:55:53 +, Tim Harig wrote: Thanks for taking the time to check in on this, Tim! So, basically, you want to store a local copy of the data and sync it to the original. In a way. I only need to store one copy of the data, and make note of changes between it and the current data. Perhaps I am missing a sublty that makes those statements different. [snip] I take the difference to be this: The first statement says there are 2 copies, the local one and the original one, with the local one kept in synch with the original one. The second statement says there's 1 copy, plus the changes which have been made to it. -- http://mail.python.org/mailman/listinfo/python-list
Re: three column dataset - additions and deletions
On 2010-12-03, MRAB pyt...@mrabarnett.plus.com wrote: On 03/12/2010 01:42, Tim Harig wrote: On 2010-12-03, draeathdraeath.spamt...@gmail.com wrote: On Thu, 02 Dec 2010 22:55:53 +, Tim Harig wrote: Thanks for taking the time to check in on this, Tim! So, basically, you want to store a local copy of the data and sync it to the original. In a way. I only need to store one copy of the data, and make note of changes between it and the current data. Perhaps I am missing a sublty that makes those statements different. [snip] I take the difference to be this: The first statement says there are 2 copies, the local one and the original one, with the local one kept in synch with the original one. The second statement says there's 1 copy, plus the changes which have been made to it. Okay, so you keep one local copy of the data pristine with what is on the server, then you want to keep what is effectively a set of local changes from the what is on the server like a patch so that when you want to access the data locally, you can generate return a local copy by applying the patch to the pristine server data. Then you want to be able to pull down and incorporate changes from the server, something like what you would do with update on an SCM. The informated is still changed on a record by record basis. Does that sound correct? In that case, I would use the same basic scheme that I suggested before but adding a row in the table to track the changes for each field that might be locally modified. Null would of course mark fields that have not changed. You could store the changes literaly or using some kind of diff style delta that could be applied against the pristine copy. Updating from the server works as before; but, you need to decide what happens if there is a conflict between what changes on the server and local changes. -- http://mail.python.org/mailman/listinfo/python-list
Re: three column dataset - additions and deletions
On Fri, 03 Dec 2010 02:19:54 +, Tim Harig wrote: a whole bunch of useful stuff Certainly some good points for me to chew on... thanks! -- http://mail.python.org/mailman/listinfo/python-list
Re: three column dataset - additions and deletions
On 12/2/2010 5:06 PM, draeath wrote: On Thu, 02 Dec 2010 22:55:53 +, Tim Harig wrote: Thanks for taking the time to check in on this, Tim! I realize this could likely all be done from inside the database itself - but altering the DB itself is not an option (as the product vendor is very touchy about that, and altering it can null our support agreement) A local database is probably the way to go. You're already using MySQL, so you know how to do that. You can use MySQL or SQlite on a local machine machine for your local database, while also talking to the remote MySQL database. Locally, you probably want to store the key, the short string, the MD5 of the long string, and the long string. When you get an update, put it in a temporary table, then compare that table with your permanent table. (The comparison is one line of SQL.) What you do with the differences is your problem. I have a system running which does something like this. Every three hours, it fetches PhishTank's database of a few hundred thousand phishign sites, and compares it to my local copy. Another system of mine reads the daily updates to SEC filings and updates my local database. This is all routine database stuff. If you have to work with big, persistent data sets, use a real database. That's what they are for, and they already have good algorithms for the hard stuff. Storing some local data structure with pickle is probably not the right approach. John Nagle -- http://mail.python.org/mailman/listinfo/python-list
Re: three column dataset - additions and deletions
The only reason I want the hash is that I don't want a copy of this string laying around. I also don't need to know what it is, I just need to know if it's different. Think of this as a tripwire - if someone's user access level is changed, we find out. I still think using a separate database (sqlite or not) is entirely unnecessary. The task simply isn't large enough of a scale to justify going to that extreme. That's like cutting a string with a chainsaw. Granted, the next time I'm at my workstation I'll just give it a go and just stick with whatever turns out best The alternatives are what I was looking for, and you all did provide them :) So here's the goal, since it seems it's still unclear: Cron will call my script on an interval (probably once a day... so yea). The script will determine if users were added, removed, or changed. If so, an email will be constructed with generic details users bjoe, jdoe created; user mwallace deleted; user psouter access level changed and sent to the MUA/MTA/whatever for delivery to myself and the other admins. -- http://mail.python.org/mailman/listinfo/python-list