three column dataset - additions and deletions

2010-12-02 Thread draeath
I'm going to be writing a utility that will be pulling three fields from 
a MySQL table. I've already got a sample dataset - there's a long int 
(which is a db key), a short string, and a looong string. Many rows.

As it is, receive this data from the DB interface as a rather large tuple 
of tuples. I plan on hashing the long string field (both for convenience 
and security) and storing the set in a pickle.

The idea is that this script will run periodically, pulling the table, 
and comparing the data gathered at that run to that stored by the 
previous, acting on changes made, and storing the current data back (to 
be referenced against in the next invocation)

I figure it will be easy enough to determine changed hashes for a given 
key. What I'm unclear on is what the best type of structure to keep this 
data in, given that I need to modify the data after it comes in 
(replacing that long string with, say, an MD5 from hashlib) and both need 
to act on new rows (rows that don't exist in the 'old' data) and 
deleted rows (rows that only exist in the 'old' data).

Keeping in mind that I'm a newbie here, and I'm probably not aware of 
most of the different ways to store such things. I shouldn't have any 
problems with the logic itself - I just know enough to know I don't know 
the best ways of doing things :)

Any suggestions? I'm not asking for code or handholding, but some objects 
or datatypes to look into would be very helpful at this early stage.

Thanks!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: three column dataset - additions and deletions

2010-12-02 Thread Tim Harig
On 2010-12-02, draeath draeath.spamt...@gmail.com wrote:
 The idea is that this script will run periodically, pulling the table, 
 and comparing the data gathered at that run to that stored by the 
 previous, acting on changes made, and storing the current data back (to 
 be referenced against in the next invocation)

So, basically, you want to store a local copy of the data and sync it to
the original.

 I figure it will be easy enough to determine changed hashes for a given 
 key. What I'm unclear on is what the best type of structure to keep this 
 data in, given that I need to modify the data after it comes in 
 (replacing that long string with, say, an MD5 from hashlib) and both need 
 to act on new rows (rows that don't exist in the 'old' data) and 
 deleted rows (rows that only exist in the 'old' data).

You need to differentiate between the in memory data model and the storage
model.  Since this data comes from a database in the first place, I would
dump it to an sqlite3 database from the beginning.  You can use this to
store, modify, and change the values as you receive them from the database.

If you are looking for in-memory structures, then you haven't really
provided us with enough information on the significance and organization of
the data.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: three column dataset - additions and deletions

2010-12-02 Thread draeath
On Thu, 02 Dec 2010 22:55:53 +, Tim Harig wrote:

Thanks for taking the time to check in on this, Tim!

 So, basically, you want to store a local copy of the data and sync it to
 the original.
In a way. I only need to store one copy of the data, and make note of 
changes between it and the current data.

 You need to differentiate between the in memory data model and the
 storage model.  Since this data comes from a database in the first
 place, I would dump it to an sqlite3 database from the beginning.  You
 can use this to store, modify, and change the values as you receive them
 from the database.
I thought of doing that, but given that I only need to store a single 
instance of the data, a simple pickle will do the job nicely (am I 
correct in reading that it can save/load any python object?)

 If you are looking for in-memory structures, then you haven't really
 provided us with enough information on the significance and organization
 of the data.
The data columns:
Long Int, String (under 30 chars), String (over 100 chars)
The rows can scale up to hundreds, perhaps thousands.

The integer is the database key, the shorter string is a user name, and 
the longer string is an access control definition. The whole idea of this 
script is to check, daily, for any added or removed users - or any 
altered access control definition.

I realize this could likely all be done from inside the database itself - 
but altering the DB itself is not an option (as the product vendor is 
very touchy about that, and altering it can null our support agreement)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: three column dataset - additions and deletions

2010-12-02 Thread Tim Harig
On 2010-12-03, draeath draeath.spamt...@gmail.com wrote:
 On Thu, 02 Dec 2010 22:55:53 +, Tim Harig wrote:

 Thanks for taking the time to check in on this, Tim!

 So, basically, you want to store a local copy of the data and sync it to
 the original.
 In a way. I only need to store one copy of the data, and make note of 
 changes between it and the current data.

Perhaps I am missing a sublty that makes those statements different.

 You need to differentiate between the in memory data model and the
 storage model.  Since this data comes from a database in the first
 place, I would dump it to an sqlite3 database from the beginning.  You
 can use this to store, modify, and change the values as you receive them
 from the database.
 I thought of doing that, but given that I only need to store a single 
 instance of the data, a simple pickle will do the job nicely (am I 
 correct in reading that it can save/load any python object?)

So, you start by dumping the data from the remote server into an sqlite3
database table.  What you end up with is a record=record copy of the
original query (plus any other meta data that you want to add).  Then,
when the data changes, you apply those same changes to your local table
(or just regenerate it since you seem to be pulling all of the information
from the server anyway.  The result is a *single instance of the data*.

Why you would want to replace this with a pickle of a nested set up tuples
or a homebrew on disk data structure is beyond me.  Using sqlite3 is almost
certainly faster and more functional then anything you are going to create
without some serious work.

 If you are looking for in-memory structures, then you haven't really
 provided us with enough information on the significance and organization
 of the data.
 The data columns:
 Long Int, String (under 30 chars), String (over 100 chars)
 The rows can scale up to hundreds, perhaps thousands.

Then those are the columns that you create for your local table.

 The integer is the database key, the shorter string is a user name, and 
 the longer string is an access control definition. The whole idea of this 
 script is to check, daily, for any added or removed users - or any 
 altered access control definition.

The question is how are you going to use this information once you have
mirroed it locally.  Most likely, from you description, you just need to
access it as a local read only data store.  Now compare the differences in
how you would acces the data:

pickle method:
1. You have to load the entire pickle into memory.
2. Unless you add some kind of tree or indexing mechanism, you will have to
walk through an average of 1/2 of the records to find the matching
id.  If you do use an advanced mechanism you have to create the
code that inserts and locates the data.

sqlite3 method:
1. You open the file using the sqlite3 connector which does not have to
read all of the data into memory.
2. You use a select query to get just the record for the id that you are
looking for.  sqlite3 has already provided you with optimized
lookup and indexing capability, as well as modification
operations, etc (most likely written in C).

As an added bonus, you don't have to worry about locking issues, to keep
the clients from accesses the datastore and receiving an inconsistant
copy, while you are making your periodic updates to the database

Summary: the pickle method is reinventing the wheel.  You can do what has
already been done for you with the sqlite3 module (and library)
that has already been written for you; but, getting anything
near the same functionality is going to require considerable effort
on your part; and you are probably going to have to write C to get
the equivilant performance.

Which seems like a better option to you?

 I realize this could likely all be done from inside the database itself - 
 but altering the DB itself is not an option (as the product vendor is 
 very touchy about that, and altering it can null our support agreement)

Altering the remote database is not an option; but, I am talking about
modifying only your local copy.  If you can rewrite your pickle file,
then you can modify the sqlite3 file.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: three column dataset - additions and deletions

2010-12-02 Thread MRAB

On 03/12/2010 01:42, Tim Harig wrote:

On 2010-12-03, draeathdraeath.spamt...@gmail.com  wrote:

On Thu, 02 Dec 2010 22:55:53 +, Tim Harig wrote:

Thanks for taking the time to check in on this, Tim!


So, basically, you want to store a local copy of the data and sync it to
the original.

In a way. I only need to store one copy of the data, and make note of
changes between it and the current data.


Perhaps I am missing a sublty that makes those statements different.


[snip]
I take the difference to be this:

The first statement says there are 2 copies, the local one and the
original one, with the local one kept in synch with the original one.

The second statement says there's 1 copy, plus the changes which have
been made to it.
--
http://mail.python.org/mailman/listinfo/python-list


Re: three column dataset - additions and deletions

2010-12-02 Thread Tim Harig
On 2010-12-03, MRAB pyt...@mrabarnett.plus.com wrote:
 On 03/12/2010 01:42, Tim Harig wrote:
 On 2010-12-03, draeathdraeath.spamt...@gmail.com  wrote:
 On Thu, 02 Dec 2010 22:55:53 +, Tim Harig wrote:

 Thanks for taking the time to check in on this, Tim!

 So, basically, you want to store a local copy of the data and sync it to
 the original.
 In a way. I only need to store one copy of the data, and make note of
 changes between it and the current data.

 Perhaps I am missing a sublty that makes those statements different.

 [snip]
 I take the difference to be this:

 The first statement says there are 2 copies, the local one and the
 original one, with the local one kept in synch with the original one.

 The second statement says there's 1 copy, plus the changes which have
 been made to it.

Okay, so you keep one local copy of the data pristine with what is on the
server, then you want to keep what is effectively a set of local changes
from the what is on the server like a patch so that when you want to access
the data locally, you can generate return a local copy by applying the
patch to the pristine server data.  Then you want to be able to pull down
and incorporate changes from the server, something like what you would do
with update on an SCM.  The informated is still changed on a record by
record basis.

Does that sound correct?

In that case, I would use the same basic scheme that I suggested before
but adding a row in the table to track the changes for each field that
might be locally modified.  Null would of course mark fields that have not
changed.  You could store the changes literaly or using some kind of diff
style delta that could be applied against the pristine copy.  Updating from
the server works as before; but, you need to decide what happens if there
is a conflict between what changes on the server and local changes.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: three column dataset - additions and deletions

2010-12-02 Thread draeath
On Fri, 03 Dec 2010 02:19:54 +, Tim Harig wrote:

 a whole bunch of useful stuff

Certainly some good points for me to chew on... thanks!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: three column dataset - additions and deletions

2010-12-02 Thread John Nagle

On 12/2/2010 5:06 PM, draeath wrote:

On Thu, 02 Dec 2010 22:55:53 +, Tim Harig wrote:

Thanks for taking the time to check in on this, Tim!



I realize this could likely all be done from inside the database itself -
but altering the DB itself is not an option (as the product vendor is
very touchy about that, and altering it can null our support agreement)


   A local database is probably the way to go.  You're already using
MySQL, so you know how to do that.  You can use MySQL or SQlite on
a local machine machine for your local database, while also talking
to the remote MySQL database.

   Locally, you probably want to store the key, the short string,
the MD5 of the long string, and the long string.  When you get an
update, put it in a temporary table, then compare that table with
your permanent table.  (The comparison is one line of SQL.)
What you do with the differences is your problem.

   I have a system running which does something like this. Every
three hours, it fetches PhishTank's database of a few hundred
thousand phishign sites, and compares it to my local copy.
Another system of mine reads the daily updates to SEC filings
and updates my local database.  This is all routine database
stuff.

   If you have to work with big, persistent data sets, use a
real database. That's what they are for, and they already have
good algorithms for the hard stuff.  Storing some local data
structure with pickle is probably not the right approach.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: three column dataset - additions and deletions

2010-12-02 Thread draeath
The only reason I want the hash is that I don't want a copy of this 
string laying around. I also don't need to know what it is, I just need 
to know if it's different. Think of this as a tripwire - if someone's 
user access level is changed, we find out.

I still think using a separate database (sqlite or not) is entirely 
unnecessary. The task simply isn't large enough of a scale to justify 
going to that extreme. That's like cutting a string with a chainsaw. 
Granted, the next time I'm at my workstation I'll just give it a go and 
just stick with whatever turns out best

The alternatives are what I was looking for, and you all did provide 
them :)

So here's the goal, since it seems it's still unclear:

Cron will call my script on an interval (probably once a day... so yea). 
The script will determine if users were added, removed, or changed. If 
so, an email will be constructed with generic details users bjoe, jdoe 
created; user mwallace deleted; user psouter access level changed and 
sent to the MUA/MTA/whatever for delivery to myself and the other admins.
-- 
http://mail.python.org/mailman/listinfo/python-list