[sqlalchemy] Handling big Python objects

2014-12-03 Thread Andrea Gavana
Hello list,

sorry for the possibly noob question, I've googled around without much 
success looking for an answer. Basically, I am given a series of this huge 
Python class (a Simulation object), which contains an enormous amount of 
information - when I cPickle it (with highest protocol), it can result to 
files 200-250 MB in size, although rarely it can get up to 2 GB. I am 
looking for intelligent ways to store these objects into a database. I have 
to say that I don't have that much control on this Simulation class, so I 
can't change its internal structure - I'm just looking for a better 
alternative to what I am doing.

So, what I am doing now is basically storing this huge object as a string. 
I have these two methods:

import cPickle
import zlib
import base64

def serialize(my_simulation):
my_db_object = base64.b64encode(zlib.compress(cPickle.dumps(obj, 
cPickle.HIGHEST_PROTOCOL)))
return my_db_object

def deserialize(my_db_object):
my_simulation = 
cPickle.loads(zlib.decompress(base64.b64decode(my_db_object)))
return simulation


I can use them to store/retrieve this big Python classes to/from the 
database, but I feel it's not a particularly effective way to handle this 
problem. I've tried to get my head around BLOBs and LargeBinary stuff, but 
I'm not sure I'm on the right path here. I appreciate any suggestion on how 
to approach the problem, to make the storing/retrieving of these objects a 
bit less time/memory consuming (especially time).

On a related note, I am now using MySQL as a backend - but I am open to 
suggestions about possible alternatives that may make this problem more 
tractable: I have some difficulties in installing other backends (like 
PostgreSQL or psycopg2), but I know I can use Access and Oracle backends. I 
know that Access is not particularly famous in the SQLAlchemy world. Of 
course, if any other backend has advantages over MySQL in dealing with this 
issue, I may try to convince IT to get it installed on our machines.

All suggestions and comments are most welcome.

Thank you in advance for your help.

Andrea.

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


Re: [sqlalchemy] Handling big Python objects

2014-12-03 Thread AM
I would recommend just storing them on disk and let the OS VMM deal with 
caching for speed. If you are not constrained for space I would 
recommend not zlib-ing it either.


AM

On 12/3/14 1:18 PM, Andrea Gavana wrote:

Hello list,

sorry for the possibly noob question, I've googled around without 
much success looking for an answer. Basically, I am given a series of 
this huge Python class (a Simulation object), which contains an 
enormous amount of information - when I cPickle it (with highest 
protocol), it can result to files 200-250 MB in size, although rarely 
it can get up to 2 GB. I am looking for intelligent ways to store 
these objects into a database. I have to say that I don't have that 
much control on this Simulation class, so I can't change its internal 
structure - I'm just looking for a better alternative to what I am doing.


So, what I am doing now is basically storing this huge object as a 
string. I have these two methods:


import cPickle
import zlib
import base64

def serialize(my_simulation):
my_db_object = base64.b64encode(zlib.compress(cPickle.dumps(obj, 
cPickle.HIGHEST_PROTOCOL)))

return my_db_object

def deserialize(my_db_object):
my_simulation = 
cPickle.loads(zlib.decompress(base64.b64decode(my_db_object)))

return simulation


I can use them to store/retrieve this big Python classes to/from the 
database, but I feel it's not a particularly effective way to handle 
this problem. I've tried to get my head around BLOBs and LargeBinary 
stuff, but I'm not sure I'm on the right path here. I appreciate any 
suggestion on how to approach the problem, to make the 
storing/retrieving of these objects a bit less time/memory consuming 
(especially time).


On a related note, I am now using MySQL as a backend - but I am open 
to suggestions about possible alternatives that may make this problem 
more tractable: I have some difficulties in installing other backends 
(like PostgreSQL or psycopg2), but I know I can use Access and Oracle 
backends. I know that Access is not particularly famous in the 
SQLAlchemy world. Of course, if any other backend has advantages over 
MySQL in dealing with this issue, I may try to convince IT to get it 
installed on our machines.


All suggestions and comments are most welcome.

Thank you in advance for your help.

Andrea.
--
You received this message because you are subscribed to the Google 
Groups sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send 
an email to sqlalchemy+unsubscr...@googlegroups.com 
mailto:sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com 
mailto:sqlalchemy@googlegroups.com.

Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


Re: [sqlalchemy] Handling big Python objects

2014-12-03 Thread Ed Rahn
What do you want to be able to do with the objects? I'd recommed storing 
them as files and then referencing the filename in the database.


Ed


On 12/03/2014 04:18 PM, Andrea Gavana wrote:

Hello list,

sorry for the possibly noob question, I've googled around without 
much success looking for an answer. Basically, I am given a series of 
this huge Python class (a Simulation object), which contains an 
enormous amount of information - when I cPickle it (with highest 
protocol), it can result to files 200-250 MB in size, although rarely 
it can get up to 2 GB. I am looking for intelligent ways to store 
these objects into a database. I have to say that I don't have that 
much control on this Simulation class, so I can't change its internal 
structure - I'm just looking for a better alternative to what I am doing.


So, what I am doing now is basically storing this huge object as a 
string. I have these two methods:


import cPickle
import zlib
import base64

def serialize(my_simulation):
my_db_object = base64.b64encode(zlib.compress(cPickle.dumps(obj, 
cPickle.HIGHEST_PROTOCOL)))

return my_db_object

def deserialize(my_db_object):
my_simulation = 
cPickle.loads(zlib.decompress(base64.b64decode(my_db_object)))

return simulation


I can use them to store/retrieve this big Python classes to/from the 
database, but I feel it's not a particularly effective way to handle 
this problem. I've tried to get my head around BLOBs and LargeBinary 
stuff, but I'm not sure I'm on the right path here. I appreciate any 
suggestion on how to approach the problem, to make the 
storing/retrieving of these objects a bit less time/memory consuming 
(especially time).


On a related note, I am now using MySQL as a backend - but I am open 
to suggestions about possible alternatives that may make this problem 
more tractable: I have some difficulties in installing other backends 
(like PostgreSQL or psycopg2), but I know I can use Access and Oracle 
backends. I know that Access is not particularly famous in the 
SQLAlchemy world. Of course, if any other backend has advantages over 
MySQL in dealing with this issue, I may try to convince IT to get it 
installed on our machines.


All suggestions and comments are most welcome.

Thank you in advance for your help.

Andrea.
--
You received this message because you are subscribed to the Google 
Groups sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send 
an email to sqlalchemy+unsubscr...@googlegroups.com 
mailto:sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com 
mailto:sqlalchemy@googlegroups.com.

Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


Re: [sqlalchemy] Handling big Python objects

2014-12-03 Thread Jonathan Vanasco


On Wednesday, December 3, 2014 4:23:31 PM UTC-5, Ams Fwd wrote:

 I would recommend just storing them on disk and let the OS VMM deal with 
 caching for speed. If you are not constrained for space I would 
 recommend not zlib-ing it either. 


I'll second storing them to disk.  Large object support in all the 
databases is a pain and not very optimal.  Just pickle/unpickle a file and 
use the db to manage that file.

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


Re: [sqlalchemy] Handling big Python objects

2014-12-03 Thread Andrea Gavana


On Wednesday, December 3, 2014 10:42:27 PM UTC+1, Jonathan Vanasco wrote:



 On Wednesday, December 3, 2014 4:23:31 PM UTC-5, Ams Fwd wrote:

 I would recommend just storing them on disk and let the OS VMM deal with 
 caching for speed. If you are not constrained for space I would 
 recommend not zlib-ing it either. 


 I'll second storing them to disk.  Large object support in all the 
 databases is a pain and not very optimal.  Just pickle/unpickle a file and 
 use the db to manage that file.



Thanks to all of you who replied. A couple of issues that I'm sure I will 
encounter by letting the files on disk:

1. Other users can easily delete/overwrite/rename the files on disk, which 
is something we really, really do not want;
2. The whole point of a database was to have everything centralized in one 
place, not leaving the simulation files scattered around like a mess in the 
whole network drive;
3. As an aside, not zlib-ing the files saves about 5 seconds/simulation 
(over a 20 seconds save) but increases the database size by 4 times. I'll 
have to check if this is OK.

Thank you again for your interest.

Andrea.

 

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


Re: [sqlalchemy] Handling big Python objects

2014-12-03 Thread Claudio Freire
On Wed, Dec 3, 2014 at 7:23 PM, Andrea Gavana andrea.gav...@gmail.com wrote:
 3. As an aside, not zlib-ing the files saves about 5 seconds/simulation
 (over a 20 seconds save) but increases the database size by 4 times. I'll
 have to check if this is OK.

You can usually specify a compression level of 1 to get a considerable
speedup if you care about speed more than disk efficiency.

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


Re: [sqlalchemy] Handling big Python objects

2014-12-03 Thread AM

On 12/3/14 2:23 PM, Andrea Gavana wrote:



On Wednesday, December 3, 2014 10:42:27 PM UTC+1, Jonathan Vanasco wrote:



On Wednesday, December 3, 2014 4:23:31 PM UTC-5, Ams Fwd wrote:

I would recommend just storing them on disk and let the OS VMM
deal with
caching for speed. If you are not constrained for space I would
recommend not zlib-ing it either. 



I'll second storing them to disk.  Large object support in all the
databases is a pain and not very optimal.  Just pickle/unpickle a
file and use the db to manage that file.



Thanks to all of you who replied. A couple of issues that I'm sure I 
will encounter by letting the files on disk:


1. Other users can easily delete/overwrite/rename the files on disk, 
which is something we really, really do not want;


If this is windows group policies are your friends :). If this is linux, 
permissions with a secondary service to access the files are a decent 
choice.


2. The whole point of a database was to have everything centralized in 
one place, not leaving the simulation files scattered around like a 
mess in the whole network drive;


The last time I did it a post processing step in my data pipeline 
organized the files based on a multi-level folder structure based on the 
first x-characters of their sha1.


3. As an aside, not zlib-ing the files saves about 5 
seconds/simulation (over a 20 seconds save) but increases the database 
size by 4 times. I'll have to check if this is OK.


To use compression or not depends on your needs. If the difference in 
time consumed is so stark, I would highly recommend compression.


HTH
AM


Thank you again for your interest.

Andrea.

--
You received this message because you are subscribed to the Google 
Groups sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send 
an email to sqlalchemy+unsubscr...@googlegroups.com 
mailto:sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com 
mailto:sqlalchemy@googlegroups.com.

Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


Re: [sqlalchemy] Handling big Python objects

2014-12-03 Thread Andrea Gavana
Hi,

On Thursday, December 4, 2014 12:02:42 AM UTC+1, Ams Fwd wrote:

 On 12/3/14 2:23 PM, Andrea Gavana wrote: 
  
  
  On Wednesday, December 3, 2014 10:42:27 PM UTC+1, Jonathan Vanasco 
 wrote: 
  
  
  
  On Wednesday, December 3, 2014 4:23:31 PM UTC-5, Ams Fwd wrote: 
  
  I would recommend just storing them on disk and let the OS VMM 
  deal with 
  caching for speed. If you are not constrained for space I would 
  recommend not zlib-ing it either. 
  
  
  I'll second storing them to disk.  Large object support in all the 
  databases is a pain and not very optimal.  Just pickle/unpickle a 
  file and use the db to manage that file. 
  
  
  
  Thanks to all of you who replied. A couple of issues that I'm sure I 
  will encounter by letting the files on disk: 
  
  1. Other users can easily delete/overwrite/rename the files on disk, 
  which is something we really, really do not want; 

 If this is windows group policies are your friends :). If this is linux, 
 permissions with a secondary service to access the files are a decent 
 choice. 



Yes, this is Windows, but no, I can't go around and tell the users that the 
simulation they just saved is not accessible anymore. The database is part 
of a much larger user interface application, where users can generate 
simulations and then decide whether or not they are relevant enough to be 
stored in the database. At a rate of 300 MB per simulation (or more), it 
gets quickly to the size issue.



  2. The whole point of a database was to have everything centralized in 
  one place, not leaving the simulation files scattered around like a 
  mess in the whole network drive; 

 The last time I did it a post processing step in my data pipeline 
 organized the files based on a multi-level folder structure based on the 
 first x-characters of their sha1. 



Again, I am dealing with non-Python people - and in general with people who 
are extremely good at what they do but they don't care about the overall IT 
architecture - as long as it works and it is recognizable from a Windows 
Explorer point of view. A file-based approach is unfortunately not a good 
option in the current setup.

 


  3. As an aside, not zlib-ing the files saves about 5 
  seconds/simulation (over a 20 seconds save) but increases the database 
  size by 4 times. I'll have to check if this is OK. 
  
 To use compression or not depends on your needs. If the difference in 
 time consumed is so stark, I would highly recommend compression. 



I will probably go that way, 5 seconds more or less do not make that much 
of a difference overall. I just wish the backends didn't complain when I 
pass them cPickled objects (bytes instead of strings...).


Andrea.
 

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.


Re: [sqlalchemy] Handling big Python objects

2014-12-03 Thread Jonathan Vanasco
If you need to handle large objects, you should look into DBAPI drivers 
that can stream results.  The only drivers i know that can handle a stream 
are psycopg2 (postgres) and oursql (mysql). 

There have been a handful of recipes/threads of people using streams for 
blobs in the archives.

I would still do a filesystem based approach.  You wouldn't have to b64 
your files, and could use external tools to compress (ie, bzip2 which would 
take longer, but compresses more).   Your users should never know where 
this data is, it should be treated as a hidden/secure storage area that 
only the app uses -- just like your mysql.  

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.