[sqlalchemy] Handling big Python objects
Hello list, sorry for the possibly noob question, I've googled around without much success looking for an answer. Basically, I am given a series of this huge Python class (a Simulation object), which contains an enormous amount of information - when I cPickle it (with highest protocol), it can result to files 200-250 MB in size, although rarely it can get up to 2 GB. I am looking for intelligent ways to store these objects into a database. I have to say that I don't have that much control on this Simulation class, so I can't change its internal structure - I'm just looking for a better alternative to what I am doing. So, what I am doing now is basically storing this huge object as a string. I have these two methods: import cPickle import zlib import base64 def serialize(my_simulation): my_db_object = base64.b64encode(zlib.compress(cPickle.dumps(obj, cPickle.HIGHEST_PROTOCOL))) return my_db_object def deserialize(my_db_object): my_simulation = cPickle.loads(zlib.decompress(base64.b64decode(my_db_object))) return simulation I can use them to store/retrieve this big Python classes to/from the database, but I feel it's not a particularly effective way to handle this problem. I've tried to get my head around BLOBs and LargeBinary stuff, but I'm not sure I'm on the right path here. I appreciate any suggestion on how to approach the problem, to make the storing/retrieving of these objects a bit less time/memory consuming (especially time). On a related note, I am now using MySQL as a backend - but I am open to suggestions about possible alternatives that may make this problem more tractable: I have some difficulties in installing other backends (like PostgreSQL or psycopg2), but I know I can use Access and Oracle backends. I know that Access is not particularly famous in the SQLAlchemy world. Of course, if any other backend has advantages over MySQL in dealing with this issue, I may try to convince IT to get it installed on our machines. All suggestions and comments are most welcome. Thank you in advance for your help. Andrea. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
Re: [sqlalchemy] Handling big Python objects
I would recommend just storing them on disk and let the OS VMM deal with caching for speed. If you are not constrained for space I would recommend not zlib-ing it either. AM On 12/3/14 1:18 PM, Andrea Gavana wrote: Hello list, sorry for the possibly noob question, I've googled around without much success looking for an answer. Basically, I am given a series of this huge Python class (a Simulation object), which contains an enormous amount of information - when I cPickle it (with highest protocol), it can result to files 200-250 MB in size, although rarely it can get up to 2 GB. I am looking for intelligent ways to store these objects into a database. I have to say that I don't have that much control on this Simulation class, so I can't change its internal structure - I'm just looking for a better alternative to what I am doing. So, what I am doing now is basically storing this huge object as a string. I have these two methods: import cPickle import zlib import base64 def serialize(my_simulation): my_db_object = base64.b64encode(zlib.compress(cPickle.dumps(obj, cPickle.HIGHEST_PROTOCOL))) return my_db_object def deserialize(my_db_object): my_simulation = cPickle.loads(zlib.decompress(base64.b64decode(my_db_object))) return simulation I can use them to store/retrieve this big Python classes to/from the database, but I feel it's not a particularly effective way to handle this problem. I've tried to get my head around BLOBs and LargeBinary stuff, but I'm not sure I'm on the right path here. I appreciate any suggestion on how to approach the problem, to make the storing/retrieving of these objects a bit less time/memory consuming (especially time). On a related note, I am now using MySQL as a backend - but I am open to suggestions about possible alternatives that may make this problem more tractable: I have some difficulties in installing other backends (like PostgreSQL or psycopg2), but I know I can use Access and Oracle backends. I know that Access is not particularly famous in the SQLAlchemy world. Of course, if any other backend has advantages over MySQL in dealing with this issue, I may try to convince IT to get it installed on our machines. All suggestions and comments are most welcome. Thank you in advance for your help. Andrea. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com mailto:sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com mailto:sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
Re: [sqlalchemy] Handling big Python objects
What do you want to be able to do with the objects? I'd recommed storing them as files and then referencing the filename in the database. Ed On 12/03/2014 04:18 PM, Andrea Gavana wrote: Hello list, sorry for the possibly noob question, I've googled around without much success looking for an answer. Basically, I am given a series of this huge Python class (a Simulation object), which contains an enormous amount of information - when I cPickle it (with highest protocol), it can result to files 200-250 MB in size, although rarely it can get up to 2 GB. I am looking for intelligent ways to store these objects into a database. I have to say that I don't have that much control on this Simulation class, so I can't change its internal structure - I'm just looking for a better alternative to what I am doing. So, what I am doing now is basically storing this huge object as a string. I have these two methods: import cPickle import zlib import base64 def serialize(my_simulation): my_db_object = base64.b64encode(zlib.compress(cPickle.dumps(obj, cPickle.HIGHEST_PROTOCOL))) return my_db_object def deserialize(my_db_object): my_simulation = cPickle.loads(zlib.decompress(base64.b64decode(my_db_object))) return simulation I can use them to store/retrieve this big Python classes to/from the database, but I feel it's not a particularly effective way to handle this problem. I've tried to get my head around BLOBs and LargeBinary stuff, but I'm not sure I'm on the right path here. I appreciate any suggestion on how to approach the problem, to make the storing/retrieving of these objects a bit less time/memory consuming (especially time). On a related note, I am now using MySQL as a backend - but I am open to suggestions about possible alternatives that may make this problem more tractable: I have some difficulties in installing other backends (like PostgreSQL or psycopg2), but I know I can use Access and Oracle backends. I know that Access is not particularly famous in the SQLAlchemy world. Of course, if any other backend has advantages over MySQL in dealing with this issue, I may try to convince IT to get it installed on our machines. All suggestions and comments are most welcome. Thank you in advance for your help. Andrea. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com mailto:sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com mailto:sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
Re: [sqlalchemy] Handling big Python objects
On Wednesday, December 3, 2014 4:23:31 PM UTC-5, Ams Fwd wrote: I would recommend just storing them on disk and let the OS VMM deal with caching for speed. If you are not constrained for space I would recommend not zlib-ing it either. I'll second storing them to disk. Large object support in all the databases is a pain and not very optimal. Just pickle/unpickle a file and use the db to manage that file. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
Re: [sqlalchemy] Handling big Python objects
On Wednesday, December 3, 2014 10:42:27 PM UTC+1, Jonathan Vanasco wrote: On Wednesday, December 3, 2014 4:23:31 PM UTC-5, Ams Fwd wrote: I would recommend just storing them on disk and let the OS VMM deal with caching for speed. If you are not constrained for space I would recommend not zlib-ing it either. I'll second storing them to disk. Large object support in all the databases is a pain and not very optimal. Just pickle/unpickle a file and use the db to manage that file. Thanks to all of you who replied. A couple of issues that I'm sure I will encounter by letting the files on disk: 1. Other users can easily delete/overwrite/rename the files on disk, which is something we really, really do not want; 2. The whole point of a database was to have everything centralized in one place, not leaving the simulation files scattered around like a mess in the whole network drive; 3. As an aside, not zlib-ing the files saves about 5 seconds/simulation (over a 20 seconds save) but increases the database size by 4 times. I'll have to check if this is OK. Thank you again for your interest. Andrea. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
Re: [sqlalchemy] Handling big Python objects
On Wed, Dec 3, 2014 at 7:23 PM, Andrea Gavana andrea.gav...@gmail.com wrote: 3. As an aside, not zlib-ing the files saves about 5 seconds/simulation (over a 20 seconds save) but increases the database size by 4 times. I'll have to check if this is OK. You can usually specify a compression level of 1 to get a considerable speedup if you care about speed more than disk efficiency. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
Re: [sqlalchemy] Handling big Python objects
On 12/3/14 2:23 PM, Andrea Gavana wrote: On Wednesday, December 3, 2014 10:42:27 PM UTC+1, Jonathan Vanasco wrote: On Wednesday, December 3, 2014 4:23:31 PM UTC-5, Ams Fwd wrote: I would recommend just storing them on disk and let the OS VMM deal with caching for speed. If you are not constrained for space I would recommend not zlib-ing it either. I'll second storing them to disk. Large object support in all the databases is a pain and not very optimal. Just pickle/unpickle a file and use the db to manage that file. Thanks to all of you who replied. A couple of issues that I'm sure I will encounter by letting the files on disk: 1. Other users can easily delete/overwrite/rename the files on disk, which is something we really, really do not want; If this is windows group policies are your friends :). If this is linux, permissions with a secondary service to access the files are a decent choice. 2. The whole point of a database was to have everything centralized in one place, not leaving the simulation files scattered around like a mess in the whole network drive; The last time I did it a post processing step in my data pipeline organized the files based on a multi-level folder structure based on the first x-characters of their sha1. 3. As an aside, not zlib-ing the files saves about 5 seconds/simulation (over a 20 seconds save) but increases the database size by 4 times. I'll have to check if this is OK. To use compression or not depends on your needs. If the difference in time consumed is so stark, I would highly recommend compression. HTH AM Thank you again for your interest. Andrea. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com mailto:sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com mailto:sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
Re: [sqlalchemy] Handling big Python objects
Hi, On Thursday, December 4, 2014 12:02:42 AM UTC+1, Ams Fwd wrote: On 12/3/14 2:23 PM, Andrea Gavana wrote: On Wednesday, December 3, 2014 10:42:27 PM UTC+1, Jonathan Vanasco wrote: On Wednesday, December 3, 2014 4:23:31 PM UTC-5, Ams Fwd wrote: I would recommend just storing them on disk and let the OS VMM deal with caching for speed. If you are not constrained for space I would recommend not zlib-ing it either. I'll second storing them to disk. Large object support in all the databases is a pain and not very optimal. Just pickle/unpickle a file and use the db to manage that file. Thanks to all of you who replied. A couple of issues that I'm sure I will encounter by letting the files on disk: 1. Other users can easily delete/overwrite/rename the files on disk, which is something we really, really do not want; If this is windows group policies are your friends :). If this is linux, permissions with a secondary service to access the files are a decent choice. Yes, this is Windows, but no, I can't go around and tell the users that the simulation they just saved is not accessible anymore. The database is part of a much larger user interface application, where users can generate simulations and then decide whether or not they are relevant enough to be stored in the database. At a rate of 300 MB per simulation (or more), it gets quickly to the size issue. 2. The whole point of a database was to have everything centralized in one place, not leaving the simulation files scattered around like a mess in the whole network drive; The last time I did it a post processing step in my data pipeline organized the files based on a multi-level folder structure based on the first x-characters of their sha1. Again, I am dealing with non-Python people - and in general with people who are extremely good at what they do but they don't care about the overall IT architecture - as long as it works and it is recognizable from a Windows Explorer point of view. A file-based approach is unfortunately not a good option in the current setup. 3. As an aside, not zlib-ing the files saves about 5 seconds/simulation (over a 20 seconds save) but increases the database size by 4 times. I'll have to check if this is OK. To use compression or not depends on your needs. If the difference in time consumed is so stark, I would highly recommend compression. I will probably go that way, 5 seconds more or less do not make that much of a difference overall. I just wish the backends didn't complain when I pass them cPickled objects (bytes instead of strings...). Andrea. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
Re: [sqlalchemy] Handling big Python objects
If you need to handle large objects, you should look into DBAPI drivers that can stream results. The only drivers i know that can handle a stream are psycopg2 (postgres) and oursql (mysql). There have been a handful of recipes/threads of people using streams for blobs in the archives. I would still do a filesystem based approach. You wouldn't have to b64 your files, and could use external tools to compress (ie, bzip2 which would take longer, but compresses more). Your users should never know where this data is, it should be treated as a hidden/secure storage area that only the app uses -- just like your mysql. -- You received this message because you are subscribed to the Google Groups sqlalchemy group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.