hi Previous discussion of gp-export did not cover its design in details. I think if we're going to bundle gp-export with elephant, it is better to dicuss its features and shortcomings, so we can either try to fix them or document them.
Basic idea how it works -- it goes through objects in database feeding them to serializer one by one. Serializer writes objects to the stream. Serializer here is a somewhat modified version of s-serialization from cl-prevalence -- it was made more flexible, allowing changing how objects are serialized via hooks. Deserialization just works in reverse -- reads forms from file and creates structures/instantiates objects. Problems with this approach: 1) To deal with circular references or just objects being referenced from multiple places, it keeps track of objects it already visited, and in case it is encountered again, it writes object reference instead. That is, there is a hash-table which references all objects and structures you're exporting. This might be a bad idea if database is large and does not fit into memory. Particularly, it needs to track references to all objects, structures, vectors and conses. It does not need to track object slots and string, though. So, if database is large because of large pieces of text, it is not a problem. But if database is large because there are lots of objects, it is a problem. 2) Import works by reading objects one by one via CL:READ and then going through s-expressions instantiating stuff. So, obviously, each individual object with everything it references should fit in memory in its s-expression form. It might be a problem if you have some huuge btree. Particularly, root is (going to be) exported as a single object, so all data stored in root must fit in memory. 3) Serialization has its limitations. That is, if you're using some clever objects, it might not work. Basically, it is about as good as cl-prevalence, plus it should be able to handle elephant objects without a problem. These are quite fundamental limitations of this approach. I don't think we are going to deal with them in this release, in some future version -- maybe. However, there might be some workarounds for the issue number 1: a) there could be an option to disable reference tracking altogether b) another option is to allow feeding serializer manually. If you know you have a large database, you can feed serializer with a small batches, resetting its state between them, so object references do not accumulate. But then you're responsible for data integrity -- if something is exported more than once, it is your own problem. Also, doing it manually, you might forget to export something. I'm planning to include only option b) just to increase flexibility. So it you think you need option a) badly, drop me a note. Current version has even more problems (and I'm going to fix some of them): I) It does rather weird thing -- first writes data to a string, then reads this string into s-expressions, and then writes them into a file. It's because initial design did customizations on s-expression level. I've identified this as not flexible enough (and rather cumbersome too) and added hooks to serializer. But there is still a hook which allows modifications on s-expression level. I'm going to remove it before release, so it will write directly to a file. II) Import and export were very dependent on elephant backend, to the point you need to write a piece of code for each backend pair. Now I'm using different approach -- all recognized elephant objects (descendants of persistent-collection, basically) are serialized in a special way, using a general elephant API rather than backend-specific stuff. Later you can import it to a database of any type, as importer will use only general elephant API rather than backend-specific one. III) Approach described above has a consequence -- if you have some clever object which is of class persistent-collection but doesn't work like standard persistent collections, probably it won't work unless you explicitly add support for it. IV) Transient slots are exported. I think they should not be and I'm going to fix this. V) Instances are created with make-instance. This works fine for simple objects, but some clever objects might suffer. We've already implemented this stuff in elephant, I hope I can make it working with same semantics as it has in elephant. VI) This is not really a problem, but a design decision -- objects of type btree-index won't be exported but instead would be recreated from indexed-btree objects. VII) Internal structure is rather cumbersome because of the layered approach -- serialization part works mostly independently from the rest, with only a few hooks in parts where I needed those hooks. I think if it was designed as a whole it could be more flexible and elegant, but who knows... I'm going to keep it as it is for a current release. If we're going to deal with memory usage problems in future, it might make sense to redesign architecture... But I'm not going to look that far. And one more thing, Henrik says it is a good idea to rename gp-export to something else. I dunno, gp-export is not that bad, as for me. Maybe it is hard to say what it does from the name, but at least this name is recognizable. Henrik's ideas from README.md: ---- gp-export, lob-dump clob-dump # lob-dump Name? gp-export, lob-dump clob-dump lob-dump is lisp objects dump, a way to export lisp objects to a file and restore. ---- I don't like any of these particularly, but if we're going to make more accurate name, my suggestion is "clod-exim". It is supposed to mean "Commom Lisp object database export and import". It is important to mention both export and import, as some people might think it does only export. (Dictionary says that "clod" is "a big clumsy often slow-witted person", this kind of reflects issues I've mentioned above :) _______________________________________________ elephant-devel site list elephant-devel@common-lisp.net http://common-lisp.net/mailman/listinfo/elephant-devel