Re: [Rpm-maint] Rpm Database musings
On 03/13/2013 03:19 PM, Michael Schroeder wrote: On Fri, Mar 08, 2013 at 03:37:12PM +0100, Michael Schroeder wrote: I kind of like to have all the data in one file. Anyway, attached is a little Packages database implementation I did yesterday and today. Attached is the current version of my little experiments. The main changes are: - I switched to adler32 instead of md5sum - I added a little index database implementation, rpmidx.[ch] Oh, awesome. I was quietly hoping you might do a proof-of-concept index (database) implementation too, and here we are :) Haven't looked deeply into it yet, but in any case with an actual alternative implementation it'll be much easier to work towards a backend abstraction in the rpmdb layer, and actually be able to test it. The index database is using mmap to map the database into memory. It uses the main rpmpkg database for locking. Performance and database sizes seem to be promising. Things I'm not happy about: - resizing currently works by rebuilding a new database and calling rename(). I can change this to be inplace, though, it just makes to code a little bit slower because I don't want to simply overwrite the old data. I basically want an atomic switch to the new data. - The generation count in idxdb is currently not used. My goal is to detect crashed database updates somehow. Yup, detecting and automatically regenerating out-of-sync indexes is pretty much a must (yet something we currently dont have either, sigh) Some other issues in the current implementation AFAICS: - The ability to grab all keys of an index is missing, which would be needed for the newish index iterator API. I always had the feeling that API might come back to bite us at some point... - Index keys are limited to strings whereas we currently have others too, but then all the actually interesting indexes have string keys, and we might well be able just to eliminate the others (or convert the data into strings) BTW shouldn't those h2be() and be2h() calls be htonl() and ntohl() instead? The idea seems to be keeping the database and indexes in big-endian, ie network byte order (which is good IMO), but currently its unconditionally byteswapping so big-endian system would have the db's in little endian format and little endian systems in big endian. Or am I totally missing something here? - Panu - ___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] FSM hooks for rpm plugin
On 03/13/2013 01:08 PM, Reshetova, Elena wrote: Do you want to do the changes? I can also try to do it tomorrow if they aren't objections. I probably should merge (at least some of) the study and link count patches first, as those change the landscape quite a bit. I'll try to do that as soon as the caffeine kicks in for good. Sure, I will wait for changes. On a somewhat related note, I'm pondering about changing fsm to do staged removals too, ie rename before actually removing. It doesn't make much difference as things are now, but I've also started seriously thinking about changing the fsm to the model we discussed earlier where unpacking and setting permissions etc is first done for all files, and only if that succeeds completely we actually commit to renaming them all to the final target, and undo the whole lot if anything in unpacking failed. I think this would be the safest way not only from security, but also from correctness and also makes installation more robust in case of sudden power cuts and etc. Indeed. The way rpm currently behaves on failure is just plain embarrassing. ...which of course would actually fundamentally change the landscape again: if commit is changed to consist only of renaming a file, then commit hooks would no longer the right place to do security labeling etc. Argh! :) In that model we'd be back to the set metadata hook, or actually two of them to preserve the possibility of doing something after rpm did its own business. And in that model, both pre and post metadata hooks should get the temp and final path as separate arguments. Yeah, but I guess maybe we can first finish with the current system and check that it works for whatever test cases we have (I can start using new hooks in msm plugin) and then change it when you move rpm to a new fsm model. I think this would be a big change for fsm, so won't be possible to do it fast anyway. Sure, I'm not suggesting delaying everything until I someday get around to fixing it, just that we could try thinking ahead for that model to hopefully avoid having to change the plugin interfaces later. I pushed a bunch of fsm changes yesterday, the two more interesting ones that we already talked about being: 1) Reflect the hardlink count in st_nlink so the real files vs hardlinks can be easily detected 2) Set permissions before committing to the rename to final destination. With 2) in place, we might be able to model the hooks in a way that doesn't require changing later. The question (again) just is, what the hooks should actually be. I think we'd want those pre- and post-commit hooks in any case: for example a %config versioning system plugin would want to know whether a file is being replaced and if it actually succeeded. The pre-commit hook could of course be used for setting additional permissions, content checking etc as well, but in the alleged new model of unpack + set permissions on all files first and only then commit, I think one would want to abort the whole thing as early as possible. Not that it matters all that much if we really are able to undo the whole thing. So I guess we'll just go with the pre- and post-commit hooks for now to be able to move forward with this. At least no-one can say this hasn't been thoroughly discussed :) - Panu - ___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] Rpm Database musings
On Thu, Mar 14, 2013 at 10:55:07AM +0200, Panu Matilainen wrote: Yup, detecting and automatically regenerating out-of-sync indexes is pretty much a must (yet something we currently dont have either, sigh) Some other issues in the current implementation AFAICS: - The ability to grab all keys of an index is missing, which would be needed for the newish index iterator API. I always had the feeling that API might come back to bite us at some point... I already added both rpmidxList() and rpmpkgList() last night. ;) - Index keys are limited to strings whereas we currently have others too, but then all the actually interesting indexes have string keys, and we might well be able just to eliminate the others (or convert the data into strings) Yes, I noticed that after checking rpm's current database code. I can easily switch the rpmidx functions to use binary as keys if you like, it just makes the rpmidxList function a bit awkward as it can no longer return an array of strings. BTW shouldn't those h2be() and be2h() calls be htonl() and ntohl() instead? Yes, we could use those instead. I just didn't like to include the arpa/inet.h header file, it kinda felt wrong. There's also htobe32/be32toh in endian.h if we define _BSD_SOURCE; that seems to be a better choice. As I wasn't sure what to do I decided to postpone the issue by using my own inline functions for now ;) The idea seems to be keeping the database and indexes in big-endian, ie network byte order (which is good IMO), but currently its unconditionally byteswapping so big-endian system would have the db's in little endian format and little endian systems in big endian. Or am I totally missing something here? Yes, the code always uses big endian. It doesn't unconditionally swap. (It also does unaligned reads/writes, but we don't really need that.) Coming back to automatically regenerating of out-of-sync indexes, there's still another way do the implementation: keep those indexes in memory and don't store them to disk at all. This means that the indexes need to be generated on the fly at first access by reading all header, it thus means we need to additionaly store a stripped version of each header that just contains the interesting bits. Advantages: - just one single database file - no out-of-sync indexes possible Disadvantage: - needs a bit of time to generate the in-core indexes For my system (2102 installed rpms) the stripped headers would be about 2.2 MBytes to read, that takes about .34 seconds with my slow disk and dropped caches, which is quite noticable. Cheers, Michael. -- Michael Schroeder m...@suse.de SUSE LINUX Products GmbH, GF Jeff Hawn, HRB 16746 AG Nuernberg main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);} ___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] FSM hooks for rpm plugin
Sure, I'm not suggesting delaying everything until I someday get around to fixing it, just that we could try thinking ahead for that model to hopefully avoid having to change the plugin interfaces later. I pushed a bunch of fsm changes yesterday, the two more interesting ones that we already talked about being: 1) Reflect the hardlink count in st_nlink so the real files vs hardlinks can be easily detected 2) Set permissions before committing to the rename to final destination. With 2) in place, we might be able to model the hooks in a way that doesn't require changing later. The question (again) just is, what the hooks should actually be. I think we'd want those pre- and post-commit hooks in any case: for example a %config versioning system plugin would want to know whether a file is being replaced and if it actually succeeded. The pre-commit hook could of course be used for setting additional permissions, content checking etc as well, but in the alleged new model of unpack + set permissions on all files first and only then commit, I think one would want to abort the whole thing as early as possible. Not that it matters all that much if we really are able to undo the whole thing. So I guess we'll just go with the pre- and post-commit hooks for now to be able to move forward with this. At least no-one can say this hasn't been thoroughly discussed :) I just went through your yesterday's changes. I think it now slowly falls together nicely. I think it is right that we need pre and post fsm hooks because even if we were able to unpack everything and successfully set all permissions on all files in tmp location, it isn't a guarantee that committing the whole thing to the final location would be successful. It is always possible that preserving security labels might fail or anything else might happen. And when you change a fsm model to new one, we can just add a new hook that would be called after each file is unpacked to tpm location: this would be primary hook for setting additional metadata on file and good time to scan the content of a file, too (so that we can revert the whole thing and delete a file if malware is found in it). The only thing that I can't find so far a usage for pre commit hook for future: it would be kind of called on the same context (file is unpacked in tpm dir) and the future metadata/content screening hook One idea can be that for security needs, plugins can actually use pre- and post hooks to verify that permissions were preserved (and set) correctly and abort if they see some mismatch. But maybe this is too paranoid again :) Best Regards, Elena. smime.p7s Description: S/MIME cryptographic signature ___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] Rpm Database musings
On 03/14/2013 01:10 PM, Michael Schroeder wrote: On Thu, Mar 14, 2013 at 10:55:07AM +0200, Panu Matilainen wrote: Yup, detecting and automatically regenerating out-of-sync indexes is pretty much a must (yet something we currently dont have either, sigh) Some other issues in the current implementation AFAICS: - The ability to grab all keys of an index is missing, which would be needed for the newish index iterator API. I always had the feeling that API might come back to bite us at some point... I already added both rpmidxList() and rpmpkgList() last night. ;) Ok, good :) - Index keys are limited to strings whereas we currently have others too, but then all the actually interesting indexes have string keys, and we might well be able just to eliminate the others (or convert the data into strings) Yes, I noticed that after checking rpm's current database code. I can easily switch the rpmidx functions to use binary as keys if you like, it just makes the rpmidxList function a bit awkward as it can no longer return an array of strings. I think strings are fine, just thought to note that there are those couple of non-string indexes which we need to do something about. Sigmd5 is probably better just axed, Installtid we might want to keep but that can just as well be converted into a string. BTW shouldn't those h2be() and be2h() calls be htonl() and ntohl() instead? Yes, we could use those instead. I just didn't like to include the arpa/inet.h header file, it kinda felt wrong. There's also htobe32/be32toh in endian.h if we define _BSD_SOURCE; that seems to be a better choice. As I wasn't sure what to do I decided to postpone the issue by using my own inline functions for now ;) Heh. Including arpa/inet.h for non-networking purposes does indeed feel a bit odd, but that's likely the standard and portably correct way of doing endian conversions, which at least in glibc are system-optimized as well. endian.h is apparently not very standard. Hmm... rpm seems to include netinet/in.h directly, which works with glibc but is not what standards and man pages say about htonl() and friends. The idea seems to be keeping the database and indexes in big-endian, ie network byte order (which is good IMO), but currently its unconditionally byteswapping so big-endian system would have the db's in little endian format and little endian systems in big endian. Or am I totally missing something here? Yes, the code always uses big endian. It doesn't unconditionally swap. (It also does unaligned reads/writes, but we don't really need that.) Ok. I'm not having one of my brightest days apparently ;) Guess I was expecting to see those on big endian do nothing ifdef's in there. Coming back to automatically regenerating of out-of-sync indexes, there's still another way do the implementation: keep those indexes in memory and don't store them to disk at all. This means that the indexes need to be generated on the fly at first access by reading all header, it thus means we need to additionaly store a stripped version of each header that just contains the interesting bits. Advantages: - just one single database file - no out-of-sync indexes possible Disadvantage: - needs a bit of time to generate the in-core indexes For my system (2102 installed rpms) the stripped headers would be about 2.2 MBytes to read, that takes about .34 seconds with my slow disk and dropped caches, which is quite noticable. Yeah, it seems pretty heavy for simple operations. OTOH it wouldn't hurt to have such a mode: for example if we notice indexes are corrupt/out-of-sync but we dont have the permissions to regenerate the on-disk files, it could fall back to in-memory indexes to get correct results even if its slightly slower. What I've had in mind is lumping all the index stuff (possibly along with actual data for the critical parts) into a single file so there'd be just two files db-related files to worry about. But for now, I'm just happy to have an alternative implementation for the pkgs + index databases to play around with :) - Panu - ___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] FSM hooks for rpm plugin
On 03/14/2013 03:01 PM, Reshetova, Elena wrote: Sure, I'm not suggesting delaying everything until I someday get around to fixing it, just that we could try thinking ahead for that model to hopefully avoid having to change the plugin interfaces later. I pushed a bunch of fsm changes yesterday, the two more interesting ones that we already talked about being: 1) Reflect the hardlink count in st_nlink so the real files vs hardlinks can be easily detected 2) Set permissions before committing to the rename to final destination. With 2) in place, we might be able to model the hooks in a way that doesn't require changing later. The question (again) just is, what the hooks should actually be. I think we'd want those pre- and post-commit hooks in any case: for example a %config versioning system plugin would want to know whether a file is being replaced and if it actually succeeded. The pre-commit hook could of course be used for setting additional permissions, content checking etc as well, but in the alleged new model of unpack + set permissions on all files first and only then commit, I think one would want to abort the whole thing as early as possible. Not that it matters all that much if we really are able to undo the whole thing. So I guess we'll just go with the pre- and post-commit hooks for now to be able to move forward with this. At least no-one can say this hasn't been thoroughly discussed :) I just went through your yesterday's changes. I think it now slowly falls together nicely. I think it is right that we need pre and post fsm hooks because even if we were able to unpack everything and successfully set all permissions on all files in tmp location, it isn't a guarantee that committing the whole thing to the final location would be successful. It is always possible that preserving security labels might fail or anything else might happen. And when you change a fsm model to new one, we can just add a new hook that would be called after each file is unpacked to tpm location: this would be primary hook for setting additional metadata on file and good time to scan the content of a file, too (so that we can revert the whole thing and delete a file if malware is found in it). Yup, and its this part I'm still pondering about: should we just add that post-unpack hook (or whatever its called) already and go with that for SELinux and all from the start, because that's what they really want. That's kinda what the setmetadata hook idea, but perhaps a more generic name would be appropriate now that it wouldn't be limited to that. Maybe something like file pre- and post-process, which can cover metadata, malware scanning and whatnot, both for install and erase (which needs the perhaps otherwise unnecessary pre-hook) The only thing that I can't find so far a usage for pre commit hook for future: it would be kind of called on the same context (file is unpacked in tpm dir) and the future metadata/content screening hook One idea can be that for security needs, plugins can actually use pre- and post hooks to verify that permissions were preserved (and set) correctly and abort if they see some mismatch. But maybe this is too paranoid again :) That's perhaps slightly paranoid, yes :) But then its not my job to say what the plugins should be used... for some purpose having yet another layer of verifying might be reasonable. One use-case I see for the pre- and post-commit hooks is a plugin keeping track of config file contents: in pre-commit it would stage the change of content (think of git apply), and in post-commit it would either commit or abort (think of git commit or git reset --hard) depending on whether fsm commit succeeded or not. - Panu - ___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] Rpm Database musings
On Thu, Mar 14, 2013 at 03:33:44PM +0200, Panu Matilainen wrote: On 03/14/2013 01:10 PM, Michael Schroeder wrote: On Thu, Mar 14, 2013 at 10:55:07AM +0200, Panu Matilainen wrote: Yup, detecting and automatically regenerating out-of-sync indexes is pretty much a must (yet something we currently dont have either, sigh) Some other issues in the current implementation AFAICS: - The ability to grab all keys of an index is missing, which would be needed for the newish index iterator API. I always had the feeling that API might come back to bite us at some point... I already added both rpmidxList() and rpmpkgList() last night. ;) Ok, good :) I've set up a repo on github so I don't need to send tarballs to the mailing list anymore: git://github.com/mlschroe/newrpmdb.git Panu, do you have a github account so that I can add you as collaborator? Cheers, Michael. -- Michael Schroeder m...@suse.de SUSE LINUX Products GmbH, GF Jeff Hawn, HRB 16746 AG Nuernberg main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);} ___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint