Re: [Rpm-maint] Rpm Database musings

2013-03-14 Thread Panu Matilainen

On 03/13/2013 03:19 PM, Michael Schroeder wrote:

On Fri, Mar 08, 2013 at 03:37:12PM +0100, Michael Schroeder wrote:

I kind of like to have all the data in one file.

Anyway, attached is a little Packages database implementation I did yesterday
and today.


Attached is the current version of my little experiments. The main
changes are:

- I switched to adler32 instead of md5sum
- I added a little index database implementation, rpmidx.[ch]


Oh, awesome. I was quietly hoping you might do a proof-of-concept index 
(database) implementation too, and here we are :) Haven't looked deeply 
into it yet, but in any case with an actual alternative implementation 
it'll be much easier to work towards a backend abstraction in the rpmdb 
layer, and actually be able to test it.




The index database is using mmap to map the database into memory.
It uses the main rpmpkg database for locking.

Performance and database sizes seem to be promising.

Things I'm not happy about:

- resizing currently works by rebuilding a new database and
   calling rename(). I can change this to be inplace, though,
   it just makes to code a little bit slower because I don't
   want to simply overwrite the old data. I basically want an
   atomic switch to the new data.

- The generation count in idxdb is currently not used. My goal
   is to detect crashed database updates somehow.


Yup, detecting and automatically regenerating out-of-sync indexes is 
pretty much a must (yet something we currently dont have either, sigh)


Some other issues in the current implementation AFAICS:
- The ability to grab all keys of an index is missing, which would be 
needed for the newish index iterator API. I always had the feeling that 
API might come back to bite us at some point...
- Index keys are limited to strings whereas we currently have others 
too, but then all the actually interesting indexes have string keys, and 
we might well be able just to eliminate the others (or convert the data 
into strings)


BTW shouldn't those h2be() and be2h() calls be htonl() and ntohl() 
instead? The idea seems to be keeping the database and indexes in 
big-endian, ie network byte order (which is good IMO), but currently its 
unconditionally byteswapping so big-endian system would have the db's in 
little endian format and little endian systems in big endian. Or am I 
totally missing something here?


- Panu -
___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] FSM hooks for rpm plugin

2013-03-14 Thread Panu Matilainen

On 03/13/2013 01:08 PM, Reshetova, Elena wrote:



Do you want to do the changes? I can also try to do it tomorrow if
they aren't objections.



I probably should merge (at least some of) the study and link count patches
first, as those change the landscape quite a bit. I'll try to do that as soon
as the caffeine kicks in for good.


Sure, I will wait for changes.


On a somewhat related note, I'm pondering about changing fsm to do staged
removals too, ie rename before actually removing. It doesn't make much
difference as things are now, but I've also started seriously thinking about
changing the fsm to the model we discussed earlier where unpacking and
setting permissions etc is first done for all files, and only if that
succeeds completely we actually commit to renaming them all to the final
target, and undo the whole lot if anything in unpacking failed.


I think this would be the safest way not only from security,  but also from
correctness and also makes installation more robust in case of sudden power
cuts and etc.


Indeed. The way rpm currently behaves on failure is just plain embarrassing.




...which of course would actually fundamentally change the landscape
again: if commit is changed to consist only of renaming a file, then commit
hooks would no longer the right place to do security labeling etc. Argh! :)
In that model we'd be back to the set metadata hook, or actually two of
them to preserve the possibility of doing something after rpm did its own
business. And in that model, both pre and post metadata hooks should get the
temp and final path as separate arguments.


Yeah, but I guess maybe we can first finish with the current system and check
that it works for whatever test cases we have (I can start using new hooks in
msm plugin) and then change it when you move rpm to a new fsm model. I think
this would be a big change for fsm, so won't be possible to do it fast anyway.


Sure, I'm not suggesting delaying everything until I someday get around 
to fixing it, just that we could try thinking ahead for that model to 
hopefully avoid having to change the plugin interfaces later. I pushed a 
bunch of fsm changes yesterday, the two more interesting ones that we 
already talked about being:


1) Reflect the hardlink count in st_nlink so the real files vs hardlinks 
can be easily detected


2) Set permissions before committing to the rename to final destination.

With 2) in place, we might be able to model the hooks in a way that 
doesn't require changing later. The question (again) just is, what the 
hooks should actually be.


I think we'd want those pre- and post-commit hooks in any case: for 
example a %config versioning system plugin would want to know whether a 
file is being replaced and if it actually succeeded. The pre-commit hook 
could of course be used for setting additional permissions, content 
checking etc as well, but in the alleged new model of unpack + set 
permissions on all files first and only then commit, I think one would 
want to abort the whole thing as early as possible.


Not that it matters all that much if we really are able to undo the 
whole thing. So I guess we'll just go with the pre- and post-commit 
hooks for now to be able to move forward with this. At least no-one can 
say this hasn't been thoroughly discussed :)


- Panu -

___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] Rpm Database musings

2013-03-14 Thread Michael Schroeder
On Thu, Mar 14, 2013 at 10:55:07AM +0200, Panu Matilainen wrote:
 Yup, detecting and automatically regenerating out-of-sync indexes is pretty 
 much a must (yet something we currently dont have either, sigh)

 Some other issues in the current implementation AFAICS:
 - The ability to grab all keys of an index is missing, which would be 
 needed for the newish index iterator API. I always had the feeling that API 
 might come back to bite us at some point...

I already added both rpmidxList() and rpmpkgList() last night. ;)

 - Index keys are limited to strings whereas we currently have others too, 
 but then all the actually interesting indexes have string keys, and we 
 might well be able just to eliminate the others (or convert the data into 
 strings)

Yes, I noticed that after checking rpm's current database code. I can
easily switch the rpmidx functions to use binary as keys if you like,
it just makes the rpmidxList function a bit awkward as it can no longer
return an array of strings.

 BTW shouldn't those h2be() and be2h() calls be htonl() and ntohl() instead? 

Yes, we could use those instead. I just didn't like to include the
arpa/inet.h header file, it kinda felt wrong.
There's also htobe32/be32toh in endian.h if we define _BSD_SOURCE; that
seems to be a better choice.
As I wasn't sure what to do I decided to postpone the issue by using
my own inline functions for now ;)

 The idea seems to be keeping the database and indexes in big-endian, ie 
 network byte order (which is good IMO), but currently its unconditionally 
 byteswapping so big-endian system would have the db's in little endian 
 format and little endian systems in big endian. Or am I totally missing 
 something here?

Yes, the code always uses big endian. It doesn't unconditionally swap.
(It also does unaligned reads/writes, but we don't really need that.)


Coming back to automatically regenerating of out-of-sync indexes, there's
still another way do the implementation: keep those indexes in memory
and don't store them to disk at all.
This means that the indexes need to be generated on the fly at first
access by reading all header, it thus means we need to additionaly store
a stripped version of each header that just contains the interesting
bits.

Advantages:
- just one single database file
- no out-of-sync indexes possible

Disadvantage:
- needs a bit of time to generate the in-core indexes

For my system (2102 installed rpms) the stripped headers would be
about 2.2 MBytes to read, that takes about .34 seconds with my slow
disk and dropped caches, which is quite noticable.

Cheers,
  Michael.

-- 
Michael Schroeder   m...@suse.de
SUSE LINUX Products GmbH,  GF Jeff Hawn, HRB 16746 AG Nuernberg
main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}
___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] FSM hooks for rpm plugin

2013-03-14 Thread Reshetova, Elena
Sure, I'm not suggesting delaying everything until I someday get around to 
fixing it, just that we could try thinking ahead for that model to hopefully 
avoid having to change the plugin interfaces later. I pushed a bunch of fsm 
changes yesterday, the two more interesting ones that we already talked 
about being:

1) Reflect the hardlink count in st_nlink so the real files vs hardlinks can 
be easily detected

2) Set permissions before committing to the rename to final destination.

With 2) in place, we might be able to model the hooks in a way that doesn't 
require changing later. The question (again) just is, what the hooks should 
actually be.

I think we'd want those pre- and post-commit hooks in any case: for example a 
%config versioning system plugin would want to know whether a file is being 
replaced and if it actually succeeded. The pre-commit hook could of course be 
used for setting additional permissions, content checking etc as well, but 
in the alleged new model of unpack + set permissions on all files first and 
only then commit, I think one would want to abort the whole thing as early as 
possible.

Not that it matters all that much if we really are able to undo the whole 
thing. So I guess we'll just go with the pre- and post-commit hooks for now 
to be able to move forward with this. At least no-one can say this hasn't 
been thoroughly discussed :)

I just went through your yesterday's changes. I think it now slowly falls 
together nicely. I think it is right that we need pre and post fsm hooks 
because even if we were able to unpack everything and successfully set all 
permissions on all files in tmp location, it isn't a guarantee that committing 
the whole thing to the final location would be successful. It is always 
possible that preserving security labels might fail or anything else might 
happen. And when you change a fsm model to new one, we can just add a new hook 
that would be called after each file is unpacked to tpm location: this would 
be primary hook for setting additional metadata on file and good time to scan 
the content of a file, too (so that we can revert the whole thing and delete a 
file if malware  is found in it). The only thing that I can't find so far a 
usage for pre commit hook for future: it would be kind of called on the same 
context (file is unpacked in tpm dir) and the future metadata/content 
screening hook One idea can be that for security needs, plugins can 
actually use pre- and post hooks to verify that permissions were preserved 
(and set) correctly and abort if they see some mismatch. But maybe this is too 
paranoid again :)


Best Regards,
Elena.


smime.p7s
Description: S/MIME cryptographic signature
___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] Rpm Database musings

2013-03-14 Thread Panu Matilainen

On 03/14/2013 01:10 PM, Michael Schroeder wrote:

On Thu, Mar 14, 2013 at 10:55:07AM +0200, Panu Matilainen wrote:

Yup, detecting and automatically regenerating out-of-sync indexes is pretty
much a must (yet something we currently dont have either, sigh)

Some other issues in the current implementation AFAICS:
- The ability to grab all keys of an index is missing, which would be
needed for the newish index iterator API. I always had the feeling that API
might come back to bite us at some point...


I already added both rpmidxList() and rpmpkgList() last night. ;)


Ok, good :)


- Index keys are limited to strings whereas we currently have others too,
but then all the actually interesting indexes have string keys, and we
might well be able just to eliminate the others (or convert the data into
strings)


Yes, I noticed that after checking rpm's current database code. I can
easily switch the rpmidx functions to use binary as keys if you like,
it just makes the rpmidxList function a bit awkward as it can no longer
return an array of strings.


I think strings are fine, just thought to note that there are those 
couple of non-string indexes which we need to do something about. Sigmd5 
is probably better just axed, Installtid we might want to keep but that 
can just as well be converted into a string.



BTW shouldn't those h2be() and be2h() calls be htonl() and ntohl() instead?


Yes, we could use those instead. I just didn't like to include the
arpa/inet.h header file, it kinda felt wrong.
There's also htobe32/be32toh in endian.h if we define _BSD_SOURCE; that
seems to be a better choice.
As I wasn't sure what to do I decided to postpone the issue by using
my own inline functions for now ;)


Heh. Including arpa/inet.h for non-networking purposes does indeed 
feel a bit odd, but that's likely the standard and portably correct 
way of doing endian conversions, which at least in glibc are 
system-optimized as well. endian.h is apparently not very standard.


Hmm... rpm seems to include netinet/in.h directly, which works with 
glibc but is not what standards and man pages say about htonl() and friends.





The idea seems to be keeping the database and indexes in big-endian, ie
network byte order (which is good IMO), but currently its unconditionally
byteswapping so big-endian system would have the db's in little endian
format and little endian systems in big endian. Or am I totally missing
something here?


Yes, the code always uses big endian. It doesn't unconditionally swap.
(It also does unaligned reads/writes, but we don't really need that.)


Ok. I'm not having one of my brightest days apparently ;)
Guess I was expecting to see those on big endian do nothing ifdef's in 
there.




Coming back to automatically regenerating of out-of-sync indexes, there's
still another way do the implementation: keep those indexes in memory
and don't store them to disk at all.
This means that the indexes need to be generated on the fly at first
access by reading all header, it thus means we need to additionaly store
a stripped version of each header that just contains the interesting
bits.

Advantages:
- just one single database file
- no out-of-sync indexes possible

Disadvantage:
- needs a bit of time to generate the in-core indexes

For my system (2102 installed rpms) the stripped headers would be
about 2.2 MBytes to read, that takes about .34 seconds with my slow
disk and dropped caches, which is quite noticable.


Yeah, it seems pretty heavy for simple operations. OTOH it wouldn't hurt 
to have such a mode: for example if we notice indexes are 
corrupt/out-of-sync but we dont have the permissions to regenerate the 
on-disk files, it could fall back to in-memory indexes to get correct 
results even if its slightly slower.


What I've had in mind is lumping all the index stuff (possibly along 
with actual data for the critical parts) into a single file so there'd 
be just two files db-related files to worry about. But for now, I'm just 
happy to have an alternative implementation for the pkgs + index 
databases to play around with :)


- Panu -

___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] FSM hooks for rpm plugin

2013-03-14 Thread Panu Matilainen

On 03/14/2013 03:01 PM, Reshetova, Elena wrote:

Sure, I'm not suggesting delaying everything until I someday get around to
fixing it, just that we could try thinking ahead for that model to hopefully
avoid having to change the plugin interfaces later. I pushed a bunch of fsm
changes yesterday, the two more interesting ones that we already talked
about being:



1) Reflect the hardlink count in st_nlink so the real files vs hardlinks can
be easily detected



2) Set permissions before committing to the rename to final destination.



With 2) in place, we might be able to model the hooks in a way that doesn't
require changing later. The question (again) just is, what the hooks should
actually be.



I think we'd want those pre- and post-commit hooks in any case: for example a
%config versioning system plugin would want to know whether a file is being
replaced and if it actually succeeded. The pre-commit hook could of course be
used for setting additional permissions, content checking etc as well, but
in the alleged new model of unpack + set permissions on all files first and
only then commit, I think one would want to abort the whole thing as early as
possible.



Not that it matters all that much if we really are able to undo the whole
thing. So I guess we'll just go with the pre- and post-commit hooks for now
to be able to move forward with this. At least no-one can say this hasn't
been thoroughly discussed :)


I just went through your yesterday's changes. I think it now slowly falls
together nicely. I think it is right that we need pre and post fsm hooks
because even if we were able to unpack everything and successfully set all
permissions on all files in tmp location, it isn't a guarantee that committing
the whole thing to the final location would be successful. It is always
possible that preserving security labels might fail or anything else might
happen. And when you change a fsm model to new one, we can just add a new hook
that would be called after each file is unpacked to tpm location: this would
be primary hook for setting additional metadata on file and good time to scan
the content of a file, too (so that we can revert the whole thing and delete a
file if malware  is found in it).


Yup, and its this part I'm still pondering about: should we just add 
that post-unpack hook (or whatever its called) already and go with that 
for SELinux and all from the start, because that's what they really 
want. That's kinda what the setmetadata hook idea, but perhaps a more 
generic name would be appropriate now that it wouldn't be limited to 
that. Maybe something like file pre- and post-process, which can cover 
metadata, malware scanning and whatnot, both for install and erase 
(which needs the perhaps otherwise unnecessary pre-hook)



The only thing that I can't find so far a
usage for pre commit hook for future: it would be kind of called on the same
context (file is unpacked in tpm dir) and the future metadata/content
screening hook One idea can be that for security needs, plugins can
actually use pre- and post hooks to verify that permissions were preserved
(and set) correctly and abort if they see some mismatch. But maybe this is too
paranoid again :)


That's perhaps slightly paranoid, yes :) But then its not my job to say 
what the plugins should be used... for some purpose having yet another 
layer of verifying might be reasonable.


One use-case I see for the pre- and post-commit hooks is a plugin 
keeping track of config file contents: in pre-commit it would stage the 
change of content (think of git apply), and in post-commit it would 
either commit or abort (think of git commit or git reset --hard) 
depending on whether fsm commit succeeded or not.


- Panu -
___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] Rpm Database musings

2013-03-14 Thread Michael Schroeder
On Thu, Mar 14, 2013 at 03:33:44PM +0200, Panu Matilainen wrote:
 On 03/14/2013 01:10 PM, Michael Schroeder wrote:
 On Thu, Mar 14, 2013 at 10:55:07AM +0200, Panu Matilainen wrote:
 Yup, detecting and automatically regenerating out-of-sync indexes is pretty
 much a must (yet something we currently dont have either, sigh)

 Some other issues in the current implementation AFAICS:
 - The ability to grab all keys of an index is missing, which would be
 needed for the newish index iterator API. I always had the feeling that API
 might come back to bite us at some point...

 I already added both rpmidxList() and rpmpkgList() last night. ;)

 Ok, good :)

I've set up a repo on github so I don't need to send tarballs to
the mailing list anymore:

git://github.com/mlschroe/newrpmdb.git

Panu, do you have a github account so that I can add you as collaborator?

Cheers,
  Michael.

-- 
Michael Schroeder   m...@suse.de
SUSE LINUX Products GmbH,  GF Jeff Hawn, HRB 16746 AG Nuernberg
main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}
___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint