[install-discuss] Re: pdo.c - introduction

Vasiliy Tue, 24 Oct 2006 19:43:15 PDT

Data structures.

For me it is obvious that best representation for patches to do most efficient 
dependency analysis is graph not a table. All relation should be direct link 
from one patch to another with backlinks. It is possible of course to implement 
it based on relational algebra but it will not be efficient - you have to do 
search every time you need follow the require relation or any other relation. I 
need to make this remark, because some people for some reason thought and even 
continue thinking that relational database is solution for all our install 
problems. It was 3 year long investigation not so far ago which conclude that 
relation database repository is the possible best way to keep and analyze 
packaging and patching data. It is possible to implement it this way, but it is 
slow and as we saw on practice, when this database "solution" was introduced in 
S10 development version on analyze for some time - extremely heavyweight in 
terms of disk space, memory and processor power, without real performance 
improvement.


Major requirement from zones is performance. Any installation over zones will 
take as many times as many zones we have and 1 min dependency check time 
acceptable for patching when you have no zones will turn in unacceptable 1 hour 
if you have 60 zones. So to keep installation time reasonable and match 
customer expectation (which is not realistic and usually they expect install 
work for 100 zones same amount of times as for system without zones, at least 
not 100 times longer). So major target was performance and so datastructure as 
well as everything is chosen to provide best performance possible.

What attributes patch has.

Patch_ID - which is patch identifier and consist of patch number and patch 
revision. Due to the lack of design for patches there is no defined syntax for 
Patch_ID and in general it may be any directory name. However, in real life it 
is six digit number and two digit revision. All references to patch in all 
system files are done using Patch_ID. So this is actually unique key - 
identifier for each patch. In Pdo.c it is represented as string all patch_id 
are located in array of strings and on top of this array there is hashtable 
which speed up access to the patch by Patch_ID very good. Much faster then 
index. This days when size of memory is ridiculous and no need to squeeze you 
data in graphic memory (because you memory is only 48K, but there is free 64K 
green bank of graphic memory...) I can use way bigger hashtable index then 
recommended by theory 30% - in pdo it is actually few times more then expected 
load. Hash function is specially created to be fast using addition and shifts 
and gives good dispersion. When I tuned hashtable it gives me something about 1 
patch found from 3rd hit, 5 from second hits and all other few hundreds from 
first hits (of course this is statistical results). I guess I can compete 
indexed table in database with this. May be it is overkill, and more like 
assembler kind of programming, but it was fun for me (like back to late 80th) 
and performs very well.

All the other attributes of patch are in general relations.

Let take REQUIRE relation. In pkginfo it is Key REQUIRE followed by PatchID I 
can of course keep it as an string and any time I need to follow this relation 
do the search over PatchID array - it may take minimal number of comparisons 
etc. But it will be much better if this attribute will directly lead me to 
related patch - no search at all, even most efficient, just get it right away. 
If we have this direct link to required patch we may also need direct link 
back, to the patch which require this one. It is good and handy to have anyway, 
but it also need to be known if we removing patch (again following direct links 
work bit faster then request to relational database). 

All other relation are represented same way - not a PatchIDs but direct links 
to related patches with direct back-links. Those are:

REQUIRE - REQUIRE_BY
INCOMPATIBLE - INCOMPATIBLE_BY
OBSOLETE - OBSOLETE_BY
PREVIOUS - NEXT (revisions)

PREVIOUS - NEXT pair is only single links - other links multiple by nature - in 
theory it may be several REQUIRE relation (plus with accumulation any of this 
require relation will be inherited). So each relation is an array of link. 
OBSOLETE_BY is single link, but I keep it as array just to be less confused and 
have all implementation symmetrical.

In general, in real life there is no too much of this relation defined for 
patches. However our new proposal to get rid of accumulation and simplify 
patching to have patch per putback without revisions will require much greater 
use of require relation. 

Other important attribute of patch is type. Initial there are three types of 
patches 
SYSTEM - already installed and l found them in /var/sadm/pkg/*/pkginfo
REQUESTED - patches requested to install and I found them in specified to 
patchadd location 
VIRTUAL - very important for analysis type of patches, they do not exist 
anywhere, but were refered by other patches 

I am generating this data structure while parsing pkginfo files from system and 
form patches - this one way parsing. When I read first patch record I create 
patch object for this record and for each related patch I create also Virtual 
patch records - they just marked Virtual. If I read patch record and find that 
this patch already exist as an virtual I mark it as real (SYSTEM or REQUESTED - 
depends what I am parsing currently). I will talk in detail about parsing 
later. 

To implement this I use unusual approach. All books will suggest in this case 
to have at least array of structures. And links between records have as 
pointers. I developed and get used to different technic for some time in my 
outside of the job, hobby programing for years (I did little organizer database 
for my PalmV and then Sony Clie and other things of this nature just for fun). 
Instead of having array of structures I have set of arrays. This way all links 
are not pointers but longs referring to array item with that index. In result 
this representation is extremely mobile and relocatable, also it is easy to 
understand debugging printouts. It is not tied to fixed structure and when I 
need I can introduce additional field just inside one procedure name space for 
example which I use with other field as part of the structure but then destroy 
- like depth field needed only for ordering during dependency check or circular 
dependency path. 

Best example of this mobility is mechanism of transactions in recursive 
patching - when I may commit or rollback results of recursive dependency check. 
Colors of the knots in tree during tree analysis I keep in array TYPES, first I 
memcpy this array at once and then run check if root patch is approved then I 
COMMIT this results by keeping marks as they are and destroy copy, if not I 
ROLLBACK - relink TYPES to copy and destroy current check results.

vassun
 
 
This message posted from opensolaris.org

[install-discuss] Re: pdo.c - introduction

Reply via email to