Modeling combinations (options and dependencies)

eacil Fri, 19 May 2023 02:36:12 -0700

Hello. I am struggling for the longest time about data I can't fit into a 
satisfying model. I thought about giving up about this project many times...
Context: it's for a web server so speed is of the essence. It's an amateur 
project so expect cheap server. I am also more a novice than a seasoned user so 
please have mercy on me.


Here we go, I hope I will be able to correctly explain my problem because 
chatgpt failed spectacularly to suggest anything even remotely worthwhile (yes, 
pretty desperate move, I know, but I am a desperate man).

-------------------------------------
DATA
-------------------------------------

-You have objects that are linked to many tables describing their properties, 
such as an object__tag table.
-These objects are sold in releases.
-Thing is, objects can be made of objects. I call them superobjects when they 
assume that container role (as objects, they can be part of other 
superobjects). Multiple releases can be linked to one superobject. Yes, it's a 
parts scenario.
-Objects can be shuffled in all sorts of unpredictable ways so it's useless to 
think that if one object is contained inside some superobject A, if that 
superobject A will be contained inside another superobject B, superobject B 
will inherit all objects from superobject A. In fact, my attempt at a solution 
will instantiate objects to link them to superobjects.
-In a superobject, multiple objects can be grouped together into a set of 
options. One option can be made of multiple objects (majority of the time, just 
one). You can choose how many options minimum and maximum (aka a range) you can 
choose from a set (majority of the time, just one). It's possible to have a 
NONE choice aka you don't choose any option.
-Any object can have 0 (common), 1 (common), or more (very rare) dependencies.

That's pretty much all.
It looks like a hierarchy but it is not really one. It looks like a graph but 
probably not one either.
It's made of multiple concepts that, alone, are difficult enough already, but 
now they are all grouped together and it's going well over my head.
In the end, I think it's about combinations with a bit of hierarchy to make 
everything more painful...

Here is a simple "tree" representation of that data. Think of "( )" as radio 
buttons. Empty linebreaks are separating sets of options. X indicates mandatory 
nodes (unless their parent is not chosen). "(n-m)" represents the min-max range 
for a set of options.

(x) Object 0 (it's the superobject)

(1-1) Object A1
(1-1) Object A2
(1-1) Object A3

(x) Object B1

(1-1) Object C1
│ (x) Object D1
│
│ (0-1) Object E1
│ (0-1) Object E2 + Object E3
│
│ (1-2) Object F1
│ (1-2) Object F2 {requires A2}
│
(1-1) Object C2
│ (x) Object G1
│
│ (x) Object E1

As you can see, a user can choose between three options A, E (with the second 
option being made of two objects and the third option making this set 
optional), two options C, F. If they choose object C2, they gets objects G1 and 
E1 by default.
Choosing object F2 forces you to have both objects C1 and A2. It's ugly to 
represent constraints like that but impossible to draw in 2D properly.

The reason why I said that it wasn't really a hierarchy is that you have three 
groups at "depth 1": object 0, object B, and objects C with their children. 
These three groups are all equal to each others and their position could be 
swapped with each other. Same with the children of objects C. Groups D, E, F, 
could be swapped with each other.

When you buy a release, you have to go from the top to the bottom and go 
through every branch until you reach its end.
Here, you could buy objects 0, A2, B1, C1, D1, E2, E3, F2.
Or 0, A1, B1, C1, D1, F1.
Or 0, A3, B1, C2, G1, E1.

Superobjects don't have much depth to begin with, depth 6 max.
Sets of options per superobject won't go probably above 10, but options 
themselves can go pretty high like 50.
Most objects don't have dependencies to tell you. The average superobject is 
made of a bunch of mandatory objects and few sets of options with few options 
(half of them probably being if you want the object or not).
I know it's vague but the data is unpredictable with lot of exceptions and 
that's why I need flexibility and can't hardcode anything.

-------------------------------------
PURPOSE
-------------------------------------

I don't really care at the moment as of how to render that tree above as it 
doesn't appear to be a big challenge. What I care is building a performant 
search engine to find superobjects (and releases above them) given columns of 
objects (subqueries) selected through joining objects with their related tableS 
such as object__tag. You have to find superobjects you can buy, while 
respecting all those annoying constraints. So, if you search for a superobject 
containing E2 and G1, you cannot find superobject 0 because it's on different 
branches! Same, if you search for A1 and A2, you cannot find superobject 0 
because they are in the same set of options and choosing one prevent you from 
choosing the other!

-------------------------------------
BAD SOLUTION 1
-------------------------------------

My first thought was to create a "hierarchy". You put the three groups at depth 
1 one after the other. The order doesn't matter. What matters is to have all 
possibilities the one after the other. One path of such tree would be 
0-A1-B1-C1-D1-E2-E3-F1. It would be a tree representing all possible 
combinations.

The problem is that such a tree grows exponentially.
If you have 3 options, then 4, then 5, the tree will have 60 paths. If you add 
another one of 3 it's 180. I estimated that you have some freak superobjects 
that could spawn more than 1000 possibilities. It's insane just to check few 
sets of options! I just need some freak exceptions I never imagined would exist 
to clog my database, meaning I would have to limit the number of combinations a 
superobject could have to be on the safe side.
And how do you even begin being efficient if it's to throw at such a hierarchy 
four columns of object.id to find paths that hit a combination?? If I have to 
check recursively hundreds of paths that do not even have a predictable order 
just to find a list of superobjects, I don't see my website becoming very 
performant.

I read a quite a "bit" on hierarchies but couldn't find anything useful. Most 
stuff is focused on basic cases such as finding parents or children, shortest 
path, subtrees, but never combinations.

I also tried to flatten all of that and dived into ltree, arrays and jsonb, 
thinking I could use somehow that interesting containment property, but it was 
before finally getting that you can't even throw columns at such types.

-------------------------------------
(NOT COMPLETELY?) BAD SOLUTION 2
-------------------------------------

It's the one I am going for at the moment (given it's the only one I found, on 
my own...) but it's a wasteful solution. It takes a lot of place and every time 
you want to throw another column of objects to the query, it becomes an order 
of magnitude slower because the progressions of relations (and therefore joins) 
follows a triangle sequence (1, 3, 6, 10, 15, 21, etc.) as you will see.

I hesitated before explaining my solution because I don't want to divert the 
discussion about trying to build over it if it's to ignore a better solution, 
but it's better to show that I did some homework before asking for a solution.

CREATE TABLE object (
id integer NOT NULL GENERATED ALWAYS AS IDENTITY,
CONSTRAINT object__pk PRIMARY KEY (id)
);

CREATE TABLE superobject (
id integer NOT NULL GENERATED ALWAYS AS IDENTITY,
object_id integer NOT NULL,
CONSTRAINT superobject_pk PRIMARY KEY (id)
);
ALTER TABLE superobject ADD CONSTRAINT superobject__object_id_fk FOREIGN KEY 
(object_id) REFERENCES object (id);

CREATE TABLE release (
id integer NOT NULL GENERATED ALWAYS AS IDENTITY,
superobject_id integer NOT NULL,
CONSTRAINT release_pk PRIMARY KEY (id)
);
ALTER TABLE release ADD CONSTRAINT release__superobject_id_fk FOREIGN KEY 
(superobject_id) REFERENCES superobject (id);

CREATE TABLE instance (
id integer NOT NULL GENERATED ALWAYS AS IDENTITY ,
object_id integer NOT NULL,
CONSTRAINT instance__pk PRIMARY KEY (id,object_id)
);
ALTER TABLE instance ADD CONSTRAINT instance__object_id_fk FOREIGN KEY 
(object_id) REFERENCES object (id);

CREATE TABLE pairing (
superobject_id integer NOT NULL,
instance_a integer NOT NULL,
instance_b integer NOT NULL,
CONSTRAINT pairing__pk PRIMARY KEY (superobject_id,instance_a,instance_b)
);
ALTER TABLE pairing ADD CONSTRAINT pairing__superobject_id_fk FOREIGN KEY 
(superobject_id) REFERENCES superobject (id);
ALTER TABLE pairing ADD CONSTRAINT pairing__instance_a_fk FOREIGN KEY 
(instance_a) REFERENCES instance (id);
ALTER TABLE pairing ADD CONSTRAINT pairing__instance_b_fk FOREIGN KEY 
(instance_b) REFERENCES instance (id);

Here is a visual representation of my schema if it can help but be aware that 
it doesn't match exactly what I wrote above as it's a stripped down version:
https://i.imgur.com/AUau65W.png

In the table pairing, you will add for every object which object it is linked 
to, no matter the depth.
If one has no dependency, it will have a row for every object.
If one is an option, it won't have a row linking it to its fellow options or 
any of their sub-options.

Example (instances a and b are replaced by the names of the objects for 
clarity):
(sorry, can't format that properly in a mail)

superobject_instance_a_instance_b
0___________0__________A1
0___________0__________A2
0___________0__________A3
0___________0__________B1
0___________0__________C1
0___________0__________D1
0___________0__________E1
0___________0__________E2
0___________0__________E3
0___________0__________F1
0___________0__________F2
0___________0__________C2
0___________0__________G1
0___________0__________E1
0___________A1_________0
0___________A1_________B1
0___________A1_________C1
0___________A1_________D1
0___________A1_________E1
0___________A1_________E2
0___________A1_________E3
0___________A1_________F1
0___________A1_________F2
0___________A1_________C2
0___________A1_________G1
0___________A1_________E1
0___________A2_________0
0___________A2_________B1
0___________A2_________C1
0___________A2_________D1
0___________A2_________E1
0___________A2_________E2
0___________A2_________E3
0___________A2_________F1
0___________A2_________F2
0___________A2_________C2
0___________A2_________G1
0___________A2_________E1
0___________A3_________0
...
0___________B1_________0
...
0___________C2_________0
0___________C2_________A1
0___________C2_________A2
0___________C2_________A3
0___________C2_________B1
0___________C2_________G1
0___________C2_________E1
0___________C1_________0
0___________C1_________A1
0___________C1_________A2
0___________C1_________A3
0___________C1_________B1
0___________C1_________D1
0___________C1_________E1
0___________C1_________E2
0___________C1_________E3
0___________C1_________F1
0___________C1_________F2
0___________D1_________0
...
etc.

As you can see, it is super wasteful as you have the same relation in both 
directions. 0-A1 and A1-0 are identical given that this table has two columns 
semantically identical. You could search in one (instance_a -> instance_b) or 
the other direction (instance_b -> instance_a) and it would be the same. I 
asked on dba.stackexchange 
(https://dba.stackexchange.com/questions/326860/bidirectional-self-join-table) 
if you could split the table in half but I don't think it's possible.

The benefit is that it doesn't grow exponentially. Sure, it's closer to a 
quartesian product (total_of_objects * total_of_objects-1 when no set of 
options involved) but at least it stays "reasonable".

Instantiating is mandatory because if you don't make every object of a tree 
unique, you will get collisions. With paths A-B-C, E-B-D, D will be linked to A 
through A-B and B-D even though they are not. Instantiating B will prevent that.

As for the queries:

--FIND SUPEROBJECTS GIVEN 1 OBJECT (super fast)
SELECT superobject_id FROM pairing
WHERE instance_a IN (SELECT i.id FROM object o JOIN instance i ON o.id = 
i.object_id JOIN object__tag ot ON o.id = ot.object_id WHERE <first_criterion>)
GROUP BY superobject_id;

--FIND SUPEROBJECTS GIVEN 2 OBJECTS (super fast)
SELECT superobject_id FROM pairing
WHERE instance_a IN (SELECT i.id FROM object o JOIN instance i ON o.id = 
i.object_id JOIN object__tag ot ON o.id = ot.object_id WHERE <first_criterion>)
AND instance_b IN (SELECT i.id FROM object o JOIN instance i ON o.id = 
i.object_id JOIN object__tag ot ON o.id = ot.object_id WHERE <second_criterion>)
GROUP BY superobject_id;

--FIND SUPEROBJECTS GIVEN 3 OBJECTS (fast?)
SELECT superobject_id
FROM (SELECT p1.superobject_id, p1.instance_b AS b, p2.instance_b AS c FROM 
pairing p1
JOIN pairing p2 ON p1.superobject_id = p2.superobject_id and p1.instance_a = 
p2.instance_a
WHERE p1.instance_a IN (SELECT i.id FROM object o JOIN instance i ON o.id = 
i.object_id JOIN object__tag ot ON o.id = ot.object_id WHERE <first_criterion>)
AND p1.instance_b IN (SELECT i.id FROM object o JOIN instance i ON o.id = 
i.object_id JOIN object__tag ot ON o.id = ot.object_id WHERE <second_criterion>)
AND p2.instance_b IN (SELECT i.id FROM object o JOIN instance i ON o.id = 
i.object_id JOIN object__tag ot ON o.id = ot.object_id WHERE <third_criterion>)
intersect
SELECT superobject_id, instance_a AS b, instance_b AS c FROM pairing) AS results
GROUP BY superobject_id

It's basically a triangle.
<->A<->B<->C<->

--FIND SUPEROBJECTS GIVEN 4 OBJECTS (uh oh?)
SELECT first_pass.superobject_id FROM (SELECT p1.superobject_id, p1.instance_b 
AS b, p2.instance_b AS c, p3.instance_b AS d FROM pairing p1
JOIN pairing p2 ON p1.superobject_id = p2.superobject_id and p1.instance_a = 
p2.instance_a
JOIN pairing p3 ON p1.superobject_id = p3.superobject_id and p1.instance_a = 
p3.instance_a
WHERE p1.instance_a IN (SELECT i.id FROM object o JOIN instance i ON o.id = 
i.object_id JOIN object__tag ot ON o.id = ot.object_id WHERE <first_criterion>)
AND p1.instance_b IN (SELECT i.id FROM object o JOIN instance i ON o.id = 
i.object_id JOIN object__tag ot ON o.id = ot.object_id WHERE <second_criterion>)
AND p2.instance_b IN (SELECT i.id FROM object o JOIN instance i ON o.id = 
i.object_id JOIN object__tag ot ON o.id = ot.object_id WHERE <third_criterion>)
AND p3.instance_b IN (SELECT i.id FROM object o JOIN instance i ON o.id = 
i.object_id JOIN object__tag ot ON o.id = ot.object_id WHERE 
<fourth_criterion>)) AS first_pass
JOIN pairing p4 ON first_pass.superobject_id = p4.superobject_id and 
first_pass.b = p4.instance_a and first_pass.c = p4.instance_b
JOIN pairing p5 ON first_pass.superobject_id = p5.superobject_id and 
first_pass.b = p5.instance_a and first_pass.d = p5.instance_b
JOIN pairing p6 ON first_pass.superobject_id = p6.superobject_id and 
first_pass.c = p6.instance_a and first_pass.d = p6.instance_b
GROUP BY first_pass.superobject_id

Here you have a square and its two diagonales.
I am not sure it's the best way to do it.

--FIND SUPEROBJECTS GIVEN 5 OBJECTS (???)
something something NINE JOINS
A pentacle and it's pentagram of joins...

As you can see, it is already quite the beast just to find a superobject made 
of 4 subqueries. It doesn't look sustainable and is quite infuriating to have 
to build such intricate polygons from small instance_a-instance_b pieces when 
the majority of superobjects will be flat.

Anyway, I hope to find some solace in your hands. ヾ(_ _*)
Feel free to ask any question if something is not clear.

Modeling combinations (options and dependencies)

Reply via email to