On 4/25/23 14:11, Ilya Maximets wrote: > On 4/25/23 00:12, Terry Wilson via discuss wrote: >> A performance issue that has always bothered me: >> >> OVSDB has a set data type that matches up with Python's set data type (an >> unordered collection of unique items). The in-tree Python library represents >> this set type as a list. Not only does it do that, but every time you call >> Row.__getattr__() through accessing a Row with a set-type column, it will >> loop through those values, add them to a new Python set (presumably to >> remove duplicates)...and then return them as a sorted list. Every single >> time the attribute is accessed [1]. >> >> Some of these sets can be quite huge. In OpenStack Neutron, for example, we >> have a default Port Group that all ports are added to by default. This is >> many thousands of ports. >> >> Now, it would be very simple to just return a set here and users would get >> the benefits of both less overhead on attribute access AND the ability to do >> O(1) lookups on these sets. Things like "find port groups that have this >> port" etc. would be *much* cheaper. The problem is that this breaks the API. >> You can no longer do things like Port_Group.ports[0] as set objects are >> unordered and do not have __getitem__(), operations like append() don't >> exist, etc. This will also break tons of tests because they tend to rely on >> order of objects since they do simple string matching. The latter issue is >> probably pretty easy to fix in the tests themselves by just sorting the >> results in the tests themselves. >> >> It's probably possible to create a wrapper type object that makes a set that >> kinda looks like a list enough to not break things, but that's also pretty >> ugly. So I guess my question is, "what do we think about breaking the API at >> some point to fix this?" It's pretty terrible behavior, but it's also >> annoying when APIs change. >> >> Terry >> >> [1] >> https://github.com/openvswitch/ovs/blob/d70688a7291edb432fd66b9230a92842fcfd3607/python/ovs/db/data.py#L498-L504 > Hi, Terry. Thank you for bringing that up. > > I think, the main idea behind the current design is to do whatever > C implementation does. And C implementation presents IDL objects > to the user as sorted arrays. That translates to ordered lists in > python. Another major - and probably the most important - thing is > that columns can contain weak references to rows that do not exist, > the code in [1] makes sure that such references are not visible to > the user. And this code resolves UUIDs into row objects. > > Another potential thing is that we perform a copy in order to avoid > internal structures to be modified by the user application. i.e. > avoid returning direct references to internal structures. I'm not > sure if that is really important, but I guess, some user apps may > rely on that in some form even if they are not aware of that. > > I fully agree that how it is implemented today is really not a > python way of doing things, and this needs to be improved. > > If we'll go the route of API change, it likely should be a per IDL > instance configuration knob that will alter the behavior. So, the > users can opt-in to the new API. We may deprecate the old one > eventually, but we'll need a transitional period with some usual > deprecation warnings like python libs usually do. > Wrapper object may be a solution for a transition period to not > carry two separate implementations. > But we still need to translate UUIDs to rows, so I'm not sure how > to avoid that. > > There is an alternative approach to consider. I believe, C IDL > implementation had the same problem ages ago. And the solution > that is in place until today is to cache the "parsed" version of > a column, and re-parse it whenever column changes. For the python > IDL it may look like having an extra field that holds a sorted list. > Clear it when something changes and create a new one whenever user > is accessing the column, but the "parsed" version doesn't exist yet. > It's not ideal, but should be better than what we have now. > This carries the same issue of accidental modifications of the > common internal object though, but again, this might be not very > important.
I remembered why I didn't end up writing any code the last time I wanted to fix that issue. It's because we need to re-parse the source column whenever the row it references changed. So, if A contains UUIDs of rows in a table T and a new row in T is a row that one of the UUIDs in A points to, then we need to re-parse column A in order to include resolved row object for the previously missing reference. In order to do that we need to maintain source and destination references between rows. See the src/dst_arcs in the C implementation. And the logic around these is fairly complex including management of orphan rows that are referenced but do not exist. > > What do you think? > > Best regards, Ilya Maximets. _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss