[MariaDB developers] UDFs and character sets

Jaco Kroon via developers Mon, 24 Jun 2024 08:14:16 -0700

Hi,

It's a known issue that strings to and from UDF functions are a pain(https://dev.mysql.com/blog-archive/a-tale-of-udfs-with-character-sets/,amongst others).

Most recently I ran into an issue trying to create a UDF to assist withIPv6 manipulations (in it's current, not working as ex form athttps://github.com/jkroonza/uls-mysql/blob/master/src/uls_inet6.c).

The two UDF functions there is intended to find the "network" (first)and "broadcast" (last) address of a range, with the intention to addfunctions to perform other manipulations (like adding some offset to anIPv6, or even add two IPv6's together, eg, to find the "next" networkfor 2c0f:f720::/40 you can add 0:0:100:: to 2c0f:f720::).

The referenced works great when you get values from inet6 column to passto the function (it gets transformed into textual presentation beforebeing passed to the UDF) eg:

MariaDB [test]> select uls_inet6_network_address('::ffff:192.168.1.123',120);

+--------------------------------------------------------+
| uls_inet6_network_address('::ffff:192.168.1.123', 120) |
+--------------------------------------------------------+
| ::ffff:192.168.1.0                                     |
+--------------------------------------------------------+
1 row in set (0.001 sec)

However, given this:

MariaDB [test]> create table t(t inet6 not null, primary key(t));
Query OK, 0 rows affected (0.024 sec)

MariaDB [test]> insert into t values('::ffff:192.168.1.15');
Query OK, 1 row affected (0.014 sec)

MariaDB [test]> select * from t whereuls_inet6_network_address('::ffff:192.168.1.123', 120) < t;

Empty set, 1 warning (0.009 sec)

MariaDB [test]> show warnings;
+---------+------+---------------------------------------------+
| Level   | Code | Message                                     |
+---------+------+---------------------------------------------+
| Warning | 1292 | Incorrect inet6 value: '::ffff:192.168.1.0' |
+---------+------+---------------------------------------------+
1 row in set (0.000 sec)

I started to dig, to find this:

MariaDB [test]> selectcharset(uls_inet6_network_address('::ffff:192.168.1.123', 120));

+-----------------------------------------------------------------+
| charset(uls_inet6_network_address('::ffff:192.168.1.123', 120)) |
+-----------------------------------------------------------------+
| binary |
+-----------------------------------------------------------------+
1 row in set (0.000 sec)

Then I played a bit and found that if I can rig it to the returnedstring has 16 characters I don't get an error; eg:

MariaDB [test]> select * from t whereuls_inet6_network_address('::ff:f:ffff:ffff', 120) > t;

+---------------------+
| t                   |
+---------------------+
| ::ffff:192.168.1.15 |
+---------------------+
1 row in set (0.001 sec)

Based on this I'm inferring that:

1. Contextually the difference between BINARY() and CHAR() types is thecharacter set (ie, binary is implemented as a kind of characters set forstrings, or perhaps strings are binaries with a specific character set).2. The inet6 storage column (which I now realize is simply BINARY(16),with a transform applied to transform strings to and from the BINARY()format, I think the technical term in the code is "fixed binary storagestorage").3. When string (character) data gets sent *to* an inet6 column a"character set conversion" is performed, and as is the case with UDFs,it's already BINARY, so no conversion is actually performed, and sincethe BINARY length is now *not* 16 bytes, we get the above warning, andthus comparisons are as if with a NULL value, and always false.

I've been contemplating possible "solutions" to the problem withoutbreaking backwards compatibility, and it's tricky. The below aims at asimple way of specifying the return "character set" of a string returnin more detail. And possibly even looking at the character sets forparameters.

MySQL have started down a path (as per the tale above) whereby it'susing some other mechanism to give indications about character sets. Idon't like the particular mechanism, but it could work, looks like itdoes only character set though, in which case for my example above, Icould just force both input and output to "latin1" so that MySQL/MariaDBis forced to convert between that and INET6 again. I think we can dobetter. I'm not convinced their change will permit backwardscompatibility - ie, any UDF that uses those functions will require themand cannot opt to operate in a degraded mode. Not sure that's even adesirable thing to have though ... so perhaps copying what they're doinghere is the way to go.


My ideas:

Option 1. Simply permit specifying a more precise string type duringfunction creation.


For example, instead of:

MariaDB [uls]> CREATE FUNCTION uls_inet6_last_address RETURNS STRINGSONAME 'uls_inet6.so';


Do this:

MariaDB [uls]> CREATE FUNCTION uls_inet6_last_address RETURNS INET6SONAME 'uls_inet6.so';

And this is permitted because INET6 is a specialization of BINARY, whichis really what MySQL passes to/from UDF functions anyway.

The downside is this will only work for returns. This would, work formy use case, since from INET6 => UDF the decoder is called anyway, so Ido get the text presentation on function entry, and now I can now I justneed to return the binary form of INET6. If sent back to the client itgets converted by the server to

This eliminates at least *one* of the two pointless conversions to/frombinary format.

Option 2. A mechanism to pass the character set/binary type upon entryand exit.

What I'm thinking here is a non-backwards compatible change,potentially. Or use the *extension pointer somehow. I'm still havingsome trouble navigating the codebase to figure out exactly how the void*extension pointers in UDF_ARGS and UDF_INIT structs are used. But I'mthinking they can be (ab)used to somehow pass an additional structure,which may be how 3 is achieved perhaps?


To be more precise, the current UDF_ARGS looks like:

47 typedef struct UDF_ARGS {
48   unsigned int arg_count;           /**< Number of arguments */
49   enum Item_result *arg_type;       /**< Pointer to item_results */
50   char **args;                      /**< Pointer to argument */
51   unsigned long *lengths;           /**< Length of string arguments */

52 char *maybe_null; /**< Set to 1 for all maybe_nullargs */

53   char **attributes;                /**< Pointer to attribute name */
54   unsigned long *attribute_lengths; /**< Length of attribute arguments */
55   void *extension;
56 } UDF_ARGS;

I can't find any source files that references both the UDF_ARGS type andreferences this extension field. As such I'm hoping I can safely assumethat in current implementations this will always be NULL.

If so it becomes possible to turn this into a pointer-sized versionfield, or struct size field, ie something like:


union {
   void *extension;
   size_t argsize; /* one alternative */
   long api_version; /* another alternative */
};

for argsize this way it becomes possible to check that this field is >=sizeof(UDF_ARGS) against which the UDF was compiled. This *feels*flakey, but should be fine. If a user wants more fine-grained compilebackwards compat, then the user can check that all fields (further thanthose listed) are positioned inside the structure.

Another option is simply to have extension be a pointer to someUDF_ARGS_EXPANDED struct, which really is a pointer back to UDF_ARGSwhich is really something like:


typedef struct UDF2_ARGS {
  ... everything from above, except void* extension becomes:
  void *UDF2_ARGS extension; /* this points to self */
  size_t argsize; /* or long api_version */

bool udf2_supported; /* set to false by the server, if the UDFsupports this, set to true */ const char ** string_arg_types; /* in _init this will be passed asthe raw type, eg, INET6 if the input is an INET6 field, can be set in_init to have the server cast the type, ie, perform some conversion */

  ... additional fields;

} UDF2_ARGS;

So in my example, if udf2_supported comes back as false after _init,then current behaviour is maintained, and INET6 will be converted tostring, and return strings are assumed to be BINARY (currentbehaviour). If however, udf2_supported, then my code could setstring_args_types[0] = "inet6"; - which the engine knows is a BINARY(16)type, and use the INET6 conversion code to convert to and fromBINARY(16) as needed.

Option 3. Some other more formal UDF API versioning/apidefinition/scheme. So call this UDFv2, and make this extensible so thatfeatures can be "negotiated", or that the UDF itself can even "fail toload" in case where it requires support from the running server that'snot provided.

The idea here would be that a symbol like "mariadb_udf_init" isexported, passing it some structure that defines the capabilities of theserver it's running on, the module can then decline to load, or proceed,but limit itself based on functionality available kind of thing.

I'm guessing option 2 is also a way of achieving this. And may besufficient. I do however like the idea of having a {func}_load and{func}_unload call (some of our PBKDF functions would also benefit fromthis in that we can load the openssl algorithms once at startup insteadof every time on _init. There may be other complications though, butstill, this could be one suggestion for v2.

I'm happy to write a more formal specification based off of the informalspecification, and I'm inclined to attempt option 2 above. I justhonestly have no idea what's all REALLY involved, and where to find thevarious bits and pieces of code. So I guess the initial steps would be:

1. Add a _load function which is invoked at *load* time, passing serverand version information. Not sure what else could be useful, intensionis to initialise the UDF for use as needed. UDFs that want to maintainbackwards compatible would need to track if this was indeed called ornot, and if not, act accordingly, or if it can't be backwards compatiblewithout this, error out in _init. Suggestions as to parameters to passwould be great, I suggest we use a struct again with a size field asit's first field to ensure that we can add extra fields at a laterstage, safely. Eg:


struct UDF_LOADINFO {
    size_t load_size;
    const char* server;
    struct {

uint16_t major, minor, patch; /* 10, 6, 17 - for example whatI've got runing currently */ const char* extra; /* MariaDB-log, the stuff after the dash fromSELECT version(); */

    } version;
} UDF_LOADINFO;

And so we will call "bool (*func_load)(const UDF_LOADINFO*)" IFF it exist.

2. Add an _unload function "void (*func_unload)()", purpose of which isto clean up after _load.

Note: It's possible to use __attribute__((constructor)) and destructorfunctions, but my experience is that they're messy, and if you havemultiple functions in a UDF you can't know which ones you're beingloaded for.

3. Expand for UDF2_ARGS as explained in option 2 above, as well asUDF2_INIT.

4. Ability to flag functions as *deterministic* (same return for samearguments) rather than merely const or not const. This way SQL canoptimize calls in the same manner as for stored functions. I'm notgoing to dig into that, but happy to add the flag.


5.  Document and flag this as UDFv2 as and when it hits formal release.

Kind regards,
Jaco


_______________________________________________
developers mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[MariaDB developers] UDFs and character sets

Reply via email to