Re: [sqlite] Custom collation of blobs

2019-04-27 Thread James K. Lowden
On Fri, 26 Apr 2019 10:36:34 -0700
Jens Alfke  wrote:

> The only workaround I can think of is to define a custom function
> that converts a blob to a string and collate using those strings ?
> e.g. `? ORDER BY collatable_blob(b)`. But this requires expensive
> string conversions, 

Your UDF need not convert to a string.  It could return an integer or,
if that's not enough, couldn't it return a different blob whose bit
order yields the correct ordering?  

> and it doesn?t work well with indexes.

If indexing is important, that's a clue that the blob is not your
friend to start with.  It might be worthwhile to store the searchable
components in "redundant" columns that can be readily indexed.  By
analogy, if your blob is a PDF, it might be convenient to store the
author and title in columns for search & display purposes.  

If you write UDFs to pull out the pieces, you could invoke them in a
trigger to keep the "redundant" columns consistent.  

--jkl
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Custom collation of blobs

2019-04-27 Thread R Smith
To add to Dominique's suggestion, we use this approach a lot and have 
sort of standardized it internally. Of course 1NF dictates that this is 
not the real RDBMS way, but sometimes you need blobs because you just do.


I'm sure you already have figured out how to do it sans blob collations, 
and the question is only aimed at the collations and not needing 
alternate ideas, but on the off chance - here goes:


What we do is have blobs compressed if needed[1], then translate it to 
string using good old 3 to 4 translation (aka Base64)[2] which is rather 
efficient (much more so than bin --> Hex which is 1 to 2, doubling the 
size), then append an ante to the string "blob" that is a unique Key of 
the Blob which is also what it is sorted by (and a couple more 
characters that say a: whether it is compressed and b: the Blob version 
so that future software changes remain able to read the current blob 
architecture).


From there it's an easy step to make a standardized blob-storage thingy 
with pack/unpack to some structure, JSON, etc.  We even have these as 
UDFs, though I must say, in practice we never really use the UDF version 
since the content of the blob never really comes into play in SQL - if 
it does, we'd rather add real schema columns for that bit of the 
information.[3]


If sorting is all you need from all this, it's rather easy to also 
simply add another column to the table that contains the sort-by key and 
use that column whenever you query-sort it or compare it.


Lastly, you must already be aware, but just in case someone else is 
reading this: If you do have more than one column, place the Blob column 
at the end of the table definition.



Cheers,
Ryan

[1]: We start compressing if Len > 2560 bytes - we just found this 
number through running some research, a different figure might work for 
you. The compression algorithm varies for us between maximizing size 
saving vs. speed, and we can see a saving before 2560 bytes, but it 
seems for the general case, the timing of the total 
Base64.decode(compressed data)+LZ.decompress() starts being sporadically 
faster than just Base64.decompress(non-compressed data) times after 
about 3 thousand bytes, so even though a size saving  starts under 1K, a 
speed saving only starts around 3K+. Also, in case anyone wonders, 
Base64 compresses really badly - there is no point in first doing Base64 
then compressing, it's always a loss.
Mind you, this was long ago, perhaps I will find the code, modernize it 
and run tests again, some things may have changed.


[2]: As an aside, because of the only problem really being the null 
character in strings, we've designed another Base64-like translation 
that in stead of 8-bit to 6-bit (3x8bits --> 4x6bit) does 8bit to 7bit 
(7x8bit to 8x7bit) using a full 128-character palette plus a few control 
characters, which is not hard to find in a 256 character ASCII range 
(well, about 220-ish usable characters) and gives a great 7/8 size ratio 
(as opposed to 3/4) but it breaks UTF8 encoding (because of bit 7) so 
you must never load the stored strings through a UTF8 translation. In 
fact we've decided after some work on this that the size saving from 3/4 
to 7/8 (or put another way: 25% waste down to 12.5% waste) is just not 
worth the complication and non-standard translation. Perhaps one day 
when we have blobs in the 50-megabytes+, but then binding them as actual 
BLOBs and adding a couple of other control columns would be the better 
answer.


[3]: String storage has a hard limit of ~2Gigs in some systems, though 
you should have opted for another way of storage very long before you 
get to this limit.



On 2019/04/27 12:35 PM, Dominique Devienne wrote:

On Fri, Apr 26, 2019 at 7:36 PM Jens Alfke  wrote:


We are using SQLite blobs to store some structured values, and need
control over how they are collated in queries, i.e. memcmp is not the
correct ordering. We’ve registered a custom collating function, but
unfortunately it doesn’t get called. According to the docs, collating
functions are only used to compare strings, not any other data type. Is
there any way around this limitation?

The only workaround I can think of is to define a custom function that
converts a blob to a string and collate using those strings — e.g. `… ORDER
BY collatable_blob(b)`. But this requires expensive string conversions, and
it doesn’t work well with indexes.

It would be much cleaner and more efficient if there were a type of
collating function that operated on all data types.


You are hitting what is IMHO one of the last big remaining "hole" in
SQLite, which is its lack of User-Defined Types (UDTs) for persistent
columns.
(for transient values in function call chains, we have subtypes nowadays,
which is as close as a UDT as SQLite as).

Your blob obviously has internal structure, and while you're happy to store
it as an opaque blob, you'd like to have it sort differently.
You can of course work-around having a real UDT by 

Re: [sqlite] Custom collation of blobs

2019-04-27 Thread Dominique Devienne
On Fri, Apr 26, 2019 at 7:36 PM Jens Alfke  wrote:

> We are using SQLite blobs to store some structured values, and need
> control over how they are collated in queries, i.e. memcmp is not the
> correct ordering. We’ve registered a custom collating function, but
> unfortunately it doesn’t get called. According to the docs, collating
> functions are only used to compare strings, not any other data type. Is
> there any way around this limitation?
>
> The only workaround I can think of is to define a custom function that
> converts a blob to a string and collate using those strings — e.g. `… ORDER
> BY collatable_blob(b)`. But this requires expensive string conversions, and
> it doesn’t work well with indexes.
>
> It would be much cleaner and more efficient if there were a type of
> collating function that operated on all data types.
>

You are hitting what is IMHO one of the last big remaining "hole" in
SQLite, which is its lack of User-Defined Types (UDTs) for persistent
columns.
(for transient values in function call chains, we have subtypes nowadays,
which is as close as a UDT as SQLite as).

Your blob obviously has internal structure, and while you're happy to store
it as an opaque blob, you'd like to have it sort differently.
You can of course work-around having a real UDT by assigning a custom
collation to the column, but really it belongs on the "type", not the
column.

Coming back on your more specific question, aside from DRH adding support
for User-Defined Collations to blob columns (a long shot I suspect),
you could store your "structured values" as strings instead of blobs.
AFAIK, SQLite does not check the string is valid UTF8, so can store
arbitrary binary too.
As long as you do not perform encoding conversions UTF8 <-> UTF16, and you
do not call SQL functions assuming valid UTF8, you should be fine.
And now you can have your custom collation.

Note friendly to users of the DB that might see garbage when printing those
"binary" strings though.
But at this point, that's the only thing you can do that avoid the
expensive conversions you pointed out. --DD
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Custom collation of blobs

2019-04-26 Thread Jens Alfke
We are using SQLite blobs to store some structured values, and need control 
over how they are collated in queries, i.e. memcmp is not the correct ordering. 
We’ve registered a custom collating function, but unfortunately it doesn’t get 
called. According to the docs, collating functions are only used to compare 
strings, not any other data type. Is there any way around this limitation?

The only workaround I can think of is to define a custom function that converts 
a blob to a string and collate using those strings — e.g. `… ORDER BY 
collatable_blob(b)`. But this requires expensive string conversions, and it 
doesn’t work well with indexes.

It would be much cleaner and more efficient if there were a type of collating 
function that operated on all data types.

—Jens
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users