> Maybe you're writing for a fixed-space embedded device, which nonetheless
> has space for the gigabytes required

Actually I am at the other end of the spectrum and running out of SSD space
on the server-side,
with the higher SSD storage tiers being quite more expensive.
I am also leaving the realm of "baseline" bandwidth requirements for timely
backup & restore.

Cutting down on redundancy could give a couple years of breathing room,
hopefully enough
for terabyte SSDs and 10 Gbps network to become more widespread ;)

> I right that in addition to modifying your queries to extract the
> substrings, you'd have the work of handling substring collisions?

Yes, that means replacing a single "=" test in SQL with two (one for
filtering, one for ensuring the match).

Cutting down to 64 bits is enough in practice to make substring collisions
very unlikely, while dividing
the index size by 4 to 8.
I ran some tests and even 32 bits substrings have few enough collisions to
not be a problem performance-wise
(the lookup is just the first step of typically far more complex queries,
so even if it happens 10 time slower,
it does not matter much).


On Mon, Jul 30, 2018 at 7:23 PM, Donald Griggs <dfgri...@gmail.com> wrote:

> There's a good chance this comment won't be useful to you, Eric.
> Nevertheless,
>
> Any chance of relaxing your space requirement?   I.e., what bad things
> happen if the space is not reduced?
>
> Maybe you're writing for a fixed-space embedded device, which nonetheless
> has space for the gigabytes required, and you can't expand without a chip
> change for all your customers?    If not such a case, would adding, say,
> $3-worth of additional disk space perhaps solve things?
>
> Also -- If you implement your index on the first several bytes of your
> cryptographic hash, then since you have millions of them per gigabyte,  am
> I right that in addition to modifying your queries to extract the
> substrings, you'd have the work of handling substring collisions?
>
> One of the main reasons for using a database is to automate the creation
> and use of efficient search algorithms with the tradeoff of more
> expenditure on disk space.
>
> I hope you find an acceptable solution,
>    Donald
>
>
>
>
>
> On Mon, Jul 30, 2018 at 4:32 AM Eric Grange <zar...@gmail.com> wrote:
>
> > Hi,
> >
> > Is there a way to reduce the size of an index on strings/blobs ?
> >
> > I have tables which are a key + value, the key is an "integer primary key
> > autoincrement", which is used for references in all other tables of the
> > schema.
> > The values are cryptographic GUIDs (so 256 to 512 bits in size) with a
> > "compact" encoding (either base64 or blob rather than hexadecimal
> strings),
> > but they still represent gigabytes of data.
> >
> > Those tables have an index on the value, and my problem is that the size
> of
> > this index (as reported by dbstat or sql3_analyzer) is about the same
> > as the table.
> >
> > As these are cryptographic GUIDs, the first few bytes of a values are in
> > practice unique, so in theory I can index just the first few bytes (using
> > substr()),
> > this indeed reduces in a much smaller index, but this also requires
> > adapting all queries that search by value.
> >
> > Before starting down that route, is there another way?
> >
> > My searches on those indexes are by either exact value or by value start
> > (human search & auto-completion)
> >
> > Eric
> > _______________________________________________
> > sqlite-users mailing list
> > sqlite-users@mailinglists.sqlite.org
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to