@Simon Slavin
> Don't index using substr().  That would be slow because it has to keep
working out substr().

I gave it a try, but that grows the size of the tables, and would require a
full update of the key/value table,
so not something that can be deployed without a lot of I/O.

The substr() should not be a problem, I am ready to trade disk space for
cpu time, and I was planning
to use a check() constraint to ensure uniqueness without an explicit unique
index (I can probably live
with the odd duplicates in the substr value, if the space gained is worth
it)

However, the beef of the issue is that all queries that are doing a "value
= xxx" will have to be rewritten
to something like "(substr(value, 1, 8) = substr(xxx, 1, 8) and value =
xxx)" and then check in the query plan
that the substr index is actually used.

So, if there is a "better" way... :)

@Paul Sanderson
> If I understand correctly then changing from a base64 index to a blob 
> containing
the raw bytes would save 25%

Yes it would, but the problem is for human searches / auto-completions,
some keys are displayed as base64,
so humans type in the base64 first characters.
it might be possible to do a partial base64 decoding, and then filter on
re-encoded base64, but that would be
quite complex for "just" 25% size gained :/

Indexing on the string start still allows the filtering to occur with
"substr(value, 1, 8) between x and y" f.i.

Eric


On Mon, Jul 30, 2018 at 11:16 AM, Paul Sanderson <
sandersonforens...@gmail.com> wrote:

> If I understand correctly then changing from a base64 index to a blob
> containing the raw bytes would save 25%
>
> Paul
> www.sandersonforensics.com
> SQLite Forensics Book <https://www.amazon.co.uk/dp/ASIN/1980293074>
>
> On 30 July 2018 at 09:32, Eric Grange <zar...@gmail.com> wrote:
>
> > Hi,
> >
> > Is there a way to reduce the size of an index on strings/blobs ?
> >
> > I have tables which are a key + value, the key is an "integer primary key
> > autoincrement", which is used for references in all other tables of the
> > schema.
> > The values are cryptographic GUIDs (so 256 to 512 bits in size) with a
> > "compact" encoding (either base64 or blob rather than hexadecimal
> strings),
> > but they still represent gigabytes of data.
> >
> > Those tables have an index on the value, and my problem is that the size
> of
> > this index (as reported by dbstat or sql3_analyzer) is about the same
> > as the table.
> >
> > As these are cryptographic GUIDs, the first few bytes of a values are in
> > practice unique, so in theory I can index just the first few bytes (using
> > substr()),
> > this indeed reduces in a much smaller index, but this also requires
> > adapting all queries that search by value.
> >
> > Before starting down that route, is there another way?
> >
> > My searches on those indexes are by either exact value or by value start
> > (human search & auto-completion)
> >
> > Eric
> > _______________________________________________
> > sqlite-users mailing list
> > sqlite-users@mailinglists.sqlite.org
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to