Re: [HACKERS] How to implement a SP-GiST index as a extension module?

2017-11-12 Thread Connor Wolf
Ok, I've managed to get my custom index working.

It's all on github here: https://github.com/fake-name/pg-spgist_hamming, if
anyone else needs a fuzzy-image searching system
that can integrate into postgresql..

It should be a pretty good basis for anyone else to use if they want to
implement a SP-GiST index too.

Thanks!

On Sun, Nov 5, 2017 at 8:10 PM, Connor Wolf  wrote:

> Never mind, it turns out the issue boiled down to me declaring the
> wrong prefixType in my config function.
>
> TL;DR - PEBKAC
>
> On Sun, Nov 5, 2017 at 1:09 AM, Connor Wolf  com> wrote:
>
>> Ok, I've got everything compiling and it installs properly, but I'm
>> running into problems that I think are either a side-effect of implementing
>> picksplit incorrectly (likely), or a bug in SP-GiST(?).
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/me
>> mcpy-sse2-unaligned.S:159
>> 159 ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: No such
>> file or directory.
>> (gdb) bt
>> #0  __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/me
>> mcpy-sse2-unaligned.S:159
>> #1  0x004ecd66 in memcpy (__len=16, __src=,
>> __dest=0x13c9dd8) at /usr/include/x86_64-linux-gnu/bits/string3.h:53
>> #2  memcpyDatum (target=target@entry=0x13c9dd8, att=att@entry=0x7fff327325f4,
>> datum=datum@entry=18445692987396472528) at spgutils.c:587
>> #3  0x004ee06b in spgFormInnerTuple (state=state@entry
>> =0x7fff327325e0, hasPrefix=, prefix=18445692987396472528,
>> nNodes=8,
>> nodes=nodes@entry=0x13bd340) at spgutils.c:741
>> #4  0x004f508b in doPickSplit (index=index@entry=0x7f2cf9de7f98,
>> state=state@entry=0x7fff327325e0, current=current@entry=0x7fff32732020,
>> parent=parent@entry=0x7fff32732040, 
>> newLeafTuple=newLeafTuple@entry=0x13b9f00,
>> level=level@entry=0, isNulls=0 '\000', isNew=0 '\000') at
>> spgdoinsert.c:913
>> #5  0x004f6976 in spgdoinsert (index=index@entry=0x7f2cf9de7f98,
>> state=state@entry=0x7fff327325e0, heapPtr=heapPtr@entry=0x12e672c,
>> datum=12598555199787281,
>> isnull=0 '\000') at spgdoinsert.c:2053
>> #6  0x004ee5cc in spgistBuildCallback (index=index@entry
>> =0x7f2cf9de7f98, htup=htup@entry=0x12e6728, values=values@entry
>> =0x7fff327321e0,
>> isnull=isnull@entry=0x7fff32732530 "", tupleIsAlive=tupleIsAlive@entry=1
>> '\001', state=state@entry=0x7fff327325e0) at spginsert.c:56
>> #7  0x00534e8d in IndexBuildHeapRangeScan
>> (heapRelation=heapRelation@entry=0x7f2cf9ddc6c8,
>> indexRelation=indexRelation@entry=0x7f2cf9de7f98,
>> indexInfo=indexInfo@entry=0x1390ad8, allow_sync=allow_sync@entry=1
>> '\001', anyvisible=anyvisible@entry=0 '\000',
>> start_blockno=start_blockno@entry=0,
>> numblocks=4294967295, callback=0x4ee573 ,
>> callback_state=0x7fff327325e0) at index.c:2609
>> #8  0x00534f52 in IndexBuildHeapScan
>> (heapRelation=heapRelation@entry=0x7f2cf9ddc6c8,
>> indexRelation=indexRelation@entry=0x7f2cf9de7f98,
>> indexInfo=indexInfo@entry=0x1390ad8, allow_sync=allow_sync@entry=1
>> '\001', callback=callback@entry=0x4ee573 ,
>> callback_state=callback_state@entry=0x7fff327325e0) at index.c:2182
>> #9  0x004eeb74 in spgbuild (heap=0x7f2cf9ddc6c8,
>> index=0x7f2cf9de7f98, indexInfo=0x1390ad8) at spginsert.c:140
>> #10 0x00535e55 in index_build 
>> (heapRelation=heapRelation@entry=0x7f2cf9ddc6c8,
>> indexRelation=indexRelation@entry=0x7f2cf9de7f98,
>> indexInfo=indexInfo@entry=0x1390ad8, isprimary=isprimary@entry=0
>> '\000', isreindex=isreindex@entry=0 '\000') at index.c:2043
>> #11 0x00536ee8 in index_create 
>> (heapRelation=heapRelation@entry=0x7f2cf9ddc6c8,
>> indexRelationName=indexRelationName@entry=0x12dd600 "int8idx_2",
>> indexRelationId=16416, indexRelationId@entry=0, relFileNode=0,
>> indexInfo=indexInfo@entry=0x1390ad8, indexColNames=indexColNames@en
>> try=0x1390f40,
>> accessMethodObjectId=4000, tableSpaceId=0,
>> collationObjectId=0x12e6b18, classObjectId=0x12e6b38, coloptions=0x12e6b58,
>> reloptions=0, isprimary=0 '\000',
>> isconstraint=0 '\000', deferrable=0 '\000', initdeferred=0 '\000',
>> allow_system_table_mods=0 '\000', skip_build=0 '\000', concurrent=0 '\000',
>> is_internal=0 '\000', if_not_exists=0 '\000') at index.c:1116
>> #12 0x0

Re: [HACKERS] How to implement a SP-GiST index as a extension module?

2017-11-05 Thread Connor Wolf
Never mind, it turns out the issue boiled down to me declaring the
wrong prefixType in my config function.

TL;DR - PEBKAC

On Sun, Nov 5, 2017 at 1:09 AM, Connor Wolf  wrote:

> Ok, I've got everything compiling and it installs properly, but I'm
> running into problems that I think are either a side-effect of implementing
> picksplit incorrectly (likely), or a bug in SP-GiST(?).
>
> Program received signal SIGSEGV, Segmentation fault.
> __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/
> memcpy-sse2-unaligned.S:159
> 159 ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: No such file
> or directory.
> (gdb) bt
> #0  __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/
> memcpy-sse2-unaligned.S:159
> #1  0x004ecd66 in memcpy (__len=16, __src=,
> __dest=0x13c9dd8) at /usr/include/x86_64-linux-gnu/bits/string3.h:53
> #2  memcpyDatum (target=target@entry=0x13c9dd8, att=att@entry=0x7fff327325f4,
> datum=datum@entry=18445692987396472528) at spgutils.c:587
> #3  0x004ee06b in spgFormInnerTuple (state=state@entry=0x7fff327325e0,
> hasPrefix=, prefix=18445692987396472528, nNodes=8,
> nodes=nodes@entry=0x13bd340) at spgutils.c:741
> #4  0x004f508b in doPickSplit (index=index@entry=0x7f2cf9de7f98,
> state=state@entry=0x7fff327325e0, current=current@entry=0x7fff32732020,
> parent=parent@entry=0x7fff32732040, 
> newLeafTuple=newLeafTuple@entry=0x13b9f00,
> level=level@entry=0, isNulls=0 '\000', isNew=0 '\000') at
> spgdoinsert.c:913
> #5  0x004f6976 in spgdoinsert (index=index@entry=0x7f2cf9de7f98,
> state=state@entry=0x7fff327325e0, heapPtr=heapPtr@entry=0x12e672c,
> datum=12598555199787281,
> isnull=0 '\000') at spgdoinsert.c:2053
> #6  0x004ee5cc in spgistBuildCallback 
> (index=index@entry=0x7f2cf9de7f98,
> htup=htup@entry=0x12e6728, values=values@entry=0x7fff327321e0,
> isnull=isnull@entry=0x7fff32732530 "", tupleIsAlive=tupleIsAlive@entry=1
> '\001', state=state@entry=0x7fff327325e0) at spginsert.c:56
> #7  0x00534e8d in IndexBuildHeapRangeScan
> (heapRelation=heapRelation@entry=0x7f2cf9ddc6c8,
> indexRelation=indexRelation@entry=0x7f2cf9de7f98,
> indexInfo=indexInfo@entry=0x1390ad8, allow_sync=allow_sync@entry=1
> '\001', anyvisible=anyvisible@entry=0 '\000', start_blockno=start_blockno@
> entry=0,
> numblocks=4294967295, callback=0x4ee573 ,
> callback_state=0x7fff327325e0) at index.c:2609
> #8  0x00534f52 in IndexBuildHeapScan 
> (heapRelation=heapRelation@entry=0x7f2cf9ddc6c8,
> indexRelation=indexRelation@entry=0x7f2cf9de7f98,
> indexInfo=indexInfo@entry=0x1390ad8, allow_sync=allow_sync@entry=1
> '\001', callback=callback@entry=0x4ee573 ,
> callback_state=callback_state@entry=0x7fff327325e0) at index.c:2182
> #9  0x004eeb74 in spgbuild (heap=0x7f2cf9ddc6c8,
> index=0x7f2cf9de7f98, indexInfo=0x1390ad8) at spginsert.c:140
> #10 0x00535e55 in index_build 
> (heapRelation=heapRelation@entry=0x7f2cf9ddc6c8,
> indexRelation=indexRelation@entry=0x7f2cf9de7f98,
> indexInfo=indexInfo@entry=0x1390ad8, isprimary=isprimary@entry=0
> '\000', isreindex=isreindex@entry=0 '\000') at index.c:2043
> #11 0x00536ee8 in index_create 
> (heapRelation=heapRelation@entry=0x7f2cf9ddc6c8,
> indexRelationName=indexRelationName@entry=0x12dd600 "int8idx_2",
> indexRelationId=16416, indexRelationId@entry=0, relFileNode=0,
> indexInfo=indexInfo@entry=0x1390ad8, indexColNames=indexColNames@
> entry=0x1390f40,
> accessMethodObjectId=4000, tableSpaceId=0,
> collationObjectId=0x12e6b18, classObjectId=0x12e6b38, coloptions=0x12e6b58,
> reloptions=0, isprimary=0 '\000',
> isconstraint=0 '\000', deferrable=0 '\000', initdeferred=0 '\000',
> allow_system_table_mods=0 '\000', skip_build=0 '\000', concurrent=0 '\000',
> is_internal=0 '\000', if_not_exists=0 '\000') at index.c:1116
> #12 0x005d8fe6 in DefineIndex (relationId=relationId@entry=16413,
> stmt=stmt@entry=0x12dd568, indexRelationId=indexRelationId@entry=0,
> is_alter_table=is_alter_table@entry=0 '\000',
> check_rights=check_rights@entry=1 '\001', check_not_in_use=check_not_in_
> use@entry=1 '\001', skip_build=0 '\000',
> quiet=0 '\000') at indexcmds.c:667
> #13 0x00782057 in ProcessUtilitySlow (pstate=pstate@entry=0x12dd450,
> pstmt=pstmt@entry=0x12db108,
> queryString=queryString@entry=0x12da0a0 "CREATE INDEX int8idx_2 ON
> int8tmp_2 USING spgist ( a vptree_ops );", context=context@entry=PROCESS_
> UTILITY_TOPLEV

Re: [HACKERS] How to implement a SP-GiST index as a extension module?

2017-11-05 Thread Connor Wolf
PLEVEL,
params=, queryEnv=,
dest=dest@entry=0x12db200,
completionTag=0x7fff32732ed0 "") at utility.c:357
#16 0x0077de2e in PortalRunUtility (portal=portal@entry=0x1391a80,
pstmt=pstmt@entry=0x12db108, isTopLevel=isTopLevel@entry=1 '\001',
setHoldSnapshot=setHoldSnapshot@entry=0 '\000', dest=dest@entry=0x12db200,
completionTag=completionTag@entry=0x7fff32732ed0 "") at pquery.c:1178
#17 0x0077e98e in PortalRunMulti (portal=portal@entry=0x1391a80,
isTopLevel=isTopLevel@entry=1 '\001', setHoldSnapshot=setHoldSnapshot@entry=0
'\000',
dest=dest@entry=0x12db200, altdest=altdest@entry=0x12db200,
completionTag=completionTag@entry=0x7fff32732ed0 "") at pquery.c:1324
#18 0x0077f782 in PortalRun (portal=portal@entry=0x1391a80,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001',
run_once=run_once@entry=1 '\001', dest=dest@entry=0x12db200,
altdest=altdest@entry=0x12db200, completionTag=0x7fff32732ed0 "") at
pquery.c:799
#19 0x0077bc12 in exec_simple_query
(query_string=query_string@entry=0x12da0a0
"CREATE INDEX int8idx_2 ON int8tmp_2 USING spgist ( a vptree_ops );")
at postgres.c:1120
#20 0x0077d95c in PostgresMain (argc=,
argv=argv@entry=0x12e9948, dbname=0x12bca10 "contrib_regression",
username=)
at postgres.c:4139
#21 0x006fecf4 in BackendRun (port=port@entry=0x12de030) at
postmaster.c:4364
#22 0x00700e32 in BackendStartup (port=port@entry=0x12de030) at
postmaster.c:4036
#23 0x00701112 in ServerLoop () at postmaster.c:1755
#24 0x007023af in PostmasterMain (argc=argc@entry=8,
argv=argv@entry=0x12ba7c0)
at postmaster.c:1363
#25 0x006726c1 in main (argc=8, argv=0x12ba7c0) at main.c:228



It's segfaulting when trying to build the inner tuple after the picksplit
operation.

Adding debugging output to the print function, I see:

NOTICE:  Memcopying from  to 013d7938 with len 16

The first item in my input data file is zero, and if I change it to 1:

NOTICE:  Memcopying from 0001 to 01b45938 with len 16

So pretty clearly, I'm trying to copy from the literal data representation
of the data as an address.
Following the data, this is the value I'm assigning to out->prefixDatum in
my picksplit call. I can confirm this by hard-coding the
value of out->prefixDatum in my picksplit call to a known value, it shows
up as the address in the memcopy call.

However, as far as I can tell, I'm assigning it correctly:  out->prefixDatum
= Int64GetDatum(val);

This is similar to how the other spgist implementations work.
spgkdtreeproc.c does out->prefixDatum = Float8GetDatum(coord);
for example.

I think this is the SP-GiST core failing to handle certain types being
pass-by-value? I'm not totally certain.

As I understand it, the "maybe-pass-by-reference" parameter is a global
flag (USE_FLOAT8_BYVAL), but I'd like to
keep that enabled. What's the proper approach for adding support for this
in the SP-GiST core?

My (somewhat messy) extension module is here
<https://github.com/fake-name/pg-spgist_hamming/tree/master/vptree>, if
it's relevant.

Connor




On Fri, Nov 3, 2017 at 3:12 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Fri, Nov 3, 2017 at 12:37 PM, Connor Wolf  com> wrote:
>
>> EDIT: That's actually exactly how the example I'm working off of works.
>> DERP. The SQL is
>>
>> CREATE TYPE vptree_area AS
>> (
>> center _int4,
>> distance float8
>> );
>>
>> CREATE OR REPLACE FUNCTION vptree_area_match(_int4, vptree_area) RETURNS
>> boolean AS
>> 'MODULE_PATHNAME','vptree_area_match'
>> LANGUAGE C IMMUTABLE STRICT;
>>
>> CREATE OPERATOR <@ (
>> LEFTARG = _int4,
>> RIGHTARG = vptree_area,
>> PROCEDURE = vptree_area_match,
>> RESTRICT = contsel,
>> JOIN = contjoinsel);
>>
>> so I just need to understand how to parse out the custom type in my index
>> operator.
>>
>
> You can see the implementation of vptree_area_match function located in
> vptree.c.  It just calls GetAttributeByNum() function.
>
> There is also alternative approach for that implemented in pg_trgm contrib
> module.  It has "text % text" operator which checks if two strings are
> similar enough.  The similarity threshold is defined by
> pg_trgm.similarity_threshold GUC.  Thus, you can also define GUC with
> threshold distance value.  However, it would place some limitations.  For
> instance, you wouldn't be able to use different distance threshold in the
> same query.
>
> --
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>


Re: [HACKERS] How to implement a SP-GiST index as a extension module?

2017-11-03 Thread Connor Wolf
Yeah, unfortunately, the way these type of metric trees work, the entire
search procedure is a function of both the target value and the allowed
search distance. The only way I can think of to return ordered results
without just scanning the entire index would be to repeatedly search the
index while gradually incrementing the allowed search distance (which is
horrible).

>From looking at some of the contrib modules, the way other index libraries
that have similar needs manage it is by implementing custom types that
encapsulate the filter parameters. The sp-gist kd-tree and quadtree indexes
store point coordinates in n-dimensional space, but they (ab)use the BOX
type because it's a convenient way of passing multiple values into the
value parameter of the index query.

I'm thinking at this point, I'm mostly stuck having to define a custom type
to encapsulate the relevant parameters. Really, the filter condition is a
integer 2-tuple, so I wonder if I could do something horrible with the
array type. If the value parameter for the query could be a bigint
array[2], that would work, and I'd just have to remember the ordering.

Does that seem viable?


EDIT: That's actually exactly how the example I'm working off of works.
DERP. The SQL is

CREATE TYPE vptree_area AS
(
center _int4,
distance float8
);

CREATE OR REPLACE FUNCTION vptree_area_match(_int4, vptree_area) RETURNS
boolean AS
'MODULE_PATHNAME','vptree_area_match'
LANGUAGE C IMMUTABLE STRICT;

CREATE OPERATOR <@ (
LEFTARG = _int4,
RIGHTARG = vptree_area,
PROCEDURE = vptree_area_match,
RESTRICT = contsel,
JOIN = contjoinsel);

so I just need to understand how to parse out the custom type in my index
operator.
--

Sorry if I'm asking a lot of dumb questions. Postgresql is huge and I have
no idea what I'm really doing.

On Fri, Nov 3, 2017 at 12:20 AM, Robert Haas  wrote:

> On Thu, Nov 2, 2017 at 9:53 AM, Connor Wolf
>  wrote:
> > As such:
> > Will compound queries as I describe above basically require a custom
> type to
> > make it possible? My (admittedly naive) expectation
> > is that the eventual query for this index will look something like
> "SELECT *
> > FROM example_table WHERE indexed_column <=> target_value < 4;",
> > with "<=>" being the operator for the relevant distance calculation
> > (hamming, for the BK tree, numeric for the VP-tree).
> >
> > The existing VP-tree code appears to not support multiple operators
> > whatsoever, probably because it was very preliminary.
>
> I'm not an expert in this area in any way whatsoever; I don't know a
> VP-tree from a BK-tree from a maple tree.
>
> However, I can tell you that as a general rule, PostgreSQL index
> access methods can only apply index quals of the form "WHERE column op
> value" or ordering criteria of the form "ORDER BY column op value".
> So, in the above example, you might think about trying to set up the
> access method so that it can efficiently return values ordered by
> indexed_column <=> target_value and then wrapping the ORDER BY query
> in a subselect to cut off fetching values at the correct point.  But
> no operator class for any access method can directly handle that query
> efficiently as you've written it.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: [HACKERS] How to implement a SP-GiST index as a extension module?

2017-11-01 Thread Connor Wolf
Ok, more questions.

I've been studying the implementation Alexander Korotkov sent, and I'm not
seeing how to map
some of the components onto the changes in the SP-GiST system that occured
between Postgresql 9.2 and 9.3.

The changes at that point seem to have been to change xxx_inner_consistent
and xxx_leaf_consistent from taking
an input containing a Datum pointing to the query conditional to a list of
ScanKeys which each encapsulate one filter condition.

The issue here is that for VP trees (or the BK tree I want to eventually
implement), the filter condition requires two parameters:
 - The maximum allowed distance from the target value
 - The actual target value.

The ScanKeys struct appears to only be able to contain a conditional type,
and a single parameter, such as "less then *x*", "above *y*",
and so forth. Looking at the existing spgkdtreeproc.c and
spgquadtreeproc.c, their mechanism for passing more complex conditions
through to the xxx_consistent functions appears to be to encapsulate the
entire condition in a custom type. For example,
their mechanism for querying if something is within a certain envelope is
done by having the envelope be described
by the "BOX" type.

As such:
Will compound queries as I describe above basically require a custom type
to make it possible? My (admittedly naive) expectation
is that the eventual query for this index will look something like "SELECT
* FROM example_table WHERE indexed_column <=> target_value < 4;",
with "<=>" being the operator for the relevant distance calculation
(hamming, for the BK tree, numeric for the VP-tree).

The existing VP-tree code appears to not support multiple operators
whatsoever, probably because it was very preliminary.

Thanks!
Connor

On Mon, Oct 30, 2017 at 7:04 PM, Connor Wolf <
conn...@imaginaryindustries.com> wrote:

> I was mostly unclear on how I'd go about attaching the extension functions
> to the relevant indexing mechanism. From the looks of the vptree.tar.gz
> file (which is really, *really* helpful, incidentally!), a it's done via a
> custom operator class, which then gets passed to the actual index creation
> mechanism when you're declaring the index.
>
> I think I had looked at that at one point, but it's been a while. In my
> case, I'm using discrete-cosine-transform based perceptual hashes for
> searching. They are nice and compact (64-bits per hash), while still
> producing good search results. I have a dataset of ~36 million images, and
> it does searches in < 50 milliseconds with a hamming distance of 4, while
> touching ~0.25% of the tree (And occupying ~18 GB of ram).
>
> My BK tree is up on github here
> <https://github.com/fake-name/IntraArchiveDeduplicator/blob/master/deduplicator/bktree.hpp>,
> if anyone needs something like that (BSD licensed, pretty well tested).
> There's also a python wrapper for it.
>
> I'll probably not have time to poke about until this weekend, but thanks!
> Connor
>
>
>
> On Mon, Oct 30, 2017 at 4:50 AM, Alexander Korotkov <
> a.korot...@postgrespro.ru> wrote:
>
>> Hi!
>>
>> On Sun, Oct 29, 2017 at 12:07 PM, Connor Wolf <
>> w...@imaginaryindustries.com> wrote:
>>
>>> I'm looking at implementing a custom indexing scheme, and I've been
>>> having trouble understanding the proper approach.
>>>
>>> Basically, I need a BK tree, which is a tree-structure useful for
>>> indexing arbitrary discrete metric-spaces (in my case, I'm interested in
>>> indexing across the hamming edit-distance of perceptual hashes, for fuzzy
>>> image searching). I'm *pretty* sure a SP-GiST index is the correct
>>> index type, as my tree is intrinsically unbalanced.
>>>
>>
>> Yes, SP-GiST is appropriate index type for BK tree.  I'm pretty sure BK
>> tree could be implemented as SP-GiST opclass.
>> The only thing worrying me is selection pivot values for nodes.  SP-GiST
>> builds by insertion of index tuples on by one.  First pivot value for root
>> node in SP-GIST would be created once first leaf page overflows.  Thus, you
>> would have to select this pivot value basing on very small fraction in the
>> beginning of dataset.
>> As I know, BK tree is most efficient when root pivot value is selected
>> after looking in whole dataset and then hierarchically to subtrees.
>>
>> BTW, did you try my extension for searching similar images.  It's quite
>> primitive, but works for some cases.
>> https://github.com/postgrespro/imgsmlr
>> https://wiki.postgresql.org/images/4/43/Pgcon_2013_similar_images.pdf
>>
>> I have a functional stand-alone implementation of a BK-Tree,

Re: [HACKERS] How to implement a SP-GiST index as a extension module?

2017-10-30 Thread Connor Wolf
I was mostly unclear on how I'd go about attaching the extension functions
to the relevant indexing mechanism. From the looks of the vptree.tar.gz
file (which is really, *really* helpful, incidentally!), a it's done via a
custom operator class, which then gets passed to the actual index creation
mechanism when you're declaring the index.

I think I had looked at that at one point, but it's been a while. In my
case, I'm using discrete-cosine-transform based perceptual hashes for
searching. They are nice and compact (64-bits per hash), while still
producing good search results. I have a dataset of ~36 million images, and
it does searches in < 50 milliseconds with a hamming distance of 4, while
touching ~0.25% of the tree (And occupying ~18 GB of ram).

My BK tree is up on github here
<https://github.com/fake-name/IntraArchiveDeduplicator/blob/master/deduplicator/bktree.hpp>,
if anyone needs something like that (BSD licensed, pretty well tested).
There's also a python wrapper for it.

I'll probably not have time to poke about until this weekend, but thanks!
Connor



On Mon, Oct 30, 2017 at 4:50 AM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> Hi!
>
> On Sun, Oct 29, 2017 at 12:07 PM, Connor Wolf <
> w...@imaginaryindustries.com> wrote:
>
>> I'm looking at implementing a custom indexing scheme, and I've been
>> having trouble understanding the proper approach.
>>
>> Basically, I need a BK tree, which is a tree-structure useful for
>> indexing arbitrary discrete metric-spaces (in my case, I'm interested in
>> indexing across the hamming edit-distance of perceptual hashes, for fuzzy
>> image searching). I'm *pretty* sure a SP-GiST index is the correct index
>> type, as my tree is intrinsically unbalanced.
>>
>
> Yes, SP-GiST is appropriate index type for BK tree.  I'm pretty sure BK
> tree could be implemented as SP-GiST opclass.
> The only thing worrying me is selection pivot values for nodes.  SP-GiST
> builds by insertion of index tuples on by one.  First pivot value for root
> node in SP-GIST would be created once first leaf page overflows.  Thus, you
> would have to select this pivot value basing on very small fraction in the
> beginning of dataset.
> As I know, BK tree is most efficient when root pivot value is selected
> after looking in whole dataset and then hierarchically to subtrees.
>
> BTW, did you try my extension for searching similar images.  It's quite
> primitive, but works for some cases.
> https://github.com/postgrespro/imgsmlr
> https://wiki.postgresql.org/images/4/43/Pgcon_2013_similar_images.pdf
>
> I have a functional stand-alone implementation of a BK-Tree, and it works
>> very well, but the complexity of managing what is basically a external
>> index for my database has reached the point where it's significantly
>> problematic, and it seems to be it should be moved into the database.
>>
>
> Sure, moving this index to the database is right decision.
>
> Anyways, looking at the contents of postgres/src/backend/access/spgist,
>> it looks pretty straightforward in terms of the actual C implementation,
>> but I'm stuck understanding how to "install" a custom SP-GiST
>> implementation. There are several GiST indexing implementations in the
>> contrib directory, but no examples for how I'd go about implementing a
>> loadable SP-GiST index.
>>
>> Basically, my questions are:
>>
>>- Is it possible to implement a SP-GiST indexing scheme as a loadable
>>module?
>>
>>  Yes, it's possible to define SP-GiST.
>
>>
>>- If so, how?
>>
>> The pretty same way as GiST opclass extension.  You have to define
> supporting functions and operators and then define operator class over them.
>
>>
>>- And is there an example I can base my implementation off of?
>>
>>
>>
>> I'm relatively comfortable with C (much moreso with C++), but I haven't
>> spent a lot of time looking at the postgresql codebase.  I don't think I
>> could start from a empty folder and make a properly-implemented module in
>> any reasonable period of time, so if I have a working example for some sort
>> of index that uses the same interfaces that would really help a lot
>>
> I don't think there is an example in PostgreSQL source code tree or on
> github.  But I've attached by early experiment with VP-tree (seems to be
> pretty same as BK tree) using SP-GiST (see vptree.tar.gz).  Basing on this
> experiment I realized that it's important to select root pivot value basing
> on the whole dataset.  However, for your metric/dataset/quer

[HACKERS] How to implement a SP-GiST index as a extension module?

2017-10-30 Thread Connor Wolf

Hi there!

I'm looking at implementing a custom indexing scheme, and I've been 
having trouble understanding the proper approach.


Basically, I need a BK tree, which is a tree-structure useful for 
indexing arbitrary discrete metric-spaces (in my case, I'm interested in 
indexing across the hamming edit-distance of perceptual hashes, for 
fuzzy image searching). I'm /pretty/ sure a SP-GiST index is the correct 
index type, as my tree is intrinsically unbalanced.


I have a functional stand-alone implementation of a BK-Tree, and it 
works very well, but the complexity of managing what is basically a 
external index for my database has reached the point where it's 
significantly problematic, and it seems to be it should be moved into 
the database.


Anyways, looking at the contents of postgres/src/backend/access/spgist, 
it looks pretty straightforward in terms of the actual C implementation, 
but I'm stuck understanding how to "install" a custom SP-GiST 
implementation. There are several GiST indexing implementations in the 
contrib directory, but no examples for how I'd go about implementing a 
loadable SP-GiST index.


Basically, my questions are:

 * Is it possible to implement a SP-GiST indexing scheme as a loadable
   module?
 o If so, how?
 o And is there an example I can base my implementation off of?

I'm relatively comfortable with C (much moreso with C++), but I haven't 
spent a lot of time looking at the postgresql codebase.  I don't think I 
could start from a empty folder and make a properly-implemented module 
in any reasonable period of time, so if I have a working example for 
some sort of index that uses the same interfaces that would really help 
a lot.


Thanks!
Connor



Re: [HACKERS] [GENERAL] Understanding and implementing a GiST Index

2014-10-10 Thread Connor Wolf
I'd be ok with either SP-GiST or GiST, though there are tree structures 
(BK tree) that are particularly suited to this application that I 
/think/ map to GiST better then SP-GiST.


Re: hamming in the RD-tree implementation: Where? I cloned the smlar git 
repo, and poked around, and it looks like it only can operate on set 
intersections, which is a non-starter for what I need to do.  I spent a 
while looking through the source, and I didn't see anything that looked 
like hamming distance calculation (through I probably missed some stuff. 
The indirection through `FCall2()` makes things very hard to follow).


Anyways, right now, smlar is not useful to me, because it completely 
ignores the structure of incoming arrays:


postgres=# SELECT smlar(
postgres(# '{1,0,0,0, 1,1,1,1, 1,1,1,0, 0,1,0,0}'::int[],
postgres(# '{1,0,0,1, 0,1,0,0, 0,1,0,0, 0,1,1,0}'
postgres(# );
 smlar
---
 1
(1 row)


For two radically different hashes (shortened for example purposes) 
which compare as identical.


I spent some time trying to see if I could easily disable the array 
uniquefication, and by disabling the calls to `qsort_arg`, and modifying 
`numOfIntersect` so it just counts the number of times the array 
elements do not match, and I'm getting the behaviour I want out of 
`smlar()` at this point:


postgres=# SELECT smlar(
postgres(# '{1,0,0,0, 1,1,1,1, 1,1,1,0, 0,1,0,0}'::int[],
postgres(# '{1,0,0,1, 0,1,0,0, 0,1,0,0, 0,1,1,0}'
postgres(# );
 smlar

 0.4375
(1 row)

postgres=# SELECT smlar(
'{0,1,1,0}'::int[],
'{1,0,1,0}'
);
 smlar
---
   0.5
(1 row)


But the index does not seem to work after the changes I made, and the 
debugging print statements (well, `elog()` statements) I inserted into 
`cmpArrayElem()` and `numOfIntersect()` are not being hit, so I don't 
understand how the index is even being built.


Anyways, I'm rambling a bit. Any input would be great.
Thanks,
Connor

On 10/9/2014 8:11 AM, Oleg Bartunov wrote:

Just quick recommendation.
You need to ask -hackers mailing list for that. Also, do you want GiST 
or SP-GiST ?
We already use hamming distance in rd-tree, which implemented in GiST, 
see our presentation 
http://www.sai.msu.su/~megera/postgres/talks/pgcon-2012.pdf 
<http://www.sai.msu.su/%7Emegera/postgres/talks/pgcon-2012.pdf>, for 
example.


Oleg

On Thu, Oct 9, 2014 at 11:09 AM, Connor Wolf 
mailto:w...@imaginaryindustries.com>> 
wrote:


Hi there!

I'm trying to implement a custom indexing system using the GiST
index framework, and I have a few questions.
Basically, I'm trying to implement a system that allows me to
search across a set of 64 bit integers by hamming distance. This
is intended to be used in searching for similar images, where the
64 bit integer is actually a phash
<http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html>
of an image, and similarity between two images is reflected in the
hamming distance between two integers.

Anyways, The appropriate approach here is to use something called
a BK tree
<http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees>,
for which I've written some test code

<https://github.com/fake-name/MangaCMS/blob/master/deduplicator/hamDb.py#L27>
and I think I have a decent grip of (my test code seems to work,
in any event).

That said:

  * Is there a /simple/ piece of example-code for implementing a
GiST index for a single index? I've been looking through the
/contrib/ directory, particularly the /contrib/btree_gist/
files, but it looks like 90% of the complexity there is
related to supporting all the column types Postgres has,
rather then actually tied to producing a functional index.
  * Once I have something compiling, how can I check to be sure
that I'm actually using the indexing module I created? I can
enable my compiled extension using `CREATE EXTENSION`, but is
there an option for `CREATE INDEX test_index ON testing USING
gist (val);` that lets me specify *which* GiST index is
actually employed? Is this even a valid question?
This is probably something that's available in one of the
system tables somewhere, but I haven't had much luck with
google finding out where.
  * Testing: What's the appropriate way to examine the generated
tree structure of the index? I certainly went through a few
bugs with my test tree system (and that was in python!). Is
there any way to examine the index structure for debugging
purposes?
  * Also, it looks like `ereport()` is the proper way to emit
debugging information. Is this correct?
  * In that vein, is there any way to have information that is on
the operations of an entire q