Re: [GENERAL] Bit datatype performance?

2011-09-15 Thread Harald Fuchs
In article CAPHN3JX1YNxnGsu3q5A0wGqMMwjXMcmu8LnZ72jepE2A=t2...@mail.gmail.com,
Antonio Vieiro anto...@antonioshome.net writes:

 Hi all,
 One of my entities 'E' may be 'tagged' with an arbitrary set of 256 tags 'T'.

 A first approach could be to add a M:N relationship between 'E' and 'T'.

 A second way to do this could be to add a BIT(256) datatype to 'E',
 setting bits to '1' if the entity is tagged with each one of the 256
 tags (i.e. using a 'bitmask' on the set of tags).

 Since querying entities 'E' with a certain set of tags 'T' must be
 very fast I was wondering if the second approach would be faster. What
 do you think?

I think the best way is to put the tags into a hstore column.  With a
GiST index on that column access is very fast.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Bit datatype performance?

2011-09-14 Thread David Johnston

On Sep 14, 2011, at 6:00, Antonio Vieiro anto...@antonioshome.net wrote:

 Hi all,
 
 One of my entities 'E' may be 'tagged' with an arbitrary set of 256 tags 'T'.
 
 A first approach could be to add a M:N relationship between 'E' and 'T'.
 
 A second way to do this could be to add a BIT(256) datatype to 'E',
 setting bits to '1' if the entity is tagged with each one of the 256
 tags (i.e. using a 'bitmask' on the set of tags).
 
 Since querying entities 'E' with a certain set of tags 'T' must be
 very fast I was wondering if the second approach would be faster. What
 do you think?
 
 Thanks for any hints,
 Antonio
 
 -- 
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general

Dealing with 256 arbitrary ones and zeros instead of meaningful named tags 
seems to be asking for mega-confusion.

If performance is really that important do both and run some performance tests.

If the tag set ever changes a schema change will be needed for the bit version 
but not the two-table version.

David J.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Bit datatype performance?

2011-09-14 Thread Radosław Smogura

On Wed, 14 Sep 2011 12:00:35 +0200, Antonio Vieiro wrote:

Hi all,

One of my entities 'E' may be 'tagged' with an arbitrary set of 256 
tags 'T'.


A first approach could be to add a M:N relationship between 'E' and 
'T'.


A second way to do this could be to add a BIT(256) datatype to 'E',
setting bits to '1' if the entity is tagged with each one of the 256
tags (i.e. using a 'bitmask' on the set of tags).

Since querying entities 'E' with a certain set of tags 'T' must be
very fast I was wondering if the second approach would be faster. 
What

do you think?

Thanks for any hints,
Antonio


I assume each entity may have one or more different tags.

Actually performing test like
... where (t.bits  :mymask) = :mymask
should be quite fast and faster then creating additional relations, but 
only if it's highly probable that your query will almost always scan 
whole table.


The advantage of indexes is that the index is used 1st and tail 
(slower) parts of query will always get subset of table. In bitset, 
You will probably scan whole table.


So I think, you should do some performance test for large number of 
data, and compare both ways. I think bitset will be fast for really 
small data, but M:N relations may be faster for really large data sets.


You need to measure size of your database too, in M:N case with 256 
tags it may be quite large.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Bit datatype performance?

2011-09-14 Thread pasman pasmański
Other option is use an array of int2 instead of bit(256). It can be indexed.







2011/9/14, Radosław Smogura rsmog...@softperience.eu:
 On Wed, 14 Sep 2011 12:00:35 +0200, Antonio Vieiro wrote:
 Hi all,

 One of my entities 'E' may be 'tagged' with an arbitrary set of 256
 tags 'T'.

 A first approach could be to add a M:N relationship between 'E' and
 'T'.

 A second way to do this could be to add a BIT(256) datatype to 'E',
 setting bits to '1' if the entity is tagged with each one of the 256
 tags (i.e. using a 'bitmask' on the set of tags).

 Since querying entities 'E' with a certain set of tags 'T' must be
 very fast I was wondering if the second approach would be faster.
 What
 do you think?

 Thanks for any hints,
 Antonio

 I assume each entity may have one or more different tags.

 Actually performing test like
 ... where (t.bits  :mymask) = :mymask
 should be quite fast and faster then creating additional relations, but
 only if it's highly probable that your query will almost always scan
 whole table.

 The advantage of indexes is that the index is used 1st and tail
 (slower) parts of query will always get subset of table. In bitset,
 You will probably scan whole table.

 So I think, you should do some performance test for large number of
 data, and compare both ways. I think bitset will be fast for really
 small data, but M:N relations may be faster for really large data sets.

 You need to measure size of your database too, in M:N case with 256
 tags it may be quite large.

 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general



-- 

pasman

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Bit datatype performance?

2011-09-14 Thread Antonio Vieiro

Hi again,

Thanks for the tip. In fact I was thinking of creating an index on the 
bitmask, so I could use:


... where t.bits = :mymask

directly, avoiding a full table scan. I assume this is possible 
(indexing bit and comparing bits), isn't it?


Thanks,
Antonio

El 14/09/11 15:58, Radosław Smogura escribió:

On Wed, 14 Sep 2011 12:00:35 +0200, Antonio Vieiro wrote:

Hi all,

One of my entities 'E' may be 'tagged' with an arbitrary set of 256
tags 'T'.

A first approach could be to add a M:N relationship between 'E' and 'T'.

A second way to do this could be to add a BIT(256) datatype to 'E',
setting bits to '1' if the entity is tagged with each one of the 256
tags (i.e. using a 'bitmask' on the set of tags).

Since querying entities 'E' with a certain set of tags 'T' must be
very fast I was wondering if the second approach would be faster. What
do you think?

Thanks for any hints,
Antonio


I assume each entity may have one or more different tags.

Actually performing test like
... where (t.bits  :mymask) = :mymask
should be quite fast and faster then creating additional relations, but
only if it's highly probable that your query will almost always scan
whole table.

The advantage of indexes is that the index is used 1st and tail (slower)
parts of query will always get subset of table. In bitset, You will
probably scan whole table.

So I think, you should do some performance test for large number of
data, and compare both ways. I think bitset will be fast for really
small data, but M:N relations may be faster for really large data sets.

You need to measure size of your database too, in M:N case with 256 tags
it may be quite large.



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Bit datatype performance?

2011-09-14 Thread Radosław Smogura

Hi,

I think it's not bad approach if performance is important. I don't know 
how b-tree index will work with bitset datatype, but I assume it should 
treat is as 256bit number (maybe someone more competive in internals 
will answer this).


Please bear in mind, that this approach will work well only on query 
You have written.


Because You ask on performance, I will add this topic You may want to 
test and think about it


PG by default uses text transfer mode, so if you transfer your data 
from/to server those will be transferred as 256 0/1 character string. 
You may to think about storing tags as e.g. 4 long (64bit) fields, or 2 
type 4 UUIDs (128bit) and use composite index. If you have ability to 
use binary transfer and on your client side bitest will be mapped to 
some reasonable type, then You won, otherwise (in binary mode) you 
should get nice boost when you will store, those values in types I have 
wrote.


Of course those are only some concepts, personally I have never made 
such things.


Regards,
Radek

On Wed, 14 Sep 2011 17:58:58 +0200, Antonio Vieiro wrote:

Hi again,

Thanks for the tip. In fact I was thinking of creating an index on
the bitmask, so I could use:

... where t.bits = :mymask

directly, avoiding a full table scan. I assume this is possible
(indexing bit and comparing bits), isn't it?

Thanks,
Antonio

El 14/09/11 15:58, Radosław Smogura escribió:

On Wed, 14 Sep 2011 12:00:35 +0200, Antonio Vieiro wrote:

Hi all,

One of my entities 'E' may be 'tagged' with an arbitrary set of 256
tags 'T'.

A first approach could be to add a M:N relationship between 'E' and 
'T'.


A second way to do this could be to add a BIT(256) datatype to 'E',
setting bits to '1' if the entity is tagged with each one of the 
256

tags (i.e. using a 'bitmask' on the set of tags).

Since querying entities 'E' with a certain set of tags 'T' must be
very fast I was wondering if the second approach would be faster. 
What

do you think?

Thanks for any hints,
Antonio


I assume each entity may have one or more different tags.

Actually performing test like
... where (t.bits  :mymask) = :mymask
should be quite fast and faster then creating additional relations, 
but

only if it's highly probable that your query will almost always scan
whole table.

The advantage of indexes is that the index is used 1st and tail 
(slower)
parts of query will always get subset of table. In bitset, You 
will

probably scan whole table.

So I think, you should do some performance test for large number of
data, and compare both ways. I think bitset will be fast for really
small data, but M:N relations may be faster for really large data 
sets.


You need to measure size of your database too, in M:N case with 256 
tags

it may be quite large.



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Bit datatype performance?

2011-09-14 Thread Eduardo Piombino
Hi, if you are thinking to access data in that manner, what's the point of
bits (or tags)?

The idea behind having a value and then using a bitmask is to be able to
test the value against different bitmasks, each bitmask corresponding to a
different tag or tag combination.

The *where *statement you are suggesting differs in nothing from a regular
id or in this case a category id (instead of a combination of tags)You will
be fetching all records that only have a specific mask.

I think you are a little bit confused:
Let's say you have several tags each with an identifier:
TAG_NATURE = 1
TAG_ANIMALS = 2
TAG_CARS = 4
TAG_SPORTS = 8

then you have a record ... the idea to use bits is to be able to assign that
record a single value, formed by the combination of the different tags.

For example if an element corresponds to TAG_NATURE and TAG_ANIMALS, you
would want to have that element with a value of TAG_NATURE + TAG_ANIMALS
resulting in a tag value of 3.

Then if you want to extract all ANIMALS you just do:

*... where value  TAG_ANIMALS = TAG_ANIMALS;*

because if you just do:

*... where value = TAG_ANIMALS*

you will only get the elements that *exclusively *have the tag TAG_ANIMALS.
You will miss for instance those that have the NATURE *and* ANIMALS (or any
other tag).

So, your simple index on value willl not be of any help, since you won't be
doing

*... where value = ANY_SPECIFIC_TAG*

because of the latter.

Now, if you are going to have a different TAG for every TAG COMBINATION
well, you can do that, but that would be no different than any regular id,
in this case, it would be more of a CATEGORY, and elements will only be
able to have one single category for them.

One alternative would be to try to make some helping indexes on expressions,
maybe with the help of a function like:

create or replace function hasTag(data integer, tag integer) returns boolean
as $$
declare
begin
return (data  tag) = tag;
end;
$$ language plpgsql immutable;

-- this function would return
select hasTag(1, 1); -- true
select hasTag(3, 1); -- true
select hasTag(4, 1); -- false

This way, you could reformulate your query in the following fashion:

... where hasTag(value, TAG_NATURE);

and you could now build an index on yourtable based on that expression like:

create index idx_yourtable_hasTag_1 on yourtable (hasTag(value, 1 /*
TAG_NATURE */));

If you would like to fetch a combination of tags, you could do:

... where hasTag(value, TAG_NATURE) and hasTag(value, TAG_ANIMALS)

requiring an extra index on (hasTag(value, TAG_ANIMALS)).

In this way, you will end up requiring 256 indexes :) (which can be from
acceptable to *ridiculous*, depending on how much often the indexes should
be updated, volume, etc), it's up to you. I'm not actually suggesting you
use this approach, it's just a raw idea, and it's just the conclusion of one
line of thought, that may or may have not crossed your mind. Maybe with some
refinement, you can get to something more practical.

But nonetheless (if I'm not missing something huge), the *where *statement
you provided is just the wrong approach to tags.

hope it helps,
regards,
eduardo


On Wed, Sep 14, 2011 at 12:58 PM, Antonio Vieiro
anto...@antonioshome.netwrote:

 Hi again,

 Thanks for the tip. In fact I was thinking of creating an index on the
 bitmask, so I could use:

 ... where t.bits = :mymask

 directly, avoiding a full table scan. I assume this is possible (indexing
 bit and comparing bits), isn't it?

 Thanks,
 Antonio

 El 14/09/11 15:58, Radosław Smogura escribió:

  On Wed, 14 Sep 2011 12:00:35 +0200, Antonio Vieiro wrote:

 Hi all,

 One of my entities 'E' may be 'tagged' with an arbitrary set of 256
 tags 'T'.

 A first approach could be to add a M:N relationship between 'E' and 'T'.

 A second way to do this could be to add a BIT(256) datatype to 'E',
 setting bits to '1' if the entity is tagged with each one of the 256
 tags (i.e. using a 'bitmask' on the set of tags).

 Since querying entities 'E' with a certain set of tags 'T' must be
 very fast I was wondering if the second approach would be faster. What
 do you think?

 Thanks for any hints,
 Antonio


 I assume each entity may have one or more different tags.

 Actually performing test like
 ... where (t.bits  :mymask) = :mymask
 should be quite fast and faster then creating additional relations, but
 only if it's highly probable that your query will almost always scan
 whole table.

 The advantage of indexes is that the index is used 1st and tail (slower)
 parts of query will always get subset of table. In bitset, You will
 probably scan whole table.

 So I think, you should do some performance test for large number of
 data, and compare both ways. I think bitset will be fast for really
 small data, but M:N relations may be faster for really large data sets.

 You need to measure size of your database too, in M:N case with 256 tags
 it may be quite large.



 --
 Sent via pgsql-general mailing list 

Re: [GENERAL] Bit datatype performance?

2011-09-14 Thread Antonio Vieiro

Hi,

Thanks for the tip. Maybe two UUIDs are a best approach. I'll see which 
is more performant.


Kind regards,
Antonio

El 14/09/11 19:32, Radosław Smogura escribió:

Hi,

I think it's not bad approach if performance is important. I don't know
how b-tree index will work with bitset datatype, but I assume it should
treat is as 256bit number (maybe someone more competive in internals
will answer this).

Please bear in mind, that this approach will work well only on query You
have written.

Because You ask on performance, I will add this topic You may want to
test and think about it

PG by default uses text transfer mode, so if you transfer your data
from/to server those will be transferred as 256 0/1 character string.
You may to think about storing tags as e.g. 4 long (64bit) fields, or 2
type 4 UUIDs (128bit) and use composite index. If you have ability to
use binary transfer and on your client side bitest will be mapped to
some reasonable type, then You won, otherwise (in binary mode) you
should get nice boost when you will store, those values in types I have
wrote.

Of course those are only some concepts, personally I have never made
such things.

Regards,
Radek

On Wed, 14 Sep 2011 17:58:58 +0200, Antonio Vieiro wrote:

Hi again,

Thanks for the tip. In fact I was thinking of creating an index on
the bitmask, so I could use:

... where t.bits = :mymask

directly, avoiding a full table scan. I assume this is possible
(indexing bit and comparing bits), isn't it?

Thanks,
Antonio

El 14/09/11 15:58, Radosław Smogura escribió:

On Wed, 14 Sep 2011 12:00:35 +0200, Antonio Vieiro wrote:

Hi all,

One of my entities 'E' may be 'tagged' with an arbitrary set of 256
tags 'T'.

A first approach could be to add a M:N relationship between 'E' and
'T'.

A second way to do this could be to add a BIT(256) datatype to 'E',
setting bits to '1' if the entity is tagged with each one of the 256
tags (i.e. using a 'bitmask' on the set of tags).

Since querying entities 'E' with a certain set of tags 'T' must be
very fast I was wondering if the second approach would be faster. What
do you think?

Thanks for any hints,
Antonio


I assume each entity may have one or more different tags.

Actually performing test like
... where (t.bits  :mymask) = :mymask
should be quite fast and faster then creating additional relations, but
only if it's highly probable that your query will almost always scan
whole table.

The advantage of indexes is that the index is used 1st and tail (slower)
parts of query will always get subset of table. In bitset, You will
probably scan whole table.

So I think, you should do some performance test for large number of
data, and compare both ways. I think bitset will be fast for really
small data, but M:N relations may be faster for really large data sets.

You need to measure size of your database too, in M:N case with 256 tags
it may be quite large.





--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general