Re: [GENERAL] Comparison of Oracle and PostgreSQL full text search

2010-07-29 Thread Howard Rogers
On Thu, Jul 29, 2010 at 10:33 PM, Dean Rasheed dean.a.rash...@gmail.com wrote:
 On 28 July 2010 02:58, Howard Rogers h...@diznix.com wrote:
 For what it's worth, I wrote up the performance comparison here:
 http://diznix.com/dizwell/archives/153


 Thanks, very interesting results. I wonder, are the results being
 sorted by the database? The performance degradation for large numbers
 of results might be explained by it switching over from an internal to
 an external sort, in which case tweaking work_mem might make a
 difference.

 Of course this is pure speculation without the EXPLAIN ANALYSE output.

 Regards,
 Dean


Yes, the results were being sorted. I did various tests, changing
work_mem, shared_buffers and much else, one by one, until I arrived at
the combination of settings that gave me the best 'total search time'
results. Personally, I couldn't see any difference in the explain
plans, but I was in a bit of a hurry and I may have missed it.

For the search term 'woman', which matches 1,590,275 documents, here's
the explain plan:

Sort  (cost=185372.88..185372.93 rows=20 width=312) (actual
time=10537.152..10537.154 rows=20 loops=1)
  Sort Key: a.rf, a.sort_id
  Sort Method:  quicksort  Memory: 48kB
  -  Result  (cost=109119.55..185372.45 rows=20 width=312) (actual
time=4309.020..10537.116 rows=20 loops=1)
-  Append  (cost=109119.55..185372.45 rows=20 width=312)
(actual time=4309.018..10537.108 rows=20 loops=1)
  -  Subquery Scan a  (cost=109119.55..109119.68 rows=10
width=312) (actual time=4309.018..4309.026 rows=10 loops=1)
-  Limit  (cost=109119.55..109119.58 rows=10
width=641) (actual time=4309.016..4309.019 rows=10 loops=1)
  -  Sort  (cost=109119.55..109121.94
rows=957 width=641) (actual time=4309.014..4309.015 rows=10 loops=1)
Sort Key: search_rm.sort_id
Sort Method:  top-N heapsort  Memory: 35kB
-  Bitmap Heap Scan on search_rm
(cost=6651.07..109098.87 rows=957 width=641) (actual
time=272.851..4021.458 rows=583275 loops=1)
  Recheck Cond:
(to_tsvector('english'::regconfig, (textsearch)::text) @@
'''woman'''::tsquery)
  Filter:
((bitand(sales_method_code, 1)  0) AND (bitand(subsiter, 1)  0) AND
(bitand(filetype, 1)  0))
  -  Bitmap Index Scan on
rmsearch_idx  (cost=0.00..6650.83 rows=25826 width=0) (actual
time=165.711..165.711 rows=586235 loops=1)
Index Cond:
(to_tsvector('english'::regconfig, (textsearch)::text) @@
'''woman'''::tsquery)
  -  Subquery Scan b  (cost=76252.65..76252.77 rows=10
width=312) (actual time=6228.073..6228.080 rows=10 loops=1)
-  Limit  (cost=76252.65..76252.67 rows=10
width=727) (actual time=6228.072..6228.075 rows=10 loops=1)
  -  Sort  (cost=76252.65..76254.29 rows=655
width=727) (actual time=6228.071..6228.072 rows=10 loops=1)
Sort Key: search_rf.sort_id
Sort Method:  top-N heapsort  Memory: 38kB
-  Bitmap Heap Scan on search_rf
(cost=5175.18..76238.49 rows=655 width=727) (actual
time=363.684..5748.279 rows=1007000 loops=1)
  Recheck Cond:
(to_tsvector('english'::regconfig, (textsearch)::text) @@
'''woman'''::tsquery)
  Filter:
((bitand(sales_method_code, 1)  0) AND (bitand(subsiter, 1)  0) AND
(bitand(filetype, 1)  0))
  -  Bitmap Index Scan on
rfsearch_idx  (cost=0.00..5175.02 rows=17694 width=0) (actual
time=242.859..242.859 rows=1030282 loops=1)
Index Cond:
(to_tsvector('english'::regconfig, (textsearch)::text) @@
'''woman'''::tsquery)
Total runtime: 10538.832 ms


And here's the plan for the search term clover,  which matches only
2,808 records in total:

  -  Result  (cost=109119.55..185372.45 rows=20 width=312) (actual
time=16.807..23.990 rows=20 loops=1)
-  Append  (cost=109119.55..185372.45 rows=20 width=312)
(actual time=16.806..23.985 rows=20 loops=1)
  -  Subquery Scan a  (cost=109119.55..109119.68 rows=10
width=312) (actual time=16.806..16.812 rows=10 loops=1)
-  Limit  (cost=109119.55..109119.58 rows=10
width=641) (actual time=16.805..16.807 rows=10 loops=1)
  -  Sort  (cost=109119.55..109121.94
rows=957 width=641) (actual time=16.804..16.805 rows=10 loops=1)
Sort Key: search_rm.sort_id
Sort Method:  top-N heapsort  Memory: 35kB
-  Bitmap Heap Scan on search_rm
(cost=6651.07..109098.87 rows=957 width=641) (actual
time=1.054..15.577 rows=1807 loops=1

Re: [GENERAL] Comparison of Oracle and PostgreSQL full text search

2010-07-28 Thread Howard Rogers
On Wed, Jul 28, 2010 at 8:38 PM, Daniel Verite dan...@manitou-mail.org wrote:
        zhong ming wu wrote:

 I always thought there is a clause in their user agreement preventing
 the users from publishing benchmarks like that. I must be mistaken.

 No you're correct. Currently, to download the current Oracle 11.2g, one must
 agree to:
 http://www.oracle.com/technetwork/licenses/standard-license-152015.html

 which contains:
 quote
 [...]
 You may not:
 [...]
 - disclose results of any program benchmark tests without our prior consent.
 [...]
 /quote

 Not having such frustrating license terms is also what makes PostgreSQL a
 nicer alternative!

 Best regards,
 --
 Daniel
 PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org

 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general


Hi Daniel:

Nice catch -the thing is, you've linked to the *technet* license. The
one you sign up to when you download the product for free, for
development, prototyping and self-learning purposes. That's not the
same license as the one you sign up to when you pay them stacks of
cash for the 'proper' product for a production deployment (which I
haven't read lately, so I can't say the same silly term isn't in
there, but I'm just saying: the license you linked to is not the one
that applies).

Also, I would argue that what I did was not a 'benchmark test'. We
capture the results and timings of queries as part of our production
application, for management and review purposes. Those are real
results, experienced by real users... not what I'd call a benchmark
test. (The PostgreSQL results are, certainly, an artificial
benchmark, but then the Oracle license doesn't cover those, happily!)

Regards
HJR

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Comparison of Oracle and PostgreSQL full text search

2010-07-27 Thread Howard Rogers
Thanks to some very helpful input here in earlier threads, I was
finally able to pull together a working prototype Full Text Search
'engine' on PostgreSQL and compare it directly to the way the
production Oracle Text works. The good news is that PostgreSQL is
bloody fast! The slightly iffy news is that the boss is now moaning
about possible training costs!

For what it's worth, I wrote up the performance comparison here:
http://diznix.com/dizwell/archives/153

Maybe it will be of use to anyone else wondering if it's possible to
do full text search and save a couple hundred thousand dollars whilst
you're at it!

Regards
HJR

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Bitmask trickiness

2010-07-23 Thread Howard Rogers
On Fri, Jul 23, 2010 at 3:02 PM, Scott Marlowe scott.marl...@gmail.com wrote:
 If you mean, did I read the bit in the doco where it said nothing at
 all in the 'these are great advantages' style I've just described, but
 instead makes the fairly obvious point that a bit string takes 8 bits
 to store a group of 8 bits (well, stone me!!)

 Wow, I'm surprised you get any help with your attitude.  I posted a
 link and asked a question and right up front got my head handed to me.

 To quote:  Why on Earth would I want to store this sort of stuff in a
 bit string?!

 I don't know about you, but I find looking at 21205 a darn'd site
 easier than staring blankly at 101001011010101!!

 Like I'd somehow bitten your hand when I asked my question.

That is not attitude, Scott. That is me asking a question riddled with
rhetorical exclamations.

You asked me why I wasn't using bitstrings. I asked why on Earth I
would. That's all. The only attitude you could reasonably detect
there, I think, is surprise that when I ask a precise technical
question, someone would feel it fair game to question the entire basis
of a database design about which they know extremely little (which is
not their -or your- fault, because you only get to know the details I
include in the question, after all. Fact remains, on such slim
foundations, I wouldn't build a mountain of questioning the reasons
for or behind someone else's work).

 PLUS has extra overhead,
 then yes, I did read that part of your first link... and nevertheless
 concluded that, overall, there is... er, some extra overhead in
 storing bitstrings.

 Well, your initial answer certainly didn't give ANY idea that you'd
 read that page.

It neither gave the impression I had, nor gave any indication that I
hadn't! That's the point: you've assumed something you needn't have.

 So what precisely about that first article, which I did indeed read,
 would you have expected to lead me to the conclusion that I'd SAVE
 significant amounts of space or find some other technically-compelling
 reason for switching?

 I didn't expect such.  I asked why you weren't using them, and gave
 you some links to read on it.  It clearly states that bit strings use
 a bit per bit, plus some overhead.  Now, I had no idea if you were
 dealing with bigints and 60 bit strings or 5 bit strings.  In fact,
 you did little to really describe your project and preferences in your
 post.  Which is why my response was short and concise, I had little to
 go on.

Exactly. Now, I grant you it's difficult when you can't see me; when
there's only the written word to go on; when you don't know my
personal modes of expression, nor I yours. But that's precisely why I
wouldn't answer a narrowly-scoped technical question with even a hint
of a suggestion that the entire basis of the question was suspect: I
didn't give you any details of my project or preferences, because I
was asking a precise, narrowly-focussed question about getting one
specific result with one specific query structure.

Honestly, when I want general consulting, I pay for it. You really
don't have to try and give it away for free.

 My point is that there's nothing much in it, storage-wise, either way.

 Well, there is the fact that bit strings can restrict the size of the
 entry so you don't accidentally get an int stored that's got more bits
 than your model can handle.

Well, that's fortunately not something I have to worry about.
Somewhere around the 192th bit, I'll start panicking. When I've only
got 15 of the blighters to worry about, I think I'll cope.

There's also the issue that if / when you
 ever get close to the last bit in an int bitstring may behave oddly
 because of sign issues.

And also something I don't have to worry about.

 So there's no compelling technical reason to switch.

 I never said there was.  I simply asked a question, and got my hand bitten.

No, you got a question asked back at you: *WHY* would you think I
should be using a bitstring? It was an honest question, basically
wondering if there's something about bitstrings that make them such a
great idea. It's not a data type Oracle users are greatly familiar
with, you see.

 And without a
 technically-compelling reason, the rest of the post I was referring to
 simply boiled down, as far as I could tell, to a matter of personal
 preference. No less valid for that, of course. But ultimately, not
 something that would hold much sway with me.

 Sure, fine, whatever you want.  I wasn't trying to convince you either
 way.  I do think using the right type for the job makes more sense,
 but again, it's personal preference.

Indeed. Although in this case, it's not even the right type for the job.

 But simply saying your design is broken... wo! might well scare
 the children, but doesn't really do anything for me, because I know
 for a certainty that it's not broken at all.

 I asked if there was a reason you were avoiding bit strings.  Hardly a
 your design is broken point.

 I'm 

Re: [GENERAL] Bitmask trickiness

2010-07-23 Thread Howard Rogers
On Fri, Jul 23, 2010 at 7:57 PM, Stephen Cook scli...@gmail.com wrote:
 On 7/23/2010 5:33 AM, Howard Rogers wrote:

 ...so select * from table where 21205 | 4097 = 21205 would correctly
 grab that record. So I'm assuming you mean the 'stored value' should
 be on both sides of the equals test. If so, that would indeed seem to
 be the ultimate answer to the question (though I wouldn't myself call
 it a 'plain old equals' :-) )

 Hope I was clearer this time. Originally I just fired off a quickie email
 to
 get you past your coder's block.

 I do indeed think the magic of BIT OR is the missing ingredient I
 was looking for, and I very much appreciate your help leading me to
 it. My apologies for being too dense to spot what you were talking
 about before.

 I think I misunderstood you the whole time actually, or maybe was injecting
 some of my other thoughts into your problem. I figured you meant you wanted
 to find records where your probe value has exactly the same bit pattern as
 your stored value (probe bits, and only probe bits, set; hence the plain
 old equals). Rather (and I just confirmed this looking at the OP) you want
 any records where the stored value has all of the probe value's bits set,
 regardless of the other bits in the stored value.

 So yeah, check if ORing the stored and probe values equals the stored value.

 Oh well, even if I misread, glad to help you stumble upon what you wanted
 eventually.

 -- Stephen



No worries. We got there in the end!

Thanks again,
HJR

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Bitmask trickiness

2010-07-23 Thread Howard Rogers
 Hate to interrupt your flame war, and I apologize for not being precise in
 my meaning first try... You don't need any bitwise anything to compare two
 bitmasks-hiding-in-integers, just check for equality.

 Instead of select * from coloursample where colour  10 = 10; just try
 select * from coloursample where colour = 10;.

Which works for that example. But please consider the more realistic
one I also posted. The stored value is 21205 (I think... it was all a
long time ago). I probe that with 4098. I do NOT want that returned,
because the '2' bit is not set in the 21205 value.

So select * from table where 21205=4098 won't cut it, will it?

 If you want to probe for two values, that MUST be in there, and WITHOUT
 anything else, bitwise OR them together as the probe value and use plain old
 equals there too. You only need the bitwise AND stuff for checking for a
 value that MUST be in there, regardless of whether or not other values are
 in there as well.

A Bitwise OR? Ah ha: I think that might be the key:

ims=# select 21205 | 4098;
 ?column?
--
21207
(1 row)

So again, what exactly am I supposed to test for here? I mean, select
* from table where 21205 | 4098 = 21205 would do it, I suppose,
because that would correctly reject the row. But so would select *
from table where 21205 | 4098 = 4098. However, if I stumble on a bit
further, I do manage this:

ims=# select 21205 | 4097;
 ?column?
--
21205
(1 row)

...so select * from table where 21205 | 4097 = 21205 would correctly
grab that record. So I'm assuming you mean the 'stored value' should
be on both sides of the equals test. If so, that would indeed seem to
be the ultimate answer to the question (though I wouldn't myself call
it a 'plain old equals' :-) )

 Hope I was clearer this time. Originally I just fired off a quickie email to
 get you past your coder's block.

I do indeed think the magic of BIT OR is the missing ingredient I
was looking for, and I very much appreciate your help leading me to
it. My apologies for being too dense to spot what you were talking
about before.

Regards
HJR

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Bitmask trickiness

2010-07-23 Thread Howard Rogers
On Fri, Jul 23, 2010 at 6:17 PM, Alban Hertroys
dal...@solfertje.student.utwente.nl wrote:
 I thought to do

 select * from coloursample where colour  10 = 10;

 ...but that's not right, because it finds the third record is a match.


 What's not entirely clear to me is whether you only want to find colours that 
 have BOTH Yellow and Orange set and nothing else, or colours that have EITHER 
 Yellow and Orange set and nothing else.

 The first case has been answered by Stephen (use a straight 'equals'). The 
 other case is a bit more complicated.

 That 11 matches using  10 is because you filtered out all the other bits 
 in your comparison by anding them with '0', while they /are/ relevant: they 
 aren't allowed to be '1' after all. You probably need to look at the inverted 
 versions of these numbers to get what you need.

 My bit-foo is a bit rusty,

Hehe. Mine too, it would seem!

but this looks like what you need (I used bit-strings for my own convenience):

 development= select (~ '01010'::bit(5)) | '01010'::bit(5) = '1'::bit(5);
  ?column?
 --
  t
 (1 row)

 development= select (~ '01011'::bit(5)) | '01010'::bit(5) = '1'::bit(5);
  ?column?
 --
  f
 (1 row)

 development= select (~ '01110'::bit(5)) | '01010'::bit(5) = '1'::bit(5);
  ?column?
 --
  f
 (1 row)

 development= select (~ '11010'::bit(5)) | '01010'::bit(5) = '1'::bit(5);
  ?column?
 --
  f
 (1 row)

 development= select (~ '00010'::bit(5)) | '01010'::bit(5) = '1'::bit(5);
  ?column?
 --
  t
 (1 row)

 development= select (~ '01000'::bit(5)) | '01010'::bit(5) = '1'::bit(5);
  ?column?
 --
  t
 (1 row)


 Alban Hertroys


Thanks Alban. Steve, too, has joined in again above: the bit-wise OR,
together with an equality test on the stored value, would seem to be
what's called for.

Appreciate the contribution.

I think we can all go home now!!  :-)

Regards
HJR

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Bitmask trickiness

2010-07-22 Thread Howard Rogers
On Thu, Jul 22, 2010 at 1:02 PM, Scott Marlowe scott.marl...@gmail.com wrote:



 Why on Earth would I want to store this sort of stuff in a bit string?!

 Because you are manipulating bits and not integers?  I guess there are
 10 kinds of people, those who like think in binary and those who
 don't.

 I don't know about you, but I find looking at 21205 a darn'd site
 easier than staring blankly at 101001011010101!!

 If the fifth bit means one thing, and the 7th bit means something
 else, quick which of the following have the fifth bit set and the 7th
 bit off:

 01001101 (base2)
 or
 77 (base 10)


 And, fundamentally,
 they mean precisely the same thing.

 Of course.  But that wasn't my point, and by my example above, one is
 much easier to figure out than the other if you're interested in bit
 twiddling.

 And the '' function works as
 nicely with boring old decimals as it does with long-winded binaries,
 so I really don't see the point of making it more complicated than it
 needs to be -but I'm open to be enlightened on the matter!

 I fail to see how storing a binary as a binary and showing it as a
 binary makes things more complicated.  But I'm open to someone showing
 me how that's true.  At least hex or octal have direct and simple
 conversions where each hex or octal digit represents 4 or 3 bits
 respectively.  Decimal does not.

 Thanks for the second link though. I hadn't realised that PostgreSQL
 was so richly-endowed with bitwise functions. Specifically, it's got
 the bitwise XOR I was thinking it would be nice to have in these sorts
 of situations:

 ims=# select * from coloursample where colour # 10 = 0;
  recid | colour |     descript
 ---++---
     1 |     10 | Yellow and Orange
 (1 row)

 Not quite sure how to apply that to my more realistic example just
 yet, but I think this will be very helpful, so thank you!

 Note you can cast integer to bitstring, but there may be some odd
 behaviour for sign bits and such.  Which is again why I'd use the
 right type for the job, bit string.  But it's your project.


Quoting...

 Because you are manipulating bits and not integers?  I guess there are
 10 kinds of people, those who like think in binary and those who
 don't.

Er, no. 21205 is not an integer. It's an encoded bit of magic.


 quick which of the following have the fifth bit set and the 7th
bit off:

01001101 (base2)
or
77 (base 10)

I'll give you that one (except that the fifth bit isn't set and the
7th bit is on!!). Now repeat for 10 million records and see how
you get on.

Really, I don't have to visually inspect a record to work this stuff
out! So what is easier on your eye for one record is completely
irrelevant as far as my code is concerned!

one is
much easier to figure out than the other if you're interested in bit
twiddling.

As I say, take an incredibly simple example and everything looks, er,
simple. Point of fact, I happen to know that the '1' bit is set in
21205 simply by looking at the last digit and spotting that it's not
even. Fat lot of good that does me when fetching 350,000 records that
happen to match 'insurance claim'.

I fail to see how storing a binary as a binary and showing it as a
binary makes things more complicated

Because it's NOT binary. It's an encoding. Whether that encoding is
displayed in binary, fluent hebrew or klingon or imaginary numbers is
really irrelevant to me. It happens that I can look at a decimal
number and work out most things (if the number is 29438, it's a fair
chance the 16384 bit is set, for example; if it's 4098, I know it's
4096 + 2. And so on.). If you're happier working with fancifully long
strings of 1s and 0s, good on you: but it doesn't alter the fact that
I'm working with encoded meanings, not binary digits and I prefer a
nice, compact display of that encoding which doesn't involve
hieroglyphics.

But it's your project.

Indeed.

Doesn't mean I can do it all on my own, of course. But if it's simply
a question of personal preference, I'll take mine over yours, for
projects I work on, if that's OK. No hard feelings! :-)

Regards
HJR

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Bitmask trickiness

2010-07-22 Thread Howard Rogers
On Fri, Jul 23, 2010 at 12:35 AM, Peter Hunsberger
peter.hunsber...@gmail.com wrote:
 On Wed, Jul 21, 2010 at 11:13 PM, Howard Rogers h...@diznix.com wrote:

 On Thu, Jul 22, 2010 at 1:02 PM, Scott Marlowe scott.marl...@gmail.com 
 wrote:


 
  Why on Earth would I want to store this sort of stuff in a bit string?!
 
  Because you are manipulating bits and not integers?  I guess there are
  10 kinds of people, those who like think in binary and those who
  don't.
 
  I don't know about you, but I find looking at 21205 a darn'd site
  easier than staring blankly at 101001011010101!!

 snip lots of stuff/snip

 
  Note you can cast integer to bitstring, but there may be some odd
  behaviour for sign bits and such.  Which is again why I'd use the
  right type for the job, bit string.  But it's your project.
 

 Quoting...

  Because you are manipulating bits and not integers?  I guess there are
  10 kinds of people, those who like think in binary and those who
  don't.

 Er, no. 21205 is not an integer. It's an encoded bit of magic.


 In that case your database design is fundamentally broken.  A database
 should have content fields that map to the needs of the application.
 As you describe your application requirements, that is a bit string
 and not an integer.  Use bit strings and your application logic is
 transparent, obvious and easy to maintain.  Use integers and you have
 to resort to magic.  As you say, it's your choice, but you came here
 looking for advice and the advice you were given is very good

 --
 Peter Hunsberger

Hi Peter:

It wasn't, as the original poster pointed out, 'advice' that was given
so much as personal preference. Had someone said, 'ah, but you see
storing your 15 meanings in decimal uses up 5 bytes, whereas a
bitstring only requires 15 bits, and over 10,000,000 records, the
saving of 3 bytes per record adds up...', then that would be technical
advice I could listen to, assess and make a call on.

But simply saying your design is broken... wo! might well scare
the children, but doesn't really do anything for me, because I know
for a certainty that it's not broken at all.

It comes down to this: I can do Boyce-Codd normal form in my sleep
(...and falling asleep happens quite frequent when doing it, strangely
enough), and have been doing so since 1987. I'm certainly not perfect,
but I reckon I can tell from a mile away when one of my designs is
broken, as you put it -and this one isn't. I haven't even begun to
describe a scintilla of a percentage point of the design decisions
this thing has to deal with, nor the fact that it's been running quite
happily in this manner for a good couple of years... so you'll just
have to take it from me that there's a room-full of users who can look
at code '4097' and know precisely what it means and would be mortified
if I suddenly started displaying exactly the same meanings in what, to
them, would look like utter gibberish.

Unless you, or someone else, can come up with some hard, *technical*
facts as to why working with bitstring encodings of meaning is so much
better than working in decimal, we're sticking with the decimal
representation. I'll buy you're forever doing implicit casts which
are poor performers or implicit casts might break in a future
release or it's costing you three bytes per record ...or anything
else in that vein. But matters of transparency and ease of maintenance
are entirely subjective things (about which I sought no advice at all,
incidentally), and what works for you on those scores doesn't work for
me.

Regards
HJR

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Bitmask trickiness

2010-07-22 Thread Howard Rogers
On Fri, Jul 23, 2010 at 8:37 AM, Scott Marlowe scott.marl...@gmail.com wrote:
 On Thu, Jul 22, 2010 at 3:09 PM, Howard Rogers h...@diznix.com wrote:
 On Fri, Jul 23, 2010 at 12:35 AM, Peter Hunsberger
 peter.hunsber...@gmail.com wrote:
 On Wed, Jul 21, 2010 at 11:13 PM, Howard Rogers h...@diznix.com wrote:

 On Thu, Jul 22, 2010 at 1:02 PM, Scott Marlowe scott.marl...@gmail.com 
 wrote:


 
  Why on Earth would I want to store this sort of stuff in a bit string?!
 
  Because you are manipulating bits and not integers?  I guess there are
  10 kinds of people, those who like think in binary and those who
  don't.
 
  I don't know about you, but I find looking at 21205 a darn'd site
  easier than staring blankly at 101001011010101!!

 snip lots of stuff/snip

 
  Note you can cast integer to bitstring, but there may be some odd
  behaviour for sign bits and such.  Which is again why I'd use the
  right type for the job, bit string.  But it's your project.
 

 Quoting...

  Because you are manipulating bits and not integers?  I guess there are
  10 kinds of people, those who like think in binary and those who
  don't.

 Er, no. 21205 is not an integer. It's an encoded bit of magic.


 In that case your database design is fundamentally broken.  A database
 should have content fields that map to the needs of the application.
 As you describe your application requirements, that is a bit string
 and not an integer.  Use bit strings and your application logic is
 transparent, obvious and easy to maintain.  Use integers and you have
 to resort to magic.  As you say, it's your choice, but you came here
 looking for advice and the advice you were given is very good

 --
 Peter Hunsberger

 Hi Peter:

 It wasn't, as the original poster pointed out, 'advice' that was given
 so much as personal preference. Had someone said, 'ah, but you see
 storing your 15 meanings in decimal uses up 5 bytes, whereas a
 bitstring only requires 15 bits, and over 10,000,000 records, the
 saving of 3 bytes per record adds up...', then that would be technical
 advice I could listen to, assess and make a call on.

 You do realize the first page I linked to told you that, right?  It's
 not a particularly big page.  I had made the erroneous assumption
 you'd read the link I posted.

If you mean, did I read the bit in the doco where it said nothing at
all in the 'these are great advantages' style I've just described, but
instead makes the fairly obvious point that a bit string takes 8 bits
to store a group of 8 bits (well, stone me!!) PLUS has extra overhead,
then yes, I did read that part of your first link... and nevertheless
concluded that, overall, there is... er, some extra overhead in
storing bitstrings.

So what precisely about that first article, which I did indeed read,
would you have expected to lead me to the conclusion that I'd SAVE
significant amounts of space or find some other technically-compelling
reason for switching?

My point is that there's nothing much in it, storage-wise, either way.
So there's no compelling technical reason to switch. And without a
technically-compelling reason, the rest of the post I was referring to
simply boiled down, as far as I could tell, to a matter of personal
preference. No less valid for that, of course. But ultimately, not
something that would hold much sway with me.

 But simply saying your design is broken... wo! might well scare
 the children, but doesn't really do anything for me, because I know
 for a certainty that it's not broken at all.

 I asked if there was a reason you were avoiding bit strings.  Hardly a
 your design is broken point.

I'm getting a bit fed up of this thread now. It wasn't YOU that ever
said 'the design is broken', and I never suggested it was. That was
Peter Hunsberger, about three posts up in the thread, who wrote In
that case your database design is fundamentally broken.

If you're going to take umbrage at something, please take umbrage at
things that were actually directed at you in the first place!

 You've now said why you are not using
 the type that was designed to handle bit strings for bit strings.

 I personally would store them as bit strings and change representation
 for users.

I'm a user, too. I get to see this stuff every time I do a select
statement. At the command line. Which I use a lot.

 There are some issues that come up if your bit strings are
 long enough to get close to the last bit in an integer (also mentioned
 on the links I posted that didn't get read).

Don't make false assumptions about other people, please. You don't
know what I read or didn't read. Just because you didn't make a
compelling technical argument in favour of bitstrings doesn't mean I
didn't read the article you linked to ...that also didn't make a
compelling technical argument in favour of bitstrings.

 But other than that it
 should work fine.

Yes, I know. I've only been using this technique for five years on
Oracle! I would be very surprised indeed

[GENERAL] Bitmask trickiness

2010-07-21 Thread Howard Rogers
Suppose 1=Red, 2=Yellow, 4=Green and 8=Orange.

Now suppose the following data structures and rows exist:

create table coloursample (recid integer, colour integer, descript varchar);
insert into coloursample values (1,2,'Yellow only');
insert into coloursample values (2,10,'Yellow and Orange');
insert into coloursample values (3,11,'Red, Yellow and Orange');
insert into coloursample values (4,12,'Green and Orange');
insert into coloursample values (5,13,'Red, Green and Orange');

Selecting things which have some yellow in them somewhere is easy:

select * from coloursample where colour  20;

It's also easy to find records which have either some yellow or some
orange (or both) in them:

select * from coloursample where colour  100;

But how do I find records which are ONLY yellow and orange, and
exclude records which have some other colour mixed in, in one simple
query without a lot of 'not this, not that' additions, and without
using multiple separate AND tests to nail it down?

I thought to do

select * from coloursample where colour  10 = 10;

...but that's not right, because it finds the third record is a match.

(The question comes about because I'm trying to test for up to 15
attributes per record. One record has a bitmask value of 21205, say.
That should mean the '1' bit is set (because 21205 =
1+4+16+64+128+512+4096+16384), but when I do the above queries for
21205  4098 (which is 4096 + 2, and therefore should not be finding
records with the '1' bit set), the record is being returned because
the 4096 bit is being detected (correctly) as 'on'. I want the query
to only return records where both bits are true, but I don't want to
have to test the records 15 times to find out!).

I suspect the answer is really simple... but I'm having writer's block
today! All help appreciated.

Regards
HJR

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Bitmask trickiness

2010-07-21 Thread Howard Rogers
On Wed, Jul 21, 2010 at 9:17 PM, Mathieu De Zutter math...@dezutter.org wrote:
 On Wed, Jul 21, 2010 at 5:59 AM, Howard Rogers h...@diznix.com wrote:
 It's also easy to find records which have either some yellow or some
 orange (or both) in them:

 select * from coloursample where colour  100;

 But how do I find records which are ONLY yellow and orange, and
 exclude records which have some other colour mixed in, in one simple
 query without a lot of 'not this, not that' additions, and without
 using multiple separate AND tests to nail it down?

 What about:
 WHERE colour  ~10 = 0

 Kind regards,
 Mathieu


Hi Mathieu:

Yes, that works for the simple case I gave by way of illustration (as
does the bitwise XOR -#- function), so thanks. But neither work very
obviously for the more realistic example I gave. If the stored value
is 21205 and I'm 'probing' it with 4098, the record should not be
returned, because 21205 implies a '1' bit is set, not the '2' bit. But
the # and ~ functions cause a non-negative result to be returned, so I
am none the wiser as to how to spot when or when not to return the
record:

In other words, this one is 'wrong' and should not be in the resultset:

ims=# select 21205  ~ 4098;
 ?column?
--
   17109
(1 row)

But this one is 'right' and should be returned:

ims=# select 21205  ~ 4097;
 ?column?
--
   17108
(1 row)

But looking at the outcome of both queries, there's nothing that
particularly leaps out at me that screams '17108' is right and '17109'
is wrong.

If I was 'probing' with a simple value (1,2,4,8 etc), then I can just
test for a non-zero return: if it's non-zero, the probe value is
implied by the stored value and the record should be returned. So,
right:

ims=# select 21205  1;
 ?column?
--
       1
(1 row)

And wrong:

ims=# select 21205  2;
 ?column?
--
       0
(1 row)

The minute you start probing with a complex value, however, (that is,
a probe value which is made up of multiple basic values, for example
4098, which is 4096 + 2) interpreting the output of the bitwise
operations becomes more than my head can cope with!

I fear I'm going to have to decompose the probe value supplied by a
user and perform multiple simple probes that match their meaning...
but with 15 possible attributes to deal with, that could get a lot
slower than I was hoping.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Full Text Search ideas

2010-07-18 Thread Howard Rogers
I asked recently about a performance problem I'd been having with some
full text queries, and got really useful help that pointed me to the
root issues. Currently, I'm trying to see if our document search
(running on Oracle Text) can be migrated to PostgreSQL, and the reason
I asked that earlier question points to a fundamental design issue
we'll have with PostgreSQL that doesn't affect us in Oracle (not, I
hasten to add, that that means Oracle is better/right-er/whatever.
It's just different -but the difference will cause us a problem).

Consider the following example (which is just one of 40-odd I could
have picked).

Some of our documents are in panoramic format, for example. But not
many (say, 30,000 out of 10,000,000). We have a flag for 'panoramic',
called 'sb12'. It's either 'y' or 'n' for any document. So a search
for 'sb12n' (find me all documents which are not panoramic) is
logically the same as a search for 'not sb12y'. However, 95% or more
of documents will be an sb12n, because hardly any documents are
panoramic in the first place. So. although the numeric outcome of
'sb12n' and 'not sb12y' will always be the same, you would have to
check the entire table to find which ones are 'sb12n' (because most
documents are marked that way), whereas you'd only have to check the
5% of records to find 'sb12y', because so few are marked that way.

But in Oracle Text, this doesn't seem to happen:

SQL select count(*) from search_digital_rm where
contains(textsearch,'bat and sb12n')0;

 COUNT(*)
--
     3040

Elapsed: 00:00:00.10

SQL select count(*) from search_digital_rm where
contains(textsearch,'bat not sb12y')0;

 COUNT(*)
--
     3040

Elapsed: 00:00:00.06

In both cases, the same number of records are returned. But, within a
margin of error, the time taken to do each test is about the same.
Even though the first test must be matching 'sb12n' for many millions
of records, it's taking not much longer than the search for 'sb12y',
which can only match about 90,000. It would seem (I can't tell from
the explain plan itself) as though what's happened is that the set of
'bat' records has been fetched first (about 8000 in all). For so few
records, whether you're looking for sb12y or sb12n then becomes mostly
irrelevant for timing purposes, and hence the duration equivalence of
both queries.

This is not what happens in PostgreSQL, however (as I now know, thanks
to the help on my question from a couple of days ago):

ims=# select count(*) from search_rm
where to_tsvector('english', textsearch) @@ to_tsquery('english','bat  sb12n');
count
---
 3849
(1 row)

Time: 408.962 ms

ims=# select count(*) from search_rm
where to_tsvector('english', textsearch) @@ to_tsquery('english','bat
 !sb12y');
count
---
 3849
(1 row)

Time: 11.533 ms

Now, one test takes about 40 times longer than the other, though the
one taking just 11ms is as fast as Oracle can manage (impressive,
considering I've done absolutely nothing to tune this PostgreSQL
testbed as yet!). Logically equivalent the two tests may be, but
hunting through lots of sb12n records and working out which are
related to bats is apparently a lot slower than finding things the
other way around, it would seem.

I'm wondering firstly if there's any way I can configure PostgreSQL
FTS so that it produces the sort of results we've gotten used to from
Oracle, i.e., where search speeds do not go up wildly when a 'search
term' is applied that happens to be used by the vast majority of
document records. (For example, we currently allows searches for file
types, where 80% of documents would be word documents, another 19%
would be PDFs and the remaining 1% of documents could be pretty much
anything else! We can't have people searching for definitely want
only Word documents if that means matching 8 million records and
search speeds shoot to the stratosphere as a result).

Secondly, I'm open to any suggestions as to how you would organise
things or re-write the SQL so that the attribute filter is only
applied to the small subset of records which match the relevant real
word search term, if that's what's needed here. In other words, is my
best bet in the earlier examples to fetch *all* bat records, and
then nest that query within an outer query that adds a test for a
separate attribute column? Or is something else called for here?

Would appreciate any thoughts on the subject!

Regards
HJR

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Full Text Search ideas

2010-07-18 Thread Howard Rogers
On Mon, Jul 19, 2010 at 6:16 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Howard Rogers h...@diznix.com writes:
 ims=# select count(*) from search_rm
 where to_tsvector('english', textsearch) @@ to_tsquery('english','bat  
 sb12n');
 count
 ---
  3849
 (1 row)

 Time: 408.962 ms

 ims=# select count(*) from search_rm
 where to_tsvector('english', textsearch) @@ to_tsquery('english','bat  
 !sb12y');
 count
 ---
  3849
 (1 row)

 Time: 11.533 ms

 Yeah, I imagine that the first of these will involve examining all the
 index entries for sb12n.  There's not a lot of smarts about that inside
 the GIN index machinery, AFAIK: it'll just fetch all the relevant TIDs
 for both terms and then AND them.

 I'm wondering firstly if there's any way I can configure PostgreSQL
 FTS so that it produces the sort of results we've gotten used to from
 Oracle, i.e., where search speeds do not go up wildly when a 'search
 term' is applied that happens to be used by the vast majority of
 document records.

 If you're willing to split out the search terms that are like this,
 you could probably get better results with something like

 select count(*) from search_rm
 where to_tsvector('english', textsearch) @@ to_tsquery('english','bat') AND
      to_tsvector('english', textsearch) @@ to_tsquery('english','sb12n');

 That will put it on the optimizer's head as to whether to use the index
 for one term or both terms.

 It might be worth noting that the optimizer will of course not get this
 right unless it has decent statistics about both search terms --- and
 there is an as-yet-unreleased patch about tsvector stats gathering:
 http://archives.postgresql.org/pgsql-committers/2010-05/msg00360.php
 I am not sure that the situation addressed by that patch applies in
 your case, but it might.

                        regards, tom lane


Thanks, Tom.

The breaking out into separate search terms does make a difference,
but not much:

ims=# select count(*) from search_rm
where to_tsvector('english',textsearch) @@ to_tsquery('english','bat  sb12n');
 count
---
  3849
(1 row)

Time: 413.329 ms

ims=# select count(*) from search_rm
ims-# where to_tsvector('english',textsearch) @@ to_tsquery('english','bat') AND
ims-# to_tsvector('english',textsearch) @@ to_tsquery('english','sb12n');
 count
---
  3849
(1 row)

Time: 352.583 ms

So it's shaving about a sixth of the time off, which isn't bad, but
not spectacularly good either!

I'd also thought of trying something like this:

ims=# select count(*) from
(
  select * from search_rm where
  to_tsvector('english',textsearch) @@ to_tsquery('english','bat')
) as core
where to_tsvector('english',textsearch) @@ to_tsquery('english','sb12n');

 count
---
  3849
(1 row)

Time: 357.248 ms

...in the hope that the sb12n test would only be applied to the set of
'bat' records acquired by the inner query. But as you can tell from
the time, that's not particularly better or worse than your suggestion
(bearing mind that 'bat' on its own is a 12ms search).

I'm currently constructing a separate column containing a single
bitmask value for about 15 of the 45 attributes, just to see if
evaluating the bits with a bitand test for the bat records is faster
than trying to FTS them in the first place. Something like

select count (*) from
   (
  select * from search_rm where
  to_tsvector('english',textsearch) @@ to_tsquery('english','bat')
   ) as core
where bitand(searchbits,4096)0;

But it's taking a while to get that extra column constructed in the
original table!

Fingers crossed, because if not, it's all a bit of a show-stopper for
our migration effort, I think. :-(

Regards  thanks
HJR

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Full Text Search dictionary issues

2010-07-16 Thread Howard Rogers
On Fri, Jul 16, 2010 at 10:23 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Howard Rogers h...@diznix.com writes:
 I have 10 million rows in a table, with full text index created on one
 of the columns. I submit this query:

 ims=# select count(*) from search_rm
 ims-# where to_tsvector('english', textsearch)
 ims-# @@ to_tsquery('english', 'woman  beach  ball');
  count
 ---
    646
 (1 row)
 Time: 107.570 ms

 ...and those are excellent times. But if I alter the query to read:

 ims=# select count(*) from search_rm
 where to_tsvector('english', textsearch)
 @@ to_tsquery('english', 'woman  beach  ftx1');
  count
 ---
  38343
 (1 row)
 Time: 640.985 ms

 ...then, as you see, it slows the query down by a factor of about 6,

 ... um, but it increased the number of matching rows by a factor of
 almost 60.  I think your complaint of poor scaling is misplaced.

 which is not so good! The problem is that we need to be able to search
 for ftx1, since that's a flag we put in our document records to tell
 us the file type, and we need to be able to retrieve different file
 types at different times.

 You might want to rethink how you're doing that --- it seems like a file
 type flag ought to be a separate column rather than a word in a text
 field.

                        regards, tom lane

 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general


OK, Tom: I did actually account for the number of rows difference
before I posted, though I accept I didn't show you that. So here goes:

ims=# select count(*)
ims-# from search_rm
ims-# where to_tsvector('english', textsearch) @@
to_tsquery('english', 'woan  batt  ftxa')
ims-# limit 20;
 count
---
0
(1 row)

Time: 0.593 ms
ims=# select count(*)
from search_rm
where to_tsvector('english', textsearch) @@ to_tsquery('english',
'woan  batt  ftx1')
limit 20;
 count
---
0
(1 row)

Time: 489.362 ms

Both queries return zero rows. One takes an awful lot longer than the
other. The only difference between them is that one searches for
'ftx1' and the other searches for 'ftx0'. My complaint of poor
scalability (actually, it was an enquiry about the role of dictionary
types!) is valid, I think. As a PostgreSQL newbie, I'm happy to accept
that I've done something plonkingly stupid to account for these
results, but I'd then like to know what it is I've done wrong! A
simple scale-up of the number of hits isn't, however, the problem, I
don't think.

With this amount of data, and with 45 different document attributes
that may or may not be searched for, some of them involving names and
places and dates, some just yes/no flags, it is utterly impossible to
have them as separate attribute columns and search on them with
anything like decent performance. We adopted this approach with Oracle
Text two years ago precisely because it was the only way to keep
web-based searches of 10,000,000 records coming back in less than a
second. So, no, we're not going to re-think the storage of 'attribute
data' as part of the searchable keyword field, though I'm more than
prepared to alter the precise format of that data if it helps
PostgreSQL any.

That said, however, we have people supplying us with document
references in the form DA3-76374YY-001, so alpha-numerics simply have
to be searchable with good speed, and I can't always magic-away the
alpha-numeric components, even if I wanted to.

So, I would still like to know if this performance difference when
encountering alpha-numeric words is dictionary-related, and if so
what I can do to fix that, please.

Cheers,
HJR

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Full Text Search dictionary issues

2010-07-16 Thread Howard Rogers
On Sat, Jul 17, 2010 at 3:14 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Howard Rogers h...@diznix.com writes:
 OK, Tom: I did actually account for the number of rows difference
 before I posted, though I accept I didn't show you that. So here goes:
 ...
 Both queries return zero rows. One takes an awful lot longer than the
 other. The only difference between them is that one searches for
 'ftx1' and the other searches for 'ftx0'.

 Well, this still doesn't tell us anything about what I think the
 critical point is, namely how many actual matches there are for
 ftx1 versus ftx0.  Could we see counts for *just* those words without
 the other conditions?

 So, I would still like to know if this performance difference when
 encountering alpha-numeric words is dictionary-related,

 AFAIK there is no significant difference between treatment of pure alpha
 and mixed alphanumeric words, at least not once you get past
 to_tsquery.  I'm still expecting this is just a matter of how many index
 entries match.  It's barely possible that you've got a dictionary
 configuration that makes the to_tsquery() function itself a lot slower
 in the alphanumeric case, but that should affect ftx1 and ftx0 equally.

                        regards, tom lane

 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general



My apologies for making this harder than it should have been. My
queries were correct, my description of them wasn't. The only
difference was 'ftx1' and 'ftxa', not 'ftx0'.

Anyway, I think I've finally got the point being made by you and
Richard (and sorry for being so slow about it!): ftx1 is a very, very
common flag. I'd guess 99% of records would be tagged with it. And
ftxa doesn't exist at all, of course. So, you're right:

ims=# select count(*) from search_rm
where to_tsvector('english', textsearch) @@ to_tsquery('english','ball
 beach  salsm1');
 count
---
  1753
(1 row)

Time: 557.010 ms
ims=# select count(*) from search_rm
where to_tsvector('english', textsearch) @@ to_tsquery('english','ball
 beach  salsm4');
 count
---
97
(1 row)

Time: 39.518 ms


Salsm1 is another very common flag, applied to about 80% of documents.
Salsm4 applies to about 160 documents in total. So it's really not the
alpha-numerics causing the issue after all. It's genuinely the number
of records matching each term, as you originally said. (And a note to
Steve: it makes no difference separating out the 'flag factor', I'm
afraid).

It's given me lots to think about. The hard part ought to be finding
the women, or the beaches; the flags should be mere filters applied
after those have been fetched. Clearly, though, that's not how this is
behaving, though I've been fooled into thinking it ought to be because
of the apparent default optimization done in Oracle Text. Some new
technique is called for, I guess!  (If you've got any ideas, I'm all
ears...)

Anyway: at least you've all helped me realise that it's not a
dictionary problem (or a word-type) problem, so thank you all very
much for that; much appreciated.

Regards
HJR

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Full Text Search dictionary issues

2010-07-15 Thread Howard Rogers
I have 10 million rows in a table, with full text index created on one
of the columns. I submit this query:

ims=# select count(*) from search_rm
ims-# where to_tsvector('english', textsearch)
ims-# @@ to_tsquery('english', 'woman  beach  ball');
 count
---
   646
(1 row)
Time: 107.570 ms

...and those are excellent times. But if I alter the query to read:

ims=# select count(*) from search_rm
where to_tsvector('english', textsearch)
@@ to_tsquery('english', 'woman  beach  ftx1');
 count
---
 38343
(1 row)
Time: 640.985 ms

...then, as you see, it slows the query down by a factor of about 6,
which is not so good! The problem is that we need to be able to search
for ftx1, since that's a flag we put in our document records to tell
us the file type, and we need to be able to retrieve different file
types at different times.

Now, I *think* the problem is that 'ftx1' is not being treated as
though it were a proper word:

ims=# select * from ts_debug('english','woman ball ftx1');
   alias   |       description        | token |  dictionaries  |
dictionary  | lexemes
---+--+---++--+-
 asciiword | Word, all ASCII          | woman | {english_stem} |
english_stem | {woman}
 blank     | Space symbols            |       | {}             |              |
 asciiword | Word, all ASCII          | ball  | {english_stem} |
english_stem | {ball}
 blank     | Space symbols            |       | {}             |              |
 numword   | Word, letters and digits | ftx1  | {simple}       |
simple       | {ftx1}
(5 rows)

Instead of being an asciiword that uses the english-stem dictionary,
it 'ftx1' gets regarded as a numword in the simple dictionary.

If I simply replace ftxa for ftx1, it *is* then regarded as an
asciiword, and performance of the original query reverts to being just
fine, too:

ims=# select * from ts_debug('english','woman ball ftxa');
   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
---+-+---++--+-
 asciiword | Word, all ASCII | woman | {english_stem} | english_stem | {woman}
 blank     | Space symbols   |       | {}             |              |
 asciiword | Word, all ASCII | ball  | {english_stem} | english_stem | {ball}
 blank     | Space symbols   |       | {}             |              |
 asciiword | Word, all ASCII | ftxa  | {english_stem} | english_stem | {ftxa}

ims=# select count(*) from search_rm
where to_tsvector('english', textsearch)
@@ to_tsquery('english', 'woman  beach  ftxa');
 count
---
 0
(1 row)

Time: 88.603 ms

As you can see, 88ms for a search with 'ftxa' compared to 600+ms for
one with 'ftx1'.

I should mention that we have about 45 different 'nonsense word' flags
we use for all sorts of different purposes, such as telling us which
region a document is visible in, whether it is in portrait or
landscape mode and so on. All of these flag-words take the form of
1, or 2 and so on. So there's a lot of these things causing
the problem, not just 'ftx1' specifically.

My question is, then, what I can do to stop this slowdown? Is there
some way to add 'ftx1' (and the others) as a word in the english_stem
dictionary so that it gets regarded as an asciiword, for example? Or
is there something else I can do to address the problem?

I'm fairly new to PostgreSQL's full text search. I've read Chapter 12
of the doco with rapt attention, but I don't see anything that leaps
at me as a fix for this issue. All help greatefully received,
therefore, and apologies in advance if this is a bit of a newbie
question.

Regards
HJR

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Commit every N rows in PL/pgsql

2010-07-06 Thread Howard Rogers

On 06/03/2010 08:26 AM, Chris Browne wrote:

len.wal...@gmail.com (Len Walter) writes:

I need to populate a new column in a Postgres 8.3 table. The SQL would be something 
like update t set col_c = col_a +
col_b. Unfortunately, this table has 110 million rows, so running that query 
runs out of memory.


Unnecessary.  On Oracle, the typical scenario is ORA-1562 FAILED TO
EXTEND ROLLBACK SEGMENT.

PostgreSQL doesn't have a rollback segment, so there's nothing to run
out of here.  Where Oracle would tend to encourage you to keep your
transactions rather small, PostgreSQL doesn't require you to care about
that.

Big transactions, on PostgreSQL, are really no big deal.


Oracle would most certainly NOT tend to encourage you to keep your 
transactions rather small! As in any other relational database, they 
would -I suggest- encourage you to commit when it is transactionally 
appropriate.


Do you care if your load fails part-way through and the rows inserted up 
to that point remain behind? If not, commit frequently. But if it has to 
be 'all 110 million or none', then you have one large transaction on 
your hands, and you'll have to resource for that -no matter which 
database you're using.


Oracle may run out of rollback segment space (for the past 7 years, 
they're called undo segments, but no real matter). But other databases 
will run out of transaction log space, or have a problem handling 
mutli-version concurrency control issues for an extended length of time. 
No matter what a particular RDBMS may call it, all would have resource 
issues to deal with when coping with a long-running, multi-million row 
transaction.



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] DBI::Oracle problems

2010-07-02 Thread Howard Rogers
Just thought I'd write back to say that (I think!) it had nothing to do with
the software version, and everything to do with me knowing more about Oracle
than PostgreSQL!

I had not been sure how to get PostgreSQL to re-start every time the server
bounced, so I followed the advice I'd seen in a number of tutorials dotted
around the place (for example,
http://www.cyberciti.biz/faq/howto-fedora-linux-install-postgresql-server/
 or http://wiki.scinterface.com/index.php/HowTo:_Install_PostgreSQL) , which
involved simply issuing the command 'chkconfig on'.

That turns out to have been a mistake: from what I can guess, it means the
PostgreSQL service starts before the Oracle environment variables are set.

If instead I simply did 'chkconfig off' and then manually started PostgreSQL
(postgres -D /usr/local/pgsql/data logfile 21 ), then all worked fine,
as many times as I liked, and no matter how often I logged off and back on.
I dare say some fiddling with the startup script will enable me to do the
chkconfig on AND have my environment variables picked up correctly.

Frustrating, nevettheless, but at least it's fixable.

Regards
HJR




On Thu, Jul 1, 2010 at 5:24 AM, David Fetter da...@fetter.org wrote:

 On Wed, Jun 30, 2010 at 10:10:02AM +1000, Howard Rogers wrote:
  I am stumped, despite working on this for a week! I am trying to create a
  64-bit postgresql 8.4 database server which can retrieve data from
 various
  64-bit Oracle 10gR2 and 11gR2 databases.

 Try downloading the latest version of DBI-Link using the Download
 Source link at http://github.com/davidfetter/DBI-Link

 There is also a low-traffic mailing list for the project, where
 questions like this are more on point :)

 Cheers,
 David.
 --
 David Fetter da...@fetter.org http://fetter.org/
 Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
 Skype: davidfetter  XMPP: david.fet...@gmail.com
 iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

 Remember to vote!
 Consider donating to Postgres: http://www.postgresql.org/about/donate

 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general



Re: [GENERAL] DBI::Oracle problems

2010-06-30 Thread Howard Rogers
Thank you David.

I must say, I find mailing lists extremely confusing, and wish there was a
proper forum type place to go to! My apologies for mailing to the wrong
place: I am now not sure whether to keep it here or not! I only wrote here
after noting that previous questions about DBI-Link (some going back to
2005, admittedly) were posted here -but, as I say, apologies if that was
wrong.

Anyway: I have been using the version 2.0.0 of the DBI-Link, which was last
updated three years ago. I see from your link that there are 2010 files
available, so I'll definitely give that a whirl.

@Alexander: yes, my server is configured properly. At least, the postgres
user and the root user (as well as the oracle user) can all use SQL*Plus to
connect to the remote database without having to set any *additional*
environment variables, as I mentioned originally. Quite what else I'm
supposed to set, if anything, I can't imagine!

I do note, however, that after I run the DBI-Link SQL statements, I get
nothing at all in the dbi_link.dbi_connection_environment table, which is
probably why it's not working the second time around. Why the function being
called can't read the environment variables which are most definitely set,
or why it can't insert what it's read into the relevant table, I have no
idea.

I'll try with the latest software David linked to and see how I get on.

Thanks to you both,
HJR

On Thu, Jul 1, 2010 at 5:24 AM, David Fetter da...@fetter.org wrote:

 On Wed, Jun 30, 2010 at 10:10:02AM +1000, Howard Rogers wrote:
  I am stumped, despite working on this for a week! I am trying to create a
  64-bit postgresql 8.4 database server which can retrieve data from
 various
  64-bit Oracle 10gR2 and 11gR2 databases.

 Try downloading the latest version of DBI-Link using the Download
 Source link at http://github.com/davidfetter/DBI-Link

 There is also a low-traffic mailing list for the project, where
 questions like this are more on point :)

 Cheers,
 David.
 --
 David Fetter da...@fetter.org http://fetter.org/
 Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
 Skype: davidfetter  XMPP: david.fet...@gmail.com
 iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

 Remember to vote!
 Consider donating to Postgres: http://www.postgresql.org/about/donate

 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general



[GENERAL] DBI::Oracle problems

2010-06-29 Thread Howard Rogers
I am stumped, despite working on this for a week! I am trying to create a
64-bit postgresql 8.4 database server which can retrieve data from various
64-bit Oracle 10gR2 and 11gR2 databases.


   - I have a freshly-installed 64-bit Centos 5.5, no firewall, no SELinux.
   - I create an oracle user and do a run-time 11gR2 Client installation
   (so, the full-blown client, not the instant client)
   - I set ORACLE_HOME, ORACLE_BASE, PATH, LD_LIBRARY_PATH, CLASSPATH,
   ORA_NLS10, TWO_TASK in /etc/profile (see below for precise details)
   - I can connect to my Oracle database (on a remote server) in SQL*Plus,
   both as the root user and as the oracle user.
   - I then create a new postgres user, install postgresql, create a new
   superuser, create a new database owned by the new superuser and confirm the
   new user can connect to the new database.
   - As root, I used cpan to install DVI, DBD::Oracle and YAML.
   - As root, I did *yum install postgresql-plperl*
   - As the new postgres superuser, I did *create language plperlu*
   - I downloaded the dbi-link software from
   http://pgfoundry.org/projects/dbi-link.
   - Still as the new postgres superuser, I ran the *dbi_link.sql* script
   contained in that download
   - Then I ran the two SQL statements contained in the README found in that
   download. The second of these causes a bunch of _shadow tables and views to
   be created as a select from a schema in one of the remote Oracle databases,
   which is the good news part.

If I then immediate select from one of those tables, I get data returned,
which is really excellent news ...but the good news ends shortly after that,
as this demonstrates:

ims=# select BRAND_ID from usdata.BRAND_S;
NOTICE:  SELECT dbi_link.cache_connection( 1 ) at line 12.

 BRAND_ID
--
 1032
 1115
 1254
...
 2454
 2455
 2114
 2474
 2475
(290 rows)

ims=# \q
[postg...@pgx64 ~]$ psql -d ims
psql (8.4.4)
Type help for help.

ims=# select BRAND_ID from usdata.BRAND_S;
NOTICE:  Setting bail in %_SHARED hash. at line 25.

CONTEXT:  SQL statement SELECT dbi_link.dbi_link_init()
NOTICE:  Setting quote_literal in %_SHARED hash. at line 25.

CONTEXT:  SQL statement SELECT dbi_link.dbi_link_init()
NOTICE:  Setting get_connection_info in %_SHARED hash. at line 25.

CONTEXT:  SQL statement SELECT dbi_link.dbi_link_init()
NOTICE:  Setting quote_ident in %_SHARED hash. at line 25.

CONTEXT:  SQL statement SELECT dbi_link.dbi_link_init()
NOTICE:  Setting get_dbh in %_SHARED hash. at line 25.

CONTEXT:  SQL statement SELECT dbi_link.dbi_link_init()
NOTICE:  Setting remote_exec_dbh in %_SHARED hash. at line 25.

CONTEXT:  SQL statement SELECT dbi_link.dbi_link_init()
NOTICE:  SELECT dbi_link.cache_connection( 1 ) at line 12.

NOTICE:  In cache_connection, there's no shared dbh 1 at line 7.

CONTEXT:  SQL statement SELECT dbi_link.cache_connection( 1 )
NOTICE:  Entering get_connection_info at line 44.

CONTEXT:  SQL statement SELECT dbi_link.cache_connection( 1 )
NOTICE:  ref($args) is HASH
---
data_source_id: 1

CONTEXT:  SQL statement SELECT dbi_link.cache_connection( 1 )
NOTICE:  Leaving get_connection_info at line 75.

CONTEXT:  SQL statement SELECT dbi_link.cache_connection( 1 )
NOTICE:  ---
auth: password
data_source: dbi:Oracle:database=usdata;sid=usdata;host=192.168.0.60
dbh_attributes: |
  ---
  AutoCommit: 1
  RaiseError: 1
local_schema: usdata
remote_catalog: ~
remote_schema: ~
user_name: remoteuser

CONTEXT:  SQL statement SELECT dbi_link.cache_connection( 1 )
NOTICE:  In get_dbh, input connection info is
---
auth: password
data_source: dbi:Oracle:database=usdata;sid=usdata;host=192.168.0.60
dbh_attributes: |
  ---
  AutoCommit: 1
  RaiseError: 1
local_schema: usdata
remote_catalog: ~
remote_schema: ~
user_name: remoteuser

CONTEXT:  SQL statement SELECT dbi_link.cache_connection( 1 )
ERROR:  error from Perl function remote_select: error from Perl function
cache_connection: DBI
connect('database=usdata;sid=usdata;host=192.168.0.60','remoteuser',...)
failed: ERROR OCIEnvNlsCreate. Check ORACLE_HOME (Linux) env var  or PATH
(Windows) and or NLS settings, permissions, etc. at line 137 at line 13.
ims=#

In my many, many tests, I have read a lot of posts on Google and elsewhere
about the need to set ORACLE_HOME, LD_LIBRARY_PATH and so on to avoid these
problems... but what I don't get is that the only difference between my two
selects is that I quit out of psql! If the environment variables were wrong
second time round, why were they OK the first time?!

For the record, here's the contents of my /etc/profile:

ORACLE_BASE=/u01/app/oracle
ORACLE_HOME=/u01/app/oracle/product/11.2.0/db_1
ORACLE_SID=usdata
export ORACLE_BASE ORACLE_HOME ORACLE_SID
export ORA_NLS10=/u01/app/oracle/product/11.2.0/db_1/nls/data
export TWO_TASK=usdata
export ORA_USERID=remoteuser/password
export NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P1
PATH=$ORACLE_HOME/bin:$PATH:.
export PATH
LD_LIBRARY_PATH=$ORACLE_HOME/lib:/lib64:/lib:/usr/lib64:/usr/lib