Re: [HACKERS] Unaccent performance

2013-06-22 Thread Andres Freund
On 2013-06-21 22:52:04 +0100, Thom Brown wrote:
  CREATE OR REPLACE FUNCTION public.myunaccent(sometext text)
   RETURNS text
   LANGUAGE sql
   IMMUTABLE
  AS $function$
  SELECT
  replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(sometext,'ą','a'),'Ą','A'),'ă','a'),'Ă','A'),'ā','a'),'Ā','A'),'æ','a'),'å','a'),'ä','a'),'ã','a'),'â','a'),'á','a'),'à','a'),'Æ','A'),'Å','A'),'Ä','A'),'Ã','A'),'Â','A'),'Á','A'),'À','A')
  ;
  $function$
 

 Another test passing in a string of 10 characters gives the following
 timings:
 
 unaccent: 240619.395 ms
 myunaccent: 785.505 ms

The reason for that is that unaccent is 'stable' while your function is
'immutable', so the planner recognizes that and computes it only once
since you're always passing the same text string to it.

 Another test inserting long text strings into a text column of a table
 100,000 times, then updating another column to have that unaccented value
 using both methods:
 
 unaccent: 3867.306 ms
 myunaccent: 43611.732 ms

Whereas it cannot recognize that in this case.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Unaccent performance

2013-06-21 Thread Thom Brown
Hi,

The unaccent extension is great, especially with its customisability, but
it's not always easy to recommend.  I witnessed a customer using no less
than 56 nested replace functions in an SQL function.  I looked to see how
much this can be mitigated by unaccent.  It turns out that not all the
characters they were replacing can be replaced by unaccent, either because
they replace more than 1 character at a time, or the character they're
replacing, for some reason, isn't processed by unaccent, even with a custom
rules file.

So there were 20 characters I could identify that they were replacing.  I
made a custom rules file and compared its performance to the
difficult-to-manage set of nested replace calls.

CREATE OR REPLACE FUNCTION public.myunaccent(sometext text)
 RETURNS text
 LANGUAGE sql
 IMMUTABLE
AS $function$
SELECT
replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(sometext,'ą','a'),'Ą','A'),'ă','a'),'Ă','A'),'ā','a'),'Ā','A'),'æ','a'),'å','a'),'ä','a'),'ã','a'),'â','a'),'á','a'),'à','a'),'Æ','A'),'Å','A'),'Ä','A'),'Ã','A'),'Â','A'),'Á','A'),'À','A')
;
$function$

postgres=# SELECT myunaccent(sometext::text) FROM (SELECT
'ÀÁÂÃÄÅÆàáâãäåæĀāĂ㥹' sometext FROM generate_series(1,100)) x OFFSET
99 LIMIT 1;
  myunaccent
--
 AAAaaaAaAaAa
(1 row)

Time: 726.282 ms
postgres=# SELECT unaccent(sometext::text) FROM (SELECT
'ÀÁÂÃÄÅÆàáâãäåæĀāĂ㥹' sometext FROM generate_series(1,100)) x OFFSET
99 LIMIT 1;
   unaccent
--
 AAAaaaAaAaAa
(1 row)

Time: 3305.252 ms

The timings are actually pretty much the same even if I introduce 187
nested replace calls for every line in the unaccent.rules file for 187
characters.  But the same character set with unaccent increases to 7418.526
ms with the same type of query as above.  That's 10 times more expensive.

Is there a way to boost the performance to make its adoption more palatable?

-- 
Thom


Re: [HACKERS] Unaccent performance

2013-06-21 Thread Thom Brown
On 21 June 2013 19:04, Thom Brown t...@linux.com wrote:

 Hi,

 The unaccent extension is great, especially with its customisability, but
 it's not always easy to recommend.  I witnessed a customer using no less
 than 56 nested replace functions in an SQL function.  I looked to see how
 much this can be mitigated by unaccent.  It turns out that not all the
 characters they were replacing can be replaced by unaccent, either because
 they replace more than 1 character at a time, or the character they're
 replacing, for some reason, isn't processed by unaccent, even with a custom
 rules file.

 So there were 20 characters I could identify that they were replacing.  I
 made a custom rules file and compared its performance to the
 difficult-to-manage set of nested replace calls.

 CREATE OR REPLACE FUNCTION public.myunaccent(sometext text)
  RETURNS text
  LANGUAGE sql
  IMMUTABLE
 AS $function$
 SELECT
 replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(sometext,'ą','a'),'Ą','A'),'ă','a'),'Ă','A'),'ā','a'),'Ā','A'),'æ','a'),'å','a'),'ä','a'),'ã','a'),'â','a'),'á','a'),'à','a'),'Æ','A'),'Å','A'),'Ä','A'),'Ã','A'),'Â','A'),'Á','A'),'À','A')
 ;
 $function$

 postgres=# SELECT myunaccent(sometext::text) FROM (SELECT
 'ÀÁÂÃÄÅÆàáâãäåæĀāĂ㥹' sometext FROM generate_series(1,100)) x OFFSET
 99 LIMIT 1;
   myunaccent
 --
  AAAaaaAaAaAa
 (1 row)

 Time: 726.282 ms
 postgres=# SELECT unaccent(sometext::text) FROM (SELECT
 'ÀÁÂÃÄÅÆàáâãäåæĀāĂ㥹' sometext FROM generate_series(1,100)) x OFFSET
 99 LIMIT 1;
unaccent
 --
  AAAaaaAaAaAa
 (1 row)

 Time: 3305.252 ms

 The timings are actually pretty much the same even if I introduce 187
 nested replace calls for every line in the unaccent.rules file for 187
 characters.  But the same character set with unaccent increases to 7418.526
 ms with the same type of query as above.  That's 10 times more expensive.

 Is there a way to boost the performance to make its adoption more
 palatable?


Another test passing in a string of 10 characters gives the following
timings:

unaccent: 240619.395 ms
myunaccent: 785.505 ms

I guess this must indicate that unaccent is processing all rows, and
myunaccent is only being run on the 1 select row?  I can't account for
myunaccent always being almost the same duration regardless of string
length otherwise.  This is probably an incorrect assessment of performance.

Another test inserting long text strings into a text column of a table
100,000 times, then updating another column to have that unaccented value
using both methods:

unaccent: 3867.306 ms
myunaccent: 43611.732 ms

So I guess this complaint about performance is all just noise.

However, pushing that pointless complaint to one side, I would like to have
the ability to have unaccent support more characters that it doesn't
currently seem to support, such as bullet points, ellipses etc., and also
more than 1 character being replaced.  Naturally these aren't appropriate
to fall under the unaccent function itself, but the rules file is good
starting point.  It would be a bit like translate, except it would use a
rules file instead of providing strings of single characters to convert.

So say we wanted (trademark) to be converted into ™ just as an example,
or ; to ..  We can't do that with unaccent, but in order to avoid a
huge list of replace functions, a function like unaccent, with a few
adaptations, would solve the problem.

e.g.:

SELECT transform(my_custom_dictionary, 'Commodore Amiga(trademark);')

would return

Commodore Amiga™.

This would ideally somehow cater for replacing tabs and spaces too.

-- 
Thom