[HACKERS] easy way of copying regex_t ?

2012-01-05 Thread Tomas Vondra
Hi all,

I've been working on moving an extension that allows to move the ispell
dictionaries to shared segment. It's almost complete, the last FIXME is
about copying regex_t structure (stored in AFFIX).

According to regex.h the structure is fairly complex and not exactly easy
to understand, so I'd like to know if anyone here already implemented that
or something that may serve the same purpose. Any ideas?

kind regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] easy way of copying regex_t

2016-01-24 Thread Artur Zakirov

Hi all,

I've been working on moving an extension that allows to move the ispell
dictionaries to shared segment. It's almost complete, the last FIXME is
about copying regex_t structure (stored in AFFIX).

According to regex.h the structure is fairly complex and not exactly easy
to understand, so I'd like to know if anyone here already implemented that
or something that may serve the same purpose. Any ideas?

kind regards
Tomas


This message is the reply to the message 
http://www.postgresql.org/message-id/dd02a31fdeffbf5cb24771e34213b40f.squir...@sq.gransy.com

Sorry, I can't reply to it directly. I can't get it from archive.

Thank you for your extension shared_ispell. It is very useful. I have 
got it from https://github.com/tvondra/shared_ispell
With this message I want to send some patch to your repository with 
draft of a code, which allows shared_ispell to copy regex_t.


The main idea of the patch is:
- we doesn't need copy all regex_t structure
- most of fields and structures used only in a compile time
- we need copy structures: guts, colormap, subre, cnfa
- from the subre structure we need only cnfa

colormap represents a directed acyclic graph. cnfa represents a 
nondeterministic finite automaton.


In this patch also was done the following:
- added regression tests
- deleted spell.h and spell.c since they have duplicate code
- added shared_ispell.h which declares some structures
- fix an issue when stopFile can be NULL
- fixed countCMPDAffixes since theoretically could be zero affix
- added copyCMPDAffix

Question to hackers. Can such patch be useful as a PostgreSQL patch to 
Full-Text search? Is it needed?


shared_ispell loads dictionaries into a shared memory. The main benefits 
are:
- saving of memory. Every dictionary is loaded only once. Dictionaries 
are not loaded for each backend. In current version of PostgreSQL 
dictionaires are loaded for each backend where it was requested.
- saving of time. The first load of a dictionary takes much time. With 
this patch dictionaries will be loaded only once.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/Makefile
--- b/Makefile
***
*** 1,18 
  MODULE_big = shared_ispell
! OBJS = src/shared_ispell.o src/spell.o
  
  EXTENSION = shared_ispell
! DATA = sql/shared_ispell--1.0.0.sql
! MODULES = shared_ispell
  
! CFLAGS=`pg_config --includedir-server`
  
  PG_CONFIG = pg_config
  PGXS := $(shell $(PG_CONFIG) --pgxs)
  include $(PGXS)
! 
! all: shared_ispell.so
! 
! shared_ispell.so: $(OBJS)
! 
! %.o : src/%.c
--- 1,20 
+ # contrib/shared_ispell/Makefile
+ 
  MODULE_big = shared_ispell
! OBJS = src/shared_ispell.o
  
  EXTENSION = shared_ispell
! DATA = sql/shared_ispell--1.1.0.sql
  
! REGRESS = shared_ispell
  
+ ifdef USE_PGXS
  PG_CONFIG = pg_config
  PGXS := $(shell $(PG_CONFIG) --pgxs)
  include $(PGXS)
! else
! subdir = contrib/shared_ispell
! top_builddir = ../..
! include $(top_builddir)/src/Makefile.global
! include $(top_srcdir)/contrib/contrib-global.mk
! endif
*** a/README.md
--- b/README.md
***
*** 13,28  If you need just snowball-type dictionaries, this extension is not
  really interesting for you. But if you really need an ispell
  dictionary, this may save you a lot of resources.
  
- Warning
- ---
- The extension does not yet handle affixes that require full regular
- expressions (regex_t, implemented in regex.h). This is indicated by
- an error when initializing the dictionary.
- 
- Simple affixes and affixes that can be handled by fast regex subset
- (as implemented in regis.h) are handled just fine.
- 
- 
  Install
  ---
  Installing the extension is quite simple, especially if you're on 9.1.
--- 13,18 
*** /dev/null
--- b/expected/shared_ispell.out
***
*** 0 
--- 1,193 
+ CREATE EXTENSION shared_ispell;
+ -- Test ISpell dictionary with ispell affix file
+ CREATE TEXT SEARCH DICTIONARY shared_ispell (
+ Template=shared_ispell,
+ DictFile=ispell_sample,
+ AffFile=ispell_sample
+ );
+ SELECT ts_lexize('shared_ispell', 'skies');
+  ts_lexize 
+ ---
+  {sky}
+ (1 row)
+ 
+ SELECT ts_lexize('shared_ispell', 'bookings');
+ts_lexize
+ 
+  {booking,book}
+ (1 row)
+ 
+ SELECT ts_lexize('shared_ispell', 'booking');
+ts_lexize
+ 
+  {booking,book}
+ (1 row)
+ 
+ SELECT ts_lexize('shared_ispell', 'foot');
+  ts_lexize 
+ ---
+  {foot}
+ (1 row)
+ 
+ SELECT ts_lexize('shared_ispell', 'foots');
+  ts_lexize 
+ ---
+  {foot}
+ (1 row)
+ 
+ SELECT ts_lexize('shared_ispell', 'rebookings');
+ts_lexize
+ 
+  {booking,book}
+ (1 row)
+ 
+ SELECT ts_lexize('shared_ispell', 'rebooking');
+ts_lexize
+ 
+  {booking,book}
+ (1 row)
+ 
+ SELECT ts_lexize('shared_ispell', 'rebook');
+  ts_lexize 
+ ---
+  
+ (1 row)
+ 
+ SELECT ts_lexize('

Re: [HACKERS] easy way of copying regex_t

2016-01-24 Thread Tom Lane
Artur Zakirov  writes:
> With this message I want to send some patch to your repository with 
> draft of a code, which allows shared_ispell to copy regex_t.

Allowing ispell.c to know that much about regex internals strikes me as
completely unacceptable from a modularity or maintainability standpoint.
If we want to do that, the appropriate thing would be to add a function
to backend/regex/ that copies a regex_t.

However, I'm rather suspicious of the safety of copying a regex_t into
shared memory in the first place.  It contains function pointers, which
we have not historically assumed would be portable between different
backend processes.  And the regex library is old enough to have never
heard of thread safety, so I'm not really sure that it considers the
regex_t structures to be read-only at execution time.

> shared_ispell loads dictionaries into a shared memory. The main benefits 
> are:
> - saving of memory. Every dictionary is loaded only once. Dictionaries 
> are not loaded for each backend. In current version of PostgreSQL 
> dictionaires are loaded for each backend where it was requested.
> - saving of time. The first load of a dictionary takes much time. With 
> this patch dictionaries will be loaded only once.

Where does the shared memory space come from?  It would not take too
many dictionaries to use up whatever slop is available.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] easy way of copying regex_t

2016-01-25 Thread Tomas Vondra

On 01/24/2016 10:41 PM, Tom Lane wrote:

Artur Zakirov  writes:

With this message I want to send some patch to your repository with
draft of a code, which allows shared_ispell to copy regex_t.


Allowing ispell.c to know that much about regex internals strikes me
as completely unacceptable from a modularity or maintainability
standpoint. If we want to do that, the appropriate thing would be to
add a function to backend/regex/ that copies a regex_t.


I share the feeling - that's too much insight into internal regex data 
structures, so should be part of backend/regex and not in the extension.



However, I'm rather suspicious of the safety of copying a regex_t into
shared memory in the first place.  It contains function pointers, which
we have not historically assumed would be portable between different
backend processes.  And the regex library is old enough to have never
heard of thread safety, so I'm not really sure that it considers the
regex_t structures to be read-only at execution time.


Right, it's definitely not thread-safe so there'd need to be some lock 
protecting the regex_t copy. I was thinking about either using a group 
of locks, each protecting a small subset of the affixes (thus making it 
possible to work in parallel to some extent), or simply using a single 
lock and each process would make a private copy at the beginning.


In the end, I've decided to do it differently, and simply parse the 
affix list from scratch in each process. The affix list is tiny and 
takes less than a millisecond to parse in most cases, and I don't have 
to care about the regex stuff at all. The main benefit is from sharing 
parsed wordlist anyway.





shared_ispell loads dictionaries into a shared memory. The main benefits
are:
- saving of memory. Every dictionary is loaded only once. Dictionaries
are not loaded for each backend. In current version of PostgreSQL
dictionaires are loaded for each backend where it was requested.
- saving of time. The first load of a dictionary takes much time. With
this patch dictionaries will be loaded only once.


Where does the shared memory space come from? It would not take too
many dictionaries to use up whatever slop is available.


It's an old-school shared segment created by the extension at init time. 
You're right the size is fixed so it's possible to run out of space by 
loading too many dictionaries, but that was not a big deal for the type 
of setups it was designed for - in those cases the list of dictionaries 
is stable, so it's possible to size the segment accordingly in advance.


But I guess we could do better now that we have dynamic shared memory, 
possibly allocating one segment per dictionary as needed, or something 
like that.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] easy way of copying regex_t

2016-01-25 Thread Artur Zakirov

On 25.01.2016 13:07, Tomas Vondra wrote:


Right, it's definitely not thread-safe so there'd need to be some lock
protecting the regex_t copy. I was thinking about either using a group
of locks, each protecting a small subset of the affixes (thus making it
possible to work in parallel to some extent), or simply using a single
lock and each process would make a private copy at the beginning.

In the end, I've decided to do it differently, and simply parse the
affix list from scratch in each process. The affix list is tiny and
takes less than a millisecond to parse in most cases, and I don't have
to care about the regex stuff at all. The main benefit is from sharing
parsed wordlist anyway.


This is nice decision since the affix list is small. For our task I will 
change shared_ispell to use this solution.




It's an old-school shared segment created by the extension at init time.
You're right the size is fixed so it's possible to run out of space by
loading too many dictionaries, but that was not a big deal for the type
of setups it was designed for - in those cases the list of dictionaries
is stable, so it's possible to size the segment accordingly in advance.

But I guess we could do better now that we have dynamic shared memory,
possibly allocating one segment per dictionary as needed, or something
like that.

regards



Yes it would be better as we will not need to define the maximum size of 
the shared segment in postgresql.conf.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers