Re: [GENERAL] store key name pattern search

2017-04-04 Thread Oleg Bartunov
On Tue, Apr 4, 2017 at 11:41 AM, Armand Pirvu (home)  wrote:

> Hi
>
> I have the following case
>
>
> select * from foo;
>   col1
> 
> -
>  "show_id"=>"1", "group_id"=>"32", "group_name"=>"slb",
> "group_add_by"=>"557651"
>  "show_id"=>"2", "group_id"=>"33", "group_name"=>"slc",
> "item_add_by"=>"557652"
> (2 rows)
>
> Is there anyway I can do a pattern search by hstore key name something like
>
> select * from foo where skeys(col1) like '%add_by%';
>
> I looked on the doc but did not see anything , or did I miss it ?
>

No. You may convert to json and use jsquery
https://github.com/postgrespro/jsquery

>
> Thanks
>
> Armand
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Incremental / Level -1 backup in PG

2017-03-22 Thread Oleg Bartunov
On Wed, Mar 22, 2017 at 3:27 AM, Rakesh Kumar 
wrote:

> PG does not have a concept of incremental  backup. The way it works in
> Oracle and other RDBMS is that incremental backup only backups up changed
> blocks since the last full backup.  So if only 10% of blocks changed since
> the last full backup, incremental backup will be only for 10%.
> I am wondering whether it is technically feasible to implement it like
> this:
>

Have you seen pg_probackup ? It's in development stage and needs somebody
to test it.



>
> 1 - At the time of full backup, note the last modified time of each data
> file in a repository.
> 2 - Next time when incremental backup runs, for every data file it will
> check the last modified time of it with the one in the repository to
> determine whether it has changed since last full backup. If yes, back it up.
>
> Now on to restore:
>
> 1 - First restore full backup.
> 2 - Restore incremental backup.
>
> My question: Will it work in PG?
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Full Text Search combined with Fuzzy

2017-02-27 Thread Oleg Bartunov
On Sun, Feb 26, 2017 at 3:52 PM, Nicolas Paris  wrote:

> Hello,
>
> AFAIK there is no built-in way to combine full text search and fuzzy
> matching
> (https://www.postgresql.org/docs/current/static/fuzzystrmatch.html).
> By example, phrase searching with tipos in it.
>
> First I don't know if postgresql concurrents (lucene based...) are able
> to do so.
>

Usually, https://www.postgresql.org/docs/current/static/pgtrgm.html is used
for this.

>
> Second, is such feature is in the road map ?
>
> Third, I wonder if it is a good idea to use the postgresql synonyms
> feature for such prupose.(https://www.postgresql.org/docs/current/
> static/textsearch-dictionaries.html)
> I mean, building up a synonyms dictionnary containing tipos. By eg:
>
> postgrespgsql
> postgresql  pgsql
> postgrezpgsql
> postgre pgsql
> gogle   googl
> gooogle googl
>
> There is multiple way to build such dictionary. But my question is about
> the implementation of dictionnaries in postgresql: Is postgresql
> supposed to take advantage of billion entries dictionaries ?
>

dictionary is just a program, so it's  up to developer how to write
efficient program to deal with billion entries. Specifically to synonym
dictionary, it's not intended to work with a lot of entries. btw, have a
look on contrib/dict_xsyn dictionary, which is more flexible than synonym.

>
> Thanks by advance for you answers,
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Searching array for multiple items

2017-01-25 Thread Oleg Bartunov
On Wed, Jan 25, 2017 at 11:29 AM, Alex Magnum  wrote:

> Hi,
> I can search an array with 1 = ANY('{1,3,4,7}'::int[])
>
> I need to check for one or multiple items in the array.
>
> e.g. '1,7,3'  = ANY('{1,3,4,7}'::int[]
>
> I do need to check if
> a) all items exist in the array
> b) at least one item exists in the array
>
> Is there a an operator that allows me to do these two?
> Does the order of left and right side matter?
> Right now I have a small function but I guess there is a more efficient
> way.
>

Look on our contrib/intarray


>
> Thanks for any help.
>
> A
>
>
>


Re: [GENERAL] Incorrect information about GIN-index in RUM's docs

2016-11-14 Thread Oleg Bartunov
On Tue, Nov 15, 2016 at 10:37 AM, Andreas Joseph Krogh 
wrote:

> This item isn't valid, is it?
>
>
>- There isn't phrase search with GIN index. This problem relates with
>previous problem. It is need position information to perform phrase search.
>
> RUM being an extention, and having index-access-methods as extentions is
> new in 9.6, where phrase-search was introduced, this isn't really true, or
> am I missing something?
>

RUM is very good for phrase search, since it has access to coordinates in
index, check slides from PGconf.eu (
http://www.sai.msu.su/~megera/postgres/talks/pgconfeu-fts-2016.pdf).  Where
did you find the citation ?



>
> --
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andr...@visena.com
> www.visena.com
> 
>


Re: [GENERAL] Trigram is slow when 10m rows

2016-11-13 Thread Oleg Bartunov
On Sun, Nov 13, 2016 at 5:25 PM, Aaron Lewis <the.warl0ck.1...@gmail.com>
wrote:

> Thanks Oleg.
>
> I've increased work_mem to 128MB, now the query falls down to 1.7s,
> faster but still not good enough.
>
> Is there any other thing I can do about it?
>

your query 'x264' is short in terms of the number of trigrams, so trigram
index isn't good. Did you tried text_pattern_ops for btree ? Something like
create index title_btree_idx on mytable using btree(title text_pattern_ops
);

>
> test=# explain analyze select * from mytable where title ilike 'x264';
>   QUERY PLAN
> 
> --
>  Bitmap Heap Scan on mytable  (cost=462.69..5639.67 rows=1380
> width=83) (actual time=1754.656..1754.656 rows=0 loops=1)
>Recheck Cond: (title ~~* 'x264'::text)
>Rows Removed by Index Recheck: 1220793
>Heap Blocks: exact=197567
>->  Bitmap Index Scan on title_trgm_idx  (cost=0.00..462.35
> rows=1380 width=0) (actual time=346.663..346.663 rows=1220793 loops=1)
>  Index Cond: (title ~~* 'x264'::text)
>  Planning time: 1.168 ms
>  Execution time: 1755.944 ms
>
>
> On Sun, Nov 13, 2016 at 10:04 PM, Oleg Bartunov <obartu...@gmail.com>
> wrote:
> >
> >
> > On Sun, Nov 13, 2016 at 2:54 PM, Aaron Lewis <the.warl0ck.1...@gmail.com
> >
> > wrote:
> >>
> >> I have a simple table with Trigram index,
> >>
> >> create table mytable(hash char(40), title text);
> >> create index title_trgm_idx on mytable using gin(title gin_trgm_ops);
> >>
> >> When I run a query with 10m rows, it uses the Trigram index, but takes
> >> 3s to execute, very slow.
> >> (I have 80m rows, but only inserted 10m for testing purpose)
> >>
> >> test=# select count(*) from mytable;
> >>   count
> >> --
> >>  13971887
> >> (1 row)
> >>
> >> test=# explain analyze select * from mytable where title ilike 'x264';
> >>   QUERY PLAN
> >>
> >> 
> --
> >>  Bitmap Heap Scan on mytable  (cost=462.69..5639.67 rows=1380
> >> width=83) (actual time=2937.308..2937.308 rows=0 loops=1)
> >>Recheck Cond: (title ~~* 'x264'::text)
> >>Rows Removed by Index Recheck: 11402855
> >>Heap Blocks: exact=39557 lossy=158010
> >>->  Bitmap Index Scan on title_trgm_idx  (cost=0.00..462.35
> >> rows=1380 width=0) (actual time=342.440..342.440 rows=1220793 loops=1)
> >>  Index Cond: (title ~~* 'x264'::text)
> >>  Planning time: 0.611 ms
> >>  Execution time: 2937.729 ms
> >> (8 rows)
> >>
> >> Any ideas to speed things up?
> >
> >
> >Rows Removed by Index Recheck: 11402855
> >Heap Blocks: exact=39557 lossy=158010
> >
> > You need to increase work_mem
> >>
> >>
> >> --
> >> Best Regards,
> >> Aaron Lewis - PGP: 0x13714D33 - http://pgp.mit.edu/
> >> Finger Print:   9F67 391B B770 8FF6 99DC  D92D 87F6 2602 1371 4D33
> >>
> >>
> >> --
> >> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> >> To make changes to your subscription:
> >> http://www.postgresql.org/mailpref/pgsql-general
> >
> >
>
>
>
> --
> Best Regards,
> Aaron Lewis - PGP: 0x13714D33 - http://pgp.mit.edu/
> Finger Print:   9F67 391B B770 8FF6 99DC  D92D 87F6 2602 1371 4D33
>


Re: [GENERAL] Why is this query not using GIN index?

2016-11-13 Thread Oleg Bartunov
On Sun, Nov 13, 2016 at 6:05 PM, Aaron Lewis <the.warl0ck.1...@gmail.com>
wrote:

> Sigh, didn't notice that. Thanks for the heads up.
>
> It takes 500ms with 10m rows, could it be faster?
>

sure.  Recheck with function call is pretty expensive, so I'd not recommend
to create functional index, just create separate column of type tsvector
(materialize to_tsvector) and create gin index on it.  You should surprise.


> I've increased work_mem to 256MB
>
> test=# explain analyze select * from mytable where
> to_tsvector('english', title) @@ 'x264'::tsquery limit 1000 offset 10;
>   QUERY PLAN
> 
> 
> ---
>  Limit  (cost=684.06..2949.42 rows=1000 width=83) (actual
> time=348.506..536.483 rows=1000 loops=1)
>->  Bitmap Heap Scan on mytable  (cost=661.41..158917.22 rows=69859
> width=83) (actual time=345.354..536.199 rows=1010 loops=1)
>  Recheck Cond: (to_tsvector('english'::regconfig, title) @@
> '''x264'''::tsquery)
>  Rows Removed by Index Recheck: 12242
>  Heap Blocks: exact=20 lossy=186
>  ->  Bitmap Index Scan on name_fts  (cost=0.00..643.95
> rows=69859 width=0) (actual time=333.703..333.703 rows=1044673
> loops=1)
>Index Cond: (to_tsvector('english'::regconfig, title)
> @@ '''x264'''::tsquery)
>  Planning time: 0.144 ms
>  Execution time: 537.212 ms
> (9 rows)
>
> On Sun, Nov 13, 2016 at 10:33 PM, Julien Rouhaud
> <julien.rouh...@dalibo.com> wrote:
> > On 13/11/2016 15:26, Aaron Lewis wrote:
> >> Hi Oleg,
> >>
> >> Can you elaborate on the title column? I don't get it.
> >>
> >
> >>>> create table mytable(hash char(40), title varchar(500));
> >>>> create index name_fts on mytable using gin(to_tsvector('english',
> >>>> 'title'));
> >
> > You created an index on the text 'title', not on the title column, so
> > the index is useless.
> >
> > Drop the existing index and create this one instead:
> >
> > create index name_fts on mytable using gin(to_tsvector('english',
> title));
> >
> >> On Sun, Nov 13, 2016 at 10:10 PM, Oleg Bartunov <obartu...@gmail.com>
> wrote:
> >>>
> >>>
> >>> On Sun, Nov 13, 2016 at 2:50 PM, Aaron Lewis <
> the.warl0ck.1...@gmail.com>
> >>> wrote:
> >>>>
> >>>> I have a simple table, and a gin index,
> >>>>
> >>>> create table mytable(hash char(40), title varchar(500));
> >>>> create index name_fts on mytable using gin(to_tsvector('english',
> >>>> 'title'));
> >>>
> >>>
> >>>
> >>> ^
> >>>
> >>>>
> >>>> create unique index md5_uniq_idx on mytable(hash);
> >>>>
> >>>> When I execute a query with tsquery, the GIN index was not in use:
> >>>>
> >>>> test=# explain analyze select * from mytable where
> >>>> to_tsvector('english', title) @@ 'abc | def'::tsquery limit 10;
> >>>>  QUERY PLAN
> >>>>
> >>>> 
> 
> >>>>  Limit  (cost=0.00..277.35 rows=10 width=83) (actual
> >>>> time=0.111..75.549 rows=10 loops=1)
> >>>>->  Seq Scan on mytable  (cost=0.00..381187.45 rows=13744 width=83)
> >>>> (actual time=0.110..75.546 rows=10 loops=1)
> >>>>  Filter: (to_tsvector('english'::regconfig, (title)::text) @@
> >>>> '''abc'' | ''def'''::tsquery)
> >>>>  Rows Removed by Filter: 10221
> >>>>  Planning time: 0.176 ms
> >>>>  Execution time: 75.564 ms
> >>>> (6 rows)
> >>>>
> >>>> Any ideas?
> >>>>
> >
> > --
> > Julien Rouhaud
> > http://dalibo.com - http://dalibo.org
>
>
>
> --
> Best Regards,
> Aaron Lewis - PGP: 0x13714D33 - http://pgp.mit.edu/
> Finger Print:   9F67 391B B770 8FF6 99DC  D92D 87F6 2602 1371 4D33
>


Re: [GENERAL] Why is this query not using GIN index?

2016-11-13 Thread Oleg Bartunov
On Sun, Nov 13, 2016 at 2:50 PM, Aaron Lewis 
wrote:

> I have a simple table, and a gin index,
>
> create table mytable(hash char(40), title varchar(500));
> create index name_fts on mytable using gin(to_tsvector('english',
> 'title'));
>


^


> create unique index md5_uniq_idx on mytable(hash);
>
> When I execute a query with tsquery, the GIN index was not in use:
>
> test=# explain analyze select * from mytable where
> to_tsvector('english', title) @@ 'abc | def'::tsquery limit 10;
>  QUERY PLAN
> 
> 
>  Limit  (cost=0.00..277.35 rows=10 width=83) (actual
> time=0.111..75.549 rows=10 loops=1)
>->  Seq Scan on mytable  (cost=0.00..381187.45 rows=13744 width=83)
> (actual time=0.110..75.546 rows=10 loops=1)
>  Filter: (to_tsvector('english'::regconfig, (title)::text) @@
> '''abc'' | ''def'''::tsquery)
>  Rows Removed by Filter: 10221
>  Planning time: 0.176 ms
>  Execution time: 75.564 ms
> (6 rows)
>
> Any ideas?
>
>
> --
> Best Regards,
> Aaron Lewis - PGP: 0x13714D33 - http://pgp.mit.edu/
> Finger Print:   9F67 391B B770 8FF6 99DC  D92D 87F6 2602 1371 4D33
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Trigram is slow when 10m rows

2016-11-13 Thread Oleg Bartunov
On Sun, Nov 13, 2016 at 2:54 PM, Aaron Lewis 
wrote:

> I have a simple table with Trigram index,
>
> create table mytable(hash char(40), title text);
> create index title_trgm_idx on mytable using gin(title gin_trgm_ops);
>
> When I run a query with 10m rows, it uses the Trigram index, but takes
> 3s to execute, very slow.
> (I have 80m rows, but only inserted 10m for testing purpose)
>
> test=# select count(*) from mytable;
>   count
> --
>  13971887
> (1 row)
>
> test=# explain analyze select * from mytable where title ilike 'x264';
>   QUERY PLAN
> 
> --
>  Bitmap Heap Scan on mytable  (cost=462.69..5639.67 rows=1380
> width=83) (actual time=2937.308..2937.308 rows=0 loops=1)
>Recheck Cond: (title ~~* 'x264'::text)
>Rows Removed by Index Recheck: 11402855
>Heap Blocks: exact=39557 lossy=158010
>->  Bitmap Index Scan on title_trgm_idx  (cost=0.00..462.35
> rows=1380 width=0) (actual time=342.440..342.440 rows=1220793 loops=1)
>  Index Cond: (title ~~* 'x264'::text)
>  Planning time: 0.611 ms
>  Execution time: 2937.729 ms
> (8 rows)
>
> Any ideas to speed things up?
>

   Rows Removed by Index Recheck: 11402855
   Heap Blocks: exact=39557 lossy=158010

You need to increase work_mem

>
> --
> Best Regards,
> Aaron Lewis - PGP: 0x13714D33 - http://pgp.mit.edu/
> Finger Print:   9F67 391B B770 8FF6 99DC  D92D 87F6 2602 1371 4D33
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Gin indexes on intarray is fast when value in array does not exists, and slow, when value exists

2016-11-08 Thread Oleg Bartunov
On Tue, Nov 8, 2016 at 8:43 PM, otar shavadze  wrote:

> I have table with 500 000 rows, I have int[] column "my_array" in this
> table, this array column contains minimum 1 and maximum 5 different values.
>

you didn't show us postgres version.


>
> I have GIN index on my_array column:
>
> * "CREATE INDEX idx ON table_name USING GIN(my_array gin__int_ops)"*
>
> Then I use this query: "*SELECT * FROM  table_name WHERE my_array @>
> '{3}'::integer[]  ORDER BY id LIMIT 50"*
>
> Execution time of this query is approximately 500-1000 ms. Then if I drop
> gin index "*idx*", query works extremely fast, less than 20 ms.
>

explain analyze would help us to see the problem.



>
> But, if I search value, which does not exists at all, for example no one
> array not contains number "77" and I search: * "WHERE my_array @>
> '{77}'::integer[]" *, then using gin index is much better and fast, (less
> than 20 ms), but without index, query takes 500-1000 ms.
>
>
> So, what to do? For values which does not in any one rows, using index is
> much better, but for values,  which are at least in several rows, using
>  index, slows down performance.
>
> Can somehow make, that searching was always fast (when value exist in
> array and when not)
>
>
>


Re: [GENERAL] ZSON, PostgreSQL extension for compressing JSONB

2016-10-04 Thread Oleg Bartunov
On Tue, Oct 4, 2016 at 4:20 PM, Simon Riggs <si...@2ndquadrant.com> wrote:

> On 30 September 2016 at 16:58, Aleksander Alekseev
> <a.aleks...@postgrespro.ru> wrote:
>
> > I've just uploaded ZSON extension on GitHub:
> >
> > https://github.com/afiskon/zson
> >
> > ZSON learns on your common JSONB documents and creates a dictionary
> > with strings that are frequently used in all documents. After that you
> > can use ZSON type to compress documents using this dictionary. When
> > documents schema evolve and compression becomes inefficient you can
> > re-learn on new documents. New documents will be compressed with a new
> > dictionary, old documents will be decompressed using old dictionary.
> >
> > In some cases ZSON can save half of your disk space and give you about
> > 10% more TPS. Everything depends on your data and workload though.
> > Memory is saved as well. For more details see README.md.
> >
> > Please don't hesitate to ask any questions. Any feedback and pull
> > requests are welcome too!
>
> Very good. Oleg had mentioned that dictionary compression was being
> considered.
>
> It would be useful to be able to define compression dictionaries for
> many use cases.
>


Dictionary compression is a different project, we'll publish it after
testing.


>
> Will you be submitting this to core?
>
> --
> Simon Riggshttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Results in ERROR: syntax error in tsquery:

2016-09-19 Thread Oleg Bartunov
On Mon, Sep 19, 2016 at 4:19 AM, Oleg Bartunov <obartu...@gmail.com> wrote:
> On Mon, Sep 19, 2016 at 3:46 AM, Kiran <bangalore.ki...@gmail.com> wrote:
>> Dear All,
>>
>> I have a ts_vector column in question table called weighted_tsv.
>> I am trying to search using ts_query as follows
>>
>> SELECT *
>>
>> FROM question
>>
>> WHERE weighted_tsv @@ to_tsquery('Hur ofta');
>
> you should use to_tsquery('Hur & ofta') ! Read
> https://www.postgresql.org/docs/9.5/static/datatype-textsearch.html

or use plainto_tsquery('Hur ofta')

>
>>
>>
>> But the query results in an error as follows:
>>
>> ERROR:  function ts_query(unknown) does not exist
>>
>> LINE 1: ...* from question where weighted_tsv @@ ts_query('...
>>
>> HINT:  No function matches the given name and argument types. You might need
>> to add explicit type casts.
>>
>>
>> What is the reason I am getting this error I am not able to figure it out.
>>
>> It would be of great help if anyone point me into right direction.
>>
>>
>> Thanks
>>
>> Kiran
>>
>>
>>
>>


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Results in ERROR: syntax error in tsquery:

2016-09-19 Thread Oleg Bartunov
On Mon, Sep 19, 2016 at 3:46 AM, Kiran  wrote:
> Dear All,
>
> I have a ts_vector column in question table called weighted_tsv.
> I am trying to search using ts_query as follows
>
> SELECT *
>
> FROM question
>
> WHERE weighted_tsv @@ to_tsquery('Hur ofta');

you should use to_tsquery('Hur & ofta') ! Read
https://www.postgresql.org/docs/9.5/static/datatype-textsearch.html

>
>
> But the query results in an error as follows:
>
> ERROR:  function ts_query(unknown) does not exist
>
> LINE 1: ...* from question where weighted_tsv @@ ts_query('...
>
> HINT:  No function matches the given name and argument types. You might need
> to add explicit type casts.
>
>
> What is the reason I am getting this error I am not able to figure it out.
>
> It would be of great help if anyone point me into right direction.
>
>
> Thanks
>
> Kiran
>
>
>
>


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Predicting query runtime

2016-09-14 Thread Oleg Ivanov

Hi Vinicius,

I recommend you to read this 
(http://www.doc.ic.ac.uk/~nb605/IO%20performance%20modeling%20research/Learning-based%20Query%20Performance%20-%202011.pdf) 
paper. Authors make a nice classification of different query performance 
prediction methods and propose their own solution for this problem.


You can also read (http://www.vldb.org/pvldb/vol9/p204-leis.pdf) to be 
warned about possible pitfalls in PostgreSQL query optimizer. In my 
opinion, the most unpleasant one is that you often cannot rely on 
cardinality estimations made by PostgreSQL for path nodes. Typically, 
the more complicated query is, the less reliable cardinality estimations 
become. The good news is that cost model allows to predict query 
execution time precisely enough with good cardinality estimations.


In paper (http://pages.cs.wisc.edu/~wentaowu/papers/prediction-full.pdf) 
there is no machine learning. Nevertheless, you may find it interesting. 
It contains good description of PostgreSQL cost model and a method for 
automatic costs calibration (similar to proposed by Jeff in this thread).
The issue with the calibrating is follows: the multipliers for each 
*_cost factor are not provided or even directly computed in PostgreSQL 
for the majority of path nodes. The typical way of computations is not, 
for example, total_cost = 10 * seq_page_cost + 25 * random_page_cost, 
but total_cost = 10 * (seq_page_cost + 2 * random_page_cost) + 10 * 
(random_page_cost / 2). Mathematically these formulas are equivalent, 
but practically you will spend more time and write more code to extract 
the multipliers in the second case.
In the above paper authors decided to calibrate costs using only those 
nodes, for which the computations are not very complicated and, 
therefore, the multipliers can be extracted relatively easy. Anyway, 
cost models are available in src/backend/optimizer/path/costsize.c, and 
you have to get inside it somehow to obtain extra information.


As for me, the paper 
(http://2014.eswc-conferences.org/sites/default/files/eswc2014pd_submission_30.pdf) 
is interesting mostly by their introduction of graph editing distance as 
a distance on the space of paths. It is interesting because some machine 
learning methods do not require feature representations of objects, but 
only a valid distance function on each pair of them.


The paper (http://www.vldb.org/pvldb/vol6/p925-wu.pdf) is about 
predicting query execution time for concurrent workloads and also 
contains machine learning.


I hope listed papers will be useful for your master's thesis.

The post related to (https://pgconf.ru/en/2016/89977) is available here 
(http://tigvarts.livejournal.com/691.html). Please note, that this post 
was published in February 2016, so the information in this post is 
partially outdated. Some main principles were changed during my work, 
some issues for further research are closed now, while some other issues 
appeared. I believe I will have a paper on my current results completed 
in the early October.


--
Oleg Ivanov
Postgres Professional: http://www.postgrespro.com 
<http://www.postgrespro.com/>

The Russian Postgres Company

On 09/12/2016 05:03 PM, Vinicius Segalin wrote:

Hi everyone,

I'm trying to find a way to predict query runtime (I don't need to be 
extremely precise). I've been reading some papers about it, and people 
are using machine learning to do so. For the feature vector, they use 
what the DBMS's query planner provide, such as operators and their 
cost. The thing is that I haven't found any work using PostgreSQL, so 
I'm struggling to adapt it.
My question is if anyone is aware of a work that uses machine learning 
and PostgreSQL to predict query runtime, or maybe some other method to 
perform this.


Thank you.

Best regards,

Vinicius Segalin




Re: [GENERAL] Predicting query runtime

2016-09-13 Thread Oleg Bartunov
On Tue, Sep 13, 2016 at 2:54 PM, Istvan Soos  wrote:
> On Tue, Sep 13, 2016 at 2:06 AM, Vinicius Segalin  
> wrote:
>> 2016-09-12 18:22 GMT-03:00 Istvan Soos :
>>> At Heap we have non-trivial complexity in our analytical queries, and
>>> some of them can take a long time to complete. We did analyze features
>>> like the query planner's output, our query properties (type,
>>> parameters, complexity) and tried to automatically identify factors
>>> that contribute the most into the total query time. It turns out that
>>> you don't need to use machine learning for the basics, but at this
>>> point we were not aiming for predictions yet.
>>
>> And how did you do that? Manually analyzing some queries?
>
> In this case, it was automatic analysis and feature discovery. We were
> generating features out of our query parameters, out of the SQL
> string, and also out of the explain analyze output. For each of these
> features, we have examined the P(query is slow | feature is present),
> and measured its statistical properties (precision, recall,
> correlations...).
>
> With these we have built a decision tree-based partitioning, where our
> feature-predicates divided the queries into subsets. Such a tree could
> be used for predictions, or if we would like to be fancy, we could use
> the feature vectors to train a neural network.

FYI, please check https://pgconf.ru/2016/89977

>
> Hope this helps for now,
>   Istvan
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Fwd: [Snowball-discuss] Greek stemmer

2016-08-25 Thread Oleg Bartunov
This is a chance to add default configuration for Greek language if
somebody with good knowledge could follow this development.

Oleg

-- Forwarded message --
From: Oleg Smirnov <oleg.smir...@gmail.com>
Date: Thu, Aug 25, 2016 at 5:26 PM
Subject: [Snowball-discuss] Greek stemmer
To: "snowball-discu." <snowball-disc...@lists.tartarus.org>


Hi all,

 I have implemented a stemmer for Modern Greek language [1] based on a
thesis by G. Ntais [2] with improvements proposed by S. Saroukos [3]

 I'm pretty new to Snowball so it will be great if someone could
review my code. Any feedback is much appreciated.

 1. https://github.com/snowballstem/snowball/pull/44
 2. http://sais.se/mthprize/2007/ntais2007.pdf
 3. http://tampub.uta.fi/bitstream/handle/10024/80480/gradu03463.pdf

--
Regards,
Oleg Smirnov

___
Snowball-discuss mailing list
snowball-disc...@lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Updated RUM-index and support for bigint as part of index

2016-08-25 Thread Oleg Bartunov
Andreas,

sorry for delay,
it looks like a bug to me, could you please, share your dataset with me, so
I could reproduce the behaviour.

Regards,
Oleg

On Sun, Aug 7, 2016 at 11:05 AM, Andreas Joseph Krogh <andr...@visena.com>
wrote:

> På søndag 07. august 2016 kl. 08:27:06, skrev Oleg Bartunov <
> obartu...@gmail.com>:
>
> [snip]
> have you considered <=| and |=> operators ? <=> in ORDER BY works like KNN.
>
>
> I don't get how these operators should work. Neither give me the expected
> results.
>
> Using <=>
>
>
> SELECTdel.entity_id,
> del.folder_id,
> del.received_timestampFROM origo_email_delivery delWHERE del.fts_all @@ 
> to_tsquery('simple', 'andreas:*:*')ORDER BY '2000-01-01' :: TIMESTAMP 
> <=> del.received_timestampLIMIT 10;
>
>  entity_id | folder_id |   received_timestamp
> ---+---+-
>   1224278 |   1068087 | 2015-08-17 23:53:26
>   1224382 |   1068087 | 2015-08-18 03:07:55
>   1224404 |   1068087 | 2015-08-18 03:49:02
>   1505713 | 48496 | 2015-10-27 14:51:45
>142132 | 66658 | 2012-12-03 14:14:05.488
>122565 | 90115 | 2012-11-20 15:41:04.936
>200744 | 66655 | 2013-01-28 21:47:44.561
>   1445927 |888665 | 2015-09-29 00:26:56
>123671 | 83509 | 2012-11-21 14:16:26.448
>   1129928 | 66658 | 2015-05-09 08:39:14.128
> (10 rows)
>
>
> Using <=|
>
> SELECTdel.entity_id,
> del.folder_id,
> del.received_timestampFROM origo_email_delivery delWHERE del.fts_all @@ 
> to_tsquery('simple', 'andreas:*:*')ORDER BY '2000-01-01' :: TIMESTAMP 
> <=| del.received_timestampLIMIT 10;
>
>
>  entity_id | folder_id |   received_timestamp
> ---+---+-
>   1224278 |   1068087 | 2015-08-17 23:53:26
>   1224382 |   1068087 | 2015-08-18 03:07:55
>   1224404 |   1068087 | 2015-08-18 03:49:02
>   1505713 | 48496 | 2015-10-27 14:51:45
>142132 | 66658 | 2012-12-03 14:14:05.488
>122565 | 90115 | 2012-11-20 15:41:04.936
>200744 | 66655 | 2013-01-28 21:47:44.561
>   1445927 |888665 | 2015-09-29 00:26:56
>123671 | 83509 | 2012-11-21 14:16:26.448
>   1129928 | 66658 | 2015-05-09 08:39:14.128
> (10 rows)
>
>
> Neither are ordered by received_timestamp
>
> Can you explain how to get ORDER BY received_timestamp DESC?
>
> Thanks.
>
> --
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andr...@visena.com
> www.visena.com
> <https://www.visena.com>
>
>


Re: [GENERAL] Updated RUM-index and support for bigint as part of index

2016-08-07 Thread Oleg Bartunov
On Tue, Aug 2, 2016 at 9:08 PM, Andreas Joseph Krogh 
wrote:

> Hi.
>
> I see the RUM-index is updated, which is great!
>
> I wonder, to be able to sort by timestamp one has to create the index like
> this:
>
>
> CREATE INDEX rumidx ON origo_email_delivery USING rum (fts_all 
> rum_tsvector_timestamp_ops, received_timestamp)
> WITH (attach = 'received_timestamp', TO = 'fts_all', order_by_attach = 
> TRUE );
>
> Then, to be able to use the index for sorting by the
> "received_timestamp"-column one has to issue a query like this:
>
> EXPLAIN ANALYZE SELECTdel.entity_id,
> del.subject,
> del.received_timestamp,
> fts_all <=> to_tsquery('simple', 'andreas') AS rank
> FROM origo_email_delivery del
> WHERE del.fts_all @@ to_tsquery('simple', 'andreas')
> ORDER BY '2000-01-01' :: TIMESTAMP <=> del.received_timestamp
> LIMIT 10;
>
> QUERY 
> PLAN
> --
>  Limit  (cost=14.40..26.47 rows=10 width=89) (actual time=10.908..10.952 
> rows=10 loops=1)
>->  Index Scan using rumidx on origo_email_delivery del  
> (cost=14.40..3221.22 rows=2657 width=89) (actual time=10.906..10.947 rows=10 
> loops=1)
>  Index Cond: (fts_all @@ '''andreas'' & ''kr'''::tsquery)
>  Order By: (received_timestamp <=> '2000-01-01 00:00:00'::timestamp 
> without time zone)
>  Planning time: 0.491 ms
>  Execution time: 11.010 ms
> (6 rows)
>
>
> The ORDER BY part seems strange; It seems one has to find a value
> "lower than any other value" to use as a kind of base, why is this
> necessary? It also seems that in order to be able to sort DESC one has to
> provide a timestamp value "higher than any other value", is this correct?
>

have you considered <=| and |=> operators ? <=> in ORDER BY works like KNN.


>
> It would be great if the docs explained this.
>
> I really miss the opportunity to include a BIGINT as part of the index, so
> that the WHERE-clause could be like this:
>
> WHERE del.fts_all @@ to_tsquery('simple', 'andreas') AND del.folder_id IN
> (1,2,3)
>
> Having this would be perfect for my use-case searching in email in
> folders, sorted by received_date, and having it use ONE index.
>
> Will this be supported?
>
> Thanks.
>
> --
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andr...@visena.com
> www.visena.com
> 
>


Re: [GENERAL] jsonb search

2016-06-28 Thread Oleg Bartunov
On Tue, Jun 28, 2016 at 5:15 PM, Armand Pirvu (home)
 wrote:
> Hi
>
> In my quest of JSONB querying and searching without having to actually cast
> into a text, I found JSQuery
>
> I do admit my JSONB knowledge shortcoming and I am not a developer but a
> DBA. As such some examples would be greatly appreciated since I tend to
> understand better
>
> I compiled and installed the extension
>
> 1 - Exact matching without knowing the hierarchy, just the key and element,
> I built a set like
>
> col1 |   col2
> --+--
>1 | {"Home Email": {"EmailAddress": "1...@yahoo.com"}}
>2 | {"Home Email": {"EmailAddress": "2...@yahoo.com"}}
>3 | {"Home Email": {"EmailAddress": "3...@yahoo.com"}}
>
>
> JSQuqery is super
>
> SELECT * FROM test1 WHERE col2 @@ '*.EmailAddress = "1...@yahoo.com"';
>
> Now I can do a performance boost using
>
> CREATE INDEX idx1 ON test1 USING GIN (col2 jsonb_value_path_ops);
>
> I see this yield
>
> from
>
> testdb=# explain analyze^JSELECT * FROM test1 WHERE col2 @@ '*.EmailAddress
> = "1...@yahoo.com"';
> Seq Scan on test1  (cost=0.00..12423.00 rows=500 width=68) (actual
> time=0.016..160.777 rows=1 loops=1)
>   Filter: (col2 @@ '*."EmailAddress" = "1...@yahoo.com"'::jsquery)
>   Rows Removed by Filter: 49
> Planning time: 0.042 ms
> Execution time: 160.799 ms
> (5 rows)
>
>
> to
>
> testdb-# SELECT * FROM test1 WHERE col2 @@ '*.EmailAddress =
> "1...@yahoo.com"';
> Bitmap Heap Scan on test1  (cost=31.88..1559.32 rows=500 width=68) (actual
> time=0.018..0.019 rows=1 loops=1)
>   Recheck Cond: (col2 @@ '*."EmailAddress" = "1...@yahoo.com"'::jsquery)
>   Heap Blocks: exact=1
>   ->  Bitmap Index Scan on idx1  (cost=0.00..31.75 rows=500 width=0) (actual
> time=0.011..0.011 rows=1 loops=1)
> Index Cond: (col2 @@ '*."EmailAddress" = "1...@yahoo.com"'::jsquery)
> Planning time: 0.039 ms
> Execution time: 0.038 ms
> (7 rows)
>
> A whooping 4000 times improvement
>
>
>
>
> But I also noticed a vodka index
>
>
> testdb=# CREATE INDEX idx2 ON
> testdb-# test1 USING vodka (col2);
> ERROR:  access method "vodka" does not exist
>
> What am I missing ?
>
> 2 - Is there anyway I can accomplish a pattern and/or case insensitive
> search using JSQuery similar to
>
>
> select * from test2 where upper((col2 -> 'Home Email') ->> 'EmailAddress')
> ilike '%3%YAH%';
>
> select * from test2 where (col2 -> 'Home Email') ->> 'EmailAddress' like
> '%3%yah%';
>
>
> If so what indexing strategy can be used to have similar gains as above ?
>
>
> Many thanks for any help

Vodka is our experimental prototype of access method of next
generation and it doesn't exists in production-ready form. You can
check our presentation
http://www.sai.msu.su/~megera/postgres/talks/highload-2014-vodka.pdf
to understand jsquery limitation and why we stop its development.
Also, 2 years ago I wrote (in russian)
http://obartunov.livejournal.com/179422.html about jsonb query
language and our plans. Google translate might helps

https://translate.google.com/translate?sl=auto=en=y=_t=en=UTF-8=http%3A%2F%2Fobartunov.livejournal.com%2F179422.html==url


>
>
> Armand
>


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] questions about how to implement a gist index

2016-06-27 Thread Oleg Bartunov
On Tue, Jun 28, 2016 at 12:44 AM, Riccardo Vianello
 wrote:
> Hi all,
>
> I'm trying to contribute some improvements to the implementation of a gist
> index that is part of an open source project and it would be really nice if
> anyone could help me answer some questions.
>
> I would like to use different data structures to represent the internal and
> leaf entries. I think I have found several examples in the postgresql source
> code where a variable length array is used to pack different data
> consistently with the value of a "type" attribute. Is this the suggested way
> to proceed, or would it be also possible to use two actually different data
> structures? I at first considered this latter approach, but some erroneous
> behavior suggested that maybe I am not fully clear about which data
> structure should be produced or expected by some parts of the
> implementation.
>
> Could you please also help me understand the difference (if any) between
> using the GIST_LEAF macro or the leafkey attribute of the GISTENTRY data
> structure?

Yes, this is confused.

GIST_LEAF is TRUE if key is in leaf page.

bool leafkey points if TRUE that key contains value from heap.

>
> Thank you very much in advance,
>
> Riccardo
>
>


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Updated RUM?

2016-06-27 Thread Oleg Bartunov
On Fri, Jun 24, 2016 at 2:36 PM, Andreas Joseph Krogh 
wrote:

> Hi.
>
> @PostgresPro; Will there be any updates to the rum git-repo any time soon?
>

Sure.


>
> I understand you're working hard on this, care to share some
> vision/roadmap? I'm specifically interested in if RUM will support the same
> datatypes as GIN supports (arrays, JSON etc.), and what opclasses you plan
> to implement (for equalty and sorting). Will there be some kind of
> btree_rum like there is for btree_gin, to add RUM-support for the types
> there exists Btree-support for?
>

We are testing fts+timestamp combination right now.  The challenge is to
have posting list/tree sorted by addinfo (timestamp) not by item pointer.
It works already, but we are not quite satisfied with generated WAL traffic
and are researching some options in improving posting tree building and
generic wal.

We have in mind some sort of generalization like anyarray, but don't have
any real ideas.



>
> Thanks.
>
> --
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andr...@visena.com
> www.visena.com
> 
>


Re: [GENERAL] Question about RUM-index

2016-06-17 Thread Oleg Bartunov
On Fri, Jun 17, 2016 at 4:50 PM, Andreas Joseph Krogh <andr...@visena.com>
wrote:

> På fredag 17. juni 2016 kl. 15:47:08, skrev Oleg Bartunov <
> obartu...@gmail.com>:
>
>
>
> On Fri, Jun 17, 2016 at 3:52 PM, Andreas Joseph Krogh <andr...@visena.com>
> wrote:
>>
>> På fredag 17. juni 2016 kl. 13:53:34, skrev Oleg Bartunov <
>> obartu...@gmail.com>:
>>
>>
>>
>> On Fri, Jun 17, 2016 at 2:10 PM, Oleg Bartunov <obartu...@gmail.com>
>> wrote:
>>>
>>>
>>>
>>> On Fri, Jun 17, 2016 at 9:32 AM, Andreas Joseph Krogh <
>>> andr...@visena.com> wrote:
>>>>
>>>> På torsdag 16. juni 2016 kl. 00:50:45, skrev Jeff Janes <
>>>> jeff.ja...@gmail.com>:
>>>>
>>>> On Wed, Jun 15, 2016 at 3:56 AM, Andreas Joseph Krogh <
>>>> andr...@visena.com> wrote:
>>>>>
>>>>> Hi.
>>>>>
>>>>> First; Is this the correct forum to ask questions about the Postgres
>>>>> Pro's new RUM-index?
>>>>>
>>>>> If not, please point me to the right forum.
>>>>>
>>>>
>>>> I think that https://github.com/postgrespro/rum/issues might be the
>>>> best forum.
>>>>
>>>>
>>>> Oleg and friends; Should we use GitHub-issues as forum (one issue per
>>>> question/thread?), pgsql-general or something else?
>>>>
>>>
>>> Andreas,
>>>
>>> we are hardly working on our internal version of rum and will open it
>>> after resolving some issues. I think the best place to discuss it is
>>> -hackers.
>>>
>>
>> Ah, as someone corrected me, we are working hard !
>>
>>
>> He he, I figured that was what you meant:-)
>>
>> Ok, so basically - will RUM-index support the same indexing-properties as
>> GIN (being able to index tsvector, BIGINT-arrays, JSONB etc.) *and* be
>> able to use index for sorting on ie. timestamp, tsrank or some
>> BIGINT-column?
>>
>> Like my example, will it be possible to issue a query like this:
>>
>>
>> SELECT del.id
>> , del.sent
>> FROM delivery del
>> WHERE 1 = 1  AND del.fts_all @@ 
>> to_tsquery('simple', 'hi:*')
>>   AND del.folder_id = ANY(ARRAY[2,3]::BIGINT[])  
>>   ORDER BY  del.sent DESC LIMIT 101 OFFSET 0;
>>
>>
>> and have it use *one* RUM-index so the whole operation, including
>> sorting, is as efficient as possible?
>>
>>
>
> we have implementation for timestamp. One need to write opclass to deal
> with arrays, it shouldn't be difficult.
>
>
> Will the opclass dealing with bigint-arrays also handle the single-element
> case, that is only one bigint?
>

this is up to opclass author.



>
> --
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andr...@visena.com
> www.visena.com
> <https://www.visena.com>
>
>


Re: [GENERAL] Question about RUM-index

2016-06-17 Thread Oleg Bartunov
On Fri, Jun 17, 2016 at 3:52 PM, Andreas Joseph Krogh <andr...@visena.com>
wrote:

> På fredag 17. juni 2016 kl. 13:53:34, skrev Oleg Bartunov <
> obartu...@gmail.com>:
>
>
>
> On Fri, Jun 17, 2016 at 2:10 PM, Oleg Bartunov <obartu...@gmail.com>
> wrote:
>>
>>
>>
>> On Fri, Jun 17, 2016 at 9:32 AM, Andreas Joseph Krogh <andr...@visena.com
>> > wrote:
>>>
>>> På torsdag 16. juni 2016 kl. 00:50:45, skrev Jeff Janes <
>>> jeff.ja...@gmail.com>:
>>>
>>> On Wed, Jun 15, 2016 at 3:56 AM, Andreas Joseph Krogh <
>>> andr...@visena.com> wrote:
>>>>
>>>> Hi.
>>>>
>>>> First; Is this the correct forum to ask questions about the Postgres
>>>> Pro's new RUM-index?
>>>>
>>>> If not, please point me to the right forum.
>>>>
>>>
>>> I think that https://github.com/postgrespro/rum/issues might be the
>>> best forum.
>>>
>>>
>>> Oleg and friends; Should we use GitHub-issues as forum (one issue per
>>> question/thread?), pgsql-general or something else?
>>>
>>
>> Andreas,
>>
>> we are hardly working on our internal version of rum and will open it
>> after resolving some issues. I think the best place to discuss it is
>> -hackers.
>>
>
> Ah, as someone corrected me, we are working hard !
>
>
> He he, I figured that was what you meant:-)
>
> Ok, so basically - will RUM-index support the same indexing-properties as
> GIN (being able to index tsvector, BIGINT-arrays, JSONB etc.) *and* be
> able to use index for sorting on ie. timestamp, tsrank or some
> BIGINT-column?
>
> Like my example, will it be possible to issue a query like this:
>
>
> SELECT del.id
> , del.sent
> FROM delivery del
> WHERE 1 = 1  AND del.fts_all @@ 
> to_tsquery('simple', 'hi:*')
>   AND del.folder_id = ANY(ARRAY[2,3]::BIGINT[])   
>  ORDER BY  del.sent DESC LIMIT 101 OFFSET 0;
>
>
> and have it use *one* RUM-index so the whole operation, including
> sorting, is as efficient as possible?
>
>

we have implementation for timestamp. One need to write opclass to deal
with arrays, it shouldn't be difficult.



> Thanks.
>
> --
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andr...@visena.com
> www.visena.com
> <https://www.visena.com>
>
>


Re: [GENERAL] Question about RUM-index

2016-06-17 Thread Oleg Bartunov
On Fri, Jun 17, 2016 at 2:10 PM, Oleg Bartunov <obartu...@gmail.com> wrote:

>
>
> On Fri, Jun 17, 2016 at 9:32 AM, Andreas Joseph Krogh <andr...@visena.com>
> wrote:
>
>> På torsdag 16. juni 2016 kl. 00:50:45, skrev Jeff Janes <
>> jeff.ja...@gmail.com>:
>>
>> On Wed, Jun 15, 2016 at 3:56 AM, Andreas Joseph Krogh <andr...@visena.com
>> > wrote:
>>>
>>> Hi.
>>>
>>> First; Is this the correct forum to ask questions about the Postgres
>>> Pro's new RUM-index?
>>>
>>> If not, please point me to the right forum.
>>>
>>
>> I think that https://github.com/postgrespro/rum/issues might be the best
>> forum.
>>
>>
>> Oleg and friends; Should we use GitHub-issues as forum (one issue per
>> question/thread?), pgsql-general or something else?
>>
>
> Andreas,
>
> we are hardly working on our internal version of rum and will open it
> after resolving some issues. I think the best place to discuss it is
> -hackers.
>

Ah, as someone corrected me, we are working hard !



>
>
>
>>
>>
>> Note that GIN does almost what I want, except use the index when sorting
>>> by "sent"-timestamp.
>>>
>>> So I wonder if RUM can do any better?
>>> What I don't understand is how to have "folder_id" as part of the
>>> RUM-index so that I can search in *an array* of folders using the
>>> index, *AND* have the whole result sorted by "sent"-timestamp also
>>> using the RUM-index.
>>>
>>
>> I think you would have to implement an operator for integers for RUM much
>> like btree_gin does for GIN.  Sorry don't know how to do that, except to
>> say look in the RUM code to see how it does it for time-stamps.
>>
>>
>>>
>>> In the (limited) documentation sorting using timestamp is done like this:
>>>
>>> ORDER BY sent <-> '2000-01-01'::TIMESTAMP
>>>
>>> which I don't understand; Why must one specify a value here, and how
>>> does that value affect the result?
>>>
>>
>>
>> This is essentially identical to ORDER BY ABS(sent -
>> '2000-01-01'::TIMESTAMP);  except it can use the index.
>>
>> So maybe pick a constant outside the range of possible values, and use
>> that as one argument to <->.
>>
>>
>> This should be unnecessary and hidden from the user. Maybe some "ORDER BY
>> rum_timestamp(sent)" or something could abstract away stuff to make it much
>> clearer to the user?
>> --
>> *Andreas Joseph Krogh*
>> CTO / Partner - Visena AS
>> Mobile: +47 909 56 963
>> andr...@visena.com
>> www.visena.com
>> <https://www.visena.com>
>>
>>
>
>


Re: [GENERAL] Question about RUM-index

2016-06-17 Thread Oleg Bartunov
On Fri, Jun 17, 2016 at 9:32 AM, Andreas Joseph Krogh <andr...@visena.com>
wrote:

> På torsdag 16. juni 2016 kl. 00:50:45, skrev Jeff Janes <
> jeff.ja...@gmail.com>:
>
> On Wed, Jun 15, 2016 at 3:56 AM, Andreas Joseph Krogh <andr...@visena.com>
> wrote:
>>
>> Hi.
>>
>> First; Is this the correct forum to ask questions about the Postgres
>> Pro's new RUM-index?
>>
>> If not, please point me to the right forum.
>>
>
> I think that https://github.com/postgrespro/rum/issues might be the best
> forum.
>
>
> Oleg and friends; Should we use GitHub-issues as forum (one issue per
> question/thread?), pgsql-general or something else?
>

Andreas,

we are hardly working on our internal version of rum and will open it after
resolving some issues. I think the best place to discuss it is -hackers.



>
>
> Note that GIN does almost what I want, except use the index when sorting
>> by "sent"-timestamp.
>>
>> So I wonder if RUM can do any better?
>> What I don't understand is how to have "folder_id" as part of the
>> RUM-index so that I can search in *an array* of folders using the index,
>> *AND* have the whole result sorted by "sent"-timestamp also using the
>> RUM-index.
>>
>
> I think you would have to implement an operator for integers for RUM much
> like btree_gin does for GIN.  Sorry don't know how to do that, except to
> say look in the RUM code to see how it does it for time-stamps.
>
>
>>
>> In the (limited) documentation sorting using timestamp is done like this:
>>
>> ORDER BY sent <-> '2000-01-01'::TIMESTAMP
>>
>> which I don't understand; Why must one specify a value here, and how does
>> that value affect the result?
>>
>
>
> This is essentially identical to ORDER BY ABS(sent -
> '2000-01-01'::TIMESTAMP);  except it can use the index.
>
> So maybe pick a constant outside the range of possible values, and use
> that as one argument to <->.
>
>
> This should be unnecessary and hidden from the user. Maybe some "ORDER BY
> rum_timestamp(sent)" or something could abstract away stuff to make it much
> clearer to the user?
> --
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andr...@visena.com
> www.visena.com
> <https://www.visena.com>
>
>


Re: [GENERAL] Slides for PGCon2016; "FTS is dead ? Long live FTS !"

2016-05-31 Thread Oleg Bartunov
On Sun, May 29, 2016 at 8:53 PM, Andreas Joseph Krogh <andr...@visena.com>
wrote:

> På søndag 29. mai 2016 kl. 19:49:06, skrev Oleg Bartunov <
> obartu...@gmail.com>:
>
> [snip]
>>
>> I want to run 9.6 beta in production right now because of this:-)
>>
>
> wait-wait :)  We'd be happy to have feedback from production, of course,
> but please, wait a bit. We are adding support of sorting posting list/tree
> not by item pointer as in gin, but make use of additional information, for
> example, timestamp, which will provide additional speedup to the existing
> one.
>
>
> Awesome!
>
>
>
> Also, we are sure there are some bugs :)
>
>
> He he, I reported 1st issue: https://github.com/postgrespro/rum/issues/1
>
> Would be cool to see this fixed so I actually could have a sip of the
> rum:-)
>


It's not easy to fix this. We don't want rum depends on  btree_gin, so
probably the easiest way is to have separate operator <=> in rum.

>
> --
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andr...@visena.com
> www.visena.com
> <https://www.visena.com>
>
>


Re: [GENERAL] Slides for PGCon2016; "FTS is dead ? Long live FTS !"

2016-05-30 Thread Oleg Bartunov
On Sun, May 29, 2016 at 12:59 AM, Oleg Bartunov <obartu...@gmail.com> wrote:

>
>
> On Thu, May 26, 2016 at 11:26 PM, Andreas Joseph Krogh <andr...@visena.com
> > wrote:
>
>> Hi.
>>
>> Any news about when slides for $subject will be available?
>>
>
> I submitted slides to pgcon site, but it usually takes awhile, so you can
> download our presentation directly
> http://www.sai.msu.su/~megera/postgres/talks/pgcon-2016-fts.pdf
>
>
Please, download new version of slides. I added CREATE INDEX commands in
examples.



> There are some missing features in rum index, but I hope we'll update
> github repository really soon.
>
>
>>
>> --
>> *Andreas Joseph Krogh*
>> CTO / Partner - Visena AS
>> Mobile: +47 909 56 963
>> andr...@visena.com
>> www.visena.com
>> <https://www.visena.com>
>>
>
>


Re: [GENERAL] Slides for PGCon2016; "FTS is dead ? Long live FTS !"

2016-05-30 Thread Oleg Bartunov
On Sun, May 29, 2016 at 10:04 PM, Karsten Hilbert
 wrote:
>>> I submitted slides to pgcon site, but it usually takes awhile, so you can
>>> download our presentation directly
>>> http://www.sai.msu.su/~megera/postgres/talks/pgcon-2016-fts.pdf
>
> Looking at slide 39 (attached) I get the impression that I
> should be able to do the following:
>
>
> - turn a coding system (say, ICD-10) into a dictionary
>   by splitting the terms into single words
>
> say, "diabetes mellitus -> "diabetes", "mellitus"
>
> - define stop words like "left", "right", ...
>
> say, "fracture left ulna" -> the "left" doesn't
> matter as far as coding is concerned
>
> - also turn that coding system into queries by splitting
>   the terms into single words, concatenating them
>   with "&", and setting the ICD 10 code as tag on them
>
> say, "diabetes mellitus" -> "diabetes & mellitus [E11]"
>
> - run an inverse FTS (FQS) against a user supplied string
>   thereby finding queries (= tags = ICD10 codes) likely
>   relevant to the input
>
> say, to_tsvector("patient was suspected to suffer from diabetes 
> mellitus")
> -> tag = E11
>
>
> Possible, not possible, insane, unintended use ?

why not, it's the same kind of usage I used at slide #39.

create table icd10 (q tsquery, code text);
insert into icd10 values(to_tsquery('diabetes & mellitus'), '[E11]');
select * from icd10 where to_tsvector('patient was suspected to suffer
from diabetes mellitus') @@ q;
   q   | code
---+---
 'diabet' & 'mellitus' | [E11]
(1 row)



>
> Thanks,
> Karsten
> --
> GPG key ID E4071346 @ eu.pool.sks-keyservers.net
> E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Slides for PGCon2016; "FTS is dead ? Long live FTS !"

2016-05-29 Thread Oleg Bartunov
On Sun, May 29, 2016 at 2:43 PM, Stefan Keller <sfkel...@gmail.com> wrote:

> Hi,
>
> Nice work from you postgrespro.ru guys! Especially the RUM index which
> demonstrates the power of 9.6 to let third party SW create access methods
> as extension: https://github.com/postgrespro/rum
>
> 1. I don't understand the benchmarks on slide 25 "20 mln descriptions"
> (and the one before "6.7 mln classifieds"): What does "Queries in 8 h 9.2
> +patch (9.6 rum)" mean?
>

We run queries for 8 hours and recorded the number of executed queries.
Four years ago, when I and Alexander developed an initial version of patch
we got results marked by "9.2+patch", and now we run the same queries on
the same database and put rum results into (). I'd not consider to this
numbers, since we used queries from 6 mln database. We'd be happy if
somebody run independent benchmarks.


>
> 2. What does R-U-M mean? (can't mean "Range Usage Metadata" which was
> finally coined range index BRIN)?
>


We chose RUM just because there are GIN and VODKA :) But some people
already suggested several meanings like Really Useful iMdex :)  We are open
for suggestion.


>
> :Stefan, co-organizer of Swiss PGDay
>
>
> 2016-05-29 11:29 GMT+02:00 Andreas Joseph Krogh <andr...@visena.com>:
>
>> På lørdag 28. mai 2016 kl. 23:59:55, skrev Oleg Bartunov <
>> obartu...@gmail.com>:
>>
>>
>>
>> On Thu, May 26, 2016 at 11:26 PM, Andreas Joseph Krogh <
>> andr...@visena.com> wrote:
>>>
>>> Hi.
>>>
>>> Any news about when slides for $subject will be available?
>>>
>>
>> I submitted slides to pgcon site, but it usually takes awhile, so you can
>> download our presentation directly
>> http://www.sai.msu.su/~megera/postgres/talks/pgcon-2016-fts.pdf
>>
>> There are some missing features in rum index, but I hope we'll update
>> github repository really soon.
>>
>>
>> This is simply amazing!
>>
>> I want to run 9.6 beta in production right now because of this:-)
>>
>> Hats off guys, congrats to PostgresPro, and huge thanks!!
>>
>> --
>> *Andreas Joseph Krogh*
>> CTO / Partner - Visena AS
>> Mobile: +47 909 56 963
>> andr...@visena.com
>> www.visena.com
>> <https://www.visena.com>
>>
>>
>
>


Re: [GENERAL] Slides for PGCon2016; "FTS is dead ? Long live FTS !"

2016-05-29 Thread Oleg Bartunov
On Sun, May 29, 2016 at 12:29 PM, Andreas Joseph Krogh <andr...@visena.com>
wrote:

> På lørdag 28. mai 2016 kl. 23:59:55, skrev Oleg Bartunov <
> obartu...@gmail.com>:
>
>
>
> On Thu, May 26, 2016 at 11:26 PM, Andreas Joseph Krogh <andr...@visena.com
> > wrote:
>>
>> Hi.
>>
>> Any news about when slides for $subject will be available?
>>
>
> I submitted slides to pgcon site, but it usually takes awhile, so you can
> download our presentation directly
> http://www.sai.msu.su/~megera/postgres/talks/pgcon-2016-fts.pdf
>
> There are some missing features in rum index, but I hope we'll update
> github repository really soon.
>
>
> This is simply amazing!
>
> I want to run 9.6 beta in production right now because of this:-)
>

wait-wait :)  We'd be happy to have feedback from production, of course,
but please, wait a bit. We are adding support of sorting posting list/tree
not by item pointer as in gin, but make use of additional information, for
example, timestamp, which will provide additional speedup to the existing
one. Also, we are sure there are some bugs :)


>
> Hats off guys, congrats to PostgresPro, and huge thanks!!
>
> --
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andr...@visena.com
> www.visena.com
> <https://www.visena.com>
>
>


Re: [GENERAL] Slides for PGCon2016; "FTS is dead ? Long live FTS !"

2016-05-28 Thread Oleg Bartunov
On Thu, May 26, 2016 at 11:26 PM, Andreas Joseph Krogh 
wrote:

> Hi.
>
> Any news about when slides for $subject will be available?
>

I submitted slides to pgcon site, but it usually takes awhile, so you can
download our presentation directly
http://www.sai.msu.su/~megera/postgres/talks/pgcon-2016-fts.pdf

There are some missing features in rum index, but I hope we'll update
github repository really soon.


>
> --
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andr...@visena.com
> www.visena.com
> 
>


Re: [GENERAL] Ascii Elephant for text based protocols - Final

2016-05-16 Thread Oleg Bartunov
On Mon, May 16, 2016 at 2:47 PM, Charles Clavadetscher
 wrote:
> SELECT row_dat FROM elephant ORDER BY row_num;


Very good ! Thanks !


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Insert only table and size of GIN index JSONB field.

2016-05-03 Thread Oleg Bartunov
On Tue, May 3, 2016 at 1:35 PM, Maxim Boguk  wrote:

>
>
> On Tue, May 3, 2016 at 8:09 PM, Maxim Boguk  wrote:
>
>> Hi,
>>
>> I started with empty table with index over
>>  custom_fields | jsonb
>> field
>> defined as:
>> "idx_learners_custom_fields" gin (custom_fields)
>> Globally gin_pending_list_limit set to 2MB.
>> Database version is 9.5.2.
>>
>> Now question:
>> If table populated with 1M records in single transaction then the final
>> size of the GIN index is:
>> 4265 MB
>> but after I performed reindex index idx_learners_custom_fields;
>> the index size had been reduced 15x to 295 MB.
>>
>> Is this behavior expected?
>>
>
> It's look easy to reproduce.
> I can send a sample dataset for analyze.
>

sample data are always welcome !



>
> drop table if exists test;
> create table test (custom_fields jsonb);
> create index test_gin_key on test USING GIN(custom_fields);
> insert into test select custom_fields from public.learners;
> INSERT 0 100
> \di+ test_gin_key
> List of relations
>  Schema | Name | Type  |  Owner   | Table |  Size   | Description
> +--+---+--+---+-+-
>  public | test_gin_key | index | postgres | test  | 4211 MB |
>
> reindex index test_gin_key;
>
> \di+ test_gin_key
> List of relations
>  Schema | Name | Type  |  Owner   | Table |  Size  | Description
> +--+---+--+---++-
>  public | test_gin_key | index | postgres | test  | 295 MB |
>
>
> ​
>
>
> --
> Maxim Boguk
> Senior Postgresql DBA
> http://www.postgresql-consulting.com/
>
> Phone RU: +7 910 405 4718
> Phone AU: +61 45 218 5678
>
> LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
> Skype: maxim.boguk
> Jabber: maxim.bo...@gmail.com
>
> "People problems are solved with people.
> If people cannot solve the problem, try technology.
> People will then wish they'd listened at the first stage."
>
>


Re: [GENERAL] postgresql & Fulltext & ranking & my own functions

2016-05-03 Thread Oleg Bartunov
On Tue, May 3, 2016 at 3:21 PM, Nicolas Paris  wrote:

> Hello,
>
> Documentation says : (
> http://www.postgresql.org/docs/9.5/static/textsearch-controls.html#TEXTSEARCH-RANKING
> )
> "The built-in ranking functions are only examples. You can write your own
> ranking functions and/or combine their results with additional factors to
> fit your specific needs."
>
> The build-in seems not enough for my use case. How to write my own ranking
> functions ?
>
> I would like to create something like  ts_rank_cd( tsv, 'foo & bar',
> my_new_normalization_integer)
>
>
you can look on src/backend/utils/adt/tsrank.c and write your own function.
google can help
https://www.google.ru/search?q=how+to+write+c+function+postgresql


> 
> pg 9.5
>


Re: [GENERAL] JSONB performance enhancement for 9.6

2016-05-01 Thread Oleg Bartunov
On Sun, May 1, 2016 at 6:46 AM, Tom Smith  wrote:

> Hello:
>
> I'd like to bring this JSONB performance issue again.
> Below is a link of MySQL way of storing/retrieving Json key/value
>
> https://dev.mysql.com/doc/refman/5.7/en/json.html
>
> Instead of providing column indexing(like GIN for JSONB in Postgresql).
> it provides only internal data structure level indexing within each
> individual json object
> for fast retrieval.  compression is not used.
>
> Perhaps without implementing  complicated column level GIN indexing,
> implementing
> a new variant JSON type that only handle  individual json object indexing
> would be
> feasible?  Combined with current JSONB implementation,   both common use
> cases
> (one is global doc indexing, the other is fast retrieval of individual
> values)
> would work out and make postgresql unbeatable.
>

It's called expression index ?


>
>
>
>
>
>
>
>
>
> On Tue, Jan 19, 2016 at 8:51 PM, Bruce Momjian  wrote:
>
>> On Mon, Jan 11, 2016 at 09:01:03PM -0500, Tom Smith wrote:
>> > Hi,
>> >
>> > Congrats on the official release of 9.5
>> >
>> > And I'd like bring up the issue again about if 9.6 would address the
>> jsonb
>> > performance issue
>> > with large number of top level keys.
>> > It is true that it does not have to use JSON format. it is about
>> serialization
>> > and fast retrieval
>> > of dynamic tree structure objects. (at top level, it might be called
>> dynamic
>> > columns)
>> > So if postgresql can have its own way, that would work out too as long
>> as it
>> > can have intuitive query
>> > (like what are implemented for json and jsonb) and fast retrieval of a
>> tree
>> > like object,
>> > it can be called no-sql data type. After all, most motivations of using
>> no-sql
>> > dbs like MongoDB
>> > is about working with dynamic tree object.
>> >
>> > If postgresql can have high performance on this, then many no-sql dbs
>> would
>> > become history.
>>
>> I can give you some backstory on this.  TOAST was designed in 2001 as a
>> way to store, in a data-type-agnostic way, long strings compressed and
>> any other long data type, e.g. long arrays.
>>
>> In all previous cases, _part_ of the value wasn't useful.  JSONB is a
>> unique case because it is one of the few types that can be processed
>> without reading the entire value, e.g. it has an index.
>>
>> We are going to be hesitant to do something data-type-specific for
>> JSONB.  It would be good if we could develop a data-type-agnostic
>> approach to has TOAST can be improved.  I know of no such work for 9.6,
>> and it is unlikely it will be done in time for 9.6.
>>
>> --
>>   Bruce Momjian  http://momjian.us
>>   EnterpriseDB http://enterprisedb.com
>>
>> + As you are, so once was I. As I am, so you will be. +
>> + Roman grave inscription +
>>
>
>


Re: [GENERAL] Shrinking TSvectors

2016-04-05 Thread Oleg Bartunov
On Tue, Apr 5, 2016 at 2:37 PM, Howard News 
wrote:

> Hi,
>
> does anyone have any pointers for shrinking tsvectors
>
> I have looked at the contents of some of these fields and they contain
> many details that are not needed. For example...
>
> "'+1':935,942 '-0500':72 '-0578':932 '-0667':938 '-266':937 '-873':944
> '-9972':945 '/partners/application.html':222
> '/partners/program/program-agreement.pdf':271
> '/partners/reseller.html':181,1073 '01756':50,1083 '07767':54,1087
> '1':753,771 '12':366 '14':66 (...)"
>

I am not interested in keeping the numbers or urls in the indexes.
>


select strip ('asd:23');
 strip
---
 'asd'
(1 row)



>
> Thanks,
>
> Howard.
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] MongoDB 3.2 beating Postgres 9.5.1?

2016-03-13 Thread Oleg Bartunov
On Mar 11, 2016 4:40 PM, "Paul Jones"  wrote:
>
> I have been running the EDB benchmark that compares Postgres and MongoDB.
> I believe EDB ran it against PG 9.4 and Mongo 2.6.  I am running it
> against PG 9.5.1 and Mongo 3.2 with WiredTiger storage using 10,000,000
> JSON records generated by the benchmark.  It looks like Mongo is winning,
> and apparently because of its cache management.

Dmitry was working on the same benchmarks. I think edb benchmark is broken
by design. Better,  use ycsb benchmarks. I hope, Dmitry will share his
results.

>
> The first queries on both run in ~30 min.  And, once PG fills its cache,
> it whips Mongo on repeats of the *same* query (vmstat shows no disk
> reads for PG).
>
> However, when different query on the same table is issued to both,
> vmstat shows that PG has to read the *entire* table again, and it takes
> ~30 min.  Mongo does a lot of reads initially but after about 5 minutes,
> it stops reading and completes the query, most likely because it is
> using its cache very effectively.
>
> Host:   Virtual Machine
> 4 CPUs
> 16 Gb RAM
> 200 Gb Disk
> RHEL 6.6
>
> PG: 9.5.1 compiled from source
> shared_buffers = 7GB
> effectve_cache_size = 12GB
>
> Mongo:  3.2 installed with RPM from Mongo
>
> In PG, I created the table by:
>
> CREATE TABLE json_tables
> (
> dataJSONB
> );
>
> After loading, it creates the index:
>
> CREATE INDEX json_tables_idx ON json_tables USING GIN (data
jsonb_path_ops);
>
> After a lot of experimentation, I discovered that the benchmark was not
> using PG's index, so I modified the four queries to be:
>
> SELECT data FROM json_tables WHERE data @> '{"brand": "ACME"}';
> SELECT data FROM json_tables WHERE data @> '{"name": "Phone Service Basic
Plan"}';
> SELECT data FROM json_tables WHERE data @> '{"name": "AC3 Case Red"}';
> SELECT data FROM json_tables WHERE data @> '{"type": "service"}';
>
> Here are two consecutive explain analyze for PG, for the same query.
> No functional difference in the plans that I can tell, but the effect
> of PG's cache on the second is dramatic.
>
> If anyone has ideas on how I can get PG to more effectively use the cache
> for subsequent queries, I would love to hear them.
>
> ---
>
> benchmark=# explain analyze select data from json_tables where data @>
'{"name": "AC3 Case Red"}';
>
>QUERY PLAN
>
>
>
-
>
> Bitmap Heap Scan on json_tables  (cost=113.50..37914.64 rows=1
width=1261)
> (actual time=2157.118..1259550.327 rows=909091 loops=1)
>Recheck Cond: (data @> '{"name": "AC3 Case Red"}'::jsonb)
>Rows Removed by Index Recheck: 4360296
>Heap Blocks: exact=37031 lossy=872059
>->  Bitmap Index Scan on json_tables_idx  (cost=0.00..111.00
rows=1 width =0) (actual time=2141.250..2141.250 rows=909091 loops=1)
>  Index Cond: (data @> '{"name": "AC3 Case Red"}'::jsonb)
> Planning time: 291.932 ms
> Execution time: 1259886.920 ms
> (8 rows)
>
> Time: 1261191.844 ms
>
> benchmark=# explain analyze select data from json_tables where data @>
'{"name": "AC3 Case Red"}';
>   QUERY PLAN
>
>
---
>
> Bitmap Heap Scan on json_tables  (cost=113.50..37914.64 rows=1
width=1261) (actual time=779.261..29815.262 rows=909091 loops=1)
>Recheck Cond: (data @> '{"name": "AC3 Case Red"}'::jsonb)
>Rows Removed by Index Recheck: 4360296
>Heap Blocks: exact=37031 lossy=872059
>->  Bitmap Index Scan on json_tables_idx  (cost=0.00..111.00
rows=1 width =0) (actual time=769.081..769.081 rows=909091 loops=1)
>  Index Cond: (data @> '{"name": "AC3 Case Red"}'::jsonb)
> Planning time: 33.967 ms
> Execution time: 29869.381 ms
>
> (8 rows)
>
> Time: 29987.122 ms
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Replacement for Oracle Text

2016-02-19 Thread Oleg Bartunov
On Fri, Feb 19, 2016 at 8:28 PM, Josh berkus  wrote:

> On 02/19/2016 05:49 AM, s d wrote:
>
>> On 19 February 2016 at 14:19, Bruce Momjian > > wrote:
>>
>> I wonder if PLPerl could be used to extract the words from a PDF
>> document and create a tsvector column from it.
>>
>>
>>   I don't know about PLPerl(I'm pretty sure it could be used for this
>> purpose, though.).  On the other hand I've written code for this in
>> Python which should be easy to adapt for PLPython, if necessary.
>>
>
> I'd swear someone already built something to do this.  All you need is a
> library which reads PDF and transforms it into text, and then you can FTS
> it.  I know there's a module for OpenOffice docs somewhere as well, but
> heck if I can remember where.
>

I used pdftotext for that.
I think it'd be useful to have extension{s}, which can be used to convert
anything to text. I remember someone indexed chemical formulae,  TeX/LaTeX,
DOC files.


>
> --
> --
> Josh Berkus
> Red Hat OSAS
> (any opinions are my own)
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Query plan not updated after dropped index

2016-02-18 Thread Oleg Bartunov
On Thu, Feb 18, 2016 at 11:17 AM, Victor Blomqvist  wrote:

> Hello!
>
> We just had a major issue on our databases, after a index was replaced a
> user defined function didnt change its query plan to use the new index. At
> least this is our theory, since the function in question became much slower
> and as a result brought our system to a halt.
>
> Basically it went:
> 1. create new index (a simple btree on a bigint column index)
> 2. drop old index
> 3. rename new index to old index name
>

why do you did this !?


> 3. analyze table
>
> After these steps normally our functions will update their plans and use
> the new index just fine. However this time the function (only one function
> use this particular index) seemed to take forever to complete. This is a
> 40GB table so querying for something not indexed would take a long time.
> Therefor my suspicion is that the function didnt start to use the new index.
>
> Adding to the strangeness is that if I ran the function manually it was
> fast, only when called from our application through pg_bouncer it was slow.
> I should also say that the function is only used on our 3 read slaves setup
> to our database.
>
> Things we tried to fix this:
> 1. Analyze table
> 2. Restart our application
> 3. Recreate the function
> 4. Kill the slow running queries with pg_cancel_backend()
>
> These things did not help.
>
> Instead what helped in the end was to replace the function with an extra
> useless where clause (in the hope that it would force it to create a new
> plan)
>
> So, the function only have a single SELECT inside:
> RETURN QUERY
> SELECT * FROM table
>   WHERE bigint_column = X
>   LIMIT 100 OFFSET 0;
>
> And this is my modification that made it work again:
> RETURN QUERY
> SELECT * FROM table
>   WHERE bigint_column = X AND 1=1
>   LIMIT 100 OFFSET 0;
>
>
> Obviously we are now worried why this happened and how we can avoid it in
> the future? We run Postgres 9.3 on CentOS 6.
>
> Thanks!
> Victor
>


Re: [GENERAL] pgDay Asia / talks / lightning talks

2016-02-17 Thread Oleg Bartunov
It's pity I miss this conference, since I learned about it too late.

On Wed, Feb 17, 2016 at 8:46 AM, Satoshi Nagayasu  wrote:

> Hi Dan,
>
> 2016-02-16 20:43 GMT+09:00 Daniel Pocock :
> > Is this the place to ask questions about pgDay Asia[1] or is there
> > another mailing list for it?  The mailing list link on the pgDay Asia
> > web site just takes me to a marketing list[2].  The seasiapug list[3]
> > looks very quiet.
>
> > 2.
> http://uptime.us2.list-manage.com/subscribe/post?u=8b6e2840d44be26e9f646b9f9=128a96a18a
>
> This is the list I have set up, and it is intended to provide some
> update information
> about the conference and the web site, especially for the attendees.
>
> And if you have any question or comments, please send to pgday-asia
> [at] googlegroups.co.jp
>
> Regards,
> --
> Satoshi Nagayasu 
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] fast refresh materialized view

2016-02-11 Thread Oleg Bartunov
Похоже на то, что вы понимаете по-русски !

Сообщество примет вашу работу только под лицензией BSD. Если что непонятно,
пишите мне.

С Уважением,
Олег

On Thu, Feb 11, 2016 at 7:56 PM, Nguyễn Trần Quốc Vinh  wrote:

> We would like to contribute to the PostgreSQL community. What can you
> recommend us?
>
> Thank you.
>
> TS. Nguyễn Trần Quốc Vinh
> ---
> Chủ nhiệm khoa Tin học
> Trường ĐH Sư phạm - ĐH Đà Nẵng
> Website: http://it.ued.vn ; http://www.ued.vn
> ; http://www.ued.udn.vn
> LLKH: http://scv.ued.vn/~ntquocvinh 
> ĐT: 0511.6-512-586
> DĐ: 0914.78-08-98
> 
> Nguyen Tran Quoc Vinh, PhD
> Dean
> Faculty of Information Technology
> Danang University of Education
> Website: http://it.ued.udn.vn; http://www.ued.vn 
> ; http://www.ued.udn.vn
> SCV: http://scv.ued.vn/~ntquocvinh 
> Phone: (+84) 511.6-512-586
> Mobile: (+84) 914.78-08-98
>
> On Tue, Feb 9, 2016 at 7:51 PM, Michael Paquier  > wrote:
>
>> On Tue, Feb 9, 2016 at 12:32 AM, Nguyễn Trần Quốc Vinh
>>  wrote:
>> >
>> > Thank you very much. We did n't think about that. We would like to
>> choose
>> > APACHE LICENSE. We apologize for late reply.
>>
>> And that would be incompatible with the PostgreSQL license I guess,
>> per concerns with patents and similar stuff.
>> --
>> Michael
>>
>
>


Re: [GENERAL] BSD initdb without ICU support and switch later

2016-01-29 Thread Oleg Bartunov
On Fri, Jan 29, 2016 at 5:46 PM, Maeldron T.  wrote:

> Hello,
>
> the ICU patch isn’t ready for PostgreSQL on FreeBSD.
>
> Is there any risk (more than 0) in executing the initdb without ICU
> support and recompiling PostgreSQL later when the ICU patch is ready? I
> mean any risk without making a dump and import before the switch.
>

Teodor has made a patch for ICU and PostgreSQL 9.5.


>
> If this is okay for sure, what should I do later when the ICU is
> available? Do I have to reindex everything with the ICU patched database?
>

I think, yes.


>
> Thank you.
>
> M.
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Using a german affix file for compound words

2016-01-28 Thread Oleg Bartunov
On Thu, Jan 28, 2016 at 6:04 PM, Wolfgang Winkler <
wolfgang.wink...@digital-concepts.com> wrote:

> Hi!
>
> We have a problem with importing a compound dictionary file for german.
>
> I downloaded the files here:
>
>
> http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/ispell/ispell-german-compound.tar.gz
>
> and converted them to utf-8 with iconv. The affix file seems ok when
> opened with an editor.
>
> When I try to create or alter a dictionary to use this affix file, I get
> the following error:
>
> alter TEXT SEARCH DICTIONARY german_ispell (
>   DictFile = german,
>   AffFile = german,
>   StopWords = german
> );
> ERROR:  syntax error
> CONTEXT:  line 224 of configuration file
> "/usr/local/pgsql/share/tsearch_data/german.affix": "   ABE > -ABE,äBIN
> "
>
> This is the first occurrence of an umlaut character in the file. I've
> found a view postings where the same file is used, e.g.:
>
>
> http://www.postgresql.org/message-id/flat/556c1411.4010...@tbz-pariv.de#556c1411.4010...@tbz-pariv.de
>
> This users has been able to import the file. Am I missing something
> obvious?
>

Arthur Zakirov could help you.


>
> ww
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] JSONB performance enhancement for 9.6

2016-01-20 Thread Oleg Bartunov
On Wed, Jan 20, 2016 at 4:51 AM, Bruce Momjian  wrote:

> On Mon, Jan 11, 2016 at 09:01:03PM -0500, Tom Smith wrote:
> > Hi,
> >
> > Congrats on the official release of 9.5
> >
> > And I'd like bring up the issue again about if 9.6 would address the
> jsonb
> > performance issue
> > with large number of top level keys.
> > It is true that it does not have to use JSON format. it is about
> serialization
> > and fast retrieval
> > of dynamic tree structure objects. (at top level, it might be called
> dynamic
> > columns)
> > So if postgresql can have its own way, that would work out too as long
> as it
> > can have intuitive query
> > (like what are implemented for json and jsonb) and fast retrieval of a
> tree
> > like object,
> > it can be called no-sql data type. After all, most motivations of using
> no-sql
> > dbs like MongoDB
> > is about working with dynamic tree object.
> >
> > If postgresql can have high performance on this, then many no-sql dbs
> would
> > become history.
>
> I can give you some backstory on this.  TOAST was designed in 2001 as a
> way to store, in a data-type-agnostic way, long strings compressed and
> any other long data type, e.g. long arrays.
>
> In all previous cases, _part_ of the value wasn't useful.  JSONB is a
> unique case because it is one of the few types that can be processed
> without reading the entire value, e.g. it has an index.
>
> We are going to be hesitant to do something data-type-specific for
> JSONB.  It would be good if we could develop a data-type-agnostic
> approach to has TOAST can be improved.  I know of no such work for 9.6,
> and it is unlikely it will be done in time for 9.6.
>

I'm looking on this time to time.


>
> --
>   Bruce Momjian  http://momjian.us
>   EnterpriseDB http://enterprisedb.com
>
> + As you are, so once was I. As I am, so you will be. +
> + Roman grave inscription +
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


[GENERAL] Fwd: Data Packaging/Data Unpacking

2016-01-13 Thread oleg yusim
Appologies, for posting it again, but I didn't get any responses so far.
Looks like I posted it too late in the evening and it went not noticed.

Oleg

-- Forwarded message --
From: oleg yusim <olegyu...@gmail.com>
Date: Tue, Jan 12, 2016 at 10:00 PM
Subject: Data Packaging/Data Unpacking
To: PostgreSQL General <pgsql-general@postgresql.org>


Greetings,

I have matching couple of security requirements, speaking about preserving
data confidentiality and integrity in PostgreSQL DB during packaging for
transmission / unpacking from transmission.

Important: let's assume data at rest is encrypted using EFS and data at
transit is encrypted using ciphers, provided by OpenSSL.

So, with that in mind, please, help me to understand movement and location
of the data between the moment when it is pulled from file system and
encrypted as network package going through the SSL tunnel.

And reversing it - between the moment network package arrived through the
SSL tunnel is decrypted and the moment its content is placed into the file
system.



For those interested, here are requirements themselves, quoted:

1) The DBMS must maintain the confidentiality and integrity of information
during preparation for transmission.

Information can be either unintentionally or maliciously disclosed or
modified during preparation for transmission, including, for example,
during aggregation, at protocol transformation points, and during
packing/unpacking. These unauthorized disclosures or modifications
compromise the confidentiality or integrity of the information.

Use of this requirement will be limited to situations where the data owner
has a strict requirement for ensuring data integrity and confidentiality is
maintained at every step of the data transfer and handling process.

When transmitting data, the DBMS, associated applications, and
infrastructure must leverage transmission protection mechanisms.

2) The DBMS must maintain the confidentiality and integrity of information
during reception.

Information can be either unintentionally or maliciously disclosed or
modified during reception, including, for example, during aggregation, at
protocol transformation points, and during packing/unpacking. These
unauthorized disclosures or modifications compromise the confidentiality or
integrity of the information.

This requirement applies only to those applications that are either
distributed or can allow access to data non-locally. Use of this
requirement will be limited to situations where the data owner has a strict
requirement for ensuring data integrity and confidentiality is maintained
at every step of the data transfer and handling process.

When receiving data, the DBMS, associated applications, and infrastructure
must leverage protection mechanisms.


Thanks,

Oleg


Re: [GENERAL] Data Packaging/Data Unpacking

2016-01-13 Thread oleg yusim
Kevin,

Thank you for your reply. I understand what you are saying, but I guess I
need a bit deeper understanding for my assessment. Let's dive a bit here:

Say, I got network package. The package was decrypted by OpenSSL. Where
this data are, physically, at this moment?

Opposite situation: we are about to send results of SELECT statement over
the OpenSSL tunnel. Where are the data, physically, between the moment DB
engine returned results for my query and moment the package was encrypted
by OpenSSL?

Thanks,

Oleg

On Wed, Jan 13, 2016 at 2:46 PM, Kevin Grittner <kgri...@gmail.com> wrote:

> On Tue, Jan 12, 2016 at 10:00 PM, oleg yusim <olegyu...@gmail.com> wrote:
>
> > Important: let's assume data at rest is encrypted using EFS and data at
> > transit is encrypted using ciphers, provided by OpenSSL.
> >
> > So, with that in mind, please, help me to understand movement and
> location
> > of the data between the moment when it is pulled from file system and
> > encrypted as network package going through the SSL tunnel.
> >
> > And reversing it - between the moment network package arrived through the
> > SSL tunnel is decrypted and the moment its content is placed into the
> file
> > system.
>
> At all times the data is present only in files owned by the OS user
> which runs the database server or in RAM allocated to processes run
> by that user.  Files and RAM are freed without overwrite; we count
> on the OS to not gratuitously show the old values to processes
> making new allocations.
>
> --
> Kevin Grittner
> EDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: [GENERAL] Data Packaging/Data Unpacking

2016-01-13 Thread oleg yusim
OK, Kevin, David,

Thanks you very much for explanation. Now who is the owner of this process?
My understanding is, data then located physically in RAM, in the memory
stack assigned by OS to this process. Now the question is who owns the
process?

Thanks,

Oleg

On Wed, Jan 13, 2016 at 3:29 PM, David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Wed, Jan 13, 2016 at 2:19 PM, Kevin Grittner <kgri...@gmail.com> wrote:
>
>> On Wed, Jan 13, 2016 at 2:57 PM, oleg yusim <olegyu...@gmail.com> wrote:
>>
>> > Say, I got network package. The package was decrypted by OpenSSL. Where
>> this
>> > data are, physically, at this moment?
>>
>> Process-local memory for the PostgreSQL backend process associated
>> with the database connection.
>>
>> > Opposite situation: we are about to send results of SELECT statement
>> over
>> > the OpenSSL tunnel. Where are the data, physically, between the moment
>> DB
>> > engine returned results for my query and moment the package was
>> encrypted by
>> > OpenSSL?
>>
>> Process-local memory for the PostgreSQL backend process associated
>> with the database connection.
>>
>
> ​In other words (?) while OpenSSL may be a separate project the method of
> its use involves loading all relevant code into the process space​
>
> ​of PostgreSQL - specifically the process that was forked when the SSL
> client connection was established.
>
> David J.​
>


Re: [GENERAL] Data Packaging/Data Unpacking

2016-01-13 Thread oleg yusim
David,

Answer "postgres" would suffice. I have fairly decent idea of what happens
below that (you are right system utility - Memory Manager is what
marshaling data back and forth to RAM, abstracting absolute addresses from
application along the way, and once in RAM, security kernel of the system
protects data, terminating any process which tries to obtain data from
another process - provided data were not shared to begin with).

Thanks a lot for your consultation guys.

Oleg

On Wed, Jan 13, 2016 at 3:49 PM, David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Wed, Jan 13, 2016 at 2:37 PM, oleg yusim <olegyu...@gmail.com> wrote:
>
>> OK, Kevin, David,
>>
>> Thanks you very much for explanation. Now who is the owner of this
>> process? My understanding is, data then located physically in RAM, in the
>> memory stack assigned by OS to this process. Now the question is who owns
>> the process?
>>
>>
> Who owns the PostgreSQL backend user process?​
>
> ​Typically the ​"postgres" user.
> Who owns the "process" responsible for marshalling data back and forth
> from RAM?  There isn't really a concept of "process" when dealing with
> system resources.  The kernel is responsible for management of those
> resources.  Processes are a user-space concern.
>
> There may be some here concerned with the deep workings of the various O/S
> that PostgreSQL runs on but ultimately the answer, from PostgreSQL's
> perspective, is that we use system APIs to request and use resources and
> expect the underlying system to handle those securely.  If you want to know
> how that layer works you should ask there since it seems out-of-scope for
> this forum.
>
> David J.
>
>


Re: [GENERAL] Data Packaging/Data Unpacking

2016-01-13 Thread oleg yusim
Kevin,

It is understood. But doesn't change things from my perspective. It would
anyway be user, with sole access, nobody can elevate privileges to and only
root can su to. The name, in that case, is irrelevant, but answer postgres
gave me idea of the protection level here.

Thanks,

Oleg

On Wed, Jan 13, 2016 at 4:04 PM, Kevin Grittner <kgri...@gmail.com> wrote:

> On Wed, Jan 13, 2016 at 3:54 PM, oleg yusim <olegyu...@gmail.com> wrote:
>
> > Answer "postgres" would suffice.
>
> But the user would not always be "postgres".  To be accurate, it is
> the user which owns the files for the "cluster" (database instance)
> and which runs the database service.  If a machine contains
> multiple clusters it is (IMO) best practice, for both security and
> operational reasons, to use a separate OS user for each cluster.
>
> --
> Kevin Grittner
> EDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


[GENERAL] Data Packaging/Data Unpacking

2016-01-12 Thread oleg yusim
Greetings,

I have matching couple of security requirements, speaking about preserving
data confidentiality and integrity in PostgreSQL DB during packaging for
transmission / unpacking from transmission.

Important: let's assume data at rest is encrypted using EFS and data at
transit is encrypted using ciphers, provided by OpenSSL.

So, with that in mind, please, help me to understand movement and location
of the data between the moment when it is pulled from file system and
encrypted as network package going through the SSL tunnel.

And reversing it - between the moment network package arrived through the
SSL tunnel is decrypted and the moment its content is placed into the file
system.



For those interested, here are requirements themselves, quoted:

1) The DBMS must maintain the confidentiality and integrity of information
during preparation for transmission.

Information can be either unintentionally or maliciously disclosed or
modified during preparation for transmission, including, for example,
during aggregation, at protocol transformation points, and during
packing/unpacking. These unauthorized disclosures or modifications
compromise the confidentiality or integrity of the information.

Use of this requirement will be limited to situations where the data owner
has a strict requirement for ensuring data integrity and confidentiality is
maintained at every step of the data transfer and handling process.

When transmitting data, the DBMS, associated applications, and
infrastructure must leverage transmission protection mechanisms.

2) The DBMS must maintain the confidentiality and integrity of information
during reception.

Information can be either unintentionally or maliciously disclosed or
modified during reception, including, for example, during aggregation, at
protocol transformation points, and during packing/unpacking. These
unauthorized disclosures or modifications compromise the confidentiality or
integrity of the information.

This requirement applies only to those applications that are either
distributed or can allow access to data non-locally. Use of this
requirement will be limited to situations where the data owner has a strict
requirement for ensuring data integrity and confidentiality is maintained
at every step of the data transfer and handling process.

When receiving data, the DBMS, associated applications, and infrastructure
must leverage protection mechanisms.


Thanks,

Oleg


Re: [GENERAL] WIP: CoC V4

2016-01-12 Thread Oleg Bartunov
On Tue, Jan 12, 2016 at 7:50 PM, Joshua D. Drake 
wrote:

> Tl;dr;
>
> * Removed specific examples and reworded #3 to be more concise
>
> PostgreSQL Global Development Group (PGDG) Code of Conduct (CoC):
>
> 1. The CoC is to provide community guidelines for creating and enforcing a
> safe, respectful, productive, and collaborative place for any person who is
> willing to contribute in a safe, respectful, productive and collaborative
> way.
>
> 2. The CoC is not about being offended. As with any diverse community,
> anyone can get offended at anything.
>
> 3. A safe, respectful, productive and collaborative environment is free of
> personal attacks and disparaging remarks of any kind.
>
> 4. Any sustained disruption of the collaborative space (mailing lists, IRC
> etc..) or other PostgreSQL events shall be construed as a violation of the
> CoC and appropriate action will be taken by the CoC committee.
>

Then we need to define what is "Coc committee".


>
> 5. The CoC is only about interaction with the PostgreSQL community. Your
> private and public lives outside of the PostgreSQL community are your own.
>


* We should specify here legal place of CoC, say, URL on postgresql.org
* We should translate it to other languages




>
> --
> Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
> PostgreSQL Centered full stack support, consulting and development.
> Announcing "I'm offended" is basically telling the world you can't
> control your own emotions, so everyone else should do it for you.
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Code of Conduct: Is it time?

2016-01-10 Thread Oleg Bartunov
On Sun, Jan 10, 2016 at 9:44 PM, Regina Obe  wrote:

> > On 01/10/2016 08:07 AM, Bill Moran wrote:
>
> >> So, the purpose of a CoC is twofold:
> >>
> >> A) Define what "being excellent" means to this particular
> >> community.
> >> B) Provide a process for how to resolve things when "being
> >> excellent" doesn't happen.
> >>
> >> Without #1, nobody will want to do #2, as it's basically a
> >> job that can never be done correctly.
>
> > I agree with you completely. That is actually why I included the link to
> > the graphic in the last post. My point was, I have no intention of
> > having a CoC that is full of drivel. I would want a clear, concise,
> > no-B.S. CoC.
>
> > JD
>
> This may come as a big shock to many of you, but as a contributor
> I don't care if you are racist, sexist, transphobic or whatever as long as
> you
>
> 1) Are helpful when I ask a question
> 2) Stick to the topic
> 3) Don't get into petty etiquettes like "Please stop top posting"
> and if you really need to - A polite we prefer top posting would do
>
> 4) Are sensitive to people on other operating systems other than your
> preferred.
>

+1


>
>
> My other concern about CoCs is I fear someone is going to come and demand
> we change Master/Slave  to Leader/Follower, because Master is a male term
> and Slave is insensitive to grand-children of slaves.
>


Some people don't understand all these issues with she/he, for example, we
in Russia are not really concern about this.



>
>
> Thanks,
> Regina
>
>
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Failing to known state

2016-01-05 Thread oleg yusim
Hi Adrian,

Thank you very much for that link. It confirms what JD and John said, plus
explains couple other moments to me.

Thanks,

Oleg

On Tue, Jan 5, 2016 at 7:04 PM, Adrian Klaver <adrian.kla...@aklaver.com>
wrote:

> On 01/05/2016 04:12 PM, oleg yusim wrote:
> > Hi Adrian,
> >
> > I meant a scenario, when user is trying to connect to database (doesn't
> > matter what interface) and database fails at this moment. If all
> > authentication/authorization/validation functions are written to return
> > false in case of abnormal termination, we are fine. If not, we can
> > potentially encounter the situation when database fails into state where
> > user is given greater privileges than he/she should or even
> > authenticated, when he/she shouldn't.
>
> Might want to take a look at:
>
>
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/postmaster/postmaster.c;h=41dbdf4bf9eeb54ae0a774ab21fc1c1362aa55f9;hb=d25c7d70ff46d1b2f2400f29d100190efe84d70d
>
> /*
>  * CleanupBackend -- cleanup after terminated backend.
>  *
>  * Remove all local state associated with backend.
>  *
>  * If you change this, see also CleanupBackgroundWorker.
>  */
> static void
> CleanupBackend
>
>
> /*
>  * HandleChildCrash -- cleanup after failed backend, bgwriter,
> checkpointer,
>  * walwriter, autovacuum, or background worker.
>  *
>  * The objectives here are to clean up our local state about the child
>  * process, and to signal all other remaining children to quickdie.
>  */
> static void
> HandleChildCrash(in
>
> etc
>
> Just do a find on crash.
>
>
> >
> > Thanks,
> >
> > Oleg
> >
>
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


Re: [GENERAL] Code of Conduct: Is it time?

2016-01-05 Thread Oleg Bartunov
On Wed, Jan 6, 2016 at 7:41 AM, Jim Nasby  wrote:

> On 1/5/16 10:03 PM, John R Pierce wrote:
>
>> On 1/5/2016 5:31 PM, Jim Nasby wrote:
>>
>>> IMHO, the real problem here is not simply a CoC, it is that the
>>> Postgres community doesn't focus on developing the community itself.
>>> The closest we come to "focus" is occasional talk on -hackers about
>>> how we need more developers. There is no formal
>>> discussion/leadership/coordination towards actively building and
>>> strengthening our community. Until that changes, I fear we will always
>>> have a lack of developers. More importantly, we will continue to lack
>>> all the other ways that people could contribute beyond writing code.
>>> IE: the talk shouldn't be about needing more developers, it should be
>>> about needing people who want to contribute time to growing the
>>> community.
>>>
>>
>>
>> That sounds like a bunch of modern marketing graduate mumbojumbo to
>> me.The postgres community are the people who actually support it on
>> the email lists and IRC, as well as the core development teams, and
>> INMO, they are quite strong and effective. when you start talking
>> about social marketing and facebook and twitter and stuff, thats just a
>> bunch of feelgood smoke and mirrors.The project's output is what
>> supports it, not having people going out 'growing community', that is
>> just a bunch of hot air.   you actively 'grow community' when you're
>> pushing worthless products (soda pop, etc) based on slick marketing
>> plans rather than actually selling something useful.
>>
>
> Then why is it that there is almost no contribution to the community other
> than code and mailing list discussion?
>
> Why is the infrastructure team composed entirely of highly experienced
> code contributors, of which there are ~200 on the planet, when there are
> literally 100s of thousands (if not millions) of people out there that
> could do that work (and could probably do it better if it's what they do
> for a living, no offense to the efforts of the infrastructure team).
>
> Why is there a lack of developers? And a serious lack of code reviewers?
>

I agree with Jim, something is wrong, I see our developers community isn't
growing and getting older. There is no formal problem to start contribute,
but steep learning curve and lack of mentoring practice scare people.

Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Failing to known state

2016-01-05 Thread oleg yusim
Hi Joe,

Exactly how I marked it :)

Thanks,

Oleg

On Tue, Jan 5, 2016 at 6:50 PM, Joe Conway <m...@joeconway.com> wrote:

> On 01/05/2016 04:32 PM, John R Pierce wrote:
> > On 1/5/2016 4:12 PM, oleg yusim wrote:
> >> I meant a scenario, when user is trying to connect to database
> >> (doesn't matter what interface) and database fails at this moment. If
> >> all authentication/authorization/validation functions are written to
> >> return false in case of abnormal termination, we are fine. If not, we
> >> can potentially encounter the situation when database fails into state
> >> where user is given greater privileges than he/she should or even
> >> authenticated, when he/she shouldn't.
> >
> > if the postgres server processes terminate for any reason, there's
> > nothing to connect to.  the client application will get a error like
> > 'connection refused' back from the connection attempt, or if it was
> > already connected and the server aborts, the next query will return an
> > error like CONNECTION_BAD.  there's no possible privilege elevation.
>
> +1
>
> I think you can call this one "Applicable -- Inherently Meets"
>
> Joe
>
> --
> Crunchy Data - http://crunchydata.com
> PostgreSQL Support for Secure Enterprises
> Consulting, Training, & Open Source Development
>
>


Re: [GENERAL] Failing to known state

2016-01-05 Thread oleg yusim
John,

Thanks, what you are saying makes sense. I agree, it would cause all user
to go through authentication/authorization loop all over and terminate all
running transactions too.

Thanks,

Oleg

On Tue, Jan 5, 2016 at 6:32 PM, John R Pierce <pie...@hogranch.com> wrote:

> On 1/5/2016 4:12 PM, oleg yusim wrote:
>
> I meant a scenario, when user is trying to connect to database (doesn't
> matter what interface) and database fails at this moment. If all
> authentication/authorization/validation functions are written to return
> false in case of abnormal termination, we are fine. If not, we can
> potentially encounter the situation when database fails into state where
> user is given greater privileges than he/she should or even authenticated,
> when he/she shouldn't.
>
>
>
>
>
> if the postgres server processes terminate for any reason, there's nothing
> to connect to.  the client application will get a error like
> 'connection refused' back from the connection attempt, or if it was already
> connected and the server aborts, the next query will return an error like
> CONNECTION_BAD.  there's no possible privilege elevation.
>
>
>
>
>
> --
> john r pierce, recycling bits in santa cruz
>
>


Re: [GENERAL] Failing to known state

2016-01-05 Thread oleg yusim
Thanks JD.

Let me confirm I got you right. So, by exception you mean the
authentication/authorization/validation functions would return false in
case of DB failure?

Thanks,

Oleg


On Tue, Jan 5, 2016 at 5:33 PM, Joshua D. Drake <j...@commandprompt.com>
wrote:

> On 01/05/2016 03:21 PM, oleg yusim wrote:
>
>> Thanks JD.
>>
>>  From what I read about WAL (you have been referring to this:
>> http://www.postgresql.org/docs/current/static/wal-internals.html
>> pg_xlog, right?) it allows us to know what happened, but does it
>> warranty known secure state? I mean, I do not think it would help with
>> this:
>>
>> "In general, security mechanisms should be designed so that a failure
>> will follow the same execution path as disallowing the operation. For
>> example, application security methods, such as isAuthorized(),
>> isAuthenticated(), and validate(), should all return false if there is
>> an exception during processing. If security controls can throw
>> exceptions, they must be very clear about exactly what that condition
>> means. "
>>
>
> You are correct, that isn't the pg_xlog but yes, PostgreSQL will throw an
> exception in those types of cases.
>
> Sincerely,
>
> JD
>
>
> --
> Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
> PostgreSQL Centered full stack support, consulting and development.
> Announcing "I'm offended" is basically telling the world you can't
> control your own emotions, so everyone else should do it for you.
>


Re: [GENERAL] Failing to known state

2016-01-05 Thread oleg yusim
Hi Adrian,

I meant a scenario, when user is trying to connect to database (doesn't
matter what interface) and database fails at this moment. If all
authentication/authorization/validation functions are written to return
false in case of abnormal termination, we are fine. If not, we can
potentially encounter the situation when database fails into state where
user is given greater privileges than he/she should or even authenticated,
when he/she shouldn't.

Thanks,

Oleg

On Tue, Jan 5, 2016 at 5:34 PM, Adrian Klaver <adrian.kla...@aklaver.com>
wrote:

> On 01/05/2016 03:21 PM, oleg yusim wrote:
>
>> Thanks JD.
>>
>>  From what I read about WAL (you have been referring to this:
>> http://www.postgresql.org/docs/current/static/wal-internals.html
>> pg_xlog, right?) it allows us to know what happened, but does it
>> warranty known secure state? I mean, I do not think it would help with
>> this:
>>
>> "In general, security mechanisms should be designed so that a failure
>> will follow the same execution path as disallowing the operation. For
>> example, application security methods, such as isAuthorized(),
>> isAuthenticated(), and validate(), should all return false if there is
>> an exception during processing. If security controls can throw
>> exceptions, they must be very clear about exactly what that condition
>> means. "
>>
>
> Not sure what you are talking about above. the application as in the
> client application connecting to the database or the database application
> itself?
>
>
>> Right?
>>
>> Thanks,
>>
>> Oleg
>>
>>
>> On Tue, Jan 5, 2016 at 5:14 PM, Joshua D. Drake <j...@commandprompt.com
>> <mailto:j...@commandprompt.com>> wrote:
>>
>> On 01/05/2016 03:09 PM, oleg yusim wrote:
>>
>>
>>
>> The question here, what is PostreSQL 9.4.5 (hosted on Linux box)
>> behavior? Does it fail to known/secure state in these 3 cases? I
>> tried
>> to find the description of the way PostgreSQL fails in this
>> regard, but
>> didn't find much.
>>
>>
>> Based on what you pasted, PostgreSQL does fail to a known state.
>> That is the whole point of the xlog.
>>
>> Sincerely,
>>
>> JD
>>
>>
>> Thanks,
>>
>> Oleg
>>
>>
>>
>> --
>> Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564
>> 
>> PostgreSQL Centered full stack support, consulting and development.
>> Announcing "I'm offended" is basically telling the world you can't
>> control your own emotions, so everyone else should do it for you.
>>
>>
>>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


[GENERAL] Failing to known state

2016-01-05 Thread oleg yusim
Greetings,

One more security requirement I'm battling with:



The DBMS must fail to a secure state if system initialization fails,
shutdown fails, or aborts fail.

Failure to a known state can address safety or security in accordance with
the mission/business needs of the organization.

Failure to a known secure state helps prevent a loss of confidentiality,
integrity, or availability in the event of a failure of the information
system or a component of the system.

Failure to a known safe state helps prevent systems from failing to a state
that may cause loss of data or unauthorized access to system resources.
Systems that fail suddenly and with no incorporated failure state planning
may leave the hosting system available but with a reduced security
protection capability. Preserving information system state data also
facilitates system restart and return to the operational mode of the
organization with less disruption of mission/business processes.

Databases must fail to a known consistent state. Transactions must be
successfully completed or rolled back.

In general, security mechanisms should be designed so that a failure will
follow the same execution path as disallowing the operation. For example,
application security methods, such as isAuthorized(), isAuthenticated(),
and validate(), should all return false if there is an exception during
processing. If security controls can throw exceptions, they must be very
clear about exactly what that condition means.

Abort refers to stopping a program or function before it has finished
naturally. The term abort refers to both requested and unexpected
terminations.



The question here, what is PostreSQL 9.4.5 (hosted on Linux box) behavior?
Does it fail to known/secure state in these 3 cases? I tried to find the
description of the way PostgreSQL fails in this regard, but didn't find
much.

Thanks,

Oleg


Re: [GENERAL] Failing to known state

2016-01-05 Thread oleg yusim
Thanks JD.

>From what I read about WAL (you have been referring to this:
http://www.postgresql.org/docs/current/static/wal-internals.html pg_xlog,
right?) it allows us to know what happened, but does it warranty known
secure state? I mean, I do not think it would help with this:

"In general, security mechanisms should be designed so that a failure will
follow the same execution path as disallowing the operation. For example,
application security methods, such as isAuthorized(), isAuthenticated(),
and validate(), should all return false if there is an exception during
processing. If security controls can throw exceptions, they must be very
clear about exactly what that condition means. "

Right?

Thanks,

Oleg



On Tue, Jan 5, 2016 at 5:14 PM, Joshua D. Drake <j...@commandprompt.com>
wrote:

> On 01/05/2016 03:09 PM, oleg yusim wrote:
>
>
>>
>> The question here, what is PostreSQL 9.4.5 (hosted on Linux box)
>> behavior? Does it fail to known/secure state in these 3 cases? I tried
>> to find the description of the way PostgreSQL fails in this regard, but
>> didn't find much.
>>
>>
> Based on what you pasted, PostgreSQL does fail to a known state. That is
> the whole point of the xlog.
>
> Sincerely,
>
> JD
>
>
> Thanks,
>>
>> Oleg
>>
>
>
> --
> Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
> PostgreSQL Centered full stack support, consulting and development.
> Announcing "I'm offended" is basically telling the world you can't
> control your own emotions, so everyone else should do it for you.
>


Re: [GENERAL] Shared system resources

2015-12-23 Thread oleg yusim
HI George,

Thanks, this information clears the situation. Now, question to you and
David.

May we run into situation, when attacker dumps memory and analyses it for
valuable content, instead of reserving it for own process, where it would
be zeroed? My understanding, it is a possibility. Does kernel have any
safeguard against it?

Thanks,

Oleg

On Wed, Dec 23, 2015 at 2:13 AM, George Neuner <gneun...@comcast.net> wrote:

> On Tue, 22 Dec 2015 23:21:27 +, David Wilson <dw...@hmmz.org>
> wrote:
>
> >On Linux the memory pages of an exiting process aren't sanitized at
> >exit, however it is impossible(?) for userspace to reallocate them
> >without the kernel first zeroing their contents.
>
> Not impossible, but it requires a non-standard kernel.
>
> Since 2.6.33, mmap() accepts the flag MAP_UNINITIALIZED which allows
> pages to be mapped without being cleared.  The flag has no effect
> unless the kernel was built with CONFIG_MMAP_ALLOW_UNINITIALIZED.
>
>
> No mainstream distro enables this.  AFAIK, there is NO distro at all
> that enables it ... it's too big a security risk for a general purpose
> system.  It's intended to support embedded systems where the set of
> programs is known.
>
> George
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Shared system resources

2015-12-23 Thread oleg yusim
Jim,

Help me out with this statement:

"There's tons of raw data stored in the shared memory segment, and some of
that can be copied to process local memory at any time. If they OS doesn't
secure that adequately there's certainly nothing that Postgres or any other
database can do about it."

To my knowledge, many databases are using what called TDE to encrypt data
at rest and protect data from being accessed by attacker on host this way.
Here is the reference to quick guide on it:
https://www.simple-talk.com/sql/database-administration/transparent-data-encryption/

Now, when you are saying " tons of raw data stored in the shared memory
segment, and some of that can be copied to process local memory at any time"
what kind of memory you are referring too? Is it that files, which
generally end up being protected with TDE, or is it a buffer memory, which
get's used by database processes, but doesn't belong to database permanent
storage?

Can you give me more details here, so I would understand the actual mapping
and scale of the issue?

Thanks,

Oleg

On Wed, Dec 23, 2015 at 9:55 AM, Jim Nasby <jim.na...@bluetreble.com> wrote:

> On 12/23/15 7:55 AM, oleg yusim wrote:
>
>> Sure David. For simplicity of modeling here, let's assume raw database
>> data was encrypted and the only possibility for attacker to get
>> something from raw data is to go and dig into sessions leftovers. Now,
>> with that has been said, do you happen to know what information actually
>> gets stored during the session into memory, reserved by session process?
>> I'm trying to determine, basically, does it even worth a talk - maybe
>> there is nothing at all valuable.
>>
>
> There's tons of raw data stored in the shared memory segment, and some of
> that can be copied to process local memory at any time. If they OS doesn't
> secure that adequately there's certainly nothing that Postgres or any other
> database can do about it.
>
> As David said, by the time you're concerned about someone getting access
> to raw memory it's already way too late.
>
> As for memory pages being zero'd after they are returned to the OS, that's
> entirely up to the OS. The only thing you could do on the Postgres side is
> to compile with memory debugging enabled, which will over-write any memory
> that's freed with a magic value. That's done to help hunt down memory
> access bugs, but would have the obvious side effect of obliterating any
> data that was in the page.
>
> Uh, only thing is, I don't know if this is done if we're going to be
> returning the memory to the OS.
>
> --
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com
>


Re: [GENERAL] Shared system resources

2015-12-23 Thread oleg yusim
Sure David. For simplicity of modeling here, let's assume raw database data
was encrypted and the only possibility for attacker to get something from
raw data is to go and dig into sessions leftovers. Now, with that has been
said, do you happen to know what information actually gets stored during
the session into memory, reserved by session process? I'm trying to
determine, basically, does it even worth a talk - maybe there is nothing at
all valuable.

Thanks,

Oleg

On Wed, Dec 23, 2015 at 7:41 AM, David Wilson <dw...@hmmz.org> wrote:

> On Wed, Dec 23, 2015 at 07:07:31AM -0600, oleg yusim wrote:
>
> > May we run into situation, when attacker dumps memory and analyses it
> > for valuable content, instead of reserving it for own process, where
> > it would be zeroed? My understanding, it is a possibility. Does kernel
> > have any safeguard against it?
>
> Sure it might be possible, but they would not have much useful
> information about which old processes the pages belonged to, and
> besides, they could most likely simply dump memory of a connected client
> in this case, or indeed just examine the filesystem or cache to get at
> the raw PG database files.
>
> Once someone has this level of access to the system it's not really
> useful to model threats much further.
>
> One minor correction from my first mail: MAP_UNINITIALIZED is indeed
> accessible to non-root, but as George mentions only when a non-default
> kernel parameter has been enabled.
>
>
> David
>


Re: [GENERAL] Shared system resources

2015-12-23 Thread oleg yusim
John,

To my knowledge, TDE is employed not only by Microsoft, but by Oracle too.
I recall it also has a mechanism of protecting keys. Here are references:

https://msdn.microsoft.com/en-us/library/bb934049.aspx
http://www.oracle.com/technetwork/database/options/advanced-security/index-099011.html

Thank you very much for that piece:

"In PostgreSQL 'shared memory' has a quite specific meaning, its referring
to the pool of buffer memory (ram) shared by all postgres server
processes.   this is primarily used as the buffer cache. In a properly
secured operating system, ONLY the postgres server processes have access to
this shared memory segment"

It helped me to understand terminology used by other reponders better.

Thanks,

Oleg

On Wed, Dec 23, 2015 at 10:48 AM, John R Pierce <pie...@hogranch.com> wrote:

> On 12/23/2015 8:16 AM, oleg yusim wrote:
>
>>
>> To my knowledge, many databases are using what called TDE to encrypt data
>> at rest and protect data from being accessed by attacker on host this way.
>> Here is the reference to quick guide on it:
>> https://www.simple-talk.com/sql/database-administration/transparent-data-encryption/
>>
>
> that article is talking about a specific feature of Microsoft SQL Server
> Enterprise Edition, which upon a quick skim sounds to me to be smoke and
> mirrors 'security-by-checklist' protection.   If the encryption keys are
> stored on the system, then anyone with access to the raw data can decrypt
> it, no matter how much smoke and mirrors you wave around to obfuscate this
> fact.
>
> In PostgreSQL 'shared memory' has a quite specific meaning, its referring
> to the pool of buffer memory (ram) shared by all postgres server
> processes.   this is primarily used as the buffer cache. In a properly
> secured operating system, ONLY the postgres server processes have access to
> this shared memory segment, but the details of OS level memory management
> are outide postgres's scope, since its portable and designed to be able to
> run on most any OS that provides basic memory management, multiple
> processes, and a reliable/robust file system, with tcp/ip socket support.
>
>
>
> --
> john r pierce, recycling bits in santa cruz
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Shared system resources

2015-12-23 Thread oleg yusim
Thank you very much George, that is exactly the piece of information I was
missing.

Oleg

On Wed, Dec 23, 2015 at 10:55 AM, George Neuner <gneun...@comcast.net>
wrote:

> Hi Oleg,
>
> On Wed, 23 Dec 2015 07:07:31 -0600, oleg yusim <olegyu...@gmail.com>
> wrote:
>
> >May we run into situation, when attacker dumps memory and analyses it for
> >valuable content, instead of reserving it for own process, where it would
> >be zeroed? My understanding, it is a possibility. Does kernel have any
> >safeguard against it?
>
> With recent kernels, by default there is no way for a userspace
> process (even root) to dump memory.  Older kernels by default
> permitted a root process unrestricted access to /dev/mem and
> /dev/kmem, however in general that isn't needed and has long been
> disabled by the mahor distros.  [see CONFIG_STRICT_DEVMEM].  IIRC, the
> default setting was changed in 2011.
>
> With sufficient privileges, a debugger-like process can attach and
> examine the memory of a running - or just terminated - process, but it
> won't have access to discarded (unmapped) memory.
>
> The MAP_UNINITIALIZED trick, even if it works, is not a predictable
> attack vector.  There is no way to ask for any *particular* VMM page -
> mmap() just gives you a set of pages sufficient to cover the requested
> address range ... you don't know what process those pages previously
> belonged to.  Obviously there is a known algorithm for satisfying the
> page requests, but the set of free pages includes both code and data
> and depends on the history of system activity.  There's no guarantee
> to get anything useful.
>
> I'm not sure any of this really answers your question.
> George
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Session Identifiers

2015-12-22 Thread oleg yusim
Thanks Michael, you are right, that is a very good alternative solution.

Oleg

On Tue, Dec 22, 2015 at 6:27 AM, Michael Paquier <michael.paqu...@gmail.com>
wrote:

> On Tue, Dec 22, 2015 at 1:42 AM, Stephen Frost <sfr...@snowman.net> wrote:
> > Oleg,
> >
> > * oleg yusim (olegyu...@gmail.com) wrote:
> >> tcp_keepalives_idle = 900
> >> tcp_keepalives_interval=0
> >> tcp_keepalives_count=0
> >>
> >> Doesn't terminate connection to database in 15 minutes of inactivity of
> >> psql prompt. So, it looks like that would work only for case if network
> >> connection is broken and session left hanging. For psql prompt case
> looks
> >> like pg_terminate_backend() would be the only solution.
> >
> > Those settings aren't for controlling idle timeout of a connection.
> >
> > pg_terminate_backend() will work and could be run out of a cronjob.
>
> Or a background worker if you are using PG >= 9.3:
> https://github.com/michaelpq/pg_plugins/tree/master/kill_idle
> This has the advantage to not have the cronjob error out should the
> server be stopped. That's less error handling to take care of at
> frontend level.
> --
> Michael
>


[GENERAL] Shared system resources

2015-12-22 Thread oleg yusim
Greetings,

I'm looking at the following security control right now:

*The DBMS must prevent unauthorized and unintended information transfer via
shared system resources.*

The purpose of this control is to prevent information, including encrypted
representations of information, produced by the actions of a prior
user/role (or the actions of a process acting on behalf of a prior
user/role) from being available to any current user/role (or current
process) that obtains access to a shared system resource (e.g., registers,
main memory, secondary storage) after the resource has been released back
to the information system. Control of information in shared resources is
also referred to as object reuse.

>From previous discussions, I understood that session in PostgreSQL is
closely associated with process, and it is essentially new process for
every user connection. In regards to that, my question would be:

When user disconnects, process is terminated and all resources are
released, does memory, session/process stored information at gets
"sanitized" or just released as is?

Thanks,

Oleg


Re: [GENERAL] Shared system resources

2015-12-22 Thread oleg yusim
Jim,

Yes, you are right. Generally the security control here is encryption of
data at rest (TDE), but PostgreSQL doesn't support it, to my knowledge. I
know about that vulnerability, but here I posed the question on different
one. I agree it is smaller one, compare to the absence of TDE, but I would
like to find out if this gates are opened too or not.

Thanks,

Oleg

On Tue, Dec 22, 2015 at 8:48 PM, Jim Nasby <jim.na...@bluetreble.com> wrote:

> On 12/22/15 6:03 PM, oleg yusim wrote:
>
>> Absolutely. But we are not talking about that type of data leakage here.
>> We are talking about potential situation when user, who doesn't have
>> access to database, but has (or gained) access to the Linux box DB is
>> installed one and gets his hands on data, database processes stored in
>> memory (memory would be a common resource here).
>>
>
> Of far larger concern at that point is unauthorized access to the database
> files.
>
> Basically, if someone gains access to the OS user that Postgres is running
> as, or to root, it's game-over.
> --
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com
>


Re: [GENERAL] Shared system resources

2015-12-22 Thread oleg yusim
John,

Absolutely. But we are not talking about that type of data leakage here. We
are talking about potential situation when user, who doesn't have access to
database, but has (or gained) access to the Linux box DB is installed one
and gets his hands on data, database processes stored in memory (memory
would be a common resource here).

Thanks,

Oleg

On Tue, Dec 22, 2015 at 5:28 PM, John R Pierce <pie...@hogranch.com> wrote:

> On 12/22/2015 2:52 PM, oleg yusim wrote:
>
>
> *The DBMS must prevent unauthorized and unintended information transfer
> via shared system resources.*
>
>
>
> you realize the database *itself* is a shared system resource and of
> and by itself has no idea about unauthorized/unintended information
> transfer, eg, any user with the proper credentials to connect to the
> database can query any tables those credentials are allowed to?
>
> --
> john r pierce, recycling bits in santa cruz
>
>


Re: [GENERAL] Session Identifiers

2015-12-21 Thread oleg yusim
Melvin,

I promised to let you know results of my experiment, so here is goes:


tcp_keepalives_idle = 900
tcp_keepalives_interval=0
tcp_keepalives_count=0

Doesn't terminate connection to database in 15 minutes of inactivity of
psql prompt. So, it looks like that would work only for case if network
connection is broken and session left hanging. For psql prompt case looks
like pg_terminate_backend() would be the only solution.

Thanks,

Oleg



On Sun, Dec 20, 2015 at 11:33 AM, Melvin Davidson <melvin6...@gmail.com>
wrote:

> Actually, I'm not an expert on the tcp_keepalives, but I  believe the 
> tcp_keepalives_count
> should be 1, otherwise it will take 45 minutes minutes to timeout. Then
> again, I could be wrong.
>
> On Sun, Dec 20, 2015 at 12:28 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>
>> oleg yusim <olegyu...@gmail.com> writes:
>> > Got it, thanks... Now, is it any protection in place currently against
>> > replacing Session ID (my understanding, it is kept in memory, belonging
>> to
>> > the session process) or against guessing Session ID (i.e. is Session ID
>> > generated using FIPS 140-2 compliant algorithms, or anything of that
>> sort)?
>>
>> I don't think Postgres even has any concept that matches what you seem
>> to think a Session ID is.
>>
>> If you're looking for communication security/integrity checking, that's
>> something we leave to other software such as SSL.
>>
>> regards, tom lane
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>>
>
>
>
> --
> *Melvin Davidson*
> I reserve the right to fantasize.  Whether or not you
> wish to share my fantasy is entirely up to you.
>


Re: [GENERAL] Session Identifiers

2015-12-21 Thread oleg yusim
Melvin, Stephen,

Thanks for your responses, guys. I think we can finally put this topic to
the bed with that - I have satisfactory answer. For those who would be
interested and would dig into this topic later on, here is fairly detailed
explanation on how to use pg_terminate_backend in this case, coupled with
usage of pg_stat_activity and cron (it also has code too):
http://stackoverflow.com/questions/12391174/how-to-close-idle-connections-in-postgresql-automatically

Thanks everybody,

Oleg

On Mon, Dec 21, 2015 at 10:51 AM, Melvin Davidson <melvin6...@gmail.com>
wrote:

> Pursuant to Stehen's suggestion, I've attached a scripts that you can
> execeute from a cron. I wrote it when I was working for a previous company
> that used to have users that opened connections
> and transaction that did nothing for a long time.
> Just adjust the max_time for your liking. You can also add OR
> current_query = '' to kill stagnant connections.
>
> On Mon, Dec 21, 2015 at 11:42 AM, Stephen Frost <sfr...@snowman.net>
> wrote:
>
>> Oleg,
>>
>> * oleg yusim (olegyu...@gmail.com) wrote:
>> > tcp_keepalives_idle = 900
>> > tcp_keepalives_interval=0
>> > tcp_keepalives_count=0
>> >
>> > Doesn't terminate connection to database in 15 minutes of inactivity of
>> > psql prompt. So, it looks like that would work only for case if network
>> > connection is broken and session left hanging. For psql prompt case
>> looks
>> > like pg_terminate_backend() would be the only solution.
>>
>> Those settings aren't for controlling idle timeout of a connection.
>>
>> pg_terminate_backend() will work and could be run out of a cronjob.
>>
>> Thanks!
>>
>> Stephen
>>
>
>
>
> --
> *Melvin Davidson*
> I reserve the right to fantasize.  Whether or not you
> wish to share my fantasy is entirely up to you.
>


[GENERAL] Session Identifiers

2015-12-20 Thread oleg yusim
Greetings!

I'm new to PostgreSQL, working on it from the point of view of Cyber
Security assessment. In regards to the here is my questions:

>From the security standpoint we have to assure that database invalidates
session identifiers upon user logout or other session termination (timeout
counts too).

Does PostgreSQL perform this type of actions? If so, where are those
Session IDs are stored, so I can verify it?

Thanks,

Oleg


Re: [GENERAL] Session Identifiers

2015-12-20 Thread oleg yusim
Hi Melvin,

Thank you very much, that logging option really helps (I need to give
instructions, people, who are not very code literate should be capable of
executing). And, point taken about exact version and enviornment -
PostgreSQL 9.4.5, Linux box.

Thanks,

Oleg

On Sun, Dec 20, 2015 at 10:19 AM, Melvin Davidson <melvin6...@gmail.com>
wrote:

> PostgreSQL does not "store" the session_id per se in any system
> catalogs/tables, however, you can configure the log_line_prefix in
> postgresql.conf to record it for each connection. It will then be stored in
> the postgresql log file.
> Please not that in the future, it is always helpful to provide the exact
> version of PostgreSQL and the O/S you are working with.
>
> On Sun, Dec 20, 2015 at 11:08 AM, Pavel Stehule <pavel.steh...@gmail.com>
> wrote:
>
>> Hi
>>
>> 2015-12-20 16:16 GMT+01:00 oleg yusim <olegyu...@gmail.com>:
>>
>>> Greetings!
>>>
>>> I'm new to PostgreSQL, working on it from the point of view of Cyber
>>> Security assessment. In regards to the here is my questions:
>>>
>>> From the security standpoint we have to assure that database invalidates
>>> session identifiers upon user logout or other session termination (timeout
>>> counts too).
>>>
>>> Does PostgreSQL perform this type of actions? If so, where are those
>>> Session IDs are stored, so I can verify it?
>>>
>>
>> Postgres is based on processes - for any session is created new process
>> when user is logged and this process is destroyed when user does logout.
>> Almost all data are in process memory only, but shared data related to
>> sessions are stored in shared memory - in array of PGPROC structures.
>> Postgres invalidates these data immediately when process is destroyed.
>> Search PGPROC in our code. Look to postmaster.c, where these operations are
>> described.
>>
>> What I know, there are not any other session data - so when process is
>> destroyed, then all is destroyed by o.s.
>>
>> Can be totally different if you use some connection pooler like pgpool or
>> pgbouncer - these applications can reuse Postgres server sessions for more
>> user sessions.
>>
>> Regards
>>
>> Pavel
>>
>>
>>>
>>> Thanks,
>>>
>>> Oleg
>>>
>>
>>
>
>
> --
> *Melvin Davidson*
> I reserve the right to fantasize.  Whether or not you
> wish to share my fantasy is entirely up to you.
>


Re: [GENERAL] Session Identifiers

2015-12-20 Thread oleg yusim
Got it, thanks... Now, is it any protection in place currently against
replacing Session ID (my understanding, it is kept in memory, belonging to
the session process) or against guessing Session ID (i.e. is Session ID
generated using FIPS 140-2 compliant algorithms, or anything of that sort)?

Oleg

On Sun, Dec 20, 2015 at 11:02 AM, Pavel Stehule <pavel.steh...@gmail.com>
wrote:

>
>
> 2015-12-20 17:52 GMT+01:00 oleg yusim <olegyu...@gmail.com>:
>
>> Hi Pavel,
>>
>> Thanks, for your response, it helps. Now, from my observations
>> (PostgreSQL 9.4.5, installed on Linux box), if I enter psql prompt at my
>> ssh to the box session and leave it open like that, it doesn't time out. Is
>> it really a case? Session to PostgreSQL DB doesn't terminate on timeout (or
>> rather doesn't have one), or I just happened to miss configuration option?
>>
>>
> any unbound process started as custom session means critical error - and
> there are not any related known bug. Postgres hasn't any build option for
> terminating session. If you need it - the pgbouncer has one or you can
> terminate session via pg_terminate_backend and cron. Maybe somebody will
> write background worker for this purpose. Internally, the system processes
> and sessions has pretty strong relation in Postgres. - there cannot be
> process without session and session without process.
>
> Pavel
>
>
>> Thanks,
>>
>> Oleg
>>
>> On Sun, Dec 20, 2015 at 10:08 AM, Pavel Stehule <pavel.steh...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> 2015-12-20 16:16 GMT+01:00 oleg yusim <olegyu...@gmail.com>:
>>>
>>>> Greetings!
>>>>
>>>> I'm new to PostgreSQL, working on it from the point of view of Cyber
>>>> Security assessment. In regards to the here is my questions:
>>>>
>>>> From the security standpoint we have to assure that database
>>>> invalidates session identifiers upon user logout or other session
>>>> termination (timeout counts too).
>>>>
>>>> Does PostgreSQL perform this type of actions? If so, where are those
>>>> Session IDs are stored, so I can verify it?
>>>>
>>>
>>> Postgres is based on processes - for any session is created new process
>>> when user is logged and this process is destroyed when user does logout.
>>> Almost all data are in process memory only, but shared data related to
>>> sessions are stored in shared memory - in array of PGPROC structures.
>>> Postgres invalidates these data immediately when process is destroyed.
>>> Search PGPROC in our code. Look to postmaster.c, where these operations are
>>> described.
>>>
>>> What I know, there are not any other session data - so when process is
>>> destroyed, then all is destroyed by o.s.
>>>
>>> Can be totally different if you use some connection pooler like pgpool
>>> or pgbouncer - these applications can reuse Postgres server sessions for
>>> more user sessions.
>>>
>>> Regards
>>>
>>> Pavel
>>>
>>>
>>>>
>>>> Thanks,
>>>>
>>>> Oleg
>>>>
>>>
>>>
>>
>


Re: [GENERAL] Session Identifiers

2015-12-20 Thread oleg yusim
Thanks you very much Melvin, once again, very useful. So, let me see if I
got it right, following configuration should cause my database connection
to terminate in 15 minutes, right?

tcp_keepalives_idle = 900
tcp_keepalives_interval=1
tcp_keepalives_count=3

Oleg

On Sun, Dec 20, 2015 at 11:14 AM, Melvin Davidson <melvin6...@gmail.com>
wrote:

> Regarding timeouts, PostgreSQL will use the system tcp_keepalives_* parms
> by default, but you can also configure it separately in postgresql.conf.
> http://www.postgresql.org/docs/9.4/static/runtime-config-connection.html
>
> I suggest you review all available parameters in the postgresql.conf, as
> it will probably answer some additional questions for you.
>
> On Sun, Dec 20, 2015 at 12:02 PM, Pavel Stehule <pavel.steh...@gmail.com>
> wrote:
>
>>
>>
>> 2015-12-20 17:52 GMT+01:00 oleg yusim <olegyu...@gmail.com>:
>>
>>> Hi Pavel,
>>>
>>> Thanks, for your response, it helps. Now, from my observations
>>> (PostgreSQL 9.4.5, installed on Linux box), if I enter psql prompt at my
>>> ssh to the box session and leave it open like that, it doesn't time out. Is
>>> it really a case? Session to PostgreSQL DB doesn't terminate on timeout (or
>>> rather doesn't have one), or I just happened to miss configuration option?
>>>
>>>
>> any unbound process started as custom session means critical error - and
>> there are not any related known bug. Postgres hasn't any build option for
>> terminating session. If you need it - the pgbouncer has one or you can
>> terminate session via pg_terminate_backend and cron. Maybe somebody will
>> write background worker for this purpose. Internally, the system processes
>> and sessions has pretty strong relation in Postgres. - there cannot be
>> process without session and session without process.
>>
>> Pavel
>>
>>
>>> Thanks,
>>>
>>> Oleg
>>>
>>> On Sun, Dec 20, 2015 at 10:08 AM, Pavel Stehule <pavel.steh...@gmail.com
>>> > wrote:
>>>
>>>> Hi
>>>>
>>>> 2015-12-20 16:16 GMT+01:00 oleg yusim <olegyu...@gmail.com>:
>>>>
>>>>> Greetings!
>>>>>
>>>>> I'm new to PostgreSQL, working on it from the point of view of Cyber
>>>>> Security assessment. In regards to the here is my questions:
>>>>>
>>>>> From the security standpoint we have to assure that database
>>>>> invalidates session identifiers upon user logout or other session
>>>>> termination (timeout counts too).
>>>>>
>>>>> Does PostgreSQL perform this type of actions? If so, where are those
>>>>> Session IDs are stored, so I can verify it?
>>>>>
>>>>
>>>> Postgres is based on processes - for any session is created new process
>>>> when user is logged and this process is destroyed when user does logout.
>>>> Almost all data are in process memory only, but shared data related to
>>>> sessions are stored in shared memory - in array of PGPROC structures.
>>>> Postgres invalidates these data immediately when process is destroyed.
>>>> Search PGPROC in our code. Look to postmaster.c, where these operations are
>>>> described.
>>>>
>>>> What I know, there are not any other session data - so when process is
>>>> destroyed, then all is destroyed by o.s.
>>>>
>>>> Can be totally different if you use some connection pooler like pgpool
>>>> or pgbouncer - these applications can reuse Postgres server sessions for
>>>> more user sessions.
>>>>
>>>> Regards
>>>>
>>>> Pavel
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Oleg
>>>>>
>>>>
>>>>
>>>
>>
>
>
> --
> *Melvin Davidson*
> I reserve the right to fantasize.  Whether or not you
> wish to share my fantasy is entirely up to you.
>


Re: [GENERAL] Session Identifiers

2015-12-20 Thread oleg yusim
Tom,

I understand the idea that for external communication you rely on SSL.
However, how about me opening psql prompt into the database directly from
my Linux box, my db is installed at? I thought, it would be considered
local connection and would not go through the SSL channels. If that is the
case, here we would be dealing with Session IDs belonging to DB itself, not
OpenSSL.

Please, correct me if I'm wrong.

Thanks,

Oleg

On Sun, Dec 20, 2015 at 11:28 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> oleg yusim <olegyu...@gmail.com> writes:
> > Got it, thanks... Now, is it any protection in place currently against
> > replacing Session ID (my understanding, it is kept in memory, belonging
> to
> > the session process) or against guessing Session ID (i.e. is Session ID
> > generated using FIPS 140-2 compliant algorithms, or anything of that
> sort)?
>
> I don't think Postgres even has any concept that matches what you seem
> to think a Session ID is.
>
> If you're looking for communication security/integrity checking, that's
> something we leave to other software such as SSL.
>
> regards, tom lane
>


Re: [GENERAL] Session Identifiers

2015-12-20 Thread oleg yusim
Thanks Melvin,

Let me experiment with it for a bit. I will let you know results.

Oleg

On Sun, Dec 20, 2015 at 11:33 AM, Melvin Davidson <melvin6...@gmail.com>
wrote:

> Actually, I'm not an expert on the tcp_keepalives, but I  believe the 
> tcp_keepalives_count
> should be 1, otherwise it will take 45 minutes minutes to timeout. Then
> again, I could be wrong.
>
> On Sun, Dec 20, 2015 at 12:28 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>
>> oleg yusim <olegyu...@gmail.com> writes:
>> > Got it, thanks... Now, is it any protection in place currently against
>> > replacing Session ID (my understanding, it is kept in memory, belonging
>> to
>> > the session process) or against guessing Session ID (i.e. is Session ID
>> > generated using FIPS 140-2 compliant algorithms, or anything of that
>> sort)?
>>
>> I don't think Postgres even has any concept that matches what you seem
>> to think a Session ID is.
>>
>> If you're looking for communication security/integrity checking, that's
>> something we leave to other software such as SSL.
>>
>> regards, tom lane
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>>
>
>
>
> --
> *Melvin Davidson*
> I reserve the right to fantasize.  Whether or not you
> wish to share my fantasy is entirely up to you.
>


Re: [GENERAL] Session Identifiers

2015-12-20 Thread oleg yusim
So Pavel, are are saying there is no such thing as Session ID in PostgreSQL
DB at all? Everything is tight to the process, session is accociated with,
so in essence pid is session id?

Oleg

On Sun, Dec 20, 2015 at 11:40 AM, Pavel Stehule <pavel.steh...@gmail.com>
wrote:

>
>
> 2015-12-20 18:37 GMT+01:00 oleg yusim <olegyu...@gmail.com>:
>
>> Tom,
>>
>> I understand the idea that for external communication you rely on SSL.
>> However, how about me opening psql prompt into the database directly from
>> my Linux box, my db is installed at? I thought, it would be considered
>> local connection and would not go through the SSL channels. If that is the
>> case, here we would be dealing with Session IDs belonging to DB itself, not
>> OpenSSL.
>>
>
> all necessary data are stored local in process memory. No session ID is
> required.
>
> Pavel
>
>
>>
>> Please, correct me if I'm wrong.
>>
>> Thanks,
>>
>> Oleg
>>
>> On Sun, Dec 20, 2015 at 11:28 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>
>>> oleg yusim <olegyu...@gmail.com> writes:
>>> > Got it, thanks... Now, is it any protection in place currently against
>>> > replacing Session ID (my understanding, it is kept in memory,
>>> belonging to
>>> > the session process) or against guessing Session ID (i.e. is Session ID
>>> > generated using FIPS 140-2 compliant algorithms, or anything of that
>>> sort)?
>>>
>>> I don't think Postgres even has any concept that matches what you seem
>>> to think a Session ID is.
>>>
>>> If you're looking for communication security/integrity checking, that's
>>> something we leave to other software such as SSL.
>>>
>>> regards, tom lane
>>>
>>
>>
>


Re: [GENERAL] Session Identifiers

2015-12-20 Thread oleg yusim
Hi Pavel,

Thanks, for your response, it helps. Now, from my observations (PostgreSQL
9.4.5, installed on Linux box), if I enter psql prompt at my ssh to the box
session and leave it open like that, it doesn't time out. Is it really a
case? Session to PostgreSQL DB doesn't terminate on timeout (or rather
doesn't have one), or I just happened to miss configuration option?

Thanks,

Oleg

On Sun, Dec 20, 2015 at 10:08 AM, Pavel Stehule <pavel.steh...@gmail.com>
wrote:

> Hi
>
> 2015-12-20 16:16 GMT+01:00 oleg yusim <olegyu...@gmail.com>:
>
>> Greetings!
>>
>> I'm new to PostgreSQL, working on it from the point of view of Cyber
>> Security assessment. In regards to the here is my questions:
>>
>> From the security standpoint we have to assure that database invalidates
>> session identifiers upon user logout or other session termination (timeout
>> counts too).
>>
>> Does PostgreSQL perform this type of actions? If so, where are those
>> Session IDs are stored, so I can verify it?
>>
>
> Postgres is based on processes - for any session is created new process
> when user is logged and this process is destroyed when user does logout.
> Almost all data are in process memory only, but shared data related to
> sessions are stored in shared memory - in array of PGPROC structures.
> Postgres invalidates these data immediately when process is destroyed.
> Search PGPROC in our code. Look to postmaster.c, where these operations are
> described.
>
> What I know, there are not any other session data - so when process is
> destroyed, then all is destroyed by o.s.
>
> Can be totally different if you use some connection pooler like pgpool or
> pgbouncer - these applications can reuse Postgres server sessions for more
> user sessions.
>
> Regards
>
> Pavel
>
>
>>
>> Thanks,
>>
>> Oleg
>>
>
>


Re: [GENERAL] Session Identifiers

2015-12-20 Thread oleg yusim
Hi Andrew,

Exactly! Vulnerability is the direct analogy of one with Unix shell. The
way we generally deal with Unix shell vulnerability, we configure the shell
to terminate on its own if timeout was exceeded. The question here is, can
we configure psql client to behave the same?

Thanks,

Oleg


On Sun, Dec 20, 2015 at 1:38 PM, Andrew Sullivan <a...@crankycanuck.ca>
wrote:

> On Sun, Dec 20, 2015 at 11:25:45AM -0600, oleg yusim wrote:
> > Thanks you very much Melvin, once again, very useful. So, let me see if I
> > got it right, following configuration should cause my database connection
> > to terminate in 15 minutes, right?
> >
> > tcp_keepalives_idle = 900
> > tcp_keepalives_interval=1
> > tcp_keepalives_count=3
>
> Only if your psql session ends.  Psql is a client program.  It keeps
> its connection to the database alive.
>
> In this sense, the vulnerability you're looking at is analagous to the
> case where someone logs into a UNIX shell and then leaves the shell
> open.  If the system can be compromised such that someone else can get
> control of that shell, you have a problem.  Otherwise, the session
> can't really be taken over.  So, your exposure is exactly as great as
> the exposure from UNIX process takeover.
>
> You can prove to yourself that the process doesn't linger by opening
> up a TCP connection (or for that matter a UNIX socket connection) and
> somehow making the containing program fail (e.g. open a psql
> connection and then sever your connection to the machine that had the
> shell that initiated the psql session, without properly closing the
> shell so that the session hangs around).  Eventually, the Postgres
> backend will try to talk to the session and discover it isn't there,
> and you'll get a termination logged (assuming you have loging turned
> up that high).
>
> A
>
> --
> Andrew Sullivan
> a...@crankycanuck.ca
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] json indexing and data types

2015-12-12 Thread Oleg Bartunov
On Wed, Dec 2, 2015 at 11:48 AM, Kaare Rasmussen  wrote:

> Hi
>
> As json essentially only has three basic data types, string, int, and
> boolean, I wonder how much of this - to index, search, and sort on
> unstructured data -  is possible. I guess part of the answer would be
> 'jsquery and vodka', but let me describe the problem first.
>
> The basics is, that I have a column with what is essentially json data; a
> number of data structures of different depths. Perhaps 10 - 30 top levels,
> and probably no more than 3, max 4 levels deep. In total there are some
> hundred thousands of rows in each table. It would probably be best stored
> as jsonb. Right now it's text, because it's only used by the application
> itself.
>
> It would be incredibly useful to add an index to this column, and to be
> able to search, using the index, on arbitrary elements. This part seems
> already there, with jsquery.
>
> The hard part is that some of the data items really have another type.
> There are dates and floating points, as the most important ones. And the
> really hard part is that sorting and range searches are important,
> especially for these two types. Having dates is iso-format, and
> left-padding floats with zeros is a low tech solution, and especially the
> latter is not very efficient.
>
> The solution might be to add functional indexes for these data items, but
> it's cumbersome and not easily maintainable. If a one-stop solution is in
> the works, or already there, it could save a lot of time.
>
>
This is known problem, that's why we stop developing jsquery and are
working on sql-level query language for jsonb, then you'll use all power
and extendability of SQL.  The idea is to use power of subselects and
unnest to unroll jsonb to sql level.
There is presentation at pgconf.eu on this
https://wiki.postgresql.org/images/4/4e/Createam.pdf, see slide #27

But I'm afraid it'll come to 9.6.




> /kaare
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Loggingt psql meta-commands

2015-12-10 Thread oleg yusim
Andreas, Andrian,

Thank you very much for both pieces of information. It was very helpful.
Now, let me ask you one more question on the same topic. Is it more
granular way to control logging PosgreSQL provides, or I pretty much
reduced to choosing between mod and all?

The reason I'm asking is because with 'all" volume of daily logging becomes
truly ginormous. And for my purposes, I really do not need all the SELECT
statements to be logged. Only those, which are responsible for explicit
querying of role/privileges/roles (so, \du, \dp, \z, in essence).

Thanks,

Oleg

On Thu, Dec 10, 2015 at 12:14 PM, Andreas Kretschmer <
akretsch...@spamfence.net> wrote:

> oleg yusim <olegyu...@gmail.com> wrote:
>
> > Greetings!
> >
> > I'm new to PostgreSQL, working on it from the point of view of Cyber
> Security
> > assessment. In regards to the here is my question:
> >
> > Is it a way to enable logging for psql prompt meta-commands, such as
> \du, \dp,
> > \z, etc?
>
> start psql with -E
>
>
> Andreas
> --
> Really, I'm not out to destroy Microsoft. That will just be a completely
> unintentional side effect.  (Linus Torvalds)
> "If I was god, I would recompile penguin with --enable-fly."   (unknown)
> Kaufbach, Saxony, Germany, Europe.  N 51.05082°, E 13.56889°
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Loggingt psql meta-commands

2015-12-10 Thread oleg yusim
Thanks John, I realized that and confirmed in my logs. What I'm trying to
determine now, can I only log some SELECT statements, or I should log all
of them or none of them.

Oleg

On Thu, Dec 10, 2015 at 1:40 PM, John R Pierce <pie...@hogranch.com> wrote:

> On 12/10/2015 9:58 AM, oleg yusim wrote:
>
>> I'm new to PostgreSQL, working on it from the point of view of Cyber
>> Security assessment. In regards to the here is my question:
>>
>> Is it a way to enable logging for psql prompt meta-commands, such as \du,
>> \dp, \z, etc?
>>
>
> what the other two gentlemen are trying to say is the metacommands are
> shortcuts for more complex SQL queries of the pg_catalog schema, so to log
> them, you would need to log all queries and filter for accesses to the
> pg_catalog
>
>
>
> --
> john r pierce, recycling bits in santa cruz
>
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Loggingt psql meta-commands

2015-12-10 Thread oleg yusim
Hmm... let me see if I got it right. I can set log_statement to mod
overall, and then modify it as ALTER USER postgres SET log_statement=all;
for postgres only? Also... how do we control who can run meta commands?

Thanks,

Oleg

On Thu, Dec 10, 2015 at 2:16 PM, Jerry Sievers <gsiever...@comcast.net>
wrote:

> Scott Mead <sco...@openscg.com> writes:
>
> > On Thu, Dec 10, 2015 at 2:50 PM, oleg yusim <olegyu...@gmail.com> wrote:
> >
> > Thanks John, I realized that and confirmed in my logs. What I'm
> trying to determine now, can I only log some SELECT statements, or I should
> log all of them or none
> > of them.
> >
> > You can configure this to multiple levels:
> >
> >  Global, per-user, per-database
> >
> > ALTER USER postgres SET log_min_duration_statement=0;
> > ALTER DATABASE xyz SET log_min_duration_statement=0;
> >
> >   That being said, you would want to make sure that the user
> > issuing the largest volume of queries is not set with this, otherwise,
> > you could potential flood your logs with every single query
> > issued.  This has a tendency to cause performance
> > problems.
>
> > The other item of note is that, once logged in, the
> > user could change that value with a similar ALTER statement.
>
> No, not really.  Unprivileged users can't frob those settings.
>
>
> > select name, context from pg_settings where name ~ '^log.*statement$'
> order by 1;
> name|  context
> +---
>  log_min_duration_statement | superuser
>  log_min_error_statement| superuser
>  log_statement  | superuser
> (3 rows)
>
> > select version();
>version
>
> --
>  PostgreSQL 9.3.4 on x86_64-unknown-linux-gnu, compiled by gcc (Debian
> 4.7.2-5) 4.7.2, 64-bit
> (1 row)
>
> >
>
>
> > create role foo;
> CREATE ROLE
> > set role foo;
> SET
> > set log_statement to 'none';
> ERROR:  permission denied to set parameter "log_statement"
> > set log_min_duration_statement to -1;
> ERROR:  permission denied to set parameter "log_min_duration_statement"
> >
>
> >
> > --Scott
> > PostgreSQL database experts
> > http://www.openscg.com
> >
> > Oleg
> >
> > On Thu, Dec 10, 2015 at 1:40 PM, John R Pierce <pie...@hogranch.com>
> wrote:
> >
> > On 12/10/2015 9:58 AM, oleg yusim wrote:
> >
> > I'm new to PostgreSQL, working on it from the point of view
> of Cyber Security assessment. In regards to the here is my question:
> >
> > Is it a way to enable logging for psql prompt meta-commands,
> such as \du, \dp, \z, etc?
> >
> > what the other two gentlemen are trying to say is the
> metacommands are shortcuts for more complex SQL queries of the pg_catalog
> schema, so to log them, you
> > would need to log all queries and filter for accesses to the
> pg_catalog
> >
> > --
> > john r pierce, recycling bits in santa cruz
> >
> > --
> > Sent via pgsql-general mailing list (
> pgsql-general@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-general
> >
>
> --
> Jerry Sievers
> Postgres DBA/Development Consulting
> e: postgres.consult...@comcast.net
> p: 312.241.7800
>


Re: [GENERAL] Loggingt psql meta-commands

2015-12-10 Thread oleg yusim
Adrian,

What I hope to achieve is to meet this requirement from Database SRG:

*Review DBMS documentation to verify that audit records can be produced
when privileges/permissions/role memberships are retrieved.*

To do that I would need to enable logging of such commands as \du, \dp, \z.
At the same time, I do not want to get 20 GB of logs on the daily basis, by
setting log_statement = 'all'. So, I'm trying to find a way in between.

Thanks,

Oleg



On Thu, Dec 10, 2015 at 3:29 PM, Adrian Klaver <adrian.kla...@aklaver.com>
wrote:

> On 12/10/2015 12:56 PM, oleg yusim wrote:
>
>> So what I want to accomplish is logging queries for roles/privileges
>> with minimal increasing volume of logs along the way. The idea I got
>> from responses in this thread so far is:
>>
>> 1) Set log_statement on postgresql.conf to 'mod'
>> 2) Raise log_statement to 'all' but only for postgres superuser
>>
>> What seems to be open questions to me with this model:
>>
>> 1) Way to check what log_statement set to on per user basis (what table
>> should I query?)
>> 2) Way to ensure that only superuser can run meta commands, such as \du,
>> \dp, \z
>>
>
> Maybe if you tell us what you hope to achieve, monitoring or access denial
> and to what purpose, it might be possible to come up with a more complete
> answer.
>
>
>> Thanks,
>>
>> Oleg
>>
>> On Thu, Dec 10, 2015 at 2:50 PM, David G. Johnston
>> <david.g.johns...@gmail.com <mailto:david.g.johns...@gmail.com>> wrote:
>>
>> On Thu, Dec 10, 2015 at 1:46 PM, oleg yusim <olegyu...@gmail.com
>> <mailto:olegyu...@gmail.com>>wrote:
>>
>> Hi David,
>>
>> Can you, please, give me example?
>>
>>
>> ​Not readily...maybe others can.  Putting forth specific examples of
>> what you want to accomplish may help.
>>
>> David J.​
>>
>>
>>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


Re: [GENERAL] Loggingt psql meta-commands

2015-12-10 Thread oleg yusim
Adrian,

You seemed to be familiar with the STIG world, so how about V-ID from
Database SRG? I'm looking into STIG ID: SRG-APP-91-DB-66 right now.
Now, I do not really think it is a tall order, since the requirement speaks
about explicit calls for privilege/permission/role membership information.
Internal checks, which are going on all the time do not count.

Thanks,

Oleg

On Thu, Dec 10, 2015 at 4:03 PM, Adrian Klaver <adrian.kla...@aklaver.com>
wrote:

> On 12/10/2015 01:36 PM, oleg yusim wrote:
>
>> Adrian,
>>
>> What I hope to achieve is to meet this requirement from Database SRG:
>>
>
> So some aspect of this:
>
> https://www.stigviewer.com/stig/database_security_requirements_guide/
>
> Can you be more specific?
>
>
>> /Review DBMS documentation to verify that audit records can be produced
>> when privileges/permissions/role memberships are retrieved./
>>
>
> That is a tall order, that is an almost constant process.
>
> /
>> /
>> To do that I would need to enable logging of such commands as \du, \dp,
>> \z. At the same time, I do not want to get 20 GB of logs on the daily
>> basis, by setting log_statement = 'all'. So, I'm trying to find a way in
>> between.
>>
>
> Any way you look at this is going to require pulling in and analyzing a
> great deal of information. That is why I asked for the specific
> requirement, to help determine exactly what is being required?
>
>
>> Thanks,
>>
>> Oleg
>>
>>
>>
>> On Thu, Dec 10, 2015 at 3:29 PM, Adrian Klaver
>> <adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com>> wrote:
>>
>> On 12/10/2015 12:56 PM, oleg yusim wrote:
>>
>> So what I want to accomplish is logging queries for
>> roles/privileges
>> with minimal increasing volume of logs along the way. The idea I
>> got
>> from responses in this thread so far is:
>>
>> 1) Set log_statement on postgresql.conf to 'mod'
>> 2) Raise log_statement to 'all' but only for postgres superuser
>>
>> What seems to be open questions to me with this model:
>>
>> 1) Way to check what log_statement set to on per user basis
>> (what table
>> should I query?)
>> 2) Way to ensure that only superuser can run meta commands, such
>> as \du,
>> \dp, \z
>>
>>
>> Maybe if you tell us what you hope to achieve, monitoring or access
>> denial and to what purpose, it might be possible to come up with a
>> more complete answer.
>>
>>
>> Thanks,
>>
>> Oleg
>>
>> On Thu, Dec 10, 2015 at 2:50 PM, David G. Johnston
>> <david.g.johns...@gmail.com <mailto:david.g.johns...@gmail.com>
>> <mailto:david.g.johns...@gmail.com
>> <mailto:david.g.johns...@gmail.com>>> wrote:
>>
>>  On Thu, Dec 10, 2015 at 1:46 PM, oleg yusim
>> <olegyu...@gmail.com <mailto:olegyu...@gmail.com>
>>  <mailto:olegyu...@gmail.com
>> <mailto:olegyu...@gmail.com>>>wrote:
>>
>>  Hi David,
>>
>>  Can you, please, give me example?
>>
>>
>>  ​Not readily...maybe others can.  Putting forth specific
>> examples of
>>  what you want to accomplish may help.
>>
>>  David J.​
>>
>>
>>
>>
>> --
>> Adrian Klaver
>> adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com>
>>
>>
>>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


Re: [GENERAL] Loggingt psql meta-commands

2015-12-10 Thread oleg yusim
So what I want to accomplish is logging queries for roles/privileges with
minimal increasing volume of logs along the way. The idea I got from
responses in this thread so far is:

1) Set log_statement on postgresql.conf to 'mod'
2) Raise log_statement to 'all' but only for postgres superuser

What seems to be open questions to me with this model:

1) Way to check what log_statement set to on per user basis (what table
should I query?)
2) Way to ensure that only superuser can run meta commands, such as \du,
\dp, \z

Thanks,

Oleg

On Thu, Dec 10, 2015 at 2:50 PM, David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Thu, Dec 10, 2015 at 1:46 PM, oleg yusim <olegyu...@gmail.com> wrote:
>
>> Hi David,
>>
>> Can you, please, give me example?
>>
>>
> ​Not readily...maybe others can.  Putting forth specific examples of what
> you want to accomplish may help.
>
> David J.​
>
>


Re: [GENERAL] Loggingt psql meta-commands

2015-12-10 Thread oleg yusim
Hi David,

Can you, please, give me example?

Thanks,

Oleg

On Thu, Dec 10, 2015 at 2:25 PM, David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Thu, Dec 10, 2015 at 1:20 PM, oleg yusim <olegyu...@gmail.com> wrote:
>
>> Also... how do we control who can run meta commands?
>>
>
> ​You cannot do so directly but you can control permissions to the
> underlying schema that the meta-command queries touch so that attempting to
> run the meta-command fails.
>
> This is not as simple as it may seem; visibility of the schema is
> considerably more permissive than visibility of data.
>
> David J.​
>
>
>


Re: [GENERAL] Loggingt psql meta-commands

2015-12-10 Thread oleg yusim
Adrian,

I used all those settings you suggested already, and I will suggest to use
logrotate or syslog on top of it, so no more than day worth of log would be
kept in the system. Still, I view it as a big drawback. Do you know of any
third party tools (log monitoring/analyze software, you mentioned),
PostgreSQL integrates nicely with?

Oleg

On Thu, Dec 10, 2015 at 4:45 PM, Adrian Klaver <adrian.kla...@aklaver.com>
wrote:

> On 12/10/2015 02:13 PM, oleg yusim wrote:
>
>> Adrian,
>>
>> You seemed to be familiar with the STIG world, so how about V-ID from
>>
>
> No, I am just familiar with how search engines work:)
>
> Database SRG? I'm looking into STIG ID: SRG-APP-91-DB-66 right
>> now. Now, I do not really think it is a tall order, since the
>> requirement speaks about explicit calls for privilege/permission/role
>> membership information. Internal checks, which are going on all the time
>> do not count.
>>
>
> http://securityrules.info/about/xivik-hagym-bupih-dityx/SV-42701r3_rule
>
> Hope there is another rule that keeps someone from writing code that
> masquerades as internal checks.
>
> At any rate it seems the immediate solution is going to lie in some sort
> of log monitoring/analyze software that pulls out queries of interest and
> adds them to another file. You can limit the Postgres logs that are kept
> around using the settings here:
>
> http://www.postgresql.org/docs/9.4/interactive/runtime-config-logging.html
>
> Some of which are:
>
> log_rotation_age (integer)
>
> When logging_collector is enabled, this parameter determines the
> maximum lifetime of an individual log file. After this many minutes have
> elapsed, a new log file will be created. Set to zero to disable time-based
> creation of new log files. This parameter can only be set in the
> postgresql.conf file or on the server command line.
>
> log_rotation_size (integer)
>
> When logging_collector is enabled, this parameter determines the
> maximum size of an individual log file. After this many kilobytes have been
> emitted into a log file, a new log file will be created. Set to zero to
> disable size-based creation of new log files. This parameter can only be
> set in the postgresql.conf file or on the server command line.
>
> log_truncate_on_rotation (boolean)
>
> When logging_collector is enabled, this parameter will cause
> PostgreSQL to truncate (overwrite), rather than append to, any existing log
> file of the same name. However, truncation will occur only when a new file
> is being opened due to time-based rotation, not during server startup or
> size-based rotation. When off, pre-existing files will be appended to in
> all cases. For example, using this setting in combination with a
> log_filename like postgresql-%H.log would result in generating twenty-four
> hourly log files and then cyclically overwriting them. This parameter can
> only be set in the postgresql.conf file or on the server command line.
>
> Example: To keep 7 days of logs, one log file per day named
> server_log.Mon, server_log.Tue, etc, and automatically overwrite last
> week's log with this week's log, set log_filename to server_log.%a,
> log_truncate_on_rotation to on, and log_rotation_age to 1440.
>
> Example: To keep 24 hours of logs, one log file per hour, but also
> rotate sooner if the log file size exceeds 1GB, set log_filename to
> server_log.%H%M, log_truncate_on_rotation to on, log_rotation_age to 60,
> and log_rotation_size to 100. Including %M in log_filename allows any
> size-driven rotations that might occur to select a file name different from
> the hour's initial file name.
>
>
>
>> Thanks,
>>
>> Oleg
>>
>> On Thu, Dec 10, 2015 at 4:03 PM, Adrian Klaver
>> <adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com>> wrote:
>>
>> On 12/10/2015 01:36 PM, oleg yusim wrote:
>>
>> Adrian,
>>
>> What I hope to achieve is to meet this requirement from Database
>> SRG:
>>
>>
>> So some aspect of this:
>>
>> https://www.stigviewer.com/stig/database_security_requirements_guide/
>>
>> Can you be more specific?
>>
>>
>> /Review DBMS documentation to verify that audit records can be
>> produced
>> when privileges/permissions/role memberships are retrieved./
>>
>>
>> That is a tall order, that is an almost constant process.
>>
>> /
>> /
>> To do that I would need to enable logging of such commands as
>> \du, \dp,
>> \z. At the same time, I do not want to

Re: [GENERAL] Loggingt psql meta-commands

2015-12-10 Thread oleg yusim
Thanks Tom, I get what you are saying and that seems to be final at this
stage. I will write pg_audit down, though.

Oleg

On Thu, Dec 10, 2015 at 4:41 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> oleg yusim <olegyu...@gmail.com> writes:
> > What I hope to achieve is to meet this requirement from Database SRG:
> > *Review DBMS documentation to verify that audit records can be produced
> > when privileges/permissions/role memberships are retrieved.*
>
> > To do that I would need to enable logging of such commands as \du, \dp,
> \z.
> > At the same time, I do not want to get 20 GB of logs on the daily basis,
> by
> > setting log_statement = 'all'. So, I'm trying to find a way in between.
>
> As multiple people have noted, it's a serious error to imagine that your
> requirement is "log \du etc".  Those are just handy macros for queries on
> the system catalogs, which could also be done in other ways.  What you
> seem to need is server-side logging of queries that access specific system
> catalog columns.  There's no out-of-the-box facility for that right now,
> short of log_statement = all which you've already rejected.
>
> It'd be possible to write a C-code extension that did something like
> that, and some work in that direction has already gone on; the pg_audit
> extension that didn't quite get into 9.5 might come close to your
> requirements.
>
> regards, tom lane
>


Re: [GENERAL] Loggingt psql meta-commands

2015-12-10 Thread oleg yusim
John,

I can answer that - Oracle and MS SQL do, or at least there were able to
convince DISA that they do (STIGs for them are present here:
http://iase.disa.mil/stigs/Pages/a-z.aspx). That actually benefits those
products greatly - from the point of view of security they, once hardened,
meet Federal security requirements and such can be used in multiple
products other DBs can't (for that very reason).

Thanks,

Oleg

On Thu, Dec 10, 2015 at 4:52 PM, John R Pierce <pie...@hogranch.com> wrote:

> On 12/10/2015 2:03 PM, Adrian Klaver wrote:
>
>>
>> So some aspect of this:
>>
>> https://www.stigviewer.com/stig/database_security_requirements_guide/
>>
>
> thats a rather insane bunch of requirements.   Reads like a wish list by
> academic security researchers.
>
> for instance
>
> https://www.stigviewer.com/stig/database_security_requirements_guide/2015-06-23/finding/V-58123
>
> ??!?   The database server has no clue about the difference between an
> "application that it supports" and a user directly querying.  The PSQL
> shell, or dbadmin, is an 'application that it supports'.
>
> at this point, speaking purely as a interested outsider (I am in no way
> representing hte PG Development Group), I'd guess PostgreSQL probably
> doesn't meet 2/3rds of those 'findings'.   I truly wonder if any standard
> RDBMS supports all or even most of them?!?
>
>
>
>
> --
> john r pierce, recycling bits in santa cruz
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


[GENERAL] Loggingt psql meta-commands

2015-12-10 Thread oleg yusim
Greetings!

I'm new to PostgreSQL, working on it from the point of view of Cyber
Security assessment. In regards to the here is my question:

Is it a way to enable logging for psql prompt meta-commands, such as \du,
\dp, \z, etc?

Thanks,

Oleg


Re: [GENERAL] JSON path wild cards?

2015-11-24 Thread Oleg Bartunov
On Tue, Nov 24, 2015 at 12:39 PM, Dennis <denn...@visi.com> wrote:

> Is there a way to specify a wild card in a json path?
>
> For example I have the following json doc:
>
> [ {“a”:1,”b”: [ { “x”: 7,”y”:8,”z”:9} ] },  {“a”:2,”b”: [ { “x”:
> 4,”y”:5,”z”:6} ] }, … ]
>
> How do I write a select clause that can return the values for all b x
> values something like [{b:x}] that would return all the b:x values in the
> array? e.g. 7 and 4 ...
>
> Also is there a definition of the syntax of a proper json path for use in
> postgres?
>

Not in postgres currently, but you may try our jsquery extension
https://github.com/postgrespro/jsquery.


Oleg


>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


[GENERAL] Is there any way to measure disk activity for each query?

2015-06-18 Thread Oleg Serov
Hello!

I'm wondering, if there any way to measure how much disk-io were generated
by a query?



-- 
Best Regards,
Oleg


Re: [GENERAL] [to_tsvector] German Compound Words

2015-05-28 Thread Oleg Bartunov
ts_debug() ?

=# select * from ts_debug('english', 'messages');
   alias   |   description   |  token   |  dictionaries  |  dictionary  |
lexemes
---+-+--++--+--
 asciiword | Word, all ASCII | messages | {english_stem} | english_stem |
{messag}


On Thu, May 28, 2015 at 2:05 PM, Sven R. Kunze srku...@tbz-pariv.de wrote:

 Hi everybody,

 what do I need to do in order to enable compound word handling in
 PostgreSQL tsvector implementation?

 I run an Ubuntu 14.04 machine, PostgreSQL 9.3, have installed package
 hunspell-de-de and already created a new dictionary as described here:
 http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY

 CREATE TEXT SEARCH DICTIONARY german_hunspell (
 TEMPLATE = ispell,
 DictFile = de_de,
 AffFile = de_de,
 StopWords = german
 );

 Furthermore, created a new test text search configuration (copied from
 german) and updated all parser parts where the german_stem dictionary is
 used so that it uses german_hunspell first and then german_stem.

 However, ts_vector still does not work for the compound words such as:

 wasserkraft - wasserkraft, kraft
 schifffahrt - schifffahrt, fahrt
 blindflansch - blindflansch, flansch

 etc.


 What have I done wrong here?

 --
 Sven R. Kunze
 TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
 Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
 e-mail: srku...@tbz-pariv.de
 web: www.tbz-pariv.de

 Geschäftsführer: Dr. Reiner Wohlgemuth
 Sitz der Gesellschaft: Chemnitz
 Registergericht: Chemnitz HRB 8543



 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general



Re: [GENERAL] [tsvector] to_tsvector called multiple times

2015-05-26 Thread Oleg Bartunov
You can ask http://snowball.tartarus.org/ for stemmer. Meanwhile,
you can have small personal dictionary (before stemmer) with such
exceptions, for example, use synonym template

system system

Oleg


On Tue, May 26, 2015 at 11:18 AM, Sven R. Kunze srku...@tbz-pariv.de
wrote:

 Hi everybody,

 the following stemming results made me curious:

 select to_tsvector('german', 'systeme');  'system':1
 select to_tsvector('german', 'systemes');  'system':1
 select to_tsvector('german', 'systems');  'system':1
 select to_tsvector('german', 'systemen');  'system':1
 select to_tsvector('german', 'system');   'syst':1


 First of all, this seems to be a bug in the German stemmer. Where can I
 fix it?

 Second, and more importantly, as I understand it, the stemmed version of a
 word should be considered normalized. That is, all other versions of that
 stem should be mapped to it as well. The interesting problem here is that
 PostgreSQL maps the stem itself ('system') to a completely different stem
 ('syst').

 Should a stem not remain stable even when to_tsvector is called on it
 multiple times?

 --
 Sven R. Kunze
 TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
 Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
 e-mail: srku...@tbz-pariv.de
 web: www.tbz-pariv.de

 Geschäftsführer: Dr. Reiner Wohlgemuth
 Sitz der Gesellschaft: Chemnitz
 Registergericht: Chemnitz HRB 8543



 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general



Re: [GENERAL] Grouping By Similarity (using pg_trgm)?

2015-05-22 Thread Oleg Bartunov
Have you seen http://www.sai.msu.su/~megera/postgres/talks/pgcon-2012.pdf ?

On Thu, May 14, 2015 at 9:58 PM, Cory Tucker cory.tuc...@gmail.com wrote:

 [pg version 9.3 or 9.4]

 Suppose I have a simple table:

 create table data (
   my_value  TEXT NOT NULL
 );
 CREATE INDEX idx_my_value ON data USING gin(my_value gin_trgm_ops);


 Now I would like to essentially do group by to get a count of all the
 values that are sufficiently similar.  I can do it using something like a
 CROSS JOIN to join the table on itself, but then I still am getting all the
 rows with duplicate counts.

 Is there a way to do a group by query and only return a single my_value
 column and a count of the number of times other values are similar while
 also not returning the included similar values in the output, too?




Re: [GENERAL] Find similar records (compare tsvectors)

2015-03-06 Thread Oleg Bartunov
On Fri, Mar 6, 2015 at 5:05 PM, Patrick Dung patrick_...@yahoo.com.hk wrote:
 Resend.

 How to quickly compare the similarity of two tsvector?


check http://www.sai.msu.su/~megera/postgres/talks/pgcon-2012.pdf



 On Monday, March 2, 2015 11:01 PM, Patrick Dung patrick_...@yahoo.com.hk
 wrote:


 Hello,

 I had a database with articles or attachment stored in bytea format.
 I also had a trigger: it insert/update the tsv column when a record is
 added/updated.
 The tsv column had a GIN index.
 With this setting, I can do very fast keyword search on the tsv.

 Suppose I had a specific record (id=10).
 How to list similar records based on ranking?
 In that case, I had to compare a tsvector with another tsvector.

 I had this SQL which make the original tsv as a text and then to tsquery,
 Then I can compare a tsv and a tsquery.
 SELECT ts_rank(i.tsv, replace(strip(original.tsv)::text, ' ', '|')::tsquery)
 as similarity, i.company, i.industry, i.post_timestamp, i.id FROM items i,
 (SELECT tsv, id FROM items WHERE id=10) AS original WHERE i.id !=
 original.id ORDER BY similarity;

 items table:
 id bigint
 company varchar
 industry varchar
 description varchar
 post_timestamp timestamp
 attachment bytea
 tsv tsvector

 The problem is that this is very slow.
 Any comment?

 Thank and regards,
 Patrick




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] WAL supported extension

2015-01-18 Thread Oleg Bartunov
On Sun, Jan 18, 2015 at 9:15 AM, Michael Paquier michael.paqu...@gmail.com
wrote:

 On Sat, Jan 17, 2015 at 11:18 PM, Kouhei Sutou k...@cozmixng.org wrote:
  (Is this mailing list right mailing list for asking this
  question...?)
 Hackers would have been fine as well.

  Is there any plan to implement PostgreSQL API to implement
  WAL supported extension?
 Not that I know of, the last discussion I recall on the matter being this
 one:

 http://www.postgresql.org/message-id/capphfdsxwzmojm6dx+tjnpyk27kt4o7ri6x_4oswcbyu1rm...@mail.gmail.com
 --


We are eager for development of this API.


 Michael


 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general



Re: [GENERAL] WAL supported extension

2015-01-18 Thread Oleg Bartunov
On Sun, Jan 18, 2015 at 2:01 PM, Michael Paquier michael.paqu...@gmail.com
wrote:

 Oleg Bartunov wrote:
  We are eager for development of this API.
 Yeah, me too actually :) Oleg, are there plans on your side to do
 something in this area for 9.6?


Yes, Alexander Korotkov will continue working on this. Are you coming
Moscow for pgconf.ru ? We'll have a good time to discuss this.

Oleg


 --
 Michael



  1   2   3   4   5   6   >