; On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote:
> >> Sushant Sinha writes:
> >>> Doesn't this force the headline to be taken from the first N words of
> >>> the document, independent of where the match was? That seems rather
> >>> unworkable,
The rank counts 1/coversize. So bigger covers will not have much impact
anyway. What is the need of the patch?
-Sushant.
On Fri, 2012-01-27 at 18:06 +0200, karave...@mail.bg wrote:
> Hello,
>
> I have developed a variation of cover density ranking functions that
> counts only covers that are le
On Mon, 2011-12-19 at 12:41 -0300, Euler Taveira de Oliveira wrote:
> On 19-12-2011 12:30, Sushant Sinha wrote:
> > I recently upgraded my postgres server from 9.0 to 9.1.2 and I am
> > finding a peculiar problem.I have a program that periodically adds
> rows
> > to
On Mon, 2011-12-19 at 19:08 +0200, Marti Raudsepp wrote:
> Another thought -- have you read about the GIN "fast updates" feature?
> This existed in 9.0 too. Instead of updating the index directly, GIN
> appends all changes to a sequential list, which needs to be scanned in
> whole for read queries.
I recently upgraded my postgres server from 9.0 to 9.1.2 and I am
finding a peculiar problem.I have a program that periodically adds rows
to this table using INSERT. Typically the number of rows is just 1-2
thousand when the table already has 500K rows. Whenever the program is
adding rows, the perf
.
-Sushant.
On Tue, 2011-10-25 at 23:45 +0530, Sushant Sinha wrote:
> On Tue, 2011-10-25 at 19:27 +0200, Florian Pflug wrote:
>
> > Assume, for example, that the postgres mailing list archive search used
> > tsearch (which I think it does, but I'm not sure). It'd then
On Fri, 2011-11-04 at 11:22 +0100, Pavel Stehule wrote:
> Hello
>
> I found a interesting issue when I checked a tsearch prefix searching.
>
> We use a ispell based dictionary
>
> CREATE TEXT SEARCH DICTIONARY cspell
>(template=ispell, dictfile = czech, afffile=czech, stopwords=czech);
> CRE
On Tue, 2011-10-25 at 19:27 +0200, Florian Pflug wrote:
> Assume, for example, that the postgres mailing list archive search used
> tsearch (which I think it does, but I'm not sure). It'd then probably make
> sense to add "postgres" to the list of stopwords, because it's bound to
> appear in near
On Tue, 2011-10-25 at 18:05 +0200, Florian Pflug wrote:
> On Oct25, 2011, at 17:26 , Sushant Sinha wrote:
> > I am currently using the prefix search feature in text search. I find
> > that the prefix characters are treated the same as a normal lexeme and
> > passed through
I am currently using the prefix search feature in text search. I find
that the prefix characters are treated the same as a normal lexeme and
passed through stemming and stopword dictionaries. This seems like a bug
to me.
db=# select to_tsquery('english', 's:*');
NOTICE: text-search query contain
>
> Actually, this code seems probably flat-out wrong: won't every
> successful call of hlCover() on a given document return exactly the same
> q value (end position), namely the last token occurrence in the
> document? How is that helpful?
>
>regards, tom lane
>
There is
> > Here is a simple patch that limits the number of words during the
> > tokenization phase and puts an upper-bound on the headline generation.
>
> Doesn't this force the headline to be taken from the first N words of
> the document, independent of where the match was? That seems rather
> unwor
Given a document and a query, the goal of headline generation is to
produce text excerpts in which the query appears. Currently the headline
generation in postgres follows the following steps:
1. Tokenize the documents and obtain the lexemes
2. Decide on lexemes that should be the part of the head
On Thu, 2011-07-21 at 15:31 +0200, Jan Urbański wrote:
> On 21/07/11 15:27, Sushant Sinha wrote:
> > I am using plpythonu on postgres 9.0.2. One of my python functions was
> > throwing a TypeError exception. However, I only see the exception in the
> > database and not the st
I am using plpythonu on postgres 9.0.2. One of my python functions was
throwing a TypeError exception. However, I only see the exception in the
database and not the stack trace. It becomes difficult to debug if the
stack trace is absent in Python.
logdb=# select get_words(forminput) from fi;
I am using pg_trgm for spelling correction as prescribed in the
documentation. But I see that it does not work for unicode sring. The
database was initialized with utf8 encoding and the C locale.
Here is the table:
\d words
Table "public.words"
Column | Type | Modifiers
+---
I agree that it will be a good idea to rewrite the entire thing. However, in
the mean time, I sent a proposal earlier
http://archives.postgresql.org/pgsql-hackers/2010-08/msg00019.php
And a patch later:
http://archives.postgresql.org/pgsql-hackers/2010-09/msg00476.php
Tom asked me to look into
Do not know if this mail got lost in between or no one noticed it!
On Thu, 2010-12-23 at 11:05 +0530, Sushant Sinha wrote:
Just a reminder that this patch is discussing how to break url, emails
etc into its components.
>
> On Mon, Oct 4, 2010 at 3:54 AM, Tom Lane wrote:
> [
Just a reminder that this patch is discussing how to break url, emails etc
into its components.
On Mon, Oct 4, 2010 at 3:54 AM, Tom Lane wrote:
> [ sorry for not responding on this sooner, it's been hectic the last
> couple weeks ]
>
> Sushant Sinha writes:
> >> I
Sorry for sounding the false alarm. I was not running the vanilla
postgres and that is why I was seeing that problem. Should have checked
with the vanilla one.
-Sushant
On Tue, 2010-12-21 at 23:03 -0500, Tom Lane wrote:
> Sushant Sinha writes:
> > There is a bug in ts_rank_cd. It
MY PREV EMAIL HAD A PROBLEM. Please reply to this one
==
There is a bug in ts_rank_cd. It does not correctly give rank when the
query lexeme is the first one in the tsvector.
Example:
select ts_rank_cd(to_tsvector('english', 'abc sdd'),
plainto
There is a bug in ts_rank_cd. It does not correctly give rank when the
query lexeme is the first one in the tsvector.
Example:
select ts_rank_cd(to_tsvector('english', 'abc sdd'),
plainto_tsquery('english', 'abc'));
ts_rank_cd
0
select ts_rank_cd(to_tsvector('english'
ski wrote:
> On 24/10/10 14:44, Sushant Sinha wrote:
> > I am using gin index on a tsvector and doing basic search. I see the
> > row-estimate of the planner to be horribly wrong. It is returning
> > row-estimate as 4843 for all queries whether it matches zero rows, a
> >
I am using gin index on a tsvector and doing basic search. I see the
row-estimate of the planner to be horribly wrong. It is returning
row-estimate as 4843 for all queries whether it matches zero rows, a
medium number of rows (88,000) or a large number of rows (726,000).
The table has roughly a mi
On Tue, 2010-10-12 at 19:31 -0400, Tom Lane wrote:
> This seems much of a piece with the existing proposal to allow
> individual "words" of a URL to be reported separately:
> https://commitfest.postgresql.org/action/patch_view?id=378
>
> As I said in that thread, this could be done in a backwards
Any updates on this?
On Tue, Sep 21, 2010 at 10:47 PM, Sushant Sinha wrote:
> > I looked at this patch a bit. I'm fairly unhappy that it seems to be
> > inventing a brand new mechanism to do something the ts parser can
> > already do. Why didn't you code the
Your changes are somewhat fine. It will get you tokens with "_"
characters in it. However, it is not nice to mix your new token with
existing token like NUMWORD. Give a new name to your new type of
token .. probably UnderscoreWord. Then on seeing "_", move to a state
that can identify the new token
> I looked at this patch a bit. I'm fairly unhappy that it seems to be
> inventing a brand new mechanism to do something the ts parser can
> already do. Why didn't you code the url-part mechanism using the
> existing support for compound words?
I am not familiar with compound word implementatio
For the headline generation to work properly, email/file/url/host need
to become skip tokens. Updating the patch with that change.
-Sushant.
On Sat, 2010-09-04 at 13:25 +0530, Sushant Sinha wrote:
> Updating the patch with emitting parttoken and registering it with
> snowball
Updating the patch with emitting parttoken and registering it with
snowball config.
-Sushant.
On Fri, 2010-09-03 at 09:44 -0400, Robert Haas wrote:
> On Wed, Sep 1, 2010 at 2:42 AM, Sushant Sinha wrote:
> > I have attached a patch that emits parts of a host token, a url token,
>
complicate the patch with that, I wanted to get feedback on any
other major problem with the patch.
-Sushant.
On Mon, 2010-08-02 at 10:20 -0400, Tom Lane wrote:
> Sushant Sinha writes:
> >> This would needlessly increase the number of tokens. Instead you'd
> >> better make
On Mon, 2010-08-02 at 09:32 -0400, Robert Haas wrote:
> On Mon, Aug 2, 2010 at 9:12 AM, Sushant Sinha wrote:
> > The current text parser already returns url and url_path. That already
> > increases the number of unique tokens. I am only asking for adding of
> > normal eng
> On 08/01/2010 08:04 PM, Sushant Sinha wrote:
> > 1. We do not have separate tokens "wikipedia" and "org"
> > 2. If we have the two tokens we should have them at adjacent position so
> > that a phrase search for "wikipedia org" should work.
>
Currently the english parser in text search does not support multiple
words in the same position. Consider a word "wikipedia.org". The text
search would return a single token "wikipedia.org". However if someone
searches for "wikipedia org" then there will not be a match. There are
two problems here
It seems like the ordering of lexemes in tsvector has changed from 8.3
to 8.4.
For example in 8.3.1,
postgres=# select to_tsvector('english', 'quit everytime');
to_tsvector
---
'quit':1 'everytim':2
The lexemes are arranged by length and then by string comparison
ts_headline calls ts_lexize equivalent to break the text. Off course there
is algorithm to process the tokens and generate the headline. I would be
really surprised if the algorithm to generate the headline is somehow
dependent on language (as it only processes the tokens). So Oleg is right
when he
On Tue, 2009-06-02 at 17:26 -0700, Josh Berkus wrote:
>
> * possible bug in cover density ranking?
>
> -- From Teodor's response, this is maybe a doc patch and not a code
> patch. Teodor? Oleg?
I personally think that this is a bug, because we are assigning very
high rank when we are no
Thanks,
Sushant.
On Tue, Jun 2, 2009 at 8:47 AM, Kenneth Marshall wrote:
> On Mon, Jun 01, 2009 at 08:22:23PM -0500, Kevin Grittner wrote:
> > Sushant Sinha wrote:
> >
> > > I think that dot should be considered by as a word delimiter because
> > > when dot is not
Currently it seems like that dot is not considered as a word delimiter
by the english parser.
lawdb=# select to_tsvector('english', 'Mr.J.Sai Deepak');
to_tsvector
-
'deepak':2 'mr.j.sai':1
(1 row)
So the word obtained is "mr.j.sai" rather than three words "
I see this as open items here
http://wiki.postgresql.org/wiki/PostgreSQL_8.4_Open_Items
Any interest in fixing this?
-Sushant.
On Thu, 2009-01-29 at 13:54 -0500, Sushant Sinha wrote:
>
>
> On Thu, Jan 29, 2009 at 12:38 PM, Teodor Sigaev
> wrote:
> Is this what
:57 -0400, Tom Lane wrote:
> Sushant Sinha writes:
> > Sorry for the delay. Here is the patch with FragmentDelimiter option.
> > It requires an extra option in HeadlineParsedText and uses that option
> > during generateHeadline.
>
> I did some editing of the docum
yeah you are right. I did not know that you can pass space using double
quotes.
-Sushant.
On Sun, 2009-03-01 at 20:49 -0500, Tom Lane wrote:
> Sushant Sinha writes:
> > FragmentDelimiter is an argument for ts_headline function to separates
> > different headline fragments. The de
FragmentDelimiter is an argument for ts_headline function to separates
different headline fragments. The default delimiter is " ... ".
Currently if someone specifies the delimiter as an option to the
function, no extra space is added around the delimiter. However, it does
not look good without spac
Sorry ... I thought you were running the development branch.
-Sushant.
On Sat, 2009-02-14 at 16:34 -0500, Tom Lane wrote:
> Sushant Sinha writes:
> > I think we currently do that.
>
> ... since about four months ago.
>
> 2008-10-17 14:05 teodor
>
> * doc/sr
gt; to '...' || _headline(v1.copy, query, 'MinWords = 17') || '...', but as you
> can clearly see this would always occur, and not be intelligent regarding
> the fragments. I hope that you're correct and that it is implemented, and
> not documented
>
&g
I think we currently do that. We add ellipses only when we encounter a
new fragment. So there should not be ellipses if we are at the end of
the document or if that is the first fragment (includes the beginning of
the document). Here is the code in generateHeadline, ts_parse.c that
adds the ellipse
On Thu, Jan 29, 2009 at 12:38 PM, Teodor Sigaev wrote:
> Is this what is desired? It seems to me that Wdoc is getting a high
>> ranking even when we are not sure of the position information.
>>
> 0.1 is not very high rank, and we could not suggest any reasonable rank in
> this case. This document
I am running postgres 8.3.1. In tsrank.c I am looking at the cover
density function used for ranking while doing text search:
float4
calc_rank_cd(float4 *arrdata, TSVector txt, TSQuery query, int method)
Here is the excerpt of code that I think may possibly have bug when
document is big enough to
m was wrong. ;-)
>
> ---
>
> Heikki Linnakangas wrote:
> > Sushant Sinha wrote:
> > > Patch #2. I think this is a straigt forward bug fix.
> >
> > Yes, I think you're right. In hlCover(), *q is 0 when the only match is
> > the first item in th
AIL PROTECTED]
> wrote:
> Sushant Sinha escribió:
> > Any status updates on the following patches?
> >
> > 1. Fragments in tsearch2 headlines:
> > http://archives.postgresql.org/pgsql-hackers/2008-08/msg00043.php
> >
> > 2. Bug in hlCover:
> > http://arc
Any status updates on the following patches?
1. Fragments in tsearch2 headlines:
http://archives.postgresql.org/pgsql-hackers/2008-08/msg00043.php
2. Bug in hlCover:
http://archives.postgresql.org/pgsql-hackers/2008-08/msg00089.php
-Sushant.
--
Sent via pgsql-hackers mailing list (pgsql-hacke
On Mon, 2008-08-04 at 00:36 -0300, Euler Taveira de Oliveira wrote:
> Sushant Sinha escreveu:
> > I think there is a slight bug in hlCover function in wparser_def.c
> >
> The bug is not in the hlCover. In prsd_headline, if we didn't find a
> suitable bestlen (i.e. >
Has any one noticed this?
-Sushant.
On Wed, 2008-07-16 at 23:01 -0400, Sushant Sinha wrote:
> I think there is a slight bug in hlCover function in wparser_def.c
>
> If there is only one query item and that is the first word in the text,
> then hlCover does not returns any cover. Thi
file that tests different aspects of the
new headline generation function.
Let me know if anything else is needed.
-Sushant.
On Thu, 2008-07-24 at 00:28 +0400, Oleg Bartunov wrote:
> On Wed, 23 Jul 2008, Sushant Sinha wrote:
>
> > I guess it is more readable to add cover separator a
I guess it is more readable to add cover separator at the end of a fragment
than in the front. Let me know what you think and I can update it.
I think the right place for cover separator is in the structure
HeadlineParsedText just like startsel and stopsel. This will enable users to
specify their
I looked at query operators for tsquery and here are some of the new
query operators for position based queries. I am just proposing some
changes and the questions I have.
1. What is the meaning of such a query operator?
foo #5 bar -> true if the document has word "foo" followed by "bar" at
5th p
008, Sushant Sinha wrote:
>
> > I will add test queries and their results for the corner cases in a
> > separate file. I guess the only thing I am confused about is what should
> > be the behavior of headline generation when Query items have words of
> > size less than Short
I think there is a slight bug in hlCover function in wparser_def.c
If there is only one query item and that is the first word in the text,
then hlCover does not returns any cover. This is evident in this example
when ts_headline only generates the min_words:
testdb=# select ts_headline('1 2 3 4 5
ur code:
>
> =# select ts_headline('1 2 3 4 5 1 2 3 1','1&3'::tsquery,'maxfragments=2');
> ts_headline
> --
> ... 2 ...
>
> and so on
>
>
> Oleg
> On Tue, 15 Jul 2008, Sushant Sinha wrote:
>
> &
attached are two patches:
1. documentation
2. regression tests
for headline with fragments.
-Sushant.
On Tue, 2008-07-15 at 13:29 +0400, Teodor Sigaev wrote:
> > Attached a new patch that:
> >
> > 1. fixes previous bug
> > 2. better handles the case when cover size is greater than the MaxWords
Attached a new patch that:
1. fixes previous bug
2. better handles the case when cover size is greater than the MaxWords.
Basically it divides a cover greater than MaxWords into fragments of
MaxWords, resizes each such fragment so that each end of the fragment
contains a query word and then evalua
You are right. I did not do make clean last time. After make clean, make
all, and make install it works fine.
-Sushant.
On Thu, 2008-07-10 at 17:55 +0530, Pavan Deolasee wrote:
> On Thu, Jul 10, 2008 at 5:36 PM, Sushant Sinha <[EMAIL PROTECTED]> wrote:
> >
> >
> >
>
I am trying to generate a patch with respect to the current CVS head. So
ai rsynced the tree, then did cvs up and installed the db. However, when
I did initdb on a data directory it is stuck:
It is stuck after printing creating template1
creating template1 database in /home/postgres/data/base/1 ..
I have an attached an updated patch with following changes:
1. Respects ShortWord and MinWords
2. Uses hlCover instead of Cover
3. Does not store norm (or lexeme) for headline marking
4. Removes ts_rank.h
5. Earlier it was counting even NONWORDTOKEN in the headline. Now it
only counts the actual w
On Tue, 2008-06-03 at 22:16 +0400, Teodor Sigaev wrote:
> > This is far more complicated than I thought.
> >> Of course, phrase search should be able to use indexes.
> > I can probably look into how to use index. Any pointers on this?
>
> src/backend/utils/adt/tsginidx.c, if you invent operation #
My main argument for using Cover instead of hlCover was that Cover will
be faster. I tested the default headline generation that uses hlCover
with the current patch that uses Cover. There was not much difference.
So I think you are right in that we do not need norms and we can just
use hlCover.
I
On Mon, 2008-06-02 at 19:39 +0400, Teodor Sigaev wrote:
>
> > I have attached a patch for phrase search with respect to the cvs head.
> > Basically it takes a a phrase (text) and a TSVector. It checks if the
> > relative positions of lexeme in the phrase are same as in their
> > positions in TSVec
Efficiency: I realized that we do not need to store all norms. We need
to only store store norms that are in the query. So I moved the addition
of norms from addHLParsedLex to hlfinditem. This should add very little
memory overhead to existing headline generation.
If this is still not acceptable f
I have attached a patch for phrase search with respect to the cvs head.
Basically it takes a a phrase (text) and a TSVector. It checks if the
relative positions of lexeme in the phrase are same as in their
positions in TSVector.
If the configuration for text search is "simple", then this will prod
I have attached a new patch with respect to the current cvs head. This
produces headline in a document for a given query. Basically it
identifies fragments of text that contain the query and displays them.
DESCRIPTION
HeadlineParsedText contains an array of actual words but not
information abou
ould we have an option to pass TSVector to headline
function?
-Sushant.
On Sat, 2008-05-24 at 07:57 +0400, Teodor Sigaev wrote:
> [moved to -hackers, because talk is about implementation details]
>
> > I've ported the patch of Sushant Sinha for fragmented headlines to pg8.3.1
&g
71 matches
Mail list logo