Re: [GENERAL] multi terabyte fulltext searching

2007-03-22 Thread Arturo Perez
On Wed, 21 Mar 2007 08:57:39 -0700, Benjamin Arai wrote:

 Hi Oleg,
 
 I am currently using GIST indexes because I receive about 10GB of new data
 a week (then again I am not deleting any information).  The do not expect
 to be able to stop receiving text for about 5 years, so the data is not
 going to become static any time soon.  The reason I am concerned with
 performance is that I am providing a search system for several newspapers
 since essentially the beginning of time.  Many bibliographer etc would
 like to use this utility but if each search takes too long I am not going
 to be able to support many concurrent users.
 
 Benjamin



At a previous job, I built a system to do this.  We had 3,000 publications
and approx 70M newspaper articles.  Total content size (postprocessed) was
on the order of 100GB, IIRC.  We used a proprietary (closed-source
not ours) search engine.

In order to reach subsecond response time we needed to horizontally scale
to about 50-70 machines, each a low-end Dell 1650.  This was after about 5
years of trying to vertically scale.

-arturo


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[GENERAL] multi terabyte fulltext searching

2007-03-21 Thread Benjamin Arai

Hi,

I have been struggling with getting fulltext searching for very large  
databases.  I can fulltext index 10s if gigs without any problem but  
when I start geting to hundreds of gigs it becomes slow.  My current  
system is a quad core with 8GB of memory.  I have the resource to  
throw more hardware at it but realistically it is not cost effective  
to buy a system with 128GB of memory.  Is there any solutions that  
people have come up with for indexing very large text databases?


Essentially I have several terabytes of text that I need to index.   
Each record is about 5 paragraphs of text.  I am currently using  
TSearch2 (stemming and etc) and getting sub-optimal results.  Queries  
take more than a second to execute.  Has anybody implemented such a  
database using multiple systems or some special add-on to TSearch2 to  
make things faster?  I want to do something like partitioning the  
data into multiple systems and merging the ranked results at some  
master node.  Is something like this possible for PostgreSQL or must  
it be a software solution?


Benjamin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] multi terabyte fulltext searching

2007-03-21 Thread Oleg Bartunov

Benjamin,

as one of the author of tsearch2 I'd like to know more about your setup.
tsearch2 in 8.2 has GIN index support, which scales much better than old
GiST index.

Oleg

On Wed, 21 Mar 2007, Benjamin Arai wrote:


Hi,

I have been struggling with getting fulltext searching for very large 
databases.  I can fulltext index 10s if gigs without any problem but when I 
start geting to hundreds of gigs it becomes slow.  My current system is a 
quad core with 8GB of memory.  I have the resource to throw more hardware at 
it but realistically it is not cost effective to buy a system with 128GB of 
memory.  Is there any solutions that people have come up with for indexing 
very large text databases?


Essentially I have several terabytes of text that I need to index.  Each 
record is about 5 paragraphs of text.  I am currently using TSearch2 
(stemming and etc) and getting sub-optimal results.  Queries take more than a 
second to execute.  Has anybody implemented such a database using multiple 
systems or some special add-on to TSearch2 to make things faster?  I want to 
do something like partitioning the data into multiple systems and merging the 
ranked results at some master node.  Is something like this possible for 
PostgreSQL or must it be a software solution?


Benjamin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] multi terabyte fulltext searching

2007-03-21 Thread Joshua D. Drake
Benjamin Arai wrote:
 Hi,
 
 I have been struggling with getting fulltext searching for very large
 databases.  I can fulltext index 10s if gigs without any problem but
 when I start geting to hundreds of gigs it becomes slow.  My current
 system is a quad core with 8GB of memory.  I have the resource to throw
 more hardware at it but realistically it is not cost effective to buy a
 system with 128GB of memory.  Is there any solutions that people have
 come up with for indexing very large text databases?

GIST indexes are very large.

 Essentially I have several terabytes of text that I need to index.  Each
 record is about 5 paragraphs of text.  I am currently using TSearch2
 (stemming and etc) and getting sub-optimal results.  Queries take more
 than a second to execute.

you are complaining about more than a second with a terabyte of text?


  Has anybody implemented such a database using
 multiple systems or some special add-on to TSearch2 to make things
 faster?  I want to do something like partitioning the data into multiple
 systems and merging the ranked results at some master node.  Is
 something like this possible for PostgreSQL or must it be a software
 solution?
 
 Benjamin
 
 ---(end of broadcast)---
 TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match
 


-- 

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [GENERAL] multi terabyte fulltext searching

2007-03-21 Thread Benjamin Arai

Hi Oleg,

I am currently using GIST indexes because I receive about 10GB of new  
data a week (then again I am not deleting any information).  The do  
not expect to be able to stop receiving text for about 5 years, so  
the data is not going to become static any time soon.  The reason I  
am concerned with performance is that I am providing a search system  
for several newspapers since essentially the beginning of time.  Many  
bibliographer etc would like to use this utility but if each search  
takes too long I am not going to be able to support many concurrent  
users.


Benjamin

On Mar 21, 2007, at 8:42 AM, Oleg Bartunov wrote:


Benjamin,

as one of the author of tsearch2 I'd like to know more about your  
setup.
tsearch2 in 8.2 has GIN index support, which scales much better  
than old

GiST index.

Oleg

On Wed, 21 Mar 2007, Benjamin Arai wrote:


Hi,

I have been struggling with getting fulltext searching for very  
large databases.  I can fulltext index 10s if gigs without any  
problem but when I start geting to hundreds of gigs it becomes  
slow.  My current system is a quad core with 8GB of memory.  I  
have the resource to throw more hardware at it but realistically  
it is not cost effective to buy a system with 128GB of memory.  Is  
there any solutions that people have come up with for indexing  
very large text databases?


Essentially I have several terabytes of text that I need to  
index.  Each record is about 5 paragraphs of text.  I am currently  
using TSearch2 (stemming and etc) and getting sub-optimal  
results.  Queries take more than a second to execute.  Has anybody  
implemented such a database using multiple systems or some special  
add-on to TSearch2 to make things faster?  I want to do something  
like partitioning the data into multiple systems and merging the  
ranked results at some master node.  Is something like this  
possible for PostgreSQL or must it be a software solution?


Benjamin

---(end of  
broadcast)---

TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83




---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] multi terabyte fulltext searching

2007-03-21 Thread Benjamin Arai
True, but what happens when my database reaches 100 terabytes? Is 5  
seconds ok? How about 10?  My problem is that I do not believe the  
performance loss I am experiencing as the data becomes large is (log  
the # of records).  This worries me because I could be doing  
something wrong.  Or I might be able to do something better.


Benjamin

On Mar 21, 2007, at 8:49 AM, Joshua D. Drake wrote:


Benjamin Arai wrote:

Hi,

I have been struggling with getting fulltext searching for very large
databases.  I can fulltext index 10s if gigs without any problem but
when I start geting to hundreds of gigs it becomes slow.  My current
system is a quad core with 8GB of memory.  I have the resource to  
throw
more hardware at it but realistically it is not cost effective to  
buy a

system with 128GB of memory.  Is there any solutions that people have
come up with for indexing very large text databases?


GIST indexes are very large.

Essentially I have several terabytes of text that I need to  
index.  Each

record is about 5 paragraphs of text.  I am currently using TSearch2
(stemming and etc) and getting sub-optimal results.  Queries take  
more

than a second to execute.


you are complaining about more than a second with a terabyte of text?



 Has anybody implemented such a database using
multiple systems or some special add-on to TSearch2 to make things
faster?  I want to do something like partitioning the data into  
multiple

systems and merging the ranked results at some master node.  Is
something like this possible for PostgreSQL or must it be a software
solution?

Benjamin

---(end of  
broadcast)---

TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match




--

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/ 
donate

PostgreSQL Replication: http://www.commandprompt.com/products/




---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] multi terabyte fulltext searching

2007-03-21 Thread Teodor Sigaev
I'm afraid that fulltext search on multiterabytes set of documents can not be 
implemented on any RDBMS, at least on single box. Specialized fulltext search 
engines (with exact matching and time to search about one second) has practical 
limit near 20 millions of docs, cluster - near 100 millions.  Bigger collections 
require engines like a google.



--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] multi terabyte fulltext searching

2007-03-21 Thread Benjamin Arai
By the way, what is the largest TSearch2 database that you know of  
and how fast does it return results?  Maybe my expectations are  
unrealistic.


Benjamin

On Mar 21, 2007, at 8:42 AM, Oleg Bartunov wrote:


Benjamin,

as one of the author of tsearch2 I'd like to know more about your  
setup.
tsearch2 in 8.2 has GIN index support, which scales much better  
than old

GiST index.

Oleg

On Wed, 21 Mar 2007, Benjamin Arai wrote:


Hi,

I have been struggling with getting fulltext searching for very  
large databases.  I can fulltext index 10s if gigs without any  
problem but when I start geting to hundreds of gigs it becomes  
slow.  My current system is a quad core with 8GB of memory.  I  
have the resource to throw more hardware at it but realistically  
it is not cost effective to buy a system with 128GB of memory.  Is  
there any solutions that people have come up with for indexing  
very large text databases?


Essentially I have several terabytes of text that I need to  
index.  Each record is about 5 paragraphs of text.  I am currently  
using TSearch2 (stemming and etc) and getting sub-optimal  
results.  Queries take more than a second to execute.  Has anybody  
implemented such a database using multiple systems or some special  
add-on to TSearch2 to make things faster?  I want to do something  
like partitioning the data into multiple systems and merging the  
ranked results at some master node.  Is something like this  
possible for PostgreSQL or must it be a software solution?


Benjamin

---(end of  
broadcast)---

TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83




---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] multi terabyte fulltext searching

2007-03-21 Thread Joshua D. Drake
Benjamin Arai wrote:
 True, but what happens when my database reaches 100 terabytes? Is 5
 seconds ok? How about 10?  My problem is that I do not believe the
 performance loss I am experiencing as the data becomes large is (log the
 # of records).  This worries me because I could be doing something
 wrong.  Or I might be able to do something better.

Well a couple of things you could do, especially if you have the ability
to throw hardware at it.

How many spindles do you have?

J


 
 Benjamin
 
 On Mar 21, 2007, at 8:49 AM, Joshua D. Drake wrote:
 
 Benjamin Arai wrote:
 Hi,

 I have been struggling with getting fulltext searching for very large
 databases.  I can fulltext index 10s if gigs without any problem but
 when I start geting to hundreds of gigs it becomes slow.  My current
 system is a quad core with 8GB of memory.  I have the resource to throw
 more hardware at it but realistically it is not cost effective to buy a
 system with 128GB of memory.  Is there any solutions that people have
 come up with for indexing very large text databases?

 GIST indexes are very large.

 Essentially I have several terabytes of text that I need to index.  Each
 record is about 5 paragraphs of text.  I am currently using TSearch2
 (stemming and etc) and getting sub-optimal results.  Queries take more
 than a second to execute.

 you are complaining about more than a second with a terabyte of text?


  Has anybody implemented such a database using
 multiple systems or some special add-on to TSearch2 to make things
 faster?  I want to do something like partitioning the data into multiple
 systems and merging the ranked results at some master node.  Is
 something like this possible for PostgreSQL or must it be a software
 solution?

 Benjamin

 ---(end of broadcast)---
 TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match



 -- 
   === The PostgreSQL Company: Command Prompt, Inc. ===
 Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
 Providing the most comprehensive  PostgreSQL solutions since 1997
  http://www.commandprompt.com/

 Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
 PostgreSQL Replication: http://www.commandprompt.com/products/

 
 
 ---(end of broadcast)---
 TIP 5: don't forget to increase your free space map settings
 


-- 

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/


Re: [GENERAL] multi terabyte fulltext searching

2007-03-21 Thread Oleg Bartunov

On Wed, 21 Mar 2007, Benjamin Arai wrote:


Hi Oleg,

I am currently using GIST indexes because I receive about 10GB of new data a 
week (then again I am not deleting any information).  The do not expect to be 
able to stop receiving text for about 5 years, so the data is not going to 
become static any time soon.  The reason I am concerned with performance is 
that I am providing a search system for several newspapers since essentially 
the beginning of time.  Many bibliographer etc would like to use this utility 
but if each search takes too long I am not going to be able to support many 
concurrent users.




GiST is ok for your feed, but archive part should use GIN index. 
inheritance+CE should help your life.



Benjamin

On Mar 21, 2007, at 8:42 AM, Oleg Bartunov wrote:


Benjamin,

as one of the author of tsearch2 I'd like to know more about your setup.
tsearch2 in 8.2 has GIN index support, which scales much better than old
GiST index.

Oleg

On Wed, 21 Mar 2007, Benjamin Arai wrote:


Hi,

I have been struggling with getting fulltext searching for very large 
databases.  I can fulltext index 10s if gigs without any problem but when 
I start geting to hundreds of gigs it becomes slow.  My current system is 
a quad core with 8GB of memory.  I have the resource to throw more 
hardware at it but realistically it is not cost effective to buy a system 
with 128GB of memory.  Is there any solutions that people have come up 
with for indexing very large text databases?


Essentially I have several terabytes of text that I need to index.  Each 
record is about 5 paragraphs of text.  I am currently using TSearch2 
(stemming and etc) and getting sub-optimal results.  Queries take more 
than a second to execute.  Has anybody implemented such a database using 
multiple systems or some special add-on to TSearch2 to make things faster? 
I want to do something like partitioning the data into multiple systems 
and merging the ranked results at some master node.  Is something like 
this possible for PostgreSQL or must it be a software solution?


Benjamin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [GENERAL] multi terabyte fulltext searching

2007-03-21 Thread Benjamin Arai

24.

Benjamin

On Mar 21, 2007, at 9:09 AM, Joshua D. Drake wrote:


Benjamin Arai wrote:

True, but what happens when my database reaches 100 terabytes? Is 5
seconds ok? How about 10?  My problem is that I do not believe the
performance loss I am experiencing as the data becomes large is  
(log the

# of records).  This worries me because I could be doing something
wrong.  Or I might be able to do something better.


Well a couple of things you could do, especially if you have the  
ability

to throw hardware at it.

How many spindles do you have?

J




Benjamin

On Mar 21, 2007, at 8:49 AM, Joshua D. Drake wrote:


Benjamin Arai wrote:

Hi,

I have been struggling with getting fulltext searching for very  
large
databases.  I can fulltext index 10s if gigs without any problem  
but
when I start geting to hundreds of gigs it becomes slow.  My  
current
system is a quad core with 8GB of memory.  I have the resource  
to throw
more hardware at it but realistically it is not cost effective  
to buy a
system with 128GB of memory.  Is there any solutions that people  
have

come up with for indexing very large text databases?


GIST indexes are very large.

Essentially I have several terabytes of text that I need to  
index.  Each
record is about 5 paragraphs of text.  I am currently using  
TSearch2
(stemming and etc) and getting sub-optimal results.  Queries  
take more

than a second to execute.


you are complaining about more than a second with a terabyte of  
text?




 Has anybody implemented such a database using
multiple systems or some special add-on to TSearch2 to make things
faster?  I want to do something like partitioning the data into  
multiple

systems and merging the ranked results at some master node.  Is
something like this possible for PostgreSQL or must it be a  
software

solution?

Benjamin

---(end of  
broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your  
desire to
  choose an index scan if your joining column's datatypes do  
not

  match




--
  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/ 
donate

PostgreSQL Replication: http://www.commandprompt.com/products/




---(end of  
broadcast)---

TIP 5: don't forget to increase your free space map settings




--

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/ 
donate

PostgreSQL Replication: http://www.commandprompt.com/products/




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [GENERAL] multi terabyte fulltext searching

2007-03-21 Thread Benjamin Arai
Can't you implement something similar to google by aggregating  
results for TSearch2 over many machines?


Benjamin
On Mar 21, 2007, at 8:59 AM, Teodor Sigaev wrote:

I'm afraid that fulltext search on multiterabytes set of documents  
can not be implemented on any RDBMS, at least on single box.  
Specialized fulltext search engines (with exact matching and time  
to search about one second) has practical limit near 20 millions of  
docs, cluster - near 100 millions.  Bigger collections require  
engines like a google.



--
Teodor Sigaev   E-mail:  
[EMAIL PROTECTED]
   WWW: http:// 
www.sigaev.ru/





---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] multi terabyte fulltext searching

2007-03-21 Thread Benjamin Arai

What is inheritance+CE?

Benjamin
On Mar 21, 2007, at 9:10 AM, Oleg Bartunov wrote:


inheritance+CE




Re: [GENERAL] multi terabyte fulltext searching

2007-03-21 Thread Teodor Sigaev
I am currently using GIST indexes because I receive about 10GB of new 
data a week (then again I am not deleting any information).  The do not 
expect to be able to stop receiving text for about 5 years, so the data 
is not going to become static any time soon.  The reason I am concerned 
with performance is that I am providing a search system for several 
newspapers since essentially the beginning of time.  Many bibliographer 
etc would like to use this utility but if each search takes too long I 
am not going to be able to support many concurrent users.


Use GiST and GIN indexes together: any data older than one month (which doesn't 
change) with GIN index and new data with GiST. And one time per month moves data 
from GiST to GIN.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] multi terabyte fulltext searching

2007-03-21 Thread Oleg Bartunov

On Wed, 21 Mar 2007, Benjamin Arai wrote:


What is inheritance+CE?


Hmm,

http://www.postgresql.org/docs/8.2/static/ddl-inherit.html
http://www.postgresql.org/docs/8.2/static/ddl-partitioning.html




Benjamin
On Mar 21, 2007, at 9:10 AM, Oleg Bartunov wrote:


inheritance+CE





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


[GENERAL] multi terabyte fulltext searching

2007-03-21 Thread Benjamin Arai

Hi,

I have been struggling with getting fulltext searching for very large  
databases.  I can fulltext index 10s if gigs without any problem but  
when I start geting to hundreds of gigs it becomes slow.  My current  
system is a quad core with 8GB of memory.  I have the resource to  
throw more hardware at it but realistically it is not cost effective  
to buy a system with 128GB of memory.  Is there any solutions that  
people have come up with for indexing very large text databases?


Essentially I have several terabytes of text that I need to index.   
Each record is about 5 paragraphs of text.  I am currently using  
TSearch2 (stemming and etc) and getting sub-optimal results.  Queries  
take more than a second to execute.  Has anybody implemented such a  
database using multiple systems or some special add-on to TSearch2 to  
make things faster?  I want to do something like partitioning the  
data into multiple systems and merging the ranked results at some  
master node.  Is something like this possible for PostgreSQL or must  
it be a software solution?


Benjamin

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] multi terabyte fulltext searching

2007-03-21 Thread Oleg Bartunov

On Wed, 21 Mar 2007, Benjamin Arai wrote:

Can't you implement something similar to google by aggregating results for 
TSearch2 over many machines?


tsearch2 doesn't use any global statistics, so, in principle, you
should be able to run fts on several machines and combine them using
dblink (contrib/dblink).



Benjamin
On Mar 21, 2007, at 8:59 AM, Teodor Sigaev wrote:

I'm afraid that fulltext search on multiterabytes set of documents can not 
be implemented on any RDBMS, at least on single box. Specialized fulltext 
search engines (with exact matching and time to search about one second) 
has practical limit near 20 millions of docs, cluster - near 100 millions. 
Bigger collections require engines like a google.



--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
  WWW: 
http://www.sigaev.ru/





---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] multi terabyte fulltext searching

2007-03-21 Thread Joshua D. Drake
Benjamin Arai wrote:
 24.

I can think of a couple of things.

1. Increase your spindle count.
2. Push your gist indexes off to another array entirely (with separate
controllers)
3. Split your actual tables between other arrays

Or... by a SAN (but then again, I just replaced a million dollar SAN
with a split array/controller enviroment that cost 10% of the cost and
the multiple controller solution is kicking its butt.)


Joshua D. Drake


 
 Benjamin
 
 On Mar 21, 2007, at 9:09 AM, Joshua D. Drake wrote:
 
 Benjamin Arai wrote:
 True, but what happens when my database reaches 100 terabytes? Is 5
 seconds ok? How about 10?  My problem is that I do not believe the
 performance loss I am experiencing as the data becomes large is (log the
 # of records).  This worries me because I could be doing something
 wrong.  Or I might be able to do something better.

 Well a couple of things you could do, especially if you have the ability
 to throw hardware at it.

 How many spindles do you have?

 J



 Benjamin

 On Mar 21, 2007, at 8:49 AM, Joshua D. Drake wrote:

 Benjamin Arai wrote:
 Hi,

 I have been struggling with getting fulltext searching for very large
 databases.  I can fulltext index 10s if gigs without any problem but
 when I start geting to hundreds of gigs it becomes slow.  My current
 system is a quad core with 8GB of memory.  I have the resource to
 throw
 more hardware at it but realistically it is not cost effective to
 buy a
 system with 128GB of memory.  Is there any solutions that people have
 come up with for indexing very large text databases?

 GIST indexes are very large.

 Essentially I have several terabytes of text that I need to index. 
 Each
 record is about 5 paragraphs of text.  I am currently using TSearch2
 (stemming and etc) and getting sub-optimal results.  Queries take more
 than a second to execute.

 you are complaining about more than a second with a terabyte of text?


  Has anybody implemented such a database using
 multiple systems or some special add-on to TSearch2 to make things
 faster?  I want to do something like partitioning the data into
 multiple
 systems and merging the ranked results at some master node.  Is
 something like this possible for PostgreSQL or must it be a software
 solution?

 Benjamin

 ---(end of
 broadcast)---
 TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match



 --  === The PostgreSQL Company: Command Prompt, Inc. ===
 Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
 Providing the most comprehensive  PostgreSQL solutions since 1997
  http://www.commandprompt.com/

 Donate to the PostgreSQL Project:
 http://www.postgresql.org/about/donate
 PostgreSQL Replication: http://www.commandprompt.com/products/



 ---(end of broadcast)---
 TIP 5: don't forget to increase your free space map settings



 -- 
   === The PostgreSQL Company: Command Prompt, Inc. ===
 Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
 Providing the most comprehensive  PostgreSQL solutions since 1997
  http://www.commandprompt.com/

 Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
 PostgreSQL Replication: http://www.commandprompt.com/products/

 


-- 

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] multi terabyte fulltext searching

2007-03-21 Thread Benjamin Arai
Are there any examples of dblink being used in commercial  
environments.  I am curious to understand how it deals with node  
failures and etc.


Benjamin

On Mar 21, 2007, at 9:35 AM, Oleg Bartunov wrote:


On Wed, 21 Mar 2007, Benjamin Arai wrote:

Can't you implement something similar to google by aggregating  
results for TSearch2 over many machines?


tsearch2 doesn't use any global statistics, so, in principle, you
should be able to run fts on several machines and combine them using
dblink (contrib/dblink).



Benjamin
On Mar 21, 2007, at 8:59 AM, Teodor Sigaev wrote:

I'm afraid that fulltext search on multiterabytes set of  
documents can not be implemented on any RDBMS, at least on single  
box. Specialized fulltext search engines (with exact matching and  
time to search about one second) has practical limit near 20  
millions of docs, cluster - near 100 millions. Bigger collections  
require engines like a google.

--
Teodor Sigaev   E-mail:  
[EMAIL PROTECTED]
  WWW: http:// 
www.sigaev.ru/



---(end of  
broadcast)---

TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83




---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/