Re: [GENERAL] multi terabyte fulltext searching
On Wed, 21 Mar 2007 08:57:39 -0700, Benjamin Arai wrote: Hi Oleg, I am currently using GIST indexes because I receive about 10GB of new data a week (then again I am not deleting any information). The do not expect to be able to stop receiving text for about 5 years, so the data is not going to become static any time soon. The reason I am concerned with performance is that I am providing a search system for several newspapers since essentially the beginning of time. Many bibliographer etc would like to use this utility but if each search takes too long I am not going to be able to support many concurrent users. Benjamin At a previous job, I built a system to do this. We had 3,000 publications and approx 70M newspaper articles. Total content size (postprocessed) was on the order of 100GB, IIRC. We used a proprietary (closed-source not ours) search engine. In order to reach subsecond response time we needed to horizontally scale to about 50-70 machines, each a low-end Dell 1650. This was after about 5 years of trying to vertically scale. -arturo ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[GENERAL] multi terabyte fulltext searching
Hi, I have been struggling with getting fulltext searching for very large databases. I can fulltext index 10s if gigs without any problem but when I start geting to hundreds of gigs it becomes slow. My current system is a quad core with 8GB of memory. I have the resource to throw more hardware at it but realistically it is not cost effective to buy a system with 128GB of memory. Is there any solutions that people have come up with for indexing very large text databases? Essentially I have several terabytes of text that I need to index. Each record is about 5 paragraphs of text. I am currently using TSearch2 (stemming and etc) and getting sub-optimal results. Queries take more than a second to execute. Has anybody implemented such a database using multiple systems or some special add-on to TSearch2 to make things faster? I want to do something like partitioning the data into multiple systems and merging the ranked results at some master node. Is something like this possible for PostgreSQL or must it be a software solution? Benjamin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [GENERAL] multi terabyte fulltext searching
Benjamin, as one of the author of tsearch2 I'd like to know more about your setup. tsearch2 in 8.2 has GIN index support, which scales much better than old GiST index. Oleg On Wed, 21 Mar 2007, Benjamin Arai wrote: Hi, I have been struggling with getting fulltext searching for very large databases. I can fulltext index 10s if gigs without any problem but when I start geting to hundreds of gigs it becomes slow. My current system is a quad core with 8GB of memory. I have the resource to throw more hardware at it but realistically it is not cost effective to buy a system with 128GB of memory. Is there any solutions that people have come up with for indexing very large text databases? Essentially I have several terabytes of text that I need to index. Each record is about 5 paragraphs of text. I am currently using TSearch2 (stemming and etc) and getting sub-optimal results. Queries take more than a second to execute. Has anybody implemented such a database using multiple systems or some special add-on to TSearch2 to make things faster? I want to do something like partitioning the data into multiple systems and merging the ranked results at some master node. Is something like this possible for PostgreSQL or must it be a software solution? Benjamin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [GENERAL] multi terabyte fulltext searching
Benjamin Arai wrote: Hi, I have been struggling with getting fulltext searching for very large databases. I can fulltext index 10s if gigs without any problem but when I start geting to hundreds of gigs it becomes slow. My current system is a quad core with 8GB of memory. I have the resource to throw more hardware at it but realistically it is not cost effective to buy a system with 128GB of memory. Is there any solutions that people have come up with for indexing very large text databases? GIST indexes are very large. Essentially I have several terabytes of text that I need to index. Each record is about 5 paragraphs of text. I am currently using TSearch2 (stemming and etc) and getting sub-optimal results. Queries take more than a second to execute. you are complaining about more than a second with a terabyte of text? Has anybody implemented such a database using multiple systems or some special add-on to TSearch2 to make things faster? I want to do something like partitioning the data into multiple systems and merging the ranked results at some master node. Is something like this possible for PostgreSQL or must it be a software solution? Benjamin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [GENERAL] multi terabyte fulltext searching
Hi Oleg, I am currently using GIST indexes because I receive about 10GB of new data a week (then again I am not deleting any information). The do not expect to be able to stop receiving text for about 5 years, so the data is not going to become static any time soon. The reason I am concerned with performance is that I am providing a search system for several newspapers since essentially the beginning of time. Many bibliographer etc would like to use this utility but if each search takes too long I am not going to be able to support many concurrent users. Benjamin On Mar 21, 2007, at 8:42 AM, Oleg Bartunov wrote: Benjamin, as one of the author of tsearch2 I'd like to know more about your setup. tsearch2 in 8.2 has GIN index support, which scales much better than old GiST index. Oleg On Wed, 21 Mar 2007, Benjamin Arai wrote: Hi, I have been struggling with getting fulltext searching for very large databases. I can fulltext index 10s if gigs without any problem but when I start geting to hundreds of gigs it becomes slow. My current system is a quad core with 8GB of memory. I have the resource to throw more hardware at it but realistically it is not cost effective to buy a system with 128GB of memory. Is there any solutions that people have come up with for indexing very large text databases? Essentially I have several terabytes of text that I need to index. Each record is about 5 paragraphs of text. I am currently using TSearch2 (stemming and etc) and getting sub-optimal results. Queries take more than a second to execute. Has anybody implemented such a database using multiple systems or some special add-on to TSearch2 to make things faster? I want to do something like partitioning the data into multiple systems and merging the ranked results at some master node. Is something like this possible for PostgreSQL or must it be a software solution? Benjamin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [GENERAL] multi terabyte fulltext searching
True, but what happens when my database reaches 100 terabytes? Is 5 seconds ok? How about 10? My problem is that I do not believe the performance loss I am experiencing as the data becomes large is (log the # of records). This worries me because I could be doing something wrong. Or I might be able to do something better. Benjamin On Mar 21, 2007, at 8:49 AM, Joshua D. Drake wrote: Benjamin Arai wrote: Hi, I have been struggling with getting fulltext searching for very large databases. I can fulltext index 10s if gigs without any problem but when I start geting to hundreds of gigs it becomes slow. My current system is a quad core with 8GB of memory. I have the resource to throw more hardware at it but realistically it is not cost effective to buy a system with 128GB of memory. Is there any solutions that people have come up with for indexing very large text databases? GIST indexes are very large. Essentially I have several terabytes of text that I need to index. Each record is about 5 paragraphs of text. I am currently using TSearch2 (stemming and etc) and getting sub-optimal results. Queries take more than a second to execute. you are complaining about more than a second with a terabyte of text? Has anybody implemented such a database using multiple systems or some special add-on to TSearch2 to make things faster? I want to do something like partitioning the data into multiple systems and merging the ranked results at some master node. Is something like this possible for PostgreSQL or must it be a software solution? Benjamin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/ donate PostgreSQL Replication: http://www.commandprompt.com/products/ ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [GENERAL] multi terabyte fulltext searching
I'm afraid that fulltext search on multiterabytes set of documents can not be implemented on any RDBMS, at least on single box. Specialized fulltext search engines (with exact matching and time to search about one second) has practical limit near 20 millions of docs, cluster - near 100 millions. Bigger collections require engines like a google. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [GENERAL] multi terabyte fulltext searching
By the way, what is the largest TSearch2 database that you know of and how fast does it return results? Maybe my expectations are unrealistic. Benjamin On Mar 21, 2007, at 8:42 AM, Oleg Bartunov wrote: Benjamin, as one of the author of tsearch2 I'd like to know more about your setup. tsearch2 in 8.2 has GIN index support, which scales much better than old GiST index. Oleg On Wed, 21 Mar 2007, Benjamin Arai wrote: Hi, I have been struggling with getting fulltext searching for very large databases. I can fulltext index 10s if gigs without any problem but when I start geting to hundreds of gigs it becomes slow. My current system is a quad core with 8GB of memory. I have the resource to throw more hardware at it but realistically it is not cost effective to buy a system with 128GB of memory. Is there any solutions that people have come up with for indexing very large text databases? Essentially I have several terabytes of text that I need to index. Each record is about 5 paragraphs of text. I am currently using TSearch2 (stemming and etc) and getting sub-optimal results. Queries take more than a second to execute. Has anybody implemented such a database using multiple systems or some special add-on to TSearch2 to make things faster? I want to do something like partitioning the data into multiple systems and merging the ranked results at some master node. Is something like this possible for PostgreSQL or must it be a software solution? Benjamin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [GENERAL] multi terabyte fulltext searching
Benjamin Arai wrote: True, but what happens when my database reaches 100 terabytes? Is 5 seconds ok? How about 10? My problem is that I do not believe the performance loss I am experiencing as the data becomes large is (log the # of records). This worries me because I could be doing something wrong. Or I might be able to do something better. Well a couple of things you could do, especially if you have the ability to throw hardware at it. How many spindles do you have? J Benjamin On Mar 21, 2007, at 8:49 AM, Joshua D. Drake wrote: Benjamin Arai wrote: Hi, I have been struggling with getting fulltext searching for very large databases. I can fulltext index 10s if gigs without any problem but when I start geting to hundreds of gigs it becomes slow. My current system is a quad core with 8GB of memory. I have the resource to throw more hardware at it but realistically it is not cost effective to buy a system with 128GB of memory. Is there any solutions that people have come up with for indexing very large text databases? GIST indexes are very large. Essentially I have several terabytes of text that I need to index. Each record is about 5 paragraphs of text. I am currently using TSearch2 (stemming and etc) and getting sub-optimal results. Queries take more than a second to execute. you are complaining about more than a second with a terabyte of text? Has anybody implemented such a database using multiple systems or some special add-on to TSearch2 to make things faster? I want to do something like partitioning the data into multiple systems and merging the ranked results at some master node. Is something like this possible for PostgreSQL or must it be a software solution? Benjamin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org/
Re: [GENERAL] multi terabyte fulltext searching
On Wed, 21 Mar 2007, Benjamin Arai wrote: Hi Oleg, I am currently using GIST indexes because I receive about 10GB of new data a week (then again I am not deleting any information). The do not expect to be able to stop receiving text for about 5 years, so the data is not going to become static any time soon. The reason I am concerned with performance is that I am providing a search system for several newspapers since essentially the beginning of time. Many bibliographer etc would like to use this utility but if each search takes too long I am not going to be able to support many concurrent users. GiST is ok for your feed, but archive part should use GIN index. inheritance+CE should help your life. Benjamin On Mar 21, 2007, at 8:42 AM, Oleg Bartunov wrote: Benjamin, as one of the author of tsearch2 I'd like to know more about your setup. tsearch2 in 8.2 has GIN index support, which scales much better than old GiST index. Oleg On Wed, 21 Mar 2007, Benjamin Arai wrote: Hi, I have been struggling with getting fulltext searching for very large databases. I can fulltext index 10s if gigs without any problem but when I start geting to hundreds of gigs it becomes slow. My current system is a quad core with 8GB of memory. I have the resource to throw more hardware at it but realistically it is not cost effective to buy a system with 128GB of memory. Is there any solutions that people have come up with for indexing very large text databases? Essentially I have several terabytes of text that I need to index. Each record is about 5 paragraphs of text. I am currently using TSearch2 (stemming and etc) and getting sub-optimal results. Queries take more than a second to execute. Has anybody implemented such a database using multiple systems or some special add-on to TSearch2 to make things faster? I want to do something like partitioning the data into multiple systems and merging the ranked results at some master node. Is something like this possible for PostgreSQL or must it be a software solution? Benjamin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [GENERAL] multi terabyte fulltext searching
24. Benjamin On Mar 21, 2007, at 9:09 AM, Joshua D. Drake wrote: Benjamin Arai wrote: True, but what happens when my database reaches 100 terabytes? Is 5 seconds ok? How about 10? My problem is that I do not believe the performance loss I am experiencing as the data becomes large is (log the # of records). This worries me because I could be doing something wrong. Or I might be able to do something better. Well a couple of things you could do, especially if you have the ability to throw hardware at it. How many spindles do you have? J Benjamin On Mar 21, 2007, at 8:49 AM, Joshua D. Drake wrote: Benjamin Arai wrote: Hi, I have been struggling with getting fulltext searching for very large databases. I can fulltext index 10s if gigs without any problem but when I start geting to hundreds of gigs it becomes slow. My current system is a quad core with 8GB of memory. I have the resource to throw more hardware at it but realistically it is not cost effective to buy a system with 128GB of memory. Is there any solutions that people have come up with for indexing very large text databases? GIST indexes are very large. Essentially I have several terabytes of text that I need to index. Each record is about 5 paragraphs of text. I am currently using TSearch2 (stemming and etc) and getting sub-optimal results. Queries take more than a second to execute. you are complaining about more than a second with a terabyte of text? Has anybody implemented such a database using multiple systems or some special add-on to TSearch2 to make things faster? I want to do something like partitioning the data into multiple systems and merging the ranked results at some master node. Is something like this possible for PostgreSQL or must it be a software solution? Benjamin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/ donate PostgreSQL Replication: http://www.commandprompt.com/products/ ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/ donate PostgreSQL Replication: http://www.commandprompt.com/products/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [GENERAL] multi terabyte fulltext searching
Can't you implement something similar to google by aggregating results for TSearch2 over many machines? Benjamin On Mar 21, 2007, at 8:59 AM, Teodor Sigaev wrote: I'm afraid that fulltext search on multiterabytes set of documents can not be implemented on any RDBMS, at least on single box. Specialized fulltext search engines (with exact matching and time to search about one second) has practical limit near 20 millions of docs, cluster - near 100 millions. Bigger collections require engines like a google. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http:// www.sigaev.ru/ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [GENERAL] multi terabyte fulltext searching
What is inheritance+CE? Benjamin On Mar 21, 2007, at 9:10 AM, Oleg Bartunov wrote: inheritance+CE
Re: [GENERAL] multi terabyte fulltext searching
I am currently using GIST indexes because I receive about 10GB of new data a week (then again I am not deleting any information). The do not expect to be able to stop receiving text for about 5 years, so the data is not going to become static any time soon. The reason I am concerned with performance is that I am providing a search system for several newspapers since essentially the beginning of time. Many bibliographer etc would like to use this utility but if each search takes too long I am not going to be able to support many concurrent users. Use GiST and GIN indexes together: any data older than one month (which doesn't change) with GIN index and new data with GiST. And one time per month moves data from GiST to GIN. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [GENERAL] multi terabyte fulltext searching
On Wed, 21 Mar 2007, Benjamin Arai wrote: What is inheritance+CE? Hmm, http://www.postgresql.org/docs/8.2/static/ddl-inherit.html http://www.postgresql.org/docs/8.2/static/ddl-partitioning.html Benjamin On Mar 21, 2007, at 9:10 AM, Oleg Bartunov wrote: inheritance+CE Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[GENERAL] multi terabyte fulltext searching
Hi, I have been struggling with getting fulltext searching for very large databases. I can fulltext index 10s if gigs without any problem but when I start geting to hundreds of gigs it becomes slow. My current system is a quad core with 8GB of memory. I have the resource to throw more hardware at it but realistically it is not cost effective to buy a system with 128GB of memory. Is there any solutions that people have come up with for indexing very large text databases? Essentially I have several terabytes of text that I need to index. Each record is about 5 paragraphs of text. I am currently using TSearch2 (stemming and etc) and getting sub-optimal results. Queries take more than a second to execute. Has anybody implemented such a database using multiple systems or some special add-on to TSearch2 to make things faster? I want to do something like partitioning the data into multiple systems and merging the ranked results at some master node. Is something like this possible for PostgreSQL or must it be a software solution? Benjamin ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [GENERAL] multi terabyte fulltext searching
On Wed, 21 Mar 2007, Benjamin Arai wrote: Can't you implement something similar to google by aggregating results for TSearch2 over many machines? tsearch2 doesn't use any global statistics, so, in principle, you should be able to run fts on several machines and combine them using dblink (contrib/dblink). Benjamin On Mar 21, 2007, at 8:59 AM, Teodor Sigaev wrote: I'm afraid that fulltext search on multiterabytes set of documents can not be implemented on any RDBMS, at least on single box. Specialized fulltext search engines (with exact matching and time to search about one second) has practical limit near 20 millions of docs, cluster - near 100 millions. Bigger collections require engines like a google. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [GENERAL] multi terabyte fulltext searching
Benjamin Arai wrote: 24. I can think of a couple of things. 1. Increase your spindle count. 2. Push your gist indexes off to another array entirely (with separate controllers) 3. Split your actual tables between other arrays Or... by a SAN (but then again, I just replaced a million dollar SAN with a split array/controller enviroment that cost 10% of the cost and the multiple controller solution is kicking its butt.) Joshua D. Drake Benjamin On Mar 21, 2007, at 9:09 AM, Joshua D. Drake wrote: Benjamin Arai wrote: True, but what happens when my database reaches 100 terabytes? Is 5 seconds ok? How about 10? My problem is that I do not believe the performance loss I am experiencing as the data becomes large is (log the # of records). This worries me because I could be doing something wrong. Or I might be able to do something better. Well a couple of things you could do, especially if you have the ability to throw hardware at it. How many spindles do you have? J Benjamin On Mar 21, 2007, at 8:49 AM, Joshua D. Drake wrote: Benjamin Arai wrote: Hi, I have been struggling with getting fulltext searching for very large databases. I can fulltext index 10s if gigs without any problem but when I start geting to hundreds of gigs it becomes slow. My current system is a quad core with 8GB of memory. I have the resource to throw more hardware at it but realistically it is not cost effective to buy a system with 128GB of memory. Is there any solutions that people have come up with for indexing very large text databases? GIST indexes are very large. Essentially I have several terabytes of text that I need to index. Each record is about 5 paragraphs of text. I am currently using TSearch2 (stemming and etc) and getting sub-optimal results. Queries take more than a second to execute. you are complaining about more than a second with a terabyte of text? Has anybody implemented such a database using multiple systems or some special add-on to TSearch2 to make things faster? I want to do something like partitioning the data into multiple systems and merging the ranked results at some master node. Is something like this possible for PostgreSQL or must it be a software solution? Benjamin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [GENERAL] multi terabyte fulltext searching
Are there any examples of dblink being used in commercial environments. I am curious to understand how it deals with node failures and etc. Benjamin On Mar 21, 2007, at 9:35 AM, Oleg Bartunov wrote: On Wed, 21 Mar 2007, Benjamin Arai wrote: Can't you implement something similar to google by aggregating results for TSearch2 over many machines? tsearch2 doesn't use any global statistics, so, in principle, you should be able to run fts on several machines and combine them using dblink (contrib/dblink). Benjamin On Mar 21, 2007, at 8:59 AM, Teodor Sigaev wrote: I'm afraid that fulltext search on multiterabytes set of documents can not be implemented on any RDBMS, at least on single box. Specialized fulltext search engines (with exact matching and time to search about one second) has practical limit near 20 millions of docs, cluster - near 100 millions. Bigger collections require engines like a google. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http:// www.sigaev.ru/ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org/