Re: Your professional opinion Please...
Thanks Peter: - Original Message - From: "Peter L. Berghold" <[EMAIL PROTECTED]> To: "Brian" <[EMAIL PROTECTED]> Cc: "MySQL" <[EMAIL PROTECTED]> Sent: Tuesday, March 25, 2003 4:07 PM Subject: Re: Your professional opinion Please... > On Tue, 2003-03-25 at 18:11, Brian wrote: > > What mechanism do you recommend? > > Something in perl, python or php? > Well... I tend to be a Perl bigot so I'd choose Perl. I would > do a couple of things. 8^) > 1) I'd develop a list of words to ignore such as "and", "if" , > "but" etc. etc.. This may take time and iterations. > 2) Read each file in and split on word boundaries and tally > the words that are not in the exclusion list and theoretically > what is left will be keywords. > 3) Use the number of times that a keyword is found in each > flat text file as a "weight" to be used later as a scoring mech- > anism for the search to determine relevance. > 4) Write all this to a table. Once all the documents are scanned > THEN build your index. > > Are their prebuilt modules that would develop such an index? > I don't know for sure, check CPAN (www.cpan.org) and see. > There may well be as I'm sure someone else has had to do this > before. I will check CPAN for binary tolerant text search engines. Thanks for your thoughts. Best regards, Brian -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Your professional opinion Please...
Thanks Mark: - Original Message - From: "Mark C. Roduner, Jr." <[EMAIL PROTECTED]> Sent: Tuesday, March 25, 2003 3:45 PM Subject: RE: Your professional opinion Please... > Brian, > Here's Some hints on how to accomplish an efficiant way > to index the data > Regular Expressions: > ([\w\d]{5,64}) -Matches all Word and Mumeric data in a > given string > Database > Tables > files : [int id][char*255 file name] > (Propagate This With File Names) > word : [int id][char*64 word] > (Propagate This With *Unique* Words) > map : [int id][int word][int files] > (Propagate This With `file`.`id`, > `word`.`id` > where `word`.`name` is found in file > named by > `file`.`name`) > Querys > To Find a file With given words > SELECT `file`.`name` from `file`, > `word`, `map` > where (`word`.`name` IN > ('word1','word2', 'word3')) and > (`map`.`word`=`word`.`id` and > `map`.`file`=`file`.`id`) > GROUP BY `file`.`name`; > Room for Improvement > Add in a field into the MAP table that gives the > offset > (in words) where the word was found. This would > prove > useful for "Quoted Queries" (ie: Phrase > searching). > Add a blob segment into the FILE table for > easier access > to the data (very optional, _will_ bloat your > database) Probably a little more than I can do in the allotted time. > If you're willing to pay for it, I'll Write it for you. Unfortunately there is no budget for this project. > BTW, I recommend JAVA for writing the reader program, > much easier and clean cut to do regular expressions, and > PHP (v4.x) for the search program (easier UI). Understood. Appreciate the feedback. Best regards, Brian -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Your professional opinion Please...
On Tue, 2003-03-25 at 18:11, Brian wrote: > > What mechanism do you recommend? > > Something in perl, python or php? > Well... I tend to be a Perl bigot so I'd choose Perl. I would do a couple of things. 1) I'd develop a list of words to ignore such as "and", "if" ,"but" etc. etc.. This may take time and iterations. 2) Read each file in and split on word boundaries and tally the words that are not in the exclusion list and theoretically what is left will be keywords. 3) Use the number of times that a keyword is found in each flat text file as a "weight" to be used later as a scoring mechanism for the search to determine relevance. 4) Write all this to a table. Once all the documents are scanned THEN build your index. > Are their prebuilt modules that would develop such an index? > I don't know for sure, check CPAN (www.cpan.org) and see. There may well be as I'm sure someone else has had to do this before. -- Peter L. Berghold <[EMAIL PROTECTED]> The New Jersey Bergholds -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
RE: Your professional opinion Please...
Brian, Here's Some hints on how to accomplish an efficiant way to index the data Regular Expressions: ([\w\d]{5,64}) -Matches all Word and Mumeric data in a given string Database Tables files : [int id][char*255 file name] (Propagate This With File Names) word: [int id][char*64 word] (Propagate This With *Unique* Words) map : [int id][int word][int files] (Propagate This With `file`.`id`, `word`.`id` where `word`.`name` is found in file named by `file`.`name`) Querys To Find a file With given words SELECT `file`.`name` from `file`, `word`, `map` where (`word`.`name` IN ('word1','word2', 'word3')) and (`map`.`word`=`word`.`id` and `map`.`file`=`file`.`id`) GROUP BY `file`.`name`; Room for Improvement Add in a field into the MAP table that gives the offset (in words) where the word was found. This would prove useful for "Quoted Queries" (ie: Phrase searching). Add a blob segment into the FILE table for easier access to the data (very optional, _will_ bloat your database) If you're willing to pay for it, I'll Write it for you. BTW, I recommend JAVA for writing the reader program, much easier and clean cut to do regular expressions, and PHP (v4.x) for the search program (easier UI). Mark C. Roduner, Jr. Medical Systematics Research -Original Message- From: Brian [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 25, 2003 3:12 PM To: Peter L. Berghold Cc: MySQL Subject: Re: Your professional opinion Please... > On Mon, Mar 24, 2003 at 06:41:07PM -0800, Brian wrote: > > I have a client with approximately 2 gigabytes of un-indexed > > document files (includes text and graphics). > > He wants to be able to enter a few parameters and bring up > > a list of all... > If they are flat text files this should not be too big an issue > although a very large project nonetheless. Develop an index by yanking > out keywords of interest and devloping a table to index them either by > filename title or whatever. What mechanism do you recommend? Something in perl, python or php? Are their prebuilt modules that would develop such an index? > I'd leave them as flat text files and go from there. If they are > adding or removing from the "library" then do a re-index at an > interval that makes sense. Understood - could be done once a night during slow time. Thanks for the feedback. Best regards, Brian -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED] -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Your professional opinion Please...
- Original Message - From: "James E Hicks III" <[EMAIL PROTECTED]> To: "Joe Lewis" <[EMAIL PROTECTED]>; "MySQL" <[EMAIL PROTECTED]> Cc: "Brian" <[EMAIL PROTECTED]> Sent: Tuesday, March 25, 2003 8:56 AM Subject: RE: Your professional opinion Please... > > I'd use MySQL, Apache, and UDMSEARCH. It provides the web > > interface for the google search engine (Apache and UDMSearch), > > while connecting to MySQL. If you want, the re-indexing can occur > > using a cron, and then by making apache serve the documents from > > the root and doing the fancy indexing. I suppose this is getting off > > topic, though. [grin] > UDMSEARCH is now mnoGoSearch. I pity the fool that is forced to > run this on Windows! I am looking to run the file indexer/server off a Linux box. > It seems the linux version is GPL'd and the Windows version > is going to cost you. Ha ha... Have you had some positive experiences with mnoGoSearch? Thanks for the feedback. Best regards, Brian -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Your professional opinion Please...
- Original Message - From: "Joe Lewis" <[EMAIL PROTECTED]> To: "MySQL" <[EMAIL PROTECTED]> Cc: "Brian" <[EMAIL PROTECTED]> Sent: Tuesday, March 25, 2003 8:42 AM Subject: Re: Your professional opinion Please... > Peter L. Berghold wrote: > > On Mon, Mar 24, 2003 at 06:41:07PM -0800, Brian wrote: > >>I have a client with approximately 2 gigabytes of un-indexed > >>document files (includes text and graphics). > >>He wants to be able to enter a few parameters and bring up > >>a list of all > [snip] > > I'd leave them as flat text files and go from there. If they are > > adding or removing from the "library" then do a re-index at > > an interval that makes sense. > I'd use MySQL, Apache, and UDMSEARCH. It provides the web > interface for the google search engine (Apache and UDMSearch), > while connecting to MySQL. If you want, the re-indexing can occur > using a cron, and then by making apache serve the documents from > the root and doing the fancy indexing. I suppose this is getting off > topic, though. [grin] Ok, udmsearch... Thanks, I will certainly pursue this line of investigation. Best regards, Brian -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Your professional opinion Please...
> On Mon, Mar 24, 2003 at 06:41:07PM -0800, Brian wrote: > > I have a client with approximately 2 gigabytes of un-indexed > > document files (includes text and graphics). > > He wants to be able to enter a few parameters and bring up > > a list of all... > If they are flat text files this should not be too big an issue although > a very large project nonetheless. Develop an index by yanking out > keywords of interest and devloping a table to index them either by > filename title or whatever. What mechanism do you recommend? Something in perl, python or php? Are their prebuilt modules that would develop such an index? > I'd leave them as flat text files and go from there. If they are adding > or removing from the "library" then do a re-index at an interval that > makes sense. Understood - could be done once a night during slow time. Thanks for the feedback. Best regards, Brian -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Your professional opinion Please...
Hi Nick: - Original Message - From: "Nick Arnett" <[EMAIL PROTECTED]> To: "Brian" <[EMAIL PROTECTED]>; "MySQL" <[EMAIL PROTECTED]> Sent: Monday, March 24, 2003 7:47 PM Subject: RE: Your professional opinion Please... > > I have a client with approximately 2 gigabytes of > > un-indexed document files (includes text and graphics). > > He wants to be able to enter a few parameters and bring > > up a list of all documents that fit, and then be able to > > download them over a web interface - sort of like a > > private Google search engine. > How many documents? 500+ Documents are added at the rate of ~10 per week. > What format are they in? Microsoft Word .docs 80% Microsoft Excel .xls 10% Text .rtf 10% > Does this require just text searching or is there fielded data, too? There is no fielded data - just a basic text search. e.g. Search [Bob Evans] results:1020.doc 1024.doc 1030.doc The ideal situation would to actually provide results like a Google search for the purpose of downloading the files to the Windows desktop for analysis and processing and perhaps inclusion in new documents. > How many users would search > simultaneously? There are approximately 12 users at present of which perhaps 5-10 searchs each per day - very low volume actually. > There are various search engine vendors, including Google itself. > The leader is Verity. Autonomy is probably its top current > competitor. But since you've posted here, are you considering > MySQL? It doesn't have a particularly rich query language for > text, and it's up to you to get them into the database in a usable > form. I am looking at creating a basic database/index, perhaps employing MySQL but I am open to any feedback. This is a low budget project - it is a foot in the door for me. Thanks for your feedback. Best regards, Brian -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
RE: Your professional opinion Please...
> I'd use MySQL, Apache, and UDMSEARCH. It provides the web interface for > the google search engine (Apache and UDMSearch), while connecting to > MySQL. If you want, the re-indexing can occur using a cron, and then by > making apache serve the documents from the root and doing the fancy > indexing. I suppose this is getting off topic, though. [grin] > > Joe UDMSEARCH is now mnoGoSearch. I pity the fool that is forced to run this on Windows! It seems the linux version is GPL'd and the Windows version is going to cost you. Ha ha... James sql, query -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Your professional opinion Please...
Peter L. Berghold wrote: On Mon, Mar 24, 2003 at 06:41:07PM -0800, Brian wrote: I have a client with approximately 2 gigabytes of un-indexed document files (includes text and graphics). He wants to be able to enter a few parameters and bring up a list of all [snip] I'd leave them as flat text files and go from there. If they are adding or removing from the "library" then do a re-index at an interval that makes sense. I'd use MySQL, Apache, and UDMSEARCH. It provides the web interface for the google search engine (Apache and UDMSearch), while connecting to MySQL. If you want, the re-indexing can occur using a cron, and then by making apache serve the documents from the root and doing the fancy indexing. I suppose this is getting off topic, though. [grin] Joe -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Your professional opinion Please...
On Mon, Mar 24, 2003 at 06:41:07PM -0800, Brian wrote: > I have a client with approximately 2 gigabytes of un-indexed document files > (includes text and graphics). > > He wants to be able to enter a few parameters and bring up a list of all If they are flat text files this should not be too big an issue although a very large project nonetheless. Develop an index by yanking out keywords of interest and devloping a table to index them either by filename title or whatever. I'd leave them as flat text files and go from there. If they are adding or removing from the "library" then do a re-index at an interval that makes sense. -- Peter L. Berghold [EMAIL PROTECTED] "Those who fail to learn from history are condemned to repeat it." AIM: redcowdawgYahoo IM: blue_cowdawg ICQ: 11455958 -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
RE: Your professional opinion Please...
> -Original Message- > From: Brian [mailto:[EMAIL PROTECTED] ... > I have a client with approximately 2 gigabytes of un-indexed > document files > (includes text and graphics). > > He wants to be able to enter a few parameters and bring up a list of all > documents that fit, and then be able to download them over a web > interface - > sort of like a private Google search engine. How many documents? What format are they in? Does this require just text searching or is there fielded data, too? How many users would search simultaneously? There are various search engine vendors, including Google itself. The leader is Verity. Autonomy is probably its top current competitor. But since you've posted here, are you considering MySQL? It doesn't have a particularly rich query language for text, and it's up to you to get them into the database in a usable form. Nick -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Your professional opinion Please...
Brian- why not use grep or fgrep on the files and catch the hyperlink resultset to formatted html..? Regards, Martin - Original Message - From: "Brian" <[EMAIL PROTECTED]> To: "MySQL" <[EMAIL PROTECTED]> Sent: Monday, March 24, 2003 7:41 PM Subject: Your professional opinion Please... > Hello Dear Friends: > > I have a client with approximately 2 gigabytes of un-indexed document files > (includes text and graphics). > > He wants to be able to enter a few parameters and bring up a list of all > documents that fit, and then be able to download them over a web interface - > sort of like a private Google search engine. > > What advice do you have for me? > > Thanks for your input, > > Brian > Network Services > > > > -- > MySQL General Mailing List > For list archives: http://lists.mysql.com/mysql > To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED] > > -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Your professional opinion Please...
Hello Dear Friends: I have a client with approximately 2 gigabytes of un-indexed document files (includes text and graphics). He wants to be able to enter a few parameters and bring up a list of all documents that fit, and then be able to download them over a web interface - sort of like a private Google search engine. What advice do you have for me? Thanks for your input, Brian Network Services -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]