Re: Your professional opinion Please...

2003-03-27 Thread Brian
Thanks Peter:

- Original Message - 
From: "Peter L. Berghold" <[EMAIL PROTECTED]>
To: "Brian" <[EMAIL PROTECTED]>
Cc: "MySQL" <[EMAIL PROTECTED]>
Sent: Tuesday, March 25, 2003 4:07 PM
Subject: Re: Your professional opinion Please...


> On Tue, 2003-03-25 at 18:11, Brian wrote:
> > What mechanism do you recommend?
> > Something in perl, python or php?

> Well... I tend to be a Perl bigot so I'd choose Perl. I would 
> do a couple of things. 

8^)

> 1) I'd develop a list of words to ignore such as "and", "if" ,
> "but" etc. etc..  This may take time and iterations. 

> 2) Read each file in and split on word boundaries and tally 
> the words that are not in the exclusion list and theoretically 
> what is left will be keywords. 

> 3) Use the number of times that a keyword is found in each 
> flat text file as a "weight" to be used later as a scoring mech-
> anism for the search to determine relevance. 

> 4) Write all this to a table. Once all the documents are scanned 
> THEN build your index. 

> > Are their prebuilt modules that would develop such an index?
 
> I don't know for sure, check CPAN (www.cpan.org) and see. 
> There may well be as I'm sure someone else has had to do this 
> before. 

I will check CPAN for binary tolerant text search engines.

Thanks for your thoughts.

Best regards,

Brian



-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Your professional opinion Please...

2003-03-27 Thread Brian
Thanks Mark:

- Original Message - 
From: "Mark C. Roduner, Jr." <[EMAIL PROTECTED]>
Sent: Tuesday, March 25, 2003 3:45 PM
Subject: RE: Your professional opinion Please...


> Brian,
> Here's Some hints on how to accomplish an efficiant way 
> to index the data
 
> Regular Expressions:
> ([\w\d]{5,64}) -Matches all Word and Mumeric data in a
> given string
> Database
> Tables
> files : [int id][char*255 file name]
> (Propagate This With File Names)
> word : [int id][char*64 word]
> (Propagate This With *Unique* Words)
> map : [int id][int word][int files]
> (Propagate This With `file`.`id`,
> `word`.`id` 
> where `word`.`name` is found in file
> named by
> `file`.`name`)
> Querys
> To Find a file With given words
> SELECT `file`.`name` from `file`,
> `word`, `map` 
> where (`word`.`name` IN
> ('word1','word2', 'word3')) and 
> (`map`.`word`=`word`.`id` and
> `map`.`file`=`file`.`id`)
> GROUP BY `file`.`name`;
> Room for Improvement
> Add in a field into the MAP table that gives the
> offset 
> (in words) where the word was found.  This would
> prove
> useful for "Quoted Queries" (ie: Phrase
> searching).
> Add a blob segment into the FILE table for
> easier access
> to the data (very optional, _will_ bloat your
> database)

Probably a little more than I can do in the allotted time.

> If you're willing to pay for it, I'll Write it for you. 

Unfortunately there is no budget for this project.

> BTW, I recommend JAVA for writing the reader program, 
> much easier and clean cut to do regular expressions, and 
> PHP (v4.x) for the search program (easier UI).

Understood.

Appreciate the feedback.

Best regards,

Brian


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Your professional opinion Please...

2003-03-25 Thread Peter L. Berghold
On Tue, 2003-03-25 at 18:11, Brian wrote:

> 
> What mechanism do you recommend?
> 
> Something in perl, python or php?
> 

Well... I tend to be a Perl bigot so I'd choose Perl. I would do a
couple of things. 

1) I'd develop a list of words to ignore such as "and", "if" ,"but" etc.
etc..  This may take time and iterations. 

2) Read each file in and split on word boundaries and tally the words
that are not in the exclusion list and theoretically what is left will
be keywords. 

3) Use the number of times that a keyword is found in each flat text
file as a "weight" to be used later as a scoring mechanism for the
search to determine relevance. 

4) Write all this to a table. Once all the documents are scanned THEN 
build your index. 

> Are their prebuilt modules that would develop such an index?
> 

I don't know for sure, check CPAN (www.cpan.org) and see. There may well
be as I'm sure someone else has had to do this before. 


-- 
Peter L. Berghold <[EMAIL PROTECTED]>
The New Jersey Bergholds


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



RE: Your professional opinion Please...

2003-03-25 Thread Mark C. Roduner, Jr.
Brian,
Here's Some hints on how to accomplish an efficiant way to index
the data

Regular Expressions:
([\w\d]{5,64})  -Matches all Word and Mumeric data in a
given string
Database
Tables
files   : [int id][char*255 file name]
(Propagate This With File Names)
word: [int id][char*64 word]
(Propagate This With *Unique* Words)
map : [int id][int word][int files]
(Propagate This With `file`.`id`,
`word`.`id` 
where `word`.`name` is found in file
named by
`file`.`name`)
Querys
To Find a file With given words
SELECT `file`.`name` from `file`,
`word`, `map` 
where (`word`.`name` IN
('word1','word2', 'word3')) and 
(`map`.`word`=`word`.`id` and
`map`.`file`=`file`.`id`)
GROUP BY `file`.`name`;
Room for Improvement
Add in a field into the MAP table that gives the
offset 
(in words) where the word was found.  This would
prove
useful for "Quoted Queries" (ie: Phrase
searching).
Add a blob segment into the FILE table for
easier access
to the data (very optional, _will_ bloat your
database)

If you're willing to pay for it, I'll Write it for you. 
BTW, I recommend JAVA for writing the reader program, much easier and
clean cut to do regular expressions, and PHP (v4.x) for the search
program (easier UI).

Mark C. Roduner, Jr.
Medical Systematics Research 
-Original Message-
From: Brian [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 25, 2003 3:12 PM
To: Peter L. Berghold
Cc: MySQL
Subject: Re: Your professional opinion Please...


> On Mon, Mar 24, 2003 at 06:41:07PM -0800, Brian wrote:
> > I have a client with approximately 2 gigabytes of un-indexed
> > document files (includes text and graphics).

> > He wants to be able to enter a few parameters and bring up
> > a list of all...

> If they are flat text files this should not be too big an issue 
> although a very large project nonetheless. Develop an index by yanking

> out keywords of interest and devloping a table to index them either by

> filename title or whatever.

What mechanism do you recommend?

Something in perl, python or php?

Are their prebuilt modules that would develop such an index?

> I'd leave them as flat text files and go from there. If they are 
> adding or removing from the "library" then do a re-index at an 
> interval that makes sense.

Understood - could be done once a night during slow time.

Thanks for the feedback.

Best regards,

Brian



-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Your professional opinion Please...

2003-03-25 Thread Brian
- Original Message - 
From: "James E Hicks III" <[EMAIL PROTECTED]>
To: "Joe Lewis" <[EMAIL PROTECTED]>; "MySQL" <[EMAIL PROTECTED]>
Cc: "Brian" <[EMAIL PROTECTED]>
Sent: Tuesday, March 25, 2003 8:56 AM
Subject: RE: Your professional opinion Please...


> > I'd use MySQL, Apache, and UDMSEARCH.  It provides the web 
> > interface for the google search engine (Apache and UDMSearch), 
> > while connecting to MySQL.  If you want, the re-indexing can occur 
> > using a cron, and then by making apache serve the documents from 
> > the root and doing the fancy indexing.  I suppose this is getting off 
> > topic, though.  [grin]

> UDMSEARCH is now mnoGoSearch. I pity the fool that is forced to 
> run this on Windows! 

I am looking to run the file indexer/server off a Linux box.

> It seems the linux version is GPL'd and the Windows version
> is going to cost you. Ha ha...

Have you had some positive experiences with mnoGoSearch?

Thanks for the feedback.

Best regards,

Brian



-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Your professional opinion Please...

2003-03-25 Thread Brian
- Original Message - 
From: "Joe Lewis" <[EMAIL PROTECTED]>
To: "MySQL" <[EMAIL PROTECTED]>
Cc: "Brian" <[EMAIL PROTECTED]>
Sent: Tuesday, March 25, 2003 8:42 AM
Subject: Re: Your professional opinion Please...


> Peter L. Berghold wrote:
> > On Mon, Mar 24, 2003 at 06:41:07PM -0800, Brian wrote: 
> >>I have a client with approximately 2 gigabytes of un-indexed 
> >>document files (includes text and graphics).

> >>He wants to be able to enter a few parameters and bring up 
> >>a list of all
 
> [snip]

> > I'd leave them as flat text files and go from there. If they are 
> > adding or removing from the "library" then do a re-index at 
> > an interval that makes sense. 
 
> I'd use MySQL, Apache, and UDMSEARCH.  It provides the web 
> interface for the google search engine (Apache and UDMSearch), 
> while connecting to MySQL.  If you want, the re-indexing can occur 
> using a cron, and then by making apache serve the documents from 
> the root and doing the fancy indexing.  I suppose this is getting off 
> topic, though.  [grin]

Ok, udmsearch...

Thanks, I will certainly pursue this line of investigation.

Best regards,

Brian



-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Your professional opinion Please...

2003-03-25 Thread Brian
> On Mon, Mar 24, 2003 at 06:41:07PM -0800, Brian wrote:
> > I have a client with approximately 2 gigabytes of un-indexed 
> > document files (includes text and graphics).

> > He wants to be able to enter a few parameters and bring up 
> > a list of all...

> If they are flat text files this should not be too big an issue although
> a very large project nonetheless. Develop an index by yanking out 
> keywords of interest and devloping a table to index them either by 
> filename title or whatever.

What mechanism do you recommend?

Something in perl, python or php?

Are their prebuilt modules that would develop such an index?

> I'd leave them as flat text files and go from there. If they are adding
> or removing from the "library" then do a re-index at an interval that 
> makes sense. 

Understood - could be done once a night during slow time.

Thanks for the feedback.

Best regards,

Brian



-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Your professional opinion Please...

2003-03-25 Thread Brian
Hi Nick:

- Original Message -
From: "Nick Arnett" <[EMAIL PROTECTED]>
To: "Brian" <[EMAIL PROTECTED]>; "MySQL" <[EMAIL PROTECTED]>
Sent: Monday, March 24, 2003 7:47 PM
Subject: RE: Your professional opinion Please...

> > I have a client with approximately 2 gigabytes of
> > un-indexed document files (includes text and graphics).

> > He wants to be able to enter a few parameters and bring
> > up a list of all documents that fit, and then be able to
> > download them over a web interface - sort of like a
> > private Google search engine.

> How many documents?

500+

Documents are added at the rate of ~10 per week.

> What format are they in?

Microsoft Word .docs 80%
Microsoft Excel .xls 10%
Text .rtf 10%

> Does this require just text searching or is there fielded data, too?

There is no fielded data - just a basic text search.

e.g.

Search [Bob Evans]

results:1020.doc
  1024.doc
  1030.doc

The ideal situation would to actually provide results like a Google search
for the purpose of downloading the files to the Windows desktop for analysis
and processing and perhaps inclusion in new documents.

> How many users would search
> simultaneously?

There are approximately 12 users at present of which perhaps 5-10 searchs
each per day - very low volume actually.

> There are various search engine vendors, including Google itself.

> The leader is Verity.  Autonomy is probably its top current
> competitor.  But since you've posted here, are you considering
> MySQL?  It doesn't have a particularly rich query language for
> text, and it's up to you to get them into the database in a usable
> form.

I am looking at creating a basic database/index, perhaps employing MySQL but
I am open to any feedback.

This is a low budget project - it is a foot in the door for me.

Thanks for your feedback.

Best regards,

Brian



-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



RE: Your professional opinion Please...

2003-03-25 Thread James E Hicks III
> I'd use MySQL, Apache, and UDMSEARCH.  It provides the web interface for 
> the google search engine (Apache and UDMSearch), while connecting to 
> MySQL.  If you want, the re-indexing can occur using a cron, and then by 
> making apache serve the documents from the root and doing the fancy 
> indexing.  I suppose this is getting off topic, though.  [grin]
>
> Joe

UDMSEARCH is now mnoGoSearch. I pity the fool that is forced to run this
on Windows! It seems the linux version is GPL'd and the Windows version
is going to cost you. Ha ha...


James


sql, query



-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Your professional opinion Please...

2003-03-25 Thread Joe Lewis
Peter L. Berghold wrote:
On Mon, Mar 24, 2003 at 06:41:07PM -0800, Brian wrote:

I have a client with approximately 2 gigabytes of un-indexed document files
(includes text and graphics).
He wants to be able to enter a few parameters and bring up a list of all
[snip]

I'd leave them as flat text files and go from there. If they are adding
or removing from the "library" then do a re-index at an interval that 
makes sense. 

I'd use MySQL, Apache, and UDMSEARCH.  It provides the web interface for 
the google search engine (Apache and UDMSearch), while connecting to 
MySQL.  If you want, the re-indexing can occur using a cron, and then by 
making apache serve the documents from the root and doing the fancy 
indexing.  I suppose this is getting off topic, though.  [grin]

Joe

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]


Re: Your professional opinion Please...

2003-03-25 Thread Peter L. Berghold
On Mon, Mar 24, 2003 at 06:41:07PM -0800, Brian wrote:
> I have a client with approximately 2 gigabytes of un-indexed document files
> (includes text and graphics).
> 
> He wants to be able to enter a few parameters and bring up a list of all

If they are flat text files this should not be too big an issue although
a very large project nonetheless. Develop an index by yanking out keywords
of interest and devloping a table to index them either by filename 
title or whatever.

I'd leave them as flat text files and go from there. If they are adding
or removing from the "library" then do a re-index at an interval that 
makes sense. 

-- 

Peter L. Berghold [EMAIL PROTECTED] 
"Those who fail to learn from history are condemned to repeat it."
AIM: redcowdawgYahoo IM: blue_cowdawg  ICQ: 11455958 

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



RE: Your professional opinion Please...

2003-03-24 Thread Nick Arnett
> -Original Message-
> From: Brian [mailto:[EMAIL PROTECTED]

...

> I have a client with approximately 2 gigabytes of un-indexed
> document files
> (includes text and graphics).
>
> He wants to be able to enter a few parameters and bring up a list of all
> documents that fit, and then be able to download them over a web
> interface -
> sort of like a private Google search engine.

How many documents?  What format are they in?  Does this require just text
searching or is there fielded data, too?  How many users would search
simultaneously?

There are various search engine vendors, including Google itself.  The
leader is Verity.  Autonomy is probably its top current competitor.  But
since you've posted here, are you considering MySQL?  It doesn't have a
particularly rich query language for text, and it's up to you to get them
into the database in a usable form.

Nick


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Your professional opinion Please...

2003-03-24 Thread Martin Gainty
Brian-
why not use grep or fgrep on the files and catch the hyperlink resultset to
formatted html..?
Regards,
Martin
- Original Message -
From: "Brian" <[EMAIL PROTECTED]>
To: "MySQL" <[EMAIL PROTECTED]>
Sent: Monday, March 24, 2003 7:41 PM
Subject: Your professional opinion Please...


> Hello Dear Friends:
>
> I have a client with approximately 2 gigabytes of un-indexed document
files
> (includes text and graphics).
>
> He wants to be able to enter a few parameters and bring up a list of all
> documents that fit, and then be able to download them over a web
interface -
> sort of like a private Google search engine.
>
> What advice do you have for me?
>
> Thanks for your input,
>
> Brian
> Network Services
>
>
>
> --
> MySQL General Mailing List
> For list archives: http://lists.mysql.com/mysql
> To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
>
>

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Your professional opinion Please...

2003-03-24 Thread Brian
Hello Dear Friends:

I have a client with approximately 2 gigabytes of un-indexed document files
(includes text and graphics).

He wants to be able to enter a few parameters and bring up a list of all
documents that fit, and then be able to download them over a web interface -
sort of like a private Google search engine.

What advice do you have for me?

Thanks for your input,

Brian
Network Services



-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]