On 12/27/2009 7:46 AM, Shawn Milochik wrote:
The special features of the Shrek DVD showed how the rendering took so much processing 
power that everyone's workstation was used overnight as a rendering farm. Some kind of 
video rendering would make a great example. However, it might be a lot of overhead for 
you to set up, unless you can find someone with expertise in the area. The nice thing 
about this is that it would be relevant to the audience. Also, if you describe what goes 
into processing a single frame in enough depth that they appreciate it, they'll really 
"get" the power of distributed processing.

Something else incredibly time-expensive but much easier to set up would be 
matching of names and addresses. I worked at a company where this was, at its 
very core, the primary function of the business model. Considering the 
different ways of entering simple data, many comparisons must be made. This 
takes a lot of time, and even then the match rates aren't necessarily going to 
be very high.

Here are some problems with matching:

Bill versus William
'52 10th Street' | '52 tenth street'
'E. Smith street' | 'E smith street' | 'east smith street'
'Bill Smith' | 'Smith, Bill'
'William Smith Jr' | 'William Smith Junior'
'Dr. W. Smith' | 'William Smith'
'Michael Norman Smith' | 'Michael N. Smith' | 'Michael Smith' | 'Smith, 
Michael' | 'Smith, Michael N.' | 'Smith, Michael Norman'

The list goes on and on, ad nauseum. Not to mention geocoding, married and 
maiden names, and scoring partial name matches with distance proximity matches. 
Another nice thing is that, depending on how much time you want to spend on it, 
you can have the students refine the matching rules over time, and see how 
those rules effect the match rate and the processing time. On the downside, 
your class will not have the joy of being taught the 'ideal solution' to this 
problem at the end; if you come up with that, you'll be able to go into 
business and make millions of dollars a year. ^_^

IMHO, that's a poor example. Rather than writing a fuzzy search algorithm, it's easier to write a normalizer before entering data to the index (or before comparing the search string with the corpus' string).
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to