As you suggest the easiest way is to just ignore all the blanks, and
then try to find words, probably by a greedy approach, and then backing
off.  However, the original email explained that extra spaces are much
more likely than missing spaces.  This information could be used to get
better results.

Thanks to Richard Barbalace for sending his program.
I can run it, and now I need to look at how to revise it.

Thanks,
Steve 

-----Original Message-----
From: Chris Devers [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 05, 2007 10:44 PM
To: Tolkin, Steve
Cc: boston perl mongers
Subject: Re: [Boston.pm] Program wanted to recover text that has spaces
inserted or deleted

On Apr 5, 2007, at 6:42 PM, Tolkin, Steve wrote:

> Also, this is somewhat more complicated because sometimes
> spaces can be removed, although occasionally with much lower  
> frequency.
> For example "Arti factrefers" ought to be "Artifact refers"

How is the program supposed to select from variants such as

   Artifact refers
   Art I fact refers

   documents and
   document sand

?

It almost seems like you can't trust the spaces at all, so you might  
as well just throw them all out and then look for valid word chains  
in the remaining text.

If nothing else, that would also solve the ancillary problem of a  
space before punctuation marks...



-- 
Chris Devers

 
_______________________________________________
Boston-pm mailing list
Boston-pm@mail.pm.org
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to