[Apertium-stuff] soft hyphens and tokenisation

2012-04-17 Thread Kevin Brubeck Unhammer
Hi, I notice that soft/hidden hyphens (#173;) can split words, e.g. in Jesper­sen there's a soft hyphen between n and t, but it should be analysed as one word. I've noticed this a lot in web pages, I guess a lot of news sites and such use programs that hyphenate using that character. The

Re: [Apertium-stuff] soft hyphens and tokenisation

2012-04-17 Thread Kevin Brubeck Unhammer
Kevin Brubeck Unhammer unham...@fsfe.org writes: Hi, I notice that soft/hidden hyphens (#173;) can split words, e.g. in Jesper­sen there's a soft hyphen between n and t, but it should be analysed as one Wops, between r and s! word. I've noticed this a lot in web pages, I guess a lot

Re: [Apertium-stuff] soft hyphens and tokenisation

2012-04-17 Thread Jimmy O'Regan
On 17 April 2012 14:51, Kevin Brubeck Unhammer unham...@fsfe.org wrote: Hi, I notice that soft/hidden hyphens (#173;) can split words, e.g. in    Jesper­sen there's a soft hyphen between n and t, but it should be analysed as one word. I've noticed this a lot in web pages, I guess a lot of