This is an issue that has come up in the context of the Ngram Statistics Package, which is one of the underlying pieces of software for SenseClusters. There is fairly extensive discussion of encoding issues on the NSP mailing list, and if that is something that would be interesting you might want to subscribe (it's a yahoo group...
http://tech.groups.yahoo.com/group/ngram/ However, the short version of that discussion is to modify the .pl files to include the following line at the top of each .pl file... use locale; This is a bit of a hack however (with some drawbacks as discussed in the mailing list above...) But, you might want to try this and see if it helps! Let us know how things go with this...or if you encounter any different/better solutions. Thanks, Ted On Thu, Mar 5, 2009 at 7:50 AM, Savas Yildirim <[email protected]> wrote: > SenseCluster And Ngram delete special character (ü,ö,ş) in context. > E.g. the word müssen occur as "m ssen" in SenseCluster and n-gram as well. > Is there any solution for this ? > > I know that romanian language is used with SenseCluster. > > My simple solution is replacing such word "ü" with "xxu" > > -- > Savas Yildirim > -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------------------------------ Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise -Strategies to boost innovation and cut costs with open source participation -Receive a $600 discount off the registration fee with the source code: SFAD http://p.sf.net/sfu/XcvMzF8H _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
