Hi Scott, Sounds like an interesting project - in the headless format, the goal of SenseClusters is to cluster your blog posts based on their contextual similarity (rather than tagging individual words with meanings). In the headed format, the goal is to cluster a particular word or phrase based on its contextual similarity with other occurrences of that same word or phrase. Neither of these sound like what you want to do (at least as I understand it...)
If you want to tag words with meanings, you might want to try out WordNet::SenseRelate::AllWords. http://senserelate.sourceforge.net I hope this helps. Let me know if I've misunderstood something too... Cordially, Ted On Fri, Dec 5, 2008 at 11:15 AM, Scott Salley <[EMAIL PROTECTED]> wrote: > I have gigabytes of text (blog posts) I am using for creating a statistical > language model (SLM) for an embedded LVCSR -- speech recognition on a cell > phone for writing email messages or sms. > > I would like to tag the words in this text with identifiers to distinguish > meanings of words and hopefully result in a lower perplexity score for the > SLM. > > As a first experiment I called discriminate.pl on a 1gig portion of this > text (converted to the senseval headless format) so I could start matching > documentation with reality. This text seems like it's going to require more > resources to process than I'd like to devote. > > Can someone suggest how I should go about tagging words in a large corpus? > I'm working my way through the documentation, but that is going slowly. > > > Note that I'm willing to share the data (I got it from the web in the first > place), but I don't have bandwidth for allowing everyone to download it. I > archived it as directories for each blogger (from the US), each blog post as > a file. I'm not sure of the actual amount of English text, but the fraction > I use for experiments is 1gig and I used around 1/5 of the data. > > > ________________________________ > Send e-mail anywhere. No map, no compass. Get your Hotmail(R) account now. > ------------------------------------------------------------------------------ > SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada. > The future of the web can't happen without you. Join us at MIX09 to help > pave the way to the Next Web now. Learn more and register at > http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/ > _______________________________________________ > senseclusters-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/senseclusters-users > > -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------------------------------ SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada. The future of the web can't happen without you. Join us at MIX09 to help pave the way to the Next Web now. Learn more and register at http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/ _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
