Hi Carlos, Thanks for your note, and it's great to hear that SenseClusters is proving to be useful for your purposes.
I think your option settings sound pretty reasonable to me - I usually start with first order contexts (o1), especially when I have larger amounts of data. Given the amount of data you have I might consider using a smaller window size, perhaps even 2 (which would force bigrams or co-occurrences to be adjacent). Using the very big window sizes is something I do when I have smaller amounts of data, as it's a nice way to boost the number of features. For larger amounts of data like you have, --remove of 10 is a good value, I have sometimes gone even higher than that and used 20 or 50, just depending on the nature of that data. I don't think you want to go any lower than 10 just because then you get overwhelmed with features (as you probably already observed :). You don't mention how you cut off the pmi test - you have the option of using a rank or a score - when using a rank I sometimes take the top 100 or 1000 features, and when using the score as a cutoff I will sometimes use 5 or 10 as values... In general I do not use svd with order 1, at least not at first, and I think by setting the rf and iter values you would be getting svd, so you might want to avoid those. In general I have almost always found bigrams to be at least or more effective than co-occurrences, so I tend to prefer the use of bigrams. For clustering method I often start with direct (which is kmeans) and then for cluster stopping I will normally use pk2 or gap, so your choice there is good. For all of the above though, there is room for quite a bit of experimentation. But, I think for a first pass through your data the above should be a pretty good start. I think your suggestion about including huge-count.pl is a good one - I'll make a note of that and see if we can't figure out a way to get that into the "path" of discriminate.pl (which is what the web interface relies upon as well). Of course as you've discovered you can use huge-count.pl if you set up your own sequence of programs to run (rather than using discriminate.pl or the web interface). Thanks again, I hope this helps. BTW, I took the liberty of copying the users list just because these issues frequently come up when people are starting to use the package (which settings should I use...) so hopefully the above will be generally useful. Cordially, Ted On Thu, Feb 14, 2008 at 4:57 AM, Carlos Troncoso Alarcon <[EMAIL PROTECTED]> wrote: > Dear Ted, > > I recently discovered your helpful software, first NSP and now > SenseClusters. They are saving me hours and hours of programming and > research. Thank you for them! > > I am trying to apply SC to a big task: clustering approximately 4 > million messages. The messages are unlabeled transcriptions of > voicemails, so each of them is an individual entity, though of course > you can find very similar messages. Even though I saw your video > tutorial and slides and I read the documentation, I don't have the > necessary experience/knowledge to decide which values I should use for > each of the parameters of SC, so I would be very grateful if you could > help me decide. > > Let me explain what I am trying to use so far: > > I read in your slides that for lots of data first order is better than > second order, so I guess I should use --context o1. > Since LSA is only used with first order representation when we need to > cluster words, I shouldn't use --lsa (and no --svd). > I am using co-ocurrences as feature type, but I don't really know if I > should be using bigrams instead. > As measure of association I am using pmi, as you did in your paper UMND2. > I am using the stopfile from the Demos/Regexs directory. > The value for --remove is 10. > For --window is 12, again as in your paper. > I am using default values for --k and --rf, and --iter set to 900. > For clustering stopping I am using gap with default values. > The clustering method is rbr, and the clustering function is h1. > > If you could comment on these values I would really appreciate it. > > Finally, I have a suggestion for SenseClusters. It would be nice if > you added an option to use huge-count.pl instead of count.pl. I added > it myself (I also love Perl) and found it very helpful when you have > to deal with lots of data. > > Thank you for your time and for your software again! > > -- > Carlos > -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
