Hi Carlos,

Thanks for your note, and it's great to hear that SenseClusters is
proving to be useful for your purposes.

I think your option settings sound pretty reasonable to me - I usually
start with first order contexts (o1), especially when I have larger
amounts of data. Given the amount of data you have I might consider
using a smaller window size, perhaps even 2 (which would force bigrams
or co-occurrences to be adjacent). Using the very big window sizes is
something I do when I have smaller amounts of data, as it's a nice way
to boost the number of features.  For larger amounts of data like you
have, --remove of 10 is a good value, I have sometimes gone even
higher than that and used 20 or 50, just depending on the nature of
that data. I don't think you want to go any lower than 10 just because
then you get overwhelmed with features (as you probably already
observed :).

You don't mention how you cut off the pmi test - you have the option
of using a rank or a score - when using a rank I sometimes take the
top 100 or 1000 features, and when using the score as a cutoff I will
sometimes use 5 or 10 as values...

In general I do not use svd with order 1, at least not at first, and I
think by setting the rf and iter values you would be getting svd, so
you might want to avoid those. In general I have almost always found
bigrams to be at least or more effective than co-occurrences, so I
tend to prefer the use of bigrams. For clustering method I often start
with direct (which is kmeans) and then for cluster stopping I will
normally use pk2 or gap, so your choice there is good.

For all of the above though, there is room for quite a bit of
experimentation. But, I think for a first pass through your data the
above should be a pretty good start.

I think your suggestion about including huge-count.pl is a good one -
I'll make a note of that and see if we can't figure out a way to get
that into the "path" of discriminate.pl (which is what the web
interface relies upon as well). Of course as you've discovered you can
use huge-count.pl if you set up your own sequence of programs to run
(rather than using discriminate.pl or the web interface).

Thanks again, I hope this helps. BTW, I took the liberty of copying
the users list just because these issues frequently come up when
people are starting to use the package (which settings should I
use...) so hopefully the above will be generally useful.

Cordially,
Ted

On Thu, Feb 14, 2008 at 4:57 AM, Carlos Troncoso Alarcon
<[EMAIL PROTECTED]> wrote:
> Dear Ted,
>
>  I recently discovered your helpful software, first NSP and now
>  SenseClusters. They are saving me hours and hours of programming and
>  research. Thank you for them!
>
>  I am trying to apply SC to a big task: clustering approximately 4
>  million messages. The messages are unlabeled transcriptions of
>  voicemails, so each of them is an individual entity, though of course
>  you can find very similar messages. Even though I saw your video
>  tutorial and slides and I read the documentation, I don't have the
>  necessary experience/knowledge to decide which values I should use for
>  each of the parameters of SC, so I would be very grateful if you could
>  help me decide.
>
>  Let me explain what I am trying to use so far:
>
>  I read in your slides that for lots of data first order is better than
>  second order, so I guess I should use --context o1.
>  Since LSA is only used with first order representation when we need to
>  cluster words, I shouldn't use --lsa (and no --svd).
>  I am using co-ocurrences as feature type, but I don't really know if I
>  should be using bigrams instead.
>  As measure of association I am using pmi, as you did in your paper UMND2.
>  I am using the stopfile from the Demos/Regexs directory.
>  The value for --remove is 10.
>  For --window is 12, again as in your paper.
>  I am using default values for --k and --rf, and --iter set to 900.
>  For clustering stopping I am using gap with default values.
>  The clustering method is rbr, and the clustering function is h1.
>
>  If you could comment on these values I would really appreciate it.
>
>  Finally, I have a suggestion for SenseClusters. It would be nice if
>  you added an option to use huge-count.pl instead of count.pl. I added
>  it myself (I also love Perl) and found it very helpful when you have
>  to deal with lots of data.
>
>  Thank you for your time and for your software again!
>
>  --
>  Carlos
>



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to