> Thanks Ted, > > > We use NSP and then nsp2regex as we have found it to be useful and > > interesting, but there is nothing to prevent you from defining your own > > features via a regex file, and then SenseClusters will use those instead. > > OK, I will look at nsp2regex, try to understand what it does and to use it > as a model for my own pre-processing (I'm afraid I haven't fully grasped > the difference between feature vectors and word vectors in the SC system...) > > Best regards, > > Marco >
Hi Marco, I have included an example below of some experiments I was recently running where I provided a set of features that were not determined with NSP, but were rather specified by me manually. I will also give a little explanation of this in general in another note, but thought the script might be useful as a general example of how to put such things together. What you'll see below is that I still used nsp2regex to create my feature file. In fact, you don't even need to do that, you can create the feature file itself simply using Perl regular expressions (one per line). I guess the more general comment here is that discriminate.pl, the SenseClusters driver, does require you to use NSP, but you aren't required to use discriminate.pl. You can mix and match the individual programs in the SenseClusters toolkit to create a much broader range of systems that is provided by discriminate.pl. I hope this gives some ideas, and I'll try and elaborate a bit more as well. Cordially, Ted ----------------------------------------------------------------------- #!/bin/csh # Ted Pedersen # September 1, 2006 # This script allows you to provide a feature set and a test data file # and then have first order native and LSA context discrimination carried # out on the data. Note that order 2 can not be performed on this data # since the intended features are unigrams, bigrams, and trigrams. # order 2 requires bigrams or co-occurrences only as features. # I used this script with a set of features that I manually created. It # included a mix of unigrams, bigrams, and trigrams as features. The # format of the features file was like this: # # house<>10 # car<>10 # million<>dollar<>10 # big<>time<>10 # new<>york<>city<>10 # # since the feature set was handcrafted the count after the unigrams, # bigrams, and trigrams was inluded simply to satisfy the required # format. set testfile = smoking-train.xml set features = manual.tdpless set clusters = 2 rm -fr key* rm -fr *clabel rm -fr *rlabel rm -fr *rclass # convert features file to regular expressions nsp2regex.pl $features > $features.regex ################ ORDER 1 NATIVE order1vec.pl --rlabel $features.rlabel --rclass $features.rclass --clabel $features.clabel $testfile $features.regex > $features.o1 mv keyfile*.key keyfile vcluster --clustfile $features.cluster_solution --rlabelfile $features.rlabel --rclassfile $features.rclass \ --clmethod direct --colmodel none --rowmodel log --sim cos $features.o1 $clusters > $features.$clusters cluto2label.pl $features.cluster_solution keyfile > $features.confusion label.pl $features.confusion > $features.label report.pl $features.label $features.confusion > $features.report ################ ORDER 1 LSA order1vec.pl --transpose $testfile $features.regex --testregex $features.lsa.regex > $features.lsa.o1 order2vec.pl $testfile $features.lsa.o1 $features.lsa.regex --rclass $features.lsa.rclass --rlabel $features.lsa.rlabel > $features.lsa.o2 vcluster -clustfile $features.lsa.cluster_solution -rlabelfile $features.lsa.rlabel -rclassfile $features.lsa.rclass \ -clmethod direct -sim cos -rowmodel log -colmodel none $features.lsa.o2 $clusters > $features.lsa.$clusters cluto2label.pl $features.lsa.cluster_solution keyfile > $features.lsa.confusion label.pl $features.lsa.confusion > $features.lsa.label report.pl $features.lsa.label $features.lsa.confusion > $features.lsa.report -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
