> Thanks Ted,
> 
> > We use NSP and then nsp2regex as we have found it to be useful and
> > interesting, but there is nothing to prevent you from defining your own
> > features via a regex file, and then SenseClusters will use those instead.
> 
> OK, I will look at nsp2regex, try to understand what it  does and to use it 
> as a model for my own pre-processing (I'm afraid I haven't fully grasped 
> the difference between feature vectors and word vectors in the SC system...)
> 
> Best regards,
> 
> Marco
> 

Hi Marco,

I have included an example below of some experiments I was recently
running where I provided a set of features that were not determined
with NSP, but were rather specified by me manually. I will also give
a little explanation of this in general in another note, but thought
the script might be useful as a general example of how to put 
such things together.

What you'll see below is that I still used nsp2regex to create my 
feature file. In fact, you don't even need to do that, you can create
the feature file itself simply using Perl regular expressions (one
per line). 

I guess the more general comment here is that discriminate.pl, the
SenseClusters driver, does require you to use NSP, but you aren't
required to use discriminate.pl. You can mix and match the individual
programs in the SenseClusters toolkit to create a much broader 
range of systems that is provided by discriminate.pl. 

I hope this gives some ideas, and I'll try and elaborate a bit more 
as well.

Cordially,
Ted

-----------------------------------------------------------------------

#!/bin/csh

# Ted Pedersen
# September 1, 2006

# This script allows you to provide a feature set and a test data file
# and then have first order native and LSA context discrimination carried 
# out on the data. Note that order 2 can not be performed on this data
# since the intended features are unigrams, bigrams, and trigrams. 
# order 2 requires bigrams or co-occurrences only as features.

# I used this script with a set of features that I manually created. It
# included a mix of unigrams, bigrams, and trigrams as features. The
# format of the features file was like this:
# 
# house<>10
# car<>10
# million<>dollar<>10
# big<>time<>10
# new<>york<>city<>10
#
# since the feature set was handcrafted the count after the unigrams,
# bigrams, and trigrams was inluded simply to satisfy the required
# format. 

set testfile = smoking-train.xml
set features = manual.tdpless
set clusters = 2

rm -fr key*
rm -fr *clabel
rm -fr *rlabel 
rm -fr *rclass 

# convert features file to regular expressions

nsp2regex.pl $features > $features.regex

################ ORDER 1 NATIVE

order1vec.pl --rlabel $features.rlabel --rclass $features.rclass --clabel 
$features.clabel $testfile $features.regex > $features.o1

mv keyfile*.key keyfile

vcluster --clustfile $features.cluster_solution --rlabelfile $features.rlabel 
--rclassfile $features.rclass \
--clmethod direct --colmodel none --rowmodel log --sim cos $features.o1 
$clusters > $features.$clusters

cluto2label.pl $features.cluster_solution keyfile > $features.confusion

label.pl $features.confusion > $features.label

report.pl $features.label $features.confusion > $features.report

################ ORDER 1 LSA

order1vec.pl --transpose $testfile $features.regex --testregex 
$features.lsa.regex > $features.lsa.o1

order2vec.pl $testfile $features.lsa.o1 $features.lsa.regex --rclass 
$features.lsa.rclass --rlabel $features.lsa.rlabel > $features.lsa.o2

vcluster -clustfile $features.lsa.cluster_solution -rlabelfile 
$features.lsa.rlabel -rclassfile $features.lsa.rclass \
-clmethod direct -sim cos -rowmodel log -colmodel none $features.lsa.o2 
$clusters > $features.lsa.$clusters

cluto2label.pl $features.lsa.cluster_solution keyfile > $features.lsa.confusion

label.pl $features.lsa.confusion > $features.lsa.label

report.pl $features.lsa.label $features.lsa.confusion > $features.lsa.report

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to