28 mar 2007 kl. 15.24 skrev Sengly Heng:

Thank you but I still have have no clue of how to do that by using Weka
after taking a look at its API. Let me reformulate my problem :

I have a collection of vector of terms (actually each vector of terms
represents the list of tokens extracted from a file) and I do not have the original files. I would like to calculate TF as well as TFIDF of each term
and sorted them by these value respectively. As suggested by Grant
Ingersoll, I could index those vectors of terms again using Lucene and then
use its API to measure TF and TFIDF. However I guess there should be a
simpler way or API just fit-in this case.

To my knowledge there is no thing in Lucene that makes it simpler for you than what Grant suggests. And according to me, Weka really must be the simplest way around. However, perhaps you should supply us with an example of what these vectors look like. That might change everything. Perhaps we are talking of completely different things here.

Let me reformulate my suggestion:

1. rebuild your vector to a string.
2. put the data in a file called myvectors.arff:

@relation termvectors
@attribute the_vector string
@data
"first term vector as a string"
"second term vector as a string"

3. open the file in the weka explorer application.
4. select filter/unsupervised/attribute/string to word vector
5. set your preferences of normalization, et c.
6. apply the filter.
7. save the output.

All this can be done progamatically too, with only a few lines of code.


Thanks once again everyone.

Best regards,

Sengly


On 3/28/07, karl wettin <[EMAIL PROTECTED]> wrote:


28 mar 2007 kl. 10.36 skrev Sengly Heng:

> Does anyone of you know any Java API that directly handle this
> problem?
> or I have to implement from scratch.

You can also try
weka.filters.unsupervised.attribute.StringToWordVector, it has many
neat features you might be interested in. And if applicable to what
you attempt to do, the feature selection algorithms of the same
project (Weka) does a great job reducing the data set.

http://www.cs.waikato.ac.nz/ml/weka/

It is GPL.

--
karl


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to