Re: TF-IDF API

Sengly Heng Wed, 28 Mar 2007 19:40:30 -0800

Thank you very much for your time. Here is a sample of a vector of terms :


v1 = {"sad", "john", "intelligent", "news", "USA", "disneyland", "MIT",
"cambridge", "marry", ...}

I'll try out your method.

Best regards,

Sengly



On 3/28/07, karl wettin <[EMAIL PROTECTED]> wrote:



28 mar 2007 kl. 15.24 skrev Sengly Heng:

> Thank you but I still have have no clue of how to do that by using
> Weka
> after taking a look at its API. Let me reformulate my problem :
>
> I have a collection of vector of terms (actually each vector of terms
> represents the list of tokens extracted from a file) and I do not
> have the
> original files. I would like to calculate TF as well as TFIDF of
> each term
> and sorted them by these value respectively. As suggested by Grant
> Ingersoll, I could index those vectors of terms again using Lucene
> and then
> use its API to measure TF and TFIDF. However I guess there should be a
> simpler way or API just fit-in this case.

To my knowledge there is no thing in Lucene that makes it simpler for
you than what Grant suggests. And according to me, Weka really must
be the simplest way around. However, perhaps you should supply us
with an example of what these vectors look like. That might change
everything. Perhaps we are talking of completely different things here.

Let me reformulate my suggestion:

1. rebuild your vector to a string.
2. put the data in a file called myvectors.arff:

@relation termvectors
@attribute the_vector string
@data
"first term vector as a string"
"second term vector as a string"

3. open the file in the weka explorer application.
4. select filter/unsupervised/attribute/string to word vector
5. set your preferences of normalization, et c.
6. apply the filter.
7. save the output.

All this can be done progamatically too, with only a few lines of code.

>
> Thanks once again everyone.
>
> Best regards,
>
> Sengly
>
>
> On 3/28/07, karl wettin <[EMAIL PROTECTED]> wrote:
>>
>>
>> 28 mar 2007 kl. 10.36 skrev Sengly Heng:
>>
>> > Does anyone of you know any Java API that directly handle this
>> > problem?
>> > or I have to implement from scratch.
>>
>> You can also try
>> weka.filters.unsupervised.attribute.StringToWordVector, it has many
>> neat features you might be interested in. And if applicable to what
>> you attempt to do, the feature selection algorithms of the same
>> project (Weka) does a great job reducing the data set.
>>
>> http://www.cs.waikato.ac.nz/ml/weka/
>>
>> It is GPL.
>>
>> --
>> karl
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: TF-IDF API

Reply via email to