Re: Problem converting tokenized documents into TFIDF vectors

Scott C. Cote Sun, 26 Jan 2014 10:34:51 -0800

Drew,

I'm sorry - I'm derelict (as opposed to dirichlet) in responding that I
got passed my problem.


It was the min freq that was killing me.  Forgot about that parameter.

Thank you for your assist.

Hope to be able to return the favor.

Am on the hook to update documentation for Mahout already - maybe that
will do it :)

This week, I'll be testing my code against the .9 distribution.

SCott

On 1/26/14 10:57 AM, "Drew Farris" <d...@apache.org> wrote:

>Scott,
>
>Based on the dictionary output, it looks like the processing of generating
>vector from your tokenized text is not working properly. The only term
>that's making it into your dictionary is 'java' - everything else is being
>filtered out. Furthermore, your tf vectors have a single dimension '0'
>which a weight that corresponds to the frequency of the term 'java' in
>each
>document.
>
>I would check the settings for minimum document frequency in the
>vectorization process. What is the command you are using to create vectors
>from your tokenized documents?
>
>Drew
>
>
>On Tue, Jan 21, 2014 at 6:30 PM, Scott C. Cote <scottcc...@gmail.com>
>wrote:
>
>> All,
>>
>> Not a Mahout .9 problem  once I have this working with .8 Mahout, will
>> immediately pull in the .9 stuffŠ..
>>
>> I am trying to make a small data set work (perhaps it is too small?)
>>where
>> I
>> am clustering skills (phrases).  For sake of brevity (my steps are
>>long) ,
>> I
>> have not documented the steps that I took to get my text of skills into
>> tokenized formŠ.
>>
>> By the time I get to the TFIDF vectors  (step 4)  my output is of zero
>>Š.
>> No tfidf vectors generated.
>>
>>
>> I have broken this down into 4 steps.
>>
>>
>>
>> Step 1. Tokenize docs.  Here is output validating success of
>>tokenization.
>>
>> mahout seqdumper -i tokenized-documents/part-m-00000
>>
>> yields
>>
>> Key class: class org.apache.hadoop.io.Text Value Class: class
>> org.apache.mahout.common.StringTuple
>> Key: 1: Value: [rest, web, services]
>> Key: 2: Value: [soa, design, build, service, oriented, architecture,
>>using,
>> java]
>> Key: 3: Value: [oracle, jdbc, build, java, database, connectivity,
>>layer,
>> oracle]
>> Key: 4: Value: [spring, injection, use, spring, templates, inversion,
>> control]
>> Key: 5: Value: [j2ee, create, device, enterprise, java, beans,
>>integrate,
>> spring]
>> Key: 6: Value: [can, deploy, web, archive, war, files, tomcat]
>> Key: 7: Value: [java, graphics, uses, android, graphics, packages,
>>create,
>> user, interfaces]
>> Key: 8: Value: [core, java, understand, core, libraries, java,
>>development,
>> kit]
>> Key: 9: Value: [design, develop, jdbc, sql, queries]
>> Key: 10: Value: [multithreading, thread, synchronization]
>> Count: 10
>>
>>
>> Step 2. Create term frequency vectors from the tokenized sequence file
>> (step
>> 1).
>>
>> mahout seqdumper -i dictionary.file-0
>>
>> Yields
>>
>> Key: java: Value: 0
>> Count: 1
>>
>> mahout seqdumper -i tf-vectors/part-r-00000
>>
>> Yields
>>
>> Key class: class org.apache.hadoop.io.Text Value Class: class
>> org.apache.mahout.math.VectorWritable
>> Key: 2: Value: 2:{0:1.0}
>> Key: 3: Value: 3:{0:1.0}
>> Key: 5: Value: 5:{0:1.0}
>> Key: 7: Value: 7:{0:1.0}
>> Key: 8: Value: 8:{0:2.0}
>> Count: 5
>>
>>
>> Step 3. Create the document frequency data.
>>
>> mahout seqdumper -i frequency.file-0
>>
>> Yields
>>
>> Key: 0: Value: 5
>> Count: 1
>>
>> NOTE to READER:  Java is NOT the only common word  web occurs more than
>> once  how come its not included?
>>
>>
>>
>>
>>
>> Step 4. Create the tfidf vectors: (can't remember if partials were
>>created
>> in the past step)
>>
>> mahout seqdumper -i partial-vectors-0/part-r-00000
>>
>> yields
>>
>> INFO: Command line arguments: {--endPhase=[2147483647],
>> --input=[part-r-00000], --startPhase=[0], --tempDir=[temp]}
>> 2014-01-21 16:57:23.661 java[24565:1203] Unable to load realm info from
>> SCDynamicStore
>> Input Path: part-r-00000
>> Key class: class org.apache.hadoop.io.Text Value Class: class
>> org.apache.mahout.math.VectorWritable
>> Key: 2: Value: 2:{}
>> Key: 3: Value: 3:{}
>> Key: 5: Value: 5:{}
>> Key: 7: Value: 7:{}
>> Key: 8: Value: 8:{}
>> Count: 5
>>
>> NOTE to READER:  What do the empty brackets mean here?
>>
>>
>> mahout seqdumper -i tfidf-vectors/part-r-00000
>>
>> Yields
>>
>> Key class: class org.apache.hadoop.io.Text Value Class: class
>> org.apache.mahout.math.VectorWritable
>> Count: 0
>>
>> Why 0?
>>
>> What am I NOT understanding here?
>>
>> SCott
>>
>>
>>

Re: Problem converting tokenized documents into TFIDF vectors

Reply via email to