Re: Problem converting tokenized documents into TFIDF vectors

Scott C. Cote Sun, 26 Jan 2014 11:08:51 -0800

I understand that it is not official.

Am just trying to provide another test opportunity for the .9 release.


SCott

On 1/26/14 1:05 PM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote:

>Scott,
>
>FYI... 0.9 Release is not official yet. The project trunk's still at
>0.9-SNAPSHOT.
>
>Please feel free to update the documentation.
>
>
>
>
>
>
>On Sunday, January 26, 2014 1:34 PM, Scott C. Cote <scottcc...@gmail.com>
>wrote:
> 
>Drew,
>
>I'm sorry - I'm derelict (as opposed to dirichlet) in responding that I
>got passed my problem.
>
>It was the min freq that was killing me.  Forgot about that parameter.
>
>Thank you for your assist.
>
>Hope to be able to return the favor.
>
>Am on the hook to update documentation for Mahout already - maybe that
>will do it :)
>
>This week, I'll be testing my code against the .9 distribution.
>
>SCott
>
>
>On 1/26/14 10:57 AM, "Drew Farris" <d...@apache.org> wrote:
>
>>Scott,
>>
>>Based on the dictionary output, it looks like the processing of
>>generating
>>vector from your tokenized text is not working properly. The only term
>>that's making it into your dictionary is 'java' - everything else is
>>being
>>filtered out. Furthermore, your tf vectors have a single dimension '0'
>>which a weight that corresponds to the frequency of the term 'java' in
>>each
>>document.
>>
>>I would check the settings for minimum document frequency in the
>>vectorization process. What is the command you are using to create
>>vectors
>>from your tokenized documents?
>>
>>Drew
>>
>>
>>On Tue, Jan 21, 2014 at 6:30 PM, Scott C. Cote <scottcc...@gmail.com>
>>wrote:
>>
>>> All,
>>>
>>> Not a Mahout .9 problem  once I have this working with .8 Mahout, will
>>> immediately pull in the .9 stuffŠ..
>>>
>>> I am trying to make a small data set work (perhaps it is too small?)
>>>where
>>> I
>>> am clustering skills (phrases).  For sake of brevity (my steps are
>>>long) ,
>>> I
>>> have not documented the steps that I took to get my text of skills into
>>> tokenized formŠ.
>>>
>>> By the time I get to the TFIDF vectors  (step 4)  my output is of zero
>>>Š.
>>> No tfidf vectors generated.
>>>
>>>
>>> I have broken this down into 4 steps.
>>>
>>>
>>>
>>> Step 1. Tokenize docs.  Here is output validating success of
>>>tokenization.
>>>
>>> mahout seqdumper -i tokenized-documents/part-m-00000
>>>
>>> yields
>>>
>>> Key class: class org.apache.hadoop.io.Text Value Class: class
>>> org.apache.mahout.common.StringTuple
>>> Key: 1: Value: [rest, web, services]
>>> Key: 2: Value: [soa, design, build, service, oriented, architecture,
>>>using,
>>> java]
>>> Key: 3: Value: [oracle, jdbc, build, java, database, connectivity,
>>>layer,
>>> oracle]
>>> Key: 4: Value: [spring, injection, use, spring, templates, inversion,
>>> control]
>>> Key: 5: Value: [j2ee, create, device, enterprise, java, beans,
>>>integrate,
>>> spring]
>>> Key: 6: Value: [can, deploy, web, archive, war, files, tomcat]
>>> Key: 7: Value: [java, graphics, uses, android, graphics, packages,
>>>create,
>>> user, interfaces]
>>> Key: 8: Value: [core, java, understand, core, libraries, java,
>>>development,
>>> kit]
>>> Key: 9: Value: [design, develop, jdbc, sql, queries]
>>> Key: 10: Value: [multithreading, thread, synchronization]
>>> Count: 10
>>>
>>>
>>> Step 2. Create term frequency vectors from the tokenized sequence file
>>> (step
>>> 1).
>>>
>>> mahout seqdumper -i dictionary.file-0
>>>
>>> Yields
>>>
>>> Key: java: Value: 0
>>> Count: 1
>>>
>>> mahout seqdumper -i tf-vectors/part-r-00000
>>>
>>> Yields
>>>
>>> Key class: class org.apache.hadoop.io.Text Value Class: class
>>> org.apache.mahout.math.VectorWritable
>>> Key: 2: Value: 2:{0:1.0}
>>> Key: 3: Value: 3:{0:1.0}
>>> Key: 5: Value: 5:{0:1.0}
>>> Key: 7: Value: 7:{0:1.0}
>>> Key: 8: Value: 8:{0:2.0}
>>> Count: 5
>>>
>>>
>>> Step 3. Create the document frequency data.
>>>
>>> mahout seqdumper -i frequency.file-0
>>>
>>> Yields
>>>
>>> Key: 0: Value: 5
>>> Count: 1
>>>
>>> NOTE to READER:  Java is NOT the only common word  web occurs more
>>>than
>>> once  how come its not included?
>>>
>>>
>>>
>>>
>>>
>>> Step 4. Create the tfidf vectors: (can't remember if partials were
>>>created
>>> in the past step)
>>>
>>> mahout seqdumper -i partial-vectors-0/part-r-00000
>>>
>>> yields
>>>
>>> INFO: Command line arguments: {--endPhase=[2147483647],
>>> --input=[part-r-00000], --startPhase=[0], --tempDir=[temp]}
>>> 2014-01-21 16:57:23.661 java[24565:1203] Unable to load realm info from
>>> SCDynamicStore
>>> Input Path: part-r-00000
>>> Key class: class org.apache.hadoop.io.Text Value Class: class
>>> org.apache.mahout.math.VectorWritable
>>> Key: 2: Value: 2:{}
>>> Key: 3: Value: 3:{}
>>> Key: 5: Value: 5:{}
>>> Key: 7: Value: 7:{}
>>> Key: 8: Value: 8:{}
>>> Count: 5
>>>
>>> NOTE to READER:  What do the empty brackets mean here?
>>>
>>>
>>> mahout seqdumper -i tfidf-vectors/part-r-00000
>>>
>>> Yields
>>>
>>> Key class: class org.apache.hadoop.io.Text Value Class: class
>>> org.apache.mahout.math.VectorWritable
>>> Count: 0
>>>
>>> Why 0?
>>>
>>> What am I NOT understanding here?
>>>
>>> SCott
>>>
>>>

Re: Problem converting tokenized documents into TFIDF vectors

Reply via email to