Re: Problem converting tokenized documents into TFIDF vectors

Drew Farris Sun, 26 Jan 2014 08:57:49 -0800

Scott,

Based on the dictionary output, it looks like the processing of generating
vector from your tokenized text is not working properly. The only term
that's making it into your dictionary is 'java' - everything else is being
filtered out. Furthermore, your tf vectors have a single dimension '0'
which a weight that corresponds to the frequency of the term 'java' in each
document.


I would check the settings for minimum document frequency in the
vectorization process. What is the command you are using to create vectors
from your tokenized documents?

Drew


On Tue, Jan 21, 2014 at 6:30 PM, Scott C. Cote <scottcc...@gmail.com> wrote:

> All,
>
> Not a Mahout .9 problem  once I have this working with .8 Mahout, will
> immediately pull in the .9 stuffŠ..
>
> I am trying to make a small data set work (perhaps it is too small?) where
> I
> am clustering skills (phrases).  For sake of brevity (my steps are long) ,
> I
> have not documented the steps that I took to get my text of skills into
> tokenized formŠ.
>
> By the time I get to the TFIDF vectors  (step 4)  my output is of zero Š.
> No tfidf vectors generated.
>
>
> I have broken this down into 4 steps.
>
>
>
> Step 1. Tokenize docs.  Here is output validating success of tokenization.
>
> mahout seqdumper -i tokenized-documents/part-m-00000
>
> yields
>
> Key class: class org.apache.hadoop.io.Text Value Class: class
> org.apache.mahout.common.StringTuple
> Key: 1: Value: [rest, web, services]
> Key: 2: Value: [soa, design, build, service, oriented, architecture, using,
> java]
> Key: 3: Value: [oracle, jdbc, build, java, database, connectivity, layer,
> oracle]
> Key: 4: Value: [spring, injection, use, spring, templates, inversion,
> control]
> Key: 5: Value: [j2ee, create, device, enterprise, java, beans, integrate,
> spring]
> Key: 6: Value: [can, deploy, web, archive, war, files, tomcat]
> Key: 7: Value: [java, graphics, uses, android, graphics, packages, create,
> user, interfaces]
> Key: 8: Value: [core, java, understand, core, libraries, java, development,
> kit]
> Key: 9: Value: [design, develop, jdbc, sql, queries]
> Key: 10: Value: [multithreading, thread, synchronization]
> Count: 10
>
>
> Step 2. Create term frequency vectors from the tokenized sequence file
> (step
> 1).
>
> mahout seqdumper -i dictionary.file-0
>
> Yields
>
> Key: java: Value: 0
> Count: 1
>
> mahout seqdumper -i tf-vectors/part-r-00000
>
> Yields
>
> Key class: class org.apache.hadoop.io.Text Value Class: class
> org.apache.mahout.math.VectorWritable
> Key: 2: Value: 2:{0:1.0}
> Key: 3: Value: 3:{0:1.0}
> Key: 5: Value: 5:{0:1.0}
> Key: 7: Value: 7:{0:1.0}
> Key: 8: Value: 8:{0:2.0}
> Count: 5
>
>
> Step 3. Create the document frequency data.
>
> mahout seqdumper -i frequency.file-0
>
> Yields
>
> Key: 0: Value: 5
> Count: 1
>
> NOTE to READER:  Java is NOT the only common word  web occurs more than
> once  how come its not included?
>
>
>
>
>
> Step 4. Create the tfidf vectors: (can't remember if partials were created
> in the past step)
>
> mahout seqdumper -i partial-vectors-0/part-r-00000
>
> yields
>
> INFO: Command line arguments: {--endPhase=[2147483647],
> --input=[part-r-00000], --startPhase=[0], --tempDir=[temp]}
> 2014-01-21 16:57:23.661 java[24565:1203] Unable to load realm info from
> SCDynamicStore
> Input Path: part-r-00000
> Key class: class org.apache.hadoop.io.Text Value Class: class
> org.apache.mahout.math.VectorWritable
> Key: 2: Value: 2:{}
> Key: 3: Value: 3:{}
> Key: 5: Value: 5:{}
> Key: 7: Value: 7:{}
> Key: 8: Value: 8:{}
> Count: 5
>
> NOTE to READER:  What do the empty brackets mean here?
>
>
> mahout seqdumper -i tfidf-vectors/part-r-00000
>
> Yields
>
> Key class: class org.apache.hadoop.io.Text Value Class: class
> org.apache.mahout.math.VectorWritable
> Count: 0
>
> Why 0?
>
> What am I NOT understanding here?
>
> SCott
>
>
>

Re: Problem converting tokenized documents into TFIDF vectors

Reply via email to