I understand that it is not official. Am just trying to provide another test opportunity for the .9 release.
SCott On 1/26/14 1:05 PM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote: >Scott, > >FYI... 0.9 Release is not official yet. The project trunk's still at >0.9-SNAPSHOT. > >Please feel free to update the documentation. > > > > > > >On Sunday, January 26, 2014 1:34 PM, Scott C. Cote <scottcc...@gmail.com> >wrote: > >Drew, > >I'm sorry - I'm derelict (as opposed to dirichlet) in responding that I >got passed my problem. > >It was the min freq that was killing me. Forgot about that parameter. > >Thank you for your assist. > >Hope to be able to return the favor. > >Am on the hook to update documentation for Mahout already - maybe that >will do it :) > >This week, I'll be testing my code against the .9 distribution. > >SCott > > >On 1/26/14 10:57 AM, "Drew Farris" <d...@apache.org> wrote: > >>Scott, >> >>Based on the dictionary output, it looks like the processing of >>generating >>vector from your tokenized text is not working properly. The only term >>that's making it into your dictionary is 'java' - everything else is >>being >>filtered out. Furthermore, your tf vectors have a single dimension '0' >>which a weight that corresponds to the frequency of the term 'java' in >>each >>document. >> >>I would check the settings for minimum document frequency in the >>vectorization process. What is the command you are using to create >>vectors >>from your tokenized documents? >> >>Drew >> >> >>On Tue, Jan 21, 2014 at 6:30 PM, Scott C. Cote <scottcc...@gmail.com> >>wrote: >> >>> All, >>> >>> Not a Mahout .9 problem once I have this working with .8 Mahout, will >>> immediately pull in the .9 stuffŠ.. >>> >>> I am trying to make a small data set work (perhaps it is too small?) >>>where >>> I >>> am clustering skills (phrases). For sake of brevity (my steps are >>>long) , >>> I >>> have not documented the steps that I took to get my text of skills into >>> tokenized formŠ. >>> >>> By the time I get to the TFIDF vectors (step 4) my output is of zero >>>Š. >>> No tfidf vectors generated. >>> >>> >>> I have broken this down into 4 steps. >>> >>> >>> >>> Step 1. Tokenize docs. Here is output validating success of >>>tokenization. >>> >>> mahout seqdumper -i tokenized-documents/part-m-00000 >>> >>> yields >>> >>> Key class: class org.apache.hadoop.io.Text Value Class: class >>> org.apache.mahout.common.StringTuple >>> Key: 1: Value: [rest, web, services] >>> Key: 2: Value: [soa, design, build, service, oriented, architecture, >>>using, >>> java] >>> Key: 3: Value: [oracle, jdbc, build, java, database, connectivity, >>>layer, >>> oracle] >>> Key: 4: Value: [spring, injection, use, spring, templates, inversion, >>> control] >>> Key: 5: Value: [j2ee, create, device, enterprise, java, beans, >>>integrate, >>> spring] >>> Key: 6: Value: [can, deploy, web, archive, war, files, tomcat] >>> Key: 7: Value: [java, graphics, uses, android, graphics, packages, >>>create, >>> user, interfaces] >>> Key: 8: Value: [core, java, understand, core, libraries, java, >>>development, >>> kit] >>> Key: 9: Value: [design, develop, jdbc, sql, queries] >>> Key: 10: Value: [multithreading, thread, synchronization] >>> Count: 10 >>> >>> >>> Step 2. Create term frequency vectors from the tokenized sequence file >>> (step >>> 1). >>> >>> mahout seqdumper -i dictionary.file-0 >>> >>> Yields >>> >>> Key: java: Value: 0 >>> Count: 1 >>> >>> mahout seqdumper -i tf-vectors/part-r-00000 >>> >>> Yields >>> >>> Key class: class org.apache.hadoop.io.Text Value Class: class >>> org.apache.mahout.math.VectorWritable >>> Key: 2: Value: 2:{0:1.0} >>> Key: 3: Value: 3:{0:1.0} >>> Key: 5: Value: 5:{0:1.0} >>> Key: 7: Value: 7:{0:1.0} >>> Key: 8: Value: 8:{0:2.0} >>> Count: 5 >>> >>> >>> Step 3. Create the document frequency data. >>> >>> mahout seqdumper -i frequency.file-0 >>> >>> Yields >>> >>> Key: 0: Value: 5 >>> Count: 1 >>> >>> NOTE to READER: Java is NOT the only common word web occurs more >>>than >>> once how come its not included? >>> >>> >>> >>> >>> >>> Step 4. Create the tfidf vectors: (can't remember if partials were >>>created >>> in the past step) >>> >>> mahout seqdumper -i partial-vectors-0/part-r-00000 >>> >>> yields >>> >>> INFO: Command line arguments: {--endPhase=[2147483647], >>> --input=[part-r-00000], --startPhase=[0], --tempDir=[temp]} >>> 2014-01-21 16:57:23.661 java[24565:1203] Unable to load realm info from >>> SCDynamicStore >>> Input Path: part-r-00000 >>> Key class: class org.apache.hadoop.io.Text Value Class: class >>> org.apache.mahout.math.VectorWritable >>> Key: 2: Value: 2:{} >>> Key: 3: Value: 3:{} >>> Key: 5: Value: 5:{} >>> Key: 7: Value: 7:{} >>> Key: 8: Value: 8:{} >>> Count: 5 >>> >>> NOTE to READER: What do the empty brackets mean here? >>> >>> >>> mahout seqdumper -i tfidf-vectors/part-r-00000 >>> >>> Yields >>> >>> Key class: class org.apache.hadoop.io.Text Value Class: class >>> org.apache.mahout.math.VectorWritable >>> Count: 0 >>> >>> Why 0? >>> >>> What am I NOT understanding here? >>> >>> SCott >>> >>>