Re: Problem converting tokenized documents into TFIDF vectors
Scott, Based on the dictionary output, it looks like the processing of generating vector from your tokenized text is not working properly. The only term that's making it into your dictionary is 'java' - everything else is being filtered out. Furthermore, your tf vectors have a single dimension '0' which a weight that corresponds to the frequency of the term 'java' in each document. I would check the settings for minimum document frequency in the vectorization process. What is the command you are using to create vectors from your tokenized documents? Drew On Tue, Jan 21, 2014 at 6:30 PM, Scott C. Cote scottcc...@gmail.com wrote: All, Not a Mahout .9 problem once I have this working with .8 Mahout, will immediately pull in the .9 stuffŠ.. I am trying to make a small data set work (perhaps it is too small?) where I am clustering skills (phrases). For sake of brevity (my steps are long) , I have not documented the steps that I took to get my text of skills into tokenized formŠ. By the time I get to the TFIDF vectors (step 4) my output is of zero Š. No tfidf vectors generated. I have broken this down into 4 steps. Step 1. Tokenize docs. Here is output validating success of tokenization. mahout seqdumper -i tokenized-documents/part-m-0 yields Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.common.StringTuple Key: 1: Value: [rest, web, services] Key: 2: Value: [soa, design, build, service, oriented, architecture, using, java] Key: 3: Value: [oracle, jdbc, build, java, database, connectivity, layer, oracle] Key: 4: Value: [spring, injection, use, spring, templates, inversion, control] Key: 5: Value: [j2ee, create, device, enterprise, java, beans, integrate, spring] Key: 6: Value: [can, deploy, web, archive, war, files, tomcat] Key: 7: Value: [java, graphics, uses, android, graphics, packages, create, user, interfaces] Key: 8: Value: [core, java, understand, core, libraries, java, development, kit] Key: 9: Value: [design, develop, jdbc, sql, queries] Key: 10: Value: [multithreading, thread, synchronization] Count: 10 Step 2. Create term frequency vectors from the tokenized sequence file (step 1). mahout seqdumper -i dictionary.file-0 Yields Key: java: Value: 0 Count: 1 mahout seqdumper -i tf-vectors/part-r-0 Yields Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Key: 2: Value: 2:{0:1.0} Key: 3: Value: 3:{0:1.0} Key: 5: Value: 5:{0:1.0} Key: 7: Value: 7:{0:1.0} Key: 8: Value: 8:{0:2.0} Count: 5 Step 3. Create the document frequency data. mahout seqdumper -i frequency.file-0 Yields Key: 0: Value: 5 Count: 1 NOTE to READER: Java is NOT the only common word web occurs more than once how come its not included? Step 4. Create the tfidf vectors: (can't remember if partials were created in the past step) mahout seqdumper -i partial-vectors-0/part-r-0 yields INFO: Command line arguments: {--endPhase=[2147483647], --input=[part-r-0], --startPhase=[0], --tempDir=[temp]} 2014-01-21 16:57:23.661 java[24565:1203] Unable to load realm info from SCDynamicStore Input Path: part-r-0 Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Key: 2: Value: 2:{} Key: 3: Value: 3:{} Key: 5: Value: 5:{} Key: 7: Value: 7:{} Key: 8: Value: 8:{} Count: 5 NOTE to READER: What do the empty brackets mean here? mahout seqdumper -i tfidf-vectors/part-r-0 Yields Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Count: 0 Why 0? What am I NOT understanding here? SCott
Re: Problem converting tokenized documents into TFIDF vectors
Drew, I'm sorry - I'm derelict (as opposed to dirichlet) in responding that I got passed my problem. It was the min freq that was killing me. Forgot about that parameter. Thank you for your assist. Hope to be able to return the favor. Am on the hook to update documentation for Mahout already - maybe that will do it :) This week, I'll be testing my code against the .9 distribution. SCott On 1/26/14 10:57 AM, Drew Farris d...@apache.org wrote: Scott, Based on the dictionary output, it looks like the processing of generating vector from your tokenized text is not working properly. The only term that's making it into your dictionary is 'java' - everything else is being filtered out. Furthermore, your tf vectors have a single dimension '0' which a weight that corresponds to the frequency of the term 'java' in each document. I would check the settings for minimum document frequency in the vectorization process. What is the command you are using to create vectors from your tokenized documents? Drew On Tue, Jan 21, 2014 at 6:30 PM, Scott C. Cote scottcc...@gmail.com wrote: All, Not a Mahout .9 problem once I have this working with .8 Mahout, will immediately pull in the .9 stuffŠ.. I am trying to make a small data set work (perhaps it is too small?) where I am clustering skills (phrases). For sake of brevity (my steps are long) , I have not documented the steps that I took to get my text of skills into tokenized formŠ. By the time I get to the TFIDF vectors (step 4) my output is of zero Š. No tfidf vectors generated. I have broken this down into 4 steps. Step 1. Tokenize docs. Here is output validating success of tokenization. mahout seqdumper -i tokenized-documents/part-m-0 yields Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.common.StringTuple Key: 1: Value: [rest, web, services] Key: 2: Value: [soa, design, build, service, oriented, architecture, using, java] Key: 3: Value: [oracle, jdbc, build, java, database, connectivity, layer, oracle] Key: 4: Value: [spring, injection, use, spring, templates, inversion, control] Key: 5: Value: [j2ee, create, device, enterprise, java, beans, integrate, spring] Key: 6: Value: [can, deploy, web, archive, war, files, tomcat] Key: 7: Value: [java, graphics, uses, android, graphics, packages, create, user, interfaces] Key: 8: Value: [core, java, understand, core, libraries, java, development, kit] Key: 9: Value: [design, develop, jdbc, sql, queries] Key: 10: Value: [multithreading, thread, synchronization] Count: 10 Step 2. Create term frequency vectors from the tokenized sequence file (step 1). mahout seqdumper -i dictionary.file-0 Yields Key: java: Value: 0 Count: 1 mahout seqdumper -i tf-vectors/part-r-0 Yields Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Key: 2: Value: 2:{0:1.0} Key: 3: Value: 3:{0:1.0} Key: 5: Value: 5:{0:1.0} Key: 7: Value: 7:{0:1.0} Key: 8: Value: 8:{0:2.0} Count: 5 Step 3. Create the document frequency data. mahout seqdumper -i frequency.file-0 Yields Key: 0: Value: 5 Count: 1 NOTE to READER: Java is NOT the only common word web occurs more than once how come its not included? Step 4. Create the tfidf vectors: (can't remember if partials were created in the past step) mahout seqdumper -i partial-vectors-0/part-r-0 yields INFO: Command line arguments: {--endPhase=[2147483647], --input=[part-r-0], --startPhase=[0], --tempDir=[temp]} 2014-01-21 16:57:23.661 java[24565:1203] Unable to load realm info from SCDynamicStore Input Path: part-r-0 Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Key: 2: Value: 2:{} Key: 3: Value: 3:{} Key: 5: Value: 5:{} Key: 7: Value: 7:{} Key: 8: Value: 8:{} Count: 5 NOTE to READER: What do the empty brackets mean here? mahout seqdumper -i tfidf-vectors/part-r-0 Yields Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Count: 0 Why 0? What am I NOT understanding here? SCott
Re: generic latent variable recommender question
On Sun, Jan 26, 2014 at 9:36 AM, Pat Ferrel p...@occamsmachete.com wrote: I think I’ll leave dithering out until it goes live because it would seem to make the eyeball test easier. I doubt all these experiments will survive. With anti-flood if you turn the epsilon parameter to 1 (makes log(epsilon) = 0), then no re-ordering is done. I like knobs that go to 11, but also have an off position.
Re: Problem converting tokenized documents into TFIDF vectors
Scott, FYI... 0.9 Release is not official yet. The project trunk's still at 0.9-SNAPSHOT. Please feel free to update the documentation. On Sunday, January 26, 2014 1:34 PM, Scott C. Cote scottcc...@gmail.com wrote: Drew, I'm sorry - I'm derelict (as opposed to dirichlet) in responding that I got passed my problem. It was the min freq that was killing me. Forgot about that parameter. Thank you for your assist. Hope to be able to return the favor. Am on the hook to update documentation for Mahout already - maybe that will do it :) This week, I'll be testing my code against the .9 distribution. SCott On 1/26/14 10:57 AM, Drew Farris d...@apache.org wrote: Scott, Based on the dictionary output, it looks like the processing of generating vector from your tokenized text is not working properly. The only term that's making it into your dictionary is 'java' - everything else is being filtered out. Furthermore, your tf vectors have a single dimension '0' which a weight that corresponds to the frequency of the term 'java' in each document. I would check the settings for minimum document frequency in the vectorization process. What is the command you are using to create vectors from your tokenized documents? Drew On Tue, Jan 21, 2014 at 6:30 PM, Scott C. Cote scottcc...@gmail.com wrote: All, Not a Mahout .9 problem once I have this working with .8 Mahout, will immediately pull in the .9 stuffŠ.. I am trying to make a small data set work (perhaps it is too small?) where I am clustering skills (phrases). For sake of brevity (my steps are long) , I have not documented the steps that I took to get my text of skills into tokenized formŠ. By the time I get to the TFIDF vectors (step 4) my output is of zero Š. No tfidf vectors generated. I have broken this down into 4 steps. Step 1. Tokenize docs. Here is output validating success of tokenization. mahout seqdumper -i tokenized-documents/part-m-0 yields Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.common.StringTuple Key: 1: Value: [rest, web, services] Key: 2: Value: [soa, design, build, service, oriented, architecture, using, java] Key: 3: Value: [oracle, jdbc, build, java, database, connectivity, layer, oracle] Key: 4: Value: [spring, injection, use, spring, templates, inversion, control] Key: 5: Value: [j2ee, create, device, enterprise, java, beans, integrate, spring] Key: 6: Value: [can, deploy, web, archive, war, files, tomcat] Key: 7: Value: [java, graphics, uses, android, graphics, packages, create, user, interfaces] Key: 8: Value: [core, java, understand, core, libraries, java, development, kit] Key: 9: Value: [design, develop, jdbc, sql, queries] Key: 10: Value: [multithreading, thread, synchronization] Count: 10 Step 2. Create term frequency vectors from the tokenized sequence file (step 1). mahout seqdumper -i dictionary.file-0 Yields Key: java: Value: 0 Count: 1 mahout seqdumper -i tf-vectors/part-r-0 Yields Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Key: 2: Value: 2:{0:1.0} Key: 3: Value: 3:{0:1.0} Key: 5: Value: 5:{0:1.0} Key: 7: Value: 7:{0:1.0} Key: 8: Value: 8:{0:2.0} Count: 5 Step 3. Create the document frequency data. mahout seqdumper -i frequency.file-0 Yields Key: 0: Value: 5 Count: 1 NOTE to READER: Java is NOT the only common word web occurs more than once how come its not included? Step 4. Create the tfidf vectors: (can't remember if partials were created in the past step) mahout seqdumper -i partial-vectors-0/part-r-0 yields INFO: Command line arguments: {--endPhase=[2147483647], --input=[part-r-0], --startPhase=[0], --tempDir=[temp]} 2014-01-21 16:57:23.661 java[24565:1203] Unable to load realm info from SCDynamicStore Input Path: part-r-0 Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Key: 2: Value: 2:{} Key: 3: Value: 3:{} Key: 5: Value: 5:{} Key: 7: Value: 7:{} Key: 8: Value: 8:{} Count: 5 NOTE to READER: What do the empty brackets mean here? mahout seqdumper -i tfidf-vectors/part-r-0 Yields Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Count: 0 Why 0? What am I NOT understanding here? SCott
Re: Problem converting tokenized documents into TFIDF vectors
I understand that it is not official. Am just trying to provide another test opportunity for the .9 release. SCott On 1/26/14 1:05 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Scott, FYI... 0.9 Release is not official yet. The project trunk's still at 0.9-SNAPSHOT. Please feel free to update the documentation. On Sunday, January 26, 2014 1:34 PM, Scott C. Cote scottcc...@gmail.com wrote: Drew, I'm sorry - I'm derelict (as opposed to dirichlet) in responding that I got passed my problem. It was the min freq that was killing me. Forgot about that parameter. Thank you for your assist. Hope to be able to return the favor. Am on the hook to update documentation for Mahout already - maybe that will do it :) This week, I'll be testing my code against the .9 distribution. SCott On 1/26/14 10:57 AM, Drew Farris d...@apache.org wrote: Scott, Based on the dictionary output, it looks like the processing of generating vector from your tokenized text is not working properly. The only term that's making it into your dictionary is 'java' - everything else is being filtered out. Furthermore, your tf vectors have a single dimension '0' which a weight that corresponds to the frequency of the term 'java' in each document. I would check the settings for minimum document frequency in the vectorization process. What is the command you are using to create vectors from your tokenized documents? Drew On Tue, Jan 21, 2014 at 6:30 PM, Scott C. Cote scottcc...@gmail.com wrote: All, Not a Mahout .9 problem once I have this working with .8 Mahout, will immediately pull in the .9 stuffŠ.. I am trying to make a small data set work (perhaps it is too small?) where I am clustering skills (phrases). For sake of brevity (my steps are long) , I have not documented the steps that I took to get my text of skills into tokenized formŠ. By the time I get to the TFIDF vectors (step 4) my output is of zero Š. No tfidf vectors generated. I have broken this down into 4 steps. Step 1. Tokenize docs. Here is output validating success of tokenization. mahout seqdumper -i tokenized-documents/part-m-0 yields Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.common.StringTuple Key: 1: Value: [rest, web, services] Key: 2: Value: [soa, design, build, service, oriented, architecture, using, java] Key: 3: Value: [oracle, jdbc, build, java, database, connectivity, layer, oracle] Key: 4: Value: [spring, injection, use, spring, templates, inversion, control] Key: 5: Value: [j2ee, create, device, enterprise, java, beans, integrate, spring] Key: 6: Value: [can, deploy, web, archive, war, files, tomcat] Key: 7: Value: [java, graphics, uses, android, graphics, packages, create, user, interfaces] Key: 8: Value: [core, java, understand, core, libraries, java, development, kit] Key: 9: Value: [design, develop, jdbc, sql, queries] Key: 10: Value: [multithreading, thread, synchronization] Count: 10 Step 2. Create term frequency vectors from the tokenized sequence file (step 1). mahout seqdumper -i dictionary.file-0 Yields Key: java: Value: 0 Count: 1 mahout seqdumper -i tf-vectors/part-r-0 Yields Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Key: 2: Value: 2:{0:1.0} Key: 3: Value: 3:{0:1.0} Key: 5: Value: 5:{0:1.0} Key: 7: Value: 7:{0:1.0} Key: 8: Value: 8:{0:2.0} Count: 5 Step 3. Create the document frequency data. mahout seqdumper -i frequency.file-0 Yields Key: 0: Value: 5 Count: 1 NOTE to READER: Java is NOT the only common word web occurs more than once how come its not included? Step 4. Create the tfidf vectors: (can't remember if partials were created in the past step) mahout seqdumper -i partial-vectors-0/part-r-0 yields INFO: Command line arguments: {--endPhase=[2147483647], --input=[part-r-0], --startPhase=[0], --tempDir=[temp]} 2014-01-21 16:57:23.661 java[24565:1203] Unable to load realm info from SCDynamicStore Input Path: part-r-0 Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Key: 2: Value: 2:{} Key: 3: Value: 3:{} Key: 5: Value: 5:{} Key: 7: Value: 7:{} Key: 8: Value: 8:{} Count: 5 NOTE to READER: What do the empty brackets mean here? mahout seqdumper -i tfidf-vectors/part-r-0 Yields Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Count: 0 Why 0? What am I NOT understanding here? SCott
Re: generic latent variable recommender question
Thanks for the answers, actually I worked on a similar issue, increasing the diversity of top-N lists (http://link.springer.com/article/10.1007%2Fs10844-013-0252-9). Clustering-based approaches produce good results and they are very fast compared to some optimization based techniques. Also it turned out that introducing randomization (such as choosing random 20 items among the top 100 items) might decrease diversity if the diversity of the top-N lists is better than the diversity of a set of random items, which might sometimes be the case. On Sun, Jan 26, 2014 at 8:49 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Sun, Jan 26, 2014 at 9:36 AM, Pat Ferrel p...@occamsmachete.com wrote: I think I’ll leave dithering out until it goes live because it would seem to make the eyeball test easier. I doubt all these experiments will survive. With anti-flood if you turn the epsilon parameter to 1 (makes log(epsilon) = 0), then no re-ordering is done. I like knobs that go to 11, but also have an off position.