Re: Problem converting tokenized documents into TFIDF vectors

2014-01-26 Thread Drew Farris
Scott,

Based on the dictionary output, it looks like the processing of generating
vector from your tokenized text is not working properly. The only term
that's making it into your dictionary is 'java' - everything else is being
filtered out. Furthermore, your tf vectors have a single dimension '0'
which a weight that corresponds to the frequency of the term 'java' in each
document.

I would check the settings for minimum document frequency in the
vectorization process. What is the command you are using to create vectors
from your tokenized documents?

Drew


On Tue, Jan 21, 2014 at 6:30 PM, Scott C. Cote scottcc...@gmail.com wrote:

 All,

 Not a Mahout .9 problem ­ once I have this working with .8 Mahout, will
 immediately pull in the .9 stuffŠ..

 I am trying to make a small data set work (perhaps it is too small?) where
 I
 am clustering skills (phrases).  For sake of brevity (my steps are long) ,
 I
 have not documented the steps that I took to get my text of skills into
 tokenized formŠ.

 By the time I get to the TFIDF vectors  (step 4) ­ my output is of zero Š.
 No tfidf vectors generated.


 I have broken this down into 4 steps.



 Step 1. Tokenize docs.  Here is output validating success of tokenization.

 mahout seqdumper -i tokenized-documents/part-m-0

 yields

 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.common.StringTuple
 Key: 1: Value: [rest, web, services]
 Key: 2: Value: [soa, design, build, service, oriented, architecture, using,
 java]
 Key: 3: Value: [oracle, jdbc, build, java, database, connectivity, layer,
 oracle]
 Key: 4: Value: [spring, injection, use, spring, templates, inversion,
 control]
 Key: 5: Value: [j2ee, create, device, enterprise, java, beans, integrate,
 spring]
 Key: 6: Value: [can, deploy, web, archive, war, files, tomcat]
 Key: 7: Value: [java, graphics, uses, android, graphics, packages, create,
 user, interfaces]
 Key: 8: Value: [core, java, understand, core, libraries, java, development,
 kit]
 Key: 9: Value: [design, develop, jdbc, sql, queries]
 Key: 10: Value: [multithreading, thread, synchronization]
 Count: 10


 Step 2. Create term frequency vectors from the tokenized sequence file
 (step
 1).

 mahout seqdumper -i dictionary.file-0

 Yields

 Key: java: Value: 0
 Count: 1

 mahout seqdumper -i tf-vectors/part-r-0

 Yields

 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.math.VectorWritable
 Key: 2: Value: 2:{0:1.0}
 Key: 3: Value: 3:{0:1.0}
 Key: 5: Value: 5:{0:1.0}
 Key: 7: Value: 7:{0:1.0}
 Key: 8: Value: 8:{0:2.0}
 Count: 5


 Step 3. Create the document frequency data.

 mahout seqdumper -i frequency.file-0

 Yields

 Key: 0: Value: 5
 Count: 1

 NOTE to READER:  Java is NOT the only common word ­ web occurs more than
 once ­ how come its not included?





 Step 4. Create the tfidf vectors: (can't remember if partials were created
 in the past step)

 mahout seqdumper -i partial-vectors-0/part-r-0

 yields

 INFO: Command line arguments: {--endPhase=[2147483647],
 --input=[part-r-0], --startPhase=[0], --tempDir=[temp]}
 2014-01-21 16:57:23.661 java[24565:1203] Unable to load realm info from
 SCDynamicStore
 Input Path: part-r-0
 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.math.VectorWritable
 Key: 2: Value: 2:{}
 Key: 3: Value: 3:{}
 Key: 5: Value: 5:{}
 Key: 7: Value: 7:{}
 Key: 8: Value: 8:{}
 Count: 5

 NOTE to READER:  What do the empty brackets mean here?


 mahout seqdumper -i tfidf-vectors/part-r-0

 Yields

 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.math.VectorWritable
 Count: 0

 Why 0?

 What am I NOT understanding here?

 SCott





Re: Problem converting tokenized documents into TFIDF vectors

2014-01-26 Thread Scott C. Cote
Drew,

I'm sorry - I'm derelict (as opposed to dirichlet) in responding that I
got passed my problem.

It was the min freq that was killing me.  Forgot about that parameter.

Thank you for your assist.

Hope to be able to return the favor.

Am on the hook to update documentation for Mahout already - maybe that
will do it :)

This week, I'll be testing my code against the .9 distribution.

SCott

On 1/26/14 10:57 AM, Drew Farris d...@apache.org wrote:

Scott,

Based on the dictionary output, it looks like the processing of generating
vector from your tokenized text is not working properly. The only term
that's making it into your dictionary is 'java' - everything else is being
filtered out. Furthermore, your tf vectors have a single dimension '0'
which a weight that corresponds to the frequency of the term 'java' in
each
document.

I would check the settings for minimum document frequency in the
vectorization process. What is the command you are using to create vectors
from your tokenized documents?

Drew


On Tue, Jan 21, 2014 at 6:30 PM, Scott C. Cote scottcc...@gmail.com
wrote:

 All,

 Not a Mahout .9 problem ­ once I have this working with .8 Mahout, will
 immediately pull in the .9 stuffŠ..

 I am trying to make a small data set work (perhaps it is too small?)
where
 I
 am clustering skills (phrases).  For sake of brevity (my steps are
long) ,
 I
 have not documented the steps that I took to get my text of skills into
 tokenized formŠ.

 By the time I get to the TFIDF vectors  (step 4) ­ my output is of zero
Š.
 No tfidf vectors generated.


 I have broken this down into 4 steps.



 Step 1. Tokenize docs.  Here is output validating success of
tokenization.

 mahout seqdumper -i tokenized-documents/part-m-0

 yields

 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.common.StringTuple
 Key: 1: Value: [rest, web, services]
 Key: 2: Value: [soa, design, build, service, oriented, architecture,
using,
 java]
 Key: 3: Value: [oracle, jdbc, build, java, database, connectivity,
layer,
 oracle]
 Key: 4: Value: [spring, injection, use, spring, templates, inversion,
 control]
 Key: 5: Value: [j2ee, create, device, enterprise, java, beans,
integrate,
 spring]
 Key: 6: Value: [can, deploy, web, archive, war, files, tomcat]
 Key: 7: Value: [java, graphics, uses, android, graphics, packages,
create,
 user, interfaces]
 Key: 8: Value: [core, java, understand, core, libraries, java,
development,
 kit]
 Key: 9: Value: [design, develop, jdbc, sql, queries]
 Key: 10: Value: [multithreading, thread, synchronization]
 Count: 10


 Step 2. Create term frequency vectors from the tokenized sequence file
 (step
 1).

 mahout seqdumper -i dictionary.file-0

 Yields

 Key: java: Value: 0
 Count: 1

 mahout seqdumper -i tf-vectors/part-r-0

 Yields

 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.math.VectorWritable
 Key: 2: Value: 2:{0:1.0}
 Key: 3: Value: 3:{0:1.0}
 Key: 5: Value: 5:{0:1.0}
 Key: 7: Value: 7:{0:1.0}
 Key: 8: Value: 8:{0:2.0}
 Count: 5


 Step 3. Create the document frequency data.

 mahout seqdumper -i frequency.file-0

 Yields

 Key: 0: Value: 5
 Count: 1

 NOTE to READER:  Java is NOT the only common word ­ web occurs more than
 once ­ how come its not included?





 Step 4. Create the tfidf vectors: (can't remember if partials were
created
 in the past step)

 mahout seqdumper -i partial-vectors-0/part-r-0

 yields

 INFO: Command line arguments: {--endPhase=[2147483647],
 --input=[part-r-0], --startPhase=[0], --tempDir=[temp]}
 2014-01-21 16:57:23.661 java[24565:1203] Unable to load realm info from
 SCDynamicStore
 Input Path: part-r-0
 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.math.VectorWritable
 Key: 2: Value: 2:{}
 Key: 3: Value: 3:{}
 Key: 5: Value: 5:{}
 Key: 7: Value: 7:{}
 Key: 8: Value: 8:{}
 Count: 5

 NOTE to READER:  What do the empty brackets mean here?


 mahout seqdumper -i tfidf-vectors/part-r-0

 Yields

 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.math.VectorWritable
 Count: 0

 Why 0?

 What am I NOT understanding here?

 SCott







Re: generic latent variable recommender question

2014-01-26 Thread Ted Dunning
On Sun, Jan 26, 2014 at 9:36 AM, Pat Ferrel p...@occamsmachete.com wrote:

 I think I’ll leave dithering out until it goes live because it would seem
 to make the eyeball test easier. I doubt all these experiments will survive.


With anti-flood if you turn the epsilon parameter to 1 (makes log(epsilon)
= 0), then no re-ordering is done.

I like knobs that go to 11, but also have an off position.


Re: Problem converting tokenized documents into TFIDF vectors

2014-01-26 Thread Suneel Marthi
Scott,

FYI... 0.9 Release is not official yet. The project trunk's still at 
0.9-SNAPSHOT.

Please feel free to update the documentation.






On Sunday, January 26, 2014 1:34 PM, Scott C. Cote scottcc...@gmail.com wrote:
 
Drew,

I'm sorry - I'm derelict (as opposed to dirichlet) in responding that I
got passed my problem.

It was the min freq that was killing me.  Forgot about that parameter.

Thank you for your assist.

Hope to be able to return the favor.

Am on the hook to update documentation for Mahout already - maybe that
will do it :)

This week, I'll be testing my code against the .9 distribution.

SCott


On 1/26/14 10:57 AM, Drew Farris d...@apache.org wrote:

Scott,

Based on the dictionary output, it looks like the processing of generating
vector from your tokenized text is not working properly. The only term
that's making it into your dictionary is 'java' - everything else is being
filtered out. Furthermore, your tf vectors have a single dimension '0'
which a weight that corresponds to the frequency of the term 'java' in
each
document.

I would check the settings for minimum document frequency in the
vectorization process. What is the command you are using to create vectors
from your tokenized documents?

Drew


On Tue, Jan 21, 2014 at 6:30 PM, Scott C. Cote scottcc...@gmail.com
wrote:

 All,

 Not a Mahout .9 problem ­ once I have this working with .8 Mahout, will
 immediately pull in the .9 stuffŠ..

 I am trying to make a small data set work (perhaps it is too small?)
where
 I
 am clustering skills (phrases).  For sake of brevity (my steps are
long) ,
 I
 have not documented the steps that I took to get my text of skills into
 tokenized formŠ.

 By the time I get to the TFIDF vectors  (step 4) ­ my output is of zero
Š.
 No tfidf vectors generated.


 I have broken this down into 4 steps.



 Step 1. Tokenize docs.  Here is output validating success of
tokenization.

 mahout seqdumper -i tokenized-documents/part-m-0

 yields

 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.common.StringTuple
 Key: 1: Value: [rest, web, services]
 Key: 2: Value: [soa, design, build, service, oriented, architecture,
using,
 java]
 Key: 3: Value: [oracle, jdbc, build, java, database, connectivity,
layer,
 oracle]
 Key: 4: Value: [spring, injection, use, spring, templates, inversion,
 control]
 Key: 5: Value: [j2ee, create, device, enterprise, java, beans,
integrate,
 spring]
 Key: 6: Value: [can, deploy, web, archive, war, files, tomcat]
 Key: 7: Value: [java, graphics, uses, android, graphics, packages,
create,
 user, interfaces]
 Key: 8: Value: [core, java, understand, core, libraries, java,
development,
 kit]
 Key: 9: Value: [design, develop, jdbc, sql, queries]
 Key: 10: Value: [multithreading, thread, synchronization]
 Count: 10


 Step 2. Create term frequency vectors from the tokenized sequence file
 (step
 1).

 mahout seqdumper -i dictionary.file-0

 Yields

 Key: java: Value: 0
 Count: 1

 mahout seqdumper -i tf-vectors/part-r-0

 Yields

 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.math.VectorWritable
 Key: 2: Value: 2:{0:1.0}
 Key: 3: Value: 3:{0:1.0}
 Key: 5: Value: 5:{0:1.0}
 Key: 7: Value: 7:{0:1.0}
 Key: 8: Value: 8:{0:2.0}
 Count: 5


 Step 3. Create the document frequency data.

 mahout seqdumper -i frequency.file-0

 Yields

 Key: 0: Value: 5
 Count: 1

 NOTE to READER:  Java is NOT the only common word ­ web occurs more than
 once ­ how come its not included?





 Step 4. Create the tfidf vectors: (can't remember if partials were
created
 in the past step)

 mahout seqdumper -i partial-vectors-0/part-r-0

 yields

 INFO: Command line arguments: {--endPhase=[2147483647],
 --input=[part-r-0], --startPhase=[0], --tempDir=[temp]}
 2014-01-21 16:57:23.661 java[24565:1203] Unable to load realm info from
 SCDynamicStore
 Input Path: part-r-0
 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.math.VectorWritable
 Key: 2: Value: 2:{}
 Key: 3: Value: 3:{}
 Key: 5: Value: 5:{}
 Key: 7: Value: 7:{}
 Key: 8: Value: 8:{}
 Count: 5

 NOTE to READER:  What do the empty brackets mean here?


 mahout seqdumper -i tfidf-vectors/part-r-0

 Yields

 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.math.VectorWritable
 Count: 0

 Why 0?

 What am I NOT understanding here?

 SCott




Re: Problem converting tokenized documents into TFIDF vectors

2014-01-26 Thread Scott C. Cote
I understand that it is not official.

Am just trying to provide another test opportunity for the .9 release.

SCott

On 1/26/14 1:05 PM, Suneel Marthi suneel_mar...@yahoo.com wrote:

Scott,

FYI... 0.9 Release is not official yet. The project trunk's still at
0.9-SNAPSHOT.

Please feel free to update the documentation.






On Sunday, January 26, 2014 1:34 PM, Scott C. Cote scottcc...@gmail.com
wrote:
 
Drew,

I'm sorry - I'm derelict (as opposed to dirichlet) in responding that I
got passed my problem.

It was the min freq that was killing me.  Forgot about that parameter.

Thank you for your assist.

Hope to be able to return the favor.

Am on the hook to update documentation for Mahout already - maybe that
will do it :)

This week, I'll be testing my code against the .9 distribution.

SCott


On 1/26/14 10:57 AM, Drew Farris d...@apache.org wrote:

Scott,

Based on the dictionary output, it looks like the processing of
generating
vector from your tokenized text is not working properly. The only term
that's making it into your dictionary is 'java' - everything else is
being
filtered out. Furthermore, your tf vectors have a single dimension '0'
which a weight that corresponds to the frequency of the term 'java' in
each
document.

I would check the settings for minimum document frequency in the
vectorization process. What is the command you are using to create
vectors
from your tokenized documents?

Drew


On Tue, Jan 21, 2014 at 6:30 PM, Scott C. Cote scottcc...@gmail.com
wrote:

 All,

 Not a Mahout .9 problem ­ once I have this working with .8 Mahout, will
 immediately pull in the .9 stuffŠ..

 I am trying to make a small data set work (perhaps it is too small?)
where
 I
 am clustering skills (phrases).  For sake of brevity (my steps are
long) ,
 I
 have not documented the steps that I took to get my text of skills into
 tokenized formŠ.

 By the time I get to the TFIDF vectors  (step 4) ­ my output is of zero
Š.
 No tfidf vectors generated.


 I have broken this down into 4 steps.



 Step 1. Tokenize docs.  Here is output validating success of
tokenization.

 mahout seqdumper -i tokenized-documents/part-m-0

 yields

 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.common.StringTuple
 Key: 1: Value: [rest, web, services]
 Key: 2: Value: [soa, design, build, service, oriented, architecture,
using,
 java]
 Key: 3: Value: [oracle, jdbc, build, java, database, connectivity,
layer,
 oracle]
 Key: 4: Value: [spring, injection, use, spring, templates, inversion,
 control]
 Key: 5: Value: [j2ee, create, device, enterprise, java, beans,
integrate,
 spring]
 Key: 6: Value: [can, deploy, web, archive, war, files, tomcat]
 Key: 7: Value: [java, graphics, uses, android, graphics, packages,
create,
 user, interfaces]
 Key: 8: Value: [core, java, understand, core, libraries, java,
development,
 kit]
 Key: 9: Value: [design, develop, jdbc, sql, queries]
 Key: 10: Value: [multithreading, thread, synchronization]
 Count: 10


 Step 2. Create term frequency vectors from the tokenized sequence file
 (step
 1).

 mahout seqdumper -i dictionary.file-0

 Yields

 Key: java: Value: 0
 Count: 1

 mahout seqdumper -i tf-vectors/part-r-0

 Yields

 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.math.VectorWritable
 Key: 2: Value: 2:{0:1.0}
 Key: 3: Value: 3:{0:1.0}
 Key: 5: Value: 5:{0:1.0}
 Key: 7: Value: 7:{0:1.0}
 Key: 8: Value: 8:{0:2.0}
 Count: 5


 Step 3. Create the document frequency data.

 mahout seqdumper -i frequency.file-0

 Yields

 Key: 0: Value: 5
 Count: 1

 NOTE to READER:  Java is NOT the only common word ­ web occurs more
than
 once ­ how come its not included?





 Step 4. Create the tfidf vectors: (can't remember if partials were
created
 in the past step)

 mahout seqdumper -i partial-vectors-0/part-r-0

 yields

 INFO: Command line arguments: {--endPhase=[2147483647],
 --input=[part-r-0], --startPhase=[0], --tempDir=[temp]}
 2014-01-21 16:57:23.661 java[24565:1203] Unable to load realm info from
 SCDynamicStore
 Input Path: part-r-0
 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.math.VectorWritable
 Key: 2: Value: 2:{}
 Key: 3: Value: 3:{}
 Key: 5: Value: 5:{}
 Key: 7: Value: 7:{}
 Key: 8: Value: 8:{}
 Count: 5

 NOTE to READER:  What do the empty brackets mean here?


 mahout seqdumper -i tfidf-vectors/part-r-0

 Yields

 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.math.VectorWritable
 Count: 0

 Why 0?

 What am I NOT understanding here?

 SCott






Re: generic latent variable recommender question

2014-01-26 Thread Tevfik Aytekin
Thanks for the answers, actually I worked on a similar issue,
increasing the diversity of top-N lists
(http://link.springer.com/article/10.1007%2Fs10844-013-0252-9).
Clustering-based approaches produce good results and they are very
fast compared to some optimization based techniques. Also it turned
out that introducing randomization (such as choosing random 20 items
among the top 100 items) might decrease diversity if the diversity of
the top-N lists is better than the diversity of a set of random items,
which might sometimes be the case.

On Sun, Jan 26, 2014 at 8:49 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 On Sun, Jan 26, 2014 at 9:36 AM, Pat Ferrel p...@occamsmachete.com wrote:

 I think I’ll leave dithering out until it goes live because it would seem
 to make the eyeball test easier. I doubt all these experiments will survive.


 With anti-flood if you turn the epsilon parameter to 1 (makes log(epsilon)
 = 0), then no re-ordering is done.

 I like knobs that go to 11, but also have an off position.