Simple Mahout classification example

2013-02-21 Thread sapan . parikh


I want to train create model for classification. For me this text is coming 
from database and I really do not want to store them to file for mahout 
training. I checked out the the MIA source code and changed the following code 
for very basic training task. Usual issue with mahout examples are either they 
show how to use mahout from cmd prompt using 20 news group, or the code has lot 
of dependency on Hadoop Zookeeper etc. I will really appreciate if someone can 
have a look at my code, or point me to a very simple tutorial which show how to 
train a model and then use it.
As of now in following code I am never getting past if (best != null) 
becauselearningAlgorithm.getBest(); is always returning null!
Sorry for posting the whole code, but didn't see any other option

public class Classifier {private static final int FEATURES = 1;
private static final TextValueEncoder encoder = new TextValueEncoder("body");   
 private static final FeatureVectorEncoder bias = new 
ConstantValueEncoder("Intercept");private static final String[] LEAK_LABELS 
= {"none", "month-year", "day-month-year"};/** * @param args the 
command line arguments */public static void main(String[] args) throws 
Exception {int leakType = 0;// TODO code application logic here 
   AdaptiveLogisticRegression learningAlgorithm = new 
AdaptiveLogisticRegression(20, FEATURES, new L1());Dictionary 
newsGroups = new Dictionary();//ModelDissector md = new 
ModelDissector();ListMultimap noteBySection = 
LinkedListMultimap.create();noteBySection.put("good", "I love this 
product, the screen is a pleasure to work with and is a great choice for any 
business");noteBySection.put("good", "What a product!! Really amazing 
clarity and works pretty well");noteBySection.put("good", "This product 
has good battery life and is a little bit heavy but I like it");
noteBySection.put("bad", "I am really bored with the same UI, this is their 5th 
version(or fourth or sixth, who knows) and it looks just like the first one");  
  noteBySection.put("bad", "The phone is bulky and useless");
noteBySection.put("bad", "I wish i had never bought this laptop. It died in the 
first year and now i am not able to return it");encoder.setProbes(2);   
 double step = 0;int[] bumps = {1, 2, 5};double 
averageCorrect = 0;double averageLL = 0;int k = 0;
//-//notes.keySet()for 
(String key : noteBySection.keySet()) {System.out.println(key); 
   List notes = noteBySection.get(key);for 
(Iterator it = notes.iterator(); it.hasNext();) {String 
note = it.next();int actual = newsGroups.intern(key);   
 Vector v = encodeFeatureVector(note);
learningAlgorithm.train(actual, v);k++;int bump 
= bumps[(int) Math.floor(step) % bumps.length];int scale = 
(int) Math.pow(10, Math.floor(step / bumps.length));
State best = 
learningAlgorithm.getBest();double maxBeta;
double nonZeros;double positive;double norm;
double lambda = 0;double mu = 0;if 
(best != null) {CrossFoldLearner state = 
best.getPayload().getLearner();averageCorrect = 
state.percentCorrect();averageLL = state.logLikelihood();   
 OnlineLogisticRegression model = state.getModels().get(0); 
   // finish off pending regularization
model.close();Matrix beta = model.getBeta();
maxBeta = beta.aggregate(Functions.MAX, Functions.ABS);
nonZeros = beta.aggregate(Functions.PLUS, new DoubleFunction() {
@Overridepublic double apply(double v) {
return Math.abs(v) > 1.0e-6 ? 1 : 0;
}});positive = 
beta.aggregate(Functions.PLUS, new DoubleFunction() {
@Overridepublic double apply(double v) {
return v > 0 ? 1 : 0;}
});norm = beta.aggregate(Functions.PLUS, Functions.ABS);
lambda = learningAlgorithm.getBest().getMappedParams()[0];  
  mu = learningAlgorithm.getBest().getMappedParams()[1];
} else {maxBeta = 0;nonZeros = 0;   
 positive = 0;norm = 0;}
System.out.println(k % (bump * scale));if (k % 
(bump * scale) == 0) {if (learningAl

RE: Simple Mahout classification example

2013-02-21 Thread sapan . parikh

Oops I messed up in my first Mahout mailing list post!!!
Please see the formatted questions here
http://stackoverflow.com/questions/14998250/simple-mahout-classification-example/14998762#14998762

-Original Message-
From: sapan.par...@eclinicalworks.com
Sent: Thursday, February 21, 2013 4:57am
To: user@mahout.apache.org
Subject: Simple Mahout classification example




I want to train create model for classification. For me this text is coming 
from database and I really do not want to store them to file for mahout 
training. I checked out the the MIA source code and changed the following code 
for very basic training task. Usual issue with mahout examples are either they 
show how to use mahout from cmd prompt using 20 news group, or the code has lot 
of dependency on Hadoop Zookeeper etc. I will really appreciate if someone can 
have a look at my code, or point me to a very simple tutorial which show how to 
train a model and then use it.
As of now in following code I am never getting past if (best != null) 
becauselearningAlgorithm.getBest(); is always returning null!
Sorry for posting the whole code, but didn't see any other option

public class Classifier {private static final int FEATURES = 1;
private static final TextValueEncoder encoder = new TextValueEncoder("body");   
 private static final FeatureVectorEncoder bias = new 
ConstantValueEncoder("Intercept");private static final String[] LEAK_LABELS 
= {"none", "month-year", "day-month-year"};/** * @param args the 
command line arguments */public static void main(String[] args) throws 
Exception {int leakType = 0;// TODO code application logic here 
   AdaptiveLogisticRegression learningAlgorithm = new 
AdaptiveLogisticRegression(20, FEATURES, new L1());Dictionary 
newsGroups = new Dictionary();//ModelDissector md = new 
ModelDissector();ListMultimap noteBySection = 
LinkedListMultimap.create();noteBySection.put("good", "I love this 
product, the screen is a pleasure to work with and is a great choice for any 
business");noteBySection.put("good", "What a product!! Really amazing 
clarity and works pretty well");noteBySection.put("good", "This product 
has good battery life and is a little bit heavy but I like it");
noteBySection.put("bad", "I am really bored with the same UI, this is their 5th 
version(or fourth or sixth, who knows) and it looks just like the first one");  
  noteBySection.put("bad", "The phone is bulky and useless");
noteBySection.put("bad", "I wish i had never bought this laptop. It died in the 
first year and now i am not able to return it");encoder.setProbes(2);   
 double step = 0;int[] bumps = {1, 2, 5};double 
averageCorrect = 0;double averageLL = 0;int k = 0;
//-//notes.keySet()for 
(String key : noteBySection.keySet()) {System.out.println(key); 
   List notes = noteBySection.get(key);for 
(Iterator it = notes.iterator(); it.hasNext();) {String 
note = it.next();int actual = newsGroups.intern(key);   
 Vector v = encodeFeatureVector(note);
learningAlgorithm.train(actual, v);k++;int bump 
= bumps[(int) Math.floor(step) % bumps.length];int scale = 
(int) Math.pow(10, Math.floor(step / bumps.length));
State best = 
learningAlgorithm.getBest();double maxBeta;
double nonZeros;double positive;double norm;
double lambda = 0;double mu = 0;if 
(best != null) {CrossFoldLearner state = 
best.getPayload().getLearner();averageCorrect = 
state.percentCorrect();averageLL = state.logLikelihood();   
 OnlineLogisticRegression model = state.getModels().get(0); 
   // finish off pending regularization
model.close();Matrix beta = model.getBeta();
maxBeta = beta.aggregate(Functions.MAX, Functions.ABS);
nonZeros = beta.aggregate(Functions.PLUS, new DoubleFunction() {
@Overridepublic double apply(double v) {
return Math.abs(v) > 1.0e-6 ? 1 : 0;
}});positive = 
beta.aggregate(Functions.PLUS, new DoubleFunction() {
@Overridepublic double apply(double v) {
return v > 0 ? 1 : 0;}
});norm = beta.aggregate(Functions.PLUS, Functions.ABS);
lambda = learningAlgorithm..getBest().getMappedParams()[0];  

Logistic Regression - Questions

2013-02-21 Thread Anbarasan Murthy
I am trying to understand the logistic regression implemented in mahout In 
"org.apache.mahout.classifier.sgd.AbstractOnlineLogisticRegression"





What is the role of "Protected Matrix beta"   < (numCategories-1) x numFeatures 
>



How & where this "beta" matrix is updated and



How & where the local minimum is computed?

Y = Beta * X






Thanks,
Anbu


::DISCLAIMER::


The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.




GenericUserBasedRecommender vs GenericItemBasedRecommender

2013-02-21 Thread Koobas
In the GenericUserBasedRecommender the concept of a neighborhood seems to
be fundamental.
I.e., it is a classic implementation of the kNN algorithm.

But it is not the case with the GenericItemBasedRecommender.
I understand that the two approaches are not meant to be completely
symmetric,
but still, wouldn't it make sense, from the performance perspective, to
compute items' neighborhoods first,
and then use them to compute recommendations?

If kNN was run on items first, then every item-item similarity would be
computed once.
It looks like in the GenericItemBasedRecommender each item-item similarity
will be computed multiple times.
(How much, depends on the data, but still.)

I am wondering if anybody has any thoughts on the validity of doing
item-item kNN in the context of:
1) performance,
2) quality of recommendations.


Re: GenericUserBasedRecommender vs GenericItemBasedRecommender

2013-02-21 Thread Sean Owen
It's also valid, yes. The difference is partly due to asymmetry, but also
just historical (i.e. no great reason). The item-item system uses a
different strategy for picking candidates based on CandidateItemStrategy.


On Thu, Feb 21, 2013 at 2:37 PM, Koobas  wrote:

> In the GenericUserBasedRecommender the concept of a neighborhood seems to
> be fundamental.
> I.e., it is a classic implementation of the kNN algorithm.
>
> But it is not the case with the GenericItemBasedRecommender.
> I understand that the two approaches are not meant to be completely
> symmetric,
> but still, wouldn't it make sense, from the performance perspective, to
> compute items' neighborhoods first,
> and then use them to compute recommendations?
>
> If kNN was run on items first, then every item-item similarity would be
> computed once.
> It looks like in the GenericItemBasedRecommender each item-item similarity
> will be computed multiple times.
> (How much, depends on the data, but still.)
>
> I am wondering if anybody has any thoughts on the validity of doing
> item-item kNN in the context of:
> 1) performance,
> 2) quality of recommendations.
>


Re: GenericUserBasedRecommender vs GenericItemBasedRecommender

2013-02-21 Thread Koobas
On Thu, Feb 21, 2013 at 9:39 AM, Sean Owen  wrote:

> It's also valid, yes. The difference is partly due to asymmetry, but also
> just historical (i.e. no great reason).* The item-item system uses a
> different strategy for picking candidates based on CandidateItemStrategy.*
>
> Where do I find more information about this?
And thanks for the instantaneous reply :)

>
> On Thu, Feb 21, 2013 at 2:37 PM, Koobas  wrote:
>
> > In the GenericUserBasedRecommender the concept of a neighborhood seems to
> > be fundamental.
> > I.e., it is a classic implementation of the kNN algorithm.
> >
> > But it is not the case with the GenericItemBasedRecommender.
> > I understand that the two approaches are not meant to be completely
> > symmetric,
> > but still, wouldn't it make sense, from the performance perspective, to
> > compute items' neighborhoods first,
> > and then use them to compute recommendations?
> >
> > If kNN was run on items first, then every item-item similarity would be
> > computed once.
> > It looks like in the GenericItemBasedRecommender each item-item
> similarity
> > will be computed multiple times.
> > (How much, depends on the data, but still.)
> >
> > I am wondering if anybody has any thoughts on the validity of doing
> > item-item kNN in the context of:
> > 1) performance,
> > 2) quality of recommendations.
> >
>


LDA Convergence

2013-02-21 Thread David LaBarbera
I've been running some performance test with the LDA algorithm and I'm unsure 
how to gauge them. I ran 10 iterations each time and collected the perplexity 
value every 2 iterations with test fraction set to 0.1. These were all run on 
an AWS cluster with 10 nodes (70 mapper, 30 reducers). I'm not sure about the 
memory or cpu specs. I also stored the documents on hdfs in 1MB blocks to get 
some parallelization. The documents I have were very short - 10-100 words each. 
 Hopefully these results are clear.

Document Count
corpus size (MB)
Topic Count
Perplexity
Dictionary Size
Runtime  (min/iteration) 
40,044   3.2 10 16.326, 15.418, 15.191, 15.088, 15.028  14,097  1.5
 40,044 3.2 
20 
 26.461, 24.517, 23.996, 23.805, 23.882 14,097 
 6
 40,,0443.2 
40 
19.722, 18.185, 17.823, 17.680, 17.608  14,097 
11.5 
 40,046  3.710 
19.286, 18.373, 18.092, 17.958, 17.865  98,283   5.5
 40,046  3.720 
18.574, 17.448, 17.143, 17.018, 16.940
98,283   10.5


 


 
44,767  4 
10 
19.928, 18.815, 18.521, 18.350, 18.225  317272.5
44,767  4 
20 
21.838, 20.421, 20.087, 19.963, 19.903  317274.5
 616,95758.5 
10 
14.467, 13.830, 13.583, 13.435, 13.381  151,807 
 8.5
 616,95758.5 
20 
13.590, 12.787, 12.605, 12.522, 12.476  151,807 
 16
 616,972 58.410 14.646, 13.904, 13.646, 13.573, 13.543   54,280 
 4
 616,967 54.110 13.363, 12.634, 12.432, 12.345, 12.283   32,101 
 2.5
 616,967 54.120 13.195, 12.307, 12.065, 11.764, 11.732  
32,101 
 4.5

The question is how to interpret the results. In particular, Is there anything 
telling me when to stop running LDA? I've tried running until convergence, but 
I've never had the patience to see it finish. Does the perplexity give some 
hint to the quality of the results? In attempting to reach convergence, I saw 
runs going to 200 iterations. If an iteration takes around 5.5 minutes, that's 
18 hours of processing - and that doesn't include overhead between iterations.

David

Re: GenericUserBasedRecommender vs GenericItemBasedRecommender

2013-02-21 Thread Julian Ortega
The javadoc should be a nice start
https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/impl/recommender/AbstractCandidateItemsStrategy.html

Apart from that, I'd say you should have a look around the code.


On Thu, Feb 21, 2013 at 3:42 PM, Koobas  wrote:

> On Thu, Feb 21, 2013 at 9:39 AM, Sean Owen  wrote:
>
> > It's also valid, yes. The difference is partly due to asymmetry, but also
> > just historical (i.e. no great reason).* The item-item system uses a
> > different strategy for picking candidates based on
> CandidateItemStrategy.*
> >
> > Where do I find more information about this?
> And thanks for the instantaneous reply :)
>
> >
> > On Thu, Feb 21, 2013 at 2:37 PM, Koobas  wrote:
> >
> > > In the GenericUserBasedRecommender the concept of a neighborhood seems
> to
> > > be fundamental.
> > > I.e., it is a classic implementation of the kNN algorithm.
> > >
> > > But it is not the case with the GenericItemBasedRecommender.
> > > I understand that the two approaches are not meant to be completely
> > > symmetric,
> > > but still, wouldn't it make sense, from the performance perspective, to
> > > compute items' neighborhoods first,
> > > and then use them to compute recommendations?
> > >
> > > If kNN was run on items first, then every item-item similarity would be
> > > computed once.
> > > It looks like in the GenericItemBasedRecommender each item-item
> > similarity
> > > will be computed multiple times.
> > > (How much, depends on the data, but still.)
> > >
> > > I am wondering if anybody has any thoughts on the validity of doing
> > > item-item kNN in the context of:
> > > 1) performance,
> > > 2) quality of recommendations.
> > >
> >
>


Re: LDA Convergence

2013-02-21 Thread Jake Mannix
I really can't read your results here, the formatting of your columns is
pretty destroyed...  you look like you've got results for 20 topics, as
well as for 10, with different sized corpora?

You can't compare convergence between corpora sizes - the perplexity will
vary by order of magnitude between them.  The only thing you should be
comparing is that for a single fixed corpus, as you run it for 5, 10, 15,
20,... iterations, what does the (held-out) perplexity look like after each
of these?  Does it start to level off?  At some point you may start
overfitting and having the perplexity go back up.  Your convergence
happened before that.

I don't think I've ever needed to run more than 50 iterations, and usually
I stop after 20-30.  The bigger the corpus, the more this becomes true.


On Thu, Feb 21, 2013 at 6:45 AM, David LaBarbera <
davidlabarb...@localresponse.com> wrote:

> I've been running some performance test with the LDA algorithm and I'm
> unsure how to gauge them. I ran 10 iterations each time and collected the
> perplexity value every 2 iterations with test fraction set to 0.1. These
> were all run on an AWS cluster with 10 nodes (70 mapper, 30 reducers). I'm
> not sure about the memory or cpu specs. I also stored the documents on hdfs
> in 1MB blocks to get some parallelization. The documents I have were very
> short - 10-100 words each.  Hopefully these results are clear.
>
> Document Count
> corpus size (MB)
> Topic Count
> Perplexity
> Dictionary Size
> Runtime  (min/iteration)
> 40,044   3.2 10 16.326, 15.418, 15.191, 15.088, 15.028  14,097  1.5
>  40,044 3.2
> 20
>  26.461, 24.517, 23.996, 23.805, 23.882 14,097
>  6
>  40,,0443.2
> 40
> 19.722, 18.185, 17.823, 17.680, 17.608  14,097
> 11.5
>  40,046  3.710
> 19.286, 18.373, 18.092, 17.958, 17.865  98,283   5.5
>  40,046  3.720
> 18.574, 17.448, 17.143, 17.018, 16.940
> 98,283   10.5
>
>
>
>
>
>
> 44,767  4
> 10
> 19.928, 18.815, 18.521, 18.350, 18.225  317272.5
> 44,767  4
> 20
> 21.838, 20.421, 20.087, 19.963, 19.903  317274.5
>  616,95758.5
> 10
> 14.467, 13.830, 13.583, 13.435, 13.381  151,807
>  8.5
>  616,95758.5
> 20
> 13.590, 12.787, 12.605, 12.522, 12.476  151,807
>  16
>  616,972 58.410 14.646, 13.904, 13.646, 13.573, 13.543
> 54,280  4
>  616,967 54.110 13.363, 12.634, 12.432, 12.345, 12.283
> 32,101  2.5
>  616,967 54.120 13.195, 12.307, 12.065, 11.764, 11.732
>  32,101
>  4.5
>
> The question is how to interpret the results. In particular, Is there
> anything telling me when to stop running LDA? I've tried running until
> convergence, but I've never had the patience to see it finish. Does the
> perplexity give some hint to the quality of the results? In attempting to
> reach convergence, I saw runs going to 200 iterations. If an iteration
> takes around 5.5 minutes, that's 18 hours of processing - and that doesn't
> include overhead between iterations.
>
> David




-- 

  -jake


Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?

2013-02-21 Thread Jake Mannix
This looks like you've got an old version of Mahout - are you running on
trunk?  This has been fixed on trunk, there was a bug in the 0.6 (roughly)
timeframe in which vectors for vectordump --sort were assumed incorrectly
to be of size MAX_INT, which lead to heap problems no matter how much heap
you gave it.   Well, maybe you could have worked around it with 2^32 * (4 +
8) bytes ~ 48GB, but really the solution is to upgrade to run off of trunk.


On Wed, Feb 20, 2013 at 8:47 PM, 万代豊 <20525entrad...@gmail.com> wrote:

> My trial as below. However still doesn't get through...
>
> Increased MAHOUT_HEAPSIZE as below and also deleted out the comment mark
> from mahout shell script so that I can check it's actually taking effect.
> Added JAVA_HEAP_MAX=-Xmx4g (Default was 3GB)
>
> ~bin/mahout~
> JAVA=$JAVA_HOME/bin/java
> JAVA_HEAP_MAX=-Xmx4g  * <- Increased from the original 3g to 4g*
> # check envvars which might override default args
> if [ "$MAHOUT_HEAPSIZE" != "" ]; then
>   echo "run with heapsize $MAHOUT_HEAPSIZE"
>   JAVA_HEAP_MAX="-Xmx""$MAHOUT_HEAPSIZE""m"
>   echo $JAVA_HEAP_MAX
> fi
>
> Also added the same heap size as 4G in hadoop-env.sh as
>
> ~hadoop-env.sh~
> # The maximum amount of heap to use, in MB. Default is 1000.
> export HADOOP_HEAPSIZE=4000
>
> [hadoop@localhost NHTSA]$ export MAHOUT_HEAPSIZE=4000
> [hadoop@localhost NHTSA]$ $MAHOUT_HOME/bin/mahout vectordump -i
> NHTSA-LDA-sparse -d NHTSA-vectors01/dictionary.file-* -dt sequencefile
> --vectorSize 5 --printKey TRUE --sortVectors TRUE
> run with heapsize 4000* <- Looks like RunJar is taking 4G heap?*
> -Xmx4000m   *<- Right?*
> Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
> MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
> 13/02/21 13:23:17 INFO common.AbstractJob: Command line arguments:
> {--dictionary=[NHTSA-vectors01/dictionary.file-*],
> --dictionaryType=[sequencefile], --endPhase=[2147483647],
> --input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
> --startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
> 13/02/21 13:23:17 INFO vectors.VectorDumper: Sort? true
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>  at org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
>  at
>
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.(VectorHelper.java:221)
>  at
>
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.(VectorHelper.java:218)
>  at
>
> org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
>  at
>
> org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
>  at org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>  at
> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>  at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>  at java.lang.reflect.Method.invoke(Method.java:597)
>  at
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>  at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>  at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>  at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>  at java.lang.reflect.Method.invoke(Method.java:597)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> [hadoop@localhost NHTSA]$
> I've also monitored that at least all the Hadoop tasks are taking 4GB of
> heap through VisualVM utility.
>
> I have done ClusterDump to extract the top 10 terms from the result of
> K-Means as below using the exactly same input data sets as below, however,
> this tasks requires no extra heap other that the default.
>
> $ $MAHOUT_HOME/bin/mahout clusterdump -dt sequencefile -d
> NHTSA-vectors01/dictionary.file-* -i
> NHTSA-kmeans-clusters01/clusters-9-final -o NHTSA-kmeans-clusterdump01
> -b 30-n 10
>
> I believe the vectordump utility and the clusterdump derive from different
> roots in terms of it's heap requirement.
>
> Still waiting for some advise from you people.
> Regards,,,
> Y.Mandai
> 2013/2/19 万代豊 <20525entrad...@gmail.com>
>
> >
> > Well , the --sortVectors for the vectordump utility to evaluate the
> result
> > for CVB clistering unfortunately brought me OutofMemory issue...
> >
> > Here is the case that seem to goes well without --sortVectors option.
> > $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
> > NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
> > --printKey TRUE
> > ...
> > WHILE FOR:1.3623429635926918E-6

Cross recommendation

2013-02-21 Thread Pat Ferrel
I am quite interested in trying this but have a few questions.

To use/abuse mahout to do this:

A and B can be thought of as having the same size, in other words they must be 
constructed to have the same dimension definitions (userID for rows, itemID for 
columns) as well as row and column rank. So A and B should be constructed to 
have the same users and items, even though a column or row in either matrix may 
be 0. Some work will be needed to produce this so I want to be sure I 
understand.

The conclusion is B'A h_v + B'B h_p will produce view-based cross 
recommendations along with purchase based recs.

Using mahout I'd get purchase based recs by constructing the usual purchase 
matrix [user x item] and getting recs from the framework, as I'm already doing. 
Assuming here that B'B is what you are calling the self-join?

So the primary work to be done is to calculate B'A. Since the CF/taste 
framework does the self-join to prepare for making recs for users, I would have 
to replace the self-joined matrix with B'A then allow the rest of the framework 
to produce recs in the usual way as if it had calculated the self-joined 
matrix. Does this seem reasonable?

Then the question is how to blend B'A h_v with B'B h_p? Since I'm using values 
= 1 or 0 the strengths will be on the same scale, but are they really 
comparable? I'd be inclined to try sorting B'A recs + B'B recs by strength and 
try some other experiments with blending techniques then look at eval metrics.


On Feb 10, 2013, at 9:36 AM, Ted Dunning  wrote:

Actually treating the different interactions separately can lead to very
good recommendations.  The only issue is that the interactions are no
longer dyadic.

If you think about it, having two different kinds of interactions is like
adjoining interaction matrices for the two different kinds of interaction.
Suppose that you have user x views in matrix A and you have user x
purchases in matrix B.  The complete interaction matrix of user x (views +
purchases) is [A | B].

When you compute cooccurrence in this matrix, you get

  [A | B] = [ A' ]   [ A' A  A' B ]
 [A | B]' [A | B] = [] [A | B] = []
  [A | B] = [ B' ]   [ B' A  B' B ]

This matrix is (view + purchase) x (view + purchase).  But we don't care
about predicting views so we only really need a matrix that is purchase x (view
+ purchase).  This is just the bottom part of the matrix above, or [ B'A |
B'B ].  When you produce purchase recommendations r_p by multiplying by a
mixed view and purchase history vector h which has a view part h_v and a
purchase part h_p, you get

 r_p = [ B' A  B' B ] h = B'A h_v + B'B h_p

That is a prediction of purchases based on past views and past purchase.

Note that this general form applies for both decomposition techniques such
as SVD, ALS and LLL as well as for sparsification techniques such as the
LLR sparsification.  All that changes is the mechanics of how you do the
multiplications.  Weighting of components works the same as well.

What is very different here is that we have a component of cross
recommendation.  That is the B'A in the formula above.  This is very
different from a normal recommendation and cannot be computed with the
simple self-join that we normally have in Mahout cooccurrence computation
and also very different from the decompositions that we normally do.  It
isn't hard to adapt the Mahout computations, however.

When implementing a recommendation using a search engine as the base, these
same techniques also work extremely well in my experience.  What happens is
that for each item that you would like to recommend, you would have one
field that has components of B'A and one field that has components of B'B.
It is handy to simply use the binary values of the sparsified versions of
these and let the search engine handle the weighting of different
components at query time.  Having these components separated into different
fields in the search index seems to help quite a lot, which makes a fair
bit of sense.

On Sun, Feb 10, 2013 at 9:55 AM, Sean Owen  wrote:
> 
> I think you'd have to hack the code to not exclude previously-seen items,
> or at least, not of the type you wish to consider. Yes you would also have
> to hack it to add rather than replace existing values. Or for test
> purposes, just do the adding yourself before inputting the data.
> 
> My hunch is that it will hurt non-trivially to treat different interaction
> types as different items. You probably want to predict that someone who
> viewed a product over and over is likely to buy it, but this would only
> weakly tend to occur if the bought-item is not the same thing as the
> viewed-item. You'd learn they go together but not as strongly as ought to
> be obvious from the fact that they're the same. Still, interesting
thought.
> 
> There ought to be some 'signal' in this data, just a question of how much
> vs noise. A purchase means much more than a page view of

Re: LDA Convergence

2013-02-21 Thread David LaBarbera
Is there a rule of thumb for determining "leveling off" of perplexity? Is this 
value controlled by the convergence delta?

Sorry for the table view. I reformatted it with just space.

Document Count  corpus size(MB) Topic Count Perplexity  
Dictionary Size Runtime(min/iteration)
40,0443.2 10 
16.326,15.418,15.191,15.088,15.028  14,097   1.5
40,0443.2 20 
26.461,24.517,23.996,23.805,23.882  14,097   6
40,0443.2 40 
19.722,18.185,17.823,17.680,17.608  14,097  11.5
 
40,0463.7 10 
19.286,18.373,18.092,17.958,17.865  98,283  5.5
40,0463.7 20 
18.574,17.448,17.143,17.018,16.940  98,283  10.5

44,767410 
19.928,18.815,18.521,18.350,18.225  31727   2.5
44,767420 
21.838,20.421,20.087,19.963,19.903  31727   4.5

616,957  58.5  10 
14.467,13.830,13.583,13.435,13.381  151,8078.5
616,957  58.5  20 
13.590,12.787,12.605,12.522,12.476  151,807   16

616,972  58.4  10 
14.646,13.904,13.646,13.573,13.543  54,280  4

616,967  54.1  10 
13.363,12.634,12.432,12.345,12.283  32,101  2.5
616,967  54.1  20 
13.195,12.307,12.065,11.764,11.732  32,101  4.5



On Feb 21, 2013, at 12:00 PM, Jake Mannix  wrote:

> I really can't read your results here, the formatting of your columns is
> pretty destroyed...  you look like you've got results for 20 topics, as
> well as for 10, with different sized corpora?
> 
> You can't compare convergence between corpora sizes - the perplexity will
> vary by order of magnitude between them.  The only thing you should be
> comparing is that for a single fixed corpus, as you run it for 5, 10, 15,
> 20,... iterations, what does the (held-out) perplexity look like after each
> of these?  Does it start to level off?  At some point you may start
> overfitting and having the perplexity go back up.  Your convergence
> happened before that.
> 
> I don't think I've ever needed to run more than 50 iterations, and usually
> I stop after 20-30.  The bigger the corpus, the more this becomes true.
> 
> 
> On Thu, Feb 21, 2013 at 6:45 AM, David LaBarbera <
> davidlabarb...@localresponse.com> wrote:
> 
>> I've been running some performance test with the LDA algorithm and I'm
>> unsure how to gauge them. I ran 10 iterations each time and collected the
>> perplexity value every 2 iterations with test fraction set to 0.1. These
>> were all run on an AWS cluster with 10 nodes (70 mapper, 30 reducers). I'm
>> not sure about the memory or cpu specs. I also stored the documents on hdfs
>> in 1MB blocks to get some parallelization. The documents I have were very
>> short - 10-100 words each.  Hopefully these results are clear.
>> 
>> Document Count
>> corpus size (MB)
>> Topic Count
>> Perplexity
>> Dictionary Size
>> Runtime  (min/iteration)
>> 40,044   3.2 10 16.326, 15.418, 15.191, 15.088, 15.028  14,097  1.5
>> 40,044 3.2
>> 20
>> 26.461, 24.517, 23.996, 23.805, 23.882 14,097
>> 6
>> 40,,0443.2
>> 40
>> 19.722, 18.185, 17.823, 17.680, 17.608  14,097
>> 11.5
>> 40,046  3.710
>> 19.286, 18.373, 18.092, 17.958, 17.865  98,283   5.5
>> 40,046  3.720
>> 18.574, 17.448, 17.143, 17.018, 16.940
>> 98,283   10.5
>> 
>> 
>> 
>> 
>> 
>> 
>> 44,767  4
>> 10
>> 19.928, 18.815, 18.521, 18.350, 18.225  317272.5
>> 44,767  4
>> 20
>> 21.838, 20.421, 20.087, 19.963, 19.903  317274.5
>> 616,95758.5
>> 10
>> 14.467, 13.830, 13.583, 13.435, 13.381  151,807
>> 8.5
>> 616,95758.5
>> 20
>> 13.590, 12.787, 12.605, 12.522, 12.476  151,807
>> 16
>> 616,972 58.410 14.646, 13.904, 13.646, 13.573, 13.543
>> 54,280  4
>> 616,967 54.110 13.363, 12.634, 12.432, 12.345, 12.283
>> 32,101  2.5
>> 616,967 54.120 13.195, 12.307, 12.065, 11.764, 11.732
>> 32,101
>> 4.5
>> 
>> The question is how to interpret the results. In particular, Is there
>> anything telling me when to stop running LDA? I've tried running until
>> convergence, but I've never had the patience to see it finish. Does the
>> perplexity give some hint to the quality of the results? In attempting to
>> reach convergence, I saw ru

Re: Running CVB command

2013-02-21 Thread Wilson Chu

> 
> The job get stuck and I don't know why, which means that the job doesn't
> finish and at the same it's not using CPU. Here it is the log's tail:
> 

I saw the same.  The command does not quit back to shell.  Not using CPU.
After one day waiting still the same.

...
INFO: total training time time: 22811.054701ms
Feb 21, 2013 3:43:05 AM org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
Feb 21, 2013 3:43:05 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: printTopics time: 323.275654ms
Feb 21, 2013 3:43:05 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Program took 25559 ms (Minutes: 0.42598333)

Anyone has clue on what is wrong?

--Wilson






Re: LDA Convergence

2013-02-21 Thread Jake Mannix
On Thu, Feb 21, 2013 at 11:48 AM, David LaBarbera <
davidlabarb...@localresponse.com> wrote:

> Is there a rule of thumb for determining "leveling off" of perplexity? Is
> this value controlled by the convergence delta?
>

The value of where the driver will automatically stop issuing new
iterations is determined by the convergence delta (if
(perplexity(iteration_n) / perplexity(iteration_n-1) ) - 1 < delta, stop),
but determining what to set convergence delta to is hard to say, and must
be found empirically.


> Sorry for the table view. I reformatted it with just space.
>

Ah ok, much more readable.

Document Count  corpus size(MB) Topic Count Perplexity
>  Dictionary Size Runtime(min/iteration)
> 40,0443.2 10
>   16.326,15.418,15.191,15.088,15.028  14,097   1.5
> 40,0443.2 20
>   26.461,24.517,23.996,23.805,23.882  14,097   6
> 40,0443.2 40
>   19.722,18.185,17.823,17.680,17.608  14,097  11.5
>
> 40,0463.7 10
>   19.286,18.373,18.092,17.958,17.865  98,283  5.5
> 40,0463.7 20
>   18.574,17.448,17.143,17.018,16.940  98,283  10.5
>
> 44,767410
> 19.928,18.815,18.521,18.350,18.225  31727   2.5
> 44,767420
> 21.838,20.421,20.087,19.963,19.903  31727   4.5
>
> 616,957  58.5  10
> 14.467,13.830,13.583,13.435,13.381  151,8078.5
> 616,957  58.5  20
> 13.590,12.787,12.605,12.522,12.476  151,807   16
>
> 616,972  58.4  10
> 14.646,13.904,13.646,13.573,13.543  54,280  4
>
> 616,967  54.1  10
> 13.363,12.634,12.432,12.345,12.283  32,101  2.5
> 616,967  54.1  20
> 13.195,12.307,12.065,11.764,11.732  32,101  4.5
>

If you could pick one of these corpora and topic sizes, and run it out to
25-50 iterations, and graph the perplexity after every 2 iterations, you
should be able to visually see where the perplexity levels off.
 Alternately, look at the topics themselves for some of these iterations
(like say iteration 10, 15, 20, 25, 30), and see where they start to
visually gel into something sensible.  After some point, they won't even
appear to change very much at all (i.e. if you're inspecting using
vectordump --sort, then the top 50 terms per topic will stop changing
typically after around 20-30 iterations), at this point they're pretty much
converged.

This latter method (looking at your final output topic clusters) tends to
be what I've used to know when I've converged "enough", until I've found
that for my corpora, I have an intuition for how far they need to go with
this algorithm before it's usually far enough.


>
>
>
> On Feb 21, 2013, at 12:00 PM, Jake Mannix  wrote:
>
> > I really can't read your results here, the formatting of your columns is
> > pretty destroyed...  you look like you've got results for 20 topics, as
> > well as for 10, with different sized corpora?
> >
> > You can't compare convergence between corpora sizes - the perplexity will
> > vary by order of magnitude between them.  The only thing you should be
> > comparing is that for a single fixed corpus, as you run it for 5, 10, 15,
> > 20,... iterations, what does the (held-out) perplexity look like after
> each
> > of these?  Does it start to level off?  At some point you may start
> > overfitting and having the perplexity go back up.  Your convergence
> > happened before that.
> >
> > I don't think I've ever needed to run more than 50 iterations, and
> usually
> > I stop after 20-30.  The bigger the corpus, the more this becomes true.
> >
> >
> > On Thu, Feb 21, 2013 at 6:45 AM, David LaBarbera <
> > davidlabarb...@localresponse.com> wrote:
> >
> >> I've been running some performance test with the LDA algorithm and I'm
> >> unsure how to gauge them. I ran 10 iterations each time and collected
> the
> >> perplexity value every 2 iterations with test fraction set to 0.1. These
> >> were all run on an AWS cluster with 10 nodes (70 mapper, 30 reducers).
> I'm
> >> not sure about the memory or cpu specs. I also stored the documents on
> hdfs
> >> in 1MB blocks to get some parallelization. The documents I have were
> very
> >> short - 10-100 words each.  Hopefully these results are clear.
> >>
> >> Document Count
> >> corpus size (MB)
> >> Topic Count
> >> Perplexity
> >> Dictionary Size
> >> Runtime  (min/iteration)
> >> 40,044   3.2 10 16.326, 15.418, 15.191, 15.088, 15.028  14,097
>  1.5
> >> 40,044 3.2
> >> 20
> >> 26.

Re: Running CVB command

2013-02-21 Thread Jake Mannix
If it ends with "total training time..." it's done.  The JVM isn't exiting,
but I bet if you check your HDFS (or local FS if running without hadoop),
you'll see it has already created and populated the output directories.

What version of Mahout are you running, are running on trunk?


On Thu, Feb 21, 2013 at 2:04 PM, Wilson Chu  wrote:

>
> >
> > The job get stuck and I don't know why, which means that the job doesn't
> > finish and at the same it's not using CPU. Here it is the log's tail:
> >
>
> I saw the same.  The command does not quit back to shell.  Not using CPU.
> After one day waiting still the same.
>
> ...
> INFO: total training time time: 22811.054701ms
> Feb 21, 2013 3:43:05 AM org.apache.hadoop.io.compress.CodecPool
> getCompressor
> INFO: Got brand-new compressor
> Feb 21, 2013 3:43:05 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: printTopics time: 323.275654ms
> Feb 21, 2013 3:43:05 AM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Program took 25559 ms (Minutes: 0.42598333)
>
> Anyone has clue on what is wrong?
>
> --Wilson
>
>
>
>
>


-- 

  -jake


Re: Cross recommendation

2013-02-21 Thread Ted Dunning
On Thu, Feb 21, 2013 at 12:13 PM, Pat Ferrel  wrote:

> I am quite interested in trying this but have a few questions.
>
> To use/abuse mahout to do this:
>
> A and B can be thought of as having the same size, in other words they
> must be constructed to have the same dimension definitions (userID for
> rows, itemID for columns) as well as row and column rank.
>

I generally only assume the same userID space (i.e. same number of rows).


> So A and B should be constructed to have the same users and items, even
> though a column or row in either matrix may be 0. Some work will be needed
> to produce this so I want to be sure I understand.
>

Less work if you only assume user identity.

If you co-group your transaction data, then you commonly will get a list of
A transactions and a list of B transactions for each user which is all you
should need.

The conclusion is B'A h_v + B'B h_p will produce view-based cross
> recommendations along with purchase based recs.
>

This term only produces purchase recommendations. If you want both, you
need to include the A'A h_v + A'B h_p part of the vector.


> Using mahout I'd get purchase based recs by constructing the usual
> purchase matrix [user x item] and getting recs from the framework, as I'm
> already doing. Assuming here that B'B is what you are calling the self-join?
>

Yes.


> So the primary work to be done is to calculate B'A. Since the CF/taste
> framework does the self-join to prepare for making recs for users, I would
> have to replace the self-joined matrix with B'A then allow the rest of the
> framework to produce recs in the usual way as if it had calculated the
> self-joined matrix. Does this seem reasonable?
>

Yes.  In pig, there is a co-group that would be helpful.  Alternately, you
can group views and purchases separately and then join.  Co-group saves a
map-reduce step.


> Then the question is how to blend B'A h_v with B'B h_p?
>

The range of both of these will be identical.  Each row of [B'A | B'B]
corresponds to a document.  One field (the view=>purchase indicators)
contains a row of B'A and another field (the purchase=>purchase indicators)
will contain a row of B'B.

The query will ultimately contain two fields corresponding to recent views
and recent purchases.  The search engine will combine the scores from these
intelligently without any effort on your part.  You can tune how this
works, but I haven't ever found that very useful.

Since I'm using values = 1 or 0 the strengths will be on the same scale,
> but are they really comparable?
>

Missing from this formulation are the weights that the search engine places
on things.  This has to do with how many of items have each indicator.
 Common indicators will have little weight and rare ones great weight.  For
instance, it might be that everybody (well, all the car buffs, anyway)
looks at the Ferrari's, but few buy them.  This would mean that the views
would have little weight but the purchase would have a large weight.

I'd be inclined to try sorting B'A recs + B'B recs by strength and try some
> other experiments with blending techniques then look at eval metrics.
>

Start simple.

Remember that these are systems with feedback.  That means that once the
system goes live (and if you have dithering set up) it will quickly tune
away the false-positive mistakes it is making by showing the events in
question and seeing that they don't lead to success.  This closed loop
nature makes excessive refinement in weighting schemes largely unnecessary.


Re: Network Traffic and Security Analysis

2013-02-21 Thread Mahesh Balija
Hi Ted,

 My apologizes for the delay to reply, as I was brushing up my
networking skills before I can discuss.
 Few of those topics which I want to start with will be,

  1) Deep packet inspection - Can be useful for Intrusion
detection (NIDS) by doing the port mirroring and analyzing the data packets
  2) Identifying trends in high network usage - this will
help network administrators to avoid down time, Network Congestion
  3) Flow of Traffic - To visualize what is happening with
in the data center network
  4) Identifying Network Hot-Spot Links

 I will be having access to Syslog, SNMP data and Data packets
at this point in time.
 There is scope for running predictive analytics over network
usage.

 I will share more information as I progress.
 Your suggestions are most welcome.

Thanks,
Mahesh Balija,
CalsoftLabs.

On Wed, Jan 30, 2013 at 1:25 PM, Ted Dunning  wrote:

> I don't have any such references.  It would actually be interesting if you
> could summarize some of the white papers you have read to the list.
>
> That might strike up some good discussions.
>
> On Tue, Jan 29, 2013 at 11:15 PM, Mahesh Balija
> wrote:
>
> > Hi All / Ted,
> >
> > Currently I am working on a Network project for doing Traffic and
> > Security analysis using BigData stack.
> > I have gone through various white papers related to Network
> > Traffic.
> > Can you please point out to me any advanced analytics problems
> and
> > approaches in Network domain.
> > I am currently gathering an Enterprise network traffic data
> > especially *Syslog and SNMP traps,* in future I will collect a
> > data-center's log as well.
> >
> > Thanks,
> > Mahesh Balija,
> > CalsoftLabs.
> >
>